llm-benchmark-基准测试

  1. LLM benchmark 基准测试
  2. 自带工具
    1. vLLM
      1. 简单测试
      2. 70b模型测试
    2. TensorRT-LLM
  3. 其他工具
    1. NVIDIA GenAI-Perf
    2. AIperf
    3. NVIDIA Dynamo

LLM benchmark 基准测试

本地化部署的大模型考量的指标是什么?

  1. Token生成的速度 【token/s】

这是衡量QPS的最核心指标

指标 含义 对QPS的影响
Decode tokens/s 每秒生成 token 决定吞吐量 QPS
Prefill tokens/s 输入处理速度 影响首字时间
KV cache 显存占用 每 token 需要多少缓存 限制并发数
Max batch size 单次支持的 batch 直接决定 QPS
  1. 显存占用【VRAM footprint】

一般机器要预留20%左右的显存空间

  1. prefill latency【输入阶段延迟】

影响首 token 时间(TTFT)。

自带工具

vLLM

vLLM自带的测试工具满足大部分的场景了

使用 VLLM 来部署 QWen3-8b模型,测试一下 Tesla-T4 的吞吐量

Qwen3-8B 会炸炉子,还是测试Qwen3-4B吧

官方的文档:

简单测试

  1. vLLM 提供官方 Docker 镜像进行部署。该镜像可用于运行 OpenAI 兼容服务器,并可在 Docker Hub 上以 vllm/vllm-openai 形式获取。
docker pull vllm/vllm-openai:nightly-x86_64
  1. 启动服务
docker run --runtime nvidia --gpus all \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=$HF_TOKEN" \
    -p 8000:8000 \
    --ipc=host \
    vllm/vllm-openai:nightly-x86_64 \
    --model Qwen/Qwen3-4B

在自己本机测试是否跑通

curl http://127.0.0.1:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen3-4B",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

测试情况

  1. 进入docker中使用自带的工具来测试
vllm serve --model Qwen/Qwen3-4B \
--port 8000 \
--max-model-len 10240 \
--gpu-memory-utilization 0.95
  • –max-model-len

  • –gpu-memory-utilization

不增加 --max-model-len 会使用模型最大的token数,比如4B的最大token数为40960,Tesla-T4会直接炸炉子

–gpu-memory-utilization 设置 gpu 的使用率,默认为0.9

初步启动显存占用情况

进入容器中再启动LLM服务

docker run --runtime nvidia --gpus all \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    -e HF_TOKEN=$HF_TOKEN \
    -p 8000:8000 \
    --ipc=host \
    -it --entrypoint /bin/bash \
    vllm/vllm-openai:nightly-x86_64

benchmark 测试

vllm bench serve \
    --model Qwen/Qwen3-4B \
    --host 127.0.0.1 \
    --port 8000 \
    --random-input-len 1024 \
    --random-output-len 1024  \
    --num-prompts  5

运行结果

Tesla-T4

Tesla-T4测试结果

tip: install termplotlib and gnuplot to plot the metrics
============ Serving Benchmark Result ============
Successful requests:                     5         
Failed requests:                         0         
Benchmark duration (s):                  40.76     
Total input tokens:                      5120      
Total generated tokens:                  5120      
Request throughput (req/s):              0.12      
Output token throughput (tok/s):         125.63    
Peak output token throughput (tok/s):    140.00    
Peak concurrent requests:                5.00      
Total Token throughput (tok/s):          251.26    
---------------Time to First Token----------------
Mean TTFT (ms):                          934.99    
Median TTFT (ms):                        1306.70   
P99 TTFT (ms):                           1343.47   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          38.88     
Median TPOT (ms):                        38.53     
P99 TPOT (ms):                           39.67     
---------------Inter-token Latency----------------
Mean ITL (ms):                           38.88     
Median ITL (ms):                         38.46     
P99 ITL (ms):                            41.42     
==================================================

4 X A100,简单的测试

Azure 上的Spot VM 型号为 Standard NC96ads A100 v4 (96 vcpu,880 GiB 内存)

vllm bench serve \
    --model Qwen/Qwen3-4B \
    --host 127.0.0.1 \
    --port 8000 \
    --random-input-len 1024 \
    --random-output-len 1024  \
    --num-prompts  5

4XA100

tip: install termplotlib and gnuplot to plot the metrics
============ Serving Benchmark Result ============
Successful requests:                     5         
Failed requests:                         0         
Benchmark duration (s):                  7.00      
Total input tokens:                      5120      
Total generated tokens:                  5120      
Request throughput (req/s):              0.71      
Output token throughput (tok/s):         731.76    
Peak output token throughput (tok/s):    765.00    
Peak concurrent requests:                5.00      
Total token throughput (tok/s):          1463.53   
---------------Time to First Token----------------
Mean TTFT (ms):                          140.16    
Median TTFT (ms):                        186.10    
P99 TTFT (ms):                           196.95    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          6.69      
Median TPOT (ms):                        6.65      
P99 TPOT (ms):                           6.79      
---------------Inter-token Latency----------------
Mean ITL (ms):                           6.69      
Median ITL (ms):                         6.64      
P99 ITL (ms):                            7.33      
==================================================

70b模型测试

4XA100吞吐量还是非常夸张的,但是只是简单的跑通测试,对于4X100这样的吞吐量不够全面,试试更大的模型,更全面的测试

官方文档如下:

运行启动服务

采用的模型为 deepseek-r1-distill-llama-70b

并且使用别人量化好的模型权重,采用AWQ量化,具体使用的仓库为

Valdemardi/DeepSeek-R1-Distill-Llama-70B-AWQ

描述如下:

This quantized model was created using AutoAWQ version 0.2.8 with quant_config:

{
  "zero_point": True,
  "q_group_size": 128,
  "w_bit": 4,
  "version": "GEMM"
}
  • 启动服务
docker run --runtime nvidia --gpus all \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=$HF_TOKEN" \
    -p 8000:8000 \
    --ipc=host \
    vllm/vllm-openai:nightly-x86_64 \
    --model Valdemardi/DeepSeek-R1-Distill-Llama-70B-AWQ \
    --tensor-parallel-size 4 \
    --max-model-len 16384 \
    --gpu-memory-utilization 0.90 \
    --quantization awq \
    --dtype float16
  • 进入容器进行操作
docker exce -it xxx bash
# download dataset
# wget https://hugging-face.cn/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json
vllm bench serve \
  --backend vllm \
  --model Valdemardi/DeepSeek-R1-Distill-Llama-70B-AWQ  \
  --endpoint /v1/completions \
  --dataset-name sharegpt \
  --dataset-path /vllm-workspace/data/ShareGPT_V3_unfiltered_cleaned_split.json \
  --num-prompts 100

运行结果:

推理速度

tip: install termplotlib and gnuplot to plot the metrics
============ Serving Benchmark Result ============
Successful requests:                     100       
Failed requests:                         0         
Benchmark duration (s):                  112.02    
Total input tokens:                      22398     
Total generated tokens:                  19894     
Request throughput (req/s):              0.89      
Output token throughput (tok/s):         177.60    
Peak output token throughput (tok/s):    501.00    
Peak concurrent requests:                100.00    
Total token throughput (tok/s):          377.56    
---------------Time to First Token----------------
Mean TTFT (ms):                          4890.56   
Median TTFT (ms):                        4740.88   
P99 TTFT (ms):                           8798.57   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          216.18    
Median TPOT (ms):                        160.08    
P99 TPOT (ms):                           764.65    
---------------Inter-token Latency----------------
Mean ITL (ms):                           153.88    
Median ITL (ms):                         142.23    
P99 ITL (ms):                            766.03    
==================================================

TensorRT-LLM

TensorRT-LLM是针对 A100 / H100 显卡专门设计的,Tesla-T4有问题

使用 官方的Docker 镜像环境来启动服务

NV 官方的镜像网站,提供了各种深度学习环境

官方的教程

docker run --rm -it --ipc host --gpus all \
  --ulimit memlock=-1 --ulimit stack=67108864 \
  -p 8000:8000 \
  nvcr.io/nvidia/tensorrt-llm/release:1.2.0rc3

进入 docker 下的 bash

docker exec -it <container_id> bash

启动 LLM 服务

trtllm-serve "Qwen/Qwen3-8B"

访问服务

curl -X POST http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -H "Accept: application/json" \
    -d '{
        "model": "Qwen/Qwen3-8B",
        "messages":[{"role": "system", "content": "You are a helpful assistant."},
                    {"role": "user", "content": "Where is New York? Tell me in a single sentence."}],
        "max_tokens": 1024,
        "temperature": 0
    }'

运行结果

压力测试

TensorRT-LLM 基准测试 — TensorRT-LLM

官方推荐的方式

# Step 1: 准备数据集
python benchmarks/cpp/prepare_dataset.py \
    --stdout \
    --tokenizer Qwen/Qwen3-8B \
    token-norm-dist \
    --input-mean 128 \
    --output-mean 128 \
    --input-stdev 0 \
    --output-stdev 0 \
    --num-requests 3000 \
    > /tmp/synthetic_128_128.txt

# Step 2: 构建 Engine
trtllm-bench \
    --model Qwen/Qwen3-8B \
    build \
    --dataset /tmp/synthetic_128_128.txt 

# Step 3: 吞吐量 Benchmark
trtllm-bench \
    --model Qwen/Qwen3-8B \
    throughput \
    --dataset /tmp/synthetic_128_128.txt \
    --engine_dir /tmp/Qwen/Qwen3-8B/tp_1_pp_1

测试结果如下:

===========================================================
= PYTORCH BACKEND
===========================================================
Model:                  Qwen/Qwen3-8B
Model Path:             None
TensorRT LLM Version:   1.2
Dtype:                  bfloat16
KV Cache Dtype:         None
Quantization:           None

===========================================================
= MACHINE DETAILS 
===========================================================
NVIDIA A100 80GB PCIe, memory 79.25 GB, 1.51 GHz

===========================================================
= REQUEST DETAILS 
===========================================================
Number of requests:             3000
Number of concurrent requests:  2235.6361
Average Input Length (tokens):  128.0000
Average Output Length (tokens): 128.0000
===========================================================
= WORLD + RUNTIME INFORMATION 
===========================================================
TP Size:                1
PP Size:                1
EP Size:                None
Max Runtime Batch Size: 1792
Max Runtime Tokens:     3328
Scheduling Policy:      GUARANTEED_NO_EVICT
KV Memory Percentage:   90.00%
Issue Rate (req/sec):   8.9362E+13

===========================================================
= PERFORMANCE OVERVIEW 
===========================================================
Request Throughput (req/sec):                     42.9775
Total Output Throughput (tokens/sec):             5501.1246
Total Token Throughput (tokens/sec):              11002.2492
Total Latency (ms):                               69803.9090
Average request latency (ms):                     52018.7121
Per User Output Throughput [w/ ctx] (tps/user):   2.6867
Per GPU Output Throughput (tps/gpu):              5501.1246

-- Request Latency Breakdown (ms) -----------------------

[Latency] P50    : 49434.4831
[Latency] P90    : 69053.3882
[Latency] P95    : 69401.8254
[Latency] P99    : 69515.5505
[Latency] MINIMUM: 29191.0916
[Latency] MAXIMUM: 69536.4003
[Latency] AVERAGE: 52018.7121

===========================================================
= DATASET DETAILS
===========================================================
Dataset Path:         /tmp/synthetic_128_128.txt
Number of Sequences:  3000

-- Percentiles statistics ---------------------------------

        Input              Output           Seq. Length
-----------------------------------------------------------
MIN:   128.0000           128.0000           256.0000
MAX:   128.0000           128.0000           256.0000
AVG:   128.0000           128.0000           256.0000
P50:   128.0000           128.0000           256.0000
P90:   128.0000           128.0000           256.0000
P95:   128.0000           128.0000           256.0000
P99:   128.0000           128.0000           256.0000
===========================================================

其他工具

NVIDIA GenAI-Perf

很遗憾的是NV已经不维护这个工具了

blog

使用文档

项目地址

AIperf

GenAI-Perf 不维护后推荐指向的项目,好处是能可视化

项目地址

NVIDIA Dynamo

项目地址

Benchmark测试文档

Dynamo 是一种LLM推理引擎,单也提供了benchmark测试脚本,自动导出图表

github