vllm docker运行实践
vllm docker相关文档
https://docs.vllm.ai/en/latest/deployment/docker.html
docker run --gpus all \
-v /opt/models:/vllm-workspace \
-p 8000:8000 \
--ipc=host \
--name=vllm \
vllm/vllm-openai:latest \
--model Qwen/Qwen3-14B \
--tensor-parallel-size 2 \
--max-model-len 24576 \
--gpu-memory-utilization 0.9 \
--swap-space 4
主要参数说明:
--tensor-parallel-size 2:使用2个GPU进行张量并行
--max-model-len:减少最大序列长度以节省显存
--gpu-memory-utilization 0.9:GPU显存利用率设为90%
--swap-space 4:设置4GB交换空间
4090D 24G x4
4张4090D 24G显存 运行qwen3-14B
Maximum concurrency for 24,576 tokens per request: 14.45x
2080Ti 22G x1
Qwen3-8B
docker run -d \
--gpus all \
--restart=unless-stopped \
--network=host \
-v /root/models:/vllm-workspace \
-p 8000:8000 \
--ipc=host \
--name=vllm \
vllm/vllm-openai:latest \
--model Qwen3-8B \
--enforce-eager \
--max-model-len 16384 \
--swap-space 4
1张2080Ti 22G现存 运行Qwen3-8B
Maximum concurrency for 16384 tokens per request: 1.53x
Qwen2.5-VL-3B-Instruct-AWQ
docker run --gpus all \
-v /root/models:/vllm-workspace \
-p 8000:8000 \
--ipc=host \
--name=vllm \
vllm/vllm-openai:latest \
--model Qwen2.5-VL-3B-Instruct-AWQ \
--trust-remote-code \
--enforce-eager \
--max-model-len 32768 \
--max-num-batched-tokens 32768 \
--max-num-seqs 1 \
--tensor-parallel-size 1 \
--gpu-memory-utilization 0.95
1张2080Ti 22G现存 运行Qwen2.5-VL-3B-Instruct-AWQ
Maximum concurrency for 32768 tokens per request: 12.35x
但是没有完成图片的问答
Qwen2.5-VL-3B-Instruct
docker run --gpus all \
-v /root/models:/vllm-workspace \
-p 8000:8000 \
--ipc=host \
--name=vllm \
vllm/vllm-openai:latest \
--model Qwen2.5-VL-3B-Instruct \
--trust-remote-code \
--enforce-eager \
--max-model-len 32768 \
--max-num-batched-tokens 32768 \
--max-num-seqs 1 \
--tensor-parallel-size 1 \
--gpu-memory-utilization 0.95 \
--limit-mm-per-prompt image=1,video=0 \
--served-model-name qwen2.5-vl
{
"model": "Qwen2.5-VL-3B-Instruct",
"messages": [
{
"role": "user",
"content": [
{
"type": "text",
"text": "Describe this image."
},
{
"type": "image_url",
"image_url": {
"url": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg"
}
}
]
}
],
"stream": true
}
长期使用Qwen3-4B-Instruct-2507
docker run -d \
--gpus all \
--network=host \
--ipc=host \
--restart=unless-stopped \
--shm-size 8g \
-v /root/models:/vllm-workspace \
-p 8000:8000 \
--name=vllm \
vllm/vllm-openai:latest \
--model Qwen3-4B-Instruct-2507 \
--trust-remote-code \
--enforce-eager \
--tensor-parallel-size 1 \
--max-model-len 32768 \
--gpu-memory-utilization 0.95 \
--swap-space 8