大模型部署记录
最近在公司部署了好多模型,记录一下不同模型的部署方式和参数配置,包括使用Ollama、VLLM等工具。
2. Ollama 部署记录
最开始的时候我用的OLLAMA来部署的模型,一个是因为内网存在OLLMA的镜像,可以直接下载使用,另外OLLAMA里面可以部署一些规模比较小的向量化模型,这可以让我在本地直接向量化一些小型代码仓。
2.1 环境变量配置
1
2
| export OLLAMA_HOST="0.0.0.0:11434"
export OLLAMA_MODELS=/usr1/ollama/models
|
2.2 服务启动
2.3 模型管理
1
| curl -X POST http://localhost:11434/api/unload -H "Content-Type: application/json" -d '{"model": "GLM-4.5-Air"}'
|
3. VLLM 部署记录
3.1 Qwen3-Coder-480B-A35B-Instruct-FP8
1
2
3
4
5
6
| VLLM_USE_DEEP_GEMM=1 vllm serve /usr1/huggingface/models/Qwen3-Coder-480B-A35B-Instruct-FP8 \
--max-model-len 131072 \
--enable-expert-parallel \
--data-parallel-size 8 \
--enable-auto-tool-choice \
--tool-call-parser qwen3_coder
|
3.2 Qwen3-235B-A22B-Instruct-2507-FP8
1
2
3
4
5
6
7
8
9
10
| vllm serve /usr1/huggingface/models/Qwen3-235B-A22B-Instruct-2507-FP8 \
--served-model-name Qwen3-235B-A22B-Instruct-2507-FP8 \
--tensor-parallel-size 8 \
--max-model-len 8192 \
--gpu-memory-utilization 0.85 \
--trust-remote-code \
--enable-expert-parallel \
--host 0.0.0.0 \
--enable-auto-tool-choice \
--tool-call-parser hermes
|
性能数据:
- 每张卡占用43G
- 总计占用显存344G(总共368GB)
- 上下文窗口较短
3.3 Qwen3-Coder-480B-A35B-Instruct-GPTQ-Int4-Int8Mix
1
2
3
4
5
6
7
8
9
10
11
12
13
| vllm serve /usr1/huggingface/models/Qwen3-Coder-480B-A35B-Instruct-GPTQ-Int4-Int8Mix \
--served-model-name Qwen3-Coder-480B-A35B-Instruct-GPTQ-Int4-Int8Mix \
--tensor-parallel-size 8 \
--max-model-len 131072 \
--gpu-memory-utilization 0.9 \
--trust-remote-code \
--kv-cache-dtype fp8_e5m2 \
--enable-chunked-prefill \
--max-num-batched-tokens 8192 \
--enable-expert-parallel \
--host 0.0.0.0 \
--enable-auto-tool-choice \
--tool-call-parser qwen3_coder
|
性能数据:
- 8张卡,每张卡占用42G,设置成16K、65536时均占用42G,但是设置成131072占用为44.7GB显存
3.4 Qwen3-Embedding-8B
1
2
3
4
5
|
vllm serve /usr1/huggingface/models/Qwen3-Embedding-8B --served-model-name Qwen3-Embedding-8B --tensor-parallel-size 8 --task embedding --host 0.0.0.0 --port 8113 --max-model-len 40000 --max-num-batched-tokens 40000 --max-num-seqs 40 --gpu-memory-utilization 0.12
nohup vllm serve /usr1/huggingface/models/Qwen3-Embedding-8B --served-model-name Qwen3-Embedding-8B --tensor-parallel-size 8 --task embedding --host 0.0.0.0 --port 8113 --max-model-len 40000 --max-num-batched-tokens 40000 --max-num-seqs 40 --gpu-memory-utilization 0.12 > vllm_qwen3-embeding.log 2>&1 &
|
当–tensor-parallel-size为8时,每张显卡占用内存22G,当为1时占用显存35GB
3.5 Qwen3-235B-A22B-GPTQ-Int4
1
2
|
vllm serve /usr1/huggingface/models/Qwen3-235B-A22B-GPTQ-Int4 --served-model-name Qwen3-235B-A22B-GPTQ-Int4 --host 0.0.0.0 --max-num-seqs 40 --gpu-memory-utilization 0.9 --tensor-parallel-size 4 --enable-expert-parallel --disable-log-requests --trust-remote-code --max-model-len 131072 --kv-cache-dtype fp8_e5m2 --enable-chunked-prefill --max-num-batched-tokens 8192 --enable-auto-tool-choice --tool-call-parser hermes
|
3.6 GLM-4.6-Int4-Int8Mix
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
|
vllm serve \
/usr1/huggingface/models/GLM-4.6-GPTQ-Int4-Int8Mix \
--served-model-name GLM-4.6-GPTQ-Int4-Int8Mix \
--enable-auto-tool-choice \
--tool-call-parser glm45 \
--reasoning-parser glm45 \
--swap-space 16 \
--max-num-seqs 64 \
--max-model-len 131072 \
--gpu-memory-utilization 0.8 \
--tensor-parallel-size 8 \
--enable-expert-parallel \
--trust-remote-code \
--disable-log-requests \
--host 0.0.0.0 \
--port 8000
|
32768时每张显卡占用43GB, 131072时,也是差不多43GB显存
nohup vllm serve
/usr1/huggingface/models/GLM-4.6-GPTQ-Int4-Int8Mix
–served-model-name GLM-4.6-GPTQ-Int4-Int8Mix
–enable-auto-tool-choice
–tool-call-parser glm45
–reasoning-parser glm45
–swap-space 16
–max-num-seqs 64
–max-model-len 131072
–gpu-memory-utilization 0.8
–tensor-parallel-size 8
–enable-expert-parallel
–trust-remote-code
–disable-log-requests
–host 0.0.0.0
–port 8000 » vllm_glm4.log 2>&1 &
3.7 GLM-4.6-AWQ
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
| vllm serve \
/usr1/huggingface/models/GLM-4.6-AWQ \
--served-model-name GLM-4.6-AWQ \
--enable-auto-tool-choice \
--tool-call-parser glm45 \
--reasoning-parser glm45 \
--swap-space 16 \
--max-num-seqs 64 \
--max-model-len 202752 \
--gpu-memory-utilization 0.95 \
--tensor-parallel-size 8 \
--enable-expert-parallel \
--trust-remote-code \
--disable-log-requests \
--host 0.0.0.0 \
--port 8000
nohup vllm serve \
/usr1/huggingface/models/GLM-4.6-AWQ \
--served-model-name GLM-4.6-AWQ \
--enable-auto-tool-choice \
--tool-call-parser glm45 \
--reasoning-parser glm45 \
--swap-space 16 \
--max-num-seqs 64 \
--max-model-len 202752 \
--gpu-memory-utilization 0.95 \
--tensor-parallel-size 8 \
--enable-expert-parallel \
--trust-remote-code \
--enable-log-requests \
--host 0.0.0.0 \
--port 8000 >> vllm_glm4-awq.log 2>&1 &
# --enable-log-outputs \ 打开日志输出
# 增加日志和性能监控
nohup vllm serve \
/usr1/huggingface/models/GLM-4.6-AWQ \
--served-model-name GLM-4.6-AWQ \
--enable-auto-tool-choice \
--tool-call-parser glm45 \
--reasoning-parser glm45 \
--swap-space 16 \
--max-num-seqs 64 \
--max-model-len 202752 \
--tensor-parallel-size 8 \
--enable-expert-parallel \
--trust-remote-code \
--enable-log-requests \
--host 127.0.0.1 \
--port 8000 >> vllm_glm4-awq.log 2>&1 &
|
8张卡,每张占用43G显存
GLM-4.6-AWQ部署
- 激活环境
1
| source ~/glm4.7/bin/activate
|
- 设置环境变量
1
2
3
4
| export VLLM_USE_DEEP_GEMM=0
export VLLM_USE_FLASHINFER_MOE_FP16=1
export VLLM_USE_FLASHINFER_SAMPLER=0
export OMP_NUM_THREADS=4
|
- 启动 vLLM
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
| nohup vllm serve \
/usr1/huggingface/models/GLM-4.7-AWQ \
--served-model-name GLM-4.7-AWQ \
--swap-space 16 \
--max-num-seqs 32 \
--max-model-len 202752 \
--gpu-memory-utilization 0.93 \
--tensor-parallel-size 8 \
--enable-expert-parallel \
--speculative-config.method mtp \
--speculative-config.num_speculative_tokens 1 \
--tool-call-parser glm47 \
--reasoning-parser glm45 \
--enable-auto-tool-choice \
--trust-remote-code \
--enable-log-requests \
--host 127.0.0.1 \
--port 8000 >> vllm_glm4-awq.log 2>&1 &
|
3.8 Qwen3-32B
1
2
| vllm serve /usr1/huggingface/models/Qwen3-32B --served-model-name Qwen3-32B --enable-auto-tool-choice --tool-call-parser qwen3 --tensor-parallel-size 8 --gpu-memory-utilization 0.8
|
qwen3-32B部署必须先停掉glm4.6,否则会出现显存不够的问题
3.9 Kimi-K2-thinking
1
2
3
4
5
6
7
8
9
10
| vllm serve /usr1/huggingface/models/Kimi-K2-Thinking \
--served-model-name kimi-k2-thinking \
--trust-remote-code \
--tensor-parallel-size 8 \
--enable-auto-tool-choice \
--gpu-memory-utilization 0.95 \
--max-num-batched-tokens 32768 \
--tool-call-parser kimi_k2 \
--reasoning-parser kimi_k2
|
3.10 MiniMax-M2.1
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
| # MiniMax-M2.1
# 注意这里名字还是用的GLM-4.6-AWQ是为了让其他用户也能继续使用
SAFETENSORS_FAST_GPU=1 vllm serve /usr1/huggingface/models/MiniMax-M2-1 \
--served-model-name GLM-4.6-AWQ MiniMax-M2.1 \
--trust-remote-code \
--enable_expert_parallel \
--tensor-parallel-size 8 \
--enable-auto-tool-choice \
--tool-call-parser minimax_m2 \
--reasoning-parser minimax_m2_append_think \
--max-num-seqs 64 \
--max-model-len 196608 \
--enable-log-requests \
--gpu-memory-utilization 0.95 \
--host 127.0.0.1 \
--port 8000
|
3.11 MiniMax-M2.5
1
2
3
4
5
6
7
8
9
10
11
12
| # MiniMax-M2.5
# 注意这里名字还是用的GLM-4.6-AWQ是为了让其他用户也能继续使用
SAFETENSORS_FAST_GPU=1 vllm serve \
/usr1/huggingface/models/MiniMax-M2.5-BF16-INT4-AWQ --trust-remote-code \
--served-model-name GLM-4.6-AWQ \
--enable_expert_parallel --tensor-parallel-size 8 \
--enable-auto-tool-choice --tool-call-parser minimax_m2 \
--reasoning-parser minimax_m2_append_think \
--enable-log-requests \
--host 127.0.0.1 \
--port 8000
|
后台运行
1
2
3
4
5
6
7
8
9
| SAFETENSORS_FAST_GPU=1 nohup vllm serve \
/usr1/huggingface/models/MiniMax-M2.5-BF16-INT4-AWQ --trust-remote-code \
--served-model-name GLM-4.6-AWQ \
--enable_expert_parallel --tensor-parallel-size 8 \
--enable-auto-tool-choice --tool-call-parser minimax_m2 \
--reasoning-parser minimax_m2_append_think \
--enable-log-requests \
--host 127.0.0.1 \
--port 8000 >> vllm_minimax2.5-awq.log 2>&1 &
|
可以成功运行
3.12 Qwen3.5-397B-A17B
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
| # 激活vllm环境
source vllm-qwen3.5-plus/bin/activate
# 需要根据 https://huggingface.co/QuantTrio/Qwen3.5-397B-A17B-AWQ 完成nightly版本的安装以及transformer的安装
# 加载cuda12
export LD_LIBRARY_PATH=$VIRTUAL_ENV/lib/python3.10/site-packages/nvidia/cuda_runtime/lib:$LD_LIBRARY_PATH
export LD_LIBRARY_PATH=$VIRTUAL_ENV/lib/python3.10/site-packages/nvidia/cublas/lib:$LD_LIBRARY_PATH
# 中提供的部署命令
export VLLM_USE_DEEP_GEMM=0
export VLLM_USE_FLASHINFER_MOE_FP16=1
export VLLM_USE_FLASHINFER_SAMPLER=0
export OMP_NUM_THREADS=16
# 注意这里名字还是用的GLM-4.6-AWQ是为了让其他用户也能继续使用
vllm serve \
/usr1/huggingface/models/Qwen3.5-397B-A17B-AWQ \
--served-model-name GLM-4.6-AWQ \
--swap-space 16 \
--max-num-seqs 32 \
--max-model-len 202752 \
--tensor-parallel-size 8 --enable-expert-parallel --enable-auto-tool-choice --tool-call-parser qwen3_coder --reasoning-parser qwen3 --speculative-config '{"method":"qwen3_next_mtp","num_speculative_tokens":2}' --trust-remote-code --host 127.0.0.1 --port 8000
|
后台nohup部署
1
2
3
4
5
6
7
8
| nohup vllm serve \
/usr1/huggingface/models/Qwen3.5-397B-A17B-AWQ \
--served-model-name GLM-4.6-AWQ \
--swap-space 16 \
--max-num-seqs 32 \
--max-model-len 202752 \
--tensor-parallel-size 8 --enable-expert-parallel --enable-auto-tool-choice --tool-call-parser qwen3_coder --reasoning-parser qwen3 --speculative-config '{"method":"qwen3_next_mtp","num_speculative_tokens":2}' --trust-remote-code --host 127.0.0.1 --port 8000 > vllm_qwen3.5.log 2>&1 &
|
4. Docker 容器化部署
4.1 Open-WebUI 部署
1
2
3
4
5
6
7
8
9
10
11
| # 启动容器
sudo docker run -d -p 0.0.0.0:3000:8080 \
-v open-webui:/app/backend/data \
--name open-webui \
--restart always \
-e OPENAI_API_BASE_URL=http://10.44.151.54:8000/v1 \
ghcr.io/open-webui/open-webui:main
# 停止和清理容器
docker stop open-webui
docker rm open-webui
|
4.2 Lobe-Chat 部署
1
2
3
4
5
6
7
8
9
10
11
12
| sudo docker run -d -p 0.0.0.0:3210:3210 \
--name lobe-chat \
--restart always \
-e OPENAI_PROXY_URL=http://10.44.151.54:8111/v1 \
-e OPENAI_MODEL_LIST="-all,+GLM-4.6-AWQ" \
-e DEFAULT_AGENT_CONFIG="model=GLM-4.6-AWQ" \
lobehub/lobe-chat:latest
# 停止和清理容器
sudo docker stop lobe-chat
sudo docker rm lobe-chat
|