模型部署记录

发表于 2026/02/28

数据中心服务器机房

作者 Harlan Zhou

8 分钟阅读

模型部署记录

大模型部署记录

最近在公司部署了好多模型，记录一下不同模型的部署方式和参数配置，包括使用Ollama、VLLM等工具。

2. Ollama 部署记录

最开始的时候我用的OLLAMA来部署的模型，一个是因为内网存在OLLMA的镜像，可以直接下载使用，另外OLLAMA里面可以部署一些规模比较小的向量化模型，这可以让我在本地直接向量化一些小型代码仓。

2.1 环境变量配置

        
export OLLAMA_HOST="0.0.0.0:11434"
export OLLAMA_MODELS=/usr1/ollama/models

2.2 服务启动

ollama serve

2.3 模型管理

        
      
curl -X POST http://localhost:11434/api/unload -H "Content-Type: application/json" -d '{"model": "GLM-4.5-Air"}'

3. VLLM 部署记录

3.1 Qwen3-Coder-480B-A35B-Instruct-FP8

        
      
VLLM_USE_DEEP_GEMM=1 vllm serve /usr1/huggingface/models/Qwen3-Coder-480B-A35B-Instruct-FP8 \
  --max-model-len 131072 \
  --enable-expert-parallel \
  --data-parallel-size 8 \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_coder

3.2 Qwen3-235B-A22B-Instruct-2507-FP8

        
      
vllm serve /usr1/huggingface/models/Qwen3-235B-A22B-Instruct-2507-FP8 \
  --served-model-name Qwen3-235B-A22B-Instruct-2507-FP8 \
  --tensor-parallel-size 8 \
  --max-model-len 8192 \
  --gpu-memory-utilization 0.85 \
  --trust-remote-code \
  --enable-expert-parallel \
  --host 0.0.0.0 \
  --enable-auto-tool-choice \
  --tool-call-parser hermes

性能数据：

每张卡占用43G
总计占用显存344G（总共368GB）
上下文窗口较短

3.3 Qwen3-Coder-480B-A35B-Instruct-GPTQ-Int4-Int8Mix

        
      
vllm serve /usr1/huggingface/models/Qwen3-Coder-480B-A35B-Instruct-GPTQ-Int4-Int8Mix \
  --served-model-name Qwen3-Coder-480B-A35B-Instruct-GPTQ-Int4-Int8Mix \
  --tensor-parallel-size 8 \
  --max-model-len 131072 \
  --gpu-memory-utilization 0.9 \
  --trust-remote-code \
  --kv-cache-dtype fp8_e5m2 \
  --enable-chunked-prefill  \
  --max-num-batched-tokens 8192 \
  --enable-expert-parallel \
  --host 0.0.0.0 \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_coder

性能数据：

8张卡，每张卡占用42G,设置成16K、65536时均占用42G,但是设置成131072占用为44.7GB显存

3.4 Qwen3-Embedding-8B

        
vllm serve /usr1/huggingface/models/Qwen3-Embedding-8B --served-model-name  Qwen3-Embedding-8B --tensor-parallel-size 8 --task embedding --host 0.0.0.0 --port 8113 --max-model-len 40000 --max-num-batched-tokens 40000 --max-num-seqs 40 --gpu-memory-utilization 0.12

nohup vllm serve /usr1/huggingface/models/Qwen3-Embedding-8B --served-model-name  Qwen3-Embedding-8B --tensor-parallel-size 8 --task embedding --host 0.0.0.0 --port 8113 --max-model-len 40000 --max-num-batched-tokens 40000 --max-num-seqs 40 --gpu-memory-utilization 0.12 > vllm_qwen3-embeding.log 2>&1 &

当–tensor-parallel-size为8时，每张显卡占用内存22G，当为1时占用显存35GB

3.5 Qwen3-235B-A22B-GPTQ-Int4

        
      
vllm serve /usr1/huggingface/models/Qwen3-235B-A22B-GPTQ-Int4 --served-model-name  Qwen3-235B-A22B-GPTQ-Int4 --host 0.0.0.0 --max-num-seqs 40 --gpu-memory-utilization 0.9 --tensor-parallel-size 4 --enable-expert-parallel --disable-log-requests --trust-remote-code --max-model-len 131072 --kv-cache-dtype fp8_e5m2 --enable-chunked-prefill --max-num-batched-tokens 8192 --enable-auto-tool-choice --tool-call-parser hermes

3.6 GLM-4.6-Int4-Int8Mix

        
      
vllm serve \
    /usr1/huggingface/models/GLM-4.6-GPTQ-Int4-Int8Mix \
    --served-model-name GLM-4.6-GPTQ-Int4-Int8Mix \
    --enable-auto-tool-choice \
    --tool-call-parser glm45 \
    --reasoning-parser glm45 \
    --swap-space 16 \
    --max-num-seqs 64 \
    --max-model-len 131072 \
    --gpu-memory-utilization 0.8 \
    --tensor-parallel-size 8 \
    --enable-expert-parallel \
    --trust-remote-code \
    --disable-log-requests \
    --host 0.0.0.0 \
    --port 8000

32768时每张显卡占用43GB， 131072时，也是差不多43GB显存

nohup vllm serve
/usr1/huggingface/models/GLM-4.6-GPTQ-Int4-Int8Mix
–served-model-name GLM-4.6-GPTQ-Int4-Int8Mix
–enable-auto-tool-choice
–tool-call-parser glm45
–reasoning-parser glm45
–swap-space 16
–max-num-seqs 64
–max-model-len 131072
–gpu-memory-utilization 0.8
–tensor-parallel-size 8
–enable-expert-parallel
–trust-remote-code
–disable-log-requests
–host 0.0.0.0
–port 8000 » vllm_glm4.log 2>&1 &

3.7 GLM-4.6-AWQ

        
      
vllm serve \
    /usr1/huggingface/models/GLM-4.6-AWQ \
    --served-model-name GLM-4.6-AWQ \
    --enable-auto-tool-choice \
    --tool-call-parser glm45 \
    --reasoning-parser glm45 \
    --swap-space 16 \
    --max-num-seqs 64 \
    --max-model-len 202752 \
    --gpu-memory-utilization 0.95 \
    --tensor-parallel-size 8 \
    --enable-expert-parallel \
    --trust-remote-code \
    --disable-log-requests \
    --host 0.0.0.0 \
    --port 8000

nohup vllm serve \
    /usr1/huggingface/models/GLM-4.6-AWQ \
    --served-model-name GLM-4.6-AWQ \
    --enable-auto-tool-choice \
    --tool-call-parser glm45 \
    --reasoning-parser glm45 \
    --swap-space 16 \
    --max-num-seqs 64 \
    --max-model-len 202752 \
    --gpu-memory-utilization 0.95 \
    --tensor-parallel-size 8 \
    --enable-expert-parallel \
    --trust-remote-code \
    --enable-log-requests \
    --host 0.0.0.0 \
    --port 8000  >> vllm_glm4-awq.log 2>&1 &

#     --enable-log-outputs  \ 打开日志输出
# 增加日志和性能监控
nohup vllm serve \
    /usr1/huggingface/models/GLM-4.6-AWQ \
    --served-model-name GLM-4.6-AWQ \
    --enable-auto-tool-choice \
    --tool-call-parser glm45 \
    --reasoning-parser glm45 \
    --swap-space 16 \
    --max-num-seqs 64 \
    --max-model-len 202752 \
    --tensor-parallel-size 8 \
    --enable-expert-parallel \
    --trust-remote-code \
    --enable-log-requests \
    --host 127.0.0.1 \
    --port 8000  >> vllm_glm4-awq.log 2>&1 &

8张卡，每张占用43G显存

GLM-4.6-AWQ部署

激活环境
1 source ~/glm4.7/bin/activate

设置环境变量

        
      
export VLLM_USE_DEEP_GEMM=0
export VLLM_USE_FLASHINFER_MOE_FP16=1
export VLLM_USE_FLASHINFER_SAMPLER=0
export OMP_NUM_THREADS=4

启动 vLLM

        
      
nohup vllm serve \
 /usr1/huggingface/models/GLM-4.7-AWQ \
 --served-model-name GLM-4.7-AWQ \
 --swap-space 16 \
 --max-num-seqs 32 \
 --max-model-len 202752 \
 --gpu-memory-utilization 0.93 \
 --tensor-parallel-size 8 \
 --enable-expert-parallel \
 --speculative-config.method mtp \
 --speculative-config.num_speculative_tokens 1 \
 --tool-call-parser glm47 \
 --reasoning-parser glm45 \
 --enable-auto-tool-choice \
 --trust-remote-code \
 --enable-log-requests \
 --host 127.0.0.1 \
 --port 8000  >> vllm_glm4-awq.log 2>&1 &

3.8 Qwen3-32B

        
      
vllm serve /usr1/huggingface/models/Qwen3-32B  --served-model-name Qwen3-32B --enable-auto-tool-choice --tool-call-parser qwen3 --tensor-parallel-size 8 --gpu-memory-utilization 0.8

qwen3-32B部署必须先停掉glm4.6,否则会出现显存不够的问题

3.9 Kimi-K2-thinking

        
      
vllm serve /usr1/huggingface/models/Kimi-K2-Thinking \
  --served-model-name kimi-k2-thinking \
  --trust-remote-code \
  --tensor-parallel-size 8 \
  --enable-auto-tool-choice \
  --gpu-memory-utilization 0.95 \
  --max-num-batched-tokens 32768 \
  --tool-call-parser kimi_k2 \
  --reasoning-parser kimi_k2

3.10 MiniMax-M2.1

        
      
# MiniMax-M2.1
# 注意这里名字还是用的GLM-4.6-AWQ是为了让其他用户也能继续使用

SAFETENSORS_FAST_GPU=1 vllm serve /usr1/huggingface/models/MiniMax-M2-1 \
    --served-model-name GLM-4.6-AWQ MiniMax-M2.1 \
    --trust-remote-code \
    --enable_expert_parallel \
    --tensor-parallel-size 8 \
    --enable-auto-tool-choice \
    --tool-call-parser minimax_m2 \
    --reasoning-parser minimax_m2_append_think \
    --max-num-seqs 64 \
    --max-model-len 196608 \
    --enable-log-requests \
    --gpu-memory-utilization 0.95 \
    --host 127.0.0.1 \
    --port 8000

3.11 MiniMax-M2.5

        
      
# MiniMax-M2.5
# 注意这里名字还是用的GLM-4.6-AWQ是为了让其他用户也能继续使用
SAFETENSORS_FAST_GPU=1 vllm serve \
    /usr1/huggingface/models/MiniMax-M2.5-BF16-INT4-AWQ --trust-remote-code \
    --served-model-name GLM-4.6-AWQ  \
    --enable_expert_parallel --tensor-parallel-size 8 \
    --enable-auto-tool-choice --tool-call-parser minimax_m2 \
    --reasoning-parser minimax_m2_append_think \
    --enable-log-requests \
    --host 127.0.0.1 \
    --port 8000 
    

后台运行

        
      
SAFETENSORS_FAST_GPU=1 nohup vllm serve \
    /usr1/huggingface/models/MiniMax-M2.5-BF16-INT4-AWQ --trust-remote-code \
    --served-model-name GLM-4.6-AWQ  \
    --enable_expert_parallel --tensor-parallel-size 8 \
    --enable-auto-tool-choice --tool-call-parser minimax_m2 \
    --reasoning-parser minimax_m2_append_think \
    --enable-log-requests \
    --host 127.0.0.1 \
    --port 8000 >> vllm_minimax2.5-awq.log 2>&1 &

可以成功运行

3.12 Qwen3.5-397B-A17B

        
      
# 激活vllm环境
source vllm-qwen3.5-plus/bin/activate
# 需要根据 https://huggingface.co/QuantTrio/Qwen3.5-397B-A17B-AWQ 完成nightly版本的安装以及transformer的安装
# 加载cuda12
export LD_LIBRARY_PATH=$VIRTUAL_ENV/lib/python3.10/site-packages/nvidia/cuda_runtime/lib:$LD_LIBRARY_PATH
export LD_LIBRARY_PATH=$VIRTUAL_ENV/lib/python3.10/site-packages/nvidia/cublas/lib:$LD_LIBRARY_PATH
#  中提供的部署命令
export VLLM_USE_DEEP_GEMM=0
export VLLM_USE_FLASHINFER_MOE_FP16=1
export VLLM_USE_FLASHINFER_SAMPLER=0
export OMP_NUM_THREADS=16

# 注意这里名字还是用的GLM-4.6-AWQ是为了让其他用户也能继续使用
vllm serve \
    /usr1/huggingface/models/Qwen3.5-397B-A17B-AWQ \
    --served-model-name GLM-4.6-AWQ \
    --swap-space 16 \
    --max-num-seqs 32 \
    --max-model-len 202752  \
    --tensor-parallel-size 8 --enable-expert-parallel --enable-auto-tool-choice --tool-call-parser qwen3_coder --reasoning-parser qwen3 --speculative-config '{"method":"qwen3_next_mtp","num_speculative_tokens":2}'  --trust-remote-code  --host 127.0.0.1  --port 8000

后台nohup部署

nohup vllm serve \
    /usr1/huggingface/models/Qwen3.5-397B-A17B-AWQ \
    --served-model-name GLM-4.6-AWQ \
    --swap-space 16 \
    --max-num-seqs 32 \
    --max-model-len 202752  \
    --tensor-parallel-size 8 --enable-expert-parallel --enable-auto-tool-choice --tool-call-parser qwen3_coder --reasoning-parser qwen3 --speculative-config '{"method":"qwen3_next_mtp","num_speculative_tokens":2}'  --trust-remote-code  --host 127.0.0.1  --port 8000 > vllm_qwen3.5.log 2>&1 &

4. Docker 容器化部署

4.1 Open-WebUI 部署

        
      
# 启动容器
sudo docker run -d -p 0.0.0.0:3000:8080 \
  -v open-webui:/app/backend/data \
  --name open-webui \
  --restart always \
  -e OPENAI_API_BASE_URL=http://10.44.151.54:8000/v1 \
  ghcr.io/open-webui/open-webui:main

# 停止和清理容器
docker stop open-webui
docker rm open-webui

4.2 Lobe-Chat 部署

        
      
sudo docker run -d -p 0.0.0.0:3210:3210 \
  --name lobe-chat \
  --restart always \
  -e OPENAI_PROXY_URL=http://10.44.151.54:8111/v1 \
  -e OPENAI_MODEL_LIST="-all,+GLM-4.6-AWQ" \
  -e DEFAULT_AGENT_CONFIG="model=GLM-4.6-AWQ" \
  lobehub/lobe-chat:latest


  # 停止和清理容器
sudo docker stop lobe-chat
sudo docker rm lobe-chat

LLM

模型部署，Ollama VLLM LLM，大模型

本文由作者按照 CC BY 4.0 进行授权