跳转至

昇腾 LLM 推理

vllm

参考

https://vllm-ascend.readthedocs.io/en/latest/installation.html

安装 nnal,否则后面 vllm 运行报错 libatb.so: cannot open shared object file: No such file or directory

# root 运行
# https://github.com/vllm-project/vllm-ascend/issues/152
$ wget https://ascend-repo.obs.cn-east-2.myhuaweicloud.com/CANN/CANN%208.0.0/Ascend-cann-nnal_8.0.0_linux-aarch64.run
$ chmod +x ./Ascend-cann-nnal_8.0.0_linux-aarch64.run
# root安装
$ ./Ascend-cann-nnal_8.0.0_linux-aarch64.run --install-for-all --install 
$ source /usr/local/Ascend/nnal/atb/set_env.sh
$ source /usr/local/Ascend/ascend-toolkit/set_env.sh
安装 vllm 和 vllm-ascend,torch_nopu、vllm、vllm-ascend 版本有严格要求。
# 普通账号安装
# install torch_npu
$ wget  https://pytorch-package.obs.cn-north-4.myhuaweicloud.com/pta/Daily/v2.5.1/20250218.4/pytorch_v2.5.1_py39.tar.gz
$ tar xf pytorch_v2.5.1_py39.tar.gz
$ pip install --upgrade torch_npu-2.5.1.dev20250218-cp39-cp39-manylinux_2_17_aarch64.manylinux2014_aarch64.whl

# Install vllm from source
$ git clone  --depth 1 --branch v0.7.1 https://github.com/vllm-project/vllm
$ cd vllm
# VLLM_TARGET_DEVICE=empty pip install . --extra-index-url https://download.pytorch.org/whl/cpu/

# Install vllm-ascend from source
git clone  --depth 1 --branch v0.7.1rc1 https://github.com/vllm-project/vllm-ascend.git
cd vllm-ascend
pip install -e . --extra-index-url https://download.pytorch.org/whl/cpu/

测试是否可用

# 设置进程可用的卡ID
$ export ASCEND_RT_VISIBLE_DEVICES=4,5,6,7
$ export NPU_VISIBLE_DEVICES=4,5,6,7
# 从 MODELSCOPE 下载模型
$ VLLM_USE_MODELSCOPE=true 
$ vllm serve Qwen/Qwen2.5-0.5B-Instruct 
.
.
.
INFO 02-26 01:36:24 launcher.py:19] Available routes are:
INFO 02-26 01:36:24 launcher.py:27] Route: /openapi.json, Methods: GET, HEAD
INFO 02-26 01:36:24 launcher.py:27] Route: /invocations, Methods: POST
INFO:     Started server process [1923472]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)

$ curl http://localhost:8000/v1/models

测试 DeepSeek-R1-Distill-Qwen-32B,速度非常慢,3.2 tokens/s,curl 命令测试将近 3 分钟才返回结果。同节点上 mindie 部署的 DeepSeek 70B 模型推理服务,1 秒钟不到即可返回结果。

$ source /usr/local/Ascend/nnal/atb/set_env.sh
$ VLLM_USE_MODELSCOPE=true  NPU_VISIBLE_DEVICES=4,5,6,7 ASCEND_RT_VISIBLE_DEVICES=4,5,6,7 vllm serve --tensor-parallel-size 4 --dtype bfloat16   deepseek-ai/DeepSeek-R1-Distill-Qwen-32B
$ curl http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "deepseek-ai/DeepSeek-R1-Distill-Qwen-32B",
    "prompt": "描述一下北京的秋天",
    "max_tokens": 512
  }'
本文阅读量  次
本站总访问量  次