vLLM
vLLM is an open source inference serving software. You will need to deploy vLLM for your HPC compute project. The easiest way is to use the official container.
Use one of the official images depending on your GPU vendor:
- NVIDIA CUDA is the officially supported container
- AMD ROCm is the AMD port
Container Image
Directly pull and convert on HPC node using Apptainer:
# vLLM image for NVIDIA
apptainer pull vllm-openai-nightly.sif docker://vllm/vllm-openai:nightly
# vLLM image for AMD ROCm
apptainer pull vllm-openai-nightly.sif docker://rocm/vllm
Make sure the .sif image is then placed at <your_project_path>/containers