Triton Inference Server

Triton Inference Server is an open source inference serving software that streamlines AI inferencing. You will need to deploy this software for your HPC compute project. The easiest way is to use the official container.

Warning: The associated compute project requires access to a GPU cluster with data center NVIDIA GPU accelerators, such as A100, H100 or similar.

The NVIDIA Triton Inference Server is published on NGC as a Docker image. To use the image on the cluster, convert it to the Apptainer .sif format.

Container Image

On a system with Docker:

docker pull nvcr.io/nvidia/tritonserver:25.09-vllm-python-py3

You can convert it directly to an Apptainer .sif container image:

apptainer build tritonserver.sif docker-daemon://nvcr.io/nvidia/tritonserver:25.09-vllm-python-py3

Tip: Run this on a workstation with Docker if Docker is not available on the cluster.

Then move or copy tritonserver.sif to your HPC cluster:

scp tritonserver.sif your_user@your.cluster.address:/path/to/compute_project/containers

Another option is to directly pull and convert on HPC node using Apptainer:

apptainer pull tritonserver_25.09-vllm-python-py3.sif docker://nvcr.io/nvidia/tritonserver:25.09-vllm-python-py3

Make sure the .sif image is then placed at <your_project_path>/containers