Skip to content

AQIS

Running the Full Pipeline

To run the full inference pipeline, follow these steps:

  1. Set the PYTHONPATH

This ensures your Python modules are found correctly: bash export PYTHONPATH=/mnt/proj2/atr-25-3/inference-server-hpc:$PYTHONPATH

  1. Start the backend server

This command runs your backend job starter (make sure any required configurations are set in start_aqis.sh): bash python -m tests.heappe_aqis_start_job

  1. Launch the Streamlit client

This starts the web UI on port 8084: bash streamlit run src/client/streamlit_app.py --server.port 8084

You can now access the web interface at http://localhost:8084.

Multi-GPU Model Management (model_multi.py)

The model_multi.py script provides advanced multi-GPU and multi-model capabilities for the EXA4MIND AI Inference Service. This component enables efficient deployment and management of multiple large language models across multiple GPUs simultaneously.

Key Features

  • Multi-GPU Support: Automatically detects and utilizes all available CUDA devices on the system
  • Round-Robin GPU Assignment: Implements round-robin scheduling for distributing models across GPUs
  • Multiple Model Deployment: Supports loading different LLMs on different GPUs simultaneously
  • Dynamic Model Management: Allows runtime model switching and GPU reallocation
  • Memory Management: Automatically handles GPU memory cleanup when switching models

Round-Robin Scheduling

When a new model needs to be loaded: 1. If GPUs are available, the next GPU in the round-robin cycle is assigned 2. If all GPUs are occupied, the system evicts the model from the next GPU in the cycle 3. The new model is loaded on the freed GPU

Supported Inference Engines

The multi-GPU system supports three inference engines:

  1. Hugging Face Transformers: Direct integration with Hugging Face models
  2. Ollama: Local model serving with Ollama
  3. vLLM: High-performance inference with vLLM

API Endpoints

The system responds to three main message types via ZeroMQ:

  1. change_model: Load a new model on an available GPU
  2. inference: Perform inference using a specific model
  3. get_model: Retrieve information about currently loaded models

Memory Management

The system includes sophisticated memory management: - Automatic CUDA cache clearing when switching models - Proper model cleanup and resource deallocation - Memory-efficient model loading and unloading

Sending OpenAI Chat Completions Requests

To send an OpenAI-compatible chat completions request through the pipeline:

  1. Start the AQIS backend job:

bash python -m tests.heappe_aqis_start_job

  1. Send a chat completion request (replace placeholders as needed):

bash python -m tests.heappe_aqis_openai_generate --prompt "Your prompt" --model_id "model_id" --engine "inference_engine"

This uses the OpenAI chat format end-to-end: the FastAPI proxy forwards the payload via ZeroMQ unchanged, and the backend routes it to the selected engine (Hugging Face, vLLM, or Ollama).