AQIS

Running the Full Pipeline

To run the full inference pipeline, follow these steps:

Set the PYTHONPATH

This ensures your Python modules are found correctly: bash export PYTHONPATH=/mnt/proj2/atr-25-3/inference-server-hpc:$PYTHONPATH

Start the backend server

This command runs your backend job starter (make sure any required configurations are set in start_aqis.sh): bash python -m tests.heappe_aqis_start_job

Launch the Streamlit client

This starts the web UI on port 8084: bash streamlit run src/client/streamlit_app.py --server.port 8084

You can now access the web interface at http://localhost:8084.

Multi-GPU Model Management (`model_multi.py`)

The model_multi.py script provides advanced multi-GPU and multi-model capabilities for the EXA4MIND AI Inference Service. This component enables efficient deployment and management of multiple large language models across multiple GPUs simultaneously.

Key Features

Multi-GPU Support: Automatically detects and utilizes all available CUDA devices on the system
Round-Robin GPU Assignment: Implements round-robin scheduling for distributing models across GPUs
Multiple Model Deployment: Supports loading different LLMs on different GPUs simultaneously
Dynamic Model Management: Allows runtime model switching and GPU reallocation
Memory Management: Automatically handles GPU memory cleanup when switching models

Round-Robin Scheduling

When a new model needs to be loaded: 1. If GPUs are available, the next GPU in the round-robin cycle is assigned 2. If all GPUs are occupied, the system evicts the model from the next GPU in the cycle 3. The new model is loaded on the freed GPU

Supported Inference Engines

The multi-GPU system supports three inference engines:

Hugging Face Transformers: Direct integration with Hugging Face models
Ollama: Local model serving with Ollama
vLLM: High-performance inference with vLLM

API Endpoints

The system responds to three main message types via ZeroMQ:

change_model: Load a new model on an available GPU
inference: Perform inference using a specific model
get_model: Retrieve information about currently loaded models

Memory Management

The system includes sophisticated memory management: - Automatic CUDA cache clearing when switching models - Proper model cleanup and resource deallocation - Memory-efficient model loading and unloading

Sending OpenAI Chat Completions Requests

To send an OpenAI-compatible chat completions request through the pipeline:

Start the AQIS backend job:

bash python -m tests.heappe_aqis_start_job

Send a chat completion request (replace placeholders as needed):

bash python -m tests.heappe_aqis_openai_generate --prompt "Your prompt" --model_id "model_id" --engine "inference_engine"

This uses the OpenAI chat format end-to-end: the FastAPI proxy forwards the payload via ZeroMQ unchanged, and the backend routes it to the selected engine (Hugging Face, vLLM, or Ollama).