AQIS
Running the Full Pipeline
To run the full inference pipeline, follow these steps:
- Set the PYTHONPATH
This ensures your Python modules are found correctly:
bash
export PYTHONPATH=/mnt/proj2/atr-25-3/inference-server-hpc:$PYTHONPATH
- Start the backend server
This command runs your backend job starter (make sure any required configurations are set in start_aqis.sh):
bash
python -m tests.heappe_aqis_start_job
- Launch the Streamlit client
This starts the web UI on port 8084:
bash
streamlit run src/client/streamlit_app.py --server.port 8084
You can now access the web interface at http://localhost:8084.
Multi-GPU Model Management (model_multi.py)
The model_multi.py script provides advanced multi-GPU and multi-model capabilities for the EXA4MIND AI Inference Service. This component enables efficient deployment and management of multiple large language models across multiple GPUs simultaneously.
Key Features
- Multi-GPU Support: Automatically detects and utilizes all available CUDA devices on the system
- Round-Robin GPU Assignment: Implements round-robin scheduling for distributing models across GPUs
- Multiple Model Deployment: Supports loading different LLMs on different GPUs simultaneously
- Dynamic Model Management: Allows runtime model switching and GPU reallocation
- Memory Management: Automatically handles GPU memory cleanup when switching models
Round-Robin Scheduling
When a new model needs to be loaded: 1. If GPUs are available, the next GPU in the round-robin cycle is assigned 2. If all GPUs are occupied, the system evicts the model from the next GPU in the cycle 3. The new model is loaded on the freed GPU
Supported Inference Engines
The multi-GPU system supports three inference engines:
- Hugging Face Transformers: Direct integration with Hugging Face models
- Ollama: Local model serving with Ollama
- vLLM: High-performance inference with vLLM
API Endpoints
The system responds to three main message types via ZeroMQ:
change_model: Load a new model on an available GPUinference: Perform inference using a specific modelget_model: Retrieve information about currently loaded models
Memory Management
The system includes sophisticated memory management: - Automatic CUDA cache clearing when switching models - Proper model cleanup and resource deallocation - Memory-efficient model loading and unloading
Sending OpenAI Chat Completions Requests
To send an OpenAI-compatible chat completions request through the pipeline:
- Start the AQIS backend job:
bash
python -m tests.heappe_aqis_start_job
- Send a chat completion request (replace placeholders as needed):
bash
python -m tests.heappe_aqis_openai_generate --prompt "Your prompt" --model_id "model_id" --engine "inference_engine"
This uses the OpenAI chat format end-to-end: the FastAPI proxy forwards the payload via ZeroMQ unchanged, and the backend routes it to the selected engine (Hugging Face, vLLM, or Ollama).