Testing Guide
This guide covers how to run tests for the inference server. The main focus of the test suite is on integration with the HEAppE Middleware
Test Types
| Type | Description |
|---|---|
| Integration | Test HEAppE and inference backend integration. |
| API tests | Test endpoints using FastAPI test client. |
| Benchmark | Test the performance using Hugging Face inference-benchmarker |
Prerequisites
Before running tests, ensure the following:
.envfile is created and contains valid HEAppE credentials.config.yamlis configured with your environment.- Python dependencies are installed:
pip install -r requirements_api.txt
Run tests
API tests
For testing the inference service API manually you can use the Swagger UI available at the /docs endpoint.
We also provide a Locust test script that simulates multiple users accessing the service:
locust -f tests/fastapi_locust.py --headless \
-u 5 -r 1 --run-time 24h --host http://localhost:8000
The above command will launch 5 API users and run for 24 hours. Each user is setup to make an inference request every 10 to 20 seconds.
Benchmark
The recommended benchmark tool is the Hugging Face inference benchmarker.
Example usage for benchmarking the service with vLLM backend:
inference-benchmarker \
--tokenizer-name "Qwen/Qwen2.5-Coder-7B-Instruct" \
--max-vus 800 \
--duration 30s \
--url https://localhost:8000/v1/chat/completions \
-a <exa4mind_service_api_key> \
--warmup 10s \
--benchmark-kind rate \
--rates 1.0 \
For Triton server backend, you must specify the model name that matches the Triton model repo name:
inference-benchmarker \
--tokenizer-name "Qwen/Qwen2.5-Coder-7B-Instruct" \
--model "Qwen2.5-Coder-7B-Instruct"
--max-vus 800 \
--duration 30s \
--url https://localhost:8000/v1/chat/completions \
-a <exa4mind_service_api_key> \
--warmup 10s \
--benchmark-kind rate \
--rates 1.0 \
HEAppE integration
Run a login test, which will use the credentials set in your .env file (or otherwise loaded to the environment)
python -m tests.heappe_login