Testing Guide

This guide covers how to run tests for the inference server. The main focus of the test suite is on integration with the HEAppE Middleware

Test Types

Type	Description
Integration	Test HEAppE and inference backend integration.
API tests	Test endpoints using FastAPI test client.
Benchmark	Test the performance using Hugging Face inference-benchmarker

Prerequisites

Before running tests, ensure the following:

.env file is created and contains valid HEAppE credentials.
config.yaml is configured with your environment.
Python dependencies are installed:

pip install -r requirements_api.txt

Run tests

API tests

For testing the inference service API manually you can use the Swagger UI available at the /docs endpoint.

We also provide a Locust test script that simulates multiple users accessing the service:

locust -f tests/fastapi_locust.py --headless \
-u 5 -r 1  --run-time 24h --host http://localhost:8000

The above command will launch 5 API users and run for 24 hours. Each user is setup to make an inference request every 10 to 20 seconds.

Benchmark

The recommended benchmark tool is the Hugging Face inference benchmarker.

Example usage for benchmarking the service with vLLM backend:

inference-benchmarker \
  --tokenizer-name "Qwen/Qwen2.5-Coder-7B-Instruct"   \
  --max-vus 800 \
  --duration 30s \
  --url https://localhost:8000/v1/chat/completions \
  -a <exa4mind_service_api_key> \
  --warmup 10s \
  --benchmark-kind rate \
  --rates 1.0 \

For Triton server backend, you must specify the model name that matches the Triton model repo name:

inference-benchmarker \
  --tokenizer-name "Qwen/Qwen2.5-Coder-7B-Instruct" \
  --model "Qwen2.5-Coder-7B-Instruct"
  --max-vus 800 \
  --duration 30s \
  --url https://localhost:8000/v1/chat/completions \
  -a <exa4mind_service_api_key> \
  --warmup 10s \
  --benchmark-kind rate \
  --rates 1.0 \

HEAppE integration

Run a login test, which will use the credentials set in your .env file (or otherwise loaded to the environment)

python -m tests.heappe_login