A test suite for validating LLM inference server implementations. Checks that OpenAI-compatible API responses are correctly structured—reasoning fields populated, tool calls parsed, JSON schemas respected.
go install github.com/aldehir/llm-serving-tests/cmd/llm-serve-test@latestOr build from source:
git clone https://github.com/aldehir/llm-serving-tests
cd llm-serving-tests
go build -o llm-serve-test ./cmd/llm-serve-testllm-serve-test --base-url http://localhost:8080/v1 --model deepseek-r1Required flags:
--base-url- Server base URL (include/v1if needed)--model- Model name to test
Optional flags:
--api-key- API key if your server requires auth--timeout- Request timeout (default: 30s)--response-header-timeout- Time to wait for response headers, useful for slow prompt processing (default: 5m)--verbose/-v- Show full request/response for all tests--filter- Run only tests matching a pattern (e.g.--filter tool)--class- Run only tests of a specific class:standard,reasoning, orinterleaved--mode- Request mode:blocking,streaming, orboth(default:both)--all/-a- Include tests that are disabled by default--extra/-e- Add custom fields to request payloads (repeatable)--jobs/-j- Number of parallel test executions (default: 1)
Not all models support all features. Use --class to run tests appropriate for your model type. Classes are hierarchical (standard < reasoning < interleaved):
- standard - Basic functionality: tool calling, JSON schema. Works with any model.
- reasoning - Includes standard tests, plus tests requiring
reasoning_contentsupport. For reasoning models like DeepSeek R1. - interleaved - Includes all tests. Adds multi-turn agentic flows where reasoning must be sent back to the server.
# Test a standard model
llm-serve-test --base-url http://localhost:8080/v1 --model llama-3 --class standard
# Test a reasoning model
llm-serve-test --base-url http://localhost:8080/v1 --model deepseek-r1 --class reasoning
# Run only streaming tests
llm-serve-test --base-url http://localhost:8080/v1 --model deepseek-r1 --mode streaming
# Run only blocking (non-streaming) tests
llm-serve-test --base-url http://localhost:8080/v1 --model deepseek-r1 --mode blocking
# Run 4 tests in parallel
llm-serve-test --base-url http://localhost:8080/v1 --model deepseek-r1 -j 4llm-serve-test listFilter the list:
llm-serve-test list --filter tool
llm-serve-test list --class reasoningSome servers need extra parameters. Use --extra to add fields to the request body:
# String value
llm-serve-test --base-url ... --model ... --extra "custom_param=value"
# JSON value (use := instead of =)
llm-serve-test --base-url ... --model ... --extra "temperature:=0.7"
llm-serve-test --base-url ... --model ... --extra 'stop:=["\n"]'Basic
chat_completion- Verifies model returns non-empty content
Reasoning
reasoning_present- Verifiesreasoning_contentis populatedreasoning_not_leaked- Confirms reasoning doesn't leak into maincontent
Tool Calling
single_tool_call- Basic tool call parsingparallel_tool_calls- Multiple concurrent tool callsrequired_tool_call-tool_choice: "required"behaviorrequired_tool_call_with_reasoning- Tool calls don't suppress reasoning outputcomplex_schema_tool_call- Deeply nested schema with objects, arrays, enumscode_generation_tool_call- Long-form text output in tool arguments
Structured Output
json_schema- Response conforms to requested JSON schema
Agentic (Multi-Turn)
agentic_tool_call- Full tool use loop with reasoningagentic_reasoning_in_template- Reasoning included when continuing from tool resultagentic_reasoning_not_in_user_template- Reasoning excluded when last message is from useragentic_long_response- Long text generation after tool call (disabled by default, use--allto include)
All tests support both blocking and streaming modes via --mode.
Request/response logs are grouped by model and timestamped:
logs/
└── deepseek-r1/
├── 2025-01-15_143022/
│ ├── reasoning_present.log
│ ├── single_tool_call.log
│ └── ...
└── 2025-01-15_152301/
└── ...
The path is printed at the end of each run:
Logs written to: ./logs/deepseek-r1/2025-01-15_143022/
Use --verbose to also print full request/response details to the terminal.
Streaming tests also generate .stream.jsonl files for replay (see below).
Streaming tests capture chunks to JSONL files for later visualization. This helps verify streaming output is coherent.
Replay a single file:
llm-serve-test replay "logs/deepseek-r1/2025-01-15_143022/reasoning_present (streaming).stream.jsonl"Replay all streaming captures from a log directory:
llm-serve-test replay-all logs/deepseek-r1/2025-01-15_143022/Options:
--delay- Time between chunks (default: 10ms)
The output is styled:
- Reasoning - Dark gray, italic, prefixed with
[thinking] - Content - Regular text
- Tool calls - Yellow, with
[tool: name]header
LLM Serving Tests
=================
Server: http://localhost:8080/v1
Model: deepseek-r1
Reasoning
✓ reasoning_present (blocking) (512ms)
✓ reasoning_present (streaming) (534ms)
✓ reasoning_not_leaked (blocking) (487ms)
✓ reasoning_not_leaked (streaming) (501ms)
Tool Calling
✓ single_tool_call (blocking) (623ms)
✓ single_tool_call (streaming) (645ms)
✓ parallel_tool_calls (blocking) (701ms)
✗ parallel_tool_calls (streaming) - expected at least 2 tool calls, got 1
Results: 7/8 passed
Logs written to: ./logs/deepseek-r1/2025-01-15_143022/
MIT