llm-serving-tests

A test suite for validating LLM inference server implementations. Checks that OpenAI-compatible API responses are correctly structured—reasoning fields populated, tool calls parsed, JSON schemas respected.

Install

go install github.com/aldehir/llm-serving-tests/cmd/llm-serve-test@latest

Or build from source:

git clone https://github.com/aldehir/llm-serving-tests
cd llm-serving-tests
go build -o llm-serve-test ./cmd/llm-serve-test

Usage

llm-serve-test --base-url http://localhost:8080/v1 --model deepseek-r1

Required flags:

--base-url - Server base URL (include /v1 if needed)
--model - Model name to test

Optional flags:

--api-key - API key if your server requires auth
--timeout - Request timeout (default: 30s)
--response-header-timeout - Time to wait for response headers, useful for slow prompt processing (default: 5m)
--verbose / -v - Show full request/response for all tests
--filter - Run only tests matching a pattern (e.g. --filter tool)
--class - Run only tests of a specific class: standard, reasoning, or interleaved
--mode - Request mode: blocking, streaming, or both (default: both)
--all / -a - Include tests that are disabled by default
--extra / -e - Add custom fields to request payloads (repeatable)
--jobs / -j - Number of parallel test executions (default: 1)

Test Classes

Not all models support all features. Use --class to run tests appropriate for your model type. Classes are hierarchical (standard < reasoning < interleaved):

standard - Basic functionality: tool calling, JSON schema. Works with any model.
reasoning - Includes standard tests, plus tests requiring reasoning_content support. For reasoning models like DeepSeek R1.
interleaved - Includes all tests. Adds multi-turn agentic flows where reasoning must be sent back to the server.

# Test a standard model
llm-serve-test --base-url http://localhost:8080/v1 --model llama-3 --class standard

# Test a reasoning model
llm-serve-test --base-url http://localhost:8080/v1 --model deepseek-r1 --class reasoning

# Run only streaming tests
llm-serve-test --base-url http://localhost:8080/v1 --model deepseek-r1 --mode streaming

# Run only blocking (non-streaming) tests
llm-serve-test --base-url http://localhost:8080/v1 --model deepseek-r1 --mode blocking

# Run 4 tests in parallel
llm-serve-test --base-url http://localhost:8080/v1 --model deepseek-r1 -j 4

List Available Tests

llm-serve-test list

Filter the list:

llm-serve-test list --filter tool
llm-serve-test list --class reasoning

Custom Request Fields

Some servers need extra parameters. Use --extra to add fields to the request body:

# String value
llm-serve-test --base-url ... --model ... --extra "custom_param=value"

# JSON value (use := instead of =)
llm-serve-test --base-url ... --model ... --extra "temperature:=0.7"
llm-serve-test --base-url ... --model ... --extra 'stop:=["\n"]'

What Gets Tested

Basic

chat_completion - Verifies model returns non-empty content

Reasoning

reasoning_present - Verifies reasoning_content is populated
reasoning_not_leaked - Confirms reasoning doesn't leak into main content

Tool Calling

single_tool_call - Basic tool call parsing
parallel_tool_calls - Multiple concurrent tool calls
required_tool_call - tool_choice: "required" behavior
required_tool_call_with_reasoning - Tool calls don't suppress reasoning output
complex_schema_tool_call - Deeply nested schema with objects, arrays, enums
code_generation_tool_call - Long-form text output in tool arguments

Structured Output

json_schema - Response conforms to requested JSON schema

Agentic (Multi-Turn)

agentic_tool_call - Full tool use loop with reasoning
agentic_reasoning_in_template - Reasoning included when continuing from tool result
agentic_reasoning_not_in_user_template - Reasoning excluded when last message is from user
agentic_long_response - Long text generation after tool call (disabled by default, use --all to include)

All tests support both blocking and streaming modes via --mode.

Logs

Request/response logs are grouped by model and timestamped:

logs/
└── deepseek-r1/
    ├── 2025-01-15_143022/
    │   ├── reasoning_present.log
    │   ├── single_tool_call.log
    │   └── ...
    └── 2025-01-15_152301/
        └── ...

The path is printed at the end of each run:

Logs written to: ./logs/deepseek-r1/2025-01-15_143022/

Use --verbose to also print full request/response details to the terminal.

Streaming tests also generate .stream.jsonl files for replay (see below).

Replay Streaming Responses

Streaming tests capture chunks to JSONL files for later visualization. This helps verify streaming output is coherent.

Replay a single file:

llm-serve-test replay "logs/deepseek-r1/2025-01-15_143022/reasoning_present (streaming).stream.jsonl"

Replay all streaming captures from a log directory:

llm-serve-test replay-all logs/deepseek-r1/2025-01-15_143022/

Options:

--delay - Time between chunks (default: 10ms)

The output is styled:

Reasoning - Dark gray, italic, prefixed with [thinking]
Content - Regular text
Tool calls - Yellow, with [tool: name] header

Example Output

LLM Serving Tests
=================
Server: http://localhost:8080/v1
Model: deepseek-r1

Reasoning
  ✓ reasoning_present (blocking) (512ms)
  ✓ reasoning_present (streaming) (534ms)
  ✓ reasoning_not_leaked (blocking) (487ms)
  ✓ reasoning_not_leaked (streaming) (501ms)

Tool Calling
  ✓ single_tool_call (blocking) (623ms)
  ✓ single_tool_call (streaming) (645ms)
  ✓ parallel_tool_calls (blocking) (701ms)
  ✗ parallel_tool_calls (streaming) - expected at least 2 tool calls, got 1

Results: 7/8 passed

Logs written to: ./logs/deepseek-r1/2025-01-15_143022/

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
cmd/llm-serve-test		cmd/llm-serve-test
internal		internal
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
README.md		README.md
go.mod		go.mod
go.sum		go.sum

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

llm-serving-tests

Install

Usage

Test Classes

List Available Tests

Custom Request Fields

What Gets Tested

Logs

Replay Streaming Responses

Example Output

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

llm-serving-tests

Install

Usage

Test Classes

List Available Tests

Custom Request Fields

What Gets Tested

Logs

Replay Streaming Responses

Example Output

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages