Skip to content

aldehir/llm-serving-tests

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

29 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

llm-serving-tests

A test suite for validating LLM inference server implementations. Checks that OpenAI-compatible API responses are correctly structured—reasoning fields populated, tool calls parsed, JSON schemas respected.

Install

go install github.com/aldehir/llm-serving-tests/cmd/llm-serve-test@latest

Or build from source:

git clone https://github.com/aldehir/llm-serving-tests
cd llm-serving-tests
go build -o llm-serve-test ./cmd/llm-serve-test

Usage

llm-serve-test --base-url http://localhost:8080/v1 --model deepseek-r1

Required flags:

  • --base-url - Server base URL (include /v1 if needed)
  • --model - Model name to test

Optional flags:

  • --api-key - API key if your server requires auth
  • --timeout - Request timeout (default: 30s)
  • --response-header-timeout - Time to wait for response headers, useful for slow prompt processing (default: 5m)
  • --verbose / -v - Show full request/response for all tests
  • --filter - Run only tests matching a pattern (e.g. --filter tool)
  • --class - Run only tests of a specific class: standard, reasoning, or interleaved
  • --mode - Request mode: blocking, streaming, or both (default: both)
  • --all / -a - Include tests that are disabled by default
  • --extra / -e - Add custom fields to request payloads (repeatable)
  • --jobs / -j - Number of parallel test executions (default: 1)

Test Classes

Not all models support all features. Use --class to run tests appropriate for your model type. Classes are hierarchical (standard < reasoning < interleaved):

  • standard - Basic functionality: tool calling, JSON schema. Works with any model.
  • reasoning - Includes standard tests, plus tests requiring reasoning_content support. For reasoning models like DeepSeek R1.
  • interleaved - Includes all tests. Adds multi-turn agentic flows where reasoning must be sent back to the server.
# Test a standard model
llm-serve-test --base-url http://localhost:8080/v1 --model llama-3 --class standard

# Test a reasoning model
llm-serve-test --base-url http://localhost:8080/v1 --model deepseek-r1 --class reasoning

# Run only streaming tests
llm-serve-test --base-url http://localhost:8080/v1 --model deepseek-r1 --mode streaming

# Run only blocking (non-streaming) tests
llm-serve-test --base-url http://localhost:8080/v1 --model deepseek-r1 --mode blocking

# Run 4 tests in parallel
llm-serve-test --base-url http://localhost:8080/v1 --model deepseek-r1 -j 4

List Available Tests

llm-serve-test list

Filter the list:

llm-serve-test list --filter tool
llm-serve-test list --class reasoning

Custom Request Fields

Some servers need extra parameters. Use --extra to add fields to the request body:

# String value
llm-serve-test --base-url ... --model ... --extra "custom_param=value"

# JSON value (use := instead of =)
llm-serve-test --base-url ... --model ... --extra "temperature:=0.7"
llm-serve-test --base-url ... --model ... --extra 'stop:=["\n"]'

What Gets Tested

Basic

  • chat_completion - Verifies model returns non-empty content

Reasoning

  • reasoning_present - Verifies reasoning_content is populated
  • reasoning_not_leaked - Confirms reasoning doesn't leak into main content

Tool Calling

  • single_tool_call - Basic tool call parsing
  • parallel_tool_calls - Multiple concurrent tool calls
  • required_tool_call - tool_choice: "required" behavior
  • required_tool_call_with_reasoning - Tool calls don't suppress reasoning output
  • complex_schema_tool_call - Deeply nested schema with objects, arrays, enums
  • code_generation_tool_call - Long-form text output in tool arguments

Structured Output

  • json_schema - Response conforms to requested JSON schema

Agentic (Multi-Turn)

  • agentic_tool_call - Full tool use loop with reasoning
  • agentic_reasoning_in_template - Reasoning included when continuing from tool result
  • agentic_reasoning_not_in_user_template - Reasoning excluded when last message is from user
  • agentic_long_response - Long text generation after tool call (disabled by default, use --all to include)

All tests support both blocking and streaming modes via --mode.

Logs

Request/response logs are grouped by model and timestamped:

logs/
└── deepseek-r1/
    ├── 2025-01-15_143022/
    │   ├── reasoning_present.log
    │   ├── single_tool_call.log
    │   └── ...
    └── 2025-01-15_152301/
        └── ...

The path is printed at the end of each run:

Logs written to: ./logs/deepseek-r1/2025-01-15_143022/

Use --verbose to also print full request/response details to the terminal.

Streaming tests also generate .stream.jsonl files for replay (see below).

Replay Streaming Responses

Streaming tests capture chunks to JSONL files for later visualization. This helps verify streaming output is coherent.

Replay a single file:

llm-serve-test replay "logs/deepseek-r1/2025-01-15_143022/reasoning_present (streaming).stream.jsonl"

Replay all streaming captures from a log directory:

llm-serve-test replay-all logs/deepseek-r1/2025-01-15_143022/

Options:

  • --delay - Time between chunks (default: 10ms)

The output is styled:

  • Reasoning - Dark gray, italic, prefixed with [thinking]
  • Content - Regular text
  • Tool calls - Yellow, with [tool: name] header

Example Output

LLM Serving Tests
=================
Server: http://localhost:8080/v1
Model: deepseek-r1

Reasoning
  ✓ reasoning_present (blocking) (512ms)
  ✓ reasoning_present (streaming) (534ms)
  ✓ reasoning_not_leaked (blocking) (487ms)
  ✓ reasoning_not_leaked (streaming) (501ms)

Tool Calling
  ✓ single_tool_call (blocking) (623ms)
  ✓ single_tool_call (streaming) (645ms)
  ✓ parallel_tool_calls (blocking) (701ms)
  ✗ parallel_tool_calls (streaming) - expected at least 2 tool calls, got 1

Results: 7/8 passed

Logs written to: ./logs/deepseek-r1/2025-01-15_143022/

License

MIT

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages