OpenAI-compatible gateway for caching and cost control.
AI Cost Firewall is a lightweight OpenAI-compatible API gateway that reduces LLM API costs and latency by caching responses using exact matching and semantic similarity.
It sits between applications and LLM providers and forwards only necessary requests to the upstream API.
The project is developed and supported by the creators of VCAL Server.
LLM APIs are expensive and often receive repeated or semantically similar prompts.
Without caching, every request results in:
- unnecessary API calls
- increased token usage
- higher costs
- additional latency
AI Cost Firewall solves this by introducing a two-layer cache:
- Exact cache (Redis) -- instant responses for identical prompts\
- Semantic cache (Qdrant) -- reuse answers for similar prompts
Only cache misses are forwarded to the upstream LLM provider.
The firewall behaves similarly to "nginx for LLM APIs".
cache hit rate • net savings after embedding overhead • real-time cost reduction
Local synthetic workload simulating enterprise support queries (VPN, onboarding, access requests).
Demonstrates real-time cost reduction using exact and semantic caching, with full cost breakdown (gross savings, embedding cost, and net savings).
semantic threshold decisions • pass/fail boundary • real-time request classification
Mixed synthetic workload simulating enterprise support traffic with both similar and divergent queries. Demonstrates semantic cache behavior under realistic conditions: high pass rate (~99%), non-zero threshold failures (boundary cases), and continuous candidate evaluation. Shows how the system balances reuse and precision while maintaining near-zero upstream calls and stable latency.
Both dashboards are pre-configured and included in the default
docker-compose.yml. See Quick Start (Docker) to run the stack locally.
- OpenAI-compatible
/v1/chat/completionsendpoint - Exact request caching (Redis)
- Semantic cache (Qdrant)
- Token and cost savings metrics
- Prometheus observability (cost, cache, errors, runtime behavior)
- Error classification (validation / upstream / timeout / internal)
- Upstream latency and timeout tracking
- Semantic cache diagnostics (threshold, candidates, expiration behavior)
- Docker deployment
- nginx-style configuration
- Strict startup validation with clear error messages
- Hot configuration reload (
SIGHUP) - Graceful shutdown with request draining (SIGTERM / SIGINT)
- Readiness and liveness endpoints (
/readyz,/healthz) - Request size protection (
max_request_body_bytes) - Lightweight Rust + Axum implementation
AI Cost Firewall is designed to be safe by default preventing accidental misconfiguration and unintended upstream costs.
v0.1.4 focuses on operational predictability and observability in real deployments.
- Clear error classification (validation / upstream / timeout / internal)
- Upstream timeout visibility and latency tracking
- Graceful shutdown with request draining and rejection tracking
- Readiness vs liveness separation (
/readyz,/healthz) - Semantic cache diagnostics:
- candidates checked
- threshold pass/fail
- expired entries skipped
- lookup latency
- Improved logging for cache decisions (hit / miss / semantic reuse)
- Safer configuration with better validation and warnings
The system now behaves predictably under load and is easier to debug in production.
Client applications send requests to the firewall instead of directly to the LLM provider.
Full architecture documentation:
The fastest way to try AI Cost Firewall is using Docker Compose.
Install:
- Docker
- Docker Compose (included with Docker Desktop)
Verify installation:
docker --version
docker compose versionClone the repository and prepare the configuration:
git clone https://github.com/vcal-project/ai-firewall.git
cd ai-firewall
cp configs/ai-firewall.conf.example configs/ai-firewall.confEdit the configuration file and add your API keys:
nano configs/ai-firewall.confYou should also specify the exact model names returned by your LLM provider (used for cost calculation), for example:
gpt-4o-mini-2024-07-18
The repository already includes all required Prometheus and Grafana configuration
This will start the full stack (Firewall, Redis, Qdrant, Prometheus, Grafana):
docker compose pull
docker compose up -ddocker compose logs -f firewall| Service | URL |
|---|---|
| Firewall API | http://localhost:8080 |
| Prometheus | http://localhost:9090 |
| Grafana | http://localhost:3000 |
The stack includes:
- AI Cost Firewall
- Redis
- Qdrant
- Prometheus
- Grafana
curl http://localhost:8080/v1/chat/completions \
-H "Authorization: Bearer <your-key>" \
-H "Content-Type: application/json" \
-d '{
"model": "gpt-4o-mini-2024-07-18",
"messages": [
{"role": "user", "content": "Explain Redis briefly."}
]
}'AI Cost Firewall uses a simple nginx-style configuration format.
- Signal-driven operations (SIGHUP reload, SIGTERM graceful shutdown)
Example configuration:
listen_addr 0.0.0.0:8080;
redis_url redis://redis:6379;
upstream_base_url https://api.openai.com;
upstream_api_key sk-your-api-key;
embedding_base_url https://api.openai.com;
embedding_api_key sk-your-api-key;
embedding_model text-embedding-3-small;
qdrant_url http://qdrant:6334;
qdrant_collection aif_semantic_cache;
qdrant_vector_size 1536;
cache_ttl_seconds 2592000;
request_timeout_seconds 120;
graceful_shutdown_timeout_seconds 10; # default
max_request_body_bytes 1M;
semantic_cache_enabled true;
semantic_similarity_threshold 0.92;
# Model validation behavior
# By default, only models defined via `model_price` are allowed.
# Unknown models will be rejected with 400.
allow_unknown_models_pass_through false;
# Chat-completion pricing (USD per 1M tokens)
# model_price <model> <input_usd_per_1m_tokens> <output_usd_per_1m_tokens>;
model_price gpt-4o-mini-2024-07-18 0.15 0.60;
model_price gpt-4.1-mini-2025-04-14 0.30 1.20;
# Embedding pricing (optional, used for net cost estimation only)
embedding_price 0.020;
If the API returns
gpt-4o-mini-2024-07-18, the same name must appear in the configuration.
Misconfiguration is one of the most common causes of unexpected LLM costs. AI Cost Firewall prevents this at startup.
AI Cost Firewall performs strict validation at startup.
configuration error: semantic_cache_enabled=true requires: embedding_api_key, embedding_model, qdrant_url
configuration error: no allowed models configured: add at least one model_price or set allow_unknown_models_pass_through=true
configuration error: invalid AIF_MAX_REQUEST_BODY_BYTES value 'abc'. Use formats like 1024, 512K, 1M, 2M
- Multiple issues reported in a single error
- Invalid configs fail fast
- Prevents unintended upstream usage
AI Cost Firewall validates the model field before forwarding requests upstream.
- Only models defined via
model_priceare considered supported - Requests with unknown models are rejected with 400 Bad Request
- This prevents accidental or unauthorized upstream usage
Example:
{
"error": {
"code": 400,
"message": "Unsupported model: gpt-unknown",
"type": "validation_error"
}
}If you want the gateway to behave like a transparent proxy:
allow_unknown_models_pass_through true;In this mode:
- Unknown models are forwarded upstream
- Cost tracking will not be applied for unknown models
- Validation is relaxed
cache_ttl_seconds defines how long cached responses remain valid.
- Exact cache (Redis): TTL is enforced automatically by Redis
- Semantic cache (Qdrant): entries are not physically deleted, but filtered at query time based on expiration
This ensures consistent behavior across both caching layers.
v0.1.4 adds visibility into semantic cache lifecycle:
- how many candidates are evaluated
- how many fail similarity threshold
- how many are skipped due to expiration
This helps diagnose low semantic hit rates and tune thresholds effectively.
Semantic cache entries are not automatically deleted from Qdrant. Expired entries are ignored during lookup, but remain stored in the collection. To reclaim disk space, old entries can be removed manually (for example, with a periodic cleanup script or scheduled job). Automatic cleanup support may be added in future versions.
max_request_body_bytes defines the maximum request size.
Supported formats:
1024
512K
1M
2M
Requests exceeding the limit are rejected early:
{
"error": {
"code": 413,
"type": "validation_error",
"message": "request body exceeds max_request_body_bytes limit"
}
}Very small values (<1K) trigger a startup warning.
When enabled:
semantic_cache_enabled true;
Required fields:
- embedding_base_url
- embedding_api_key
- embedding_model
- qdrant_url
- qdrant_collection
- qdrant_vector_size
If no configuration file is provided, AI Cost Firewall falls back to environment variables.
For convenience, you can use a .env file in development:
AIF_REDIS_URL=redis://127.0.0.1:6379
AIF_UPSTREAM_API_KEY=sk-xxxx
AIF_EMBEDDING_MODEL=text-embedding-3-small
AIF_EMBEDDING_PRICE_USD_PER_1M_TOKENS=0.020
- Variables follow the AIF_ prefix convention
.envis loaded automatically if present- Intended for development and simple deployments
If neither a config file nor required environment variables are provided, the application will fail to start with a clear configuration error.
Example errors:
configuration error: AIF_REDIS_URL is required when no config file is used
configuration error: invalid AIF_QDRANT_VECTOR_SIZE value 'abc'
Full configuration reference:
AI Cost Firewall is designed to behave predictably in production environments.
- Stops accepting new requests
- Allows in-flight requests to complete
- Rejects new requests with 503 during shutdown
- Tracks shutdown state and rejection count
/healthz— process is alive/readyz— ready to serve traffic
During shutdown:
/healthz→ OK/readyz→ 503
- Upstream requests are bounded by
request_timeout_seconds - Timeouts are explicitly tracked and classified
Prometheus metrics are available at:
Example metrics:
aif_requests_total
aif_cache_exact_hits
aif_cache_semantic_hits
aif_cache_misses
aif_tokens_saved
aif_cost_saved_micro_usd
aif_inflight_requests
aif_shutdown_in_progress
aif_shutdown_rejections_total
Token and cost savings are calculated for:
/v1/chat/completions
For semantic cache hits:
- Gross savings are based on avoided chat-completion tokens
- Embedding lookup costs are included and deducted
- Reported savings represent net savings
Metrics:
aif_chat_cost_saved_micro_usd– gross chat-completion savingsaif_embedding_cost_micro_usd– embedding lookup costaif_cost_saved_micro_usd– net savings (gross − embedding cost)aif_errors_total{class=...}– classified errorsaif_upstream_timeouts_total– upstream timeout countaif_upstream_request_duration_seconds– upstream latencyaif_readiness_state– readiness (1/0)aif_shutdown_in_progress– shutdown stateaif_semantic_candidates_checked_totalaif_semantic_threshold_results_total{result="pass|fail"}aif_semantic_expired_entries_skipped_totalaif_semantic_lookup_duration_seconds
Exact cache hits have no embedding cost.
If embedding_price is not configured, embedding cost is treated as 0 and savings may be overestimated.
Clone the repository if you want to:
- explore the code
- modify configuration templates
- build the firewall locally
- contribute to the project
git clone https://github.com/vcal-project/ai-firewall.git
cd ai-firewallBuild the project:
cargo build --releaseRun the firewall:
cargo run --releaseAI Cost Firewall includes unit tests for configuration parsing, validation, and core request handling paths.
Key areas covered:
- Config validation (required fields, limits, semantic cache requirements)
- Byte-size parsing (
1M,2M, etc.) for request limits - Negative configuration tests (invalid values, missing fields, invalid sizes)
- Aggregated validation error tests (multiple misconfigurations reported together)
- Environment variable validation (invalid formats, missing required variables)
- Cost accounting correctness (chat vs embedding vs net)
Run tests locally:
cargo testIf cache performance is lower than expected:
-
Check semantic threshold:
- High threshold → fewer semantic hits
-
Inspect diagnostics dashboard:
- High
threshold_fail→ threshold too strict - High
expired_entries_skipped→ TTL too short
- High
-
Check upstream latency:
- Increasing latency may indicate provider issues
-
Check error classification:
validation_error→ request issuesupstream_timeout→ provider slowinternal_error→ system issue
| Document | Description |
|---|---|
docs/architecture.md |
System architecture |
docs/config-reference.md |
Configuration directives |
docs/faq.md |
Frequently asked questions |
docs/how-it-works.md |
Request flow and caching logic |
docs/quickstart.md |
Full setup guide |
docs/operation.md |
Runtime behavior (health checks, shutdown, reload) |
Contributions are welcome.
If you would like to contribute to AI Cost Firewall — whether through bug reports, feature suggestions, documentation improvements, or code — please see:
Before submitting a pull request, please open an issue to discuss the change.
We welcome improvements in:
- performance
- documentation
- testing
- integrations with LLM providers
- observability and metrics
AI Cost Firewall can optionally integrate with VCAL Server for advanced semantic caching and distributed vector storage.
VCAL Server project:
Apache License 2.0


