AI Cost Firewall

OpenAI-compatible gateway for caching and cost control.

AI Cost Firewall is a lightweight OpenAI-compatible API gateway that reduces LLM API costs and latency by caching responses using exact matching and semantic similarity.

It sits between applications and LLM providers and forwards only necessary requests to the upstream API.

The project is developed and supported by the creators of VCAL Server.

https://vcal-project.com

Why AI Cost Firewall?

LLM APIs are expensive and often receive repeated or semantically similar prompts.

Without caching, every request results in:

unnecessary API calls
increased token usage
higher costs
additional latency

AI Cost Firewall solves this by introducing a two-layer cache:

Exact cache (Redis) -- instant responses for identical prompts\
Semantic cache (Qdrant) -- reuse answers for similar prompts

Only cache misses are forwarded to the upstream LLM provider.

The firewall behaves similarly to "nginx for LLM APIs".

Example 1: Cost Savings with Exact + Semantic Caching

cache hit rate • net savings after embedding overhead • real-time cost reduction

Local synthetic workload simulating enterprise support queries (VPN, onboarding, access requests).
Demonstrates real-time cost reduction using exact and semantic caching, with full cost breakdown (gross savings, embedding cost, and net savings).

Example 2: Semantic Decision Quality & Runtime Behavior

semantic threshold decisions • pass/fail boundary • real-time request classification

Mixed synthetic workload simulating enterprise support traffic with both similar and divergent queries. Demonstrates semantic cache behavior under realistic conditions: high pass rate (~99%), non-zero threshold failures (boundary cases), and continuous candidate evaluation. Shows how the system balances reuse and precision while maintaining near-zero upstream calls and stable latency.

Both dashboards are pre-configured and included in the default docker-compose.yml. See Quick Start (Docker) to run the stack locally.

Key Features

OpenAI-compatible /v1/chat/completions endpoint
Exact request caching (Redis)
Semantic cache (Qdrant)
Token and cost savings metrics
Prometheus observability (cost, cache, errors, runtime behavior)
Error classification (validation / upstream / timeout / internal)
Upstream latency and timeout tracking
Semantic cache diagnostics (threshold, candidates, expiration behavior)
Docker deployment
nginx-style configuration
Strict startup validation with clear error messages
Hot configuration reload (SIGHUP)
Graceful shutdown with request draining (SIGTERM / SIGINT)
Readiness and liveness endpoints (/readyz, /healthz)
Request size protection (max_request_body_bytes)
Lightweight Rust + Axum implementation

AI Cost Firewall is designed to be safe by default preventing accidental misconfiguration and unintended upstream costs.

What’s new in v0.1.4

v0.1.4 focuses on operational predictability and observability in real deployments.

Key improvements

Clear error classification (validation / upstream / timeout / internal)
Upstream timeout visibility and latency tracking
Graceful shutdown with request draining and rejection tracking
Readiness vs liveness separation (/readyz, /healthz)
Semantic cache diagnostics:
- candidates checked
- threshold pass/fail
- expired entries skipped
- lookup latency
Improved logging for cache decisions (hit / miss / semantic reuse)
Safer configuration with better validation and warnings

The system now behaves predictably under load and is easier to debug in production.

Architecture Overview

Client applications send requests to the firewall instead of directly to the LLM provider.

Full architecture documentation:

docs/architecture.md

Quick Start (Docker)

The fastest way to try AI Cost Firewall is using Docker Compose.

Prerequisites

Install:

Docker
Docker Compose (included with Docker Desktop)

Verify installation:

docker --version
docker compose version

Clone the repository

Clone the repository and prepare the configuration:

git clone https://github.com/vcal-project/ai-firewall.git
cd ai-firewall
cp configs/ai-firewall.conf.example configs/ai-firewall.conf

Edit the configuration file and add your API keys:

nano configs/ai-firewall.conf

You should also specify the exact model names returned by your LLM provider (used for cost calculation), for example:

gpt-4o-mini-2024-07-18

The repository already includes all required Prometheus and Grafana configuration

Start the stack

This will start the full stack (Firewall, Redis, Qdrant, Prometheus, Grafana):

docker compose pull
docker compose up -d

View logs

docker compose logs -f firewall

Services

Service	URL
Firewall API	http://localhost:8080
Prometheus	http://localhost:9090
Grafana	http://localhost:3000

The stack includes:

AI Cost Firewall
Redis
Qdrant
Prometheus
Grafana

Example Request

curl http://localhost:8080/v1/chat/completions \
  -H "Authorization: Bearer <your-key>" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-4o-mini-2024-07-18",
    "messages": [
      {"role": "user", "content": "Explain Redis briefly."}
    ]
  }'

Configuration

AI Cost Firewall uses a simple nginx-style configuration format.

Signal-driven operations (SIGHUP reload, SIGTERM graceful shutdown)

Example configuration:

listen_addr 0.0.0.0:8080;

redis_url redis://redis:6379;

upstream_base_url https://api.openai.com;
upstream_api_key sk-your-api-key;

embedding_base_url https://api.openai.com;
embedding_api_key sk-your-api-key;
embedding_model text-embedding-3-small;

qdrant_url http://qdrant:6334;
qdrant_collection aif_semantic_cache;
qdrant_vector_size 1536;

cache_ttl_seconds 2592000;
request_timeout_seconds 120;
graceful_shutdown_timeout_seconds 10;  # default
max_request_body_bytes 1M;

semantic_cache_enabled true;
semantic_similarity_threshold 0.92;

# Model validation behavior
# By default, only models defined via `model_price` are allowed.
# Unknown models will be rejected with 400.
allow_unknown_models_pass_through false;

# Chat-completion pricing (USD per 1M tokens)
# model_price <model> <input_usd_per_1m_tokens> <output_usd_per_1m_tokens>;

model_price gpt-4o-mini-2024-07-18 0.15 0.60;
model_price gpt-4.1-mini-2025-04-14 0.30 1.20;

# Embedding pricing (optional, used for net cost estimation only)
embedding_price 0.020;

If the API returns gpt-4o-mini-2024-07-18, the same name must appear in the configuration.

Misconfiguration is one of the most common causes of unexpected LLM costs. AI Cost Firewall prevents this at startup.

Startup Validation & Error Handling

AI Cost Firewall performs strict validation at startup.

Example errors

configuration error: semantic_cache_enabled=true requires: embedding_api_key, embedding_model, qdrant_url

configuration error: no allowed models configured: add at least one model_price or set allow_unknown_models_pass_through=true

configuration error: invalid AIF_MAX_REQUEST_BODY_BYTES value 'abc'. Use formats like 1024, 512K, 1M, 2M

Behavior

Multiple issues reported in a single error
Invalid configs fail fast
Prevents unintended upstream usage

Model validation

AI Cost Firewall validates the model field before forwarding requests upstream.

Only models defined via model_price are considered supported
Requests with unknown models are rejected with 400 Bad Request
This prevents accidental or unauthorized upstream usage

Example:

{
  "error": {
    "code": 400,
    "message": "Unsupported model: gpt-unknown",
    "type": "validation_error"
  }
}

Optional: allow pass-through

If you want the gateway to behave like a transparent proxy:

allow_unknown_models_pass_through true;

In this mode:

Unknown models are forwarded upstream
Cost tracking will not be applied for unknown models
Validation is relaxed

Cache behavior and TTL

cache_ttl_seconds defines how long cached responses remain valid.

Exact cache (Redis): TTL is enforced automatically by Redis
Semantic cache (Qdrant): entries are not physically deleted, but filtered at query time based on expiration

This ensures consistent behavior across both caching layers.

v0.1.4 adds visibility into semantic cache lifecycle:

how many candidates are evaluated
how many fail similarity threshold
how many are skipped due to expiration

This helps diagnose low semantic hit rates and tune thresholds effectively.

Semantic cache entries are not automatically deleted from Qdrant. Expired entries are ignored during lookup, but remain stored in the collection. To reclaim disk space, old entries can be removed manually (for example, with a periodic cleanup script or scheduled job). Automatic cleanup support may be added in future versions.

Request size limits

max_request_body_bytes defines the maximum request size.

Supported formats:

Requests exceeding the limit are rejected early:

{
  "error": {
    "code": 413,
    "type": "validation_error",
    "message": "request body exceeds max_request_body_bytes limit"
  }
}

Very small values (<1K) trigger a startup warning.

Semantic cache requirements

When enabled:

semantic_cache_enabled true;

Required fields:

embedding_base_url
embedding_api_key
embedding_model
qdrant_url
qdrant_collection
qdrant_vector_size

Environment Variables

If no configuration file is provided, AI Cost Firewall falls back to environment variables.

For convenience, you can use a .env file in development:

AIF_REDIS_URL=redis://127.0.0.1:6379
AIF_UPSTREAM_API_KEY=sk-xxxx
AIF_EMBEDDING_MODEL=text-embedding-3-small
AIF_EMBEDDING_PRICE_USD_PER_1M_TOKENS=0.020

Variables follow the AIF_ prefix convention
.env is loaded automatically if present
Intended for development and simple deployments

If neither a config file nor required environment variables are provided, the application will fail to start with a clear configuration error.

Example errors:

configuration error: AIF_REDIS_URL is required when no config file is used

configuration error: invalid AIF_QDRANT_VECTOR_SIZE value 'abc'

Full configuration reference:

docs/config-reference.md

Operational Behavior

AI Cost Firewall is designed to behave predictably in production environments.

Graceful shutdown

Stops accepting new requests
Allows in-flight requests to complete
Rejects new requests with 503 during shutdown
Tracks shutdown state and rejection count

Readiness vs liveness

/healthz — process is alive
/readyz — ready to serve traffic

During shutdown:

/healthz → OK
/readyz → 503

Timeout handling

Upstream requests are bounded by request_timeout_seconds
Timeouts are explicitly tracked and classified

Metrics

Prometheus metrics are available at:

http://localhost:8080/metrics

Example metrics:

aif_requests_total
aif_cache_exact_hits
aif_cache_semantic_hits
aif_cache_misses
aif_tokens_saved
aif_cost_saved_micro_usd
aif_inflight_requests
aif_shutdown_in_progress
aif_shutdown_rejections_total

Note

Token and cost savings are calculated for:

/v1/chat/completions

For semantic cache hits:

Gross savings are based on avoided chat-completion tokens
Embedding lookup costs are included and deducted
Reported savings represent net savings

Metrics:

aif_chat_cost_saved_micro_usd – gross chat-completion savings
aif_embedding_cost_micro_usd – embedding lookup cost
aif_cost_saved_micro_usd – net savings (gross − embedding cost)
aif_errors_total{class=...} – classified errors
aif_upstream_timeouts_total – upstream timeout count
aif_upstream_request_duration_seconds – upstream latency
aif_readiness_state – readiness (1/0)
aif_shutdown_in_progress – shutdown state
aif_semantic_candidates_checked_total
aif_semantic_threshold_results_total{result="pass|fail"}
aif_semantic_expired_entries_skipped_total
aif_semantic_lookup_duration_seconds

Exact cache hits have no embedding cost.

If embedding_price is not configured, embedding cost is treated as 0 and savings may be overestimated.

Build from Source

Clone the repository if you want to:

explore the code
modify configuration templates
build the firewall locally
contribute to the project

git clone https://github.com/vcal-project/ai-firewall.git
cd ai-firewall

Build the project:

cargo build --release

Run the firewall:

cargo run --release

Testing

AI Cost Firewall includes unit tests for configuration parsing, validation, and core request handling paths.

Key areas covered:

Config validation (required fields, limits, semantic cache requirements)
Byte-size parsing (1M, 2M, etc.) for request limits
Negative configuration tests (invalid values, missing fields, invalid sizes)
Aggregated validation error tests (multiple misconfigurations reported together)
Environment variable validation (invalid formats, missing required variables)
Cost accounting correctness (chat vs embedding vs net)

Run tests locally:

cargo test

Troubleshooting & Debugging

If cache performance is lower than expected:

Check semantic threshold:
- High threshold → fewer semantic hits
Inspect diagnostics dashboard:
- High threshold_fail → threshold too strict
- High expired_entries_skipped → TTL too short
Check upstream latency:
- Increasing latency may indicate provider issues
Check error classification:
- validation_error → request issues
- upstream_timeout → provider slow
- internal_error → system issue

Documentation

Document	Description
`docs/architecture.md`	System architecture
`docs/config-reference.md`	Configuration directives
`docs/faq.md`	Frequently asked questions
`docs/how-it-works.md`	Request flow and caching logic
`docs/quickstart.md`	Full setup guide
`docs/operation.md`	Runtime behavior (health checks, shutdown, reload)

Contributing

Contributions are welcome.

If you would like to contribute to AI Cost Firewall — whether through bug reports, feature suggestions, documentation improvements, or code — please see:

CONTRIBUTING.md

Before submitting a pull request, please open an issue to discuss the change.

We welcome improvements in:

performance
documentation
testing
integrations with LLM providers
observability and metrics

Integration with VCAL Server

AI Cost Firewall can optionally integrate with VCAL Server for advanced semantic caching and distributed vector storage.

VCAL Server project:

https://vcal-project.com

License

Apache License 2.0

Name		Name	Last commit message	Last commit date
Latest commit History 41 Commits
assets		assets
configs		configs
deploy		deploy
docs		docs
security		security
src		src
.dockerignore		.dockerignore
.gitignore		.gitignore
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
deny.toml		deny.toml
docker-compose.yml		docker-compose.yml

Folders and files

Latest commit

History

Repository files navigation

AI Cost Firewall

Why AI Cost Firewall?

Example 1: Cost Savings with Exact + Semantic Caching

Example 2: Semantic Decision Quality & Runtime Behavior

Key Features

What’s new in v0.1.4

Key improvements

Architecture Overview

Quick Start (Docker)

Prerequisites

Clone the repository

Start the stack

View logs

Services

Example Request

Configuration

Startup Validation & Error Handling

Example errors

Behavior

Model validation

Optional: allow pass-through

Cache behavior and TTL

Request size limits

Semantic cache requirements

Environment Variables

Operational Behavior

Graceful shutdown

Readiness vs liveness

Timeout handling

Metrics

Note

Build from Source

Testing

Troubleshooting & Debugging

Documentation

Contributing

Integration with VCAL Server

License

About

Topics

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 5

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages