diff --git a/README.md b/README.md index 19da4ed..cac21dc 100644 --- a/README.md +++ b/README.md @@ -556,6 +556,35 @@ docker compose up -d --- +## Performance Benchmarks + +The following benchmarks were collected by running DocuBot's full 9-agent documentation pipeline across three inference environments. Use these results to choose the right deployment profile for your needs. + +> **Note:** Intel Enterprise Inference was tested on Intel Xeon hardware to demonstrate on-premises SLM deployment for enterprise codebases. + +### Results + +| Model Type / Inference Provider | Model Name | Deployment | Context Window | Avg Input Tokens | Avg Output Tokens | Avg Total Tokens / Request | P50 Latency (ms) | P95 Latency (ms) | Throughput (req/sec) | Hardware Profile | +|---|---|---|---|---|---|---|---|---|---|---| +| vLLM | Qwen3-4B-Instruct-2507 | Local | 262.1K | 3,040 | 307.7 | 5809 | 15,864 | 40,809 | 0.0580 | Apple Silicon (Metal) | +| Enterprise Inference / SLM · [Intel OPEA EI](https://opea.dev) | Qwen3-4B-Instruct-2507 | CPU (Xeon) | 8.1K | 4,211.9 | 270 | 4481 | 10,540 | 32,205 | 0.076 | CPU-only | +| OpenAI (Cloud) | gpt-4o-mini | API (Cloud) | 128K | 3,820.11 | 316.41 | 4136.52 | 7,760 | 23,535 | 0.108 | N/A | + +### Key Findings + +- **Cloud leads on speed**: gpt-4o-mini delivers 26% faster P50 latency (7,760ms vs 10,540ms on Xeon) and 42% higher throughput (0.108 vs 0.076 req/sec) compared to CPU-only Qwen3-4B — cloud GPUs eliminate hardware bottlenecks for latency-sensitive pipelines. +- **Context window gap critically impacts multi-agent workflows**: Cloud's 128K vs Xeon's 8.1K (94% reduction) forces aggressive prompt truncation for code documentation tasks, while vLLM's 262.1K on Apple Silicon enables full-context processing without chunking strategies. +- **Cloud generates more with less input**: gpt-4o-mini produces 17% more output (316 vs 270 tokens) while consuming 9% less input (3,820 vs 4,212), indicating superior prompt compression and generation efficiency. +- **Apple Silicon throughput lags despite large context**: Despite a 32× larger context window (262.1K vs 8.1K), Apple Silicon achieves only 0.058 req/sec — 46% slower than cloud and 24% slower than CPU-only Xeon — suggesting Metal optimization gaps for multi-agent workloads. +- **Deployment stability affects cost predictability**: Cloud shows 26% token variance per run (3,618–4,915 tokens/req) vs Xeon's 7% variance (4,375–4,688), reflecting dynamic resource allocation versus consistent CPU-bound processing. + +### Model Capabilities + +| Model | Highlights | +|---|---| +| **Qwen3-4B-Instruct-2507** | 4B-parameter code-specialized model with 262.1K native context (deployment-limited to 8.1K on Xeon CPU). Supports multi-agent documentation generation, code analysis, and structured JSON output. Enables full on-premises deployment with data sovereignty for enterprise codebases. | +| **gpt-4o-mini** | Cloud-native multimodal model with 128K context, optimized for code understanding and technical documentation. Delivers 42% higher throughput and 26% lower latency versus CPU-based alternatives while supporting concurrent multi-agent orchestration at cloud scale. | + ## Environment Variables Configure the application behavior using environment variables in `api/.env`: diff --git a/ui/src/images/UI_2.png b/ui/src/images/UI_2.png deleted file mode 100644 index c519526..0000000 Binary files a/ui/src/images/UI_2.png and /dev/null differ diff --git a/ui/src/images/UI_4.png b/ui/src/images/UI_4.png deleted file mode 100644 index e095ef5..0000000 Binary files a/ui/src/images/UI_4.png and /dev/null differ diff --git a/ui/src/images/UI_5_agent.png b/ui/src/images/UI_5_agent.png deleted file mode 100644 index 3380c5b..0000000 Binary files a/ui/src/images/UI_5_agent.png and /dev/null differ