diff --git a/README.md b/README.md index ef1c3f9..3cc93c2 100644 --- a/README.md +++ b/README.md @@ -6,7 +6,7 @@ ![License](https://img.shields.io/badge/license-MIT-blue?style=for-the-badge) [![CI](https://github.com/peterajhgraham/cortex-engine/actions/workflows/ci.yml/badge.svg)](https://github.com/peterajhgraham/cortex-engine/actions/workflows/ci.yml) -Brain-computer interfaces need sub-30ms decoding latency at the system level — not a Jupyter notebook, a deployed service. Cortex-Engine is a ground-up build of that production infrastructure: three custom Triton kernels (fused embedding, block-sparse cross-attention, fused RMSNorm+linear), per-channel INT8 quantization with a calibration pipeline (72% weight memory reduction), a continuous-batching inference server with an earliest-deadline-first scheduler and paged streaming KV cache, and a full Prometheus/Grafana/OpenTelemetry observability stack. The model is a Perceiver-style transformer trained on real motor cortex population data from the [Neural Latents Benchmark](https://neurallatents.github.io/); the Wiener filter baseline achieves R² = 0.48 on held-out movement trials and a trained Cortex-S should exceed that. On MPS the server delivers **10.5× throughput** over naive sequential inference; the p99 < 30ms SLO requires CUDA. +Production inference infrastructure for transformer-based neural decoders. Three custom Triton kernels (fused embedding, block-sparse cross-attention, fused RMSNorm+linear), per-channel INT8 quantization with a calibration pipeline (72% weight memory reduction), a continuous-batching inference server with an earliest-deadline-first scheduler and paged streaming KV cache, and a full Prometheus/Grafana/OpenTelemetry observability stack. The model is a Perceiver-style transformer trained on real motor cortex population data from the [Neural Latents Benchmark](https://neurallatents.github.io/). On MPS the server delivers **10.5× throughput** over naive sequential inference; the p99 < 30ms SLO requires CUDA. --- @@ -128,7 +128,7 @@ Spike Events (neuron_id, time_bin, value) ### Phase 2.6 — INT8 Quantization ✓ -- **Per-channel calibration:** Activation scales use 99th-percentile abs-max across calibration batches (more robust than raw max). Weight scales: absmax per output neuron. +- **Per-channel calibration:** Activation scales use 99th-percentile abs-max across calibration batches. Weight scales: absmax per output neuron. - **`QuantizedLinear`:** Stores INT8 weights + float32 scales; dequantizes to bf16 before matmul — device-agnostic, no quantized CUDA kernel dependency. ### Phase 3 — Inference Engine ✓ @@ -151,20 +151,20 @@ Spike Events (neuron_id, time_bin, value) ## Tech Stack -| Tool | Why | +| Tool | Role | |---|---| -| **PyTorch 2.2+** | FSDP2 API, `scaled_dot_product_attention` → FlashAttention dispatch, MPS backend for dev on Apple Silicon. | -| **Triton** | Custom GPU kernels in Python without CUDA C; generates PTX directly. CUDA-only — no MPS support. | -| **einops** | Explicit tensor reshapes (`rearrange`) that read like dimension names, not `.view()` chains. | -| **Hydra** | Hierarchical config composition with CLI overrides; no argparse sprawl as config surface grows. | -| **Pydantic v2** | Strict typed config and I/O schemas; `model_validator` enforces cross-field constraints at construction. | -| **pynwb / DANDI** | NLB data is NWB-native; no format conversion needed. | -| **W&B** | Experiment tracking with offline mode; `wandb sync` to upload when back on network. | +| **PyTorch 2.2+** | FSDP2, `scaled_dot_product_attention` → FlashAttention dispatch, MPS backend for dev on Apple Silicon. | +| **Triton** | Custom GPU kernels in Python; generates PTX directly. CUDA-only. | +| **einops** | Readable tensor reshapes — `rearrange` instead of chains of `.view()`. | +| **Hydra** | Hierarchical config composition with CLI overrides. | +| **Pydantic v2** | Typed config and I/O schemas; `model_validator` for cross-field constraints. | +| **pynwb / DANDI** | NLB data is NWB-native; no conversion step needed. | +| **W&B** | Experiment tracking with offline mode; `wandb sync` to upload later. | | **pytest** | Unit + integration tests; `@pytest.mark.gpu` skips CUDA-only paths on non-CUDA hardware. | -| **mypy (strict)** | Strict type checking; catches shape bugs before runtime. `py.typed` marker in package. | +| **mypy (strict)** | Catches shape bugs before runtime. `py.typed` marker in package. | | **ruff + black** | Linting and formatting enforced by pre-commit hooks. | | **FastAPI** | Async inference server; WebSocket streaming and REST batch endpoints. | -| **structlog** | Structured JSON logging throughout `cortex/`; no bare `print()` in library code. | +| **structlog** | Structured JSON logging throughout `cortex/`. | --- @@ -289,7 +289,7 @@ make docker-up # docker compose up -d docker compose -f ops/docker/docker-compose.yml --profile loadtest run loadgen ``` -On a CPU-only machine (Mac / CI — no NVIDIA GPU): +On a CPU-only machine (Mac / CI): ```bash docker compose \ @@ -302,11 +302,11 @@ docker compose \ Three dashboards are auto-provisioned at startup (Grafana → Dashboards → Cortex): -**Traffic dashboard** (`cortex-traffic`) — request rate by endpoint, error rate by type, queue depth, Little's Law in-flight estimate. After `make docker-up` + a few seconds of load, you see a steady ~100 req/s rate line on the top-left panel and near-zero error rate on the top-right. Queue depth stays at 0–2 under normal conditions. +**Traffic** (`cortex-traffic`) — request rate by endpoint, error rate by type, queue depth, Little's Law in-flight estimate. -**Latency dashboard** (`cortex-latency`) — p50/p95/p99/p99.9 end-to-end latency time series with 30 ms / 50 ms warning/critical thresholds highlighted. Bottom-left SLO burn gauge reads < 1× (green) during normal operation and spikes to red when the k6 ramping scenario saturates the server. Batch-size heatmap shows the continuous batching coalescing requests into batches of 16–32. +**Latency** (`cortex-latency`) — p50/p95/p99/p99.9 time series with 30 ms / 50 ms thresholds. SLO burn gauge spikes red when the k6 ramping scenario saturates the server. Batch-size heatmap shows continuous batching coalescing requests into batches of 16–32. -**Resources dashboard** (`cortex-resources`) — GPU memory used (bytes), GPU utilization % with green (60%) / yellow (85%) / red (95%) bands, KV cache pages used (climbs during streaming sessions, drops on LRU eviction), KV cache hit rate (green above 80% — exploiting the 91.6% sliding-window overlap). +**Resources** (`cortex-resources`) — GPU memory and utilization, KV cache pages and hit rate. Hit rate should stay above 80% during streaming sessions given the 91.6% sliding-window overlap. ---