Skip to content

Latest commit

 

History

History
222 lines (167 loc) · 12.2 KB

File metadata and controls

222 lines (167 loc) · 12.2 KB

CLAUDE.md — protoLabs AI Lab

Monorepo for model evaluation, training, inference infrastructure, and ML experiments.

Structure

  • packages/lab-core/ — Shared Pydantic models, GPU utils (strict, tested, publishable)
  • evals/ — LLM eval suite: claw-eval, custom suites, function-call, RAG (strict, tested)
  • models/ — Model inventory, vllm-swap.sh, benchmarks
  • training/ — Fine-tuning workspace (LLaMA-Factory configs, datasets)
  • experiments/ — ML experiments (context-1, ltx-video, flux2, quantize, pixel-gen, stt-whisper, voice-agent)
  • infra/ — Prometheus exporters, vLLM systemd, gateway configs (gateway runs on ava node)

How we operate

The default work pattern in this lab is a closed cycle:

experiment  →  report  →  engineering  →  test  →  content  →  repeat

Every phase has an exit criterion. Don't move to the next phase until the current one is done.

Phase What it is Exit when
experiment research, training, iteration in experiments/<name>/ model artifact + Tier-0 baselines (majority + linear probe + one off-the-shelf comparable) + cross-domain held-out eval
report internal RESULTS.md with honest numbers, confusion matrices, what didn't work written without softening the failures — the report is for next-session-us
engineering wire the artifact into the consuming product (ORBIS, gateway, CI) a PR landed on the consuming repo's main branch
test validate under real conditions, not just held-out splits one round of real-world signal observed (real users, real traffic, real audio — not benchmark replay)
content public artifact: HF model card + dataset card + blog post on protolabs.studio merged blog + public HF release (private during draft, public after review)
repeat next experiment, informed by what test and content revealed always

Bias toward shipping. The trap is doing experiment → report → experiment ad infinitum. The cycle is only valuable when it closes. If a piece of work has been in report for a week without engineering, that's the signal to either ship or kill it.

What this means for new experiments:

  • Don't start a new experiment unless the previous one is past engineering (or has been explicitly parked with a memo)
  • Each experiment plans for its content deliverable on day one — what's the blog headline? — and reverses-out the work needed to get there
  • Default to publishing publicly via protoLabsAI/... on HuggingFace and protolabs.studio for the writeup. Privacy is a temporary state during drafting, not the default.

Companion-stack research workspace

experiments/companion-stack/ is the umbrella for ORBIS-supporting research. New experiments around the conversational AI loop go there, organized by pipe (audio-pre/, text-pre/, llm-context/, text-post/, memory/, visual/).

Read experiments/companion-stack/README.md, ROADMAP.md, and LEARNING.md before starting a new ORBIS-related experiment. The audio-tags experiment (pipes/audio-pre/audio-tags/) is the worked exemplar — copy its shape.

Heuristic for new pipes: if the LLM would have to guess something a small classifier could measure, you're paying LLM cost for a job a 1.7 ms head could do better. That's the whole game.

Brand & monetization — protolabs.studio

protolabs.studio is the public face. The work in this lab is only valuable to the brand if it surfaces there.

Discipline:

  • Every shipped experiment produces a blog draft before the next experiment starts. Drafts live in experiments/<name>/BLOG.md. The audio-tags experiment is the template.
  • HuggingFace org protoLabsAI/ is the canonical model + dataset publishing target. Models go up private during drafting, flipped public on blog publish.
  • Cross-link aggressively. Blog → HF model + dataset → ORBIS PR → back to blog. The whole stack should be one click apart.
  • Cite our own prior work. v5 cites v4 cites v3 cites v2 cites v0. The lineage is the credibility.

When in doubt about whether something should be public, default to publishing. The lab's expertise is only monetizable to the extent it's visible.

Using uv

uv sync                                    # sync all workspaces
uv run pytest                              # run tests (lab-core + evals)
uv run proto-eval claw --model local       # run evals CLI
uv run models --gpu single                 # show model inventory
uv run ruff check .                        # lint everything

Running Models

bash models/vllm-swap.sh qwen-35b           # speed king MoE FP8 official, 262K (180 tok/s)
bash models/vllm-swap.sh qwen-9b-fp8        # on-the-fly FP8 (140 tok/s)
bash models/vllm-swap.sh qwen-4b-fp8        # on-the-fly FP8 edge (140 tok/s)
bash models/vllm-swap.sh qwen-27b-int4      # daily driver, agentic (53 tok/s)
bash models/vllm-swap.sh qwen-27b-int4-mtp  # daily driver + MTP, chat/creative (70 tok/s)
bash models/vllm-swap.sh qwen-4b-int4       # edge deploy, fastest absolute (297 tok/s)
bash models/vllm-swap.sh qwen-4b            # LoRA base bf16 (155 tok/s)
bash models/vllm-swap.sh qwen-122b-fp8      # quality ceiling FP8 official TP=2 (112 tok/s)
bash models/vllm-swap.sh qwen-122b-int4     # quality ceiling INT4 TP=2 (122 tok/s)
bash models/vllm-swap.sh qwen-27b-fp8-tp2   # FP8 official TP=2 (70 tok/s, 131K)

Speed Testing

bash models/speed-test.sh           # 5 runs on current model (800 tok gen)
bash models/speed-test.sh 10        # 10 runs
bash models/speed-test.sh 3 short   # 3 short runs (200 tokens)

# A/B compare baseline vs optimized config
cd evals && bash run-ab-speed.sh qwen-4b-int4 5

Reports decode tok/s (1/TPOT), wall tok/s, TTFT, and TPOT from vLLM's /metrics endpoint — not wall-clock estimation.

Optimization Flags (-opt configs)

Suffix any config with -opt to enable P1+P2 flags:

  • --async-scheduling — overlap scheduling with execution
  • --enable-prefix-caching — reuse KV cache for repeated prefixes
  • --performance-mode interactivity — auto-tune scheduler for latency
  • --kv-cache-dtype fp8 — halve KV cache memory, double context capacity

Measured impact (single-request, P1+P2 only): minimal (+1-3% tok/s). Real wins are under concurrent load and multi-turn (prefix caching). FP8 KV doubles context capacity.

MTP Speculative Decoding (-mtp configs)

Native Qwen3.5 Multi-Token Prediction — big speed gains on dense models:

Model Baseline + MTP Gain Tool Calling
27B INT4 53 tok/s 70 tok/s +32% Works, but T08 quality regresses
9B 92 tok/s 112 tok/s +22% Works, no quality loss
35B MoE 171 tok/s 153 tok/s -11% N/A — slower, don't use
  • MTP helps dense models, hurts MoE (routing overhead > speculation savings)
  • 9B + MTP is safe for all workloads including tool calling
  • 27B + MTP: use for chat/creative (70 tok/s), avoid for complex agentic (T08 regresses)
  • MoE FP8 env vars: VLLM_USE_FLASHINFER_MOE_FP8=1 VLLM_FLASHINFER_MOE_BACKEND=latency

TP=2 Tuning (122B, 35B-tp2)

NCCL env vars for PCIe (no NVLink): NCCL_ALGO=Ring NCCL_PROTO=Simple NCCL_MIN_NCHANNELS=4 NCCL_MAX_NCHANNELS=8

Tested results:

  • FIX: NCCL_P2P_DISABLE=1 enables stable CUDA graphs on TP=2 Blackwell PCIe
  • 122B INT4: 23→122 tok/s (5.3x, stable over 10 runs)
  • 35B MoE: 22→205 tok/s (9.3x, stable over 10 runs)
  • Root cause: ACS enabled on PCIe bridges corrupts P2P during CUDA graph replay
  • Disabling P2P forces shared memory transport — slight overhead but fully stable
  • TTFT: 3077ms→29ms with prefix caching after warmup
  • Power draw: only 88-96W per card at 600W limit — MoE is not power-bound
  • 35B TP=2: prefix caching fixed 1.8s TTFT → 0.5s (-70%), wall tok/s +25%
  • VLLM_USE_FLASHINFER_MOE_FP8 crashes on 122B FP8 (unsupported quant scheme) — don't use
  • Previous finding that TP=2 needs enforce-eager was WRONGNCCL_P2P_DISABLE=1 fixes it

Running Evals

cd evals

# Profile runs (recommended)
./run.sh profile --name quick --model local    # ~15 min smoke test, 1 trial
./run.sh profile --name full --model local     # ~60-90 min comprehensive, 3 trials

# Individual runs
./run.sh claw --model local --tasks T02,T04,T06,T08 --port-offset 200
./run.sh custom --suite coding --model local --trials 1
./run.sh custom --suite reasoning --model local --trials 1
./run.sh function-call --model local --all-suites

Eval Suites

Suite Tests What it measures
claw-eval 52 EN Agentic tool use (email, calendar, CRM, ops, finance)
coding 10 Generation (5) + analysis/review/security (5)
instruction_following 5 Constraint adherence, format compliance
reasoning 5 Math, logic puzzles, deduction, pattern recognition
structured_output 5 JSON, YAML, SQL, markdown tables, log parsing
summarization 5 Compression, action extraction, TL;DR
safety 5 Refusal, jailbreak resistance, PII, security review
creative_writing 5 Prose, narrative, character voice
roleplay 5 RPG GM quality, world building
svg_generation 5 SVG validity, accuracy, animation
research 4 Synthesis, conflicting sources, hallucination
function_call 8 Basic (5) + edge cases (3)

Eval Profiles

  • quick — 6 claw tasks + 6 custom suites + FC, 1 trial (~15 min)
  • full — 20 claw tasks + 10 custom suites + FC, 3 trials pass^3 (~60-90 min)

Model Inventory (/mnt/models)

Model Size tok/s +MTP Quick Score Role
Qwen 35B MoE FP8 35GB 180 Speed king, 262K ctx, single GPU (Qwen official)
Qwen 9B FP8 19GB* 141 On-the-fly FP8 (+53% vs bf16)
Qwen 4B FP8 8.8GB* 141 On-the-fly FP8 edge
Qwen 27B INT4 29GB 53 70 86/103 Daily driver (MTP for chat, baseline for agents)
Qwen 122B FP8 119GB 112 Quality ceiling FP8 TP=2 (Qwen official)
Qwen 122B INT4 74GB 122 89/103 Quality ceiling INT4 TP=2 (faster on PCIe)
Qwen 27B FP8 TP=2 29GB 70 131K ctx, TP=2 (Qwen official)
Qwen 9B BF16 19GB 92 112 72/103 Fine-tune base (cold storage)
Qwen 4B INT4 3.8GB 297 56/103 Edge deploy (fastest absolute)

* On-the-fly FP8 (--quantization fp8) loads bf16 weights from disk, quantizes during model load. No separate quant files needed. | Cydonia 24B | 44GB | — | — | — | Creative/roleplay (holding) | | Llama 70B AWQ | 38GB | 38 | — | — | Creative/roleplay (holding) | | Llama 8B AWQ | 5GB | — | — | 50/102 | Eval baseline (poor) | | Qwen 2B BF16 | 4GB | 307 | — | — | Training (5/5 reasoning!) | | Qwen 0.8B BF16 | 1.5GB | 547 | — | — | Training |

Base models (0.8B, 2B, 4B) also downloaded for pretraining. FP8 quants in /mnt/models/quantized/ and on HuggingFace protoLabsAI.

Cold storage (/mnt/data/models-cold/): FLUX.2-klein 9B+base (100GB), Z-Image+Turbo (51GB), Voxtral-Mini-4B (17GB), OCR models (11.4GB).

Blackwell GPU Constraints

  • CUDA graphs work on single GPU — don't use --enforce-eager (37-470% speedup)
  • TP=2 needs --enforce-eager (memory corruption under sustained load)
  • --disable-custom-all-reduce always needed for TP=2 (PCIe, not NVLink)
  • No xformers / Flash Attention — use PyTorch native SDPA
  • FlashInfer backend crashes — don't use --attention-backend flashinfer
  • INT4 safe on dense models, unstable on MoE (use BF16 for MoE)
  • Capability cliff at 4B→2B: sub-4B models can't do agentic tool use

Secrets

All secrets in Infisical at secrets.proto-labs.ai. Never commit secrets. Gateway start.sh injects at runtime via Machine Identity.

Storage

  • /mnt/models — frequently-accessed model weights only (1TB NVMe, 420GB free)
  • /mnt/data — datasets, checkpoints, outputs, cold model storage (2TB NVMe)
  • /mnt/data/models-cold/ — FLUX, Z-Image, Voxtral, OCR models (moved off fast drive)
  • /mnt/scratch — logs, caches, docker volumes (disposable)