Skip to content

Latest commit

 

History

History
174 lines (132 loc) · 7.34 KB

File metadata and controls

174 lines (132 loc) · 7.34 KB

Session Handoff — 2026-03-22

What We Built

1. Comprehensive Eval Suite (128 custom tests + 72 claw + 8 FC)

Suite Tests New?
creative_writing 25 +20 (narrative, character voice, genre, perspective)
summarization 25 +20 (news, technical, factual consistency)
RAG 25 +20 (faithfulness, multi-doc, domain-specific, edge cases)
coding 10
instruction_following 5
reasoning 5
structured_output 5
safety 5
roleplay 5
svg_generation 5
research 4
function_call 8

New infrastructure: pairwise A/B judge, NIAH context window test, deepeval/ragas integration.

Quick profile is FROZEN as baseline — don't change test composition. Add new profiles instead.

2. Speed Optimization Testing

Optimization Result
P1+P2 flags (async-scheduling, prefix-caching, FP8 KV) +1-3% single-request (real wins under concurrency)
MTP speculative decoding +32% on 27B, +22% on 9B, -11% on 35B MoE
NCCL tuning for TP=2 Zero impact on decode speed
Prefix caching on TP=2 TTFT -70% on 35B (1.8s→0.5s)
MoE FP8 env vars Crashes on 122B FP8 quants, +2% on BF16 MoE

3. Grand Model Comparison (Expanded Quick Profile)

Model Claw Code IF Rea SO Summ/25 Safe Creat/25 RP FC Total
GPT-5.4 2/9 9 5 5 5 21 5 23 5 8 88/102
Sonnet 4.6 4/10 9 3 5 4 20 4 25 5 8 87/103
Gemini Flash 2/8 9 4 4 5 23 4 23 5 8 87/101
DeepSeek V3.2 2/8 9 4 5 4 21 4 25 5 8 87/101
27B INT4 MTP 5 8 5 5 5 19 5 21 5 8 86/103
Gemini 3.1 Pro 0/7 10 4 4 5 22 5 22 5 8 85/100
Grok 4.1 Fast 2/8 8 4 5 4 22 5 22 5 8 85/101
27B INT4 5 8 4 4 5 21 5 19 5 8 84/103
Haiku 4.5 3/8 9 2 4 3 22 4 23 5 8 83/101
35B MoE 6 10 4 5 4 19 5 11 4 8 76/103
9B MTP 5 8 4 5 4 19 5 11 3 8 72/103
4B INT4 5 7 3 5 5 14 4 2 3 8 56/103
Llama 8B AWQ 1/9 5 3 2 5 14 5 5 4 6 50/102

italic = local model

4. Models Tested and Eliminated

Model Why Eliminated
OmniCoder 9B Inferior to Qwen3.5-9B (same arch, worse tool use)
Mistral-Small-4-119B MLA attention unsupported on Blackwell SM 12.0
Qwen3.5-27B BF16 Redundant with INT4 (same quality, 52GB wasted)
Qwen3.5-122B FP8 Redundant with INT4 (INT4 is faster on single GPU)
Hermes 3 70B FP8 FP8 crashes under sustained load, tool parser broken, all claw scores 0.00
Llama 8B AWQ 50/102 — worst model tested, Qwen 4B beats it at 1/4 the size

5. Key Discoveries

Creative writing is the great separator. The expanded 25-test creative suite reveals massive quality gaps invisible in the old 5-test version:

  • Frontier ceiling: 25/25 (Sonnet 4.6, DeepSeek V3.2)
  • 27B MTP: 21/25 (competitive)
  • 35B MoE: 11/25 (MoE routing hurts creative?)
  • 9B: 11/25 (fine-tune target)
  • 4B: 2/25 (capability cliff)

Qwen owns everything local. Llama 8B scored 50/102 vs Qwen 9B at 72/103. Hermes 70B couldn't even complete a run. All fine-tuning should be on Qwen architecture.

27B MTP is 1 point behind frontier. At 86/103 with 70 tok/s local, it's within striking distance of GPT-5.4 (88/102). The gap is creative (21 vs 23) and summarization (19 vs 21).


Current Model Inventory (/mnt/models — 415GB free)

Production LLMs

Model Size tok/s Quick Score Config
Qwen 27B INT4 29GB 53/70 MTP 84-86/103 qwen-27b-int4 / -mtp
Qwen 35B MoE 67GB 171 76/103 qwen-35b
Qwen 9B 19GB 92/112 MTP 72/103 qwen-9b / -mtp
Qwen 4B INT4 3.8GB 297 56/103 qwen-4b-int4
Qwen 122B INT4 74GB ~30 (TP=2) qwen-122b-int4

Training / Fine-tune Bases

Model Size Purpose
Qwen 4B BF16 8.8GB LoRA base
Qwen 4B Base 8.8GB Pretraining
Qwen 2B + Base 8.6GB Training experiments (5/5 reasoning at 2B!)
Qwen 0.8B + Base 3.4GB Training experiments
Llama 3.1-8B AWQ 5GB Eval baseline (poor: 50/102)

Creative / Experimental

Model Size Notes
Cydonia 24B 44GB Uncensored Mistral, tool calling broken, holding for manual creative testing
Llama 3.3-70B AWQ 38GB Creative/roleplay, holding

Non-LLM

Model Size Notes
LTX-2.3 97GB Video gen (ComfyUI symlinked, keep on fast drive)
fishaudio/s2-pro 11GB TTS

Cold Storage (/mnt/data/models-cold/)

FLUX.2-klein 9B+base (100GB), Z-Image+Turbo (51GB), Voxtral-Mini-4B (17GB), OCR models (11.4GB)


Recommended Production Configs

# Daily driver — agentic work (tools, claw tasks)
bash models/vllm-swap.sh qwen-27b-int4       # 53 tok/s, reliable tools

# Daily driver — chat, creative, coding
bash models/vllm-swap.sh qwen-27b-int4-mtp   # 70 tok/s, +32% speed

# Fine-tune evaluation
bash models/vllm-swap.sh qwen-9b-mtp         # 112 tok/s, fine-tune baseline

# Edge deploy
bash models/vllm-swap.sh qwen-4b-int4        # 297 tok/s, 3.8GB

# Quality ceiling (needs both GPUs)
bash models/vllm-swap.sh qwen-122b-int4      # 30 tok/s, TP=2

# Speed king (single GPU or TP=2 for long context)
bash models/vllm-swap.sh qwen-35b            # 171 tok/s, best for batch/coding

What YOU Need To Do Next

Fine-Tuning (Primary Goal)

  • Install LLaMA-Factory (pip install llamafactory in training venv)
  • Accept Llama 3.1-8B-Instruct license: https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct
  • Set up Langfuse online evaluators for protoClaw production traffic
  • Export tool-use traces from Langfuse as training data
  • First LoRA: Qwen3.5-9B on protoClaw tool-use traces
  • Creative writing LoRA: Qwen3.5-9B on curated creative data (target: 11→20+/25)
  • Create HuggingFace dataset: ArtificialCitizens/protoclaw-agent-v1

Eval Refinement

  • Run full profile (3 trials, pass^3) on 27B MTP — it's our SOTA local model
  • Investigate claw scoring — many models score low, may be rubric/parser issues
  • GPT-5.4-mini/nano need gateway restart with drop_params: true to retest
  • Consider adding WildBench (1,024 tasks) as a periodic deep creative eval

Infrastructure

  • Rebuild protoClaw container for Langfuse tracing
  • Set up Grafana dashboard for protoClaw metrics
  • Concurrency testing — async-scheduling claims 30% but untested under load
  • Install Inspect AI: uv pip install inspect-ai (HumanEval, GSM8K, ARC)

Models to Revisit

  • Qwen3-235B GPTQ-Int4 (~120GB) — would be our quality ceiling, 415GB free
  • Mistral-Small-4-119B — retry when vLLM ships MLA fixes for SM 12.0
  • Llama 3.1-8B-Instruct BF16 — for fine-tune experiments once license approved