A local-model coding assistant built on forge. Define tools, hand them to forge's WorkflowRunner, and get a working coding assistant — forge handles the agentic loop, guardrails, context management, and backend communication.
Sub-agent mode: Split work across two llama-server slots. A main agent coordinates, while research and builder specialists run on a dedicated slot with fresh context per task. The main agent's context stays clean — it only sees summaries, never raw file contents.
Three interfaces:
- CLI — readline REPL for quick use
- TUI — Textual-based terminal UI with inline diffs, permission modals, live context tracking, and plan mode
- VS Code extension — webview chat panel with markdown rendering, side-by-side diffs, and Claude Code-style gutter
Supports Ollama, llama-server (llama.cpp), Llamafile, and Anthropic as backends.
- Python 3.12+
- forge installed
- A running LLM backend (see below)
git clone https://github.com/antoinezambelli/forge-code.git
cd forge-code
pip install -e ../forge # install forge from local checkout
pip install -e . # install forge-codellama-server (recommended):
# Install from https://github.com/ggml-org/llama.cpp/releases
llama-server -m path/to/model.gguf --jinja -ngl 999 --port 8080Ollama (easiest):
ollama pull ministral-3:8b-instruct-2512-q4_K_MSee forge's Backend Setup for full instructions.
Three interfaces — same tools, same engine:
python -m forge_codeMinimal interface. No mid-turn cancellation — use the TUI or VS Code extension for Esc-to-cancel support.
python -m forge_code --tuiSetup (one time):
cd vscode
npm install
npm run compileTesting (Extension Development Host):
- Open the
vscode/directory in VS Code:code vscode/ - Press F5 — this launches a new VS Code window (the "Extension Development Host")
- In the new window, open the project you want to work in (e.g.
forge-code/itself) - Open the command palette: Ctrl+Shift+P
- Run "forge-code: Open Chat"
- The chat panel opens in the sidebar — type a message to start
The original VS Code window shows debug output. Close the Extension Development Host window to stop.
To install permanently (no F5 needed):
cd vscode
npx @vscode/vsce package
code --install-extension forge-code-0.2.0.vsixConfigure in VS Code settings (or use ~/.forge-code.json for defaults):
forge-code.pythonPath— Python interpreter with forge-code installed (default:python)forge-code.backend—llamaserver|ollama|llamafile(default:llamaserver)forge-code.model— Model name (required for ollama)forge-code.port— Backend server port (default:8080)forge-code.gguf— Path to GGUF file for managed modeforge-code.cacheType— KV cache quantization (e.g.q8_0)forge-code.subAgents— Enable sub-agent mode (default:false)
Create ~/.forge-code.json to set defaults across all frontends:
{
"gguf": "/path/to/model.gguf",
"cacheType": "q8_0",
"subAgents": true
}Resolution order: CLI flags > VS Code settings > config file > defaults.
--backend llamaserver | ollama | llamafile (default: llamaserver)
--model Model name (required for ollama)
--port Backend server port (default: 8080)
--gguf Path to GGUF file — managed mode (starts llama-server automatically)
--cache-type KV cache quantization (e.g. q8_0)
--sub-agents Enable sub-agent mode (research + build on slot 1)
--tui Launch the Textual TUI
--server Run as JSON-RPC server (used by VS Code extension)
Single-agent mode (default):
| Tool | Description | Permission |
|---|---|---|
bash |
Execute shell commands | write |
view |
Read files with line numbers | read |
edit |
String replacement in files | write |
write |
Create or overwrite files | write |
glob |
Find files by pattern | read |
grep |
Search file contents by regex | read |
respond |
Model's speech channel to the user | — |
Sub-agent mode (--sub-agents):
The main agent coordinates via dispatch tools. Specialists run on slot 1 with fresh context:
| Tool | Description | Runs on |
|---|---|---|
research |
Investigate codebase — returns summary with file paths and recommendations | slot 1 (read-only: view, glob, grep) |
build |
Execute coding task — make edits, run tests, report results | slot 1 (full: view, glob, grep, edit, write, bash) |
respond |
Model's speech channel to the user | main (slot 0) |
Write tools prompt for permission before executing. Grant always to skip future prompts for that tool within the session. Denying permission cancels the specialist and returns to user input.
- Tool call indicators —
[*] tool — summaryfor each tool call - Inline diffs — red/green tinted backgrounds for edit operations
- Permission modal — overlay with diff/command preview and Allow/Always/Deny buttons
- Status bar — live context usage (tokens + %), retry/rescue counters
- Cancel — press Esc to cancel the active turn. The runner finishes the current tool call and stops cleanly. Ctrl+C exits the app.
- Plan mode —
/plantoggles read-only analysis mode (removes write tools, deepens analysis prompt)
Type /plan in the TUI input to toggle plan mode. In plan mode:
- Write tools (bash, edit, write) are removed from the workflow
- The system prompt instructs the model to analyze and recommend rather than modify
- The model explores the codebase with read tools and responds with findings and specific edit recommendations
Type /plan again to return to normal mode.
src/forge_code/
__init__.py
__main__.py # Entry point: python -m forge_code
engine.py # Shared backend — clients, runners, session, slot manager
cli.py # readline REPL (thin frontend)
tui.py # Textual TUI (thin frontend)
server.py # JSON-RPC server for VS Code (thin frontend)
display.py # Shared tool call summarizer (CLI + TUI)
session.py # Message list, workflow builder, plan mode
permissions.py # CLI permission gate (allow/deny/always)
prompts/
system.py # System prompts: main agent, research, builder, plan mode
tools/
__init__.py # Tool registry: build_tools(), build_main_agent_tools()
specialists.py # Research + build dispatch tools, CancelOnDeny
context.py # ToolContext, PermissionGate protocol
bash.py # Shell execution via subprocess
view.py # File reading with line numbers
edit.py # String replacement editing
write.py # File creation/overwrite
glob.py # File pattern matching
grep.py # Regex content search
tests/
76 unit tests across all tools, permissions, and session
eval/
eval_runner.py # Single-scenario runner with scoring
batch_eval.py # Multi-model batch runner (JSONL, resume)
report.py # ASCII report tables
judge_batch.py # Opus judge batch API script
review.py # LLM review prompt generation
dummy_repo/ # Data pipeline codebase (13 modules, 100 tests)
scenarios/ # 9 scenarios × 3 prompt levels = 27 task variants
See docs/ARCHITECTURE.md for the full design and docs/WORKFLOW.md for the per-turn loop.
The key idea: each user message is one WorkflowRunner.run() call with respond as the terminal tool. The model stays in tool-calling mode where forge's full guardrail stack applies. Multi-turn memory is a growing list[Message] passed via initial_messages. Context compaction is handled transparently by forge using real token counts from the backend.
All three frontends are thin UI layers. Backend logic (clients, runners, session, sub-agent orchestration) lives in Engine.
pip install -e ".[dev]"
python -m pytest tests/ -v9 scenarios measuring how reliably a model completes real coding tasks — bug fixes, refactors, and feature additions against a dummy data-pipeline repo. See Eval Concept for design rationale.
# Single scenario
python -m tests.eval.eval_runner --scenario B1 --prompt natural --backend llamafile --model ministral-3-14b --verbose
# Batch eval (JSONL output, automatic resume)
python -m tests.eval.batch_eval --config all --runs 10
# Reports
python -m tests.eval.report eval_results.jsonl
python -m tests.eval.report eval_results.jsonl --by-model- Multi-dimensional scoring — syntax validity, test pass rate, edit distance vs reference, pipeline regression, LLM review (deferred). No binary pass/fail.
- Prompt specificity axis — each scenario runs at three levels (specific, natural, vague) to measure how well the model handles ambiguity.
- Context telemetry — per-step token counts, peak context, tool breakdowns. Designed to empirically measure the effective attention threshold (~12-15K for 8B models).
- Partial credit — a model that writes correct logic with a dropped f-string quote scores differently from one that writes wrong logic. Existing benchmarks miss this.
- Architecture — Design decisions, component overview, what forge provides vs what forge-code builds
- Workflow — Per-turn loop, the
respondtool, permission flow, plan mode - Eval Concept — Eval suite design, scoring methodology, scenario descriptions
MIT — Copyright (c) 2025-2026 Antoine Zambelli
