Add todo/checklist tool for structured task tracking
Problem
Small models (8-14B) lose track of multi-step tasks. Observed in forge-code eval:
- R1 specific: The prompt said "update all imports, call sites, and tests." The model renamed the function definition, updated some call sites, then spiraled trying to edit the test file — 30 iterations, max context, never called respond. It lost track of what was done vs what remained.
- R3 specific: 17 edits across 4 files, got 89/100 tests — close but missed spots. No way to know if it was aware it missed them or just lost the thread.
- B2 natural: 30 iterations of view-view-view-edit-view — reading the same files repeatedly, never converging. No internal state to track "I already checked reporter.py, the bug isn't there."
The pattern: when a task requires 3+ coordinated steps, small models either (a) forget what they've already done and repeat work, (b) forget what they haven't done and respond prematurely, or (c) spiral without converging because they have no external memory of their progress.
Claude Code solves this with TodoWrite — a structured task list the model maintains alongside its work. The model breaks the task into steps, checks them off as it goes, and has a persistent reference for what's left.
Proposal
Add a todo tool to forge-code's tool set:
todo(action="create", items=["Find all usages of apply_transforms", "Update transformer.py", "Update pipeline.py", "Update test_transformer.py"])
todo(action="check", index=0)
todo(action="check", index=1, note="renamed definition")
todo(action="list")
The tool maintains a simple list in the tool context (in-memory, per-turn). The model sees the current state in every tool result. No persistence across turns needed.
System prompt guidance
Add to the system prompt:
- For tasks that touch multiple files or require coordinated changes, use the todo tool to plan your steps before starting. Check off each step as you complete it.
Eval axis
This is a natural eval axis: same scenario, with and without the todo tool. Hypothesis: multi-file scenarios (R1, R2, R3, F2, F3) improve significantly with todo, while single-file scenarios (B1, F1) show no difference.
Specific metrics to compare:
- Completion rate: Does the model finish all required changes?
- Wasted calls: How many tool calls are repeated (reading the same file twice)?
- Context efficiency: Does structured planning reduce total context consumed?
Implementation notes
- New tool:
tools/todo.py with TodoParams and make_todo_tool(ctx)
- State lives on
ToolContext (add a todo_items: list[TodoItem] field)
- Read-only tool (no permission gate needed)
- Add to both normal and plan mode tool sets
- System prompt update in
prompts/system.py
Alternatives considered
- Prompt-only approach: Just tell the model to "make a mental plan." Might not work - there's no external state, so the "plan" gets compacted away as context fills.
- Required step enforcement: forge's
required_steps ensures certain tools are called, but doesn't track sub-task progress within a step.
- Scratchpad file: Have the model write a TODO.md file. Works but burns file I/O on bookkeeping and pollutes the working directory.
Add todo/checklist tool for structured task tracking
Problem
Small models (8-14B) lose track of multi-step tasks. Observed in forge-code eval:
The pattern: when a task requires 3+ coordinated steps, small models either (a) forget what they've already done and repeat work, (b) forget what they haven't done and respond prematurely, or (c) spiral without converging because they have no external memory of their progress.
Claude Code solves this with
TodoWrite— a structured task list the model maintains alongside its work. The model breaks the task into steps, checks them off as it goes, and has a persistent reference for what's left.Proposal
Add a
todotool to forge-code's tool set:The tool maintains a simple list in the tool context (in-memory, per-turn). The model sees the current state in every tool result. No persistence across turns needed.
System prompt guidance
Add to the system prompt:
Eval axis
This is a natural eval axis: same scenario, with and without the todo tool. Hypothesis: multi-file scenarios (R1, R2, R3, F2, F3) improve significantly with todo, while single-file scenarios (B1, F1) show no difference.
Specific metrics to compare:
Implementation notes
tools/todo.pywithTodoParamsandmake_todo_tool(ctx)ToolContext(add atodo_items: list[TodoItem]field)prompts/system.pyAlternatives considered
required_stepsensures certain tools are called, but doesn't track sub-task progress within a step.