Skip to content

Add todo/checklist tool for structured task tracking #8

@antoinezambelli

Description

@antoinezambelli

Add todo/checklist tool for structured task tracking

Problem

Small models (8-14B) lose track of multi-step tasks. Observed in forge-code eval:

  • R1 specific: The prompt said "update all imports, call sites, and tests." The model renamed the function definition, updated some call sites, then spiraled trying to edit the test file — 30 iterations, max context, never called respond. It lost track of what was done vs what remained.
  • R3 specific: 17 edits across 4 files, got 89/100 tests — close but missed spots. No way to know if it was aware it missed them or just lost the thread.
  • B2 natural: 30 iterations of view-view-view-edit-view — reading the same files repeatedly, never converging. No internal state to track "I already checked reporter.py, the bug isn't there."

The pattern: when a task requires 3+ coordinated steps, small models either (a) forget what they've already done and repeat work, (b) forget what they haven't done and respond prematurely, or (c) spiral without converging because they have no external memory of their progress.

Claude Code solves this with TodoWrite — a structured task list the model maintains alongside its work. The model breaks the task into steps, checks them off as it goes, and has a persistent reference for what's left.

Proposal

Add a todo tool to forge-code's tool set:

todo(action="create", items=["Find all usages of apply_transforms", "Update transformer.py", "Update pipeline.py", "Update test_transformer.py"])
todo(action="check", index=0)
todo(action="check", index=1, note="renamed definition")
todo(action="list")

The tool maintains a simple list in the tool context (in-memory, per-turn). The model sees the current state in every tool result. No persistence across turns needed.

System prompt guidance

Add to the system prompt:

- For tasks that touch multiple files or require coordinated changes, use the todo tool to plan your steps before starting. Check off each step as you complete it.

Eval axis

This is a natural eval axis: same scenario, with and without the todo tool. Hypothesis: multi-file scenarios (R1, R2, R3, F2, F3) improve significantly with todo, while single-file scenarios (B1, F1) show no difference.

Specific metrics to compare:

  • Completion rate: Does the model finish all required changes?
  • Wasted calls: How many tool calls are repeated (reading the same file twice)?
  • Context efficiency: Does structured planning reduce total context consumed?

Implementation notes

  • New tool: tools/todo.py with TodoParams and make_todo_tool(ctx)
  • State lives on ToolContext (add a todo_items: list[TodoItem] field)
  • Read-only tool (no permission gate needed)
  • Add to both normal and plan mode tool sets
  • System prompt update in prompts/system.py

Alternatives considered

  • Prompt-only approach: Just tell the model to "make a mental plan." Might not work - there's no external state, so the "plan" gets compacted away as context fills.
  • Required step enforcement: forge's required_steps ensures certain tools are called, but doesn't track sub-task progress within a step.
  • Scratchpad file: Have the model write a TODO.md file. Works but burns file I/O on bookkeeping and pollutes the working directory.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions