Add todo/checklist tool for structured task tracking

# Add todo/checklist tool for structured task tracking

## Problem

Small models (8-14B) lose track of multi-step tasks. Observed in forge-code eval:

- **R1 specific**: The prompt said "update all imports, call sites, and tests." The model renamed the function definition, updated some call sites, then spiraled trying to edit the test file — 30 iterations, max context, never called respond. It lost track of what was done vs what remained.
- **R3 specific**: 17 edits across 4 files, got 89/100 tests — close but missed spots. No way to know if it was aware it missed them or just lost the thread.
- **B2 natural**: 30 iterations of view-view-view-edit-view — reading the same files repeatedly, never converging. No internal state to track "I already checked reporter.py, the bug isn't there."

The pattern: when a task requires 3+ coordinated steps, small models either (a) forget what they've already done and repeat work, (b) forget what they haven't done and respond prematurely, or (c) spiral without converging because they have no external memory of their progress.

Claude Code solves this with `TodoWrite` — a structured task list the model maintains alongside its work. The model breaks the task into steps, checks them off as it goes, and has a persistent reference for what's left.

## Proposal

Add a `todo` tool to forge-code's tool set:

```
todo(action="create", items=["Find all usages of apply_transforms", "Update transformer.py", "Update pipeline.py", "Update test_transformer.py"])
todo(action="check", index=0)
todo(action="check", index=1, note="renamed definition")
todo(action="list")
```

The tool maintains a simple list in the tool context (in-memory, per-turn). The model sees the current state in every tool result. No persistence across turns needed.

### System prompt guidance

Add to the system prompt:
```
- For tasks that touch multiple files or require coordinated changes, use the todo tool to plan your steps before starting. Check off each step as you complete it.
```

### Eval axis

This is a natural eval axis: same scenario, with and without the todo tool. Hypothesis: multi-file scenarios (R1, R2, R3, F2, F3) improve significantly with todo, while single-file scenarios (B1, F1) show no difference.

Specific metrics to compare:
- **Completion rate**: Does the model finish all required changes?
- **Wasted calls**: How many tool calls are repeated (reading the same file twice)?
- **Context efficiency**: Does structured planning reduce total context consumed?

## Implementation notes

- New tool: `tools/todo.py` with `TodoParams` and `make_todo_tool(ctx)`
- State lives on `ToolContext` (add a `todo_items: list[TodoItem]` field)
- Read-only tool (no permission gate needed)
- Add to both normal and plan mode tool sets
- System prompt update in `prompts/system.py`

## Alternatives considered

- **Prompt-only approach**: Just tell the model to "make a mental plan." Might not work - there's no external state, so the "plan" gets compacted away as context fills.
- **Required step enforcement**: forge's `required_steps` ensures certain tools are called, but doesn't track sub-task progress within a step.
- **Scratchpad file**: Have the model write a TODO.md file. Works but burns file I/O on bookkeeping and pollutes the working directory.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add todo/checklist tool for structured task tracking #8

Add todo/checklist tool for structured task tracking

Problem

Proposal

System prompt guidance

Eval axis

Implementation notes

Alternatives considered

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Add todo/checklist tool for structured task tracking #8

Description

Add todo/checklist tool for structured task tracking

Problem

Proposal

System prompt guidance

Eval axis

Implementation notes

Alternatives considered

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions