You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: CLAUDE.md
+1Lines changed: 1 addition & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -50,6 +50,7 @@ docs/ Shared phases + reference docs consumed by skills at runt
50
50
|`CHANGELOG.md`| Version history | Add entry for each release |
51
51
|`README.md`| User-facing documentation | Keep in sync with feature changes |
52
52
|`tests/`| Hook and structure tests |`hooks/` for hook tests, `structure/` for plugin validation |
53
+
|`.agent-team/0309-protocol-research/`| Research findings | Reference only — do not modify. Contains 4 reports on protocol, patterns, resilience, and scaling |
**Optional confidence grade**: Append `[X%]` to any finding when confidence is meaningful:
83
+
-`H1[95%]: src/auth.py:15, SQL injection via unsanitized input, fix: use parameterized query`
84
+
-`M2[60%]: src/api.py:42, possible race condition under load`
85
+
86
+
Omit the grade when confidence is obviously high (most findings). Use it when a finding is uncertain or based on inference rather than direct evidence.
87
+
34
88
In COMPLETED messages, include total counts: "N issues: X high, Y medium, Z low"
-[Circular Dependency Detection](#circular-dependency-detection) — prevent deadlocks in Phase 2
28
+
-[Graceful Degradation](#graceful-degradation) — scope reduction under resource pressure
25
29
-[Auto-Block on Repeated Failures](#auto-block-on-repeated-failures) — escalation after repeated failures
26
30
-[Direct Handoff](#direct-handoff) — authorized peer-to-peer messaging with audit trail
31
+
-[Anti-Pattern Catalog](#anti-pattern-catalog) — known coordination pitfalls to avoid
27
32
28
33
## Communication Protocol
29
34
@@ -201,6 +206,25 @@ When Teammate A produces output that Teammate B needs:
201
206
202
207
Do NOT have teammates message each other directly for handoffs unless they need a back-and-forth discussion. The lead summarizing and forwarding keeps coordination clean and maintains the workspace audit trail.
203
208
209
+
### Warm vs Cold Handoff
210
+
211
+
-**Warm handoff**: Lead forwards full context — what was done, why, key decisions, and specific next steps for the receiving teammate. Use when the handoff requires understanding of reasoning.
212
+
```
213
+
A finished task #3 (auth token refactor). Key changes:
214
+
- Moved token validation to src/auth/validate.ts
215
+
- New interface: TokenResult { valid: boolean, claims: Claims }
216
+
- Decision: used JWT over opaque tokens (see progress.md Decision Log)
217
+
You can now proceed with task #5 using the new TokenResult interface.
218
+
```
219
+
220
+
-**Cold handoff**: Lead forwards minimal context — just file paths and a pointer to workspace. Use when the receiving teammate only needs to know what files to read.
221
+
```
222
+
A finished task #3. Output files: src/auth/validate.ts, src/auth/types.ts.
223
+
Check workspace tasks.md for full details. Proceed with task #5.
224
+
```
225
+
226
+
**Default to warm handoffs** — the extra context costs little and prevents follow-up QUESTION messages. Use cold handoffs only when the downstream task is clearly independent (e.g., reviewer just needs to read files).
227
+
204
228
## Teammate Not Responding
205
229
206
230
If a teammate hasn't sent an update after an extended period:
@@ -382,6 +406,156 @@ Completion criteria: Build exits 0 with no errors.
382
406
383
407
Assign to the nearest available teammate (reviewer or tester preferred, implementer if no others are available).
384
408
409
+
## Checkpoint/Rollback
410
+
411
+
Save consistent state at natural breakpoints during long-running tasks. Enables recovery from mid-task failures without losing completed work.
412
+
413
+
### When to Use
414
+
415
+
- Tasks expected to take >10 minutes
416
+
- Multi-step migrations, large refactors, or batch operations
417
+
- Any task where partial failure is possible and rework is expensive
418
+
419
+
### Protocol
420
+
421
+
1.**Lead instructs** in spawn prompt: "For long tasks, send CHECKPOINT messages at natural breakpoints (after each module, after each migration step, etc.)"
422
+
2.**Teammate sends** CHECKPOINT at each breakpoint:
423
+
```
424
+
CHECKPOINT #N: {what was completed}, artifacts={file references}, ready_for=[task IDs]
425
+
```
426
+
3.**Lead logs** checkpoint in `progress.md` Decision Log: "Checkpoint: task #N at [milestone]"
427
+
4.**On failure**: Lead messages teammate with last checkpoint context:
428
+
```
429
+
Resume from checkpoint. Last known state:
430
+
- Completed: {checkpoint description}
431
+
- Artifacts: {file references}
432
+
- Remaining: {what's left to do}
433
+
```
434
+
5.**If teammate is unrecoverable**: spawn replacement with checkpoint context in prompt
435
+
436
+
### Workspace Integration
437
+
438
+
- Checkpoints are logged in `progress.md` Decision Log (not a separate file)
439
+
- Checkpoint artifacts live in the workspace directory: `.agent-team/{team}/checkpoint-{task-id}.md`
440
+
- On task completion, checkpoint artifacts can be cleaned up or kept for audit
441
+
442
+
### Key Rule
443
+
444
+
Checkpoints are lightweight — a one-line CHECKPOINT message, not a full state dump. The workspace files (`tasks.md`, `issues.md`) already track team-level state. Checkpoints track task-level progress within a single teammate's scope.
445
+
446
+
## Deadline Escalation
447
+
448
+
Proactive time-based escalation to prevent tasks from exceeding the user's time budget.
449
+
450
+
### When to Use
451
+
452
+
- User has an implicit or explicit time constraint
453
+
- A task has been in_progress for an extended period with no PROGRESS or COMPLETED message
454
+
- The team session is approaching context limits
455
+
456
+
### Protocol
457
+
458
+
1.**Lead tracks** estimated task duration in `progress.md`:
459
+
```
460
+
**Session started**: {timestamp}
461
+
```
462
+
2.**Lead proactively checks** tasks that have been in_progress without updates:
463
+
```
464
+
Status check on task #N — it's been [duration] since your last update.
465
+
What's your progress? Use PROGRESS or COMPLETED format.
466
+
If blocked, use BLOCKED so I can log and route it.
467
+
```
468
+
3.**Escalation ladder**:
469
+
-**Nudge** (first check): request status update
470
+
-**Warn** (second check, ~5 min later): "Task #N is at risk. Need status or BLOCKED report."
471
+
-**Escalate** (third check): mark task as at-risk in `tasks.md`, consider reassignment or scope reduction
472
+
4.**Scope reduction option**: if task is too large, lead proposes splitting:
473
+
```
474
+
Task #N is taking longer than expected. Options:
475
+
a) Continue (estimated X more minutes)
476
+
b) Split: complete [partial scope], defer [remaining scope] as follow-up
477
+
c) Reassign to [other teammate]
478
+
```
479
+
480
+
### Key Rule
481
+
482
+
Deadline escalation is proactive, not punitive. The goal is visibility — silent tasks are the biggest risk to team throughput. Combine with the PROGRESS message type for teammates to self-report before escalation triggers.
483
+
484
+
## Circular Dependency Detection
485
+
486
+
Validate task dependency graphs before execution to prevent silent deadlocks.
487
+
488
+
### When to Use
489
+
490
+
- Phase 2 plan has 4+ tasks with `blocked by` relationships
491
+
- Any time tasks form chains longer than 2 levels deep
492
+
493
+
### Protocol
494
+
495
+
1.**During Phase 2**: Before presenting the plan, trace all dependency chains:
496
+
- For each task with `blocked by`, follow the chain: A blocks B blocks C...
497
+
- If any chain leads back to a task already visited, there's a cycle
498
+
2.**On cycle detected**: Do NOT present the plan. Instead, restructure:
499
+
- Option A: Merge the cyclic tasks into one (assign to same teammate)
500
+
- Option B: Remove the weakest dependency (the one where the blocker could be worked around)
501
+
- Option C: Split one task to break the cycle (the blocking portion runs first)
502
+
3.**Log**: Record the detected cycle and resolution in `progress.md` Decision Log
503
+
504
+
### Example
505
+
506
+
```
507
+
Task #1: Set up database schema
508
+
Task #2: Write API endpoints (blocked by #1)
509
+
Task #3: Write migrations (blocked by #2)
510
+
Task #1 update: schema depends on migration format (blocked by #3) ← CYCLE
511
+
512
+
Resolution: Merge #1 and #3 into single task "Database schema + migrations"
513
+
```
514
+
515
+
### Prevention
516
+
517
+
The best prevention is Phase 1 decomposition by independent modules, not by sequential steps. If streams need constant handoffs, merge them.
518
+
519
+
## Graceful Degradation
520
+
521
+
Reduce scope rather than stopping when the team hits resource limits or unrecoverable blockers.
522
+
523
+
### When to Use
524
+
525
+
- Context window is running low (frequent compaction)
526
+
- Multiple teammates are blocked and remediation isn't viable
527
+
- User's time budget is exceeded but partial delivery has value
528
+
529
+
### Protocol
530
+
531
+
1.**Detect degradation trigger**:
532
+
- 2+ context compactions in short succession
533
+
- 3+ teammates blocked simultaneously
534
+
- Lead judges that full scope cannot be completed
535
+
2.**Assess salvageable work**: read `tasks.md` — which tasks are COMPLETED? What partial value exists?
536
+
3.**Present scope reduction to user**:
537
+
```
538
+
Scope reduction needed: [trigger reason]
539
+
540
+
Completed work (will be preserved):
541
+
- [task IDs and summaries]
542
+
543
+
Work to defer (will be logged as follow-up):
544
+
- [task IDs and summaries]
545
+
546
+
Approve reduced scope?
547
+
```
548
+
4.**If approved**:
549
+
- Mark deferred tasks as `deferred` in `tasks.md`
550
+
- Shut down teammates working on deferred tasks
551
+
- Continue to Phase 5 with completed work only
552
+
- Include deferred items in report's Follow-up section
553
+
5.**Log**: Record scope reduction decision in `progress.md` Decision Log
554
+
555
+
### Key Rule
556
+
557
+
Graceful degradation is a controlled retreat, not a failure. The user gets partial value immediately and a clear list of what remains. This is always better than a team that burns context trying to finish everything and produces nothing.
558
+
385
559
## Auto-Block on Repeated Failures
386
560
387
561
Prevents teammates from spinning on the same error. Escalates automatically after repeated failures.
@@ -433,3 +607,27 @@ For pre-approved information transfers between specific teammates, bypassing the
433
607
### Key Rule
434
608
435
609
The audit trail MUST be maintained. Direct handoffs save time but must still be logged via the lead's workspace updates.
610
+
611
+
## Anti-Pattern Catalog
612
+
613
+
Known coordination anti-patterns to avoid. These emerge from research into multi-agent systems (CrewAI, AutoGen, LangGraph, MetaGPT) and distributed systems theory.
614
+
615
+
### Critical (Prevent by Design)
616
+
617
+
**Circular Wait Deadlock**: Tasks A→B→C→A where each blocks the next. Prevention: validate dependency DAG in Phase 2 (see [Circular Dependency Detection](#circular-dependency-detection)).
618
+
619
+
**Race Condition on Shared State**: Two teammates simultaneously edit the same file; last write wins. Prevention: 1:1 file ownership mapping in Phase 2 + PreToolUse hook enforcement.
620
+
621
+
**Context Overflow Cascade**: Workspace grows unbounded; teammates can't read full context; compaction fires repeatedly. Prevention: batch workspace updates, keep workspace files concise, use [Graceful Degradation](#graceful-degradation) when compaction frequency increases.
622
+
623
+
**Infinite Re-Debate Loop**: Two teammates keep revisiting a completed decision. Prevention: once a task is COMPLETED, no further work on it unless explicitly reassigned by the lead. Log decisions in `progress.md` Decision Log as the authoritative record.
624
+
625
+
### Warning (Monitor and Mitigate)
626
+
627
+
**Silent Failure**: Teammate completes but sends no message — task appears blocked but is actually done. Mitigation: First Contact Verification + proactive check-ins. If idle 2+ cycles without any message, investigate.
628
+
629
+
**Scope Explosion**: Team grows beyond lead's effective span of control (>6 agents). Mitigation: enforce team size limits in Phase 3; for >6, use hierarchical sub-leads or phased execution.
630
+
631
+
**Single Point of Failure**: All work depends on one teammate; if they fail, the whole team stalls. Mitigation: avoid assigning >50% of tasks to any single teammate. For critical paths, ensure another teammate can take over.
632
+
633
+
**Byzantine Output**: Teammate reports task complete but output is incorrect or hallucinated. Mitigation: Adversarial Review Rounds for critical tasks; verify file changes actually exist before marking tasks complete (TaskCompleted hook already does this for implementers).
0 commit comments