Is your feature request related to a problem?
In production, transient failures and poison artifacts require sophisticated retry strategies. Today we only have a global max_agent_iterations safeguard, so teams build their own retry wrappers, dead-letter queues, and circuit breakers. This leads to duplicated logic and inconsistent observability across deployments.
Describe the solution you want to see
- Add a first-class retry configuration API on agents/components supporting strategies like exponential backoff with jitter, selective retry by exception type, and maximum attempt limits.
- Provide a built-in dead-letter queue implementation that captures failed artifacts and their traces for later inspection or replay.
- Introduce per-agent circuit breakers with configurable thresholds, recovery windows, and half-open probes.
- Emit structured metrics/traces for retry attempts, DLQ size, and circuit state so operators can monitor behavior in dashboards and alerts.
Describe alternatives you have considered
Current workarounds involve wrapping agents with custom components or external schedulers, but those solutions bypass Flock’s tracing and visibility semantics, and they don’t generalize across teams.
Additional context
Ensure retry metadata travels with artifacts so downstream agents know when data was retried or DLQ’d. Coordinate with persistence efforts (#271) to store DLQ entries durably.
Is your feature request related to a problem?
In production, transient failures and poison artifacts require sophisticated retry strategies. Today we only have a global
max_agent_iterationssafeguard, so teams build their own retry wrappers, dead-letter queues, and circuit breakers. This leads to duplicated logic and inconsistent observability across deployments.Describe the solution you want to see
Describe alternatives you have considered
Current workarounds involve wrapping agents with custom components or external schedulers, but those solutions bypass Flock’s tracing and visibility semantics, and they don’t generalize across teams.
Additional context
Ensure retry metadata travels with artifacts so downstream agents know when data was retried or DLQ’d. Coordinate with persistence efforts (#271) to store DLQ entries durably.