Skip to content

🛡️ [FEATURE] [1.0] Advanced Retry & Error Handling #277

@AndreRatzenberger

Description

@AndreRatzenberger

Is your feature request related to a problem?

In production, transient failures and poison artifacts require sophisticated retry strategies. Today we only have a global max_agent_iterations safeguard, so teams build their own retry wrappers, dead-letter queues, and circuit breakers. This leads to duplicated logic and inconsistent observability across deployments.

Describe the solution you want to see

  • Add a first-class retry configuration API on agents/components supporting strategies like exponential backoff with jitter, selective retry by exception type, and maximum attempt limits.
  • Provide a built-in dead-letter queue implementation that captures failed artifacts and their traces for later inspection or replay.
  • Introduce per-agent circuit breakers with configurable thresholds, recovery windows, and half-open probes.
  • Emit structured metrics/traces for retry attempts, DLQ size, and circuit state so operators can monitor behavior in dashboards and alerts.

Describe alternatives you have considered

Current workarounds involve wrapping agents with custom components or external schedulers, but those solutions bypass Flock’s tracing and visibility semantics, and they don’t generalize across teams.

Additional context

Ensure retry metadata travels with artifacts so downstream agents know when data was retried or DLQ’d. Coordinate with persistence efforts (#271) to store DLQ entries durably.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    Status

    Backlog

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions