Skip to content

Releases: Jasvina/AgentReliabilityKit

v0.1.0 - AgentEvalKit initial public toolkit release

03 May 14:54

Choose a tag to compare

v0.1.0

Why this release exists

v0.1.0 introduces AgentEvalKit as a public toolkit for the agent reliability loop: capture real runs, replay and diff them, turn traces into reusable eval artifacts, cluster recurring failures, and slice the same evidence into reproducible datasets.

This release is meant to make the repo understandable and usable as a coherent workflow, not just a collection of tools.

What is included

  • AgentCI for replay-first regression testing of tool-using agents
  • TracePack for packaging traces into reusable benchmark packs
  • FailMap for clustering recurring failures and comparing releases
  • PackSlice for balanced train/eval/test splits from the same pack
  • Root-level automation, docs, and community health files so the full toolchain is easy to discover and run
  • A public roadmap backlog with starter issues for the next improvements

First run

From the repo root:

./scripts/run_automation_demo.sh /tmp/agentevalkit-demo

This produces a machine-readable manifest.json plus per-tool artifacts that show the full pipeline working end to end.

Public backlog

The next public work stays focused on the reliability loop:

  • expand adapters and regression diffing in AgentCI
  • improve redaction, labeling, and export coverage in TracePack
  • strengthen release-over-release comparisons and issue routing in FailMap
  • add label-aware, temporal, and reproducibility improvements in PackSlice
  • make root automation outputs easier to consume in CI and dashboards

Scope

This release covers the current monorepo toolchain and its public workflow surface.

It does not try to be:

  • a general-purpose agent framework
  • a broad orchestration platform
  • an open-ended memory layer
  • a demo-first UI product