Feature Request
Problem
Benchmark scenarios always execute from the beginning. To investigate agent behavior at a specific point, run ablation studies on a later phase, or compare different agents against identical mid-scenario conditions, you must replay the entire scenario from time zero every time.
This is a general maseval limitation that affects all benchmarks and environments.
Proposed Solution
Add support for restoring full system state from previously gathered traces to a specific point, then continue execution from there. Full system state includes:
- Environment state -- simulation clock, app state, world/event logs
- Agent state -- message histories, internal memory, accumulated context
- User/participant state -- interaction history, pending messages, turn state
- Tool state -- any stateful tool context accumulated during the run
Given traces from a prior run, maseval should be able to:
- Initialize all components to the state at time
t by replaying or restoring the recorded history up to that point
- Resume live execution from
t onward with a real agent making new decisions
- Do this repeatedly from the same checkpoint to enable controlled comparisons
Sketch
# Run a benchmark, collect traces
result = benchmark.run(agent=agent_a, tasks=tasks)
traces = result.traces # full system traces from gather_traces()
# Later: replay from midpoint with a different agent
result_b = benchmark.run(
agent=agent_b,
tasks=tasks,
restore_from=traces,
resume_at=0.5, # e.g., 50% through the scenario, or a simulation timestamp
)
Use Cases
- Debugging: An agent fails at turn 40 of a 50-turn interaction. Restore to turn 38 and replay with verbose logging, without re-running turns 0-37.
- Ablation studies: Test how different agents or configurations handle identical mid-scenario state.
- Evaluation efficiency: For long benchmarks, evaluate only the critical decision window instead of the full scenario.
- Reproducibility: Save and share checkpoints so collaborators can reproduce results from a specific system state.
- Agent comparison: Run multiple agents from the same checkpoint for controlled comparisons that are not confounded by different early-game trajectories.
Considerations
- Trace format:
gather_traces() already captures environment and tool traces. This needs to extend to agent message history and user interaction state in a restorable format.
- Environment support: Each
Environment subclass needs to implement a restore/checkpoint protocol. Some environments have natural checkpoint semantics (e.g. ARE with its simulation clock); others may need to replay events sequentially to reach the target state.
- Agent restore:
AgentAdapter needs a way to inject prior message history so the agent's context matches what it would have seen at the checkpoint. Each framework adapter (smolagents, langgraph, etc.) needs its own implementation.
- Determinism: Restoring to a checkpoint does not guarantee the same outcome since LLM responses are stochastic. The value is in controlled starting conditions, not exact replay.
- Checkpoint granularity: Need to define what "midpoint" means across different benchmark types -- simulation time, turn number, percentage, or a specific event.
Feature Request
Problem
Benchmark scenarios always execute from the beginning. To investigate agent behavior at a specific point, run ablation studies on a later phase, or compare different agents against identical mid-scenario conditions, you must replay the entire scenario from time zero every time.
This is a general maseval limitation that affects all benchmarks and environments.
Proposed Solution
Add support for restoring full system state from previously gathered traces to a specific point, then continue execution from there. Full system state includes:
Given traces from a prior run, maseval should be able to:
tby replaying or restoring the recorded history up to that pointtonward with a real agent making new decisionsSketch
Use Cases
Considerations
gather_traces()already captures environment and tool traces. This needs to extend to agent message history and user interaction state in a restorable format.Environmentsubclass needs to implement a restore/checkpoint protocol. Some environments have natural checkpoint semantics (e.g. ARE with its simulation clock); others may need to replay events sequentially to reach the target state.AgentAdapterneeds a way to inject prior message history so the agent's context matches what it would have seen at the checkpoint. Each framework adapter (smolagents, langgraph, etc.) needs its own implementation.