All notable changes to this project will be documented in this file.
The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.
- Generic
AREEnvironmentinmaseval.interface.environmentsfor Meta's ARE (Agent Research Environments) platform with lifecycle control (start/stop/pause/resume), notification polling, AUI tool filtering, and oracle mode support. Install withpip install maseval[are]. (PR: #55) - Generic
AREToolWrapperinmaseval.interface.environmentswraps any AREAppToolwith simulation time tracking, invocation history, JSON schema extraction, and MASEval tracing integration. (PR: #55) - Optional dependency extra
are:pip install maseval[are](PR: #55)
Gaia2Environmentnow inherits fromAREEnvironmentand only retains GAIA2-specific setup (preprocessing + judge) and trace/config gathering as overrides. (PR: #55)Gaia2GenericToolis now an alias forAREToolWrapper. (PR: #55)- Renamed
task_dataparameter toenvironment_dataacross all environment constructors, test fixtures, and examples for consistency with the base class API. (PR: #58)
- Fixed MACS real-data tests passing
{"environment_data": task.environment_data}instead oftask.environment_datadirectly, which causedsetup_stateto silently receive an empty tools list. (PR: #58)
0.4.0 - 2026-03-28
Core
- Fixed
MessageHistory.to_list()returning a reference to the internal list instead of a copy, causing simulator logs to contain future conversation messages that hadn't occurred at the time of logging. (PR: #48) - Fixed
get_git_info()crashing on detached HEAD (e.g. in CI checkout), now returnsdetached@<short-hash>as the branch name. (PR: #41)
Interface
- Agent adapter
gather_config()in smolagents, langgraph, and llamaindex no longer silently swallows exceptions, ensuring config collection errors are visible instead of producing incomplete configuration data. (PR: #53)
Core
- Usage and cost tracking via
UsageandTokenUsagedata classes.ModelAdaptertracks token usage automatically after eachchat()call. Components that implementUsageTrackableMixinare collected viagather_usage(). Live totals available during benchmark runs viabenchmark.usage(grand total) andbenchmark.usage_by_component(per-component breakdowns). Post-hoc analysis viaUsageReporter.from_reports(benchmark.reports)with breakdowns by task, component, or model. (PR: #45) - Pluggable cost calculation via
CostCalculatorprotocol.StaticPricingCalculatorcomputes cost from user-supplied per-token rates.LiteLLMCostCalculatorinmaseval.interface.usagefor automatic pricing via LiteLLM's model database (supportscustom_pricingoverrides andmodel_id_map; requireslitellm). Pass acost_calculatortoModelAdapterorAgentAdapterto computeUsage.cost. Provider-reported cost always takes precedence. (PR: #45) AgentAdapternow acceptscost_calculatorandmodel_idparameters. For smolagents, CAMEL, and LlamaIndex, both are auto-detected from the framework's agent object (LiteLLMCostCalculatorif litellm is installed). LangGraph requires explicitmodel_idsince graphs can contain multiple models. Explicit parameters always override auto-detection. (PR: #45)Task.freeze()andTask.unfreeze()methods to make task data read-only during benchmark runs, preventing accidental mutation ofenvironment_data,user_data,evaluation_data, andmetadata(including nested dicts). Attribute reassignment is also blocked while frozen. Check state withTask.is_frozen. (PR: #42)TaskFrozenErrorexception inmaseval.core.exceptions, raised when attempting to modify a frozen task. (PR: #42)- Added
InformativeSubsetQueueandDISCOQueuetomaseval.core.taskfor subset-based evaluation (e.g., anchor-point selection for DISCO).DISCOQueueacceptsanchor_points_pathto load indices from a.json/.pklfile viaDISCOQueue.load_anchor_points(). Available viafrom maseval import DISCOQueue, InformativeSubsetQueue. (PR: #34 and #41) - Added
ModelScorerabstract base class inmaseval.core.scorerfor log-likelihood scoring, withloglikelihood(),loglikelihood_batch(), andloglikelihood_choices()methods. (PR: #34 and #41) - Added
SeedGeneratorabstract base class andDefaultSeedGeneratorimplementation for reproducible benchmark runs via SHA-256-based seed derivation (PR: #24) - Added
seedandseed_generatorparameters toBenchmark.__init__for enabling reproducibility (PR: #24) - Added
seed_generatorparameter to all benchmark setup methods (setup_environment,setup_user,setup_agents,setup_evaluators) (PR: #24) - Added
seedparameter toModelAdapter.__init__for deterministic model inference (PR: #24) - Added
SeedingErrorexception for providers that don't support seeding (Anthropic models raise this if seed is provided) (PR: #24) - Added
UserExhaustedErrorexception inmaseval.core.exceptionsfor flow control when a user's turns are exhausted (PR: #39)
Interface
- Added seed support to interface adapters:
OpenAIModelAdapter,GoogleGenAIModelAdapter,LiteLLMModelAdapter,HuggingFacePipelineModelAdapterpass seeds to underlying APIs (PR: #24) - Added
HuggingFaceModelScorerinmaseval.interface.inference— log-likelihood scorer backed by a HuggingFaceAutoModelForCausalLM, with single-token optimisation for MCQ evaluation. Implements theModelScorerinterface. (PR: #34 and #41) - CAMEL-AI integration:
CamelAgentAdapterandCamelLLMUserfor evaluating CAMEL-AI ChatAgent-based systems (PR: #22)- Added
CamelAgentUserfor using a CAMEL ChatAgent as the user in agent-to-agent evaluation (PR: #22) - Added
camel_role_playing_execution_loop()for benchmarks using CAMEL's RolePlaying semantics (PR: #22) - Added
CamelRolePlayingTracerandCamelWorkforceTracerfor capturing orchestration-level traces from CAMEL's multi-agent systems (PR: #22)
- Added
Benchmarks
- MMLU Benchmark with DISCO support: Integration for evaluating language models on MMLU (Massive Multitask Language Understanding) multiple-choice questions, compatible with DISCO anchor-point methodology.
MMLUBenchmarkis a framework-agnostic base class (setup_agents()andget_model_adapter()must be implemented by subclasses);DefaultMMLUBenchmarkprovides a ready-made HuggingFace implementation. Also includesMMLUEnvironment,MMLUEvaluator,load_tasks(), andcompute_benchmark_metrics(). Install withpip install maseval[mmlu]. Optional extras:lm-eval(forDefaultMMLUBenchmark.precompute_all_logprobs_lmeval),disco(for DISCO prediction in the example). (PR: #34 and #41) - CONVERSE benchmark for contextual safety evaluation in adversarial agent-to-agent conversations, including
ConverseBenchmark,DefaultAgentConverseBenchmark,ConverseEnvironment,ConverseExternalAgent,PrivacyEvaluator,SecurityEvaluator, andload_tasks()utilities fortravel,real_estate, andinsurancedomains. Benchmark source files are now downloaded on first use viaensure_data_exists()instead of being bundled in the package. (PR: #28) - GAIA2 Benchmark: Integration with Meta's ARE (Agent Research Environments) platform for evaluating LLM-based agents on dynamic, multi-step scenarios (PR: #26)
Gaia2Benchmark,Gaia2Environment,Gaia2Evaluatorcomponents for framework-agnostic evaluation with ARE simulation (PR: #26)DefaultAgentGaia2Benchmarkwith ReAct-style agent for direct comparison with ARE reference implementation (PR: #26)- Generic tool wrapper (
Gaia2GenericTool) for MASEval tracing of ARE tools with simulation time tracking (PR: #26, #30) - Data loading utilities:
load_tasks(),configure_model_ids()for loading scenarios from HuggingFace (PR: #26) Gaia2JudgeEngineConfigfor configuring the judge's LLM model and provider (PR: #30)- Metrics:
compute_gaia2_metrics()for GSR (Goal Success Rate) computation by capability type (PR: #26) - Support for 5 capability dimensions: execution, search, adaptability, time, ambiguity (PR: #26, #30)
- Added
gaia2optional dependency:pip install maseval[gaia2](PR: #26)
- MultiAgentBench Benchmark: Integration with MARBLE MultiAgentBench for evaluating multi-agent collaboration across all 6 paper-defined domains: research, bargaining, coding, database, werewolf, and minecraft (PR: #25, #30)
MultiAgentBenchBenchmarkabstract base class for framework-agnostic multi-agent evaluation with seeding support for evaluators and agents (PR: #25)MarbleMultiAgentBenchBenchmarkfor exact MARBLE reproduction mode using native MARBLE agents (note: MARBLE's internal LLM calls bypass MASEval seeding) (PR: #25)MultiAgentBenchEnvironmentandMultiAgentBenchEvaluatorcomponents (PR: #25)- Data loading utilities:
load_tasks(),configure_model_ids(),get_domain_info(),ensure_marble_exists()(PR: #25) - MARBLE adapter:
MarbleAgentAdapterfor wrapping MARBLE agents with MASEval tracing (PR: #25)
Examples
- Added usage tracking to the 5-A-Day benchmark:
five_a_day_benchmark.ipynb(section 2.7) andfive_a_day_benchmark.py(post-run usage summary with per-component and per-task breakdowns). (PR: #45) - MMLU benchmark example at
examples/mmlu_benchmark/for evaluating HuggingFace models on MMLU with optional DISCO prediction (--disco_model_path,--disco_transform_path). Supports local data, HuggingFace dataset repos, and DISCO weights from .pkl/.npz or HF repos. (PR: #34 and #41) - Added a dedicated runnable CONVERSE default benchmark example at
examples/converse_benchmark/default_converse_benchmark.pyfor quick start withDefaultAgentConverseBenchmark. (PR: #28) - Gaia2 benchmark example with Google GenAI and OpenAI model support (PR: #26)
Documentation
- Usage & Cost Tracking guide (
docs/guides/usage-tracking.md) and API reference (docs/reference/usage.md). (PR: #45)
Testing
- Composable pytest markers (
live,credentialed,slow,smoke) for fine-grained test selection; default runs exclude slow, credentialed, and smoke tests (PR: #29) - Marker implication hook:
credentialedimplieslive, so-m "not live"always gives a fully offline run (PR: #29) - Skip decorators (
requires_openai,requires_anthropic,requires_google) for tests needing API keys (PR: #29) - Data integrity tests for Tau2, MACS, GAIA2, and MultiAgentBench benchmarks validating download pipelines, file structures, and data content (PR: #29, #30)
- Real-data integration tests for GAIA2 and MultiAgentBench (PR: #30)
- HTTP-level API contract tests for model adapters (OpenAI, Anthropic, Google GenAI, LiteLLM) using
respxmocks — no API keys needed (PR: #29) - Live API round-trip tests for all model adapters (
-m credentialed) (PR: #29) - CI jobs for slow tests (with benchmark data caching) and credentialed tests (behind GitHub Environment approval) (PR: #29)
- Added
respxdev dependency for HTTP-level mocking (PR: #29) - pytest marker
mmlufor tests that require the MMLU benchmark (HuggingFace + DISCO). (PR: #34 and #41)
Core
- Simplified seeding API:
seed_generatorparameter in setup methods is now always non-None (SeedGeneratorinstead ofOptional[SeedGenerator]). When seeding is disabled (seed=None),derive_seed()returnsNoneinstead of raising an error. This eliminates allif seed_generator is not None:conditional checks - the same code path works whether seeding is enabled or disabled. (PR: #27) - Clarified benchmark/evaluator component guidance in docstrings and docs, including recommended evaluator exception behavior with
fail_on_evaluation_error. (PR: #28) User.respond()now raisesUserExhaustedErrorinstead of returning an empty string when the user has no more turns. Set the newexhausted_responseparameter to return a configurable message instead (e.g. for tool-based integrations where agents callask_user). AffectsLLMUser,AgenticLLMUser,Tau2User, andMACSUser. (PR: #39)_extract_json_object()helper inmaseval.core.simulatorreplaces brittle markdown-fence stripping with robust outermost-brace extraction for all LLM simulator JSON parsing (ToolLLMSimulator,UserLLMSimulator,AgenticUserLLMSimulator). (PR: #39)UserLLMSimulatorandAgenticUserLLMSimulatornow preserve stop tokens that appear outside the JSON object in raw LLM output, soUser._check_stop_tokencan detect them. (PR: #39)
Interface
LlamaIndexAgentAdapter: Addedmax_iterationsconstructor parameter, forwarded toAgentWorkflow.run(). Fixes silent swallowing ofmax_stepsbyFunctionAgent.__init__. (PR: #39)SmolAgentAdapter: New_determine_step_status()detects crashed steps whereAgentGenerationErrorwas raised beforestep.errorwas set, preventing false "success" status on empty steps. (PR: #39)GoogleGenAIModelAdapter: Consecutive tool-response messages are now merged into a singlecontentsentry, fixing Google API errors when multiple tool results are returned in one turn. (PR: #39)- Renamed framework-specific user classes to reflect the new
LLMUserbase (PR: #22):SmolAgentUser→SmolAgentLLMUserLangGraphUser→LangGraphLLMUserLlamaIndexUser→LlamaIndexLLMUser
Benchmarks
MACSBenchmarkandTau2Benchmarkbenchmarks now actively use the seeding system by deriving seeds for model adapters. Seeds are passed to agents, user simulators, tool simulators, and LLM-based evaluators for reproducible runs. (PR: #26)Gaia2Benchmark: Seedsagents/gaia2_agent,evaluators/judgeMACSBenchmark: Seedsenvironment/tools/tool_{name},simulators/user,evaluators/user_gsr,evaluators/system_gsrTau2Benchmark: Seedssimulators/user,agents/default_agent
- All benchmarks except MACS are now labeled as Beta in docs, BENCHMARKS.md, and benchmark index, with a warning that results have not yet been validated against original implementations. (PR: #39)
User
- Refactored
Userclass into abstract base class defining the interface (get_initial_query(),respond(),is_done()) withLLMUseras the concrete LLM-driven implementation. This enables non-LLM user implementations (scripted, human-in-the-loop, agent-based). (PR: #22) - Renamed
AgenticUser→AgenticLLMUserfor consistency with the new hierarchy (PR: #22)
Testing
- Coverage script (
scripts/coverage_by_feature.py) now accepts--excludeflag to skip additional markers; always excludescredentialedandsmokeby default (PR: #29)
Core
ResultLogger._filter_report()now includesstatusanderrorfields in persisted results, so saved logs can distinguish successful runs from infrastructure failures. Report schema is now consistent across success and failure paths (erroris always present,Noneon success). (PR: #38)- Packaging: Fixed
setuptoolsconfiguration —packagesnow usesfindwithinclude = ["maseval*"]so subpackages and package data (.json,.jsonl,.md, etc.) are included in PyPI installs. (PR: #39)
Benchmarks
- Tau2: Fixed telecom domain schema to match tau2-bench, added agent/user state synchronization and deterministic network simulation, fixed initialization flow and tool result serialization (PR: #30)
- Tau2: Added initial agent greeting ("Hi! How can I help you today?") to user simulator's message history, matching the original tau2-bench orchestrator. Fixed tool call counter accumulating across agent turns instead of resetting per turn. Corrected
max_stepscomments (original default is 100, not 200). Documented all known architectural divergences from original tau2-bench in PROVENANCE.md. (PR: #39) - Tau2: Various bugfixes including user tool routing, environment state synchronization, tool result serialization, telecom domain user models/tools, evaluator assertion logic, and
addictdependency for nested dict access. (PR: #39) - Tau2: Fixed incorrect return type annotations on
DB.load()andDB.copy_deep()— now useSelfinstead of"DB", so subclass methods return the correct type (PR: #29) - MultiAgentBench: Fixed bargaining evaluation to use both buyer and seller LLM evaluation prompts, matching the MARBLE paper's methodology. Previously only the seller prompt was used (mirroring a regression in the MARBLE codebase), causing buyer scores to always default to -1 and completion checks to always fail. Now reports
buyer_score,seller_score, andmean_scorescaled to 0-100. (PR: #39) - MultiAgentBench: Corrected domain mappings, added missing werewolf/minecraft support, fixed environment constructors, added result summarization matching MARBLE's evaluation pipeline (PR: #30)
- MultiAgentBench:
MarbleMultiAgentBenchBenchmarknow implements MARBLE's multi-iteration coordination loop with all 4 modes (graph, star, chain, tree) instead of executing agents only once. Fixed defaultcoordinate_modefrom"star"to"graph"matching 1215/1226 MARBLE configs. Uses per-taskmax_iterationsfrom task config (matchingengine.py:97), respects per-agent LLM overrides, and initializes memory type from task config. (PR: #39) - MultiAgentBench: Faithfulness audit fixes for reproduction mode — fixed wrong import path (
marble.utils.utils→marble.llms.model_prompting), added Minecraft agent registration, per-domain defaults formax_iterations/coordinate_mode/environment.type/memory.typefrom MARBLE YAML configs, resolved hardcoded relative paths forscore.jsonandworkspace/solution.pyvia_MARBLE_ROOT, unifiedcoordinate_modedefaults, corrected evaluator and agent model defaults to match MARBLE, replaced auto-generated agent IDs with strict validation. (PR: #39) - MultiAgentBench: Fixed bargaining evaluation crash from
.format()on single-brace JSON in evaluator prompts. Documented chain communication assertion bug in MARBLE'sengine.py. (PR: #39) - GAIA2: Various fixes for faithful reproduction of ARE reference results — scenario lifecycle, data loading, evaluation flow, multi-turn notification handling, tool filtering, default agent fidelity, and simulation time management (PR: #30)
- MACS:
MACSGenericTool._schema_to_inputs()now preservesitemssub-schema for array-type properties, fixing tool registration with Gemini and OpenAI providers. (PR: #39) - MACS: Simplified
MACSUser._extract_user_profile()— no longer attempts brittle parsing of scenario text; points profile section at the scenario to avoid duplication. (PR: #39) - Converse: Removed silent
"gpt-4o"default forattacker_model_id; now raisesValueErrorif not provided, preventing accidental benchmark misconfiguration. (PR: #39) - ConVerse: Various fixes for faithful reproduction of original. (PR: #32)
0.3.0 - 2025-01-18
Parallel Execution
- Added parallel task execution with
num_workersparameter inBenchmark.run()usingThreadPoolExecutor(PR: #14) - Added
ComponentRegistryclass for thread-safe component registration with thread-local storage (PR: #14) - Added
TaskContextfor cooperative timeout checking withcheck_timeout(),elapsed,remaining, andis_expiredproperties (PR: #14) - Added
TaskProtocoldataclass withtimeout_seconds,timeout_action,max_retries,priority, andtagsfields for task-level execution control (PR: #14) - Added
TimeoutActionenum (SKIP,RETRY,RAISE) for configurable timeout behavior (PR: #14) - Added
TaskTimeoutErrorexception withelapsed,timeout, andpartial_tracesattributes (PR: #14) - Added
TASK_TIMEOUTtoTaskExecutionStatusenum for timeout classification (PR: #14)
Task Queue Abstraction
- Added
TaskQueueabstract base class with iterator interface for flexible task scheduling (PR: #14) - Added
SequentialQueuefor simple FIFO task ordering (PR: #14) - Added
PriorityQueuefor priority-based task scheduling usingTaskProtocol.priority(PR: #14) - Added
AdaptiveTaskQueueabstract base class for feedback-based adaptive scheduling withinitial_state(),select_next_task(remaining, state), andupdate_state(task, report, state)methods (PR: #14)
ModelAdapter Chat Interface
- Added
chat()method toModelAdapteras the primary interface for LLM inference, accepting a list of messages in OpenAI format and returning aChatResponseobject and accepting tools - Added
ChatResponsedataclass containingcontent,tool_calls,role,usage,model, andstop_reasonfields for structured response handling
AnthropicModelAdapter
- New
AnthropicModelAdapterfor direct integration with Anthropic Claude models via the official Anthropic SDK - Handles Anthropic-specific message format conversion (system messages, tool_use/tool_result blocks) internally while accepting OpenAI-compatible input
- Added
anthropicoptional dependency:pip install maseval[anthropic]
Benchmarks
- Tau2 Benchmark: Full implementation of the tau2-bench benchmark for evaluating LLM-based agents on customer service tasks across airline, retail, and telecom domains (PR: #16)
Tau2Benchmark,Tau2Environment,Tau2User,Tau2Evaluatorcomponents for framework-agnostic evaluation (PR: #16)DefaultAgentTau2Benchmarkusing an agent setup closely resembeling to the original tau2-bench implementation (PR: #16)- Data loading utilities:
load_tasks(),ensure_data_exists(),configure_model_ids()(PR: #16) - Metrics:
compute_benchmark_metrics(),compute_pass_at_k(),compute_pass_hat_k()for tau2-style scoring (PR: #16) - Domain implementations with tool kits:
AirlineTools,RetailTools,TelecomToolswith full database simulation (PR: #16)
User
AgenticUserclass for users that can use tools during conversations (PR: #16)- Multiple stop token support:
Usernow acceptsstop_tokens(list) instead of singlestop_token, enabling different termination reasons (PR: #16) - Stop reason tracking:
Usertraces now includestop_reason,max_turns,turns_used, andstopped_by_userfor detailed termination analysis (PR: #16)
Simulator
AgenticUserLLMSimulatorfor LLM-based user simulation with tool use capabilities (PR: #16)
Examples
- Tau2 benchmark example with default agent implementation and result comparison scripts (PR: #16)
Benchmark
Benchmark.agent_dataparameter is now optional (defaults to empty dict) (PR: #16)- Refactored
Benchmarkto delegate registry operations toComponentRegistryclass (PR: #) Benchmark.run()now accepts optionalqueueparameter (BaseTaskQueue) for custom task scheduling (PR: #14)
Task
Task.idis nowstrtype instead ofUUID. Benchmarks can provide human-readable IDs directly (e.g.,Task(id="retail_001", ...)). Auto-generates UUID string if not provided. (PR: #16)
- Task reports now use
task.iddirectly instead ofmetadata["task_id"](PR: #16)
0.2.0 - 2025-12-05
Exceptions and Error Classification
- Added
AgentError,EnvironmentError,UserErrorexception hierarchy inmaseval.core.exceptionsfor classifying execution failures by responsibility (PR: #13) - Added
TaskExecutionStatus.AGENT_ERROR,ENVIRONMENT_ERROR,USER_ERROR,UNKNOWN_EXECUTION_ERRORfor fine-grained error classification enabling fair scoring (PR: #13) - Added validation helpers:
validate_argument_type(),validate_required_arguments(),validate_no_extra_arguments(),validate_arguments_from_schema()for tool implementers (PR: #13) - Added
ToolSimulatorErrorandUserSimulatorErrorexception subclasses for simulator-specific context while inheriting proper classification (PR: #13)
Documentation
- Added Exception Handling guide explaining error classification, fair scoring, and rerunning failed tasks (PR: #13)
Benchmarks
- MACS Benchmark: Multi-Agent Collaboration Scenarios benchmark (PR: #13)
Benchmark
- Added
execution_loop()method toBenchmarkbase class enabling iterative agent-user interaction (PR: #13) - Added
max_invocationsconstructor parameter toBenchmark(default: 1 for backwards compatibility) (PR: #13) - Added abstract
get_model_adapter(model_id, **kwargs)method toBenchmarkbase class as universal model factory to be used throughout the benchmarks. (PR: #13)
User
- Added
max_turnsandstop_tokenparameters toUserbase class for multi-turn support with early stopping. Same applied toUserLLMSimulator. (PR: #13) - Added
is_done(),_check_stop_token(), andincrement_turn()methods toUserbase class (PR: #13) - Added
get_initial_query()method toUserbase class for LLM-generated initial messages (PR: #13) - Added
initial_queryparameter inUserbase class to trigger the agentic system. (PR: #13)
Environment
- Added
Environment.get_tool(name)method for single-tool lookup (PR: #13)
Interface
- LlamaIndex integration:
LlamaIndexAgentAdapterandLlamaIndexLLMUserfor evaluating LlamaIndex workflow-based agents (PR: #7) - The
logsproperty insideSmolAgentAdapterandLanggraphAgentAdapterare now properly filled. (PR: #3)
Examples
- Added a new example: The
5_a_day_benchmark(PR: #10)
Exception Handling
- Benchmark now classifies execution errors into
AGENT_ERROR(agent's fault),ENVIRONMENT_ERROR(tool/infra failure),USER_ERROR(user simulator failure), orUNKNOWN_EXECUTION_ERROR(unclassified) instead of genericTASK_EXECUTION_FAILED(PR: #13) ToolLLMSimulatornow raisesToolSimulatorError(classified asENVIRONMENT_ERROR) on failure (PR: #13)UserLLMSimulatornow raisesUserSimulatorError(classified asUSER_ERROR) on failure (PR: #13)
Environment
Environment.create_tools()now returnsDict[str, Any]instead oflist(PR: #13)
Benchmark
Benchmark.run_agents()signature changed: addedquery: strparameter (PR: #13)Benchmark.run()now usesexecution_loop()internally to handle agent-user interaction cycles (PR: #13)Benchmarkclass now has afail_on_setup_errorflag that raises errors observed during setup of task (PR: #10)
Callback
FileResultLoggernow acceptspathlib.Pathfor argumentoutput_dirand has anoverwriteargument to prevent overwriting of existing logs files.
Evaluator
- The
Evaluatorclass now has afilter_tracesbase method to conveniently adapt the same evaluator to different entities in the traces (PR: #10).
Simulator
- The
LLMSimulatornow throws an exception when json cannot be decoded instead of returning the error message as text to the agent (PR: #13).
Other
- Documentation formatting improved. Added darkmode and links to
Github(PR: #11). - Improved Quick Start Guide in
docs/getting-started/quickstart.md. (PR: #10) maseval.interface.agentsstructure changed. Tools requiring framework imports (beyond just typing) now in<framework>_optional.pyand imported dynamically from<framework>.py. (PR: #12)- Various formatting improvements in the documentation (PR: #12)
- Added documentation for View Source Code pattern in
CONTRIBUTING.mdand_optional.pypattern in interface README (PR: #12)
Interface
LlamaIndexAgentAdapternow supports multiple LlamaIndex agent types includingReActAgent(workflow-based),FunctionAgent, and legacy agents by checking for.chat(),.query(), and.run()methods in priority order (PR: #10)
Other
- Consistent naming of agent
adapteroverwrapper(PR: #3) - Fixed an issue that
LiteLLMinterface andMixins were not shown in documentation properly (#PR: 12)
- Removed
set_message_history,append_message_historyandclear_message_historyforAgentAdapterand subclasses. (PR: #3)
0.1.2 - 2025-11-18
- Automated release workflow with version verification
- Documentation for release process
- Improved project documentation structure