Skip to content

Latest commit

 

History

History
366 lines (246 loc) · 28.2 KB

File metadata and controls

366 lines (246 loc) · 28.2 KB

Changelog

All notable changes to this project will be documented in this file.

The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.

Added

  • Generic AREEnvironment in maseval.interface.environments for Meta's ARE (Agent Research Environments) platform with lifecycle control (start/stop/pause/resume), notification polling, AUI tool filtering, and oracle mode support. Install with pip install maseval[are]. (PR: #55)
  • Generic AREToolWrapper in maseval.interface.environments wraps any ARE AppTool with simulation time tracking, invocation history, JSON schema extraction, and MASEval tracing integration. (PR: #55)
  • Optional dependency extra are: pip install maseval[are] (PR: #55)

Changed

  • Gaia2Environment now inherits from AREEnvironment and only retains GAIA2-specific setup (preprocessing + judge) and trace/config gathering as overrides. (PR: #55)
  • Gaia2GenericTool is now an alias for AREToolWrapper. (PR: #55)
  • Renamed task_data parameter to environment_data across all environment constructors, test fixtures, and examples for consistency with the base class API. (PR: #58)

Fixed

  • Fixed MACS real-data tests passing {"environment_data": task.environment_data} instead of task.environment_data directly, which caused setup_state to silently receive an empty tools list. (PR: #58)

Removed

0.4.0 - 2026-03-28

Fixed

Core

  • Fixed MessageHistory.to_list() returning a reference to the internal list instead of a copy, causing simulator logs to contain future conversation messages that hadn't occurred at the time of logging. (PR: #48)
  • Fixed get_git_info() crashing on detached HEAD (e.g. in CI checkout), now returns detached@<short-hash> as the branch name. (PR: #41)

Interface

  • Agent adapter gather_config() in smolagents, langgraph, and llamaindex no longer silently swallows exceptions, ensuring config collection errors are visible instead of producing incomplete configuration data. (PR: #53)

Added

Core

  • Usage and cost tracking via Usage and TokenUsage data classes. ModelAdapter tracks token usage automatically after each chat() call. Components that implement UsageTrackableMixin are collected via gather_usage(). Live totals available during benchmark runs via benchmark.usage (grand total) and benchmark.usage_by_component (per-component breakdowns). Post-hoc analysis via UsageReporter.from_reports(benchmark.reports) with breakdowns by task, component, or model. (PR: #45)
  • Pluggable cost calculation via CostCalculator protocol. StaticPricingCalculator computes cost from user-supplied per-token rates. LiteLLMCostCalculator in maseval.interface.usage for automatic pricing via LiteLLM's model database (supports custom_pricing overrides and model_id_map; requires litellm). Pass a cost_calculator to ModelAdapter or AgentAdapter to compute Usage.cost. Provider-reported cost always takes precedence. (PR: #45)
  • AgentAdapter now accepts cost_calculator and model_id parameters. For smolagents, CAMEL, and LlamaIndex, both are auto-detected from the framework's agent object (LiteLLMCostCalculator if litellm is installed). LangGraph requires explicit model_id since graphs can contain multiple models. Explicit parameters always override auto-detection. (PR: #45)
  • Task.freeze() and Task.unfreeze() methods to make task data read-only during benchmark runs, preventing accidental mutation of environment_data, user_data, evaluation_data, and metadata (including nested dicts). Attribute reassignment is also blocked while frozen. Check state with Task.is_frozen. (PR: #42)
  • TaskFrozenError exception in maseval.core.exceptions, raised when attempting to modify a frozen task. (PR: #42)
  • Added InformativeSubsetQueue and DISCOQueue to maseval.core.task for subset-based evaluation (e.g., anchor-point selection for DISCO). DISCOQueue accepts anchor_points_path to load indices from a .json/.pkl file via DISCOQueue.load_anchor_points(). Available via from maseval import DISCOQueue, InformativeSubsetQueue. (PR: #34 and #41)
  • Added ModelScorer abstract base class in maseval.core.scorer for log-likelihood scoring, with loglikelihood(), loglikelihood_batch(), and loglikelihood_choices() methods. (PR: #34 and #41)
  • Added SeedGenerator abstract base class and DefaultSeedGenerator implementation for reproducible benchmark runs via SHA-256-based seed derivation (PR: #24)
  • Added seed and seed_generator parameters to Benchmark.__init__ for enabling reproducibility (PR: #24)
  • Added seed_generator parameter to all benchmark setup methods (setup_environment, setup_user, setup_agents, setup_evaluators) (PR: #24)
  • Added seed parameter to ModelAdapter.__init__ for deterministic model inference (PR: #24)
  • Added SeedingError exception for providers that don't support seeding (Anthropic models raise this if seed is provided) (PR: #24)
  • Added UserExhaustedError exception in maseval.core.exceptions for flow control when a user's turns are exhausted (PR: #39)

Interface

  • Added seed support to interface adapters: OpenAIModelAdapter, GoogleGenAIModelAdapter, LiteLLMModelAdapter, HuggingFacePipelineModelAdapter pass seeds to underlying APIs (PR: #24)
  • Added HuggingFaceModelScorer in maseval.interface.inference — log-likelihood scorer backed by a HuggingFace AutoModelForCausalLM, with single-token optimisation for MCQ evaluation. Implements the ModelScorer interface. (PR: #34 and #41)
  • CAMEL-AI integration: CamelAgentAdapter and CamelLLMUser for evaluating CAMEL-AI ChatAgent-based systems (PR: #22)
    • Added CamelAgentUser for using a CAMEL ChatAgent as the user in agent-to-agent evaluation (PR: #22)
    • Added camel_role_playing_execution_loop() for benchmarks using CAMEL's RolePlaying semantics (PR: #22)
    • Added CamelRolePlayingTracer and CamelWorkforceTracer for capturing orchestration-level traces from CAMEL's multi-agent systems (PR: #22)

Benchmarks

  • MMLU Benchmark with DISCO support: Integration for evaluating language models on MMLU (Massive Multitask Language Understanding) multiple-choice questions, compatible with DISCO anchor-point methodology. MMLUBenchmark is a framework-agnostic base class (setup_agents() and get_model_adapter() must be implemented by subclasses); DefaultMMLUBenchmark provides a ready-made HuggingFace implementation. Also includes MMLUEnvironment, MMLUEvaluator, load_tasks(), and compute_benchmark_metrics(). Install with pip install maseval[mmlu]. Optional extras: lm-eval (for DefaultMMLUBenchmark.precompute_all_logprobs_lmeval), disco (for DISCO prediction in the example). (PR: #34 and #41)
  • CONVERSE benchmark for contextual safety evaluation in adversarial agent-to-agent conversations, including ConverseBenchmark, DefaultAgentConverseBenchmark, ConverseEnvironment, ConverseExternalAgent, PrivacyEvaluator, SecurityEvaluator, and load_tasks() utilities for travel, real_estate, and insurance domains. Benchmark source files are now downloaded on first use via ensure_data_exists() instead of being bundled in the package. (PR: #28)
  • GAIA2 Benchmark: Integration with Meta's ARE (Agent Research Environments) platform for evaluating LLM-based agents on dynamic, multi-step scenarios (PR: #26)
    • Gaia2Benchmark, Gaia2Environment, Gaia2Evaluator components for framework-agnostic evaluation with ARE simulation (PR: #26)
    • DefaultAgentGaia2Benchmark with ReAct-style agent for direct comparison with ARE reference implementation (PR: #26)
    • Generic tool wrapper (Gaia2GenericTool) for MASEval tracing of ARE tools with simulation time tracking (PR: #26, #30)
    • Data loading utilities: load_tasks(), configure_model_ids() for loading scenarios from HuggingFace (PR: #26)
    • Gaia2JudgeEngineConfig for configuring the judge's LLM model and provider (PR: #30)
    • Metrics: compute_gaia2_metrics() for GSR (Goal Success Rate) computation by capability type (PR: #26)
    • Support for 5 capability dimensions: execution, search, adaptability, time, ambiguity (PR: #26, #30)
    • Added gaia2 optional dependency: pip install maseval[gaia2] (PR: #26)
  • MultiAgentBench Benchmark: Integration with MARBLE MultiAgentBench for evaluating multi-agent collaboration across all 6 paper-defined domains: research, bargaining, coding, database, werewolf, and minecraft (PR: #25, #30)
    • MultiAgentBenchBenchmark abstract base class for framework-agnostic multi-agent evaluation with seeding support for evaluators and agents (PR: #25)
    • MarbleMultiAgentBenchBenchmark for exact MARBLE reproduction mode using native MARBLE agents (note: MARBLE's internal LLM calls bypass MASEval seeding) (PR: #25)
    • MultiAgentBenchEnvironment and MultiAgentBenchEvaluator components (PR: #25)
    • Data loading utilities: load_tasks(), configure_model_ids(), get_domain_info(), ensure_marble_exists() (PR: #25)
    • MARBLE adapter: MarbleAgentAdapter for wrapping MARBLE agents with MASEval tracing (PR: #25)

Examples

  • Added usage tracking to the 5-A-Day benchmark: five_a_day_benchmark.ipynb (section 2.7) and five_a_day_benchmark.py (post-run usage summary with per-component and per-task breakdowns). (PR: #45)
  • MMLU benchmark example at examples/mmlu_benchmark/ for evaluating HuggingFace models on MMLU with optional DISCO prediction (--disco_model_path, --disco_transform_path). Supports local data, HuggingFace dataset repos, and DISCO weights from .pkl/.npz or HF repos. (PR: #34 and #41)
  • Added a dedicated runnable CONVERSE default benchmark example at examples/converse_benchmark/default_converse_benchmark.py for quick start with DefaultAgentConverseBenchmark. (PR: #28)
  • Gaia2 benchmark example with Google GenAI and OpenAI model support (PR: #26)

Documentation

  • Usage & Cost Tracking guide (docs/guides/usage-tracking.md) and API reference (docs/reference/usage.md). (PR: #45)

Testing

  • Composable pytest markers (live, credentialed, slow, smoke) for fine-grained test selection; default runs exclude slow, credentialed, and smoke tests (PR: #29)
  • Marker implication hook: credentialed implies live, so -m "not live" always gives a fully offline run (PR: #29)
  • Skip decorators (requires_openai, requires_anthropic, requires_google) for tests needing API keys (PR: #29)
  • Data integrity tests for Tau2, MACS, GAIA2, and MultiAgentBench benchmarks validating download pipelines, file structures, and data content (PR: #29, #30)
  • Real-data integration tests for GAIA2 and MultiAgentBench (PR: #30)
  • HTTP-level API contract tests for model adapters (OpenAI, Anthropic, Google GenAI, LiteLLM) using respx mocks — no API keys needed (PR: #29)
  • Live API round-trip tests for all model adapters (-m credentialed) (PR: #29)
  • CI jobs for slow tests (with benchmark data caching) and credentialed tests (behind GitHub Environment approval) (PR: #29)
  • Added respx dev dependency for HTTP-level mocking (PR: #29)
  • pytest marker mmlu for tests that require the MMLU benchmark (HuggingFace + DISCO). (PR: #34 and #41)

Changed

Core

  • Simplified seeding API: seed_generator parameter in setup methods is now always non-None (SeedGenerator instead of Optional[SeedGenerator]). When seeding is disabled (seed=None), derive_seed() returns None instead of raising an error. This eliminates all if seed_generator is not None: conditional checks - the same code path works whether seeding is enabled or disabled. (PR: #27)
  • Clarified benchmark/evaluator component guidance in docstrings and docs, including recommended evaluator exception behavior with fail_on_evaluation_error. (PR: #28)
  • User.respond() now raises UserExhaustedError instead of returning an empty string when the user has no more turns. Set the new exhausted_response parameter to return a configurable message instead (e.g. for tool-based integrations where agents call ask_user). Affects LLMUser, AgenticLLMUser, Tau2User, and MACSUser. (PR: #39)
  • _extract_json_object() helper in maseval.core.simulator replaces brittle markdown-fence stripping with robust outermost-brace extraction for all LLM simulator JSON parsing (ToolLLMSimulator, UserLLMSimulator, AgenticUserLLMSimulator). (PR: #39)
  • UserLLMSimulator and AgenticUserLLMSimulator now preserve stop tokens that appear outside the JSON object in raw LLM output, so User._check_stop_token can detect them. (PR: #39)

Interface

  • LlamaIndexAgentAdapter: Added max_iterations constructor parameter, forwarded to AgentWorkflow.run(). Fixes silent swallowing of max_steps by FunctionAgent.__init__. (PR: #39)
  • SmolAgentAdapter: New _determine_step_status() detects crashed steps where AgentGenerationError was raised before step.error was set, preventing false "success" status on empty steps. (PR: #39)
  • GoogleGenAIModelAdapter: Consecutive tool-response messages are now merged into a single contents entry, fixing Google API errors when multiple tool results are returned in one turn. (PR: #39)
  • Renamed framework-specific user classes to reflect the new LLMUser base (PR: #22):
    • SmolAgentUserSmolAgentLLMUser
    • LangGraphUserLangGraphLLMUser
    • LlamaIndexUserLlamaIndexLLMUser

Benchmarks

  • MACSBenchmark and Tau2Benchmark benchmarks now actively use the seeding system by deriving seeds for model adapters. Seeds are passed to agents, user simulators, tool simulators, and LLM-based evaluators for reproducible runs. (PR: #26)
    • Gaia2Benchmark: Seeds agents/gaia2_agent, evaluators/judge
    • MACSBenchmark: Seeds environment/tools/tool_{name}, simulators/user, evaluators/user_gsr, evaluators/system_gsr
    • Tau2Benchmark: Seeds simulators/user, agents/default_agent
  • All benchmarks except MACS are now labeled as Beta in docs, BENCHMARKS.md, and benchmark index, with a warning that results have not yet been validated against original implementations. (PR: #39)

User

  • Refactored User class into abstract base class defining the interface (get_initial_query(), respond(), is_done()) with LLMUser as the concrete LLM-driven implementation. This enables non-LLM user implementations (scripted, human-in-the-loop, agent-based). (PR: #22)
  • Renamed AgenticUserAgenticLLMUser for consistency with the new hierarchy (PR: #22)

Testing

  • Coverage script (scripts/coverage_by_feature.py) now accepts --exclude flag to skip additional markers; always excludes credentialed and smoke by default (PR: #29)

Fixed

Core

  • ResultLogger._filter_report() now includes status and error fields in persisted results, so saved logs can distinguish successful runs from infrastructure failures. Report schema is now consistent across success and failure paths (error is always present, None on success). (PR: #38)
  • Packaging: Fixed setuptools configuration — packages now uses find with include = ["maseval*"] so subpackages and package data (.json, .jsonl, .md, etc.) are included in PyPI installs. (PR: #39)

Benchmarks

  • Tau2: Fixed telecom domain schema to match tau2-bench, added agent/user state synchronization and deterministic network simulation, fixed initialization flow and tool result serialization (PR: #30)
  • Tau2: Added initial agent greeting ("Hi! How can I help you today?") to user simulator's message history, matching the original tau2-bench orchestrator. Fixed tool call counter accumulating across agent turns instead of resetting per turn. Corrected max_steps comments (original default is 100, not 200). Documented all known architectural divergences from original tau2-bench in PROVENANCE.md. (PR: #39)
  • Tau2: Various bugfixes including user tool routing, environment state synchronization, tool result serialization, telecom domain user models/tools, evaluator assertion logic, and addict dependency for nested dict access. (PR: #39)
  • Tau2: Fixed incorrect return type annotations on DB.load() and DB.copy_deep() — now use Self instead of "DB", so subclass methods return the correct type (PR: #29)
  • MultiAgentBench: Fixed bargaining evaluation to use both buyer and seller LLM evaluation prompts, matching the MARBLE paper's methodology. Previously only the seller prompt was used (mirroring a regression in the MARBLE codebase), causing buyer scores to always default to -1 and completion checks to always fail. Now reports buyer_score, seller_score, and mean_score scaled to 0-100. (PR: #39)
  • MultiAgentBench: Corrected domain mappings, added missing werewolf/minecraft support, fixed environment constructors, added result summarization matching MARBLE's evaluation pipeline (PR: #30)
  • MultiAgentBench: MarbleMultiAgentBenchBenchmark now implements MARBLE's multi-iteration coordination loop with all 4 modes (graph, star, chain, tree) instead of executing agents only once. Fixed default coordinate_mode from "star" to "graph" matching 1215/1226 MARBLE configs. Uses per-task max_iterations from task config (matching engine.py:97), respects per-agent LLM overrides, and initializes memory type from task config. (PR: #39)
  • MultiAgentBench: Faithfulness audit fixes for reproduction mode — fixed wrong import path (marble.utils.utilsmarble.llms.model_prompting), added Minecraft agent registration, per-domain defaults for max_iterations/coordinate_mode/environment.type/memory.type from MARBLE YAML configs, resolved hardcoded relative paths for score.json and workspace/solution.py via _MARBLE_ROOT, unified coordinate_mode defaults, corrected evaluator and agent model defaults to match MARBLE, replaced auto-generated agent IDs with strict validation. (PR: #39)
  • MultiAgentBench: Fixed bargaining evaluation crash from .format() on single-brace JSON in evaluator prompts. Documented chain communication assertion bug in MARBLE's engine.py. (PR: #39)
  • GAIA2: Various fixes for faithful reproduction of ARE reference results — scenario lifecycle, data loading, evaluation flow, multi-turn notification handling, tool filtering, default agent fidelity, and simulation time management (PR: #30)
  • MACS: MACSGenericTool._schema_to_inputs() now preserves items sub-schema for array-type properties, fixing tool registration with Gemini and OpenAI providers. (PR: #39)
  • MACS: Simplified MACSUser._extract_user_profile() — no longer attempts brittle parsing of scenario text; points profile section at the scenario to avoid duplication. (PR: #39)
  • Converse: Removed silent "gpt-4o" default for attacker_model_id; now raises ValueError if not provided, preventing accidental benchmark misconfiguration. (PR: #39)
  • ConVerse: Various fixes for faithful reproduction of original. (PR: #32)

Removed

0.3.0 - 2025-01-18

Added

Parallel Execution

  • Added parallel task execution with num_workers parameter in Benchmark.run() using ThreadPoolExecutor (PR: #14)
  • Added ComponentRegistry class for thread-safe component registration with thread-local storage (PR: #14)
  • Added TaskContext for cooperative timeout checking with check_timeout(), elapsed, remaining, and is_expired properties (PR: #14)
  • Added TaskProtocol dataclass with timeout_seconds, timeout_action, max_retries, priority, and tags fields for task-level execution control (PR: #14)
  • Added TimeoutAction enum (SKIP, RETRY, RAISE) for configurable timeout behavior (PR: #14)
  • Added TaskTimeoutError exception with elapsed, timeout, and partial_traces attributes (PR: #14)
  • Added TASK_TIMEOUT to TaskExecutionStatus enum for timeout classification (PR: #14)

Task Queue Abstraction

  • Added TaskQueue abstract base class with iterator interface for flexible task scheduling (PR: #14)
  • Added SequentialQueue for simple FIFO task ordering (PR: #14)
  • Added PriorityQueue for priority-based task scheduling using TaskProtocol.priority (PR: #14)
  • Added AdaptiveTaskQueue abstract base class for feedback-based adaptive scheduling with initial_state(), select_next_task(remaining, state), and update_state(task, report, state) methods (PR: #14)

ModelAdapter Chat Interface

  • Added chat() method to ModelAdapter as the primary interface for LLM inference, accepting a list of messages in OpenAI format and returning a ChatResponse object and accepting tools
  • Added ChatResponse dataclass containing content, tool_calls, role, usage, model, and stop_reason fields for structured response handling

AnthropicModelAdapter

  • New AnthropicModelAdapter for direct integration with Anthropic Claude models via the official Anthropic SDK
  • Handles Anthropic-specific message format conversion (system messages, tool_use/tool_result blocks) internally while accepting OpenAI-compatible input
  • Added anthropic optional dependency: pip install maseval[anthropic]

Benchmarks

  • Tau2 Benchmark: Full implementation of the tau2-bench benchmark for evaluating LLM-based agents on customer service tasks across airline, retail, and telecom domains (PR: #16)
  • Tau2Benchmark, Tau2Environment, Tau2User, Tau2Evaluator components for framework-agnostic evaluation (PR: #16)
  • DefaultAgentTau2Benchmark using an agent setup closely resembeling to the original tau2-bench implementation (PR: #16)
  • Data loading utilities: load_tasks(), ensure_data_exists(), configure_model_ids() (PR: #16)
  • Metrics: compute_benchmark_metrics(), compute_pass_at_k(), compute_pass_hat_k() for tau2-style scoring (PR: #16)
  • Domain implementations with tool kits: AirlineTools, RetailTools, TelecomTools with full database simulation (PR: #16)

User

  • AgenticUser class for users that can use tools during conversations (PR: #16)
  • Multiple stop token support: User now accepts stop_tokens (list) instead of single stop_token, enabling different termination reasons (PR: #16)
  • Stop reason tracking: User traces now include stop_reason, max_turns, turns_used, and stopped_by_user for detailed termination analysis (PR: #16)

Simulator

  • AgenticUserLLMSimulator for LLM-based user simulation with tool use capabilities (PR: #16)

Examples

  • Tau2 benchmark example with default agent implementation and result comparison scripts (PR: #16)

Changed

Benchmark

  • Benchmark.agent_data parameter is now optional (defaults to empty dict) (PR: #16)
  • Refactored Benchmark to delegate registry operations to ComponentRegistry class (PR: #)
  • Benchmark.run() now accepts optional queue parameter (BaseTaskQueue) for custom task scheduling (PR: #14)

Task

  • Task.id is now str type instead of UUID. Benchmarks can provide human-readable IDs directly (e.g., Task(id="retail_001", ...)). Auto-generates UUID string if not provided. (PR: #16)

Fixed

  • Task reports now use task.id directly instead of metadata["task_id"] (PR: #16)

Removed

0.2.0 - 2025-12-05

Added

Exceptions and Error Classification

  • Added AgentError, EnvironmentError, UserError exception hierarchy in maseval.core.exceptions for classifying execution failures by responsibility (PR: #13)
  • Added TaskExecutionStatus.AGENT_ERROR, ENVIRONMENT_ERROR, USER_ERROR, UNKNOWN_EXECUTION_ERROR for fine-grained error classification enabling fair scoring (PR: #13)
  • Added validation helpers: validate_argument_type(), validate_required_arguments(), validate_no_extra_arguments(), validate_arguments_from_schema() for tool implementers (PR: #13)
  • Added ToolSimulatorError and UserSimulatorError exception subclasses for simulator-specific context while inheriting proper classification (PR: #13)

Documentation

  • Added Exception Handling guide explaining error classification, fair scoring, and rerunning failed tasks (PR: #13)

Benchmarks

  • MACS Benchmark: Multi-Agent Collaboration Scenarios benchmark (PR: #13)

Benchmark

  • Added execution_loop() method to Benchmark base class enabling iterative agent-user interaction (PR: #13)
  • Added max_invocations constructor parameter to Benchmark (default: 1 for backwards compatibility) (PR: #13)
  • Added abstract get_model_adapter(model_id, **kwargs) method to Benchmark base class as universal model factory to be used throughout the benchmarks. (PR: #13)

User

  • Added max_turns and stop_token parameters to User base class for multi-turn support with early stopping. Same applied to UserLLMSimulator. (PR: #13)
  • Added is_done(), _check_stop_token(), and increment_turn() methods to User base class (PR: #13)
  • Added get_initial_query() method to User base class for LLM-generated initial messages (PR: #13)
  • Added initial_query parameter in User base class to trigger the agentic system. (PR: #13)

Environment

  • Added Environment.get_tool(name) method for single-tool lookup (PR: #13)

Interface

  • LlamaIndex integration: LlamaIndexAgentAdapter and LlamaIndexLLMUser for evaluating LlamaIndex workflow-based agents (PR: #7)
  • The logs property inside SmolAgentAdapter and LanggraphAgentAdapter are now properly filled. (PR: #3)

Examples

  • Added a new example: The 5_a_day_benchmark (PR: #10)

Changed

Exception Handling

  • Benchmark now classifies execution errors into AGENT_ERROR (agent's fault), ENVIRONMENT_ERROR (tool/infra failure), USER_ERROR (user simulator failure), or UNKNOWN_EXECUTION_ERROR (unclassified) instead of generic TASK_EXECUTION_FAILED (PR: #13)
  • ToolLLMSimulator now raises ToolSimulatorError (classified as ENVIRONMENT_ERROR) on failure (PR: #13)
  • UserLLMSimulator now raises UserSimulatorError (classified as USER_ERROR) on failure (PR: #13)

Environment

  • Environment.create_tools() now returns Dict[str, Any] instead of list (PR: #13)

Benchmark

  • Benchmark.run_agents() signature changed: added query: str parameter (PR: #13)
  • Benchmark.run() now uses execution_loop() internally to handle agent-user interaction cycles (PR: #13)
  • Benchmark class now has a fail_on_setup_error flag that raises errors observed during setup of task (PR: #10)

Callback

  • FileResultLogger now accepts pathlib.Path for argument output_dir and has an overwrite argument to prevent overwriting of existing logs files.

Evaluator

  • The Evaluator class now has a filter_traces base method to conveniently adapt the same evaluator to different entities in the traces (PR: #10).

Simulator

  • The LLMSimulator now throws an exception when json cannot be decoded instead of returning the error message as text to the agent (PR: #13).

Other

  • Documentation formatting improved. Added darkmode and links to Github (PR: #11).
  • Improved Quick Start Guide in docs/getting-started/quickstart.md. (PR: #10)
  • maseval.interface.agents structure changed. Tools requiring framework imports (beyond just typing) now in <framework>_optional.py and imported dynamically from <framework>.py. (PR: #12)
  • Various formatting improvements in the documentation (PR: #12)
  • Added documentation for View Source Code pattern in CONTRIBUTING.md and _optional.py pattern in interface README (PR: #12)

Fixed

Interface

  • LlamaIndexAgentAdapter now supports multiple LlamaIndex agent types including ReActAgent (workflow-based), FunctionAgent, and legacy agents by checking for .chat(), .query(), and .run() methods in priority order (PR: #10)

Other

  • Consistent naming of agent adapter over wrapper (PR: #3)
  • Fixed an issue that LiteLLM interface and Mixins were not shown in documentation properly (#PR: 12)

Removed

  • Removed set_message_history, append_message_history and clear_message_history for AgentAdapter and subclasses. (PR: #3)

0.1.2 - 2025-11-18

Added

  • Automated release workflow with version verification
  • Documentation for release process

Changed

  • Improved project documentation structure

0.1.1 - [Previous release date]