Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
25 commits
Select commit Hold shift + click to select a range
1a2b1f5
design specifications
cemde Mar 27, 2026
96e03d1
Add AREEnvironment implementation plan
cemde Mar 27, 2026
55f36a6
feat: add maseval/interface/environments/ package
cemde Mar 27, 2026
15d1082
feat: add AREToolWrapper with tracing and metadata
cemde Mar 27, 2026
ee1ee4c
feat: add AREEnvironment with scenario path and lifecycle control
cemde Mar 28, 2026
0a0222a
test: add shorthand construction path tests for AREEnvironment
cemde Mar 28, 2026
a9b6dab
feat: add 'are' optional dependency extra
cemde Mar 28, 2026
c239b03
test: add ARE integration smoke tests
cemde Mar 28, 2026
fb8f8ef
added docs
cemde Mar 28, 2026
5915415
are issues identifies
cemde Mar 28, 2026
dcf7139
gaia2 simplifaction and issue fixing plan
cemde Mar 28, 2026
0c366a1
feat(are): add simulation time tracking to AREToolWrapper invocations
cemde Mar 28, 2026
53b9b99
feat: use AppToolAdapter for metadata and remove silent schema defaults
cemde Mar 28, 2026
0304e40
fix(are): remove hasattr fallbacks from oracle mode
cemde Mar 28, 2026
23ff2dd
fix(are): propagate errors from poll_notifications
cemde Mar 29, 2026
0c95bdb
feat(are): add AUI tool filtering and get_turn_notifications
cemde Mar 29, 2026
00fdcb5
fix(test): add AppToolAdapter mock to test_are_tool_wrapper
cemde Mar 29, 2026
02a266a
refactor(gaia2): simplify Gaia2Environment as AREEnvironment subclass
cemde Mar 29, 2026
0a417d9
merge: resolve conflicts with main (task_data → environment_data rename)
cemde Mar 29, 2026
f5403bd
fix: resolve lint and typecheck issues after merge
cemde Mar 29, 2026
b2f9096
test(are): replace import-only tests with real ARE integration tests
cemde Mar 29, 2026
0bd966e
added code review
cemde Apr 10, 2026
56fb7a1
updated are integration
cemde Apr 10, 2026
da6a21d
fixed typing error
cemde Apr 10, 2026
a664b2b
[skip ci] updated changlog
cemde Apr 10, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
143 changes: 143 additions & 0 deletions ARECRITICALASSESSMENT.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,143 @@
## Critical Assessment: ARE Integration in MASEval

I've read every line of the ARE integration (`are.py`, `are_tool_wrapper.py`), the base `Environment` class, the primary consumer (`Gaia2Environment`), the underlying ARE library (`AppTool`, `AppToolAdapter`, `Environment`, `EnvironmentConfig`), and all tests. Here's my honest assessment.

---

### What works well

**1. The tool wrapper is genuinely useful.** `AREToolWrapper` adds real value: simulation-time-aware tracing, invocation history, JSON schema extraction, and a clean `__call__` interface. Without it, every developer would need to write their own tracing and metadata extraction. The 3-layer design (ARE tool → maseval wrapper → framework-specific) is correct and well-documented.

**2. Notification polling is well-designed.** The two variants (`poll_notifications` for consuming, `get_turn_notifications` for peeking/re-queuing) solve a real problem. Notification handling is one of the trickiest parts of ARE, and having pre-built methods that categorize by `MessageType` saves developers from re-implementing this.

**3. Tracing and config gathering are well-integrated.** `gather_traces()` and `gather_config()` compose cleanly with the base class and feed into maseval's tracing infrastructure. Per-tool invocation history with sim-time metadata is exactly what you'd need for benchmark analysis.

**4. Test coverage is excellent.** Three test files covering unit tests, mock-based tests, and real integration tests against the ARE stack. The integration tests are particularly valuable — they exercise the full lifecycle with real apps.

---

### Problems and concerns

**1. `Gaia2Environment` completely bypasses `AREEnvironment.setup_state()` — raising the question of whether `AREEnvironment` is actually reusable.**

This is the most important finding. `Gaia2Environment.setup_state()` at `maseval/benchmark/gaia2/environment.py:62` does NOT call `super().setup_state()`. It reimplements everything: creating `EnvironmentConfig`, creating the notification system, calling `self._are_env = AREEnv(...)`, calling `self._are_env.run(...)`, setting `self._scenario`.

The parent class's `setup_state()` was designed to be the reusable path, but the only real consumer skips it entirely because it needs `preprocess_scenario()` (which handles oracle, judge creation, turns, etc.) — things the parent can't anticipate. This means:

- The "shorthand path" (`apps` + `events`) has **zero real-world usage** — it's only tested in tests.
- The generic `_run_oracle_mode()` in `AREEnvironment` is also unused by Gaia2 (which uses ARE's own `preprocess_scenario()` oracle run).
- The `_create_notification_system()` helper is bypassed (Gaia2 creates `VerboseNotificationSystem()` directly).

**Implication:** `AREEnvironment` is essentially a bag of utilities (`create_tools()`, `poll_notifications()`, lifecycle methods, accessors) rather than a coherent base class with a usable setup path. The `setup_state()` code is tested but not battle-tested through actual benchmark use.

**2. Invented defaults in the shorthand path violate scientific integrity guidelines.**

At `maseval/interface/environments/are.py:170-173`:
```python
duration = environment_data.get("duration", 1800)
seed = environment_data.get("seed", 0)
start_time = environment_data.get("start_time", 0)
time_increment = environment_data.get("time_increment_in_seconds", 1)
```

Per AGENTS.md's scientific integrity section: *"If a researcher would need to report a parameter in a paper's Experimental Setup section, do not invent a default for it."* Duration and seed are experimental parameters — a researcher would absolutely need to report them. The `1800` default was copied from ARE's `MAX_SCENARIO_DURATION`, but `seed=0` is invented. This is exactly the pattern AGENTS.md warns against.

**3. Excessive use of `getattr` with fallbacks masks errors.**

At `maseval/interface/environments/are.py:134`: `getattr(scenario, "time_increment_in_seconds", 1)` — if the scenario doesn't have this attribute, silently using `1` could lead to incorrect simulation behavior. Same at line 136: `getattr(scenario, "start_time", None)`. Per AGENTS.md: *"Pass through directly, let errors surface."*

This is a pattern throughout: `getattr(scenario, "scenario_id", None)`, `getattr(scenario, "seed", None)`, etc. at lines 147-151. If `scenario` doesn't have these attributes, something is wrong and should fail loudly.

**4. The `filter_aui_tools` flag mutates external state as a side effect.**

At `maseval/interface/environments/are.py:254-255`:
```python
if self._filter_aui_tools and hasattr(app, "wait_for_user_response"):
app.wait_for_user_response = False
```

This mutates the ARE app's internal state as a side effect of creating tools. This is surprising — calling `create_tools()` shouldn't modify app behavior. A developer reusing the same app instance elsewhere would get unexpected behavior.

**5. The `AREToolWrapper` duplicates what `AppToolAdapter` already provides.**

Looking at the ARE source, `AppToolAdapter` already:
- Extracts `name`, `description`, `inputs`, `output_type`
- Has a `forward()` method that delegates to `AppTool.__call__`
- Converts types to HuggingFace format

`AREToolWrapper` wraps an `AppTool`, then immediately creates an `AppToolAdapter` just to read metadata from it, then never uses it again. The schema extraction in `_extract_schema()` at `maseval/interface/environments/are_tool_wrapper.py:78-106` duplicates what `AppTool.to_open_ai()` already does natively.

Maseval adds tracing on top, which is good — but the metadata extraction layer is redundant. You could simplify by composing with `AppToolAdapter` rather than duplicating.

**6. `Any` types everywhere erode the type safety promise.**

`_are_env: Any`, `_scenario: Any`, `get_are_environment() -> Any`, `get_scenario() -> Any`, `get_notification_system() -> Any`. The AGENTS.md philosophy says types should "provide better IDE autocomplete and error detection." With `Any` returns, developers get zero autocomplete on the most important objects.

If the concern is that ARE is an optional dependency, you can use `TYPE_CHECKING` imports (which `are_tool_wrapper.py` already does!) to provide proper types that only resolve during type-checking.

**7. Silent no-ops are dangerous for simulation code.**

Every lifecycle method silently does nothing if `_are_env is None`:
```python
def start(self) -> None:
if self._are_env is not None and self._scenario is not None:
self._are_env.run(...)
```

If someone calls `start()` on a misconfigured environment, nothing happens and nothing is logged. In simulation code, silent failures can waste hours of debugging. These should either raise or at minimum warn.

**8. The cleanup swallows all exceptions.**

At `maseval/interface/environments/are.py:441-444`:
```python
def cleanup(self) -> None:
if self._are_env is not None:
try:
self._are_env.stop()
except Exception:
pass
```

Blanket exception swallowing makes debugging impossible. At minimum, log the exception.

---

### Does it solve real developer problems?

**Partially.** The integration solves the right problems in theory:
- Tool wrapping with tracing (**yes, genuinely helpful**)
- Notification handling (**yes, saves real boilerplate**)
- Lifecycle management (**mostly — but lifecycle is already simple in ARE**)
- Config/trace gathering (**yes, good maseval integration**)

But in practice, the only real consumer (Gaia2) bypasses the setup path entirely, which suggests the abstraction doesn't match what developers actually need when building ARE-based benchmarks. The "reusable base class" framing oversells what you get — what you really get is a collection of useful utilities attached to a partially-useful setup flow.

### Does it give flexibility?

**Mixed.** The shorthand path provides a nice quick-start experience for simple scenarios. But for any real benchmark (like Gaia2), developers need to override `setup_state()` completely because:
- They need `preprocess_scenario()` (ARE's own setup pipeline)
- They need custom judge configuration
- They need benchmark-specific oracle mode handling
- The parent's oracle mode is too simplistic

The fact that the primary consumer had to bypass the parent's setup entirely suggests the abstraction layer is at the wrong level.

### What would make this better?

1. **Accept that `AREEnvironment` is a utility mixin, not a full base class.** The lifecycle methods, notification polling, tool wrapping, and accessors are genuinely useful. But `setup_state()` should either be truly generic (just "give me an ARE Environment and a Scenario and I'll set up tools + accessors") or be left abstract so subclasses always implement it.

2. **Remove the shorthand path or demote it to a factory function.** It has no real consumers and introduces invented defaults.

3. **Remove `getattr` fallbacks.** Let errors surface.

4. **Use proper types via `TYPE_CHECKING`.** You already do this in the tool wrapper — extend it to `AREEnvironment`.

5. **Don't mutate app state in `create_tools()`.** The AUI filtering should be separate from tool creation.

6. **Make lifecycle methods explicit about failure.** `start()` on a `None` environment should raise, not silently succeed.

---

### Bottom line

The ARE integration is a **solid 70%**. The tool wrapper, notification handling, and tracing infrastructure are genuinely useful and well-tested. But the base class `setup_state()` path is an unused abstraction — the only real consumer bypasses it completely. The code would be stronger if it acknowledged its actual role (utilities + tool wrapping) rather than pretending to be a complete, reusable environment setup pipeline. The invented defaults and silent no-ops are the most actionable issues — they directly contradict the project's own scientific integrity and "let errors surface" guidelines.
109 changes: 109 additions & 0 deletions AREISSUES.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,109 @@
# AREEnvironment Issues

Issues identified by comparing `AREEnvironment` (generic ARE wrapper in `maseval/interface/environments/`) with `Gaia2Environment` (dataset-specific implementation in `maseval/benchmark/gaia2/`).

Reviewed through scientific coding principles: silent failures that could produce wrong-but-plausible benchmark results are treated as high priority.

**Design principle:** maseval's `Benchmark` class provides `fail_on_task_error`, `fail_on_setup_error`, `fail_on_evaluation_error`, etc. flags that give users explicit control over error tolerance. Environment and tool code must propagate errors so the benchmark runner can classify them (agent fault vs. infrastructure fault) and respect the user's `fail_on_*` settings. Silent swallowing inside environment code bypasses this mechanism entirely — the user asked for strict mode but gets silent degradation.

## High Priority

### 1. Oracle mode silently degrades to empty data

`maseval/interface/environments/are.py:200-203` uses `hasattr` checks that silently fall back to empty dicts/lists:

```python
oracle_traces = {
"apps_state": oracle_env.get_apps_state() if hasattr(oracle_env, "get_apps_state") else {},
"world_logs": oracle_env.get_world_logs() if hasattr(oracle_env, "get_world_logs") else [],
}
```

If the ARE API changes or the methods are missing, oracle traces will be `{}` and `[]` — structurally valid but scientifically empty. Downstream evaluation code will run without error and produce meaningless scores. This is a textbook "wrong result that looks right" scenario.

`Gaia2Environment` avoids this entirely by delegating to ARE's canonical `preprocess_scenario()`.

No tests cover oracle mode, so this cannot be caught by CI.

**Fix:** Remove the `hasattr` fallbacks. Call the methods directly — if they don't exist, the crash immediately tells you the ARE API changed. Alternatively, delegate to `preprocess_scenario()`. Add oracle mode tests.

### 2. Missing simulation time tracking in AREToolWrapper

`Gaia2GenericTool` records simulation time before/after each tool call (`maseval/benchmark/gaia2/tool_wrapper.py:123-150`). `AREToolWrapper` only records wall-clock time (`maseval/interface/environments/are_tool_wrapper.py:122-128`).

Simulation time is the temporal coordinate of the experiment. Without it, any analysis of agent behavior in time-sensitive ARE scenarios is done against wall-clock time, which conflates LLM latency with simulated environment dynamics. Results would be unreproducible across different hardware or API providers.

**Fix:** Add `_get_simulation_time()` helper and record `simulation_time_before`, `simulation_time_after`, `simulation_time_elapsed` in the invocation `meta` dict, matching `Gaia2GenericTool`.

### 3. Schema extraction silently fabricates defaults

`AREToolWrapper._extract_schema()` (`maseval/interface/environments/are_tool_wrapper.py:90-95`):

```python
"type": getattr(arg, "arg_type", "string"), # fabricates "string" if missing
if not getattr(arg, "has_default", True): # assumes optional if missing
```

If an ARE tool arg is missing `arg_type`, it silently becomes `"string"`. If `has_default` is missing, it silently becomes optional. The schema will look valid but will be wrong — agents will receive incorrect type information and parameter requirements, producing tool calls that either fail in hard-to-trace ways or succeed with wrong arguments.

`Gaia2GenericTool` accesses these directly (`arg.arg_type`, `arg.has_default`) — it will crash immediately if the structure is unexpected, which is the correct behavior.

**Fix:** Remove the silent defaults. Access `arg.arg_type` and `arg.has_default` directly. If ARE tools don't have these attributes, that's a real problem that should surface immediately. Additionally, use ARE's `AppToolAdapter` as the canonical source of tool metadata, as `Gaia2GenericTool` does.

### 4. `poll_notifications()` silently swallows all exceptions

Both `AREEnvironment` (`are.py:330-331`) and `Gaia2Environment` (`environment.py:328-329`):

```python
except Exception:
return [], [], False
```

If the notification system has a bug, returns corrupt data, or the ARE API changes, this catch-all returns "no notifications, simulation not stopped" — a plausible-looking empty result. The agent loop continues as if nothing happened, missing user messages and environment events. The benchmark produces scores based on an agent that never received its inputs.

This is distinct from lifecycle cleanup (where swallowing errors is acceptable). Notification polling is part of the data path — it directly affects what the agent sees and does.

Additionally, this bypasses maseval's `fail_on_*_error` mechanism: even if the user configured strict mode via `fail_on_task_error` or `fail_on_setup_error` to catch infrastructure failures, these errors are swallowed before the benchmark runner ever sees them.

**Fix:** Remove the bare `except Exception`. Catch only the specific exceptions that represent expected transient conditions (if any). Let unexpected errors propagate — the benchmark runner classifies them (`ENVIRONMENT_ERROR` during execution, `SETUP_FAILED` during setup) and the user's `fail_on_*` settings decide whether to abort or continue.

## Medium Priority

### 5. AUI tool filtering missing from AREEnvironment

`Gaia2Environment` filters out `AgentUserInterface` message-retrieval tools and sets `wait_for_user_response = False` (`maseval/benchmark/gaia2/environment.py:178-213`), matching ARE's default agent behavior. `AREEnvironment` doesn't.

Anyone using `AREEnvironment` with ARE's notification-based message delivery will get tools that block waiting for a response or duplicate user messages. This is a correctness issue for interactive scenarios, not just a convenience gap.

**Fix:** Add an opt-in parameter (e.g. `filter_aui_tools=True`) to `AREEnvironment.__init__`, or at minimum document the behavior prominently so users of the generic wrapper know they must handle this themselves.

### 6. Inconsistent error handling in AREEnvironment lifecycle methods

`cleanup()` wraps in try/except (`are.py:360-364`), but `pause()` and `resume_with_offset()` propagate exceptions (`are.py:268-280`). `Gaia2Environment` wraps all lifecycle methods consistently.

For lifecycle methods, the question is: "if this fails, can the experiment still produce correct results?"

- `cleanup()`: Swallowing is acceptable — the task is already done, we're tearing down.
- `pause()` / `resume_with_offset()`: These control simulation time during agent execution. A failure here means time is advancing when it shouldn't be (or vice versa), directly affecting the experimental conditions. The current inconsistency means `pause()` will crash the benchmark run while `cleanup()` won't — but neither behavior is fully correct.

Crucially, swallowing these errors removes the user's ability to control error tolerance via `fail_on_task_error` / `fail_on_setup_error`. The user opted into strict mode precisely to catch conditions like "simulation time wasn't paused during LLM generation." Silent swallowing overrides that choice.

**Fix:** Let `pause()` and `resume_with_offset()` propagate exceptions. The benchmark runner classifies them and the user's `fail_on_*` settings decide the outcome. `cleanup()` is the only lifecycle method where swallowing is acceptable (teardown after task completion).

### 7. No `get_turn_notifications()` on AREEnvironment

`Gaia2Environment` has `get_turn_notifications()` (`maseval/benchmark/gaia2/environment.py:331-380`) which re-queues environment notifications for the inner agent loop — essential for multi-turn ARE scenarios. Without it, environment notifications are consumed by the outer turn loop and never reach the agent's step loop.

Anyone building a multi-turn agent on `AREEnvironment` will silently lose environment notifications between turns.

**Fix:** Add `get_turn_notifications()` to `AREEnvironment`, or document that `poll_notifications()` alone is insufficient for multi-turn agent loops and that users must implement re-queuing themselves.

## Low Priority

### 8. Metadata extraction should use AppToolAdapter

`Gaia2GenericTool` delegates metadata extraction to ARE's `AppToolAdapter` (`maseval/benchmark/gaia2/tool_wrapper.py:68-75`) — the canonical source of truth for tool name, description, inputs, and output_type. `AREToolWrapper` reads these attributes directly from the ARE tool object.

This works today but couples `AREToolWrapper` to ARE's internal tool structure rather than its public adapter API.

**Fix:** Use `AppToolAdapter` in `AREToolWrapper` for metadata extraction.
6 changes: 6 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,8 +9,14 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0

### Added

- Generic `AREEnvironment` in `maseval.interface.environments` for Meta's ARE (Agent Research Environments) platform with lifecycle control (start/stop/pause/resume), notification polling, AUI tool filtering, and oracle mode support. Install with `pip install maseval[are]`. (PR: #55)
- Generic `AREToolWrapper` in `maseval.interface.environments` wraps any ARE `AppTool` with simulation time tracking, invocation history, JSON schema extraction, and MASEval tracing integration. (PR: #55)
- Optional dependency extra `are`: `pip install maseval[are]` (PR: #55)

### Changed

- `Gaia2Environment` now inherits from `AREEnvironment` and only retains GAIA2-specific setup (preprocessing + judge) and trace/config gathering as overrides. (PR: #55)
- `Gaia2GenericTool` is now an alias for `AREToolWrapper`. (PR: #55)
- Renamed `task_data` parameter to `environment_data` across all environment constructors, test fixtures, and examples for consistency with the base class API. (PR: #58)

### Fixed
Expand Down
Loading