Skip to content

Latest commit

 

History

History
541 lines (374 loc) · 17.4 KB

File metadata and controls

541 lines (374 loc) · 17.4 KB

AGENTS.md

Project Overview

MASEval is an orchestration library for benchmarking LLM-based multi-agent systems. Think of it as PyTorch Lightning for agent evaluation—it provides the execution engine while users implement agent logic.

Key Architecture Rule: Strict separation between maseval/core (minimal dependencies) and maseval/interface (optional integrations). Core must NEVER import from interface.

The library is in early development, so breaking changes that are parsimonous are strongly preferred.

Setup Commands

# Sync environment with all dependencies (creates .venv automatically)
uv sync --all-extras --all-groups

# Activate environment
source .venv/bin/activate

# Or use uv run for individual commands (no activation needed)
uv run python examples/amazon_collab.py
uv run pytest tests/

Code Style and Quality

  • Line length: 144 characters
  • Tool: ruff
  • All checks must pass in CI before merge
# Format code
uv run ruff format .

# Lint and auto-fix issues
uv run ruff check . --fix

Testing Instructions

  • Tests use composable pytest markers — see tests/README.md for full details
  • What it tests: core, interface, contract, benchmark, smolagents, langgraph, llamaindex, gaia2, camel
  • What it needs: live (network), credentialed (API keys), slow (>30s), smoke (full pipeline)
  • Default pytest excludes slow, credentialed, and smoke via addopts
  • All tests must pass before PR merge
  • Add/update tests for code changes
  • Benchmark tests follow a two-tier pattern (offline structural + live real-data). See tests/README.md for the recommended pattern when adding or modifying benchmark tests.
# Default — fast tests only
uv run pytest -v

# Core tests only (minimal dependencies)
uv run pytest -m core -v

# Specific integration tests
uv run pytest -m smolagents -v
uv run pytest -m interface -v

# Data download validation (needs network)
uv run pytest -m "live and slow" -v

# Live API tests (needs OPENAI_API_KEY, ANTHROPIC_API_KEY, GOOGLE_API_KEY)
uv run pytest -m credentialed -v

# Fully offline
uv run pytest -m "not live" -v

Coverage

View coverage by feature area (auto-discovers benchmarks/interfaces):

# Full coverage (default + slow + live, excludes credentialed and smoke)
uv run python scripts/coverage_by_feature.py

# Fast-only (skip slow and live tests)
uv run python scripts/coverage_by_feature.py --exclude slow,live

Manual coverage for specific modules:

pytest --cov=maseval.core.agent --cov-report=term-missing

Typing

Type Checker

The project uses the ty type checker.

# Check types across the project
uv run ty check

# View documentation
uv run ty --help

Documentation: https://docs.astral.sh/ty/

Philosophy & Priorities

Types exist to help users and catch bugs—not to satisfy theoretical purity.

This library uses type hinting to:

  • Provide better IDE autocomplete and error detection
  • Document expected types clearly
  • Catch real errors before runtime

However, pragmatism over pedantry: if a typing pattern improves usability and robustness, use it—even if it technically violates some typing rule.

Example: MACSBenchmark narrows its environment type to MACSEnvironment (instead of generic Environment). This violates strict subtyping rules but provides users with:

  • Precise autocomplete for MACS-specific methods
  • Clear documentation that MACS requires its own environment
  • Better error messages during development
  • Prevents mixing incompatible components (e.g., Tau2Environment cannot be passed to MACSBenchmark)

Unless there's a graceful alternative that preserves usability, choose the pattern that helps users most.

Guiding principle: Orient yourself on existing patterns in the codebase. Consistency matters more than theoretical correctness.

Syntax Rules

  • Unions: Use A | B notation (not Union[A, B])
  • Optional: Prefer Optional[X] over X | None for explicitness
  • Collections: Use List[...], Dict[..., ...], Sequence[...] instead of list, dict, sequence

Example:

def process_data(
    items: List[str],
    config: Optional[Dict[str, Any]] = None
) -> str | int:
    ...

Dependency Management

Three types of dependencies:

  • Core ([project.dependencies]): Required, installed with pip install maseval. Keep minimal!
  • Optional ([project.optional-dependencies]): Published for end users. Framework integrations like smolagents, langgraph, plus convenience bundles like all and examples. Uses self-references (e.g., maseval[all]) to avoid duplication - this is a standard Python packaging feature.
  • Dev Groups ([dependency-groups]): NOT published. Only for contributors. Tools like pytest, ruff, mkdocs.
# Add core dependency (use sparingly!)
uv add <package-name>

# Add optional dependency for end users (e.g., framework integrations)
uv add --optional <extra-name> <package-name>

# Add development dependency (not published - dev tools only)
uv add --group dev <package-name>

# Remove any dependency
uv remove <package-name>

Important: uv add automatically updates both pyproject.toml and uv.lock. Never edit lockfile manually.

Understanding uv sync options:

  • uv sync: Installs only core dependencies
  • uv sync --extra <name>: Adds specific optional dependency (e.g., --extra smolagents)
  • uv sync --all-extras: Installs ALL optional dependencies (includes smolagents, langgraph, etc.)
  • uv sync --group <name>: Adds specific dev group (e.g., --group dev)
  • uv sync --all-groups: Installs ALL dev groups (dev, docs)
  • uv sync --all-extras --all-groups: Full contributor setup with everything

Note: Type checking requires --all-extras to install all framework packages needed for checking types across all interfaces.

Architecture Rules

Core vs Interface Separation (CRITICAL)

  • maseval/core/: Essential library logic. NO optional dependencies. Must work with minimal installation.
  • maseval/interface/: Adapters for external frameworks. ALL dependencies are optional.

NEVER:

  • Import maseval/interface from maseval/core
  • Add optional dependencies to core
  • This is enforced in CI and will fail the build

Adding new framework integrations:

  1. Create adapter in maseval/interface/<library>/
  2. Add dependency to [project.optional-dependencies]
  3. Add tests in tests/test_interface/
  4. Mark tests with appropriate pytest marker
  5. Keep adapters small and well-documented

Framework Adapter Pattern:

When implementing adapters for external frameworks, always use the framework's native message storage as the source of truth:

Pattern 1: Persistent State (smolagents)

class MyFrameworkAdapter(AgentAdapter):
    def get_messages(self) -> MessageHistory:
        """Dynamically fetch from framework's internal storage."""
        # Get from framework (e.g., agent.memory, agent.messages)
        framework_messages = self.agent.get_messages()

        # Convert and return immediately (no caching)
        return self._convert_messages(framework_messages)

    def _run_agent(self, query: str) -> MessageHistory:
        # Run agent (updates framework's internal state)
        self.agent.run(query)

        # Return fresh history
        return self.get_messages()

Pattern 2: Stateless/Result-based (LangGraph)

def __init__(self, agent_instance, name, callbacks=None, config=None):
    super().__init__(agent_instance, name, callbacks)
    self._last_result = None  # Cache for stateless mode
    self._config = config  # For stateful mode if supported

def get_messages(self) -> MessageHistory:
    # Try persistent state first (if framework supports it)
    if self._config and hasattr(self.agent, 'get_state'):
        state = self.agent.get_state(self._config)
        return self._convert_messages(state.values['messages'])

    # Fall back to cached result
    if self._last_result:
        return self._convert_messages(self._last_result['messages'])

    return MessageHistory()

def _run_agent(self, query: str) -> MessageHistory:
    result = self.agent.invoke(query, config=self._config)
    self._last_result = result  # Cache result
    return self.get_messages()

Why no caching? Conversion is cheap, and caching creates consistency issues if framework state changes. The framework's internal storage is the single source of truth. For stateless frameworks, cache only the last result and prefer fetching from persistent state if available.

Documentation

  • Built with MkDocs Material theme + mkdocstrings
  • Source files in docs/, config in mkdocs.yml
# Build with strict checking (catches broken links)
mkdocs build --strict

# Serve locally at http://127.0.0.1:8000
mkdocs serve

Contribution Workflow

  1. Create a feature branch (never commit to main)
  2. Make changes following code style guidelines
  3. Run just all before committing. This formats, lints, typechecks, and tests in one step. See the justfile for all available recipes.
  4. Update documentation if needed
  5. Open PR against main branch
  6. Request review from cemde
  7. Ensure all CI checks pass

CI Pipeline: GitHub Actions runs formatting checks, linting, and test suite across Python versions and OS. All checks must pass before merge.

For VS Code Agents Specifically

Important: VS Code agents (GitHub Copilot in VS Code) do NOT have direct access to run terminal commands like uv, python, pytest, or ruff.

When you need to run commands:

  • Ask the user to execute commands
  • Provide clear explanations of what each command does
  • Don't attempt to run uv, python, pytest, or similar commands directly, it does not work.
  • You CAN read/write files, and execute simple terminal commands such as ls, cd, grep etc.

Example workflow:

  1. Make code changes using file edit tools
  2. Ask user to run: ruff format . && ruff check . --fix
  3. Ask user to run: pytest -v to verify changes
  4. Read test output if user provides it

Project-Specific Conventions

  • Minimum Python: 3.10 (Primary development: 3.12)
  • Package manager: uv exclusively (don't use pip install in development)
  • Commit guidelines: Reference issues, keep commits focused, run full test suite before pushing

Common Tasks Quick Reference

# Fresh environment setup / Update after pulling changes
just install # uv sync --all-extras --all-groups

# Before committing (format, lint, typecheck, test)
just all

# Run example
uv run python examples/amazon_collab.py

# Add optional dependency
uv add --optional <extra-name> <package-name>

# Check specific test file
uv run pytest tests/test_core/test_agent.py -v

For more comments see justfile.

Security and Confidentiality

IMPORTANT: This project contains confidential research material.

  • DO NOT publicly distribute code or data
  • DO NOT publish without explicit permission
  • DO NOT share copyrighted third-party benchmark data

Changelog

When you complete a task, document your changes in the Changelog. Multiple tasks contribute to a single PR, and PRs are compiled into release changelogs.

User-Facing Documentation

Write changelog entries from the user's perspective - describe what the change means for someone using the library, not what you did internally. Focus on features, fixes, and improvements they'll notice or benefit from.

Task-Level Documentation

Add an entry for your completed task under the ## Unreleased section.

Important Rules

  • If you modified something already listed under "Added" in Unreleased, update that existing entry instead of adding a new one under "Changed"
  • Keep entries focused on user impact, not implementation details
  • Multiple task entries will be grouped together under the same PR
  • PR changelogs are then compiled into release notes between versions

Format

Brief description of the user-facing change (PR: #PR_NUMBER_PLACEHOLDER)

Example (User-Facing)

Good:

  • Added support for custom retry strategies in API client with argument retry for Client.__init__. (PR: #13)
  • Fixed timeout errors when processing large datasets in func (PR: #4)

Bad (not user-focused):

  • Refactored retry logic into separate module
  • Updated error handling in data_processor.py

Docstrings

Write docstrings for users, not about your implementation process.

Rules

  • Describe what the code does and how to use it
  • Explain parameters, return values, and behavior
  • Never write narratives: "I did...", "First we...", "Then I..."
  • Never include quality claims: "rigorously tested", "well-optimized"
  • Omit implementation details users don't need

Bad (narrative, claims, implementation details)

def calculate_average(numbers: list) -> float:
    """
    I implemented this to calculate averages. First I sum the numbers,
    then divide by count. Rigorously tested and optimized.
    """

Good (clear, user-focused)

def calculate_average(numbers: list) -> float:
    """
    Calculate the arithmetic mean of numbers.

    Args:
        numbers: List of numeric values

    Returns:
        Average as float

    Raises:
        ValueError: If list is empty
    """

mkdocs Rendering

This project uses mkdocstrings to render docstrings as HTML. Follow these rules to ensure proper rendering:

Lists require a blank line before them:

# Bad - renders as one paragraph
"""Subclasses must provide:
- method_one(): Description
- method_two(): Description
"""

# Good - renders as proper bullet list
"""Subclasses must provide:

- `method_one()` - Description
- `method_two()` - Description
"""

Return descriptions must be single-line (multi-line creates multiple table rows):

# Bad
"""
Returns:
    TerminationReason indicating why is_done() returns True,
    or NOT_TERMINATED if the interaction is still ongoing.
"""

# Good
"""
Returns:
    Why `is_done()` returns True, or `NOT_TERMINATED` if still ongoing.
"""

For dictionary returns, document fields in the docstring body using "Output fields:":

# Bad - creates multiple table rows in Returns
"""
Returns:
    Dictionary containing:
    - `name` - User identifier
    - `profile` - User profile data
"""

# Good - fields in body, single-line Returns
"""
Gather execution traces from this user.

Output fields:

- `name` - User identifier
- `profile` - User profile data
- `message_count` - Number of messages in history

Returns:
    Dictionary containing user state and interaction data.
"""

HTML-like strings must be in backticks (otherwise stripped as HTML):

# Bad - </stop> disappears
"""Uses "</stop>" to signal satisfaction."""

# Good
"""Uses `"</stop>"` to signal satisfaction."""

Use backticks for code references - method names, parameters, and values: `is_done()`, `stop_tokens`, `None`

Seeding for Reproducibility

MASEval provides a seeding system for reproducible benchmark runs. Seeds cascade from a global seed through all components, ensuring deterministic behavior when model providers support seeding. Study code and documentation of Benchmark, DefaultSeedGenerator to gain an understanding.

Early-Release Status

This project is early-release. Clean, maintainable code is the priority - not backwards compatibility.

  • Break APIs if it improves design
  • Refactor poor implementations
  • Remove technical debt as soon as you identify it
  • Don't preserve bad patterns for compatibility reasons
  • Focus on getting it right, not keeping it the same

We have zero obligation to maintain backwards compatibility. If you find code messy, propose a fix.

Scientific Integrity

MASEval is a scientific library. Scientific integrity is paramount. Never introduce defaults that could silently alter benchmark behavior or experimental outcomes.

The Boundary

Guiding principle: If a researcher would need to report a parameter in a paper's "Experimental Setup" section, do not invent a default for it.

Acceptable (infrastructure/convenience): TaskQueue(limit=None), Logger(verbose=False), num_workers=1, print_results(color=True) — these don't affect scientific results.

Unacceptable (experimental parameters): Temperature, seed, model version, prompt format, simulation duration, agent limits, dataset splits, scoring functions — these alter what's being measured.

Reproducing Benchmarks

When integrating external benchmarks, match the source implementation exactly. Never invent fallback values.

# BAD: Invented defaults
config = EnvironmentConfig(
    duration=getattr(scenario, "duration", 86400),  # Made-up fallback!
)
start_time = getattr(scenario, "start_time", None)  # Hides missing attributes

# GOOD: Pass through directly, let errors surface
config = EnvironmentConfig(
    duration=scenario.duration,  # Trust the source
)
start_time = scenario.start_time  # AttributeError if missing

# GOOD: Copy source defaults with documentation
# Default value copied from original_library/evaluator.py:L45
EVAL_TEMPERATURE = 0.7

class Evaluator:
    def run(self, temperature: Optional[float] = None):
        if temperature is None:
            temperature = EVAL_TEMPERATURE  # From source:L45

# also good:
class Evaluator:
    # default temperature from source:L45
    def run(self, temperature: Optional[float] = 0.7):
        ...

Rule: Only copy defaults that exist in the source. If the original doesn't provide a default, neither should you. Always document the source file and line number.