First off, thank you for considering contributing to MASEval! It's people like you that make our community great. This document provides a guide for making contributions.
Before diving into the technical details, here are the core principles that guide our development process.
We follow a standard GitHub workflow. Following these steps makes it easier to review and merge your changes.
- Create a new branch for your feature or bugfix. Do not commit directly to
main. - Make your changes following our code style (see below).
- Update CHANGELOG.md: Add a brief entry under
[Unreleased]in the appropriate section (Added/Changed/Fixed/Removed). - Open a Pull Request against the
mainbranch - our PR template will guide you through the checklist. - Request a review from
cemde. - Ensure all automated tests pass before your PR can be merged. Our automated checks are detailed in the technical section below.
The maseval package is designed with a strict separation between its core logic and optional integrations. Understanding this is key to contributing effectively.
-
maseval/core: This is the heart of the library. It contains the essential logic and must not have any optional dependencies. It should be fully functional with a minimal installation. -
maseval/interface: This contains adapters for other multi-agent frameworks (likecrewai,langgraph, etc.). All dependencies for these integrations are optional.
Warning
Code in maseval/core must never import from maseval/interface. This separation is critical to keep the core package lightweight and dependency-free. Breaking this rule will cause the library to fail.
This section provides the technical details you'll need to get started with coding.
We use uv for fast and reliable package management. The best way to ensure a consistent environment is to sync it with the project's lockfile.
# Sync your environment with all dependencies (including dev tools and optional dependencies)
# This command will automatically create a virtual environment if one doesn't exist
uv sync --all-extras --all-groupsNote:
uv syncwill automatically create a.venvdirectory if it doesn't exist, so there's no need to runuv venvseparately. The--all-extrasflag includes all optional dependencies (like framework integrations), and--all-groupsincludes development tools likeruffandpytest.
You have two options for running commands in your development environment:
Option 1: Activate the virtual environment (traditional approach)
# Activate the environment (macOS/Linux)
source .venv/bin/activate
# Now you can run commands directly
python examples/amazon_collab.py
pytest tests/
ruff format .Option 2: Use uv run (no activation needed)
# Run commands directly with uv run
uv run python examples/amazon_collab.py
uv run pytest tests/
uv run ruff format .Both approaches work equally well! Use whichever you prefer. The uv run approach is convenient if you don't want to activate the environment, as it automatically uses the correct virtual environment for each command.
Option 3: Use just (recommended)
We provide a justfile as a command runner for common development tasks. This is the easiest way to run commands — no need to remember long uv run invocations.
# Install just (if not already installed)
brew install just # macOS
# or: cargo install just
# List all available commands
just
# Examples
just test # Run default test suite
just test-core # Run core tests
just check # Run all quality checks (format, lint, typecheck)
just docs # Serve documentation locallyAll just recipes accept extra arguments. For example:
just test-core -x --tb=short # Stop on first failure, short tracebacks
just test-benchmark -k "macs" # Run only MACS benchmark testsNote: All commands also work with
uv rundirectly —justis a convenience wrapper.
We use ruff to enforce a consistent code style. Before committing, please run the formatter and linter.
# Format the codebase
ruff format .
# Lint the codebase and fix what can be fixed automatically
ruff check . --fixFor convenience, you can enable pre-commit hooks to automatically format and lint code on every commit:
uv run pre-commit installThis is optional—CI will catch any issues regardless. But if enabled, the hooks will:
- Format code with
ruff format(using project settings frompyproject.toml) - Lint and auto-fix issues with
ruff check --fix
Note: The pre-commit hooks intentionally skip removing unused imports (
F401) and unused variables (F841) to avoid disrupting work-in-progress code. Runuv run ruff check . --fixmanually before opening a PR to clean these up.
Dependencies are defined in pyproject.toml and locked in uv.lock. Understanding the different dependency types is important:
Core Dependencies ([project.dependencies])
- Required for the package to function at all
- Installed by default when users run
pip install maseval - Example:
rich(used for console output in the core library)
Optional Dependencies ([project.optional-dependencies])
- Published with the package and installable by end users
- Used for optional features that users can choose to install
- Installed via
pip install maseval[smolagents]orpip install maseval[langgraph] - Examples:
smolagents,langgraph,openai(framework integrations and inference engines)
Dependency Groups ([dependency-groups])
- Not published with the package (development-only)
- Used by contributors for development, testing, and documentation
- Installed via
uv sync --group devoruv sync --all-groups - Examples:
pytest,ruff,mkdocs(development and documentation tools)
Key Difference: Optional dependencies are for end users who want additional features. Dependency groups are for contributors who need development tools. Only optional dependencies are published to PyPI.
If you need to add or change dependencies, use uv add. This command automatically updates both pyproject.toml and uv.lock.
# Add a core dependency (required for the package to work)
uv add <package-name>
# Add an optional dependency to a specific extra group
uv add --optional <extra-name> <package-name>
# Add a development dependency to a group
uv add --group dev <package-name>
# Remove a dependency
uv remove <package-name>After updating dependencies in a Pull Request, other developers can get the changes simply by running uv sync --all-extras --all-groups.
Our documentation is built with MkDocs. To preview your changes locally:
# Build the documentation with strict checking
mkdocs build --strict
# Serve the documentation locally at http://127.0.0.1:8000
mkdocs serveTip: You can also use
uv run mkdocs build --strictanduv run mkdocs serveif you prefer not to activate the environment.
API reference pages should include links to source files on GitHub. Use the following pattern:
[:material-github: View source](https://github.com/parameterlab/maseval/blob/main/maseval/path/to/YOUR_NEW_CLASS.py){ .md-source-file align=right }
::: maseval.path.to.YOUR_NEW_CLASSThis renders a right-aligned GitHub link above the auto-generated API documentation. See docs/reference/agent.md for a complete example.
You may include jupyter notebook examples directly into the documentation. We use the mkdocs-jupyter plugin to render Jupyter notebooks in the documentation. To avoid duplicating notebook files, we use symbolic links.
Example: The notebook examples/aws_collab/amazon_collab.ipynb is included in the docs via a symlink at docs/examples/amazon_collab.ipynb.
To add a new notebook to the documentation:
-
Create a symlink from the notebook's location to the
docs/directory:ln -sf ../../examples/your_notebook_dir/notebook.ipynb docs/examples/notebook.ipynb
-
Add the notebook to the nav in
mkdocs.yml:nav: - Examples: - Your Notebook: examples/notebook.ipynb
-
Commit the symlink to version control:
git add docs/examples/notebook.ipynb
This approach ensures that:
- The notebook remains in its original location (e.g., with related data files)
- Documentation always reflects the latest notebook version
- No manual copying/syncing is required
When you open a Pull Request, a series of automated checks will run using GitHub Actions (GHA). This is our continuous integration (CI) pipeline, and it ensures that all contributions meet our quality standards.
The pipeline automatically performs the following tasks:
- Linting and Formatting: Verifies that your code adheres to our style guide using
ruff. - Testing (tiered):
- Fast tests (every PR, Python 3.10–3.14): core, benchmark, and all default-suite tests. No API keys needed.
- Slow tests (every PR, Python 3.12): data download and integrity validation.
- Credentialed tests (every PR, Python 3.12): live API tests. Requires maintainer approval to run — secrets are only exposed after approval.
- Type Checking: Validates type annotations using
ty. - Documentation: Ensures documentation builds without errors using
mkdocs.
All checks must pass before your Pull Request can be merged. Contributors don't need API keys — the default and slow test suites run without them. See tests/README.md for how markers work and for the recommended benchmark testing pattern (offline structural tests vs. real-data tests).
Note: You don't need to run all these checks locally - CI will catch issues. However, running
uv run ruff format && uv run ruff checkbefore pushing can save you time.
When creating adapters for external agent frameworks (in maseval/interface/agents/), follow these best practices to ensure consistency and reliability:
Always use the framework's native message storage as the source of truth. Do not cache converted messages in the adapter, as this can lead to inconsistencies if the framework's internal state changes.
Correct Pattern (SmolAgents example):
class SmolAgentAdapter(AgentAdapter):
def get_message_history(self) -> MessageHistory:
"""Dynamically fetch and convert messages from framework's memory."""
# Get messages from framework's internal storage
smol_messages = self.agent.write_memory_to_messages()
# Convert and return (no caching)
return self._convert_smolagents_messages(smol_messages)
def _run_agent(self, query: str) -> MessageHistory:
# Run the agent (updates framework's internal memory)
self.agent.run(query)
# Return by calling get_message_history() to fetch latest
return self.get_message_history()Why this matters:
- Single Source of Truth: The framework maintains the canonical message history
- Always Current: Each call to
get_message_history()fetches the latest state - No Sync Issues: No risk of cached copy becoming stale
- Cheap Conversion: Message format conversion is typically very fast, so caching provides minimal benefit
Anti-pattern to avoid:
# ❌ DON'T DO THIS - Cached copy can become stale
def _run_agent(self, query: str) -> MessageHistory:
self.agent.run(query)
history = self._convert_messages(self.agent.messages)
self._cached_history = history # Bad: creates stale cache
return history
def get_message_history(self) -> MessageHistory:
return self._cached_history # Bad: returns potentially stale dataWhen adding support for a new framework:
- Override
get_messages()to fetch from framework's native storage - Implement
_run_agent()to execute agent and return fresh history - Create conversion method (e.g.,
_convert_X_messages()) for message format - Handle tool calls and tool responses if the framework supports them
- Add optional dependency to
pyproject.tomlunder[project.optional-dependencies] - Add conditional import in
maseval/interface/agents/__init__.py - Write integration tests in
tests/test_interface/ - Update documentation with usage examples
- Provide a
logsproperty inside theAgentAdapter.
Pattern 1: Persistent State (smolagents)
class MyFrameworkAdapter(AgentAdapter):
def get_messages(self) -> MessageHistory:
"""Dynamically fetch from framework's internal storage."""
# Get from framework (e.g., agent.memory, agent.messages)
framework_messages = self.agent.get_messages()
# Convert and return immediately (no caching)
return self._convert_messages(framework_messages)
def _run_agent(self, query: str) -> MessageHistory:
# Run agent (updates framework's internal state)
self.agent.run(query)
# Return by calling get_messages() to fetch latest
return self.get_messages()Why This Works:
- Single Source of Truth: Framework's internal storage is authoritative
- Always Current: Each call to
get_messages()fetches the latest state
Stateless/Result-based (LangGraph pattern):
Frameworks that return results without persistent state can cache the last result:
def __init__(self, agent_instance, name, callbacks=None, config=None):
super().__init__(agent_instance, name, callbacks)
self._last_result = None
self._config = config # For stateful mode if supported
def get_messages(self) -> MessageHistory:
# Try fetching from persistent state if configured
if self._config and hasattr(self.agent, 'get_state'):
state = self.agent.get_state(self._config)
return self._convert_messages(state.values['messages'])
# Fall back to cached result
if self._last_result:
return self._convert_messages(self._last_result['messages'])
return MessageHistory()
def _run_agent(self, query: str) -> MessageHistory:
result = self.agent.invoke(query, config=self._config)
self._last_result = result # Cache for stateless mode
return self.get_messages()The key principle: Always try to fetch from the framework's source of truth first, fall back to caching only when the framework doesn't provide persistent state access.