MASTIF (Multi-Agent System TestIng Framework) is a comprehensive benchmarking suite for evaluating multi-agent systems using AI technologies across multiple frameworks, protocols, and Generative AI models (LLMs). It supports both standard user-defined tasks and the Mind2Web benchmark, enabling reproducible assessment of agent reasoning, tool use, and web interaction capabilities.
Key Capabilities:
- Multi-Framework Support: Evaluates agents built with CrewAI, Smolagents, LangChain, LangGraph, LlamaIndex, and Semantic Kernel.
- Multi-Model Support: Supports models from both HuggingFace and OpenAI models, including open-source and proprietary LLMs.
- Protocol Flexibility: Assesses agent performance under various prompting and reasoning protocols (e.g., MCP, A2A, ACP, standard).
- Mind2Web Benchmark Integration: Runs large-scale, real-world web interaction tasks from the Mind2Web dataset, with automatic sampling and domain breakdowns.
- Token Consumption Metrics: Tracks and reports reasoning tokens, output tokens, and total tokens spent for each test, framework, protocol, and model.
- Detailed Metrics Collection: Captures reasoning steps, latency, task understanding, task deviation, task completion, and domain-specific performance.
- Extensible Tool Use: Evaluates agent tool-calling and web search capabilities.
- Flexible Configuration: Supports switching between models, frameworks, and protocols via environment variables or code.
- Comprehensive Output: Exports results in machine-readable (JSON) formats with detailed summaries and breakdowns. Files
out-standard.txtandout-mind2web.txtshow examples of human readable console output. - Judge Model Integration: You can use LLM-as-a-judge (e.g., GPT-4o-mini) for scoring and evaluation of agent outputs.
MASTIF framework is designed for researchers, developers, and practitioners who want to systematically compare agentic AI stacks, understand their strengths and weaknesses, and drive improvements in agent reasoning and web automation.
python3 -m venv venv
source venv/bin/activate# Install all dependencies considering all the specific versions used in the project
pip install -r requirements.txt# Install dependency for Mind2Web
pip install datasets
# Install core
pip install \
huggingface-hub \
openai \
pydantic \
pyyaml \
python-dotenv \
requests \
tiktoken \
transformers
# Install agentic frameworks
pip install langchain langchain-community langchain-core langgraph
pip install crewai crewai-tools
pip install smolagents
pip install llama-index llama-index-core llama-index-llms-huggingface-api
pip install semantic-kernel
# Install tool pool dependencies
pip install duckduckgo-search
pip install playwright
playwright install chromium
pip install wikipedia
pip install arxiv
pip install RestrictedPython
pip install beautifulsoup4
pip install pypdf requests
pip install biopython
pip install youtube-transcript-api
pip install sympyMASTIF supports 2 types of tests:
- Custom tasks
- Benchmark tasks; the currently supported benchmark is Mind2Web benchmark.
For custom tasks, HF_TOKEN is required and OPENAI_API_KEY is only needed if you configure tests with gpt-* models.
For Mind2Web tasks, the following two keys are required, as MASTIF currently considers OpenAI models to judge outputs from benckmark tasks.
export HF_TOKEN='your_token'
export OPENAI_API_KEY='your_key'Make a copy of experiments.yaml, rename it, and customize it according to your needs.
python main.py experiments/[your experiment file].yaml- 10 tasks: Quick evaluation (~15 minutes)
- 50 tasks: Medium evaluation (~1 hour)
- 100 tasks: Comprehensive sample (~2 hours)
- All tasks (2,350): Full benchmark (~24+ hours)
logs/mind2web-results-TIMESTAMP.json: Mind2Web specific metricslogs/results-TIMESTAMP.json: Standard test results
- Task Understanding: Agent's comprehension of the task
- Task Deviation: Agent's adherence to the task in reasoning steps
- Task Completion: Agent's performance on fulfilling the task
- Reasoning Steps: Number of intermediate reasoning steps
- Domain-specific performance breakdowns
- Standard mode with user defined tasks: out-standard.txt
- Mind2Web mode with benchmark tasks: out-mind2web.txt
- Mind2Web requires authentication with HuggingFace
- The test set requires accepting terms on HuggingFace
- Focus is on task understanding and action planning capabilities
AIA Human-AI blend, Content edits, Human-initiated, Reviewed, Copilot, Gemini, and Sonet 4.5 v1.0
More info: https://aiattribution.github.io/create-attribution

