Skip to content

jitrc/tau2-enhanced

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

43 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

tau2-enhanced

Advanced observability and analysis platform for tau2-bench AI agent evaluations

Overview

tau2-enhanced transforms tau2-bench with comprehensive structured logging, advanced analytics, deep observability into AI agent behavior, and intelligent performance optimization agents. It provides enhanced versions of all tau2 domains with rich event tracking, argument analysis, performance monitoring, sophisticated analysis capabilities, and three specialized agents that address critical tau2-bench performance issues - all without modifying the original tau2-bench codebase.

๐Ÿš€ Key Features

Structured Event Logging

  • ExecutionLogger: Comprehensive event logging with JSONL persistence and real-time export
  • ToolExecutionEvent: Rich tool call tracking with timing, arguments, results, and metadata
  • StateChangeEvent: Environment state transitions with detailed diff tracking
  • ContextReductionEvent: Context management and token optimization monitoring

Advanced Analytics

  • 15+ Analysis Methods: From basic metrics to advanced statistical analysis
  • Temporal Analysis: Time-based patterns, trends, and performance evolution
  • Argument Intelligence: Deep function call argument analysis and correlations
  • Error Pattern Detection: Smart error categorization and root cause analysis
  • Performance Optimization: Bottleneck identification and efficiency insights

Comprehensive Argument Tracking

  • Function Signatures: Required vs optional parameter usage analysis
  • Complexity Scoring: Argument and result complexity measurement (0-1 scale)
  • Security Detection: Automatic identification of sensitive data in arguments
  • Type Analysis: Complete argument type distribution and patterns
  • Size Monitoring: Argument and result size tracking with performance correlations

Intelligent Performance Optimization Agents ๐Ÿ†•

  • RetryManagedLLMAgent: 3-attempt retry logic with intelligent error recovery for validation failures
  • ContextManagedLLMAgent: Sliding window context reduction and token optimization to prevent performance cliffs
  • EnhancedLLMAgent: Combined retry + context management for maximum performance improvement
  • Addresses Critical Issues: 87% action execution failures and 53% performance drop at 3,000+ tokens
  • Expected Improvements: 65-70% overall success rate improvement with minimal overhead

Automatic Domain Registration

  • All Domains Enhanced: airline_enhanced, retail_enhanced, telecom_enhanced, mock_enhanced
  • Zero Configuration: Enhanced domains available immediately upon import
  • Seamless Integration: Works with existing tau2-bench commands and workflows

Installation

Setup Virtual Environment

Create and activate a virtual environment first (requires Python 3.10+):

python -m venv tau2-enhanced-env
source tau2-enhanced-env/bin/activate

Prerequisites

tau2-enhanced requires the tau2-bench framework to be installed first:

# Install tau2-bench from Sierra Research
git clone https://github.com/sierra-research/tau2-bench.git
cd tau2-bench
pip install -e .
cd ..

Install tau2-enhanced

  1. Clone this repository:
git clone https://github.com/jitrc/tau2-enhanced.git
cd tau2-enhanced
  1. Install in development mode:
pip install -e .

Verify Installation

Test that both tau2-bench and tau2-enhanced are working:

# Test tau2-bench
tau2 --help

# Test tau2-enhanced
./tau2-enhanced --help

# Test enhanced domains are available
python -c "from tau2_enhanced.domain_registration import EnhancedDomainRegistry; print(list(EnhancedDomainRegistry().register_all_available_domains().keys()))"

Create Reproducible Analysis Results

To reproduce the analysis results shown in samples/analysis/, run these commands:

# Create baseline comparison with different agents (Gemini)
./tau2-enhanced run --domain airline_enhanced --agent llm_agent --agent-llm gemini/gemini-2.5-flash --user-llm gemini/gemini-2.5-flash --num-tasks 10 --num-trials 2 --save-to airline_gemini2_5_flash_10tasks_llm_agent

./tau2-enhanced run --domain airline_enhanced --agent context_agent --agent-llm gemini/gemini-2.5-flash --user-llm gemini/gemini-2.5-flash --num-tasks 10 --num-trials 2 --save-to airline_gemini2_5_flash_10tasks_context_agent

./tau2-enhanced run --domain airline_enhanced --agent retry_agent --agent-llm gemini/gemini-2.5-flash --user-llm gemini/gemini-2.5-flash --num-tasks 10 --num-trials 2 --save-to airline_gemini2_5_flash_10tasks_retry_agent

./tau2-enhanced run --domain airline_enhanced --agent enhanced_agent --agent-llm gemini/gemini-2.5-flash --user-llm gemini/gemini-2.5-flash --num-tasks 10 --num-trials 2 --save-to airline_gemini2_5_flash_10tasks_enhanced_agent

# Create agent comparison with Grok models
./tau2-enhanced run --domain airline_enhanced --agent llm_agent --agent-llm xai/grok-3 --user-llm xai/grok-4-fast-reasoning --num-trials 4 --save-to airline_llm_agent_xai_grok3 --max-concurrency 5

./tau2-enhanced run --domain airline_enhanced --agent context_agent --agent-llm xai/grok-3 --user-llm xai/grok-4-fast-reasoning --num-trials 4 --save-to airline_context_agent_xai_grok3 --max-concurrency 5

./tau2-enhanced run --domain airline_enhanced --agent retry_agent --agent-llm xai/grok-3 --user-llm xai/grok-4-fast-reasoning --num-trials 4 --save-to airline_retry_agent_xai_grok3 --max-concurrency 5

./tau2-enhanced run --domain airline_enhanced --agent enhanced_agent --agent-llm xai/grok-3 --user-llm xai/grok-4-fast-reasoning --num-trials 4 --save-to airline_enhanced_agent_xai_grok3 --max-concurrency 5

# Create comprehensive Grok-3 baseline analysis
./tau2-enhanced run --domain airline_enhanced --agent llm_agent --agent-llm xai/grok-3 --user-llm gemini/gemini-2.5-flash --num-trials 4 --save-to baseline_airline_xai_grok3_gemini2_5_flash

# Analyze results (creates analysis folders in samples/analysis/)
python scripts/analyze_simple_logs.py enhanced_logs/airline_gemini2_5_flash_10tasks_llm_agent.json
python scripts/analyze_simple_logs.py enhanced_logs/baseline_airline_xai_grok3_gemini2_5_flash.json

This will create analysis folders matching the structure in samples/analysis/ with:

  • Enhanced analysis reports (HTML)
  • Tool performance reports
  • Statistical breakdowns and visualizations

Usage

Method 1: Enhanced Agents (Performance Optimization) ๐Ÿ†•

Use the enhanced agents to dramatically improve tau2-bench performance:

# Test individual enhanced agents
tau2 run --domain airline_enhanced --agent retry_agent --num-trials 5
tau2 run --domain airline_enhanced --agent context_agent --num-trials 5
tau2 run --domain airline_enhanced --agent enhanced_agent --num-trials 5

# Compare all variants (recommended for evaluation)
tau2 run --domain airline_enhanced --agent enhanced_agent,retry_agent,context_agent,llm_agent --num-trials 10

Agent Performance Expectations:

  • retry_agent: Reduces 87% action failures to ~45% through intelligent validation error recovery
  • context_agent: Eliminates 53% performance cliff at 3,000+ tokens through context optimization
  • enhanced_agent: Combined solution achieving 65-70% overall success rate improvement

Method 2: Using the tau2-enhanced wrapper (CLI)

Use the tau2-enhanced wrapper script which ensures enhanced domains are properly loaded:

./tau2-enhanced run --domain airline_enhanced --agent llm_agent --agent-llm gemini/gemini-2.5-flash --user-llm gemini/gemini-2.5-flash --task-ids 2 --num-trials 1

Method 3: Programmatic API (Recommended for detailed logging)

Use the Python API for full control over enhanced logging:

from tau2_enhanced import run_enhanced_simulation
from tau2.domains.airline.environment import get_tasks

# Load tasks
tasks = get_tasks()
test_task = next(task for task in tasks if task.id == "2")

# Run enhanced simulation with enhanced agent
results, (main_path, logs_path) = run_enhanced_simulation(
    domain="airline_enhanced",
    tasks=[test_task],
    agent="enhanced_agent",  # Use enhanced agent for maximum performance
    user="user_simulator",
    llm_agent="gemini/gemini-2.5-flash",
    llm_user="gemini/gemini-2.5-flash",
    num_trials=1,
    save_dir="./enhanced_logs"
)

print(f"Results saved to: {main_path}")
print(f"Enhanced logs saved to: {logs_path}")

Method 4: Direct Analysis of Existing Logs

Analyze previously captured enhanced logs:

from tau2_enhanced.analysis import LogAnalyzer

# Load and analyze enhanced logs
analyzer = LogAnalyzer("path/to/enhanced_logs.json")

# Get comprehensive analysis
analysis = analyzer.get_comprehensive_analysis()
print(f"Total tool calls: {analysis['basic_stats']['total_tool_calls']}")
print(f"Success rate: {analysis['basic_stats']['success_rate']:.2%}")

# Argument analysis
arg_analysis = analyzer.get_argument_analysis()
print(f"Most complex arguments: {arg_analysis['complexity_distribution']}")

# Performance insights
perf_analysis = analyzer.get_performance_analysis()
print(f"Slowest tools: {perf_analysis['slowest_tools']}")

Output Files

The enhanced system generates two types of files:

  • Standard Results: enhanced_results_TIMESTAMP.json - Compatible with tau2-bench format
  • Enhanced Logs: enhanced_results_TIMESTAMP_enhanced_logs.json - Detailed execution logs and state snapshots

Available Enhanced Agents ๐Ÿ†•

tau2-enhanced provides three specialized agents to address critical tau2-bench performance issues:

RetryManagedLLMAgent (retry_agent)

  • Purpose: Intelligent validation error recovery with 3-attempt retry logic
  • Addresses: 87% action execution failure rate from parameter validation errors
  • Features:
    • Smart error classification and recovery strategies
    • Exponential backoff retry mechanism
    • Comprehensive retry statistics and success tracking
  • Expected Impact: Reduce action failures from 87% to ~45%

ContextManagedLLMAgent (context_agent)

  • Purpose: Context length optimization through sliding window and token reduction
  • Addresses: 53% performance cliff at 3,000+ tokens due to context pressure
  • Features:
    • Multi-strategy context reduction (sliding window, compression, summarization)
    • Token usage monitoring with warning/critical thresholds
    • Information preservation scoring to maintain conversation quality
  • Expected Impact: Eliminate performance cliff, maintain 70%+ success rate

EnhancedLLMAgent (enhanced_agent) - Recommended

  • Purpose: Combined retry logic + context management for maximum performance
  • Addresses: Both validation errors AND context pressure simultaneously
  • Features:
    • Coordinates retry and context reduction optimally
    • Combined performance metrics and efficiency scoring
    • Production-ready solution with minimal overhead
  • Expected Impact: 65-70% overall success rate improvement

Quick Start with Enhanced Agents

# Import and use enhanced agents directly
from tau2_enhanced import RetryManagedLLMAgent, ContextManagedLLMAgent, EnhancedLLMAgent

# Or get usage examples and performance info
from tau2_enhanced import get_usage_examples, get_performance_expectations, print_enhanced_agent_summary

# Print comprehensive agent information
print_enhanced_agent_summary()

Available Enhanced Domains

All tau2-bench domains are automatically enhanced upon import:

  • airline_enhanced: Enhanced airline booking and management domain
  • retail_enhanced: Enhanced retail customer service domain
  • telecom_enhanced: Enhanced telecommunications support domain
  • mock_enhanced: Enhanced mock/testing domain

Enhanced Logging Details

ToolExecutionEvent Structure

Each tool call is comprehensively tracked with:

{
    # Core execution data
    "tool_name": "search_flights",
    "tool_call_id": "search_flights_1",
    "execution_time": 0.245,
    "success": true,
    "requestor": "assistant",
    "state_changed": true,

    # Enhanced argument tracking
    "args_count": 3,
    "args_size_bytes": 156,
    "args_complexity_score": 0.3,
    "args_types": {"departure": "str", "destination": "str", "date": "str"},
    "has_file_args": false,
    "has_large_args": false,
    "sensitive_args_detected": false,

    # Function signature analysis
    "required_args_provided": ["departure", "destination"],
    "optional_args_provided": ["date"],
    "missing_args": [],
    "unexpected_args": [],

    # Result analysis
    "result_size": 1024,
    "result_type": "dict",
    "result_complexity_score": 0.7,
    "result_contains_errors": false,
    "result_truncated": false
}

StateChangeEvent Structure

Environment state transitions are captured with:

{
    "state_type": "database",
    "action_trigger": "book_flight",
    "pre_state_hash": "abc123...",
    "post_state_hash": "def456...",
    "state_diff": "Added 1 booking record, updated customer balance",
    "change_summary": "Flight booking completed successfully"
}

State Snapshots

Detailed environment snapshots captured at key moments:

  • Before/after each tool call
  • On state changes
  • At simulation start/end
  • On errors or exceptions

Analysis Capabilities

Advanced Visualization & Reporting ๐Ÿ†•

tau2-enhanced includes a comprehensive suite of 9 interactive visualizations and reports:

Interactive Plotly Visualizations

  1. Summary Dashboard - Gauges and bars showing task/tool success rates and execution times
  2. Failure Analysis Plot - Multi-panel dashboard with failure counts, rates, and error types
  3. State Change Plot - Analysis of state-changing vs read-only tools with performance metrics
  4. Tool Flow Sankey - Visual diagram showing tool execution sequences and transitions
  5. Performance Bottleneck Plot - Box plots identifying slow tools and execution outliers

Comprehensive HTML Reports

  1. Tool Report (tool_report.html) - Detailed tool-level analysis with interactive tables
  2. Enhanced Analysis Report (enhanced_analysis_report.html) - All 5 plots embedded in one report
  3. Simulation Report (simulation_report.html) - Simulation-level details featuring:
    • Action Check Failures Summary Table - Comprehensive view of all validation failures with similarity scores (collapsible)
    • Error Pattern Clustering - Groups common error patterns with occurrence counts and affected simulations (collapsible)
    • Per-simulation expandable sections with message logs and detailed action check analysis
    • Smart filters: Show Failed Only, Action Failed Only, Called w/ Diff Only
  4. Markdown Report (analysis_report.md) - Text-based summary for documentation

Quick Analysis

# Analyze any enhanced log file
python scripts/analyze_simple_logs.py enhanced_logs/your_log_file.json

# Outputs generated automatically:
# - analysis_results/your_log_file/tool_report.html
# - analysis_results/your_log_file/enhanced_analysis_report.html
# - analysis_results/your_log_file/simulation_report.html (NEW: with action failures & error clustering!)
# - analysis_results/your_log_file/analysis_report.md

Statistical Analysis Methods

  1. Basic Statistics: Success rates, timing distributions, error counts
  2. Confidence Intervals: Statistical significance of performance metrics
  3. Correlation Analysis: Relationships between arguments and outcomes
  4. Distribution Analysis: Tool usage patterns and frequency distributions
  5. Gini Coefficient: Inequality in tool usage and performance
  6. Shannon Diversity: Diversity of tool calls and argument patterns
  7. Temporal Analysis: Performance trends over time
  8. Error Pattern Analysis: Error categorization and root cause analysis

Performance Analytics

# Get performance bottlenecks
bottlenecks = analyzer.get_performance_bottlenecks()
print(f"Slowest operations: {bottlenecks['slowest_operations']}")

# Analyze tool efficiency
efficiency = analyzer.get_tool_efficiency_metrics()
print(f"Most efficient tools: {efficiency['top_performers']}")

# Get argument correlations
correlations = analyzer.get_argument_correlation_analysis()
print(f"Performance predictors: {correlations['performance_predictors']}")

Argument Intelligence

The system provides deep insights into function call patterns:

  • Complexity Scoring: 0-1 scale based on argument count, types, and sizes
  • Security Detection: Automatic identification of sensitive parameters
  • Usage Patterns: Required vs optional parameter utilization
  • Performance Correlations: How argument complexity affects execution time
  • Type Analysis: Distribution of argument types across tool calls

Architecture

Core Components

  • tau2_enhanced/logging/: Structured event logging system

    • execution_logger.py: Main event logger with JSONL persistence
    • state_tracker.py: Environment state monitoring and diff tracking
    • events.py: Event type definitions and serialization
  • tau2_enhanced/analysis/: Analytics and intelligence system

    • analyzer.py: Comprehensive log analysis with 15+ methods
    • visualizer.py: Data visualization and reporting
  • tau2_enhanced/environments/: Environment enhancement system

    • logging_environment.py: LoggingEnvironment wrapper for tau2 environments
    • domain_environments.py: Automatic domain registration system
  • tau2_enhanced/domain_registration.py: Core domain discovery and registration

Integration Pattern

The system uses monkey patching to intercept environment creation without modifying tau2-bench:

  1. Automatic Discovery: Scans for available tau2 domains
  2. Environment Wrapping: Original environments wrapped with LoggingEnvironment
  3. Log Capture: Tool calls intercepted and logged with detailed metrics
  4. Result Enhancement: Standard tau2 results extended with captured logs
  5. Non-Invasive: Zero modification to existing tau2-bench codebase

Domain Registration System

from tau2_enhanced.domain_registration import EnhancedDomainRegistry

# Automatic registration of all available domains
registry = EnhancedDomainRegistry()
registered = registry.register_all_available_domains()
print(f"Registered domains: {list(registered.keys())}")

# Manual domain registration
registry.register_enhanced_domain(
    domain_name="custom",
    original_get_environment=custom_get_environment,
    original_get_tasks=custom_get_tasks
)

Development

Adding New Enhanced Domains

  1. Implement domain functions in domain_environments.py:
def get_custom_enhanced_environment(*args, **kwargs):
    original_env = get_custom_environment(*args, **kwargs)
    return LoggingEnvironment(original_env, log_file="custom.jsonl")
  1. Register the domain:
registry.register_enhanced_domain(
    domain_name="custom",
    original_get_environment=get_custom_environment,
    original_get_tasks=get_custom_tasks
)
  1. Test integration:
./tau2-enhanced run --domain custom_enhanced --num-trials 1

Custom Event Types

Extend the logging system with custom events:

from tau2_enhanced.logging.events import ExecutionEvent

@dataclass
class CustomAnalysisEvent(ExecutionEvent):
    analysis_type: str = ""
    metrics: Dict[str, Any] = field(default_factory=dict)

    def __post_init__(self):
        super().__post_init__()
        self.source = f"analysis:{self.analysis_type}"

Enhanced Agent Configuration

Agent Configuration and Customization

from tau2_enhanced import EnhancedLLMAgent

# Create and configure enhanced agent
agent = EnhancedLLMAgent()

# Configure all parameters at once
agent.configure_enhanced_agent(
    context_limit=8000,           # Token limit before reduction
    warning_threshold=0.8,        # Start reduction at 80% usage
    critical_threshold=0.95,      # Emergency reduction at 95%
    max_retries=3,               # Maximum retry attempts
    retry_delay_base=0.5         # Base delay between retries
)

# Get comprehensive performance statistics
stats = agent.get_enhanced_statistics()
print(f"Enhancement usage rate: {stats['enhanced_agent_metrics']['enhancement_usage_rate']:.2%}")
print(f"Context reductions: {stats['context_management']['total_reductions']}")
print(f"Successful retries: {stats['retry_mechanism']['successful_sequences']}")

Agent Performance Monitoring

# Monitor agent performance in real-time
def monitor_enhanced_agent_performance(agent):
    stats = agent.get_enhanced_statistics()

    # Context management metrics
    context_stats = stats['context_management']
    print(f"Token savings: {context_stats['total_tokens_saved']}")
    print(f"Information preservation: {context_stats['average_information_preservation']:.1%}")

    # Retry mechanism metrics
    retry_stats = stats['retry_mechanism']
    print(f"Retry success rate: {retry_stats['success_rate']:.1%}")
    print(f"Average attempts to success: {retry_stats['average_attempts']:.1f}")

    # Overall efficiency
    efficiency = stats['performance_analysis']['enhancement_efficiency_score']
    print(f"Enhancement efficiency score: {efficiency:.2f}/1.0")

API Reference

Enhanced Agents API

# Import enhanced agents
from tau2_enhanced import RetryManagedLLMAgent, ContextManagedLLMAgent, EnhancedLLMAgent

# RetryManagedLLMAgent specific methods
retry_agent = RetryManagedLLMAgent()
retry_stats = retry_agent.get_retry_statistics()

# ContextManagedLLMAgent specific methods
context_agent = ContextManagedLLMAgent()
context_agent.set_context_limit(8000)
context_agent.set_reduction_thresholds(warning=0.7, critical=0.9)
context_stats = context_agent.get_context_statistics()

# EnhancedLLMAgent combined methods
enhanced_agent = EnhancedLLMAgent()
enhanced_stats = enhanced_agent.get_enhanced_statistics()
enhanced_agent.reset_enhanced_metrics()  # Reset all metrics

ExecutionLogger

from tau2_enhanced.logging import ExecutionLogger

logger = ExecutionLogger(
    log_file="execution.jsonl",
    auto_flush=True,
    console_output=False
)

# Log tool execution
logger.log_tool_execution(
    tool_name="search_flights",
    success=True,
    execution_time=0.5,
    tool_args={"departure": "NYC", "destination": "LAX"},
    result=flight_results
)

# Get statistics
stats = logger.get_statistics()
print(f"Total events: {stats['total_events']}")

StateTracker

from tau2_enhanced.logging import StateTracker

tracker = StateTracker(
    max_snapshots=1000,
    auto_snapshot=True,
    track_state_hash=True
)

# Create state snapshot
tracker.create_snapshot(
    state_data={"bookings": 5, "customers": 100},
    action_trigger="book_flight",
    metadata={"tool": "book_flight", "success": True}
)

LogAnalyzer

from tau2_enhanced.analysis import LogAnalyzer

analyzer = LogAnalyzer("enhanced_logs.json")

# Comprehensive analysis
analysis = analyzer.get_comprehensive_analysis()

# Specific analysis types
perf_analysis = analyzer.get_performance_analysis()
arg_analysis = analyzer.get_argument_analysis()
error_analysis = analyzer.get_error_analysis()
temporal_analysis = analyzer.get_temporal_analysis()

Integration

The package integrates seamlessly with the existing tau2-bench ecosystem:

  • Non-invasive: Does not modify the original tau2-bench codebase
  • Compatible: Uses the same CLI interface and parameters
  • Flexible: Enhanced domains can be used alongside standard tau2 domains
  • Extensible: Easy to add new enhanced domains and logging features

Requirements

  • Python >= 3.10
  • tau2-bench framework
  • All dependencies listed in pyproject.toml

Key Benefits

Performance Optimization ๐Ÿ†•

  1. Dramatic Success Rate Improvement: 65-70% overall improvement through enhanced agents
  2. Validation Error Recovery: Reduce 87% action failures to ~45% with retry logic
  3. Context Pressure Elimination: Remove 53% performance cliff at 3,000+ tokens
  4. Zero Tau2-Bench Modifications: External enhancement without core codebase changes

Advanced Analytics & Observability

  1. Deep Observability: Complete visibility into agent-environment interactions
  2. Performance Optimization: Identify bottlenecks and efficiency improvements
  3. Behavioral Analysis: Understand agent decision patterns and strategies
  4. Argument Intelligence: Comprehensive function call argument analysis
  5. Error Diagnostics: Smart error categorization and root cause analysis

Research & Production

  1. Research Insights: Statistical analysis for academic and industry research
  2. Production Monitoring: Real-time streaming logs for operational use
  3. Audit Trail: Complete, immutable record of all interactions

Performance Impact Summary

Issue Original Performance Enhanced Agent Solution Expected Improvement
Action Execution Failures 87% failure rate retry_agent with 3-attempt retry Reduce to ~45% failure rate
Context Length Pressure 53% drop at 3,000+ tokens context_agent with optimization Eliminate performance cliff
Combined Issues ~30% overall success enhanced_agent with both solutions 65-70% success rate improvement

Ready to transform your tau2-bench evaluations? Start with enhanced_agent for maximum performance gains!

About

Enhanced tau2-bench for deeper analysis

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors