Version: 1.0 Date: September 28, 2025 Purpose: Step-by-step procedure to execute and evaluate email management agent experiments
This SOP covers the complete process from running the email management agent through Braintrust logging to creating evaluations from the results.
- Braintrust account with API key configured
- OpenAI API key configured in Braintrust
- Python environment with requirements installed
- Project directory:
/Users/fabswill/ReposClaudeCode/braintrustdevdeepdive
# Verify these are set
echo $BRAINTRUST_API_KEY
echo $OPENAI_API_KEY
# If not set, export them:
export BRAINTRUST_API_KEY="your_key_here"
export OPENAI_API_KEY="your_key_here" # (or configure in Braintrust UI)cd /Users/fabswill/ReposClaudeCode/braintrustdevdeepdivepython demo_email_agent.pyExpected Output:
- Console output showing each demo section
- Logs appearing in Braintrust UI under your project
- No evaluation scores (since no evals defined yet)
- Open Braintrust UI: https://www.braintrust.dev/
- Navigate to your project:
Fabs27Sep25DeepDive(or current project name) - Go to Logs section
- Verify you see traces for:
process_email_request(root spans)- Individual step spans:
decision_step,tool_execution_step, etc.
# If you want to export the raw logs
braintrust export --project "Fabs27Sep25DeepDive" --output logs_export.jsonIn Braintrust UI:
- Click on individual traces to see:
- Input queries processed
- Agent decisions made
- Tool operations executed
- Final responses generated
- Identify successful patterns and edge cases
- Note any unexpected behaviors or errors
Create a list of test scenarios based on what you observed:
Example scenarios to capture:
1. System Status Query: "Give me a quick system status"
2. Zero Inbox Request: "Get all my inboxes to zero and categorize everything"
3. Specific Search: "Find emails about Aidvantage student loans"
4. Urgent Attention: "Show me what needs immediate attention"
5. Account Summary: "Give me a summary of all my email accounts"# Create new evaluation file
touch evals/eval_email_management_v2.pyBased on Braintrust documentation patterns:
# Template structure for evals/eval_email_management_v2.py
from braintrust import Eval
from agents.email_management import manage_emails
from scoring.email_scoring import DualEmailScorer
# Define test cases based on logged interactions
eval_data = [
{
"input": "Give me a quick system status",
"expected": {
"expected_actions": ["get_system_status"],
"required_accounts": ["adotob_primary", "gmail_fabsgwill", "gmail_jahmekyanbwoy", "hotmail_fabian_williams"],
"should_succeed": True,
"scenario_type": "system_health"
}
},
# Add more scenarios based on your logged interactions
]
Eval(
"Email Management Agent v2",
data=lambda: eval_data,
task=manage_emails, # Your agent function
scores=[DualEmailScorer()] # Your scoring function
)braintrust eval evals/eval_email_management_v2.py- Review scores in Braintrust UI
- Identify low-scoring scenarios
- Compare agent outputs with expected results
- Note areas for improvement
Based on results:
- Adjust agent prompts/logic
- Refine scoring criteria
- Add edge case handling
# Run evaluation again after changes
braintrust eval evals/eval_email_management_v2.pyCompare results to see improvements.
# Run agent demo (generates logs)
python demo_email_agent.py
# Run specific evaluation
braintrust eval evals/eval_email_management_v2.py
# Run all evaluations
python run_email_evals.py --run all
# Export data
braintrust export --project "ProjectName" --output filename.json
# Test setup only
python run_email_evals.py --test# Check Python syntax
python -m py_compile agents/email_management.py
# Test environment
python -c "import braintrust; print('Braintrust OK')"
python -c "import openai; print('OpenAI OK')"
# Verbose logging
EVAL_VERBOSE_LOGGING=true python demo_email_agent.py- ✅ Demo runs without errors
- ✅ Logs appear in Braintrust UI
- ✅ All agent steps traced properly
- ✅ Realistic email data processed
- ✅ Agent responses reviewed and documented
- ✅ Test scenarios identified from real interactions
- ✅ Edge cases and error conditions noted
- ✅ Evaluation file created with real scenarios
- ✅ Evaluation runs successfully
- ✅ Scores generated for all test cases
- ✅ Results provide actionable insights
- "Not initialized" errors: Check
BRAINTRUST_API_KEYis set - Import errors: Verify
pip install -r requirements.txtcompleted - No logs appearing: Check API key and network connectivity
- Evaluation failures: Verify evaluation file syntax with
python -m py_compile
- Braintrust logs: UI → Project → Logs
- Python errors: Console output
- Evaluation results: UI → Project → Experiments
Current Status: Agent generates logs but no evaluations defined Next Goal: Create evaluations from logged interactions Key Insight: Use actual agent outputs as baseline for expected results
This SOP enables repeatable experimentation and continuous improvement of the email management agent through systematic evaluation.