Monitoring Setup Guide

This guide covers the monitoring and observability setup for the Scripture App, including synthetic monitoring with warm-up strategies for cold starts.

🔧 Components

1. Enhanced Health Check Endpoint

Location: backend/app/main.py
Endpoint: /health
Features:
- Database connection warm-up
- Volume count verification
- Warm-up status reporting
- Detailed health information

2. GitHub Actions Synthetic Monitoring

Location: .github/workflows/synthetic-monitoring.yml
Schedule: Every 15 minutes
Features:
- 5-minute warm-up period
- Core endpoint testing
- Performance metrics collection
- Response validation
- Manual trigger support

3. Local Monitoring Script

Location: scripts/monitor.py
Features:
- Local testing capabilities
- Configurable warm-up time
- Performance metrics
- JSON output support
- Command-line interface

🚀 Quick Start

Local Testing

Start the backend server:

cd backend
uv run uvicorn app.main:app --reload --host 0.0.0.0 --port 8000

Run the monitoring script:

# Basic monitoring with 5-minute warm-up
python scripts/monitor.py

# Test without warm-up
python scripts/monitor.py --warm-up false

# Custom warm-up time (2 minutes)
python scripts/monitor.py --wait 2

# Test against production URL
python scripts/monitor.py --url https://scriptures-fast-api.onrender.com

# Save results to file
python scripts/monitor.py --output results.json

GitHub Actions Setup

Update the API URL in .github/workflows/synthetic-monitoring.yml:

echo "API_URL=https://scriptures-fast-api.onrender.com" >> $GITHUB_ENV

Enable GitHub Actions in your repository settings
Monitor the workflow:
- Go to Actions tab in GitHub
- Check "Synthetic Monitoring" workflow
- Runs every 15 minutes automatically

📊 Monitoring Features

Health Check Response

{
  "status": "healthy",
  "warmed_up": true,
  "database": "connected",
  "volumes_count": 5,
  "timestamp": "2025-01-05T00:00:00Z"
}

Performance Metrics

Response time tracking
Database connection status
Volume count verification
Error rate monitoring

Alerting

Health check failures
Slow response times (>5s for health, >10s for random)
Database connection issues
Endpoint availability

🔄 Warm-up Strategy

Why Warm-up?

Render free tier has cold starts
Services scale down after inactivity
First request can take 30+ seconds
Subsequent requests are fast

Warm-up Process

Initial Request: Triggers cold start
5-minute Wait: Allows full warm-up
Testing: All endpoints tested
Validation: Response validation
Metrics: Performance data collection

GitHub Actions Limitations

Maximum job time: 6 hours (free tier)
Cron frequency: Minimum 5 minutes
Resource usage: 2,000 minutes/month (free tier)
Our setup: 15-minute intervals = 96 runs/day = 2,880 minutes/month

🛠 Customization

Modify Warm-up Time

# In .github/workflows/synthetic-monitoring.yml
sleep 300  # 5 minutes
# Change to: sleep 180  # 3 minutes

Add More Endpoints

# Add to the test section
- name: Test additional endpoints
  run: |
    curl -s "$API_URL/api/scriptures/reference/John/3"

Custom Alerting

# Add Slack/Discord notifications
- name: Notify on failure
  if: failure()
  run: |
    curl -X POST -H 'Content-type: application/json' \
      --data '{"text":"Scripture App monitoring failed!"}' \
      $SLACK_WEBHOOK_URL

📈 Monitoring Dashboard Ideas

Key Metrics to Track

Uptime: Service availability
Response Times: p50, p95, p99
Cold Start Frequency: How often services scale down
Error Rates: By endpoint and type
User Impact: Cold start vs warm performance

Grafana Dashboard Queries

-- Average response time by endpoint
SELECT endpoint, AVG(response_time)
FROM monitoring_metrics
GROUP BY endpoint

-- Cold start detection
SELECT COUNT(*)
FROM monitoring_metrics
WHERE response_time > 30

🚨 Troubleshooting

Common Issues

Cold Start Timeouts:
- Increase timeout values in monitoring
- Extend warm-up period
- Consider upgrading to paid tier
GitHub Actions Failures:
- Check API URL is correct
- Verify endpoint availability
- Review workflow logs
Local Script Issues:
- Install requests: pip install requests
- Check backend is running
- Verify URL accessibility

Debug Commands

# Test health endpoint manually
curl -v http://localhost:8000/health

# Check GitHub Actions logs
# Go to Actions tab in GitHub repository

# Test monitoring script with verbose output
python scripts/monitor.py --url http://localhost:8000

📚 Next Steps

Phase 2: Advanced Monitoring

Add Prometheus metrics
Set up Grafana dashboards
Implement distributed tracing
Add log aggregation

Phase 3: SRE Practices

Define SLI/SLOs
Implement error budgets
Create incident response playbooks
Set up chaos engineering

Last Updated: January 2025 Maintained By: SRE Team Review Schedule: Monthly

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Monitoring Setup Guide

🔧 Components

1. Enhanced Health Check Endpoint

2. GitHub Actions Synthetic Monitoring

3. Local Monitoring Script

🚀 Quick Start

Local Testing

GitHub Actions Setup

📊 Monitoring Features

Health Check Response

Performance Metrics

Alerting

🔄 Warm-up Strategy

Why Warm-up?

Warm-up Process

GitHub Actions Limitations

🛠 Customization

Modify Warm-up Time

Add More Endpoints

Custom Alerting

📈 Monitoring Dashboard Ideas

Key Metrics to Track

Grafana Dashboard Queries

🚨 Troubleshooting

Common Issues

Debug Commands

📚 Next Steps

Phase 2: Advanced Monitoring

Phase 3: SRE Practices

FilesExpand file tree

monitoring-setup.md

Latest commit

History

monitoring-setup.md

File metadata and controls

Monitoring Setup Guide

🔧 Components

1. Enhanced Health Check Endpoint

2. GitHub Actions Synthetic Monitoring

3. Local Monitoring Script

🚀 Quick Start

Local Testing

GitHub Actions Setup

📊 Monitoring Features

Health Check Response

Performance Metrics

Alerting

🔄 Warm-up Strategy

Why Warm-up?

Warm-up Process

GitHub Actions Limitations

🛠 Customization

Modify Warm-up Time

Add More Endpoints

Custom Alerting

📈 Monitoring Dashboard Ideas

Key Metrics to Track

Grafana Dashboard Queries

🚨 Troubleshooting

Common Issues

Debug Commands

📚 Next Steps

Phase 2: Advanced Monitoring

Phase 3: SRE Practices