This document describes the production-grade monitoring and logging system implemented for LCopilot API.
- Overview
- Structured Logging
- Request Tracking
- Health Endpoints
- CloudWatch Integration
- Alerting & Notifications
- Setup Instructions
- Testing
- Troubleshooting
LCopilot API includes comprehensive monitoring with:
- Structured JSON logging with request correlation
- CloudWatch integration for production log aggregation
- Health endpoints for load balancer checks
- Automated alerting for error spikes and performance issues
- Request tracking with unique IDs across the entire request lifecycle
- JSON format for easy parsing and filtering
- Request correlation with unique request IDs
- Multiple log levels: DEBUG, INFO, WARNING, ERROR, CRITICAL
- Contextual information: service, environment, hostname, version
- Performance metrics: request duration, database query times
- External service tracking: S3, Document AI, database operations
{
"timestamp": "2025-01-15T10:30:45.123Z",
"level": "INFO",
"service": "lcopilot-api",
"environment": "production",
"hostname": "api-server-01",
"version": "2.0.0",
"request_id": "550e8400-e29b-41d4-a716-446655440000",
"message": "API request completed",
"http_method": "POST",
"http_path": "/api/sessions",
"http_status_code": 200,
"request_duration_ms": 342.56,
"user_id": "user_123"
}from app.utils.logger import get_logger, log_api_request
# Get logger with request context
logger = get_logger("service_name", request_id="req_123")
# Log with structured data
logger.info(
"Document processed successfully",
document_id="doc_456",
processing_time_ms=1250.5,
document_type="letter_of_credit"
)
# Log API requests automatically
log_api_request(
logger,
method="POST",
path="/api/documents",
status_code=201,
duration_ms=800.2
)Every API request gets a unique request ID that:
- Preserves client-provided
X-Request-IDheaders - Generates UUID4 if no ID provided
- Injects into log context automatically
- Returns in response headers for client correlation
- Tracks request lifecycle from start to finish
1. Client Request → [X-Request-ID: abc123] (optional)
2. Middleware → Generate/preserve request ID
3. Log Context → Inject ID into all logs
4. Response → Return ID in X-Request-ID header
5. Error Handling → Include ID in error responses
Purpose: Indicates if the service is alive Used by: Kubernetes, load balancers Response:
{
"status": "ok",
"timestamp": "2025-01-15T10:30:45.123Z",
"version": "2.0.0",
"environment": "production",
"uptime_seconds": 3600
}Purpose: Indicates if the service is ready to handle requests Used by: Load balancers for traffic routing Checks:
- ✅ Database connectivity (PostgreSQL)
- ✅ S3 bucket accessibility
- ✅ Document AI availability (unless using stubs)
Response:
{
"status": "ok",
"timestamp": "2025-01-15T10:30:45.123Z",
"overall_healthy": true,
"checks": {
"database": {
"status": "ok",
"response_time_ms": 23.45,
"message": "Database connection successful"
},
"s3": {
"status": "ok",
"response_time_ms": 156.78,
"message": "S3 bucket 'lcopilot-docs-prod' accessible",
"bucket": "lcopilot-docs-prod"
},
"document_ai": {
"status": "ok",
"response_time_ms": 489.12,
"message": "Document AI accessible (3 processors found)",
"project": "lcopilot-docai",
"location": "eu"
}
}
}Purpose: Detailed service information (non-sensitive)
{
"service": "lcopilot-api",
"version": "2.0.0",
"environment": "production",
"uptime_seconds": 7200,
"configuration": {
"use_stubs": false,
"debug_mode": false,
"aws_region": "eu-north-1",
"database_configured": true,
"s3_bucket": "lcopilot-docs-prod"
},
"runtime": {
"python_version": "3.9.6",
"process_id": 1234
}
}- Log Group:
lcopilot-backend - Log Stream:
{hostname}-{environment} - Retention: 30 days
- Format: JSON with structured fields
| Filter Name | Pattern | Metric | Description |
|---|---|---|---|
LCopilotErrorFilter |
{ $.level = "ERROR" } |
LCopilotErrorCount |
Count of ERROR level logs |
LCopilotCriticalFilter |
{ $.level = "CRITICAL" } |
LCopilotCriticalErrorCount |
Count of CRITICAL level logs |
LCopilot5xxErrorFilter |
{ $.http_status_code >= 500 } |
LCopilot5xxErrorCount |
Count of 5xx HTTP errors |
LCopilotSlowRequestFilter |
{ $.request_duration_ms > 5000 } |
LCopilotSlowRequestCount |
Count of slow requests (>5s) |
# Required for CloudWatch logging
ENVIRONMENT=production
AWS_REGION=eu-north-1
AWS_ACCESS_KEY_ID=your-access-key
AWS_SECRET_ACCESS_KEY=your-secret-key| Alarm Name | Condition | Threshold | Action |
|---|---|---|---|
LCopilot-HighErrorRate |
Error count in 1 min | ≥ 5 errors | SNS notification |
LCopilot-CriticalErrors |
Critical errors in 1 min | ≥ 1 error | SNS notification |
LCopilot-High5xxErrors |
5xx errors in 5 min | ≥ 10 errors | SNS notification |
LCopilot-SlowRequests |
Slow requests in 5 min | ≥ 5 requests | SNS notification |
- Topic Name:
lcopilot-alerts - Purpose: Send notifications when alarms trigger
- Supports: Email, SMS, Slack webhooks
cd apps/api
pip install -r requirements.txtRequired packages:
structlog==24.1.0- Structured loggingwatchtower==3.0.1- CloudWatch log handlerboto3-stubs[essential]==1.34.0- AWS SDK type hints
Create .env.production:
# Application
ENVIRONMENT=production
DEBUG=false
# Database
DATABASE_URL=postgresql://user:pass@host:5432/lcopilot
# AWS CloudWatch
AWS_REGION=eu-north-1
AWS_ACCESS_KEY_ID=your-access-key
AWS_SECRET_ACCESS_KEY=your-secret-key
# Service Configuration
USE_STUBS=false
SECRET_KEY=your-production-secret-keycd apps/api
python setup_cloudwatch.pyThis creates:
- Log group with 30-day retention
- Metric filters for error tracking
- CloudWatch alarms
- SNS topic for notifications
aws sns subscribe \
--topic-arn arn:aws:sns:eu-north-1:123456789012:lcopilot-alerts \
--protocol email \
--notification-endpoint your-email@domain.compython main.pyThe application will:
- Initialize structured logging
- Configure CloudWatch handlers
- Register health endpoints
- Start request tracking
python test_monitoring.pyThis tests:
- ✅ Logging configuration
- ✅ Environment variables
- ✅ Health endpoints
- ✅ Request tracking
- ✅ Error handling
- ✅ CloudWatch dependencies
# Liveness check
curl http://localhost:8000/health/live
# Readiness check
curl http://localhost:8000/health/ready
# Service info
curl http://localhost:8000/health/info# Send request with custom ID
curl -H "X-Request-ID: test-123" http://localhost:8000/
# Check response headers for request ID
curl -I http://localhost:8000/# Trigger 404 error
curl http://localhost:8000/nonexistent
# Trigger 500 error (if endpoint exists)
curl -X POST http://localhost:8000/simulate-errorTo test CloudWatch alarms:
-
Generate error spike:
for i in {1..6}; do curl http://localhost:8000/nonexistent sleep 1 done
-
Check CloudWatch:
- Go to AWS CloudWatch Console
- Check "Alarms" section
- Look for
LCopilot-HighErrorRatealarm
-
Verify SNS notification:
- Check email for alarm notification
- Should arrive within 1-2 minutes
Symptoms: Logs not visible in CloudWatch console
Solutions:
# Check AWS credentials
aws sts get-caller-identity
# Verify log group exists
aws logs describe-log-groups --log-group-name-prefix lcopilot-backend
# Check IAM permissions for CloudWatch Logs
# Required: logs:CreateLogGroup, logs:CreateLogStream, logs:PutLogEventsSymptoms: /health/ready returns 503
Solutions:
# Check database connection
psql $DATABASE_URL -c "SELECT 1"
# Verify S3 bucket access
aws s3 ls s3://your-bucket-name
# Test Document AI
python -c "
from google.cloud import documentai
client = documentai.DocumentProcessorServiceClient()
print('Document AI client created successfully')
"Symptoms: Missing X-Request-ID headers
Solutions:
- Ensure
RequestIDMiddlewareis registered first - Check middleware order in
main.py - Verify imports are correct
Symptoms: Logs appear as plain text
Solutions:
- Check
ENVIRONMENT=productionis set - Verify
structlogis installed correctly - Check logger configuration in
logger.py
# Check log format
tail -f /var/log/lcopilot.log | head -1 | python -m json.tool
# Test logger directly
python -c "
from app.utils.logger import get_logger
logger = get_logger('test')
logger.info('Test message', field='value')
"
# Check health endpoint locally
curl -v http://localhost:8000/health/ready | python -m json.tool
# Verify CloudWatch setup
python setup_cloudwatch.py- Log Level: Use INFO+ in production (avoid DEBUG)
- Batch Size: CloudWatch batches up to 100 log entries
- Rate Limits: CloudWatch has API rate limits (~5 requests/second)
- Costs: CloudWatch logs are charged per GB ingested and stored
- DEBUG: Detailed information for diagnosing problems
- INFO: General information about application flow
- WARNING: Something unexpected happened but application continues
- ERROR: Serious problem occurred, function couldn't complete
- CRITICAL: Very serious error, application may crash
✅ Do Log:
- API requests/responses with duration
- Database operations with timing
- External service calls (S3, Document AI)
- Authentication events
- Business logic errors
- Performance metrics
❌ Don't Log:
- Passwords or secret keys
- Credit card numbers
- Personal identification numbers
- Full request/response bodies (unless sanitized)
- High-frequency debug information in production
- Error Rate: > 5 errors per minute
- Critical Errors: ≥ 1 critical error per minute
- Response Time: > 5 seconds for 95th percentile
- Health Check: 3 consecutive failures
- AWS CloudWatch Logs Documentation
- Structlog Documentation
- FastAPI Middleware Guide
- Kubernetes Health Checks
🆘 Need Help?
If you encounter issues with the monitoring setup:
- Check this troubleshooting guide first
- Run the test suite:
python test_monitoring.py - Verify AWS credentials and permissions
- Check application logs for error messages
- Consult the CloudWatch console for alarm states
⚡ Quick Health Check: curl http://localhost:8000/health/live