Skip to content

Latest commit

 

History

History
716 lines (555 loc) · 31.7 KB

File metadata and controls

716 lines (555 loc) · 31.7 KB

SOAR Architecture

This document provides a comprehensive technical overview of the OpenSecOps Security Orchestration, Automation, and Response (SOAR) architecture, including detailed workflow analysis, data structures, error handling patterns, and testing strategies.

System Overview

OpenSecOps SOAR is a serverless AWS-native security automation platform that processes security findings, automates remediation, manages tickets for manual intervention, and generates AI-powered security reports. The system operates across an AWS Organization, providing centralized security management for all accounts.

Core Architecture Components

1. State Machine Workflows

The system is built around AWS Step Functions state machines that orchestrate complex security workflows:

SOARASFFProcessor (Main Orchestrator)

  • File: statemachines/asff_processor.asl.yaml
  • Purpose: Central orchestrator that processes all Security Hub findings
  • Key Functions:
    • Sets up standardized "scratchpad" data structure
    • Determines workflow routing (suppress, auto-remediate, incident, ticket)
    • Manages Security Hub finding status updates
    • Handles email notifications and ticket management

SOARAutoRemediations

  • File: statemachines/autoremediations.asl.yaml
  • Purpose: Executes automated fixes for 30+ security controls
  • Coverage: EC2, S3, RDS, IAM, ELB, ECS, ECR, KMS, DynamoDB
  • Status: 100% tested (30/30 functions, 428 tests)

SOARIncidents

  • File: statemachines/incidents.asl.yaml
  • Purpose: Handles security incidents requiring immediate response
  • Capabilities:
    • GuardDuty finding processing (EC2, IAM, S3, EKS)
    • Instance termination and forensic data collection
    • Incident classification and escalation

SOARSyncEnabledControlsTable

  • File: statemachines/sync_enabled_controls_table.asl.yaml
  • Purpose: Synchronizes enabled security controls across accounts

SOARHourlyTasks

  • File: statemachines/hourly_tasks.asl.yaml
  • Purpose: Performs hourly maintenance tasks including overdue ticket processing

SOARWeeklyAIReport

  • File: statemachines/weekly_ai_report.asl.yaml
  • Purpose: Generates AI-powered weekly security reports

2. Standard Data Structure (Scratchpad)

All functions operate on a standardized data structure established by the ASFF Processor:

{
  "account": {},              // Account metadata from GetAccountDataFunction
  "finding": {},              // Complete ASFF Security Hub finding data
  "tags": {},                 // Resource tags for context
  "actions": {               // Control flags set by processing functions
    "suppress_finding": false,
    "autoremediation_not_done": false,
    "reconsider_later": false
  },
  "messages": {              // Communication templates
    "actions_taken": "None.",
    "actions_required": "...",
    "ai": {
      "plaintext": "",
      "html": ""
    }
  },
  "db": {}                   // DynamoDB lookup results and cached data
}

3. Error Handling and Retry Patterns

The system implements consistent error handling across all state machines:

Standard Retry Configuration

  • Interval: 3 seconds
  • Backoff Rate: 3.0 (exponential backoff)
  • Max Attempts: 10-100 (varies by function criticality)

Error Type Handling

  • Lambda.ServiceException: Standard retry with exponential backoff
  • Lambda.AWSLambdaException: Extended retry for AWS service issues
  • DynamoDB.ProvisionedThroughputExceededException: Specialized retry for database throttling
  • States.TaskFailed: Catch-all for general task failures
  • States.ALL: Comprehensive catch-all for any unhandled exceptions

Fallback-to-Ticketing Error Handling (v2.2.1+)

  • Comprehensive Coverage: All 30+ autoremediation functions protected by state machine catch-all error handling
  • Centralized Error Handler: SetAutoremediationNotDone state sets actions.autoremediation_not_done = true for any unhandled failures
  • Data Preservation: Pure ASL implementation preserves all original scratchpad data while adding error flag
  • Workflow Integration: Failed autoremediations automatically route to ticketing system for manual intervention
  • Zero Silent Failures: Eliminates workflow failures from unhandled Lambda exceptions or API errors

AI Function Special Handling

  • Timeout: 600 seconds (10 minutes)
  • Throttling Protection: Extended retry intervals
  • Memory Allocation: Higher memory limits for AI processing

Function Architecture (77 Total Lambda Functions)

Auto-Remediation Functions (30 functions - 100% tested)

  • Status: Complete test coverage (428 tests)
  • Services: RDS, EC2, S3, IAM, ELB, ECR, ECS, KMS, DynamoDB
  • Pattern: Standardized ASFF processing with comprehensive error handling
  • Error Resilience: State machine catch-all error handling ensures 100% fallback-to-ticketing coverage (v2.2.1+)

Core Workflow Functions (47 functions - Next Testing Priority)

Category 1: Core Workflow (6 functions) - CRITICAL

  1. get_ticket_and_decide - Central decision maker for all workflow routing
  2. get_account_data - Loads account metadata used throughout system
  3. suppress_finding - Updates Security Hub finding status to SUPPRESSED
  4. suppress_locally - Applies local suppression rules
  5. update_remediated_finding - Marks findings as RESOLVED
  6. account_reassignment_preprocessor - Corrects account routing for delegated services (IAM Access Analyzer, GuardDuty, Inspector, Detective) - ✅ COMPLETE (25 tests passing)

Category 2: Finding Processing (8 functions) - HIGH PRIORITY

  1. compute_penalty_score - Risk scoring algorithm
  2. update_ticketed_finding - Marks findings as NOTIFIED
  3. get_findings_for_account - Retrieves Security Hub findings
  4. get_findings_for_all_accounts - Cross-account finding aggregation
  5. get_findings_for_weekly_report - Report data preparation
  6. get_recent_findings - Recent activity tracking
  7. get_findings_count - Statistical counting
  8. get_findings_count_for_account - Account-specific statistics

Category 3: Incident Response (3 functions) - HIGH PRIORITY

  1. determine_type - Incident classification (EC2, IAM, S3, EKS, Generic)
  2. terminate_instance - Compromised EC2 instance termination
  3. call_disk_forensics_collection - Forensic data collection trigger

Category 4: Communication (15 functions) - MEDIUM PRIORITY

  • Email formatting (9 functions)
  • Ticketing system integration (6 functions)

Category 5: Reporting & Analytics (15 functions) - MEDIUM PRIORITY

  • AI report generation (7 functions)
  • Statistics and metrics (4 functions)
  • Account management (3 functions)
  • System setup (3 functions)

4. Security Finding Processing Workflow

When security findings are detected:

  1. Ingestion: ASFF Processor receives Security Hub findings
  2. Enrichment: Account data and context added via GetAccountDataFunction
  3. Scoring: Penalty scores calculated based on severity and environment
  4. Decision: GetTicketAndDecideFunction determines workflow path:
    • Auto-remediate: Route to SOARAutoRemediations state machine
    • Incident Response: Route to SOARIncidents state machine
    • Manual Intervention: Create ticket and notify teams
    • Suppress: Apply suppression rules and update finding status
  5. Execution: Appropriate actions taken based on decision
  6. Tracking: Results recorded in DynamoDB and Security Hub updated

5. Automated Remediation Architecture

The system includes 30 specialized Lambda functions for remediating specific AWS security controls:

Service Coverage

  • RDS (9 controls): Database encryption, snapshot settings, security configurations
  • EC2 (6 controls): Security group configurations, unnecessary ports, instance settings
  • S3 (5 controls): Bucket permissions, encryption settings, public access
  • IAM (3 controls): Excessive permissions, policy compliance
  • ELB (2 controls): Load balancer security settings
  • ECR (2 controls): Container registry security
  • ECS (1 control): Container service security
  • KMS (1 control): Key management settings
  • DynamoDB (1 control): Database encryption

Standardized Remediation Pattern

  1. ASFF Validation: Verify finding format and required fields
  2. Resource Validation: Confirm resource exists and is accessible
  3. Cross-Account Access: Assume appropriate role in target account
  4. Remediation Logic: Implement AWS-specific security fix
  5. Verification: Confirm remediation was successful
  6. Response Formatting: Return standardized success/failure response
  7. Error Handling: Comprehensive error catching and reporting

Lambda Layer Architecture

  • aws_utils Layer: Cross-account client creation and AWS service utilities
  • rds_remediation Layer: Specialized RDS remediation functions
  • Shared Dependencies: Common libraries and utilities across functions

4. Ticketing and Notification System

For issues requiring human intervention:

  1. Tickets are created with detailed information
  2. Email notifications are sent to appropriate teams
  3. Reminder system tracks overdue tickets
  4. Resolution process is monitored and recorded

The system uses specialized formatters for different issue types (GuardDuty findings, general compliance issues, etc.).

5. Multi-Account Management

SOAR operates across an AWS Organization:

  • Organization Admin Account: Hosts the SOAR application and centralized security operations
  • Security Account: Used for managing the security posture and setting security controls
  • Member Accounts: Where findings are detected and remediated

Cross-account access is managed via IAM roles with least privilege principles.

6. AI Integration

AI capabilities enhance the security operations:

  • Weekly Security Reports: AI-generated analysis of security posture
  • Recommendations: Contextual security improvement suggestions
  • Incident Analysis: AI-assisted investigation of security incidents

7. Reporting and Analytics

The system maintains historical data for:

  • Auto-remediation effectiveness
  • Incident trends
  • Ticket resolution times
  • Account security posture over time

Technical Architecture

Serverless Infrastructure

  • Lambda Functions: 77 total functions, Python 3.12 runtime
    • 30 auto-remediation functions (100% tested)
    • 47 workflow/support functions (next testing target)
  • Lambda Layers: Shared code libraries for common functionality
  • DynamoDB: Stores findings, ticketing data, and configuration
  • EventBridge: Custom event bus coordinates asynchronous processing
  • Step Functions: 5 state machines orchestrate complex workflows
  • Simple Email Service: Handles notifications and alerts
  • IAM Roles: Manages cross-account permissions with least privilege
  • Security Hub: Central finding repository and status management
  • GuardDuty: Threat detection and incident generation

Detailed Data Flow

  1. Finding Ingestion:

    • Security Hub findings trigger SOARASFFProcessor
    • GuardDuty findings processed through incident workflows
    • EventBridge routes findings to appropriate state machines
  2. Data Enrichment:

    • GetAccountDataFunction loads account metadata
    • Tags and resource context added to scratchpad
    • Penalty scoring applied based on environment and severity
  3. Decision Processing:

    • GetTicketAndDecideFunction analyzes finding characteristics
    • Suppression rules evaluated via SuppressLocallyFunction
    • Workflow routing determined (auto-remediate, incident, ticket, suppress)
  4. Action Execution:

    • Auto-remediation: SOARAutoRemediations state machine
    • Incidents: SOARIncidents state machine with forensics
    • Ticketing: Email notifications and external ticket creation
    • Suppression: Security Hub status update to SUPPRESSED
  5. Result Tracking:

    • DynamoDB stores all actions and outcomes
    • Security Hub findings updated with resolution status
    • Statistics collected for reporting and analytics

State Machine Integration Points

ASFF Processor → Auto-Remediation

  • Trigger: actions.suppress_finding = false AND auto-remediation applicable
  • Data: Complete scratchpad with finding and account context
  • Response: Remediation success/failure status

ASFF Processor → Incidents

  • Trigger: GuardDuty findings or high-severity security events
  • Data: Incident-specific context and response requirements
  • Response: Incident handling results and forensic data

Cross-State Machine Error Handling

  • Retry Logic: Consistent 3-second intervals with exponential backoff
  • Error Propagation: Structured error responses maintain workflow state
  • Fallback Mechanisms: Graceful degradation when services unavailable

Deployment Model

  • AWS SAM: Defines infrastructure as code
  • CloudFormation: Manages resource provisioning
  • Multi-region: Supports deployment across regions
  • Versioned Releases: Tracked via git tags and CHANGELOG

Security and Compliance Considerations

  • Least Privilege: Minimal IAM permissions for operations
  • Environment-based Prioritization: Higher security standards for production
  • Audit Trail: Records all remediation actions and decisions
  • Cross-account Security: Controlled access between accounts

Scalability

  • Serverless Architecture: Scales with security finding volume
  • Stateless Processing: Allows horizontal scaling
  • Account Parallelism: Processes multiple accounts concurrently

Operational Model

  • Infrastructure as Code: All resources defined in template.yaml
  • Automated Deployment: Via deploy script
  • CI/CD Ready: Structured for pipeline integration
  • Configuration Management: Via parameters and environment variables

Event-Driven Architecture Pattern

SOAR follows established event-driven patterns for state transitions, with two primary approaches based on AWS API constraints:

Update-and-Reprocess Workflow (Mutable Fields)

  1. State Change: Lambda function calls Security Hub batch_update_findings API
  2. Event Generation: Security Hub automatically generates new finding event with updated status
  3. Fresh Processing: SOAR processes the new event with corrected data from the beginning
  4. Clean Logic: All workflow decisions use the updated finding state

Use Cases:

  • Auto-Remediation Completion: Sets Workflow.Status: 'RESOLVED' → Triggers ticket closure workflow
  • Finding Suppression: Sets Workflow.Status: 'SUPPRESSED' → Triggers cleanup actions
  • Severity Reclassification: Updates Severity.Label → Triggers reprocessing with correct severity

Create-and-Suppress Workflow (Immutable Fields)

  1. Create New Finding: Lambda function calls Security Hub batch_import_findings API with corrected data
  2. Suppress Original: Lambda function calls Security Hub batch_update_findings API to suppress original finding
  3. Terminate Current: Sets terminate_for_reprocessing flag to end current workflow
  4. Fresh Processing: New finding triggers fresh SOAR processing with corrected data

Use Cases:

  • Account Routing Correction: Creates finding in correct account when AwsAccountId is immutable
  • Cross-Account Finding Creation: Routes findings to appropriate account teams

Account Reassignment Preprocessor

AWS Security Hub delegated administration architecture creates account routing mismatches for certain security services (IAM Access Analyzer, GuardDuty, Inspector, Detective). These services generate findings that appear to originate from the Security-Adm account but actually concern resources in member accounts.

Problem: Finding shows AwsAccountId: 111111111111 (Security-Adm) but resource is in account 222222222222

Solution: Account Reassignment Preprocessor detects mismatches and creates corrected findings in the appropriate accounts:

  1. Detection: Two-tier approach comparing AwsAccountId with:
    • Priority 1: ProductFields.ResourceOwnerAccount field
    • Priority 2: Account extracted from resource ARN (arn:aws:service:region:account:resource)
  2. Recreation: Uses batch_import_findings to create properly structured ASFF finding in target account
  3. Suppression: Sets actions.suppress_finding = True to trigger state machine suppression of original finding
  4. Graceful Degradation: Any failure preserves original workflow (fail-safe design)
  5. Fresh Processing: New finding triggers complete SOAR processing in correct account

Technical Implementation:

  • Production-Ready: All 25 tests passing with comprehensive TDD coverage
  • ASFF Compliance: Follows AWS Security Hub BatchImportFindings API requirements exactly
  • Zero AWS Costs: Comprehensive mocking prevents real API calls during testing
  • Cross-Account Support: Uses established SOAR get_client patterns for secure access
  • Unique ID Generation: Creates unique finding IDs using {original-id}-reassigned-{account-id} pattern

Integration: Account Reassignment Preprocessor integrates into the ASFF processor state machine after AWS Health Reclassifier, with Choice node routing to existing suppression functionality when actions.suppress_finding = true.

Benefits

  • Consistent Processing: All workflows use the same updated finding data
  • Audit Trail: Security Hub maintains complete state change history
  • Downstream Integration: Other Security Hub consumers see corrected data
  • Loop Prevention: Updates are unidirectional (e.g., HIGH → INFORMATIONAL only)
  • Correct Team Notification: Findings reach the teams responsible for the actual resources

This pattern enables the SOAR to maintain clean separation between decision logic and state management while ensuring all systems have consistent, up-to-date finding information and appropriate team routing.

CloudWatch Alarm Integration and Monitoring

System Self-Monitoring Architecture

The OpenSecOps platform implements comprehensive self-monitoring through CloudWatch alarms that automatically create SOAR incidents for operational failures. This ensures that infrastructure issues are handled through the same security incident response workflow as other security findings.

Alarm Processing Workflow

  1. Alarm Creation: CloudWatch alarms monitor AWS services (Step Functions, Lambda, etc.)
  2. Event Generation: Alarm state changes to "ALARM" trigger EventBridge events
  3. Event Processing: SOAR-all-alarms-to-sec-hub component converts alarm events to Security Hub findings
  4. SOAR Processing: Main SOAR system processes alarm findings as incidents
  5. Incident Response: Alarms are routed through standard incident workflows with AI analysis

SOAR-all-alarms-to-sec-hub Component

Purpose: Converts CloudWatch alarm state change events into AWS Security Finding Format (ASFF) findings in Security Hub

Trigger Pattern: EventBridge rule matching CloudWatch alarms transitioning to "ALARM" state:

Pattern:
  source: [aws.cloudwatch]
  detail-type: [CloudWatch Alarm State Change]
  detail:
    state:
      value: [ALARM]

Processing Logic:

  • Extracts alarm metadata (name, description, account, region, timestamp)
  • Suppresses CIS-related alarms to avoid operational noise
  • Requires severity level in alarm name (INFORMATIONAL|LOW|MEDIUM|HIGH|CRITICAL)
  • Determines incident domain: "INFRA" (infrastructure) vs "APP" (application) based on alarm name
  • Creates ASFF-compliant Security Hub finding

ASFF Structure Created:

{
  "SchemaVersion": "2018-10-08",
  "Types": ["Software and Configuration Checks/CloudWatch Alarms/soar-cloudwatch-alarms"],
  "Title": "ALARM_NAME",
  "Description": "ALARM_DESCRIPTION or N/A",
  "Severity": {"Label": "HIGH|MEDIUM|LOW|etc"},
  "Resources": [{"Type": "AwsAccountId", "Id": "account_id", "Region": "region"}],
  "ProductFields": {
    "IncidentDomain": "INFRA|APP",
    "TicketDestination": "TEAM"
  }
}

Alarm Naming Convention

All monitoring alarms follow the pattern: [DOMAIN]-[Component]-[Type]-[Severity]

Domain Types:

  • INFRA: Infrastructure alarms routed to infrastructure/operations teams
  • APP: Application alarms that can be routed to application development teams

Examples:

  • INFRA-SOAR-ASFF-Processor-SM-Failure-HIGH
  • INFRA-CombineLogFilesSM-Failure-HIGH
  • INFRA-SOAR-Weekly-AI-Report-SM-Failure-MEDIUM
  • APP-UserService-Lambda-Error-MEDIUM (hypothetical application alarm)

This naming convention enables:

  • Automatic severity extraction and incident routing
  • Incident domain classification (INFRA vs APP) for appropriate team assignment
  • Complete incident management coverage across infrastructure and application layers
  • Consistent AI prompt context and analysis

SOAR Platform Self-Monitoring

Core State Machine Monitoring (HIGH severity):

  • SOAR ASFF Processor failures
  • Auto-Remediation state machine failures
  • Incident processing failures

Operational Monitoring (MEDIUM/LOW severity):

  • Weekly AI Report generation failures
  • Hourly maintenance task failures
  • Control synchronization failures

Alarm Configuration:

  • Metric: ExecutionsFailed from AWS/States namespace
  • Threshold: >= 1 failed execution
  • Period: 60 seconds with 1 evaluation period
  • Treatment: Missing data treated as not breaching

Foundation Platform Self-Monitoring

Log Processing Infrastructure (HIGH severity):

  • Control Tower log aggregation failures
  • Historical log processing failures

Alarm Configuration: Same pattern as SOAR alarms, monitoring Step Functions execution failures

AI-Enhanced Incident Analysis

Recent enhancements (v2.2.1) provide comprehensive AI-powered incident analysis capabilities:

Intelligence Features:

  • Component-Specific Analysis: Deep understanding of each state machine and Lambda function's purpose
  • Context-Aware Prompts: AI prompts distinguish between individual failures vs. systemic issues
  • Impact Assessment: Explains operational impact and urgency levels based on component criticality
  • Targeted Recommendations: Provides specific troubleshooting steps tailored to each alarm type
  • Operational Continuity: Emphasizes that single failures don't indicate complete system breakdown

Enhanced Diagnostic Capabilities:

  • Failure Context: AI understands the role and dependencies of each failing component
  • Troubleshooting Guidance: Component-specific debugging procedures and common resolution steps
  • Automated Infrastructure Knowledge: Self-updating understanding of system architecture
  • Multi-Account Awareness: Context-aware analysis across different AWS accounts and environments

Analysis Inputs: The AI analysis receives the alarm-based ASFF finding and provides contextualized guidance based on:

  • Component type (SOAR core vs. operational vs. Foundation vs. application)
  • Severity level and expected operational impact
  • Component criticality and dependencies
  • Historical patterns and troubleshooting procedures
  • Account context and organizational structure

Limitations and Context Gaps

Current Context Available:

  • Alarm name, description, and basic metadata
  • Account and region information
  • Severity level and incident domain classification
  • Timestamp of alarm trigger

Missing Context for Enhanced Debugging:

  • Specific Step Functions execution ARN and failure details
  • CloudWatch Logs from failed Lambda functions
  • Actual metric values that triggered the alarm
  • Execution timeline and specific failure points
  • Resource-specific details beyond account-level information

This comprehensive alarm integration ensures that both SOAR and Foundation infrastructure failures are automatically detected, classified, and routed through the standard security incident response workflow with appropriate AI-powered analysis and operational guidance.

CloudWatch Alarm Context Enrichment Enhancement

Proposed Architecture Enhancement

To address the context limitations identified above, a CloudWatch Context Enrichment Function is proposed to provide AI-powered incident analysis with comprehensive debugging information and historical pattern analysis.

Enhanced Context Enrichment Function

Purpose: Enrich CloudWatch alarm-based incidents with detailed execution context, logs, and historical pattern analysis before AI processing.

Integration Point: Added to SOARIncidents state machine between instruction retrieval and AI analysis for both APP and INFRA domains.

State Machine Placement:

# After "Format Generic message" and before "Email to Whom?"
EnrichCloudWatchContext:
  Type: Task
  Resource: '${EnrichCloudWatchContextFunctionArn}'
  TimeoutSeconds: 300
  Condition: Finding type contains "CloudWatch Alarms"
  Next: Email to Whom?

Intelligent Enrichment Logic

Service-Aware Processing:

  • Step Functions Failures: Extract execution ARNs, failure details, execution history
  • Lambda Function Errors: Retrieve CloudWatch Logs, error messages, invocation details
  • Generic AWS Services: Basic metric and configuration data

Temporal Correlation:

  • Calculate exact alarm evaluation window using Period × EvaluationPeriods
  • Query failed executions/invocations within specific time window that triggered alarm
  • Eliminates false correlation with unrelated failures

Memory and Pattern Analysis:

  • Query DynamoDB incidents table using configurable retention period (IncidentExpirationInDays)
  • Analyze incident frequency across multiple time windows (7d, 30d, 90d, full retention)
  • Detect pattern trends: INCREASING, DECREASING, STABLE
  • Classify incidents: RARE_OCCURRENCE, FREQUENT_RECENT, RECURRING_PATTERN, ISOLATED_INCIDENT, BASELINE_DEGRADATION
  • Determine success baselines from recent execution history

Enhanced Context Data Structure

Technical Enrichment:

{
  "service_type": "stepfunctions|lambda|generic",
  "alarm_correlation": {
    "evaluation_window": {...},
    "total_executions_in_window": N
  },
  "failed_executions": [
    {
      "execution_arn": "...",
      "failure_details": {...},
      "logs_summary": "..."
    }
  ]
}

Pattern Analysis:

{
  "pattern_analysis": {
    "retention_period_days": 365,
    "incident_trends": {
      "recent_rate": 0.5,
      "trend_direction": "INCREASING|STABLE|DECREASING"
    },
    "pattern_classification": "FREQUENT_RECENT|ISOLATED_INCIDENT|...",
    "baseline_context": {
      "success_rate": 0.95,
      "last_successful_execution": "2024-01-01T12:00:00Z"
    }
  }
}

AI Enhancement Benefits

The enriched context enables AI analysis to provide:

  1. Precise Debugging: "Step XYZ failed with error ABC in execution DEF, check log stream GHI for root cause JKL"
  2. Pattern Awareness: "This is the 3rd occurrence in 7 days, suggesting systematic issue vs. isolated failure"
  3. Baseline Comparison: "Success rate dropped from 95% to 60% vs. isolated failure in otherwise healthy system"
  4. Historical Context: "Last successful execution 2 hours ago vs. no successful executions in 48 hours"
  5. Trend Analysis: "Increasing failure frequency suggests underlying degradation vs. random operational issue"

Error Resilience and Performance

  • Graceful Degradation: Enrichment failures don't block incident processing
  • Intelligent Filtering: Only processes CloudWatch alarm findings with enrichable patterns
  • Configurable Timeouts: 300-second limit prevents workflow delays
  • Cost Optimization: Respects configured retention periods and query limits
  • Cross-Account Integration: Leverages existing SOAR IAM roles for API access

This enhancement transforms AI incident analysis from reactive debugging to pattern-aware, context-rich incident intelligence, significantly improving operational response quality and debugging efficiency.

Implementation Plan

Phase 1: Core Enrichment Function (Week 1-2)

  1. Create Lambda Function

    • functions/enrich_cloudwatch_context/app.py
    • Core function structure with intelligent pattern detection
    • Environment variables: INCIDENTS_TABLE_NAME, EXPIRATION_DAYS
  2. Add Function to Template

    • CloudFormation resource definition
    • IAM permissions for DynamoDB, Step Functions, CloudWatch APIs
    • Environment variable mapping
  3. Basic Service Detection

    • Step Functions alarm pattern matching
    • Lambda function alarm pattern matching
    • Alarm evaluation window calculation

Phase 2: Step Functions Integration (Week 2-3)

  1. Step Functions Enrichment

    • Execution ARN extraction from alarm configuration
    • Failed execution queries within evaluation window
    • Execution history and failure point identification
    • CloudWatch Logs integration for state machine logs
  2. Lambda Functions Enrichment

    • Function name extraction from alarm configuration
    • CloudWatch Logs queries for error patterns
    • Invocation failure correlation with alarm timing

Phase 3: Memory and Pattern Analysis (Week 3-4)

  1. DynamoDB Integration

    • Query incidents table using configurable retention period
    • Multi-timeframe analysis (7d, 30d, 90d, full retention)
    • Incident frequency and trend calculation
  2. Pattern Classification

    • Success baseline analysis from execution history
    • Incident pattern categorization logic
    • Trend direction detection (increasing/stable/decreasing)

Phase 4: State Machine Integration (Week 4-5)

  1. SOARIncidents State Machine Updates

    • Add EnrichCloudWatchContext state after "Format Generic message"
    • Conditional execution for CloudWatch alarm findings only
    • Error handling with graceful degradation
  2. Testing and Validation

    • Unit tests for enrichment logic
    • Integration tests with sample alarm events
    • Performance testing with large DynamoDB datasets

Phase 5: AI Prompt Enhancement (Week 5-6)

  1. AI Prompt Updates

    • Update incident_infra.txt to utilize enriched context
    • Update incident_app.txt for application incidents
    • Test AI analysis quality with enriched data
  2. Documentation and Deployment

    • Update architecture documentation
    • Deployment procedures and rollback plans
    • Performance monitoring and alerting

Implementation Files

New Files to Create:

  • functions/enrich_cloudwatch_context/app.py - Main enrichment function
  • functions/enrich_cloudwatch_context/requirements.txt - Dependencies
  • tests/test_enrich_cloudwatch_context.py - Unit tests

Files to Modify:

  • template.yaml - Add function definition and state machine updates
  • statemachines/incidents.asl.yaml - Add enrichment state
  • ai-prompts/incident_infra.txt - Enhanced prompts for enriched context
  • ai-prompts/incident_app.txt - Enhanced prompts for enriched context

Success Metrics

  • Enrichment Coverage: % of CloudWatch alarms receiving enrichment
  • AI Quality Improvement: Subjective assessment of AI recommendations with enriched context
  • Performance Impact: Enrichment function execution time and state machine latency
  • Pattern Detection Accuracy: Validation of pattern classifications against known incident trends

Integration Points

  • Security Hub: Primary source of findings
  • GuardDuty: Threat detection integration
  • CloudWatch: Infrastructure monitoring and alarm generation
  • EventBridge: Alarm event routing and processing
  • External Ticketing: Can connect to external systems
  • Microsoft Sentinel: Optional integration for enterprise SOC environments

Future Extensibility

The modular architecture allows for:

  • Auto-Remediation Expansion: Adding new security controls and services
  • Security Service Integration: Additional AWS security services
  • AI Enhancement: Advanced threat analysis and response recommendations
  • Custom Playbooks: Organization-specific incident response procedures
  • Third-Party Integrations: SIEM, SOAR, and ticketing system connectors