SOAR Architecture

This document provides a comprehensive technical overview of the OpenSecOps Security Orchestration, Automation, and Response (SOAR) architecture, including detailed workflow analysis, data structures, error handling patterns, and testing strategies.

System Overview

OpenSecOps SOAR is a serverless AWS-native security automation platform that processes security findings, automates remediation, manages tickets for manual intervention, and generates AI-powered security reports. The system operates across an AWS Organization, providing centralized security management for all accounts.

Core Architecture Components

1. State Machine Workflows

The system is built around AWS Step Functions state machines that orchestrate complex security workflows:

SOARASFFProcessor (Main Orchestrator)

File: statemachines/asff_processor.asl.yaml
Purpose: Central orchestrator that processes all Security Hub findings
Key Functions:
- Sets up standardized "scratchpad" data structure
- Determines workflow routing (suppress, auto-remediate, incident, ticket)
- Manages Security Hub finding status updates
- Handles email notifications and ticket management

SOARAutoRemediations

File: statemachines/autoremediations.asl.yaml
Purpose: Executes automated fixes for 30+ security controls
Coverage: EC2, S3, RDS, IAM, ELB, ECS, ECR, KMS, DynamoDB
Status: 100% tested (30/30 functions, 428 tests)

SOARIncidents

File: statemachines/incidents.asl.yaml
Purpose: Handles security incidents requiring immediate response
Capabilities:
- GuardDuty finding processing (EC2, IAM, S3, EKS)
- Instance termination and forensic data collection
- Incident classification and escalation

SOARSyncEnabledControlsTable

File: statemachines/sync_enabled_controls_table.asl.yaml
Purpose: Synchronizes enabled security controls across accounts

SOARHourlyTasks

File: statemachines/hourly_tasks.asl.yaml
Purpose: Performs hourly maintenance tasks including overdue ticket processing

SOARWeeklyAIReport

File: statemachines/weekly_ai_report.asl.yaml
Purpose: Generates AI-powered weekly security reports

2. Standard Data Structure (Scratchpad)

All functions operate on a standardized data structure established by the ASFF Processor:

{
  "account": {},              // Account metadata from GetAccountDataFunction
  "finding": {},              // Complete ASFF Security Hub finding data
  "tags": {},                 // Resource tags for context
  "actions": {               // Control flags set by processing functions
    "suppress_finding": false,
    "autoremediation_not_done": false,
    "reconsider_later": false
  },
  "messages": {              // Communication templates
    "actions_taken": "None.",
    "actions_required": "...",
    "ai": {
      "plaintext": "",
      "html": ""
    }
  },
  "db": {}                   // DynamoDB lookup results and cached data
}

3. Error Handling and Retry Patterns

The system implements consistent error handling across all state machines:

Standard Retry Configuration

Interval: 3 seconds
Backoff Rate: 3.0 (exponential backoff)
Max Attempts: 10-100 (varies by function criticality)

Error Type Handling

Lambda.ServiceException: Standard retry with exponential backoff
Lambda.AWSLambdaException: Extended retry for AWS service issues
DynamoDB.ProvisionedThroughputExceededException: Specialized retry for database throttling
States.TaskFailed: Catch-all for general task failures
States.ALL: Comprehensive catch-all for any unhandled exceptions

Fallback-to-Ticketing Error Handling (v2.2.1+)

Comprehensive Coverage: All 30+ autoremediation functions protected by state machine catch-all error handling
Centralized Error Handler: SetAutoremediationNotDone state sets actions.autoremediation_not_done = true for any unhandled failures
Data Preservation: Pure ASL implementation preserves all original scratchpad data while adding error flag
Workflow Integration: Failed autoremediations automatically route to ticketing system for manual intervention
Zero Silent Failures: Eliminates workflow failures from unhandled Lambda exceptions or API errors

AI Function Special Handling

Timeout: 600 seconds (10 minutes)
Throttling Protection: Extended retry intervals
Memory Allocation: Higher memory limits for AI processing

Function Architecture (77 Total Lambda Functions)

Auto-Remediation Functions (30 functions - 100% tested)

Status: Complete test coverage (428 tests)
Services: RDS, EC2, S3, IAM, ELB, ECR, ECS, KMS, DynamoDB
Pattern: Standardized ASFF processing with comprehensive error handling
Error Resilience: State machine catch-all error handling ensures 100% fallback-to-ticketing coverage (v2.2.1+)

Core Workflow Functions (47 functions - Next Testing Priority)

Category 1: Core Workflow (6 functions) - CRITICAL

get_ticket_and_decide - Central decision maker for all workflow routing
get_account_data - Loads account metadata used throughout system
suppress_finding - Updates Security Hub finding status to SUPPRESSED
suppress_locally - Applies local suppression rules
update_remediated_finding - Marks findings as RESOLVED
account_reassignment_preprocessor - Corrects account routing for delegated services (IAM Access Analyzer, GuardDuty, Inspector, Detective) - ✅ COMPLETE (25 tests passing)

Category 2: Finding Processing (8 functions) - HIGH PRIORITY

compute_penalty_score - Risk scoring algorithm
update_ticketed_finding - Marks findings as NOTIFIED
get_findings_for_account - Retrieves Security Hub findings
get_findings_for_all_accounts - Cross-account finding aggregation
get_findings_for_weekly_report - Report data preparation
get_recent_findings - Recent activity tracking
get_findings_count - Statistical counting
get_findings_count_for_account - Account-specific statistics

Category 3: Incident Response (3 functions) - HIGH PRIORITY

determine_type - Incident classification (EC2, IAM, S3, EKS, Generic)
terminate_instance - Compromised EC2 instance termination
call_disk_forensics_collection - Forensic data collection trigger

Category 4: Communication (15 functions) - MEDIUM PRIORITY

Email formatting (9 functions)
Ticketing system integration (6 functions)

Category 5: Reporting & Analytics (15 functions) - MEDIUM PRIORITY

AI report generation (7 functions)
Statistics and metrics (4 functions)
Account management (3 functions)
System setup (3 functions)

4. Security Finding Processing Workflow

When security findings are detected:

Ingestion: ASFF Processor receives Security Hub findings
Enrichment: Account data and context added via GetAccountDataFunction
Scoring: Penalty scores calculated based on severity and environment
Decision: GetTicketAndDecideFunction determines workflow path:
- Auto-remediate: Route to SOARAutoRemediations state machine
- Incident Response: Route to SOARIncidents state machine
- Manual Intervention: Create ticket and notify teams
- Suppress: Apply suppression rules and update finding status
Execution: Appropriate actions taken based on decision
Tracking: Results recorded in DynamoDB and Security Hub updated

5. Automated Remediation Architecture

The system includes 30 specialized Lambda functions for remediating specific AWS security controls:

Service Coverage

RDS (9 controls): Database encryption, snapshot settings, security configurations
EC2 (6 controls): Security group configurations, unnecessary ports, instance settings
S3 (5 controls): Bucket permissions, encryption settings, public access
IAM (3 controls): Excessive permissions, policy compliance
ELB (2 controls): Load balancer security settings
ECR (2 controls): Container registry security
ECS (1 control): Container service security
KMS (1 control): Key management settings
DynamoDB (1 control): Database encryption

Standardized Remediation Pattern

ASFF Validation: Verify finding format and required fields
Resource Validation: Confirm resource exists and is accessible
Cross-Account Access: Assume appropriate role in target account
Remediation Logic: Implement AWS-specific security fix
Verification: Confirm remediation was successful
Response Formatting: Return standardized success/failure response
Error Handling: Comprehensive error catching and reporting

Lambda Layer Architecture

aws_utils Layer: Cross-account client creation and AWS service utilities
rds_remediation Layer: Specialized RDS remediation functions
Shared Dependencies: Common libraries and utilities across functions

4. Ticketing and Notification System

For issues requiring human intervention:

Tickets are created with detailed information
Email notifications are sent to appropriate teams
Reminder system tracks overdue tickets
Resolution process is monitored and recorded

The system uses specialized formatters for different issue types (GuardDuty findings, general compliance issues, etc.).

5. Multi-Account Management

SOAR operates across an AWS Organization:

Organization Admin Account: Hosts the SOAR application and centralized security operations
Security Account: Used for managing the security posture and setting security controls
Member Accounts: Where findings are detected and remediated

Cross-account access is managed via IAM roles with least privilege principles.

6. AI Integration

AI capabilities enhance the security operations:

Weekly Security Reports: AI-generated analysis of security posture
Recommendations: Contextual security improvement suggestions
Incident Analysis: AI-assisted investigation of security incidents

7. Reporting and Analytics

The system maintains historical data for:

Auto-remediation effectiveness
Incident trends
Ticket resolution times
Account security posture over time

Technical Architecture

Serverless Infrastructure

Lambda Functions: 77 total functions, Python 3.12 runtime
- 30 auto-remediation functions (100% tested)
- 47 workflow/support functions (next testing target)
Lambda Layers: Shared code libraries for common functionality
DynamoDB: Stores findings, ticketing data, and configuration
EventBridge: Custom event bus coordinates asynchronous processing
Step Functions: 5 state machines orchestrate complex workflows
Simple Email Service: Handles notifications and alerts
IAM Roles: Manages cross-account permissions with least privilege
Security Hub: Central finding repository and status management
GuardDuty: Threat detection and incident generation

Detailed Data Flow

Finding Ingestion:
- Security Hub findings trigger SOARASFFProcessor
- GuardDuty findings processed through incident workflows
- EventBridge routes findings to appropriate state machines
Data Enrichment:
- GetAccountDataFunction loads account metadata
- Tags and resource context added to scratchpad
- Penalty scoring applied based on environment and severity
Decision Processing:
- GetTicketAndDecideFunction analyzes finding characteristics
- Suppression rules evaluated via SuppressLocallyFunction
- Workflow routing determined (auto-remediate, incident, ticket, suppress)
Action Execution:
- Auto-remediation: SOARAutoRemediations state machine
- Incidents: SOARIncidents state machine with forensics
- Ticketing: Email notifications and external ticket creation
- Suppression: Security Hub status update to SUPPRESSED
Result Tracking:
- DynamoDB stores all actions and outcomes
- Security Hub findings updated with resolution status
- Statistics collected for reporting and analytics

State Machine Integration Points

ASFF Processor → Auto-Remediation

Trigger: actions.suppress_finding = false AND auto-remediation applicable
Data: Complete scratchpad with finding and account context
Response: Remediation success/failure status

ASFF Processor → Incidents

Trigger: GuardDuty findings or high-severity security events
Data: Incident-specific context and response requirements
Response: Incident handling results and forensic data

Cross-State Machine Error Handling

Retry Logic: Consistent 3-second intervals with exponential backoff
Error Propagation: Structured error responses maintain workflow state
Fallback Mechanisms: Graceful degradation when services unavailable

Deployment Model

AWS SAM: Defines infrastructure as code
CloudFormation: Manages resource provisioning
Multi-region: Supports deployment across regions
Versioned Releases: Tracked via git tags and CHANGELOG

Security and Compliance Considerations

Least Privilege: Minimal IAM permissions for operations
Environment-based Prioritization: Higher security standards for production
Audit Trail: Records all remediation actions and decisions
Cross-account Security: Controlled access between accounts

Scalability

Serverless Architecture: Scales with security finding volume
Stateless Processing: Allows horizontal scaling
Account Parallelism: Processes multiple accounts concurrently

Operational Model

Infrastructure as Code: All resources defined in template.yaml
Automated Deployment: Via deploy script
CI/CD Ready: Structured for pipeline integration
Configuration Management: Via parameters and environment variables

Event-Driven Architecture Pattern

SOAR follows established event-driven patterns for state transitions, with two primary approaches based on AWS API constraints:

Update-and-Reprocess Workflow (Mutable Fields)

State Change: Lambda function calls Security Hub batch_update_findings API
Event Generation: Security Hub automatically generates new finding event with updated status
Fresh Processing: SOAR processes the new event with corrected data from the beginning
Clean Logic: All workflow decisions use the updated finding state

Use Cases:

Auto-Remediation Completion: Sets Workflow.Status: 'RESOLVED' → Triggers ticket closure workflow
Finding Suppression: Sets Workflow.Status: 'SUPPRESSED' → Triggers cleanup actions
Severity Reclassification: Updates Severity.Label → Triggers reprocessing with correct severity

Create-and-Suppress Workflow (Immutable Fields)

Create New Finding: Lambda function calls Security Hub batch_import_findings API with corrected data
Suppress Original: Lambda function calls Security Hub batch_update_findings API to suppress original finding
Terminate Current: Sets terminate_for_reprocessing flag to end current workflow
Fresh Processing: New finding triggers fresh SOAR processing with corrected data

Use Cases:

Account Routing Correction: Creates finding in correct account when AwsAccountId is immutable
Cross-Account Finding Creation: Routes findings to appropriate account teams

Account Reassignment Preprocessor

AWS Security Hub delegated administration architecture creates account routing mismatches for certain security services (IAM Access Analyzer, GuardDuty, Inspector, Detective). These services generate findings that appear to originate from the Security-Adm account but actually concern resources in member accounts.

Problem: Finding shows AwsAccountId: 111111111111 (Security-Adm) but resource is in account 222222222222

Solution: Account Reassignment Preprocessor detects mismatches and creates corrected findings in the appropriate accounts:

Detection: Two-tier approach comparing AwsAccountId with:
- Priority 1: ProductFields.ResourceOwnerAccount field
- Priority 2: Account extracted from resource ARN (arn:aws:service:region:account:resource)
Recreation: Uses batch_import_findings to create properly structured ASFF finding in target account
Suppression: Sets actions.suppress_finding = True to trigger state machine suppression of original finding
Graceful Degradation: Any failure preserves original workflow (fail-safe design)
Fresh Processing: New finding triggers complete SOAR processing in correct account

Technical Implementation:

Production-Ready: All 25 tests passing with comprehensive TDD coverage
ASFF Compliance: Follows AWS Security Hub BatchImportFindings API requirements exactly
Zero AWS Costs: Comprehensive mocking prevents real API calls during testing
Cross-Account Support: Uses established SOAR get_client patterns for secure access
Unique ID Generation: Creates unique finding IDs using {original-id}-reassigned-{account-id} pattern

Integration: Account Reassignment Preprocessor integrates into the ASFF processor state machine after AWS Health Reclassifier, with Choice node routing to existing suppression functionality when actions.suppress_finding = true.

Benefits

Consistent Processing: All workflows use the same updated finding data
Audit Trail: Security Hub maintains complete state change history
Downstream Integration: Other Security Hub consumers see corrected data
Loop Prevention: Updates are unidirectional (e.g., HIGH → INFORMATIONAL only)
Correct Team Notification: Findings reach the teams responsible for the actual resources

This pattern enables the SOAR to maintain clean separation between decision logic and state management while ensuring all systems have consistent, up-to-date finding information and appropriate team routing.

CloudWatch Alarm Integration and Monitoring

System Self-Monitoring Architecture

The OpenSecOps platform implements comprehensive self-monitoring through CloudWatch alarms that automatically create SOAR incidents for operational failures. This ensures that infrastructure issues are handled through the same security incident response workflow as other security findings.

Alarm Processing Workflow

Alarm Creation: CloudWatch alarms monitor AWS services (Step Functions, Lambda, etc.)
Event Generation: Alarm state changes to "ALARM" trigger EventBridge events
Event Processing: SOAR-all-alarms-to-sec-hub component converts alarm events to Security Hub findings
SOAR Processing: Main SOAR system processes alarm findings as incidents
Incident Response: Alarms are routed through standard incident workflows with AI analysis

SOAR-all-alarms-to-sec-hub Component

Purpose: Converts CloudWatch alarm state change events into AWS Security Finding Format (ASFF) findings in Security Hub

Trigger Pattern: EventBridge rule matching CloudWatch alarms transitioning to "ALARM" state:

Pattern:
  source: [aws.cloudwatch]
  detail-type: [CloudWatch Alarm State Change]
  detail:
    state:
      value: [ALARM]

Processing Logic:

Extracts alarm metadata (name, description, account, region, timestamp)
Suppresses CIS-related alarms to avoid operational noise
Requires severity level in alarm name (INFORMATIONAL|LOW|MEDIUM|HIGH|CRITICAL)
Determines incident domain: "INFRA" (infrastructure) vs "APP" (application) based on alarm name
Creates ASFF-compliant Security Hub finding

ASFF Structure Created:

{
  "SchemaVersion": "2018-10-08",
  "Types": ["Software and Configuration Checks/CloudWatch Alarms/soar-cloudwatch-alarms"],
  "Title": "ALARM_NAME",
  "Description": "ALARM_DESCRIPTION or N/A",
  "Severity": {"Label": "HIGH|MEDIUM|LOW|etc"},
  "Resources": [{"Type": "AwsAccountId", "Id": "account_id", "Region": "region"}],
  "ProductFields": {
    "IncidentDomain": "INFRA|APP",
    "TicketDestination": "TEAM"
  }
}

Alarm Naming Convention

All monitoring alarms follow the pattern: [DOMAIN]-[Component]-[Type]-[Severity]

Domain Types:

INFRA: Infrastructure alarms routed to infrastructure/operations teams
APP: Application alarms that can be routed to application development teams

Examples:

INFRA-SOAR-ASFF-Processor-SM-Failure-HIGH
INFRA-CombineLogFilesSM-Failure-HIGH
INFRA-SOAR-Weekly-AI-Report-SM-Failure-MEDIUM
APP-UserService-Lambda-Error-MEDIUM (hypothetical application alarm)

This naming convention enables:

Automatic severity extraction and incident routing
Incident domain classification (INFRA vs APP) for appropriate team assignment
Complete incident management coverage across infrastructure and application layers
Consistent AI prompt context and analysis

SOAR Platform Self-Monitoring

Core State Machine Monitoring (HIGH severity):

SOAR ASFF Processor failures
Auto-Remediation state machine failures
Incident processing failures

Operational Monitoring (MEDIUM/LOW severity):

Weekly AI Report generation failures
Hourly maintenance task failures
Control synchronization failures

Alarm Configuration:

Metric: ExecutionsFailed from AWS/States namespace
Threshold: >= 1 failed execution
Period: 60 seconds with 1 evaluation period
Treatment: Missing data treated as not breaching

Foundation Platform Self-Monitoring

Log Processing Infrastructure (HIGH severity):

Control Tower log aggregation failures
Historical log processing failures

Alarm Configuration: Same pattern as SOAR alarms, monitoring Step Functions execution failures

AI-Enhanced Incident Analysis

Recent enhancements (v2.2.1) provide comprehensive AI-powered incident analysis capabilities:

Intelligence Features:

Component-Specific Analysis: Deep understanding of each state machine and Lambda function's purpose
Context-Aware Prompts: AI prompts distinguish between individual failures vs. systemic issues
Impact Assessment: Explains operational impact and urgency levels based on component criticality
Targeted Recommendations: Provides specific troubleshooting steps tailored to each alarm type
Operational Continuity: Emphasizes that single failures don't indicate complete system breakdown

Enhanced Diagnostic Capabilities:

Failure Context: AI understands the role and dependencies of each failing component
Troubleshooting Guidance: Component-specific debugging procedures and common resolution steps
Automated Infrastructure Knowledge: Self-updating understanding of system architecture
Multi-Account Awareness: Context-aware analysis across different AWS accounts and environments

Analysis Inputs: The AI analysis receives the alarm-based ASFF finding and provides contextualized guidance based on:

Component type (SOAR core vs. operational vs. Foundation vs. application)
Severity level and expected operational impact
Component criticality and dependencies
Historical patterns and troubleshooting procedures
Account context and organizational structure

Limitations and Context Gaps

Current Context Available:

Alarm name, description, and basic metadata
Account and region information
Severity level and incident domain classification
Timestamp of alarm trigger

Missing Context for Enhanced Debugging:

Specific Step Functions execution ARN and failure details
CloudWatch Logs from failed Lambda functions
Actual metric values that triggered the alarm
Execution timeline and specific failure points
Resource-specific details beyond account-level information

This comprehensive alarm integration ensures that both SOAR and Foundation infrastructure failures are automatically detected, classified, and routed through the standard security incident response workflow with appropriate AI-powered analysis and operational guidance.

CloudWatch Alarm Context Enrichment Enhancement

Proposed Architecture Enhancement

To address the context limitations identified above, a CloudWatch Context Enrichment Function is proposed to provide AI-powered incident analysis with comprehensive debugging information and historical pattern analysis.

Enhanced Context Enrichment Function

Purpose: Enrich CloudWatch alarm-based incidents with detailed execution context, logs, and historical pattern analysis before AI processing.

Integration Point: Added to SOARIncidents state machine between instruction retrieval and AI analysis for both APP and INFRA domains.

State Machine Placement:

# After "Format Generic message" and before "Email to Whom?"
EnrichCloudWatchContext:
  Type: Task
  Resource: '${EnrichCloudWatchContextFunctionArn}'
  TimeoutSeconds: 300
  Condition: Finding type contains "CloudWatch Alarms"
  Next: Email to Whom?

Intelligent Enrichment Logic

Service-Aware Processing:

Step Functions Failures: Extract execution ARNs, failure details, execution history
Lambda Function Errors: Retrieve CloudWatch Logs, error messages, invocation details
Generic AWS Services: Basic metric and configuration data

Temporal Correlation:

Calculate exact alarm evaluation window using Period × EvaluationPeriods
Query failed executions/invocations within specific time window that triggered alarm
Eliminates false correlation with unrelated failures

Memory and Pattern Analysis:

Query DynamoDB incidents table using configurable retention period (IncidentExpirationInDays)
Analyze incident frequency across multiple time windows (7d, 30d, 90d, full retention)
Detect pattern trends: INCREASING, DECREASING, STABLE
Classify incidents: RARE_OCCURRENCE, FREQUENT_RECENT, RECURRING_PATTERN, ISOLATED_INCIDENT, BASELINE_DEGRADATION
Determine success baselines from recent execution history

Enhanced Context Data Structure

Technical Enrichment:

{
  "service_type": "stepfunctions|lambda|generic",
  "alarm_correlation": {
    "evaluation_window": {...},
    "total_executions_in_window": N
  },
  "failed_executions": [
    {
      "execution_arn": "...",
      "failure_details": {...},
      "logs_summary": "..."
    }
  ]
}

Pattern Analysis:

{
  "pattern_analysis": {
    "retention_period_days": 365,
    "incident_trends": {
      "recent_rate": 0.5,
      "trend_direction": "INCREASING|STABLE|DECREASING"
    },
    "pattern_classification": "FREQUENT_RECENT|ISOLATED_INCIDENT|...",
    "baseline_context": {
      "success_rate": 0.95,
      "last_successful_execution": "2024-01-01T12:00:00Z"
    }
  }
}

AI Enhancement Benefits

The enriched context enables AI analysis to provide:

Precise Debugging: "Step XYZ failed with error ABC in execution DEF, check log stream GHI for root cause JKL"
Pattern Awareness: "This is the 3rd occurrence in 7 days, suggesting systematic issue vs. isolated failure"
Baseline Comparison: "Success rate dropped from 95% to 60% vs. isolated failure in otherwise healthy system"
Historical Context: "Last successful execution 2 hours ago vs. no successful executions in 48 hours"
Trend Analysis: "Increasing failure frequency suggests underlying degradation vs. random operational issue"

Error Resilience and Performance

Graceful Degradation: Enrichment failures don't block incident processing
Intelligent Filtering: Only processes CloudWatch alarm findings with enrichable patterns
Configurable Timeouts: 300-second limit prevents workflow delays
Cost Optimization: Respects configured retention periods and query limits
Cross-Account Integration: Leverages existing SOAR IAM roles for API access

This enhancement transforms AI incident analysis from reactive debugging to pattern-aware, context-rich incident intelligence, significantly improving operational response quality and debugging efficiency.

Implementation Plan

Phase 1: Core Enrichment Function (Week 1-2)

Create Lambda Function
- functions/enrich_cloudwatch_context/app.py
- Core function structure with intelligent pattern detection
- Environment variables: INCIDENTS_TABLE_NAME, EXPIRATION_DAYS
Add Function to Template
- CloudFormation resource definition
- IAM permissions for DynamoDB, Step Functions, CloudWatch APIs
- Environment variable mapping
Basic Service Detection
- Step Functions alarm pattern matching
- Lambda function alarm pattern matching
- Alarm evaluation window calculation

Phase 2: Step Functions Integration (Week 2-3)

Step Functions Enrichment
- Execution ARN extraction from alarm configuration
- Failed execution queries within evaluation window
- Execution history and failure point identification
- CloudWatch Logs integration for state machine logs
Lambda Functions Enrichment
- Function name extraction from alarm configuration
- CloudWatch Logs queries for error patterns
- Invocation failure correlation with alarm timing

Phase 3: Memory and Pattern Analysis (Week 3-4)

DynamoDB Integration
- Query incidents table using configurable retention period
- Multi-timeframe analysis (7d, 30d, 90d, full retention)
- Incident frequency and trend calculation
Pattern Classification
- Success baseline analysis from execution history
- Incident pattern categorization logic
- Trend direction detection (increasing/stable/decreasing)

Phase 4: State Machine Integration (Week 4-5)

SOARIncidents State Machine Updates
- Add EnrichCloudWatchContext state after "Format Generic message"
- Conditional execution for CloudWatch alarm findings only
- Error handling with graceful degradation
Testing and Validation
- Unit tests for enrichment logic
- Integration tests with sample alarm events
- Performance testing with large DynamoDB datasets

Phase 5: AI Prompt Enhancement (Week 5-6)

AI Prompt Updates
- Update incident_infra.txt to utilize enriched context
- Update incident_app.txt for application incidents
- Test AI analysis quality with enriched data
Documentation and Deployment
- Update architecture documentation
- Deployment procedures and rollback plans
- Performance monitoring and alerting

Implementation Files

New Files to Create:

functions/enrich_cloudwatch_context/app.py - Main enrichment function
functions/enrich_cloudwatch_context/requirements.txt - Dependencies
tests/test_enrich_cloudwatch_context.py - Unit tests

Files to Modify:

template.yaml - Add function definition and state machine updates
statemachines/incidents.asl.yaml - Add enrichment state
ai-prompts/incident_infra.txt - Enhanced prompts for enriched context
ai-prompts/incident_app.txt - Enhanced prompts for enriched context

Success Metrics

Enrichment Coverage: % of CloudWatch alarms receiving enrichment
AI Quality Improvement: Subjective assessment of AI recommendations with enriched context
Performance Impact: Enrichment function execution time and state machine latency
Pattern Detection Accuracy: Validation of pattern classifications against known incident trends

Integration Points

Security Hub: Primary source of findings
GuardDuty: Threat detection integration
CloudWatch: Infrastructure monitoring and alarm generation
EventBridge: Alarm event routing and processing
External Ticketing: Can connect to external systems
Microsoft Sentinel: Optional integration for enterprise SOC environments

Future Extensibility

The modular architecture allows for:

Auto-Remediation Expansion: Adding new security controls and services
Security Service Integration: Additional AWS security services
AI Enhancement: Advanced threat analysis and response recommendations
Custom Playbooks: Organization-specific incident response procedures
Third-Party Integrations: SIEM, SOAR, and ticketing system connectors

Uh oh!

FilesExpand file tree

ARCHITECTURE.md

Latest commit

History