Skip to content

Latest commit

 

History

History
302 lines (230 loc) · 7.4 KB

File metadata and controls

302 lines (230 loc) · 7.4 KB

CloudSeer - Master Project Context

Date: March 28, 2026 Purpose: Single source-of-truth handoff document for AI assistants and new contributors.

1) Executive Summary

CloudSeer is a cloud cost intelligence platform that turns AWS telemetry into operations decisions.

Current end-to-end loop:

  1. Pull EC2 CPU from CloudWatch on a timer.
  2. Estimate cost per polling interval.
  3. Store time-series points in SQLite.
  4. Run anomaly detection and forecasting in API routes.
  5. Surface actions in a React dashboard.
  6. Trigger one-click remediation (stop EC2 instance).

CloudSeer is in an MVP-plus stage: technically integrated, demo-ready, and extensible, with known hardcoded constraints documented below.

2) Product Positioning

CloudSeer is designed to move teams from reactive cloud cost review to proactive cloud cost control.

  • Reactive mode: discover overruns late, investigate manually, fix slowly.
  • CloudSeer mode: detect drift early, forecast risk, explain urgency, trigger action.

Core buyer value:

  • Faster anomaly response.
  • Less alert fatigue.
  • Direct path from insight to remediation.
  • Traceable savings narrative.

3) What Exists in Code Right Now

Backend

  • FastAPI app with CORS enabled globally.
  • Async startup polling loop in backend/api/main.py.
  • Poll interval: 60 seconds.
  • CloudWatch lookup window: last 15 minutes, period 300 seconds.
  • SQLite store class: TimeSeriesStore in backend/db/timeseries_store.py.
  • Cost estimator: fixed EC2 hourly rate in backend/aws/cost_estimator.py.
  • API routes mounted under /api:
    • metrics
    • forecast
    • anomalies
    • remediate

Frontend

  • React + Vite app with route layout architecture.
  • Routes:
    • /
    • /anomalies
    • /automation
    • /resources/:id
    • /reports
  • Data fetch layer in frontend/src/api/client.js.
  • Dashboard refresh cadence: every 10 seconds.
  • Recharts visualizations and Framer Motion animations throughout operator experience.

ML

  • Anomaly module: ml/anomaly/isolation_forest.py.
  • Forecast module: ml/forecasting/prophet_model.py.
  • Synthetic history seeding: ml/forecasting/synthetic_history.py.
  • Preprocessor utilities: ml/pipeline/data_preprocessor.py.
  • ML execution is integrated in live backend endpoints, not notebook-only.

4) Architecture (Runtime Data Path)

AWS CloudWatch (EC2 CPU)
  -> backend/api/main.py (poll_aws, every 60s)
  -> backend/aws/cost_estimator.py (fixed-rate cost increment)
  -> backend/db/timeseries_store.py (SQLite cloudseer.db)
  -> backend/api/routes/*.py
      /api/metrics
      /api/forecast
      /api/anomalies
      /api/remediate
  -> frontend/src/api/client.js
  -> frontend routed pages and components

ML invocation path:

  • /api/anomalies -> detect_anomalies(metrics)
  • /api/forecast -> train_and_forecast(historical_data, horizon_minutes=60)

5) API Contract Snapshot

Base URL: http://localhost:8000

GET /

Returns service status and exposed endpoint list.

GET /api/metrics

Returns metrics grouped by (resource_id, resource_type):

{
  "resources": [
    {
      "id": "i-0da3659219976da09",
      "type": "ec2",
      "metrics": [
        {
          "timestamp": "2026-03-28T10:01:00Z",
          "cpu": 14.31,
          "cost_usd": 0.000193,
          "invocations": null
        }
      ]
    }
  ]
}

GET /api/forecast?resource_id=<optional>

Returns Prophet forecast and spike signal:

{
  "resource_id": null,
  "forecast": [
    {
      "timestamp": "2026-03-28T10:02:00Z",
      "predicted_cost": 0.00021,
      "lower": 0.00018,
      "upper": 0.00024
    }
  ],
  "spike_warning": false,
  "spike_at": null
}

GET /api/anomalies

Returns anomaly-only items after ML filtering:

{
  "anomalies": [
    {
      "resource_id": "i-0da3659219976da09",
      "type": "idle_instance",
      "confidence": 0.94,
      "cost_impact_usd": 1.37,
      "recommended_action": "stop_instance",
      "auto_execute": false,
      "claude_summary": "..."
    }
  ]
}

POST /api/remediate

Current backend implementation safely accepts a payload representing the intent (resource_id, recommended_action), tracks the before_cost, derives after_cost logic, triggers the system fix, and logs an entry to the remediations SQLite table.

{
  "success": true,
  "before_cost": 0.0116,
  "after_cost": 0.0,
  "action_taken": "stop_instance",
  "resource_id": "i-0da3659219976da09",
  "status": "success",
  "details": {
    "status": "stopped",
    "instance_id": "i-0da3659219976da09"
  }
}

GET /api/remediations

Returns a historical log list table out of the internal database of all system-applied remediations to be queried by the dashboard tables.

6) Important Hardcoded Values and Behavioral Constraints

  • Hardcoded instance ID in backend poller and remediation route: i-0da3659219976da09.
  • Frontend API base URL hardcoded to http://localhost:8000.
  • Cost model uses static EC2 hourly rate (0.0116) for demo simplicity.
  • Poll loop uses fallback CPU value when AWS metric datapoints are temporarily absent.
  • Forecast route trains Prophet in-request (no persisted model cache yet).
  • Anomalies route attempts Anthropic summary generation with 120-second in-memory cache; falls back to deterministic text on any exception.

7) Data Model

SQLite tables:

  1. metrics

Columns:

  • id INTEGER PRIMARY KEY AUTOINCREMENT
  • timestamp TEXT
  • resource_id TEXT
  • resource_type TEXT
  • cpu REAL
  • cost_usd REAL
  • invocations REAL

Read pattern: ascending by timestamp with optional limit.

  1. remediations

Columns:

  • id INTEGER PRIMARY KEY AUTOINCREMENT
  • timestamp TEXT
  • resource_id TEXT
  • action_taken TEXT
  • status TEXT
  • before_cost REAL
  • after_cost REAL
  • details TEXT

8) Stack and Versions

Backend (backend/requirements.txt):

  • boto3
  • fastapi
  • uvicorn
  • pydantic
  • python-dotenv

ML (ml/requirements.txt):

  • numpy==2.4.3
  • pandas==3.0.1
  • scikit-learn==1.8.0
  • prophet==1.3.0

Frontend (frontend/package.json):

  • React 18
  • Vite 5
  • react-router-dom 7.x
  • Recharts
  • Framer Motion
  • Tailwind CSS

9) Runbook

Backend

cd backend
pip install -r requirements.txt
uvicorn api.main:app --reload --port 8000

Frontend

cd frontend
npm install
npm run dev

ML Environment

cd ml
pip install -r requirements.txt

10) Known Gaps and Risks

  1. Single-resource focus: runtime flow is effectively one EC2 instance.
  2. No auth, tenancy, or RBAC boundary in API layer.
  3. Forecasting retrains on each request and may scale poorly under load.
  4. Reports page exists in routing but remains lightweight in operational depth.

11) Near-Term Execution Priorities

  1. Externalize all hardcoded IDs/URLs to env and config.
  2. Introduce model caching and basic inference telemetry.
  3. Expand resource coverage beyond EC2 to Lambda/S3.
  4. Add authentication and org/tenant scoping.

12) Claude Prompt Starter (Copy/Paste)

Use this when handing context to Claude:

You are joining the CloudSeer codebase (AWS cost intelligence platform).
Read PROJECT_MASTER.md, AI_CONTEXT_BACKEND.md, AI_CONTEXT_FRONTEND.md, AI_CONTEXT_ML.md, and AI_CONTEXT_DESIGN.md.
Assume current date March 28, 2026.
Focus on code-accurate behavior, especially hardcoded instance remediation, API contracts, and ML route integration.
When suggesting changes, preserve existing endpoint response shapes unless explicitly asked to break compatibility.