Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
947 changes: 947 additions & 0 deletions INITIAL-13.md

Large diffs are not rendered by default.

855 changes: 855 additions & 0 deletions PRPs/PRP-13-data-seeder-dashboard.md

Large diffs are not rendered by default.

52 changes: 51 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@ Portfolio-grade end-to-end retail demand forecasting system.
- **Dashboard**: React 19 + Vite + Tailwind CSS 4 + shadcn/ui for data exploration and model management
- **RAG Knowledge Base**: Postgres pgvector embeddings + evidence-grounded answers with citations
- **Agentic Layer**: PydanticAI agents for autonomous experimentation and evidence-grounded Q&A with human-in-the-loop approval
- **Data Seeder (The Forge)**: Reproducible synthetic data generator with realistic time-series patterns, scenario presets, and retail effects

## Quick Start

Expand Down Expand Up @@ -154,7 +155,9 @@ pnpm preview
```
app/ # FastAPI backend
├── core/ # Config, database, logging, middleware, exceptions
├── shared/ # Pagination, timestamps, error schemas
├── shared/
│ ├── seeder/ # The Forge - randomized database seeder
│ └── ... # Pagination, timestamps, error schemas
├── features/
│ ├── data_platform/ # Store, product, calendar, sales tables
│ ├── ingest/ # Batch upsert endpoints for sales data
Expand Down Expand Up @@ -187,6 +190,7 @@ examples/
├── queries/ # Example SQL queries
├── models/ # Baseline model examples (naive, seasonal_naive, moving_average)
├── backtest/ # Backtesting examples (run_backtest, inspect_splits, metrics_demo)
├── seed/ # Data seeder configs and examples (YAML scenarios)
├── compute_features_demo.py # Feature engineering demo
└── registry_demo.py # Model registry workflow demo
scripts/ # Utility scripts
Expand Down Expand Up @@ -640,6 +644,52 @@ AGENT_APPROVAL_TIMEOUT_MINUTES=60
AGENT_ENABLE_STREAMING=true
```

### Data Seeder (The Forge)

Generate reproducible synthetic test data with realistic time-series patterns.

**CLI Commands:**
```bash
# Generate complete dataset
uv run python scripts/seed_random.py --full-new --seed 42 --confirm

# Delete all data
uv run python scripts/seed_random.py --delete --confirm

# Append data for new date range
uv run python scripts/seed_random.py --append --start-date 2025-01-01 --end-date 2025-03-31

# Run pre-built scenario
uv run python scripts/seed_random.py --full-new --scenario holiday_rush --confirm

# Show current data counts
uv run python scripts/seed_random.py --status

# Verify data integrity
uv run python scripts/seed_random.py --verify
```

**Scenario Presets:**

| Scenario | Description |
|----------|-------------|
| `retail_standard` | Normal retail patterns with mild seasonality |
| `holiday_rush` | Q4 surge with Black Friday/Christmas peaks |
| `high_variance` | Noisy data with anomalies for robustness testing |
| `stockout_heavy` | Frequent stockouts (25% probability) |
| `new_launches` | 100 products with launch ramp patterns |
| `sparse` | 50% missing combinations, random gaps |

**Features:**
- Deterministic generation with configurable seeds for reproducibility
- Realistic time-series patterns (trend, weekly/monthly seasonality, noise, anomalies)
- Retail effects (promotions, stockouts, price elasticity)
- YAML configuration support for custom scenarios
- Safe deletion with scope control (all/facts/dimensions)
- Dry-run mode for previewing changes

See [examples/seed/README.md](examples/seed/README.md) for detailed configuration options.

### Error Responses (RFC 7807)

All error responses follow RFC 7807 Problem Details format with `Content-Type: application/problem+json`:
Expand Down
9 changes: 9 additions & 0 deletions app/features/seeder/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
"""Seeder feature module for managing synthetic data generation via REST API.

This feature provides REST endpoints for the Data Seeder (The Forge),
allowing management of synthetic test data through the dashboard.
"""

from app.features.seeder.routes import router

__all__ = ["router"]
262 changes: 262 additions & 0 deletions app/features/seeder/routes.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,262 @@
"""FastAPI routes for seeder operations.

Provides REST endpoints for managing synthetic data generation
through the dashboard admin panel.
"""

from fastapi import APIRouter, Depends, HTTPException, status
from sqlalchemy.ext.asyncio import AsyncSession

from app.core.config import get_settings
from app.core.database import get_db
from app.core.logging import get_logger
from app.features.seeder import schemas, service

router = APIRouter(prefix="/seeder", tags=["seeder"])
logger = get_logger(__name__)


def _check_seeder_enabled() -> None:
"""Check if seeder operations are allowed in current environment.

Raises:
HTTPException: If seeder is disabled in production.
"""
settings = get_settings()
if not settings.seeder_allow_production and settings.app_env == "production":
raise HTTPException(
status_code=status.HTTP_403_FORBIDDEN,
detail="Seeder operations are not allowed in production environment. "
"Set SEEDER_ALLOW_PRODUCTION=true to enable (not recommended).",
)


@router.get(
"/status",
response_model=schemas.SeederStatus,
summary="Get database status",
description="Returns current row counts for all tables and date range metadata.",
)
async def get_status(
db: AsyncSession = Depends(get_db),
) -> schemas.SeederStatus:
"""Get current database row counts and metadata.

Returns counts for all dimension and fact tables, plus date range
information from sales_daily.
"""
return await service.get_status(db)


@router.get(
"/scenarios",
response_model=list[schemas.ScenarioInfo],
summary="List scenario presets",
description="Returns available scenario presets with their default configurations.",
)
async def list_scenarios() -> list[schemas.ScenarioInfo]:
"""List available scenario presets.

Returns pre-built scenarios like retail_standard, holiday_rush, etc.
with their default configurations.
"""
return service.list_scenarios()


@router.post(
"/generate",
response_model=schemas.GenerateResult,
status_code=status.HTTP_201_CREATED,
summary="Generate new dataset",
description="Generate a complete synthetic dataset. Requires confirmation in non-dev environments.",
)
async def generate_data(
params: schemas.GenerateParams,
db: AsyncSession = Depends(get_db),
) -> schemas.GenerateResult:
"""Generate a new synthetic dataset from scratch.

This will create stores, products, calendar, sales, inventory,
price history, and promotions based on the selected scenario.

**Warning:** This operation may take several minutes for large datasets.

Args:
params: Generation parameters including scenario and seed.

Returns:
GenerateResult with counts of created records.

Raises:
HTTPException: If operation fails or is blocked.
"""
_check_seeder_enabled()

try:
return await service.generate_data(db, params)
except ValueError as e:
logger.error(
"seeder.generate.failed",
error=str(e),
error_type=type(e).__name__,
)
raise HTTPException(
status_code=status.HTTP_400_BAD_REQUEST,
detail=str(e),
) from e
except Exception as e:
logger.error(
"seeder.generate.failed",
error=str(e),
error_type=type(e).__name__,
exc_info=True,
)
raise HTTPException(
status_code=status.HTTP_500_INTERNAL_SERVER_ERROR,
detail=f"Generation failed: {e}",
) from e
Comment on lines +107 to +117
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Avoid leaking internal exception details in 500 responses.
Line 115/Line 170/Line 225/Line 261 include raw exception messages in detail, which can expose internals. Prefer a generic error message and rely on structured logs for diagnostics.

🔒 Suggested fix (apply similarly to append/delete/verify)
-        raise HTTPException(
-            status_code=status.HTTP_500_INTERNAL_SERVER_ERROR,
-            detail=f"Generation failed: {e}",
-        ) from e
+        raise HTTPException(
+            status_code=status.HTTP_500_INTERNAL_SERVER_ERROR,
+            detail="Generation failed.",
+        ) from e

Also applies to: 161-171, 216-226, 252-262

🤖 Prompt for AI Agents
In `@app/features/seeder/routes.py` around lines 107 - 117, The handler currently
returns raw exception text in the HTTPException detail (see the block using
logger.error and raising HTTPException with
status.HTTP_500_INTERNAL_SERVER_ERROR); change the raised HTTPException to use a
generic message (e.g., "Internal server error" or "Generation failed") instead
of f"Generation failed: {e}" and keep the existing structured
logger.error(exc_info=True, error=str(e), error_type=type(e).__name__) so
diagnostics remain in logs; apply the same change to the other handlers
referenced (append/delete/verify) that raise HTTPException with interpolated
exception messages.



@router.post(
"/append",
response_model=schemas.GenerateResult,
status_code=status.HTTP_201_CREATED,
summary="Append data",
description="Append new data to existing dataset for a specified date range.",
)
async def append_data(
params: schemas.AppendParams,
db: AsyncSession = Depends(get_db),
) -> schemas.GenerateResult:
"""Append data to existing dataset.

Uses existing dimension tables (stores, products) and generates
new fact records (sales, inventory, etc.) for the specified date range.

Requires existing dimensions. Run /generate first if database is empty.

Args:
params: Append parameters with date range.

Returns:
GenerateResult with counts of appended records.

Raises:
HTTPException: If no dimensions exist or operation fails.
"""
_check_seeder_enabled()

try:
return await service.append_data(db, params)
except ValueError as e:
logger.error(
"seeder.append.failed",
error=str(e),
error_type=type(e).__name__,
)
raise HTTPException(
status_code=status.HTTP_400_BAD_REQUEST,
detail=str(e),
) from e
except Exception as e:
logger.error(
"seeder.append.failed",
error=str(e),
error_type=type(e).__name__,
exc_info=True,
)
raise HTTPException(
status_code=status.HTTP_500_INTERNAL_SERVER_ERROR,
detail=f"Append failed: {e}",
) from e


@router.delete(
"/data",
response_model=schemas.DeleteResult,
summary="Delete data",
description="Delete data with specified scope. Supports dry_run preview.",
)
async def delete_data(
params: schemas.DeleteParams,
db: AsyncSession = Depends(get_db),
) -> schemas.DeleteResult:
"""Delete data with specified scope.

Scopes:
- `all`: Delete everything (dimensions + facts)
- `facts`: Delete only fact tables (sales, inventory, prices, promotions)
- `dimensions`: Delete dimension tables (also deletes facts due to FK constraints)

Use `dry_run=true` to preview what would be deleted without executing.

Args:
params: Delete parameters with scope and dry_run flag.

Returns:
DeleteResult with counts of deleted records.

Raises:
HTTPException: If operation is blocked or fails.
"""
_check_seeder_enabled()

try:
return await service.delete_data(db, params)
except ValueError as e:
logger.error(
"seeder.delete.failed",
error=str(e),
error_type=type(e).__name__,
)
raise HTTPException(
status_code=status.HTTP_400_BAD_REQUEST,
detail=str(e),
) from e
except Exception as e:
logger.error(
"seeder.delete.failed",
error=str(e),
error_type=type(e).__name__,
exc_info=True,
)
raise HTTPException(
status_code=status.HTTP_500_INTERNAL_SERVER_ERROR,
detail=f"Delete failed: {e}",
) from e


@router.post(
"/verify",
response_model=schemas.VerifyResult,
summary="Verify data integrity",
description="Run data integrity checks on current database content.",
)
async def verify_data(
db: AsyncSession = Depends(get_db),
) -> schemas.VerifyResult:
"""Run data integrity verification.

Checks performed:
- Foreign key integrity (sales reference valid stores/products/dates)
- Non-negative constraints (quantities, prices >= 0)
- Calendar date coverage (no gaps in date sequence)
- Data presence (sales data exists)
- Dimension completeness (stores, products, calendar populated)

Returns:
VerifyResult with pass/fail status for each check.
"""
try:
return await service.verify_data(db)
except Exception as e:
logger.error(
"seeder.verify.failed",
error=str(e),
error_type=type(e).__name__,
exc_info=True,
)
raise HTTPException(
status_code=status.HTTP_500_INTERNAL_SERVER_ERROR,
detail=f"Verification failed: {e}",
) from e
Loading