Skip to content

chore(main): release backtesting module#33

Merged
w7-mgfcode merged 5 commits into
mainfrom
dev
Feb 1, 2026
Merged

chore(main): release backtesting module#33
w7-mgfcode merged 5 commits into
mainfrom
dev

Conversation

@w7-mgfcode
Copy link
Copy Markdown
Owner

@w7-mgfcode w7-mgfcode commented Feb 1, 2026

Summary

Release the backtesting module (PRP-6) to main.

Changes included

  • Complete time-series backtesting infrastructure
  • TimeSeriesSplitter with expanding/sliding window strategies
  • MetricsCalculator with MAE, sMAPE, WAPE, Bias, Stability Index
  • BacktestingService with baseline comparisons
  • POST /backtesting/run endpoint
  • 111 tests (95 unit + 16 integration)
  • Documentation updates

Release Notes

Features

  • feat(backtesting): implement time-series backtesting module (PRP-6)

Tests

  • test(backtesting): add integration tests for routes and service

🤖 Generated with Claude Code

Summary by CodeRabbit

  • New Features

    • Backtesting system with expanding and sliding window split strategies
    • Gap parameter to simulate operational data latency
    • Comprehensive forecast metrics: MAE, sMAPE, WAPE, Forecast Bias, Stability Index
    • Automated baseline model comparisons (Naive, Seasonal Naive)
    • Per-fold result storage for downstream visualization
  • Documentation

    • Added backtesting API documentation and example scripts

✏️ Tip: You can customize this high-level summary in your review settings.

)

* docs: update INITIAL-6.md

* feat(backtesting): implement time-series backtesting module (PRP-6)

Add complete backtesting infrastructure for model evaluation:
- TimeSeriesSplitter with expanding/sliding window strategies and gap support
- MetricsCalculator with MAE, sMAPE, WAPE, Bias, and Stability Index
- BacktestingService for orchestrating backtests with baseline comparisons
- POST /backtesting/run endpoint with full response schema
- 95 unit tests covering schemas, splitter, metrics, and service
- Example scripts for API usage, split visualization, and metrics demo

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* docs: update documentation for backtesting module (PRP-6)

- README.md: Add backtesting endpoint, examples, and project structure
- ARCHITECTURE.md: Mark backtesting as implemented with full details

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* chore: update uv.lock version to 0.1.7

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* test(backtesting): add integration tests for routes and service (PRP-6)

Add 16 integration tests that run against real PostgreSQL database:
- 8 route tests for POST /backtesting/run endpoint
- 8 service tests for BacktestingService._load_series_data

Tests use @pytest.mark.integration marker and require docker-compose.
Test data: 120 days of sequential sales (quantity = day number 1-120).

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* fix(backtesting): fix integration test fixtures and format examples

- Use savepoint-based transaction isolation instead of table drop/create
- Fix client dependency override to use async generator
- Format example files (inspect_splits.py, metrics_demo.py)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* fix(backtesting): simplify db_session fixture for CI compatibility

Remove complex savepoint-based isolation that caused issues with
FastAPI dependency injection. Use simpler session pattern that
matches other working integration tests.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* fix(backtesting): use unique IDs and proper cleanup for test isolation

- Generate unique store codes and SKUs using UUID per test
- Use merge() for calendar fixture to handle existing records
- Clean up test data after each test (SalesDaily, TEST-* stores/products)
- Preserve shared Calendar data between tests

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* debug: add error message to test assertion

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* fix(backtesting): remove strict=True from BacktestRequest to allow date coercion

The strict=True config prevented Pydantic from automatically converting
ISO date strings to date objects in JSON requests, causing 422 errors.
Changed to extra="forbid" to still reject unknown fields while allowing
normal type coercion.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* fix(backtesting): clean up calendar entries in test date range

Delete calendar entries from 2024-01-01 to 2024-04-29 during test
cleanup to prevent conflicts with other test modules that insert
calendar records in the same date range.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

---------

Co-authored-by: Gabe@w7dev <gabor@w7-7.net>
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
Copy link
Copy Markdown

@sourcery-ai sourcery-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry @w7-mgfcode, your pull request is larger than the review limit of 150000 diff characters

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Feb 1, 2026

Warning

Rate limit exceeded

@w7-mgfcode has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 11 minutes and 28 seconds before requesting another review.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

📝 Walkthrough

Walkthrough

A comprehensive backtesting system for time-series forecasting models is introduced, featuring time-based cross-validation splits (expanding/sliding windows with configurable gap), a multi-metric evaluation suite (MAE, sMAPE, WAPE, Bias, Stability Index), automated baseline comparisons, leakage validation, and a REST API endpoint. Implementation includes configuration settings, core logic modules, comprehensive test coverage, example scripts, and updated documentation.

Changes

Cohort / File(s) Summary
Design & Documentation
INITIAL-6.md, PRPs/PRP-6-backtesting.md, README.md, docs/ARCHITECTURE.md, docs/validation/pytest-standard.md
Design specifications and implementation guidance for the backtesting feature; README extended with endpoint documentation, testing guidance, and example section; ARCHITECTURE marked feature as implemented.
Configuration
app/core/config.py
Added four backtesting-specific settings: backtest_max_splits, backtest_default_min_train_size, backtest_max_gap, and backtest_results_dir with sensible defaults.
Core Backtesting Module
app/features/backtesting/__init__.py
Public API consolidation module re-exporting key backtesting types and classes (schemas, splitter, metrics, service components) for clean external interface.
Schemas & Data Models
app/features/backtesting/schemas.py
Pydantic schemas defining backtesting configurations, API contracts, and result structures: SplitConfig, BacktestConfig (with deterministic hashing), FoldResult, ModelBacktestResult, BacktestRequest, BacktestResponse.
Time-Series Splitting
app/features/backtesting/splitter.py
TimeSeriesSplitter implementing expanding and sliding window strategies with configurable gap (simulating latency), split boundary tracking, and leakage validation.
Metrics Calculation
app/features/backtesting/metrics.py
MetricsCalculator class providing MAE, sMAPE, WAPE, Bias, and Stability Index metrics with extensive edge-case handling (zeros, NaN, division-by-zero) and per-fold aggregation support.
Backtesting Orchestration
app/features/backtesting/service.py
BacktestingService coordinating end-to-end backtest execution: data loading from DB, time-series splitting, model training/prediction across folds, metric computation, optional baseline comparisons (Naive, Seasonal Naive), and leakage checks.
API Routes
app/features/backtesting/routes.py
FastAPI router exposing POST /backtesting/run endpoint accepting BacktestRequest and returning BacktestResponse; includes request/completion logging and duration measurement.
Feature Integration
app/main.py
Registered backtesting router with FastAPI application via include_router.
Comprehensive Test Suite
app/features/backtesting/tests/conftest.py, test_metrics.py, test_schemas.py, test_splitter.py, test_service.py, test_service_integration.py, test_routes_integration.py
End-to-end and unit test fixtures, data models, and validation covering metric calculations, schema constraints, time-series splitting logic, service workflows, and API integration against PostgreSQL; includes mocks and real database tests.
Example Scripts
examples/backtest/run_backtest.py, inspect_splits.py, metrics_demo.py
Demonstration scripts illustrating REST API invocation, time-series split inspection, and metrics calculator usage with sample data and human-readable output.

Sequence Diagram(s)

sequenceDiagram
    actor Client
    participant API as API Router<br/>/backtesting/run
    participant Service as BacktestingService
    participant DB as Database
    participant Splitter as TimeSeriesSplitter
    participant Model as Forecast Model
    participant Metrics as MetricsCalculator
    participant Baselines as Baseline Models

    Client->>API: POST BacktestRequest<br/>(store_id, product_id, config)
    API->>Service: run_backtest(db, ids, dates, config)
    Service->>DB: _load_series_data(store_id, product_id)
    DB-->>Service: SeriesData (dates, values)
    Service->>Splitter: split(dates, values)
    Splitter-->>Service: Iterator[TimeSeriesSplit]
    
    loop For each fold
        Service->>Model: train on fold.train_indices
        Service->>Model: predict on fold.test_indices
        Service->>Metrics: calculate_all(actuals, predictions)
        Metrics-->>Service: MetricResult (mae, smape, wape, bias)
    end
    
    alt include_baselines == true
        Service->>Baselines: _run_baseline_comparisons(series, splits)
        Baselines-->>Service: list[ModelBacktestResult]
        Service->>Service: _generate_comparison_summary(main, baselines)
    end
    
    Service->>Metrics: aggregate_fold_metrics(all_fold_results)
    Metrics-->>Service: aggregated_metrics (stability indices)
    Service-->>API: BacktestResponse
    API-->>Client: JSON response<br/>(results, metrics, leakage_check)
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

The changes introduce a substantive new backtesting subsystem with high logic density across multiple interdependent modules (schemas, splitter, metrics calculator, service orchestration, API routes). While the architecture is clean and well-organized, reviewers must verify correctness of time-series splitting logic (no leakage), metric calculations with comprehensive edge-case handling, service workflow orchestration, and integration with existing models and database layer. The comprehensive test suite (2000+ lines) aids verification but requires careful review of both implementation and test coverage.

Possibly related PRs

  • PR #21: Shares code-level modifications to app/core/config.py (new Settings fields) and app/main.py (feature router registration), indicating related feature expansion patterns.
  • PR #29: Backtesting implementation directly depends on and exercises forecasting module types, baseline models (Naive, SeasonalNaive), model configurations, and model persistence utilities from this related PR.

Suggested reviewers

  • w7-learn

🐰 A backtest feast arrives with glee,
Splits expand, then slide with ease,
Metrics bloom—mae, bias, and more,
Baselines clash, the victor's score,
No leaks allowed, just truth unfolds—
Your forecasts tested, brave and bold!

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'chore(main): release backtesting module' clearly and concisely describes the primary change: releasing the backtesting module to the main branch, which aligns with the substantial implementation of the backtesting feature detailed throughout the changeset.
Docstring Coverage ✅ Passed Docstring coverage is 97.47% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch dev

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

w7-learn
w7-learn previously approved these changes Feb 1, 2026
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Fix all issues with AI agents
In `@app/core/config.py`:
- Around line 50-54: The four config fields backtest_max_splits,
backtest_default_min_train_size, backtest_max_gap, and backtest_results_dir are
defined but not used; either remove them or wire them into the backtesting
implementation: read these values from the central config/Settings object (where
these fields are declared) and apply them in the backtesting flow—enforce
backtest_max_splits when generating folds in the split generation logic (e.g.,
Backtester.generate_splits / split generator), use
backtest_default_min_train_size as the fallback when no min_train_size is
provided to BacktestingService.run_backtest or Backtester, enforce
backtest_max_gap when validating/creating train/test windows, and use
backtest_results_dir as the destination path in BacktestingService.save_results
or wherever backtest artifacts are written; if the backtesting service has
differently named methods, update the equivalent split-generation, validation,
and save/result-writing methods to consume these config fields.

In `@app/features/backtesting/service.py`:
- Around line 334-383: The _generate_comparison_summary function computes
percent improvements using raw metric values which breaks for signed metrics
like "bias"; update it to handle signed metrics by defining a set of signed
metric names (e.g., SIGNED_METRICS = {"bias", ...}) and, when metric_name is in
that set, compute vs_naive_pct and vs_seasonal_pct using absolute values
(abs(main_value) and abs(naive_value / seasonal_value)) or skip percentage
calculation for zero/NaN as already done; ensure you still store the original
signed values under "main", "naive", and "seasonal_naive" but use abs(...) only
for the percent-change formulas inside _generate_comparison_summary.

In `@PRPs/PRP-6-backtesting.md`:
- Line 14: Implement the missing retrieval endpoint: add a GET handler for GET
/backtesting/results/{backtest_id} that looks up stored results by backtest_id
(use or create functions like storeBacktestResult in the POST /backtesting/run
flow and a corresponding getBacktestResultById or fetchBacktestResult method)
and returns the persisted backtest + fold details or a 404 if not found; if you
choose not to implement it now, add a clear TODO in the API spec and code noting
that GET /backtesting/results/{backtest_id} is deferred and update the success
criteria to remove this endpoint from the deliverables.
🧹 Nitpick comments (11)
app/features/backtesting/schemas.py (2)

65-73: Type annotation for info parameter should use ValidationInfo.

The info parameter is typed as object, which loses type safety. Pydantic v2 provides ValidationInfo for this purpose.

♻️ Proposed fix
+from pydantic import BaseModel, ConfigDict, Field, field_validator, ValidationInfo
-from pydantic import BaseModel, ConfigDict, Field, field_validator

     `@field_validator`("horizon")
     `@classmethod`
-    def validate_horizon_vs_gap(cls, v: int, info: object) -> int:
+    def validate_horizon_vs_gap(cls, v: int, info: ValidationInfo) -> int:
         """Ensure horizon is reasonable relative to gap."""
-        data = getattr(info, "data", {})
+        data = info.data or {}
         gap = data.get("gap", 0)

215-222: Same type annotation improvement applies here.

Use ValidationInfo instead of object for type safety, consistent with the earlier suggestion.

♻️ Proposed fix
     `@field_validator`("end_date")
     `@classmethod`
-    def validate_date_range(cls, v: date_type, info: object) -> date_type:
+    def validate_date_range(cls, v: date_type, info: ValidationInfo) -> date_type:
         """Ensure end_date is after start_date."""
-        data = getattr(info, "data", {})
+        data = info.data or {}
         if "start_date" in data and v <= data["start_date"]:
app/features/backtesting/splitter.py (1)

23-24: Remove unused TYPE_CHECKING block.

The TYPE_CHECKING import and empty pass block serve no purpose and should be removed.

♻️ Proposed fix
-from typing import TYPE_CHECKING, Any
+from typing import Any

 import numpy as np

 from app.features.backtesting.schemas import SplitBoundary, SplitConfig

-if TYPE_CHECKING:
-    pass
examples/backtest/inspect_splits.py (1)

39-42: Gap visualization assumes consecutive daily dates.

The gap date calculation using timedelta(days=1) assumes the data has consecutive daily observations. While this works for the synthetic example data, users adapting this script for non-daily data should be aware of this assumption. Consider adding a comment noting this.

📝 Suggested documentation improvement
         if config.gap > 0:
+            # Note: This assumes consecutive daily dates in the dataset
             gap_start = split.train_dates[-1] + timedelta(days=1)
             gap_end = split.test_dates[0] - timedelta(days=1)
             print(f"  Gap:   {gap_start} to {gap_end} ({config.gap} days)")
app/features/backtesting/metrics.py (1)

270-272: Consider using sample standard deviation for small fold counts.

np.std uses population std (ddof=0) by default. For the typical 3-5 CV folds, sample std (ddof=1) may be statistically more appropriate. This is a minor point and the current implementation is common practice.

app/features/backtesting/tests/conftest.py (1)

46-57: Overly aggressive cleanup deletes all SalesDaily records.

Line 48 deletes all SalesDaily records without filtering, which could interfere with other concurrent tests or shared test data. Consider filtering to only delete records associated with test stores/products.

♻️ Proposed fix to scope cleanup to test data
         # Clean up test data (delete in correct order due to FK constraints)
-            await session.execute(delete(SalesDaily))
+            # Delete sales only for test stores/products to avoid affecting other data
+            await session.execute(
+                delete(SalesDaily).where(
+                    SalesDaily.store_id.in_(
+                        select(Store.id).where(Store.code.like("TEST-%"))
+                    )
+                )
+            )
             await session.execute(delete(Product).where(Product.sku.like("TEST-%")))
             await session.execute(delete(Store).where(Store.code.like("TEST-%")))

You'll need to add the import:

from sqlalchemy import delete, select
docs/validation/pytest-standard.md (1)

720-750: Documentation example differs from actual implementation.

The db_session example shows create_all/drop_all pattern and the client example uses a lambda, but the actual conftest.py implementation:

  1. Does not create/drop tables (relies on existing schema)
  2. Uses an async generator for override_get_db

Consider updating the documentation to match the actual implementation for accuracy.

app/features/backtesting/routes.py (2)

88-88: Consider making BacktestingService a dependency.

Instantiating BacktestingService() per request works but differs from the pattern used in other features. If the service holds no request-specific state, consider injecting it as a FastAPI dependency for consistency and easier testing/mocking.


100-111: Verify duration_ms source.

The handler calculates duration_ms locally (Line 100) and logs it, but the response already contains its own duration_ms from the service. The logged value may differ from the response value. Consider using response.duration_ms for consistency, or clarify that the local measurement includes overhead.

app/features/backtesting/tests/test_schemas.py (1)

162-178: Consider moving date import to module level.

The from datetime import date import is repeated inside multiple test methods (Lines 164, 186, 234, 249, 263, 276). While functional, moving this import to the top of the file alongside other imports would be more conventional and slightly more efficient.

Proposed refactor
 """Tests for backtesting schemas."""
+from datetime import date

 import pytest
 from pydantic import ValidationError

Then remove the inline from datetime import date statements in each test method.

examples/backtest/run_backtest.py (1)

61-64: Consider using response.is_success for robustness.

Currently checks response.status_code != 200, which works for this endpoint but could be more robust with not response.is_success to handle any 2xx status codes.

Proposed change
-    if response.status_code != 200:
+    if not response.is_success:
         print(f"Error: {response.status_code}")
         print(response.text)
         return

Comment thread app/core/config.py
Comment thread app/features/backtesting/service.py
Comment thread PRPs/PRP-6-backtesting.md
- `BacktestResult` — Per-fold actuals vs predictions with lineage metadata
- `BacktestingService` — Orchestrates split generation, model training, prediction, evaluation
- `POST /backtesting/run` — Execute backtest for a series with configurable strategy
- `GET /backtesting/results/{backtest_id}` — Retrieve backtest results with fold details
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# First, explore the backtesting directory structure
find app/features/backtesting -type f -name "*.py" | head -20

Repository: w7-mgfcode/ForecastLabAI

Length of output: 676


🏁 Script executed:

#!/bin/bash
# Search for GET endpoint implementations and result retrieval logic
rg -n "GET|results|backtest_id" app/features/backtesting/ -A 2

Repository: w7-mgfcode/ForecastLabAI

Length of output: 22128


🏁 Script executed:

#!/bin/bash
# Check for route definitions
rg -n "@.*route|@app\.|@router\.|def.*results|def.*backtest" app/features/backtesting/ -A 2

Repository: w7-mgfcode/ForecastLabAI

Length of output: 6399


🏁 Script executed:

#!/bin/bash
# Look at the PRP document itself to understand what was specified
cat -n PRPs/PRP-6-backtesting.md | head -50

Repository: w7-mgfcode/ForecastLabAI

Length of output: 3221


🏁 Script executed:

cat -n app/features/backtesting/routes.py

Repository: w7-mgfcode/ForecastLabAI

Length of output: 5210


Missing GET endpoint for retrieving backtest results.

The PRP specifies GET /backtesting/results/{backtest_id} as part of the deliverable, but only POST /backtesting/run is implemented. While backtest results are returned in the POST response, there is no endpoint to retrieve stored results by backtest_id. Either implement the GET endpoint or document this deferral with a TODO comment and update the success criteria.

🤖 Prompt for AI Agents
In `@PRPs/PRP-6-backtesting.md` at line 14, Implement the missing retrieval
endpoint: add a GET handler for GET /backtesting/results/{backtest_id} that
looks up stored results by backtest_id (use or create functions like
storeBacktestResult in the POST /backtesting/run flow and a corresponding
getBacktestResultById or fetchBacktestResult method) and returns the persisted
backtest + fold details or a 404 if not found; if you choose not to implement it
now, add a clear TODO in the API spec and code noting that GET
/backtesting/results/{backtest_id} is deferred and update the success criteria
to remove this endpoint from the deliverables.

w7-learn and others added 3 commits February 1, 2026 04:58
- Add _validate_config() to enforce settings constraints:
  - Validate n_splits <= BACKTEST_MAX_SPLITS
  - Validate gap <= BACKTEST_MAX_GAP
  - Warn if min_train_size < BACKTEST_DEFAULT_MIN_TRAIN_SIZE
- Add save_results() method using BACKTEST_RESULTS_DIR
- Add unit tests for config validation and result saving

Closes issue with unused config fields in app/core/config.py

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add SIGNED_METRICS class constant to identify signed metrics (e.g., "bias")
- Update _generate_comparison_summary to use absolute values for
  percentage improvement calculations on signed metrics
- Original signed values are preserved in main/naive/seasonal_naive keys
- Add 3 unit tests for signed metric handling:
  - test_comparison_signed_metric_uses_absolute_values
  - test_comparison_signed_metric_positive_values
  - test_comparison_signed_metric_mixed_signs

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
feat(backtesting): wire config fields into implementation
w7-learn
w7-learn previously approved these changes Feb 1, 2026
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants