chore(main): release backtesting module#33
Conversation
) * docs: update INITIAL-6.md * feat(backtesting): implement time-series backtesting module (PRP-6) Add complete backtesting infrastructure for model evaluation: - TimeSeriesSplitter with expanding/sliding window strategies and gap support - MetricsCalculator with MAE, sMAPE, WAPE, Bias, and Stability Index - BacktestingService for orchestrating backtests with baseline comparisons - POST /backtesting/run endpoint with full response schema - 95 unit tests covering schemas, splitter, metrics, and service - Example scripts for API usage, split visualization, and metrics demo Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * docs: update documentation for backtesting module (PRP-6) - README.md: Add backtesting endpoint, examples, and project structure - ARCHITECTURE.md: Mark backtesting as implemented with full details Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * chore: update uv.lock version to 0.1.7 Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * test(backtesting): add integration tests for routes and service (PRP-6) Add 16 integration tests that run against real PostgreSQL database: - 8 route tests for POST /backtesting/run endpoint - 8 service tests for BacktestingService._load_series_data Tests use @pytest.mark.integration marker and require docker-compose. Test data: 120 days of sequential sales (quantity = day number 1-120). Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * fix(backtesting): fix integration test fixtures and format examples - Use savepoint-based transaction isolation instead of table drop/create - Fix client dependency override to use async generator - Format example files (inspect_splits.py, metrics_demo.py) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * fix(backtesting): simplify db_session fixture for CI compatibility Remove complex savepoint-based isolation that caused issues with FastAPI dependency injection. Use simpler session pattern that matches other working integration tests. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * fix(backtesting): use unique IDs and proper cleanup for test isolation - Generate unique store codes and SKUs using UUID per test - Use merge() for calendar fixture to handle existing records - Clean up test data after each test (SalesDaily, TEST-* stores/products) - Preserve shared Calendar data between tests Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * debug: add error message to test assertion Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * fix(backtesting): remove strict=True from BacktestRequest to allow date coercion The strict=True config prevented Pydantic from automatically converting ISO date strings to date objects in JSON requests, causing 422 errors. Changed to extra="forbid" to still reject unknown fields while allowing normal type coercion. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * fix(backtesting): clean up calendar entries in test date range Delete calendar entries from 2024-01-01 to 2024-04-29 during test cleanup to prevent conflicts with other test modules that insert calendar records in the same date range. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> --------- Co-authored-by: Gabe@w7dev <gabor@w7-7.net> Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
There was a problem hiding this comment.
Sorry @w7-mgfcode, your pull request is larger than the review limit of 150000 diff characters
|
Warning Rate limit exceeded
⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. 📝 WalkthroughWalkthroughA comprehensive backtesting system for time-series forecasting models is introduced, featuring time-based cross-validation splits (expanding/sliding windows with configurable gap), a multi-metric evaluation suite (MAE, sMAPE, WAPE, Bias, Stability Index), automated baseline comparisons, leakage validation, and a REST API endpoint. Implementation includes configuration settings, core logic modules, comprehensive test coverage, example scripts, and updated documentation. Changes
Sequence Diagram(s)sequenceDiagram
actor Client
participant API as API Router<br/>/backtesting/run
participant Service as BacktestingService
participant DB as Database
participant Splitter as TimeSeriesSplitter
participant Model as Forecast Model
participant Metrics as MetricsCalculator
participant Baselines as Baseline Models
Client->>API: POST BacktestRequest<br/>(store_id, product_id, config)
API->>Service: run_backtest(db, ids, dates, config)
Service->>DB: _load_series_data(store_id, product_id)
DB-->>Service: SeriesData (dates, values)
Service->>Splitter: split(dates, values)
Splitter-->>Service: Iterator[TimeSeriesSplit]
loop For each fold
Service->>Model: train on fold.train_indices
Service->>Model: predict on fold.test_indices
Service->>Metrics: calculate_all(actuals, predictions)
Metrics-->>Service: MetricResult (mae, smape, wape, bias)
end
alt include_baselines == true
Service->>Baselines: _run_baseline_comparisons(series, splits)
Baselines-->>Service: list[ModelBacktestResult]
Service->>Service: _generate_comparison_summary(main, baselines)
end
Service->>Metrics: aggregate_fold_metrics(all_fold_results)
Metrics-->>Service: aggregated_metrics (stability indices)
Service-->>API: BacktestResponse
API-->>Client: JSON response<br/>(results, metrics, leakage_check)
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes The changes introduce a substantive new backtesting subsystem with high logic density across multiple interdependent modules (schemas, splitter, metrics calculator, service orchestration, API routes). While the architecture is clean and well-organized, reviewers must verify correctness of time-series splitting logic (no leakage), metric calculations with comprehensive edge-case handling, service workflow orchestration, and integration with existing models and database layer. The comprehensive test suite (2000+ lines) aids verification but requires careful review of both implementation and test coverage. Possibly related PRs
Suggested reviewers
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Fix all issues with AI agents
In `@app/core/config.py`:
- Around line 50-54: The four config fields backtest_max_splits,
backtest_default_min_train_size, backtest_max_gap, and backtest_results_dir are
defined but not used; either remove them or wire them into the backtesting
implementation: read these values from the central config/Settings object (where
these fields are declared) and apply them in the backtesting flow—enforce
backtest_max_splits when generating folds in the split generation logic (e.g.,
Backtester.generate_splits / split generator), use
backtest_default_min_train_size as the fallback when no min_train_size is
provided to BacktestingService.run_backtest or Backtester, enforce
backtest_max_gap when validating/creating train/test windows, and use
backtest_results_dir as the destination path in BacktestingService.save_results
or wherever backtest artifacts are written; if the backtesting service has
differently named methods, update the equivalent split-generation, validation,
and save/result-writing methods to consume these config fields.
In `@app/features/backtesting/service.py`:
- Around line 334-383: The _generate_comparison_summary function computes
percent improvements using raw metric values which breaks for signed metrics
like "bias"; update it to handle signed metrics by defining a set of signed
metric names (e.g., SIGNED_METRICS = {"bias", ...}) and, when metric_name is in
that set, compute vs_naive_pct and vs_seasonal_pct using absolute values
(abs(main_value) and abs(naive_value / seasonal_value)) or skip percentage
calculation for zero/NaN as already done; ensure you still store the original
signed values under "main", "naive", and "seasonal_naive" but use abs(...) only
for the percent-change formulas inside _generate_comparison_summary.
In `@PRPs/PRP-6-backtesting.md`:
- Line 14: Implement the missing retrieval endpoint: add a GET handler for GET
/backtesting/results/{backtest_id} that looks up stored results by backtest_id
(use or create functions like storeBacktestResult in the POST /backtesting/run
flow and a corresponding getBacktestResultById or fetchBacktestResult method)
and returns the persisted backtest + fold details or a 404 if not found; if you
choose not to implement it now, add a clear TODO in the API spec and code noting
that GET /backtesting/results/{backtest_id} is deferred and update the success
criteria to remove this endpoint from the deliverables.
🧹 Nitpick comments (11)
app/features/backtesting/schemas.py (2)
65-73: Type annotation forinfoparameter should useValidationInfo.The
infoparameter is typed asobject, which loses type safety. Pydantic v2 providesValidationInfofor this purpose.♻️ Proposed fix
+from pydantic import BaseModel, ConfigDict, Field, field_validator, ValidationInfo -from pydantic import BaseModel, ConfigDict, Field, field_validator `@field_validator`("horizon") `@classmethod` - def validate_horizon_vs_gap(cls, v: int, info: object) -> int: + def validate_horizon_vs_gap(cls, v: int, info: ValidationInfo) -> int: """Ensure horizon is reasonable relative to gap.""" - data = getattr(info, "data", {}) + data = info.data or {} gap = data.get("gap", 0)
215-222: Same type annotation improvement applies here.Use
ValidationInfoinstead ofobjectfor type safety, consistent with the earlier suggestion.♻️ Proposed fix
`@field_validator`("end_date") `@classmethod` - def validate_date_range(cls, v: date_type, info: object) -> date_type: + def validate_date_range(cls, v: date_type, info: ValidationInfo) -> date_type: """Ensure end_date is after start_date.""" - data = getattr(info, "data", {}) + data = info.data or {} if "start_date" in data and v <= data["start_date"]:app/features/backtesting/splitter.py (1)
23-24: Remove unusedTYPE_CHECKINGblock.The
TYPE_CHECKINGimport and emptypassblock serve no purpose and should be removed.♻️ Proposed fix
-from typing import TYPE_CHECKING, Any +from typing import Any import numpy as np from app.features.backtesting.schemas import SplitBoundary, SplitConfig -if TYPE_CHECKING: - passexamples/backtest/inspect_splits.py (1)
39-42: Gap visualization assumes consecutive daily dates.The gap date calculation using
timedelta(days=1)assumes the data has consecutive daily observations. While this works for the synthetic example data, users adapting this script for non-daily data should be aware of this assumption. Consider adding a comment noting this.📝 Suggested documentation improvement
if config.gap > 0: + # Note: This assumes consecutive daily dates in the dataset gap_start = split.train_dates[-1] + timedelta(days=1) gap_end = split.test_dates[0] - timedelta(days=1) print(f" Gap: {gap_start} to {gap_end} ({config.gap} days)")app/features/backtesting/metrics.py (1)
270-272: Consider using sample standard deviation for small fold counts.
np.stduses population std (ddof=0) by default. For the typical 3-5 CV folds, sample std (ddof=1) may be statistically more appropriate. This is a minor point and the current implementation is common practice.app/features/backtesting/tests/conftest.py (1)
46-57: Overly aggressive cleanup deletes all SalesDaily records.Line 48 deletes all
SalesDailyrecords without filtering, which could interfere with other concurrent tests or shared test data. Consider filtering to only delete records associated with test stores/products.♻️ Proposed fix to scope cleanup to test data
# Clean up test data (delete in correct order due to FK constraints) - await session.execute(delete(SalesDaily)) + # Delete sales only for test stores/products to avoid affecting other data + await session.execute( + delete(SalesDaily).where( + SalesDaily.store_id.in_( + select(Store.id).where(Store.code.like("TEST-%")) + ) + ) + ) await session.execute(delete(Product).where(Product.sku.like("TEST-%"))) await session.execute(delete(Store).where(Store.code.like("TEST-%")))You'll need to add the import:
from sqlalchemy import delete, selectdocs/validation/pytest-standard.md (1)
720-750: Documentation example differs from actual implementation.The
db_sessionexample showscreate_all/drop_allpattern and theclientexample uses a lambda, but the actualconftest.pyimplementation:
- Does not create/drop tables (relies on existing schema)
- Uses an async generator for
override_get_dbConsider updating the documentation to match the actual implementation for accuracy.
app/features/backtesting/routes.py (2)
88-88: Consider making BacktestingService a dependency.Instantiating
BacktestingService()per request works but differs from the pattern used in other features. If the service holds no request-specific state, consider injecting it as a FastAPI dependency for consistency and easier testing/mocking.
100-111: Verify duration_ms source.The handler calculates
duration_mslocally (Line 100) and logs it, but theresponsealready contains its ownduration_msfrom the service. The logged value may differ from the response value. Consider usingresponse.duration_msfor consistency, or clarify that the local measurement includes overhead.app/features/backtesting/tests/test_schemas.py (1)
162-178: Consider movingdateimport to module level.The
from datetime import dateimport is repeated inside multiple test methods (Lines 164, 186, 234, 249, 263, 276). While functional, moving this import to the top of the file alongside other imports would be more conventional and slightly more efficient.Proposed refactor
"""Tests for backtesting schemas.""" +from datetime import date import pytest from pydantic import ValidationErrorThen remove the inline
from datetime import datestatements in each test method.examples/backtest/run_backtest.py (1)
61-64: Consider usingresponse.is_successfor robustness.Currently checks
response.status_code != 200, which works for this endpoint but could be more robust withnot response.is_successto handle any 2xx status codes.Proposed change
- if response.status_code != 200: + if not response.is_success: print(f"Error: {response.status_code}") print(response.text) return
| - `BacktestResult` — Per-fold actuals vs predictions with lineage metadata | ||
| - `BacktestingService` — Orchestrates split generation, model training, prediction, evaluation | ||
| - `POST /backtesting/run` — Execute backtest for a series with configurable strategy | ||
| - `GET /backtesting/results/{backtest_id}` — Retrieve backtest results with fold details |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# First, explore the backtesting directory structure
find app/features/backtesting -type f -name "*.py" | head -20Repository: w7-mgfcode/ForecastLabAI
Length of output: 676
🏁 Script executed:
#!/bin/bash
# Search for GET endpoint implementations and result retrieval logic
rg -n "GET|results|backtest_id" app/features/backtesting/ -A 2Repository: w7-mgfcode/ForecastLabAI
Length of output: 22128
🏁 Script executed:
#!/bin/bash
# Check for route definitions
rg -n "@.*route|@app\.|@router\.|def.*results|def.*backtest" app/features/backtesting/ -A 2Repository: w7-mgfcode/ForecastLabAI
Length of output: 6399
🏁 Script executed:
#!/bin/bash
# Look at the PRP document itself to understand what was specified
cat -n PRPs/PRP-6-backtesting.md | head -50Repository: w7-mgfcode/ForecastLabAI
Length of output: 3221
🏁 Script executed:
cat -n app/features/backtesting/routes.pyRepository: w7-mgfcode/ForecastLabAI
Length of output: 5210
Missing GET endpoint for retrieving backtest results.
The PRP specifies GET /backtesting/results/{backtest_id} as part of the deliverable, but only POST /backtesting/run is implemented. While backtest results are returned in the POST response, there is no endpoint to retrieve stored results by backtest_id. Either implement the GET endpoint or document this deferral with a TODO comment and update the success criteria.
🤖 Prompt for AI Agents
In `@PRPs/PRP-6-backtesting.md` at line 14, Implement the missing retrieval
endpoint: add a GET handler for GET /backtesting/results/{backtest_id} that
looks up stored results by backtest_id (use or create functions like
storeBacktestResult in the POST /backtesting/run flow and a corresponding
getBacktestResultById or fetchBacktestResult method) and returns the persisted
backtest + fold details or a 404 if not found; if you choose not to implement it
now, add a clear TODO in the API spec and code noting that GET
/backtesting/results/{backtest_id} is deferred and update the success criteria
to remove this endpoint from the deliverables.
- Add _validate_config() to enforce settings constraints: - Validate n_splits <= BACKTEST_MAX_SPLITS - Validate gap <= BACKTEST_MAX_GAP - Warn if min_train_size < BACKTEST_DEFAULT_MIN_TRAIN_SIZE - Add save_results() method using BACKTEST_RESULTS_DIR - Add unit tests for config validation and result saving Closes issue with unused config fields in app/core/config.py Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add SIGNED_METRICS class constant to identify signed metrics (e.g., "bias") - Update _generate_comparison_summary to use absolute values for percentage improvement calculations on signed metrics - Original signed values are preserved in main/naive/seasonal_naive keys - Add 3 unit tests for signed metric handling: - test_comparison_signed_metric_uses_absolute_values - test_comparison_signed_metric_positive_values - test_comparison_signed_metric_mixed_signs Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
feat(backtesting): wire config fields into implementation
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Summary
Release the backtesting module (PRP-6) to main.
Changes included
Release Notes
Features
Tests
🤖 Generated with Claude Code
Summary by CodeRabbit
New Features
Documentation
✏️ Tip: You can customize this high-level summary in your review settings.