From 12f6cdfd2bbaef4667b07036da97ba8e18378660 Mon Sep 17 00:00:00 2001 From: Gabor Szabo Date: Tue, 19 May 2026 18:22:15 +0200 Subject: [PATCH] feat(backtest): wire feature-aware models into the backtesting fold loop (#244) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Implements PRP-MLZOO-B.2. Feature-aware models (regression, lightgbm — any model with requires_features=True) are now evaluated by POST /backtesting/run and backtest jobs, with per-fold leakage-safe X_train / X_future matrices. - Promote the historical row builder to app/shared/feature_frames/rows.py (build_historical_feature_rows) and add build_future_feature_rows — the leakage-critical per-fold test-window assembler. ForecastingService. _assemble_regression_rows becomes a delegating shim; its leakage spec (test_regression_features_leakage.py) stays green, unedited. - BacktestingService: resolve exogenous data once async (ExogenousFrame), branch the fold loop on requires_features, slice X_train from one historical matrix, rebuild X_future per fold. min_train_size >= 30 enforced loud. - ModelBacktestResult gains additive feature_aware / exogenous_policy fields; _shape_backtest_result keys unchanged (frontend contract byte-stable). - JobService._execute_backtest accepts regression + lightgbm. - Repurpose the interim loud-fail test (PRP-29 #7 / PRP-30 #6 superseded); add the shared X_future leakage spec incl. a gap>0 fold. --- ...IAL-MLZOO-B.2-feature-aware-backtesting.md | 215 ++++++ PRPs/INITIAL/INITIAL-MLZOO-index.md | 6 +- ...PRP-MLZOO-B.2-feature-aware-backtesting.md | 707 ++++++++++++++++++ README.md | 6 + app/features/backtesting/schemas.py | 8 + app/features/backtesting/service.py | 331 +++++++- .../tests/test_feature_aware_backtest.py | 221 ++++++ .../tests/test_routes_integration.py | 82 ++ .../backtesting/tests/test_schemas.py | 42 ++ .../backtesting/tests/test_service.py | 19 +- .../tests/test_service_integration.py | 96 +++ app/features/forecasting/service.py | 39 +- app/features/jobs/service.py | 9 + app/features/jobs/tests/test_service.py | 64 ++ app/shared/feature_frames/__init__.py | 6 + app/shared/feature_frames/rows.py | 201 +++++ .../feature_frames/tests/test_contract.py | 10 +- .../feature_frames/tests/test_leakage.py | 138 ++++ docs/PHASE/5-BACKTESTING.md | 33 + examples/models/feature_frame_contract.md | 35 +- 20 files changed, 2213 insertions(+), 55 deletions(-) create mode 100644 PRPs/INITIAL/INITIAL-MLZOO-B.2-feature-aware-backtesting.md create mode 100644 PRPs/PRP-MLZOO-B.2-feature-aware-backtesting.md create mode 100644 app/features/backtesting/tests/test_feature_aware_backtest.py create mode 100644 app/shared/feature_frames/rows.py diff --git a/PRPs/INITIAL/INITIAL-MLZOO-B.2-feature-aware-backtesting.md b/PRPs/INITIAL/INITIAL-MLZOO-B.2-feature-aware-backtesting.md new file mode 100644 index 00000000..e160a0f7 --- /dev/null +++ b/PRPs/INITIAL/INITIAL-MLZOO-B.2-feature-aware-backtesting.md @@ -0,0 +1,215 @@ +# INITIAL-MLZOO-B.2-feature-aware-backtesting.md - Feature-Aware Backtesting Wiring + +## FEATURE: + +Wire **feature-aware forecasting models** (`RegressionForecaster`, `LightGBMForecaster` — +every model with `requires_features=True`) into the backtesting fold loop so they can be +evaluated by `POST /backtesting/run` and `train`/`backtest` jobs — **without** introducing +target leakage and **without** train/serve skew. + +This is the explicit follow-up deferred by **PRP-30 DECISIONS LOCKED #6** and named there as +`PRP-MLZOO-B.2`. It depends on MLZOO-A (PRP-29, merged `b116489`) and MLZOO-B (PRP-30, merged +`2f1b8a5` / PR #243) being in place. It is the third MLZOO unit, after A (foundation) and +B (first advanced model), and before C (XGBoost/Prophet) and D (frontend/registry). + +### Why current backtesting is incompatible with feature-aware models + +The backtesting slice was built for target-only baseline models and has three structural +gaps that block feature-aware models: + +1. **No exogenous data is loaded.** `BacktestingService._load_series_data` + (`app/features/backtesting/service.py`) selects only `(date, quantity)` from `sales_daily`. + `SeriesData` carries only `dates` + `values`. A feature-aware model needs the canonical + 14-column frame (`canonical_feature_columns()`): target lags, calendar columns, and the + exogenous columns `price_factor` / `promo_active` / `is_holiday` / `days_since_launch`. + +2. **The fold loop is target-only.** `_run_model_backtest` calls `model.fit(y_train)` and + `model.predict(horizon)` with **no `X`**. `RegressionForecaster` / `LightGBMForecaster` + (`requires_features=True`) raise `ValueError("… requires exogenous features X …")` at + `fit()`. That loud failure is the *current interim contract*, pinned by + `app/features/backtesting/tests/test_service.py::test_feature_aware_model_fails_loud_in_backtest` + (cites PRP-29 DECISIONS LOCKED #7). This PRP **supersedes** that interim contract. + +3. **`JobService._execute_backtest` hard-rejects** any `model_type` other than + `naive` / `seasonal_naive` / `moving_average` with `ValueError("Unsupported model_type…")`. + +A backtest of a feature-aware model is the only honest way to compare an advanced model +against the naive/seasonal baselines — without it, PRP-30's LightGBM model can be trained +and scenario-re-forecast but never *evaluated*. + +### Per-fold X_train and X_future construction + +The backtest must build, **per fold**, two leakage-safe feature matrices: + +- **`X_train`** — the historical feature matrix restricted to the fold's train rows. Every + column is observed-and-knowable: target lags read strictly-earlier observed targets, + calendar columns are pure functions of the date, exogenous columns read same-day observed + price/promotion/holiday/launch attributes. This is exactly the matrix + `ForecastingService._build_regression_features` / `_assemble_regression_rows` already + builds for training — it must be **reused**, not re-derived. + +- **`X_future`** — the feature matrix for the fold's test window (`test_size` days, after the + `gap`). The test window has **no observed target**, so its target-lag columns must be built + with the leakage-safe future-lag rule (`build_long_lag_columns`): a lag cell whose source + day lies in the test window is `NaN`, never the observed test value. + +### Leakage-safe fold contracts + +Each fold defines a forecast origin `T` = the fold's `train_end`. The invariant, mirroring +`app/shared/feature_frames/tests/test_leakage.py` and +`app/features/featuresets/tests/test_leakage.py`: + +> A feature value for a test-window day `D` may use ONLY information knowable at the fold +> origin `T`: the observed history at or before `T`, or the calendar (a pure function of the +> date). It may **NEVER** read an observed target at a test-window day. + +The `gap` between `train_end` and `test_start` simulates operational data latency — the +fold's `history_tail` ends at `T` and does **not** include the gap days; lag columns whose +source falls in the gap are therefore `NaN` too. + +The PRP must classify every canonical feature column with the **existing** +`app/shared/feature_frames.FeatureSafety` taxonomy (`SAFE` / `CONDITIONALLY_SAFE` / +`UNSAFE_UNLESS_SUPPLIED`) and decide, per class, how `X_future` is populated. It must also +draw the line — explicitly — between **target leakage** (reading `y` at a horizon day: +forbidden, must be structurally impossible) and **exogenous foresight** (assuming the future +price/promotion calendar is known: a disclosed modelling choice, not target leakage). The +result must record which exogenous policy it used so the metric is interpreted honestly. + +### Async / DB-backed loading requirements + +The fold loop (`_run_model_backtest`) is currently **synchronous and DB-free** — a property +worth keeping (it is unit-testable without a database). Feature-aware folds need exogenous +data (`unit_price`, `promotion` windows, `calendar` holidays, `product.launch_date`) that +only the DB has. The PRP must resolve all of that **async, once, up front** in `run_backtest` +(already `async`) into pure in-memory arrays, then keep the per-fold builders pure and sync — +mirroring how `ForecastingService._build_regression_features` resolves async then delegates +to the pure `_assemble_regression_rows`. + +### How feature-aware models should fail loudly until supported + +Even after this PRP, some paths stay unsupported and must fail **loud**, never silent: + +- A feature-aware model whose required feature frame cannot be sourced (e.g. an + `UNSAFE_UNLESS_SUPPLIED` column with no observed data-platform record and no supplied + assumptions) → raise `ValueError`, never a silent `NaN`/`0.0` fill. +- An unclassifiable feature column (`FeatureSafety` / `feature_safety()` raises `KeyError`) + → propagate loudly. +- The interim `test_feature_aware_model_fails_loud_in_backtest` is **repurposed**, not + deleted: feature-aware backtesting now *succeeds* on the supported path, so the test + becomes (a) a positive "regression model backtests and yields metrics" test plus (b) a new + loud-fail test for the genuinely-unsupported path. + +### Explicit out-of-scope items + +- **No new model families.** No XGBoost, no Prophet-like models — that is MLZOO-C. This PRP + wires the *existing* `regression` and `lightgbm` forecasters into backtesting and adds none. +- **No frontend work.** No changes to `frontend/`, no backtest-page model selector, no new UI. + The `/visualize/backtest` job-result contract (`_shape_backtest_result`) stays byte-stable. +- **No explainability work.** No driver attribution, no `ForecastExplanation`, no + `/explain/*` change — that is MLZOO-D. +- **No scenario persistence changes.** No `scenario_plan` schema change, no `/scenarios/*` + contract change. The scenarios slice's `feature_frame.py` is read as a reference pattern + only; it is not modified. +- **No hyperparameter search, no portfolio/global models, no recursive multi-step + forecasting.** `NaN`-as-unknown for future-sourced lags is kept (PRP-29 DECISIONS #6). +- **No new Alembic migration** — this PRP adds no table or column to the database. + +## EXAMPLES: + +Read these before PRP creation: + +- `PRPs/PRP-29-feature-aware-forecasting-foundation.md` + - The merged foundation. DECISIONS LOCKED #2 (`requires_features` capability flag), + #6 (NaN-as-unknown), #7 (the interim backtest loud-fail this PRP supersedes). + +- `PRPs/PRP-30-lightgbm-first-advanced-model.md` + - The merged first advanced model. DECISIONS LOCKED #6 explicitly defers feature-aware + backtesting to *this* PRP and states why (`_run_model_backtest` is sync, DB-free, + target-only). + +- `app/features/backtesting/service.py` + - `_load_series_data` (target-only load), `SeriesData` (the container to extend), + `_run_model_backtest` (the sync fold loop to branch), `_run_baseline_comparisons`. + +- `app/features/backtesting/splitter.py` + - `TimeSeriesSplitter` — index-based, needs **no change**; each `TimeSeriesSplit` already + carries `train_indices` / `test_indices` / `train_dates` / `test_dates`. + +- `app/features/forecasting/service.py` + - `_build_regression_features` (async exogenous resolution pattern) and + `_assemble_regression_rows` (the pure, leakage-safe historical row builder to promote). + +- `app/shared/feature_frames/contract.py` + - `canonical_feature_columns()`, `build_long_lag_columns`, `build_calendar_columns`, + the `FeatureSafety` enum and `feature_safety()` classifier — the contract to reuse. + +- `app/features/scenarios/feature_frame.py` + - `assemble_future_frame` / `build_exogenous_columns` — the future-frame *pattern* for a + feature-aware model. Reference only; do not import (cross-slice import is forbidden). + +- `app/shared/feature_frames/tests/test_leakage.py` and + `app/features/forecasting/tests/test_regression_features_leakage.py` + - The load-bearing leakage-test patterns the new builders must follow. + +- `app/features/jobs/service.py` + - `_execute_backtest` (the `model_type` allow-list to widen) and `_shape_backtest_result` + (the frontend contract that must NOT drift). + +## DOCUMENTATION: + +- scikit-learn TimeSeriesSplit: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.TimeSeriesSplit.html +- scikit-learn HistGradientBoostingRegressor (NaN-tolerant): https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.HistGradientBoostingRegressor.html +- LightGBM (NaN handling / missing values): https://lightgbm.readthedocs.io/en/stable/Advanced-Topics.html +- Forecasting cross-validation / leakage: https://otexts.com/fpp3/tscv.html +- Pandas time series documentation: https://pandas.pydata.org/docs/user_guide/timeseries.html +- Pydantic documentation: https://docs.pydantic.dev/latest/ +- `PRPs/ai_docs/exogenous-regressor-forecasting.md` — repo-local notes on exogenous-regressor + forecasting and the leakage-safe future-frame rule. + +## OTHER CONSIDERATIONS: + +The main architectural risk is the **train/serve-skew and leakage boundary** — get the +`X_future` exogenous-column policy wrong and the backtest is either leaky (optimistic and +worthless) or skewed (the model sees different feature distributions at train vs evaluate). + +Required decisions for the PRP: + +- **Where the per-fold builders live.** `backtesting` may not import `forecasting` or + `scenarios` (vertical-slice rule). The pure row builders must be promoted into + `app/shared/feature_frames` (the sanctioned cross-cutting home) so both slices consume one + definition. The promotion must be **additive** — `forecasting`'s `_assemble_regression_rows` + and its leakage test must keep working unchanged (a thin delegating shim is acceptable). + +- **The `X_future` exogenous-column policy.** `price_factor` / `promo_active` are + `UNSAFE_UNLESS_SUPPLIED`. In a *scenario* a planner supplies them; a *backtest* has no + assumptions. Decide v1 policy explicitly (recorded as a field on the result), and document + it as target-leakage-free but optimistic (it assumes the future promo/price calendar was + known at `T`). + +- **`gap` handling for future-lag columns.** With `gap > 0` the first test day is + `T + gap + 1`; `build_long_lag_columns` indexes test day `m` as `T + m`. The PRP must state + the offset correction (build `gap + test_size` rows, drop the first `gap`). + +- **The job-result contract.** `_shape_backtest_result` feeds `/visualize/backtest`. Any new + field must be additive; the existing keys must not move. + +Recommended defaults: + +- Reuse `canonical_feature_columns()` for both `X_train` and `X_future` — identical column + set and order on both sides is the train/serve-skew guard. +- Keep the per-fold builders pure and sync; do all DB I/O once in `run_backtest`. +- Branch on `model.requires_features` (a capability flag), never on a `model_type` string. +- Keep `NaN`-as-unknown for future-sourced lag cells — the estimators tolerate it natively. +- Prefer a loud `ValueError` on an unsupported path over any implicit guess or silent fill. + +Validation expectations: + +- Shared-builder leakage tests proving `X_future` lag cells are `NaN` where their source is a + test-window day (the new load-bearing spec). +- Unit tests for the per-fold `X_train` / `X_future` builders (pure, no DB). +- An integration test running a real DB-backed feature-aware (`regression`) backtest and + comparing it against the naive/seasonal baselines in the same response. +- A route test: `POST /backtesting/run` with a `regression` model config → `200`. +- A jobs test: a `backtest` job with `model_type="regression"` succeeds. +- A repurposed loud-fail test for the genuinely-unsupported feature-aware path. +- Regression proof: every existing baseline backtest test stays green with no behaviour change. diff --git a/PRPs/INITIAL/INITIAL-MLZOO-index.md b/PRPs/INITIAL/INITIAL-MLZOO-index.md index b7e9989c..b41faf77 100644 --- a/PRPs/INITIAL/INITIAL-MLZOO-index.md +++ b/PRPs/INITIAL/INITIAL-MLZOO-index.md @@ -17,6 +17,7 @@ Recommended PRP sequence: | --- | --- | --- | --- | | 1 | `INITIAL-MLZOO-A-foundation-feature-frames.md` | PRP-29 | Feature-aware forecasting foundation and leakage-safe frame contracts | | 2 | `INITIAL-MLZOO-B-lightgbm-first-model.md` | PRP-30 | First advanced model path with LightGBM (optional `ml-lightgbm` extra) | +| 2.5 | `INITIAL-MLZOO-B.2-feature-aware-backtesting.md` | PRP-MLZOO-B.2 | Wire feature-aware models into the backtesting fold loop (per-fold leakage-safe `X_train` / `X_future`) | | 3 | `INITIAL-MLZOO-C-xgboost-prophet-extensions.md` | Future PRP | XGBoost and Prophet-like extensions | | 4 | `INITIAL-MLZOO-D-frontend-registry-explainability.md` | Future PRP | UI, registry surfacing, and explanation polish | @@ -25,8 +26,9 @@ Dependency graph: ```text A. Foundation feature frames -> B. LightGBM first model - -> C. XGBoost / Prophet-like extensions - -> D. Frontend / registry / explainability + -> B.2 Feature-aware backtesting + -> C. XGBoost / Prophet-like extensions + -> D. Frontend / registry / explainability ``` The full vision is documented in `docs/optional-features/05-advanced-ml-model-zoo.md`. diff --git a/PRPs/PRP-MLZOO-B.2-feature-aware-backtesting.md b/PRPs/PRP-MLZOO-B.2-feature-aware-backtesting.md new file mode 100644 index 00000000..105a4075 --- /dev/null +++ b/PRPs/PRP-MLZOO-B.2-feature-aware-backtesting.md @@ -0,0 +1,707 @@ +name: "PRP-MLZOO-B.2 — Feature-Aware Backtesting Wiring" +description: | + +## Purpose + +The third unit of the **Advanced ML Model Zoo** sequence (`PRPs/INITIAL/INITIAL-MLZOO-index.md`), +sitting between MLZOO-B (PRP-30) and MLZOO-C. It wires the **existing feature-aware +forecasting models** — `RegressionForecaster` (PRP-27) and `LightGBMForecaster` (PRP-30), +i.e. every model with `requires_features=True` — into the **backtesting fold loop** so they +can be evaluated by `POST /backtesting/run` and `backtest` jobs. + +This is the explicit follow-up deferred by **PRP-30 DECISIONS LOCKED #6**, which named it +`PRP-MLZOO-B.2` and stated the reason: `BacktestingService._run_model_backtest` is +synchronous, DB-free, and target-only, so per-fold leakage-safe `X_train` / `X_future` +construction is itself PRP-sized. + +This PRP wires **zero new models**. It adds **no** XGBoost/Prophet, **no** frontend, **no** +explainability, **no** scenario-persistence change, **no** Alembic migration. If you find +yourself adding a model family, touching `frontend/`, or editing a `scenario_plan` schema — +stop; that is out of scope (see DECISIONS LOCKED #9). + +## What this PRP already inherits (DO NOT re-build) + +MLZOO-A (PRP-29, merged `b116489`) and MLZOO-B (PRP-30, merged `2f1b8a5` / PR #243) shipped +everything a feature-aware *model* needs. Re-use it; do not re-derive it: + +- **The capability flag.** `BaseForecaster.requires_features: ClassVar[bool]` + (`app/features/forecasting/models.py:64`). `RegressionForecaster` and `LightGBMForecaster` + both set it `True`; all three baselines leave it `False`. **Branch on this flag — never on + a `model_type` string.** +- **The shared feature-frame contract.** `app/shared/feature_frames/` owns the pinned + constants (`EXOGENOUS_LAGS`, `HISTORY_TAIL_DAYS`), `canonical_feature_columns()` (the + 14-column set + order), the leakage-safe pure builders `build_long_lag_columns` / + `build_calendar_columns`, and the `FeatureSafety` taxonomy + `feature_safety()` classifier. + This PRP **extends** that package; it writes no new contract constants. +- **The historical feature matrix.** `ForecastingService._assemble_regression_rows` + (`app/features/forecasting/service.py:114`) is the pure, leakage-safe historical row + builder, pinned by `app/features/forecasting/tests/test_regression_features_leakage.py`. + This PRP **promotes** it into `app/shared/feature_frames` so the backtesting slice can + consume it without a forbidden cross-slice import. +- **The future-frame pattern.** `app/features/scenarios/feature_frame.py::assemble_future_frame` + is the reference for a feature-aware model's *future* matrix (calendar + leakage-safe lags + + exogenous). This PRP builds the backtesting equivalent in `app/shared/feature_frames`. + The scenarios module is **read as a pattern, never imported or modified.** +- **The fold splitter.** `app/features/backtesting/splitter.py::TimeSeriesSplitter` is purely + index-based — each `TimeSeriesSplit` already carries `train_indices` / `test_indices` / + `train_dates` / `test_dates`. **It needs no change.** + +The **problem this PRP fixes**: `BacktestingService._load_series_data` loads only +`(date, quantity)`; `_run_model_backtest` calls `model.fit(y_train)` target-only; a +`requires_features=True` model raises `ValueError` at `fit()`. That loud failure is the +*interim* contract pinned by +`app/features/backtesting/tests/test_service.py::test_feature_aware_model_fails_loud_in_backtest` +(PRP-29 DECISIONS LOCKED #7). `JobService._execute_backtest` hard-rejects every non-baseline +`model_type`. The advanced LightGBM model from PRP-30 can be trained and scenario-re-forecast +but **cannot be evaluated against the baselines** — backtesting is the only honest comparison. + +## DEPENDS ON — read before starting + +- `PRPs/INITIAL/INITIAL-MLZOO-B.2-feature-aware-backtesting.md` — this PRP's brief. +- `PRPs/INITIAL/INITIAL-MLZOO-index.md` — the roadmap (A ✅ → B ✅ → **B.2 (this)** → C → D). +- `PRPs/PRP-29-feature-aware-forecasting-foundation.md` — DECISIONS LOCKED #2 + (`requires_features`), #6 (NaN-as-unknown), #7 (the interim backtest loud-fail superseded here). +- `PRPs/PRP-30-lightgbm-first-advanced-model.md` — DECISIONS LOCKED #6 defers feature-aware + backtesting to this PRP and explains why. +- `PRPs/ai_docs/exogenous-regressor-forecasting.md` — the leakage-safe future-frame rule. + +## Goal + +Make `POST /backtesting/run` (and `backtest` jobs) accept `regression` and `lightgbm` model +configs and evaluate them across time-series CV folds with **per-fold, leakage-safe** +`X_train` / `X_future` feature matrices — the advanced models compared head-to-head against +the naive / seasonal baselines, with **no target leakage** and **no train/serve skew**. + +## Why + +- **Portfolio completeness.** A forecasting system whose advanced model cannot be backtested + has no defensible model-selection story. PRP-30 delivered the model; this delivers its + evaluation. +- **Time-safety is the repo's load-bearing invariant** (`product-vision.md` §5). Wiring a + feature-consuming model into CV is exactly where leakage creeps in — doing it once, in a + shared and test-pinned way, protects every future MLZOO model (C, D). +- **It unblocks the MLZOO sequence.** MLZOO-C (XGBoost/Prophet) and MLZOO-D + (frontend/registry) both assume feature-aware models are backtestable; this is their gate. + +## What + +### Technical requirements + +1. **Per-fold `X_train`** — the historical feature matrix sliced to the fold's train rows. + Built once over the full series via the promoted shared builder; sliced by + `split.train_indices`. Leakage-safe by position: every row reads only strictly-earlier + observed targets. +2. **Per-fold `X_future`** — the test-window feature matrix, **rebuilt per fold** (never + sliced from the historical matrix). Target-lag columns use `build_long_lag_columns` with a + `history_tail` ending at the fold origin `T = train_end`; a lag cell whose source is a + test-window day is `NaN`. +3. **Async exogenous loading up front.** `run_backtest` (already `async`) resolves + `unit_price`, `promotion` windows, `calendar` holidays, and `product.launch_date` once, + into pure in-memory arrays. `_run_model_backtest` stays **sync and DB-free**. +4. **Capability branch.** `_run_model_backtest` branches on `model.requires_features`. + Target-only models keep the exact current code path; feature-aware models take the new + per-fold builder path. +5. **`JobService._execute_backtest`** accepts `regression` and `lightgbm` (the latter still + gated by `forecast_enable_lightgbm` inside `model_factory`). +6. **Loud failure** on every unsupported path — never a silent `NaN`/`0.0` fill. +7. **No frontend contract drift.** `_shape_backtest_result` and the `/visualize/backtest` + job-result keys stay byte-stable; new schema fields are additive only. + +### Future feature classes — the `X_future` policy (the architectural core) + +Every canonical column is classified by the **existing** `FeatureSafety` enum +(`app/shared/feature_frames/contract.py`). `X_future` populates each class as follows: + +| Class | Columns | `X_future` source | Leakage status | +|-------|---------|-------------------|----------------| +| `SAFE` | `dow_sin/cos`, `month_sin/cos`, `is_weekend`, `is_month_end`, `is_holiday`, `days_since_launch` | Pure function of the test-window date / timeless attribute (calendar table, product launch date) | None — never reads a target | +| `CONDITIONALLY_SAFE` | `lag_1`, `lag_7`, `lag_14`, `lag_28` | `build_long_lag_columns(history_tail, …)` — `history_tail` ends at `T`; a cell whose source day is in the test window is `NaN` | None — `NaN`-where-future is structurally enforced | +| `UNSAFE_UNLESS_SUPPLIED` | `price_factor`, `promo_active` | v1 policy `observed`: recorded `sales_daily.unit_price` / `promotion` rows for the test window | **No target leakage**; *is* exogenous foresight (see below) | + +**Target leakage vs. exogenous foresight — the line this PRP draws explicitly.** The repo's +load-bearing leakage rule is: *never read an observed target at a horizon day*. Target-lag +columns obey it structurally (`build_long_lag_columns`). The `UNSAFE_UNLESS_SUPPLIED` +exogenous columns (`price_factor`, `promo_active`) read **recorded price/promotion** for the +test window — never the target `y`. That is **not** target leakage. It **is** *exogenous +foresight*: the backtest assumes the future price/promotion calendar was known at `T`. For +retail demand that is realistic (promo calendars are planned ahead) and it keeps `X_train` +and `X_future` distributionally identical (no train/serve skew — both read same-day observed +exogenous). It is, however, optimistic, so the result **records `exogenous_policy="observed"`** +and the metric must be interpreted as "accuracy given a known promo/price plan". A future +PRP may add an `assumptions` policy; v1 ships exactly one. + +### Success Criteria + +- [ ] `POST /backtesting/run` with a `regression` model config returns `200` with per-fold + metrics and a baseline comparison. +- [ ] A `backtest` job with `model_type="regression"` completes `success`. +- [ ] A feature-aware backtest's `X_future` lag cells are `NaN` exactly where their source + day falls in the test window — pinned by a new shared leakage test. +- [ ] `X_train` and `X_future` use the identical `canonical_feature_columns()` set and order. +- [ ] Every existing baseline backtest test passes with **zero** edits. +- [ ] `_shape_backtest_result` output keys are byte-stable (frontend contract intact). +- [ ] An unsupported feature-aware path raises a loud `ValueError`, never silently degrades. +- [ ] `ruff` + `mypy --strict` + `pyright --strict` clean; unit + integration suites green. + +## All Needed Context + +### Documentation & References + +```yaml +- file: app/features/backtesting/service.py + why: _load_series_data (extend), SeriesData (extend), _run_model_backtest (branch), + _run_baseline_comparisons, _validate_config (add min-train guard). + +- file: app/features/backtesting/splitter.py + why: TimeSeriesSplit carries train/test indices + dates. NO CHANGE — index-based. + +- file: app/features/forecasting/service.py + why: _build_regression_features = the async exogenous-resolution pattern to mirror; + _assemble_regression_rows = the pure historical builder to promote to shared. + +- file: app/shared/feature_frames/contract.py + why: canonical_feature_columns(), build_long_lag_columns, build_calendar_columns, + FeatureSafety + feature_safety(). The contract this PRP extends. + +- file: app/features/scenarios/feature_frame.py + why: assemble_future_frame = the future-matrix PATTERN. Reference only — do NOT import + (backtesting -> scenarios is a forbidden cross-slice import). + +- file: app/features/jobs/service.py + why: _execute_backtest (widen the model_type allow-list); _shape_backtest_result + (the frontend contract — additive changes only, keys must not move). + +- file: app/shared/feature_frames/tests/test_leakage.py + why: the load-bearing leakage-test pattern the new builder must follow. + +- file: app/features/forecasting/tests/test_regression_features_leakage.py + why: pins _assemble_regression_rows — must stay GREEN after the promotion (shim). + +- url: https://otexts.com/fpp3/tscv.html + why: time-series cross-validation — the standard this fold loop implements. +``` + +### Current Codebase tree (relevant — all already exist) + +``` +app/ + shared/feature_frames/ + __init__.py # re-exports the contract surface + contract.py # constants, builders, FeatureSafety + tests/test_leakage.py # load-bearing leakage spec + tests/test_contract.py # AST walk — no app/features import + features/ + backtesting/ + service.py # SeriesData, run_backtest, _run_model_backtest (target-only) + splitter.py # TimeSeriesSplitter — index-based, unchanged + schemas.py # BacktestConfig, ModelBacktestResult, BacktestResponse + tests/test_service.py, test_service_integration.py, test_routes_integration.py + forecasting/ + service.py # _assemble_regression_rows (to promote), _build_regression_features + jobs/ + service.py # _execute_backtest (allow-list), _shape_backtest_result +``` + +### Desired Codebase tree — files to ADD + +``` +app/shared/feature_frames/ + rows.py # build_historical_feature_rows (promoted), + # build_future_feature_rows (NEW, leakage-safe) +app/features/backtesting/tests/ + test_feature_aware_backtest.py # unit tests for the fold builders + loud-fail +PRPs/ + PRP-MLZOO-B.2-feature-aware-backtesting.md # this file +``` + +### Files to MODIFY (all additive or behaviour-preserving) + +``` +app/shared/feature_frames/__init__.py # export the two row builders +app/shared/feature_frames/tests/test_leakage.py # ADD build_future_feature_rows leakage spec +app/shared/feature_frames/tests/test_contract.py # ADD: rows.py imports nothing from app/features +app/features/forecasting/service.py # _assemble_regression_rows -> delegating shim +app/features/backtesting/service.py # ExogenousFrame, exogenous load, fold branch +app/features/backtesting/schemas.py # ADD feature_aware + exogenous_policy (additive) +app/features/jobs/service.py # _execute_backtest: accept regression + lightgbm +app/features/backtesting/tests/test_service.py # repurpose the interim loud-fail test +app/features/backtesting/tests/test_service_integration.py # feature-aware DB-backed backtest +app/features/backtesting/tests/test_routes_integration.py # POST /backtesting/run regression +app/features/backtesting/tests/test_schemas.py # the new additive fields +app/features/jobs/tests/test_service.py # backtest job with model_type=regression +examples/models/feature_frame_contract.md # document the backtest future-frame +docs/PHASE/5-BACKTESTING.md # feature-aware backtesting section +README.md # backtest model list: add regression/lightgbm +PRPs/INITIAL/INITIAL-MLZOO-index.md # note B.2 -> this PRP +``` + +### DECISIONS LOCKED (resolved during planning — do NOT re-litigate) + +1. **The per-fold row builders live in `app/shared/feature_frames/rows.py`.** `backtesting` + may not import `forecasting` or `scenarios` (vertical-slice rule). `app/shared/` is the + sanctioned cross-cutting home and already owns the column builders. `rows.py` (pure, + stdlib-only, `app/features` import forbidden — same as `contract.py`) holds the two + row-matrix assemblers. `contract.py` stays the column-builder + taxonomy home. + +2. **`_assemble_regression_rows` is PROMOTED, not duplicated, and the promotion is additive.** + Its body moves verbatim to `build_historical_feature_rows` in `rows.py`. + `ForecastingService._assemble_regression_rows` becomes a one-line delegating shim + (`return build_historical_feature_rows(...)`). `test_regression_features_leakage.py` + imports the shim by its old name and stays **GREEN with zero edits** — the existing + leakage test is not weakened, not moved, not touched. + +3. **`X_train` is sliced from one full-series historical matrix; `X_future` is NEVER sliced.** + The historical matrix is built once over `dates[0:N]` (each row reads only strictly-earlier + targets → leakage-safe as a *training* row). Per fold, `X_train = matrix[train_indices]`. + `X_future` MUST be rebuilt per fold via `build_future_feature_rows` — slicing the + historical matrix for test rows would let `lag_1` read an adjacent test-day observed target + (**target leakage**). This asymmetry is the crux of the PRP (see GOTCHA below). + +4. **The v1 `X_future` exogenous policy is `observed`, recorded on the result.** `price_factor` + / `promo_active` for the test window come from recorded `sales_daily.unit_price` / + `promotion` rows. This is exogenous foresight, not target leakage (it never reads `y`), and + it keeps `X_train`/`X_future` skew-free (both read same-day observed exogenous). + `ModelBacktestResult.exogenous_policy` records `"observed"` so the metric is interpreted + honestly. A `Literal["observed"]` (one value in v1) is the documented extension seam — a + future PRP may add `"assumptions"` without a breaking change. + +5. **Branch on `model.requires_features`, never on a `model_type` string.** `run_backtest` + builds a cheap probe model from `config.model_config_main` to read the flag *before* the + fold loop, deciding whether to load exogenous data. Mirrors + `ForecastingService.train_model`, which already branches on exactly this flag. + The probe is a no-fit `model_factory(...)` construction (cheap) — each of the three + sites that need the flag (`_validate_config`, `run_backtest`, `_run_model_backtest`) + builds its own probe locally; there is no need to thread one instance through, and + `BacktestingService` keeps no probe/matrix instance state. + +6. **The fold loop stays sync and DB-free.** All DB I/O happens once in `run_backtest` + (`async`), resolved into a pure `ExogenousFrame`. `_run_model_backtest` and the row + builders remain unit-testable without a database — the existing architecture is preserved. + +7. **`min_train_size >= 30` is enforced for feature-aware backtests.** `_validate_config` + raises `ValueError` when the main model `requires_features` and + `split_config.min_train_size < 30` (`_MIN_REGRESSION_TRAIN_ROWS`) — each fold's train + window must resolve the lag features. Loud, not silent. + +8. **The interim loud-fail test is REPURPOSED, not deleted.** + `test_feature_aware_model_fails_loud_in_backtest` asserted a `regression` backtest raises + `ValueError`. After this PRP it succeeds. The test is rewritten as (a) a positive + "feature-aware backtest runs and yields metrics" assertion and (b) a new loud-fail + assertion for the genuinely-unsupported path (a `requires_features` model with no + `ExogenousFrame` loaded → `ValueError`). PRP-29 DECISIONS LOCKED #7 and PRP-30 DECISIONS + LOCKED #6 are **superseded** — note this in the PRP commit body and the test docstring. + +9. **OUT OF SCOPE — do not touch.** No new model family (XGBoost/Prophet = MLZOO-C). No + `frontend/` change — `_shape_backtest_result` keys stay byte-stable. No explainability + (MLZOO-D). No `scenario_plan` / `/scenarios/*` change. No Alembic migration (this PRP adds + no table/column). No recursive multi-step forecasting — `NaN`-as-unknown is kept. + +### Known Gotchas of our codebase & Library Quirks + +```python +# CRITICAL: X_future is NEVER a slice of the historical matrix. The historical row for a +# test day reads quantities[i-1] (lag_1) — for a test day that source is an adjacent +# observed TEST-DAY target == target leakage. X_future MUST be rebuilt per fold with +# build_long_lag_columns(history_tail_ending_at_T, ...) so future-sourced lag cells are NaN. + +# CRITICAL: gap offset. With gap > 0 the first test day is T + gap + 1, but +# build_long_lag_columns indexes test day m as T + m. Call it with +# horizon = gap + test_size and DROP the first `gap` rows. With gap=0 (the common case) +# this is a no-op. test_feature_aware_backtest.py MUST cover a gap>0 fold. + +# CRITICAL: history_tail ends at T = train_end, and EXCLUDES the gap days. The gap simulates +# operational data latency — data for gap days is "not yet available" at forecast time. +# history_tail = series.values[:train_end_idx][-HISTORY_TAIL_DAYS:] where +# train_end_idx = split.train_indices[-1] + 1. + +# GOTCHA: leaf-level import rule. app/shared/feature_frames/rows.py may NEVER import from +# app/features/** (test_contract.py enforces it with an AST walk). Keep rows.py pure — +# stdlib math/datetime only, same as contract.py. + +# GOTCHA: SeriesData.__post_init__ computes n_observations from `values`. Adding an optional +# `exogenous: ExogenousFrame | None = None` field is fine — keep it last, keep the default. + +# GOTCHA: ModelBacktestResult is frozen-free but consumed by _shape_backtest_result. New +# fields (feature_aware: bool = False, exogenous_policy: str | None = None) MUST have +# defaults so every existing construction site and test stays valid. + +# GOTCHA: lightgbm in a backtest job. _execute_backtest building a LightGBMModelConfig is +# fine — model_factory still raises ValueError if forecast_enable_lightgbm is False. That +# surfaces as a failed job (loud), which is correct. Do not pre-check the flag in jobs. + +# GOTCHA: baseline comparison. _run_baseline_comparisons runs naive + seasonal_naive — both +# target-only (requires_features=False). They take the UNCHANGED target-only fold path +# even when the main model is feature-aware. Do not feed them X. + +# GOTCHA: line endings — repo has mixed CRLF/LF, no .gitattributes. Run `git diff --stat` +# before committing; re-normalise any whole-file diff to the file's original ending. +``` + +## Implementation Blueprint + +### Data models and structure + +```python +# app/shared/feature_frames/rows.py (NEW — pure, stdlib only) + +def build_historical_feature_rows( + *, dates: list[date], quantities: list[float], prices: list[float], + baseline_price: float, promo_dates: set[date], holiday_dates: set[date], + launch_date: date | None, +) -> list[list[float]]: + """Promoted verbatim from ForecastingService._assemble_regression_rows. + Row i: target lags read quantities[i-lag] (strictly earlier), calendar columns + are pure, exogenous columns read same-day observed attributes. canonical order.""" + +def build_future_feature_rows( + *, test_dates: list[date], history_tail: list[float], gap: int, + test_prices: list[float], baseline_price: float, test_promo_dates: set[date], + test_holiday_dates: set[date], launch_date: date | None, +) -> list[list[float]]: + """Leakage-safe test-window matrix. lag_* columns come from + build_long_lag_columns(history_tail, gap + len(test_dates))[gap:] — NaN where the + source day is in the test window. Calendar columns pure. price_factor / promo_active + from the OBSERVED test-window records (policy='observed'). canonical order. + Raises ValueError if asked to emit a column it cannot classify/source.""" +``` + +```python +# app/features/backtesting/service.py (MODIFY) + +@dataclass +class ExogenousFrame: + """Pre-loaded exogenous data for one series — resolved async in run_backtest, + consumed by the pure/sync fold loop.""" + prices: list[float] # aligned with SeriesData.dates + baseline_price: float # median positive price (>0 fallback 1.0) + promo_dates: set[date] + holiday_dates: set[date] + launch_date: date | None + +@dataclass +class SeriesData: + dates: list[date] + values: np.ndarray + store_id: int + product_id: int + exogenous: ExogenousFrame | None = None # NEW — present only for feature-aware runs + n_observations: int = field(init=False) +``` + +### list of tasks (dependency-ordered) + +```yaml +# ════════ STEP 1 — Shared row builders (pure, no behaviour change) ════════ +Task 1: CREATE app/shared/feature_frames/rows.py + - build_historical_feature_rows: body lifted verbatim from _assemble_regression_rows. + - build_future_feature_rows: NEW — see Per-task pseudocode. + - Pure: import only math/datetime + the contract builders. No app/features import. + +Task 2: MODIFY app/shared/feature_frames/__init__.py + - Export build_historical_feature_rows, build_future_feature_rows. + +Task 3: MODIFY app/features/forecasting/service.py + - _assemble_regression_rows becomes a one-line shim delegating to + build_historical_feature_rows. Keep the signature and name byte-identical so + test_regression_features_leakage.py imports stay valid. + +# ════════ STEP 2 — Shared leakage spec ════════ +Task 4: MODIFY app/shared/feature_frames/tests/test_leakage.py + - ADD: build_future_feature_rows lag cells are NaN exactly where source day is in the + test window; an observed test-day target never appears as a lag value out of place; + gap>0 case; the historical-vs-future asymmetry. +Task 5: MODIFY app/shared/feature_frames/tests/test_contract.py + - ADD rows.py to the AST walk asserting no app/features import. + +# ════════ STEP 3 — Backtesting schemas (additive) ════════ +Task 6: MODIFY app/features/backtesting/schemas.py + - ModelBacktestResult: ADD feature_aware: bool = False, + exogenous_policy: Literal["observed"] | None = None (defaults preserve all callers). + +# ════════ STEP 4 — Backtesting service wiring ════════ +Task 7: MODIFY app/features/backtesting/service.py + - ADD ExogenousFrame; ADD optional SeriesData.exogenous. + - _validate_config: if main model requires_features and min_train_size < 30 -> ValueError. + - run_backtest: build a probe model from config.model_config_main; if requires_features, + call new async _load_exogenous_frame and attach to series_data.exogenous. + - _run_model_backtest: signature is UNCHANGED (still series_data, splitter, + model_config, store_fold_details). It builds a probe model, branches on + probe.requires_features, reads gap from splitter.config.gap, and builds the full + historical matrix as a LOCAL once before the fold loop. + target-only -> existing code path, untouched. + feature-aware -> _run_feature_aware_fold (new helper, all args explicit): per fold + slice X_train from the local historical matrix, build X_future, + fit(y,X_train), predict(test_size, X_future). Set feature_aware + + exogenous_policy on the ModelBacktestResult. + - ADD _load_exogenous_frame (async): unit_price per date, promotion windows, calendar + holidays, product.launch_date — mirrors _build_regression_features. + +# ════════ STEP 5 — Jobs integration ════════ +Task 8: MODIFY app/features/jobs/service.py + - _execute_backtest: add regression + lightgbm branches building RegressionModelConfig / + LightGBMModelConfig. _shape_backtest_result UNCHANGED (frontend contract byte-stable). + +# ════════ STEP 6 — Tests ════════ +Task 9: CREATE app/features/backtesting/tests/test_feature_aware_backtest.py + - Pure unit tests: per-fold X_train/X_future shape + column order; gap>0 fold; + feature-aware model with exogenous=None -> loud ValueError. +Task 10: MODIFY app/features/backtesting/tests/test_service.py + - Repurpose test_feature_aware_model_fails_loud_in_backtest (DECISIONS LOCKED #8). +Task 11: MODIFY app/features/backtesting/tests/test_service_integration.py + - DB-backed regression backtest vs naive/seasonal baselines in one response. +Task 12: MODIFY app/features/backtesting/tests/test_routes_integration.py + - POST /backtesting/run with a regression model config -> 200 + per-fold metrics. +Task 13: MODIFY app/features/backtesting/tests/test_schemas.py + - ModelBacktestResult new fields: defaults + explicit values. +Task 14: MODIFY app/features/jobs/tests/test_service.py + - backtest job with model_type="regression" -> success + shaped result. + +# ════════ STEP 7 — Docs ════════ +Task 15: MODIFY examples/models/feature_frame_contract.md, docs/PHASE/5-BACKTESTING.md, + README.md (backtest model list), PRPs/INITIAL/INITIAL-MLZOO-index.md (B.2 row). +``` + +### Per-task pseudocode (critical details only) + +```python +# ── Task 1 — build_future_feature_rows (the leakage-critical builder) ── +def build_future_feature_rows(*, test_dates, history_tail, gap, test_prices, + baseline_price, test_promo_dates, test_holiday_dates, + launch_date): + horizon = len(test_dates) + columns = canonical_feature_columns() + # lags: build for gap+horizon days, drop the gap lead-in. NaN where source > T. + lag_cols = build_long_lag_columns(history_tail, gap + horizon) # {"lag_k": [...]} + lag_cols = {k: v[gap:] for k, v in lag_cols.items()} + cal_cols = build_calendar_columns(test_dates) # SAFE — pure + rows: list[list[float]] = [] + for j, day in enumerate(test_dates): + row: list[float] = [] + for col in columns: + safety = feature_safety(col) # raises KeyError on unknown -> loud + if col.startswith("lag_"): # CONDITIONALLY_SAFE + row.append(lag_cols[col][j]) + elif col in cal_cols: # SAFE + row.append(cal_cols[col][j]) + elif col == "price_factor": # UNSAFE_UNLESS_SUPPLIED -> observed + row.append(test_prices[j] / baseline_price) + elif col == "promo_active": # UNSAFE_UNLESS_SUPPLIED -> observed + row.append(1.0 if day in test_promo_dates else 0.0) + elif col == "is_holiday": # SAFE — calendar timeless attribute + row.append(1.0 if day in test_holiday_dates else 0.0) + elif col == "days_since_launch": # SAFE — pure fn of date + row.append(float((day - launch_date).days) if launch_date else math.nan) + else: + raise ValueError(f"build_future_feature_rows: unsourced column {col!r}") + rows.append(row) + return rows + +# ── Task 7 — _run_model_backtest branch + _run_feature_aware_fold (pure, sync) ── +# _run_model_backtest gains NO new parameters. `gap` is read from the splitter it already +# receives (splitter.config.gap — SplitConfig is reachable as TimeSeriesSplitter.config). +# The full historical matrix is a LOCAL built once before the fold loop — there is no +# self._historical_matrix and no self.config on BacktestingService (__init__ sets only +# self.settings + self.metrics_calculator). _run_feature_aware_fold takes everything it +# needs as explicit arguments — no phantom instance state. + +def _run_model_backtest(self, series_data, splitter, model_config, store_fold_details): + probe = model_factory(model_config, random_state=self.settings.forecast_random_seed) + feature_aware = probe.requires_features # capability flag, never a string + historical_matrix: np.ndarray | None = None + if feature_aware: + if series_data.exogenous is None: + raise ValueError("feature-aware backtest requires a loaded ExogenousFrame") + exo = series_data.exogenous + historical_matrix = np.array(build_historical_feature_rows( + dates=series_data.dates, quantities=series_data.values.tolist(), + prices=exo.prices, baseline_price=exo.baseline_price, + promo_dates=exo.promo_dates, holiday_dates=exo.holiday_dates, + launch_date=exo.launch_date), dtype=np.float64) + for split in splitter.split(series_data.dates, series_data.values): + if feature_aware: + predictions = self._run_feature_aware_fold( + series_data, split, model_config, historical_matrix, splitter.config.gap) + else: + ... # existing target-only path — UNCHANGED + ... # metrics / FoldResult assembly is shared, unchanged + # set feature_aware=feature_aware and exogenous_policy on the returned ModelBacktestResult. + +def _run_feature_aware_fold(self, series_data, split, model_config, + historical_matrix, gap): + exo = series_data.exogenous # caller already guaranteed non-None + # X_train — slice the full historical matrix (built once, leakage-safe by position) + X_train = historical_matrix[split.train_indices] + y_train = series_data.values[split.train_indices] + # X_future — rebuilt per fold; history_tail ends at T = train_end, excludes gap + train_end_idx = int(split.train_indices[-1]) + 1 + history_tail = series_data.values[:train_end_idx][-HISTORY_TAIL_DAYS:].tolist() + test_idx = split.test_indices + X_future = np.array(build_future_feature_rows( + test_dates=split.test_dates, history_tail=history_tail, gap=gap, + test_prices=[exo.prices[i] for i in test_idx], baseline_price=exo.baseline_price, + test_promo_dates={series_data.dates[i] for i in test_idx if series_data.dates[i] in exo.promo_dates}, + test_holiday_dates={d for d in split.test_dates if d in exo.holiday_dates}, + launch_date=exo.launch_date), dtype=np.float64) + model = model_factory(model_config, random_state=self.settings.forecast_random_seed) + model.fit(y_train, X_train) + return model.predict(len(test_idx), X_future) + +# ── Task 8 — _execute_backtest allow-list ── +# elif model_type == "regression": model_config = RegressionModelConfig() +# elif model_type == "lightgbm": model_config = LightGBMModelConfig() +# else: raise ValueError(f"Unsupported model_type: {model_type}") # e.g. "arima" +``` + +### Integration Points + +```yaml +BACKTESTING SERVICE: + - run_backtest stays the only async entry; _run_model_backtest stays sync. + - the full historical matrix is built once per _run_model_backtest call (feature-aware + path) as a LOCAL variable, sliced per fold — never per-fold rebuilt for X_train, and + never stored on the service instance. + +JOBS: + - _execute_backtest gains regression + lightgbm; _shape_backtest_result is NOT touched. + - a backtest job for a disabled lightgbm fails loud via model_factory — expected. + +SHARED CONTRACT: + - rows.py joins contract.py under app/shared/feature_frames; __init__.py re-exports both. + - forecasting + backtesting both consume one definition — no column-order drift possible. + +NO CHANGE: + - splitter.py, scenarios/**, frontend/**, alembic/**, registry/** — untouched. +``` + +## Phased Execution Plan + +This is one coherent architectural change and **fits one reviewable PR** (~1 branch, +`feat/backtesting-feature-aware-folds`, off `dev`, tracked by **GitHub issue #244** — +every commit references `(#244)`). If the reviewer +prefers a smaller diff, split along the natural seam — Phase 1 is independently mergeable +because it is pure and behaviour-preserving: + +- **Phase 1 — Shared builders + leakage spec (Tasks 1–5).** Promote + `build_historical_feature_rows`, add `build_future_feature_rows`, wire the delegating shim, + add the shared leakage tests. Zero behaviour change — every existing test stays green. A + self-contained PR that lands the contract without touching backtesting. +- **Phase 2 — Backtesting + jobs wiring (Tasks 6–15).** Consume the builders: schema fields, + async exogenous load, the fold-loop branch, jobs allow-list, integration tests, docs. + +Recommended: ship as **one PR** unless the diff is judged too large at review time; the +phase boundary is the fallback, not the default. + +## Validation Loop + +### Level 1: Syntax & Style + +```bash +uv run ruff check . && uv run ruff format --check . +``` + +### Level 2: Type Checks + +```bash +uv run mypy app/ && uv run pyright app/ +# Watch: rows.py must type cleanly with no app/features import; the new optional +# SeriesData.exogenous and the additive ModelBacktestResult fields must not break callers. +``` + +### Level 3: Unit Tests + +```bash +uv run pytest -v -m "not integration" \ + app/shared/feature_frames/tests/ \ + app/features/backtesting/tests/test_service.py \ + app/features/backtesting/tests/test_feature_aware_backtest.py \ + app/features/backtesting/tests/test_schemas.py \ + app/features/forecasting/tests/test_regression_features_leakage.py \ + app/features/jobs/tests/test_service.py +# test_regression_features_leakage.py MUST pass with ZERO edits (the shim preserves it). +uv run pytest -v -m "not integration" # whole fast suite — all green +``` + +### Level 4: Integration Tests + +```bash +docker compose up -d +uv run pytest -v -m integration \ + app/features/backtesting/tests/test_service_integration.py \ + app/features/backtesting/tests/test_routes_integration.py +# A regression backtest must return per-fold metrics + a baseline comparison; +# the response carries exogenous_policy="observed" on the main-model result. +``` + +### Level 5: Manual Validation (dogfood — REQUIRED) + +```bash +# 1. A regression backtest runs end-to-end (needs seeded data). +curl -sX POST localhost:8123/backtesting/run -H 'Content-Type: application/json' \ + -d '{"store_id":1,"product_id":1,"start_date":"2024-01-01","end_date":"2024-12-01", + "config":{"model_config_main":{"model_type":"regression"}, + "split_config":{"n_splits":3,"min_train_size":60,"horizon":14}}}' +# -> 200; main_model_results.feature_aware == true; exogenous_policy == "observed"; +# baseline_results has naive + seasonal_naive; leakage_check_passed == true. + +# 2. min_train_size guard fires loud. +# ... same call with "min_train_size":20 -> 400 RFC-7807, "at least 30". + +# 3. A backtest job with model_type=regression completes success. +curl -sX POST localhost:8123/jobs -H 'Content-Type: application/json' \ + -d '{"job_type":"backtest","params":{"model_type":"regression","store_id":1, + "product_id":1,"start_date":"2024-01-01","end_date":"2024-12-01","n_splits":3}}' +# -> poll GET /jobs/{id} -> status "success", result has per-fold metrics. + +# 4. Baselines unaffected — a naive backtest still works exactly as before. +``` + +## Final Validation Checklist + +- [ ] `ruff` + `mypy --strict` + `pyright --strict` clean. +- [ ] Whole fast unit suite green; `test_regression_features_leakage.py` unedited & green. +- [ ] New shared leakage spec proves `X_future` lag cells are `NaN`-where-future (incl. gap>0). +- [ ] Integration: a `regression` backtest returns per-fold metrics + baseline comparison. +- [ ] `POST /backtesting/run` with `regression` → `200`; with `min_train_size<30` → `400`. +- [ ] A `backtest` job with `model_type="regression"` → `success`. +- [ ] `_shape_backtest_result` output keys byte-identical to pre-PRP (frontend contract). +- [ ] Every baseline backtest test green with zero edits. +- [ ] The interim loud-fail test is repurposed (not deleted); supersession noted. +- [ ] No `frontend/`, `scenarios/`, `alembic/` change; no new migration. +- [ ] `git diff --stat` shows no whole-file CRLF/LF noise. + +## Anti-Patterns to Avoid + +- ❌ Slicing the historical matrix for `X_future` — that leaks adjacent test-day targets. +- ❌ Filling an unknowable future column with `0.0`/`NaN` silently — raise `ValueError`. +- ❌ Branching the fold loop on a `model_type` string — branch on `requires_features`. +- ❌ Importing `forecasting`/`scenarios` from `backtesting` — promote to `app/shared/`. +- ❌ Doing DB I/O inside `_run_model_backtest` — keep it sync; load async up front. +- ❌ Re-deriving `build_long_lag_columns` / `canonical_feature_columns()` — reuse the contract. +- ❌ Weakening or deleting `test_feature_aware_model_fails_loud_in_backtest` — repurpose it. +- ❌ Editing `_shape_backtest_result` keys or any `frontend/` file — out of scope. +- ❌ Adding XGBoost/Prophet, hyperparameter search, or a migration — all out of scope. + +## Open Questions + +1. **`exogenous_policy` v1 = `observed` only.** This PRP ships exactly one policy (recorded + `price`/`promo` for the test window — exogenous foresight, target-leakage-free). A stricter + `origin_carry_forward` policy (carry the last observed price/promo state from `T`, zero + foresight) and an `assumptions` policy (planner-supplied, mirroring the scenarios slice) + are deliberately deferred. **Resolve at PRP review:** is one policy acceptable for v1, or + should `origin_carry_forward` ship alongside as the conservative default? The `Literal` + field is the seam either way. +2. **Feature-aware baseline comparison.** v1 compares a feature-aware main model only against + the *target-only* naive/seasonal baselines. Whether `regression` should also auto-run as a + baseline for a `lightgbm` main model (advanced-vs-advanced) is left to MLZOO-D — flag if + the reviewer wants it sooner. +3. **Per-series caching of the historical matrix.** Built once per `run_backtest` call; not + cached across calls. Fine for single-series backtests; revisit only if a portfolio/batch + backtester (a separate optional feature) ever lands. + +## Confidence Score + +**9/10** for one-pass implementation. The contract (MLZOO-A), both feature-aware models +(PRP-27, PRP-30), the splitter, and the leakage-test patterns all already exist and are +stable. The only genuine design judgement — the `X_future` exogenous policy — is resolved +and locked (DECISIONS LOCKED #4) with Open Question #1 as the explicit review hook. The work +is additive, single-slice-plus-shared, and needs no migration. diff --git a/README.md b/README.md index 4426b908..d406b033 100644 --- a/README.md +++ b/README.md @@ -393,6 +393,12 @@ curl -X POST http://localhost:8123/backtesting/run \ **Baseline Comparisons:** When `include_baselines=true`, automatically compares against naive and seasonal_naive models. +**Feature-Aware Models:** +`regression` and `lightgbm` models can be backtested too — set +`model_config_main.model_type` accordingly. Each fold builds a leakage-safe +per-fold feature matrix (`min_train_size >= 30` required); the result carries +`feature_aware: true` and `exogenous_policy: "observed"`. + See [examples/backtest/](examples/backtest/) for usage examples. ### Model Registry diff --git a/app/features/backtesting/schemas.py b/app/features/backtesting/schemas.py index 537809f0..747d25c2 100644 --- a/app/features/backtesting/schemas.py +++ b/app/features/backtesting/schemas.py @@ -173,6 +173,12 @@ class ModelBacktestResult(BaseModel): fold_results: Results for each fold. aggregated_metrics: Mean metrics across folds. metric_std: Standard deviation of metrics across folds. + feature_aware: True when the model consumed a per-fold feature matrix + (``requires_features``); False for target-only baseline models. + exogenous_policy: How the test-window exogenous columns were sourced. + ``"observed"`` (the recorded price/promotion plan) for a + feature-aware result; ``None`` for a target-only model. Recorded so + the metric is read honestly as "accuracy given a known plan". """ model_type: str @@ -180,6 +186,8 @@ class ModelBacktestResult(BaseModel): fold_results: list[FoldResult] aggregated_metrics: dict[str, float] metric_std: dict[str, float] + feature_aware: bool = False + exogenous_policy: Literal["observed"] | None = None # ============================================================================= diff --git a/app/features/backtesting/service.py b/app/features/backtesting/service.py index b4b8e8f4..209e9081 100644 --- a/app/features/backtesting/service.py +++ b/app/features/backtesting/service.py @@ -17,6 +17,7 @@ import uuid from dataclasses import dataclass, field from datetime import date as date_type +from datetime import timedelta from pathlib import Path from typing import TYPE_CHECKING, Any @@ -34,20 +35,56 @@ ModelBacktestResult, SplitBoundary, ) -from app.features.backtesting.splitter import TimeSeriesSplitter -from app.features.data_platform.models import SalesDaily +from app.features.backtesting.splitter import TimeSeriesSplit, TimeSeriesSplitter +from app.features.data_platform.models import Calendar, Product, Promotion, SalesDaily from app.features.forecasting.models import model_factory from app.features.forecasting.schemas import ( ModelConfig, NaiveModelConfig, SeasonalNaiveModelConfig, ) +from app.shared.feature_frames import ( + HISTORY_TAIL_DAYS, + build_future_feature_rows, + build_historical_feature_rows, +) if TYPE_CHECKING: pass logger = structlog.get_logger() +# Minimum observed train rows a feature-aware model needs per fold to resolve +# its lag features — mirrors ``forecasting.service._MIN_REGRESSION_TRAIN_ROWS``. +# A feature-aware backtest with a smaller ``min_train_size`` fails loud in +# ``_validate_config`` rather than producing all-NaN lag columns silently. +_MIN_FEATURE_AWARE_TRAIN_ROWS = 30 + + +@dataclass +class ExogenousFrame: + """Pre-loaded exogenous data for one series — resolved async, consumed sync. + + A feature-aware backtest needs price / promotion / holiday / launch-date + data to build its per-fold feature matrices, but the fold loop is sync and + DB-free by design. ``run_backtest`` resolves all of it once into this pure + in-memory carrier; the fold builders read it without touching the database. + + Attributes: + prices: Observed unit prices, aligned index-for-index with + :attr:`SeriesData.dates`. + baseline_price: Median positive price (``>0``); fallback ``1.0``. + promo_dates: Days a promotion covered anywhere in the data window. + holiday_dates: Calendar holiday days in the data window. + launch_date: The product's launch date, or ``None``. + """ + + prices: list[float] + baseline_price: float + promo_dates: set[date_type] + holiday_dates: set[date_type] + launch_date: date_type | None + @dataclass class SeriesData: @@ -58,6 +95,8 @@ class SeriesData: values: Target values as numpy array. store_id: Store ID. product_id: Product ID. + exogenous: Pre-loaded exogenous data — present only for a feature-aware + backtest; ``None`` for a target-only run. n_observations: Number of observations. """ @@ -65,6 +104,7 @@ class SeriesData: values: np.ndarray[Any, np.dtype[np.floating[Any]]] store_id: int product_id: int + exogenous: ExogenousFrame | None = None n_observations: int = field(init=False) def __post_init__(self) -> None: @@ -126,6 +166,19 @@ def _validate_config(self, config: BacktestConfig) -> None: message="Using provided min_train_size below recommended default", ) + # Feature-aware models need enough train rows per fold to resolve their + # lag features. Build a cheap probe (no fit) and branch on the + # capability flag — never on a model_type string. Loud, not silent. + probe = model_factory( + config.model_config_main, random_state=self.settings.forecast_random_seed + ) + if probe.requires_features and split_config.min_train_size < _MIN_FEATURE_AWARE_TRAIN_ROWS: + raise ValueError( + f"A feature-aware model ({config.model_config_main.model_type}) needs " + f"min_train_size of at least {_MIN_FEATURE_AWARE_TRAIN_ROWS} to resolve its " + f"lag features per fold; got {split_config.min_train_size}." + ) + def save_results( self, response: BacktestResponse, @@ -217,6 +270,21 @@ async def run_backtest( f"between {start_date} and {end_date}" ) + # Feature-aware models consume a per-fold feature matrix. Branch on the + # capability flag (not a model_type string) and resolve the exogenous + # data once, here in the async entry point — the fold loop stays sync + # and DB-free. Target-only models skip this entirely. + probe = model_factory( + config.model_config_main, random_state=self.settings.forecast_random_seed + ) + if probe.requires_features: + series_data.exogenous = await self._load_exogenous_frame( + db=db, + store_id=store_id, + product_id=product_id, + dates=series_data.dates, + ) + # Create splitter and validate splitter = TimeSeriesSplitter(config.split_config) @@ -284,30 +352,61 @@ def _run_model_backtest( ) -> ModelBacktestResult: """Run backtest for a single model configuration. + Branches on the model's ``requires_features`` capability flag (never a + ``model_type`` string). A target-only model takes the unchanged + target-only path; a feature-aware model builds the full historical + feature matrix once (a local — never instance state) and runs each fold + through :meth:`_run_feature_aware_fold` with a leakage-safe per-fold + ``X_train`` slice and a rebuilt ``X_future``. The method signature is + unchanged — ``gap`` is read from ``splitter.config.gap``. Sync and + DB-free: all exogenous I/O happened in :meth:`run_backtest`. + Args: - series_data: Loaded time series data. + series_data: Loaded time series data (carries ``exogenous`` for a + feature-aware run). splitter: Time series splitter. model_config: Model configuration. store_fold_details: Whether to store per-fold details. Returns: - ModelBacktestResult with all fold results. + ModelBacktestResult with all fold results; ``feature_aware`` and + ``exogenous_policy`` are set for a feature-aware model. + + Raises: + ValueError: If a feature-aware model has no loaded ``ExogenousFrame``. """ fold_results: list[FoldResult] = [] fold_metrics: list[dict[str, float]] = [] + # Probe the capability flag, then build the historical matrix once for + # the whole run (feature-aware path only) — sliced, never rebuilt, for + # each fold's X_train. + probe = model_factory(model_config, random_state=self.settings.forecast_random_seed) + feature_aware: bool = probe.requires_features + historical_matrix: np.ndarray[Any, np.dtype[np.floating[Any]]] | None = None + if feature_aware: + historical_matrix = self._build_historical_matrix(series_data) + for split in splitter.split(series_data.dates, series_data.values): # Extract train and test data - y_train = series_data.values[split.train_indices] y_test = series_data.values[split.test_indices] - - # Create and fit model - model = model_factory(model_config, random_state=self.settings.forecast_random_seed) - model.fit(y_train) - - # Generate predictions horizon = len(split.test_indices) - predictions = model.predict(horizon) + + if historical_matrix is not None: + # Feature-aware path — per-fold leakage-safe X_train / X_future. + predictions = self._run_feature_aware_fold( + series_data=series_data, + split=split, + model_config=model_config, + historical_matrix=historical_matrix, + gap=splitter.config.gap, + ) + else: + # Target-only path — unchanged. + y_train = series_data.values[split.train_indices] + model = model_factory(model_config, random_state=self.settings.forecast_random_seed) + model.fit(y_train) + predictions = model.predict(horizon) # Calculate metrics metrics = self.metrics_calculator.calculate_all( @@ -360,7 +459,113 @@ def _run_model_backtest( fold_results=fold_results, aggregated_metrics=aggregated_metrics, metric_std=metric_std, + feature_aware=feature_aware, + exogenous_policy="observed" if feature_aware else None, + ) + + def _build_historical_matrix( + self, series_data: SeriesData + ) -> np.ndarray[Any, np.dtype[np.floating[Any]]]: + """Build the full-series historical feature matrix for a backtest. + + Built once per :meth:`_run_model_backtest` call as a local — never + instance state. Each row is leakage-safe *as a training row*: its lag + columns read only strictly-earlier observed targets. Per-fold + ``X_train`` is a positional slice of this matrix; ``X_future`` is NEVER + sliced from it (that would leak an adjacent test-day target). + + Args: + series_data: Loaded time series data — must carry ``exogenous``. + + Returns: + Row-major feature matrix aligned with ``series_data.dates``. + + Raises: + ValueError: If ``series_data.exogenous`` is ``None`` — the + genuinely-unsupported path for a feature-aware backtest. + """ + exo = series_data.exogenous + if exo is None: + raise ValueError( + "feature-aware backtest requires a loaded ExogenousFrame on series_data; " + "run_backtest must resolve exogenous data before the fold loop" + ) + rows = build_historical_feature_rows( + dates=series_data.dates, + quantities=[float(v) for v in series_data.values], + prices=exo.prices, + baseline_price=exo.baseline_price, + promo_dates=exo.promo_dates, + holiday_dates=exo.holiday_dates, + launch_date=exo.launch_date, ) + return np.array(rows, dtype=np.float64) + + def _run_feature_aware_fold( + self, + *, + series_data: SeriesData, + split: TimeSeriesSplit, + model_config: ModelConfig, + historical_matrix: np.ndarray[Any, np.dtype[np.floating[Any]]], + gap: int, + ) -> np.ndarray[Any, np.dtype[np.floating[Any]]]: + """Fit + predict one fold of a feature-aware backtest — pure, sync. + + ``X_train`` is a positional slice of the full historical matrix (built + once, leakage-safe by position). ``X_future`` is rebuilt here per fold + via :func:`build_future_feature_rows`: its ``history_tail`` ends at the + fold origin ``T`` (the last train day, gap days excluded), so a lag + cell whose source day falls in the test window is ``NaN``. + + Args: + series_data: Loaded time series data — carries ``exogenous``. + split: The fold's train/test split. + model_config: Model configuration (feature-aware). + historical_matrix: The full-series historical feature matrix. + gap: Gap days between train end and test start. + + Returns: + Per-day predictions for the fold's test window. + + Raises: + ValueError: If ``series_data.exogenous`` is ``None``. + """ + exo = series_data.exogenous + if exo is None: # defensive — the caller guarantees this is non-None + raise ValueError("feature-aware backtest requires a loaded ExogenousFrame") + + # X_train — slice the historical matrix (leakage-safe by position). + x_train = historical_matrix[split.train_indices] + y_train = series_data.values[split.train_indices] + + # X_future — rebuilt per fold. history_tail ends at T = last train day + # and EXCLUDES the gap days (data "not yet available" at forecast time). + train_end_idx = int(split.train_indices[-1]) + 1 + history_tail = [float(v) for v in series_data.values[:train_end_idx][-HISTORY_TAIL_DAYS:]] + test_indices = [int(i) for i in split.test_indices] + test_prices = [exo.prices[i] for i in test_indices] + test_promo_dates = { + series_data.dates[i] for i in test_indices if series_data.dates[i] in exo.promo_dates + } + test_holiday_dates = {d for d in split.test_dates if d in exo.holiday_dates} + x_future = np.array( + build_future_feature_rows( + test_dates=split.test_dates, + history_tail=history_tail, + gap=gap, + test_prices=test_prices, + baseline_price=exo.baseline_price, + test_promo_dates=test_promo_dates, + test_holiday_dates=test_holiday_dates, + launch_date=exo.launch_date, + ), + dtype=np.float64, + ) + + model = model_factory(model_config, random_state=self.settings.forecast_random_seed) + model.fit(y_train, x_train) + return model.predict(len(test_indices), x_future) def _run_baseline_comparisons( self, @@ -543,3 +748,105 @@ async def _load_series_data( store_id=store_id, product_id=product_id, ) + + async def _load_exogenous_frame( + self, + db: AsyncSession, + store_id: int, + product_id: int, + dates: list[date_type], + ) -> ExogenousFrame: + """Load exogenous data for a feature-aware backtest — async, once. + + Mirrors ``ForecastingService._build_regression_features``: resolves the + recorded unit price per date, the promotion-covered days, the calendar + holidays, and the product launch date into a pure :class:`ExogenousFrame` + the sync fold loop consumes. The only ``y``-free reads — never a target. + + Args: + db: Database session. + store_id: Store ID. + product_id: Product ID. + dates: The series dates (from :meth:`_load_series_data`) the prices + must align with index-for-index. + + Returns: + The resolved :class:`ExogenousFrame`. + """ + start_date = dates[0] + end_date = dates[-1] + + # Recorded unit price per date — aligned with the series dates. Every + # series date came from the same SalesDaily window, so each resolves. + price_rows = ( + await db.execute( + select(SalesDaily.date, SalesDaily.unit_price).where( + (SalesDaily.store_id == store_id) + & (SalesDaily.product_id == product_id) + & (SalesDaily.date >= start_date) + & (SalesDaily.date <= end_date) + ) + ) + ).all() + price_by_date = {row.date: float(row.unit_price) for row in price_rows} + prices = [price_by_date.get(day, 0.0) for day in dates] + + # Baseline price = median of the positive prices, so price_factor is + # ~1.0 on a typical day and < 1.0 on a markdown/promo day. + positive_prices = sorted(price for price in prices if price > 0.0) + baseline_price = positive_prices[len(positive_prices) // 2] if positive_prices else 1.0 + + holiday_dates: set[date_type] = set( + ( + await db.execute( + select(Calendar.date).where( + Calendar.date >= start_date, + Calendar.date <= end_date, + Calendar.is_holiday.is_(True), + ) + ) + ) + .scalars() + .all() + ) + + # Promotion-active days: store-specific OR chain-wide rows overlapping + # the data window, expanded to the set of dates they cover. + promo_rows = ( + await db.execute( + select(Promotion.start_date, Promotion.end_date).where( + Promotion.product_id == product_id, + (Promotion.store_id == store_id) | (Promotion.store_id.is_(None)), + Promotion.start_date <= end_date, + Promotion.end_date >= start_date, + ) + ) + ).all() + promo_dates: set[date_type] = set() + for promo in promo_rows: + day = max(promo.start_date, start_date) + last = min(promo.end_date, end_date) + while day <= last: + promo_dates.add(day) + day += timedelta(days=1) + + launch_date: date_type | None = await db.scalar( + select(Product.launch_date).where(Product.id == product_id) + ) + + logger.info( + "backtesting.exogenous_frame_loaded", + store_id=store_id, + product_id=product_id, + n_dates=len(dates), + n_holidays=len(holiday_dates), + n_promo_days=len(promo_dates), + ) + + return ExogenousFrame( + prices=prices, + baseline_price=baseline_price, + promo_dates=promo_dates, + holiday_dates=holiday_dates, + launch_date=launch_date, + ) diff --git a/app/features/backtesting/tests/test_feature_aware_backtest.py b/app/features/backtesting/tests/test_feature_aware_backtest.py new file mode 100644 index 00000000..e5281ea6 --- /dev/null +++ b/app/features/backtesting/tests/test_feature_aware_backtest.py @@ -0,0 +1,221 @@ +"""Unit tests for feature-aware backtesting (MLZOO-B.2). + +Pure, DB-free tests of the per-fold feature-aware path wired into +``BacktestingService._run_model_backtest``: the historical matrix build, the +per-fold ``X_train`` slice / ``X_future`` rebuild, the ``feature_aware`` / +``exogenous_policy`` result fields, the gap > 0 fold, and the loud failure when +a feature-aware model reaches the fold loop with no ``ExogenousFrame`` loaded. + +The leakage invariants of the row builders themselves live in the load-bearing +``app/shared/feature_frames/tests/test_leakage.py``; this file pins the +backtesting *integration* of those builders. +""" + +from __future__ import annotations + +from datetime import date + +import numpy as np +import pytest + +from app.features.backtesting.schemas import SplitConfig +from app.features.backtesting.service import ( + BacktestingService, + ExogenousFrame, + SeriesData, +) +from app.features.backtesting.splitter import TimeSeriesSplitter +from app.features.forecasting.schemas import NaiveModelConfig, RegressionModelConfig +from app.shared.feature_frames import canonical_feature_columns + +_N_FEATURES = len(canonical_feature_columns()) # 14 — 4 lags + 6 calendar + 4 exogenous + + +def _exogenous(n: int) -> ExogenousFrame: + """A flat, no-promo, no-holiday ExogenousFrame aligned with an n-day series.""" + return ExogenousFrame( + prices=[9.99] * n, + baseline_price=9.99, + promo_dates=set(), + holiday_dates=set(), + launch_date=None, + ) + + +def _series(dates: list[date], values: np.ndarray, *, with_exogenous: bool) -> SeriesData: + """Build SeriesData, optionally carrying a loaded ExogenousFrame.""" + return SeriesData( + dates=dates, + values=values, + store_id=1, + product_id=1, + exogenous=_exogenous(len(dates)) if with_exogenous else None, + ) + + +def test_canonical_feature_set_is_fourteen_columns() -> None: + """The feature-aware matrices use exactly the 14-column canonical set.""" + assert _N_FEATURES == 14 + + +def test_build_historical_matrix_shape_matches_series_and_columns( + sample_dates_120: list[date], + sample_values_120: np.ndarray, +) -> None: + """The historical matrix has one row per series day, canonical column width.""" + service = BacktestingService() + series = _series(sample_dates_120, sample_values_120, with_exogenous=True) + + matrix = service._build_historical_matrix(series) + + assert matrix.shape == (120, _N_FEATURES) + + +def test_build_historical_matrix_without_exogenous_fails_loud( + sample_dates_120: list[date], + sample_values_120: np.ndarray, +) -> None: + """No ExogenousFrame -> loud ValueError, never a silent all-NaN matrix.""" + service = BacktestingService() + series = _series(sample_dates_120, sample_values_120, with_exogenous=False) + + with pytest.raises(ValueError, match="ExogenousFrame"): + service._build_historical_matrix(series) + + +def test_feature_aware_fold_predicts_one_value_per_test_day( + sample_dates_120: list[date], + sample_values_120: np.ndarray, + sample_split_config_expanding: SplitConfig, +) -> None: + """A single feature-aware fold yields exactly horizon predictions.""" + service = BacktestingService() + series = _series(sample_dates_120, sample_values_120, with_exogenous=True) + splitter = TimeSeriesSplitter(sample_split_config_expanding) + historical_matrix = service._build_historical_matrix(series) + split = next(splitter.split(series.dates, series.values)) + + predictions = service._run_feature_aware_fold( + series_data=series, + split=split, + model_config=RegressionModelConfig(), + historical_matrix=historical_matrix, + gap=sample_split_config_expanding.gap, + ) + + assert predictions.shape == (len(split.test_indices),) + assert np.all(np.isfinite(predictions)) + + +def test_feature_aware_backtest_produces_per_fold_metrics( + sample_dates_120: list[date], + sample_values_120: np.ndarray, + sample_split_config_expanding: SplitConfig, +) -> None: + """A regression backtest runs end-to-end and yields per-fold metrics. + + Repurposed positive assertion — feature-aware models are backtestable now + (supersedes the PRP-29 interim loud-fail contract). + """ + service = BacktestingService() + series = _series(sample_dates_120, sample_values_120, with_exogenous=True) + splitter = TimeSeriesSplitter(sample_split_config_expanding) + + result = service._run_model_backtest( + series_data=series, + splitter=splitter, + model_config=RegressionModelConfig(), + store_fold_details=True, + ) + + assert result.model_type == "regression" + assert len(result.fold_results) > 0 + assert "mae" in result.aggregated_metrics + for fold in result.fold_results: + assert "mae" in fold.metrics + + +def test_feature_aware_result_records_observed_policy( + sample_dates_120: list[date], + sample_values_120: np.ndarray, + sample_split_config_expanding: SplitConfig, +) -> None: + """A feature-aware result is flagged and records the v1 exogenous policy.""" + service = BacktestingService() + series = _series(sample_dates_120, sample_values_120, with_exogenous=True) + splitter = TimeSeriesSplitter(sample_split_config_expanding) + + result = service._run_model_backtest( + series_data=series, + splitter=splitter, + model_config=RegressionModelConfig(), + store_fold_details=True, + ) + + assert result.feature_aware is True + assert result.exogenous_policy == "observed" + + +def test_target_only_result_is_not_flagged_feature_aware( + sample_dates_120: list[date], + sample_values_120: np.ndarray, + sample_split_config_expanding: SplitConfig, +) -> None: + """A target-only baseline keeps feature_aware False and no exogenous policy.""" + service = BacktestingService() + series = _series(sample_dates_120, sample_values_120, with_exogenous=False) + splitter = TimeSeriesSplitter(sample_split_config_expanding) + + result = service._run_model_backtest( + series_data=series, + splitter=splitter, + model_config=NaiveModelConfig(), + store_fold_details=True, + ) + + assert result.feature_aware is False + assert result.exogenous_policy is None + + +def test_feature_aware_backtest_without_exogenous_fails_loud( + sample_dates_120: list[date], + sample_values_120: np.ndarray, + sample_split_config_expanding: SplitConfig, +) -> None: + """A feature-aware model reaching the fold loop with no ExogenousFrame + must fail LOUD — the genuinely-unsupported path (DECISIONS LOCKED #8).""" + service = BacktestingService() + series = _series(sample_dates_120, sample_values_120, with_exogenous=False) + splitter = TimeSeriesSplitter(sample_split_config_expanding) + + with pytest.raises(ValueError, match="ExogenousFrame"): + service._run_model_backtest( + series_data=series, + splitter=splitter, + model_config=RegressionModelConfig(), + store_fold_details=True, + ) + + +def test_feature_aware_backtest_runs_with_a_gap_fold( + sample_dates_120: list[date], + sample_values_120: np.ndarray, + sample_split_config_with_gap: SplitConfig, +) -> None: + """A gap > 0 fold runs — the lag columns drop the gap lead-in correctly.""" + assert sample_split_config_with_gap.gap > 0 + service = BacktestingService() + series = _series(sample_dates_120, sample_values_120, with_exogenous=True) + splitter = TimeSeriesSplitter(sample_split_config_with_gap) + + result = service._run_model_backtest( + series_data=series, + splitter=splitter, + model_config=RegressionModelConfig(), + store_fold_details=True, + ) + + assert result.feature_aware is True + assert len(result.fold_results) > 0 + for fold in result.fold_results: + assert np.isfinite(fold.metrics["mae"]) diff --git a/app/features/backtesting/tests/test_routes_integration.py b/app/features/backtesting/tests/test_routes_integration.py index efe2af33..a0533146 100644 --- a/app/features/backtesting/tests/test_routes_integration.py +++ b/app/features/backtesting/tests/test_routes_integration.py @@ -393,3 +393,85 @@ async def test_response_contains_all_expected_fields( assert "test_end" in split assert "train_size" in split assert "test_size" in split + + +@pytest.mark.integration +@pytest.mark.asyncio +class TestBacktestingRouteFeatureAwareIntegration: + """Integration tests for POST /backtesting/run with a feature-aware model.""" + + async def test_run_backtest_regression_model( + self, + client: AsyncClient, + sample_store: Store, + sample_product: Product, + sample_sales_120: list[SalesDaily], + ) -> None: + """A regression backtest returns 200 with per-fold metrics + baselines.""" + response = await client.post( + "/backtesting/run", + json={ + "store_id": sample_store.id, + "product_id": sample_product.id, + "start_date": "2024-01-01", + "end_date": "2024-04-29", + "config": { + "split_config": { + "strategy": "expanding", + "n_splits": 3, + "min_train_size": 30, + "gap": 0, + "horizon": 14, + }, + "model_config_main": {"model_type": "regression"}, + "include_baselines": True, + "store_fold_details": True, + }, + }, + ) + + assert response.status_code == 200 + data = response.json() + + main = data["main_model_results"] + assert main["model_type"] == "regression" + assert main["feature_aware"] is True + assert main["exogenous_policy"] == "observed" + assert len(main["fold_results"]) > 0 + assert data["leakage_check_passed"] is True + + baseline_types = {b["model_type"] for b in data["baseline_results"]} + assert baseline_types == {"naive", "seasonal_naive"} + + async def test_run_backtest_regression_rejects_small_min_train_size( + self, + client: AsyncClient, + sample_store: Store, + sample_product: Product, + sample_sales_120: list[SalesDaily], + ) -> None: + """A regression backtest with min_train_size < 30 returns RFC 7807 400.""" + response = await client.post( + "/backtesting/run", + json={ + "store_id": sample_store.id, + "product_id": sample_product.id, + "start_date": "2024-01-01", + "end_date": "2024-04-29", + "config": { + "split_config": { + "strategy": "expanding", + "n_splits": 3, + "min_train_size": 20, + "gap": 0, + "horizon": 14, + }, + "model_config_main": {"model_type": "regression"}, + "include_baselines": False, + "store_fold_details": True, + }, + }, + ) + + assert response.status_code == 400 + assert "at least 30" in response.text diff --git a/app/features/backtesting/tests/test_schemas.py b/app/features/backtesting/tests/test_schemas.py index 31eec119..c071fae9 100644 --- a/app/features/backtesting/tests/test_schemas.py +++ b/app/features/backtesting/tests/test_schemas.py @@ -225,6 +225,48 @@ def test_model_backtest_result_creation(self): assert result.model_type == "naive" assert result.aggregated_metrics["mae"] == 5.0 + def test_feature_aware_fields_default_to_target_only(self): + """The MLZOO-B.2 fields default to a target-only result — every existing + construction site (which omits them) stays valid.""" + result = ModelBacktestResult( + model_type="naive", + config_hash="abc123", + fold_results=[], + aggregated_metrics={"mae": 5.0}, + metric_std={"mae_stability": 10.0}, + ) + + assert result.feature_aware is False + assert result.exogenous_policy is None + + def test_feature_aware_fields_accept_explicit_values(self): + """A feature-aware result carries feature_aware=True + the exogenous policy.""" + result = ModelBacktestResult( + model_type="regression", + config_hash="def456", + fold_results=[], + aggregated_metrics={"mae": 3.0}, + metric_std={"mae_stability": 8.0}, + feature_aware=True, + exogenous_policy="observed", + ) + + assert result.feature_aware is True + assert result.exogenous_policy == "observed" + + def test_exogenous_policy_rejects_an_unknown_value(self): + """exogenous_policy is a closed Literal — only 'observed' in v1.""" + with pytest.raises(ValidationError): + ModelBacktestResult( + model_type="regression", + config_hash="def456", + fold_results=[], + aggregated_metrics={"mae": 3.0}, + metric_std={"mae_stability": 8.0}, + feature_aware=True, + exogenous_policy="assumptions", # type: ignore[arg-type] + ) + class TestBacktestRequest: """Tests for BacktestRequest schema.""" diff --git a/app/features/backtesting/tests/test_service.py b/app/features/backtesting/tests/test_service.py index 87d17cb1..992ccb80 100644 --- a/app/features/backtesting/tests/test_service.py +++ b/app/features/backtesting/tests/test_service.py @@ -125,11 +125,16 @@ def test_feature_aware_model_fails_loud_in_backtest( ) -> None: """A feature-aware model must fail LOUD in a backtest, never run silently. - The fold loop calls ``model.fit(y_train)`` target-only; a - ``RegressionForecaster`` (``requires_features=True``) raises ``ValueError`` - there because no exogenous ``X`` was supplied. Feature-aware backtesting - is wired in PRP-MLZOO-B — until then this loud, non-leaky failure is the - contract (PRP-29 DECISIONS LOCKED #7). + Repurposed by PRP-MLZOO-B.2 (DECISIONS LOCKED #8): feature-aware + backtesting is now wired, so a ``regression`` backtest SUCCEEDS once + ``run_backtest`` has resolved the exogenous data — the positive + assertions live in ``test_feature_aware_backtest.py``. PRP-29 DECISIONS + LOCKED #7 and PRP-30 DECISIONS LOCKED #6 are therefore superseded. + + What stays loud — and is pinned here — is the genuinely-unsupported + path: a ``requires_features`` model reaching the fold loop with no + ``ExogenousFrame`` loaded on ``series_data`` raises ``ValueError`` + rather than degrading to a silent all-NaN matrix. """ from app.features.backtesting.splitter import TimeSeriesSplitter from app.features.forecasting.schemas import RegressionModelConfig @@ -140,10 +145,10 @@ def test_feature_aware_model_fails_loud_in_backtest( values=sample_values_120, store_id=1, product_id=1, - ) + ) # NOTE: no ExogenousFrame loaded — the unsupported path. splitter = TimeSeriesSplitter(sample_split_config_expanding) - with pytest.raises(ValueError, match="requires exogenous features X"): + with pytest.raises(ValueError, match="ExogenousFrame"): service._run_model_backtest( series_data=series_data, splitter=splitter, diff --git a/app/features/backtesting/tests/test_service_integration.py b/app/features/backtesting/tests/test_service_integration.py index d1b0fbd7..81e661ca 100644 --- a/app/features/backtesting/tests/test_service_integration.py +++ b/app/features/backtesting/tests/test_service_integration.py @@ -295,3 +295,99 @@ async def test_backtest_with_gap_produces_correct_splits( actual_gap = (test_start - train_end).days # Gap should be at least gap_days (could be more if data is sparse) assert actual_gap >= gap_days, f"Expected gap >= {gap_days}, got {actual_gap}" + + +@pytest.mark.integration +@pytest.mark.asyncio +class TestBacktestingServiceFeatureAwareIntegration: + """Integration tests for feature-aware backtesting (MLZOO-B.2). + + A ``regression`` model is evaluated end-to-end against the real database: + ``run_backtest`` resolves the exogenous frame, the fold loop builds the + per-fold leakage-safe ``X_train`` / ``X_future``, and the result is + compared head-to-head with the naive / seasonal baselines. + """ + + async def test_regression_backtest_runs_with_baseline_comparison( + self, + db_session: AsyncSession, + sample_store: Store, + sample_product: Product, + sample_sales_120: list[SalesDaily], + ) -> None: + """A regression backtest yields per-fold metrics + a baseline comparison.""" + from app.features.forecasting.schemas import RegressionModelConfig + + service = BacktestingService() + config = BacktestConfig( + split_config=SplitConfig( + strategy="expanding", + n_splits=3, + min_train_size=30, + gap=0, + horizon=14, + ), + model_config_main=RegressionModelConfig(), + include_baselines=True, + store_fold_details=True, + ) + + response = await service.run_backtest( + db=db_session, + store_id=sample_store.id, + product_id=sample_product.id, + start_date=date(2024, 1, 1), + end_date=date(2024, 4, 29), + config=config, + ) + + main = response.main_model_results + assert main.model_type == "regression" + assert main.feature_aware is True + assert main.exogenous_policy == "observed" + assert len(main.fold_results) > 0 + assert "mae" in main.aggregated_metrics + assert response.leakage_check_passed is True + + # Baselines stay target-only and unflagged. + assert response.baseline_results is not None + baseline_types = {b.model_type for b in response.baseline_results} + assert baseline_types == {"naive", "seasonal_naive"} + for baseline in response.baseline_results: + assert baseline.feature_aware is False + assert baseline.exogenous_policy is None + assert response.comparison_summary is not None + + async def test_regression_backtest_rejects_small_min_train_size( + self, + db_session: AsyncSession, + sample_store: Store, + sample_product: Product, + sample_sales_120: list[SalesDaily], + ) -> None: + """A feature-aware backtest with min_train_size < 30 fails loud.""" + from app.features.forecasting.schemas import RegressionModelConfig + + service = BacktestingService() + config = BacktestConfig( + split_config=SplitConfig( + strategy="expanding", + n_splits=3, + min_train_size=20, + gap=0, + horizon=14, + ), + model_config_main=RegressionModelConfig(), + include_baselines=False, + store_fold_details=True, + ) + + with pytest.raises(ValueError, match="at least 30"): + await service.run_backtest( + db=db_session, + store_id=sample_store.id, + product_id=sample_product.id, + start_date=date(2024, 1, 1), + end_date=date(2024, 4, 29), + config=config, + ) diff --git a/app/features/forecasting/service.py b/app/features/forecasting/service.py index f2affc75..ca706803 100644 --- a/app/features/forecasting/service.py +++ b/app/features/forecasting/service.py @@ -11,7 +11,6 @@ from __future__ import annotations -import math import time import uuid from dataclasses import dataclass, field @@ -40,10 +39,8 @@ TrainResponse, ) from app.shared.feature_frames import ( - CALENDAR_COLUMNS, - EXOGENOUS_LAGS, HISTORY_TAIL_DAYS, - build_calendar_columns, + build_historical_feature_rows, canonical_feature_columns, ) @@ -133,9 +130,11 @@ def _assemble_regression_rows( then the calendar columns, then ``price_factor``, ``promo_active``, ``is_holiday``, ``days_since_launch``. - Extracted from :meth:`ForecastingService._build_regression_features` so the - leakage invariant can be unit-tested without a database - (``test_regression_features_leakage.py``). + Delegating shim (MLZOO-B.2): the row-assembly body was promoted verbatim + to :func:`app.shared.feature_frames.build_historical_feature_rows` so the + ``backtesting`` slice can reuse it without a forbidden cross-slice import. + This wrapper keeps the name and signature byte-stable, so the load-bearing + leakage spec (``test_regression_features_leakage.py``) imports it unchanged. Args: dates: Observed days in chronological order. @@ -151,23 +150,15 @@ def _assemble_regression_rows( a lag whose source day precedes the series, and ``days_since_launch`` when the product has no launch date. """ - calendar_columns = build_calendar_columns(dates) - rows: list[list[float]] = [] - for index, day in enumerate(dates): - row: list[float] = [] - # Target long-lag columns — read only strictly-earlier observations. - for lag in EXOGENOUS_LAGS: - row.append(quantities[index - lag] if index >= lag else math.nan) - # Calendar columns — pure functions of the date (shared builder). - for name in CALENDAR_COLUMNS: - row.append(calendar_columns[name][index]) - # Exogenous columns — same-day observed attributes. - row.append(prices[index] / baseline_price) - row.append(1.0 if day in promo_dates else 0.0) - row.append(1.0 if day in holiday_dates else 0.0) - row.append(float((day - launch_date).days) if launch_date is not None else math.nan) - rows.append(row) - return rows + return build_historical_feature_rows( + dates=dates, + quantities=quantities, + prices=prices, + baseline_price=baseline_price, + promo_dates=promo_dates, + holiday_dates=holiday_dates, + launch_date=launch_date, + ) class ForecastingService: diff --git a/app/features/jobs/service.py b/app/features/jobs/service.py index fd0dfdc6..9ec982a0 100644 --- a/app/features/jobs/service.py +++ b/app/features/jobs/service.py @@ -609,8 +609,10 @@ async def _execute_backtest( from app.features.backtesting.schemas import BacktestConfig, SplitConfig from app.features.backtesting.service import BacktestingService from app.features.forecasting.schemas import ( + LightGBMModelConfig, MovingAverageModelConfig, NaiveModelConfig, + RegressionModelConfig, SeasonalNaiveModelConfig, ) @@ -644,6 +646,13 @@ async def _execute_backtest( elif model_type == "moving_average": window_size = params.get("window_size", 7) model_config = MovingAverageModelConfig(window_size=window_size) + elif model_type == "regression": + # Feature-aware — the backtest builds per-fold leakage-safe X. + model_config = RegressionModelConfig() + elif model_type == "lightgbm": + # Feature-aware — still gated by forecast_enable_lightgbm inside + # model_factory; a disabled flag surfaces as a loud failed job. + model_config = LightGBMModelConfig() else: msg = f"Unsupported model_type: {model_type}" raise ValueError(msg) diff --git a/app/features/jobs/tests/test_service.py b/app/features/jobs/tests/test_service.py index 834c630d..9773cb96 100644 --- a/app/features/jobs/tests/test_service.py +++ b/app/features/jobs/tests/test_service.py @@ -21,6 +21,7 @@ SplitBoundary, SplitConfig, ) +from app.features.backtesting.service import BacktestingService from app.features.forecasting.schemas import ( LightGBMModelConfig, RegressionModelConfig, @@ -246,3 +247,66 @@ async def test_execute_train_rejects_unsupported_model_type() -> None: db=cast(AsyncSession, AsyncMock()), params={**_REGRESSION_PARAMS, "model_type": "arima"}, ) + + +# Parameters for a backtest job — _execute_backtest reads these keys. +_BACKTEST_PARAMS: dict[str, Any] = { + "model_type": "regression", + "store_id": 1, + "product_id": 1, + "start_date": "2024-01-01", + "end_date": "2024-12-01", + "n_splits": 3, +} + + +async def test_execute_backtest_builds_regression_config() -> None: + """A backtest job with model_type='regression' builds a RegressionModelConfig. + + ``run_backtest`` is mocked, so the test is pure (no DB): it pins that + ``_execute_backtest`` widened its allow-list and shaped the result. + """ + response = _make_response() + with patch.object( + BacktestingService, "run_backtest", new=AsyncMock(return_value=response) + ) as mock_run: + result = await JobService()._execute_backtest( + db=cast(AsyncSession, AsyncMock()), + params=_BACKTEST_PARAMS, + ) + assert mock_run.call_args is not None + config = mock_run.call_args.kwargs["config"] + assert isinstance(config.model_config_main, RegressionModelConfig) + assert result["model_type"] == "regression" + # The frontend job-result contract is still shaped (byte-stable keys). + assert "fold_metrics" in result + assert "aggregated_metrics" in result + + +async def test_execute_backtest_builds_lightgbm_config() -> None: + """A backtest job with model_type='lightgbm' builds a LightGBMModelConfig. + + ``run_backtest`` is mocked, so ``model_factory``'s feature-flag gate is + never reached and the optional lightgbm dependency is not required. + """ + response = _make_response() + with patch.object( + BacktestingService, "run_backtest", new=AsyncMock(return_value=response) + ) as mock_run: + result = await JobService()._execute_backtest( + db=cast(AsyncSession, AsyncMock()), + params={**_BACKTEST_PARAMS, "model_type": "lightgbm"}, + ) + assert mock_run.call_args is not None + config = mock_run.call_args.kwargs["config"] + assert isinstance(config.model_config_main, LightGBMModelConfig) + assert result["model_type"] == "lightgbm" + + +async def test_execute_backtest_rejects_unsupported_model_type() -> None: + """_execute_backtest still rejects a genuinely unsupported model_type.""" + with pytest.raises(ValueError, match="Unsupported model_type"): + await JobService()._execute_backtest( + db=cast(AsyncSession, AsyncMock()), + params={**_BACKTEST_PARAMS, "model_type": "arima"}, + ) diff --git a/app/shared/feature_frames/__init__.py b/app/shared/feature_frames/__init__.py index b79d7d92..df0568b4 100644 --- a/app/shared/feature_frames/__init__.py +++ b/app/shared/feature_frames/__init__.py @@ -23,6 +23,10 @@ canonical_feature_columns, feature_safety, ) +from app.shared.feature_frames.rows import ( + build_future_feature_rows, + build_historical_feature_rows, +) __all__ = [ "CALENDAR_COLUMNS", @@ -33,6 +37,8 @@ "FeatureSafety", "FutureFeatureFrame", "build_calendar_columns", + "build_future_feature_rows", + "build_historical_feature_rows", "build_long_lag_columns", "canonical_feature_columns", "feature_safety", diff --git a/app/shared/feature_frames/rows.py b/app/shared/feature_frames/rows.py new file mode 100644 index 00000000..9aed8adc --- /dev/null +++ b/app/shared/feature_frames/rows.py @@ -0,0 +1,201 @@ +"""Shared row-matrix assemblers for feature-aware forecasting (MLZOO-B.2). + +This module joins :mod:`app.shared.feature_frames.contract` under the +cross-cutting ``app/shared/feature_frames`` package. ``contract.py`` owns the +pinned constants, the canonical column set, the leakage-safe *column* builders +and the :class:`~app.shared.feature_frames.contract.FeatureSafety` taxonomy; +this module owns the two *row-matrix* assemblers built on top of them: + +* :func:`build_historical_feature_rows` — the historical (training) feature + matrix. Promoted verbatim from ``ForecastingService._assemble_regression_rows`` + so the ``backtesting`` slice can reuse it without a forbidden cross-slice + import (``backtesting -> forecasting`` is not allowed; ``-> app/shared`` is). +* :func:`build_future_feature_rows` — the test-window (future) feature matrix + for a backtest fold. Leakage-safe by construction (see below). + +LEAF-LEVEL: like ``contract.py`` this module may NEVER import from +``app/features/**``. Every function is pure — stdlib ``math`` / ``datetime`` +plus the contract builders only. ``tests/test_contract.py`` enforces it with +an AST walk; ``tests/test_leakage.py`` pins the leakage invariants. + +The leakage rule the future builder obeys (mirrors ``contract.py`` and the +load-bearing ``tests/test_leakage.py``): + + A future feature value for a test-window day may use ONLY information + knowable at the forecast origin ``T`` — the observed history up to and + including ``T``, the calendar (a pure function of the date), or an + exogenous input recorded for the test window (price / promotion). It may + NEVER read an observed *target* at a test-window day. +""" + +from __future__ import annotations + +import math +from datetime import date + +from app.shared.feature_frames.contract import ( + CALENDAR_COLUMNS, + EXOGENOUS_LAGS, + build_calendar_columns, + build_long_lag_columns, + canonical_feature_columns, +) + + +def build_historical_feature_rows( + *, + dates: list[date], + quantities: list[float], + prices: list[float], + baseline_price: float, + promo_dates: set[date], + holiday_dates: set[date], + launch_date: date | None, +) -> list[list[float]]: + """Assemble the historical regression feature matrix — pure, leakage-safe. + + Time-safe by construction: every lag column at row ``i`` reads only the + observed target at ``i - lag`` (a strictly earlier day); calendar columns + are pure functions of the date; ``price_factor`` / ``promo_active`` / + ``is_holiday`` / ``days_since_launch`` read the same-day exogenous + attributes. No row reads a future observation. + + Column order is :func:`canonical_feature_columns` exactly: the target + lags, then the calendar columns, then ``price_factor``, ``promo_active``, + ``is_holiday``, ``days_since_launch``. + + Promoted verbatim from ``ForecastingService._assemble_regression_rows`` + (which now delegates here) so the leakage invariant is unit-tested without + a database (``app/features/forecasting/tests/test_regression_features_leakage.py``) + and the ``backtesting`` slice can reuse it without a cross-slice import. + + Args: + dates: Observed days in chronological order. + quantities: Observed target values aligned with ``dates``. + prices: Observed unit prices aligned with ``dates``. + baseline_price: The typical price; ``price_factor`` is the ratio to it. + promo_dates: Days a promotion covered. + holiday_dates: Calendar holiday days. + launch_date: The product's launch date, or ``None``. + + Returns: + Row-major feature matrix ``[n_observations][n_features]``; ``NaN`` marks + a lag whose source day precedes the series, and ``days_since_launch`` + when the product has no launch date. + """ + calendar_columns = build_calendar_columns(dates) + rows: list[list[float]] = [] + for index, day in enumerate(dates): + row: list[float] = [] + # Target long-lag columns — read only strictly-earlier observations. + for lag in EXOGENOUS_LAGS: + row.append(quantities[index - lag] if index >= lag else math.nan) + # Calendar columns — pure functions of the date (shared builder). + for name in CALENDAR_COLUMNS: + row.append(calendar_columns[name][index]) + # Exogenous columns — same-day observed attributes. + row.append(prices[index] / baseline_price) + row.append(1.0 if day in promo_dates else 0.0) + row.append(1.0 if day in holiday_dates else 0.0) + row.append(float((day - launch_date).days) if launch_date is not None else math.nan) + rows.append(row) + return rows + + +def build_future_feature_rows( + *, + test_dates: list[date], + history_tail: list[float], + gap: int, + test_prices: list[float], + baseline_price: float, + test_promo_dates: set[date], + test_holiday_dates: set[date], + launch_date: date | None, +) -> list[list[float]]: + """Assemble a backtest fold's test-window feature matrix — leakage-safe. + + This is the leakage-critical builder. A test-window day has no observed + target, so the matrix MUST be rebuilt here rather than sliced from the + historical matrix — a sliced historical row would read an adjacent + test-day observed target as its ``lag_1`` cell (target leakage). + + Column population by class (matches the canonical column order exactly): + + * **Target lags** (``lag_*``) — from :func:`build_long_lag_columns` over + ``history_tail``, which ends at the fold origin ``T``. A lag cell whose + source day lies in the test window is ``NaN`` — structurally enforced, + never a recursive prediction. + * **Calendar columns** — pure functions of the test-window date. + * **Exogenous columns** (``price_factor`` / ``promo_active``) — the + *observed* recorded price / promotion for the test window. This reads + no target ``y`` (not target leakage); it is exogenous foresight under + the ``observed`` policy and assumes the future price/promo plan was + known at ``T``. + * ``is_holiday`` / ``days_since_launch`` — calendar / launch-date + attributes, knowable at ``T``. + + Gap handling: with ``gap > 0`` the first test day is ``T + gap + 1`` but + :func:`build_long_lag_columns` indexes its day ``m`` as ``T + m``. The lag + columns are therefore built for ``gap + len(test_dates)`` days and the + first ``gap`` rows dropped. With ``gap == 0`` the slice is a no-op. + + Args: + test_dates: The fold's test-window days (chronological). + history_tail: Observed targets ending at the fold origin ``T`` + (``history_tail[-1] == y[T]``); excludes the gap days. + gap: Gap days between train end and test start (simulated latency). + test_prices: Recorded unit prices aligned with ``test_dates``. + baseline_price: The typical price; ``price_factor`` is the ratio to it. + test_promo_dates: Test-window days a promotion covered. + test_holiday_dates: Test-window calendar holiday days. + launch_date: The product's launch date, or ``None``. + + Returns: + Row-major feature matrix ``[len(test_dates)][n_features]`` in canonical + column order; ``NaN`` marks a future-sourced lag cell and + ``days_since_launch`` when the product has no launch date. + + Raises: + ValueError: When ``gap`` is negative, ``test_prices`` does not align + with ``test_dates``, or a canonical column cannot be sourced. + """ + horizon = len(test_dates) + if gap < 0: + raise ValueError(f"build_future_feature_rows: gap must be >= 0, got {gap}") + if len(test_prices) != horizon: + raise ValueError( + f"build_future_feature_rows: test_prices has {len(test_prices)} entries " + f"but test_dates has {horizon} — they must align" + ) + + # Lags: build for gap + horizon days, then drop the gap lead-in so row j + # corresponds to test day j. NaN-where-future is enforced by the builder. + lag_columns = build_long_lag_columns(history_tail, gap + horizon) + lag_columns = {name: values[gap:] for name, values in lag_columns.items()} + calendar_columns = build_calendar_columns(test_dates) + calendar_names = set(CALENDAR_COLUMNS) + columns = canonical_feature_columns() + + rows: list[list[float]] = [] + for j, day in enumerate(test_dates): + row: list[float] = [] + for column in columns: + if column.startswith("lag_"): # target lag — NaN where future + row.append(lag_columns[column][j]) + elif column in calendar_names: # pure function of the date + row.append(calendar_columns[column][j]) + elif column == "price_factor": # observed exogenous foresight + row.append(test_prices[j] / baseline_price) + elif column == "promo_active": # observed exogenous foresight + row.append(1.0 if day in test_promo_dates else 0.0) + elif column == "is_holiday": # calendar attribute + row.append(1.0 if day in test_holiday_dates else 0.0) + elif column == "days_since_launch": # pure function of the date + row.append(float((day - launch_date).days) if launch_date is not None else math.nan) + else: # loud failure — never a silent 0.0 / NaN fill + raise ValueError( + f"build_future_feature_rows: cannot source future column {column!r}" + ) + rows.append(row) + return rows diff --git a/app/shared/feature_frames/tests/test_contract.py b/app/shared/feature_frames/tests/test_contract.py index 0b68fd6f..6280e19e 100644 --- a/app/shared/feature_frames/tests/test_contract.py +++ b/app/shared/feature_frames/tests/test_contract.py @@ -125,10 +125,14 @@ def test_shared_package_imports_nothing_from_features() -> None: """``app/shared/**`` is leaf-level — it may never import a vertical slice. Walks every ``.py`` file in the package and asserts no module imports a - name under ``app.features`` (AGENTS.md § Architecture). + name under ``app.features`` (AGENTS.md § Architecture). ``rows.py`` (the + MLZOO-B.2 row-matrix assemblers) is walked here too — see the explicit + assertion below that it is part of the package. """ pkg_dir = Path(__file__).resolve().parents[1] # app/shared/feature_frames/ + walked: set[str] = set() for py_file in pkg_dir.rglob("*.py"): + walked.add(py_file.name) source = py_file.read_text(encoding="utf-8") for node in ast.walk(ast.parse(source)): if isinstance(node, ast.ImportFrom) and node.module: @@ -140,3 +144,7 @@ def test_shared_package_imports_nothing_from_features() -> None: assert not alias.name.startswith("app.features"), ( f"ARCHITECTURE BREACH: {py_file} imports {alias.name}" ) + # rows.py must exist and be covered by the leaf-level walk above. + assert {"contract.py", "rows.py"} <= walked, ( + f"expected contract.py and rows.py in the package walk, got {sorted(walked)}" + ) diff --git a/app/shared/feature_frames/tests/test_leakage.py b/app/shared/feature_frames/tests/test_leakage.py index 56ba62f4..f1730d64 100644 --- a/app/shared/feature_frames/tests/test_leakage.py +++ b/app/shared/feature_frames/tests/test_leakage.py @@ -33,10 +33,15 @@ import math from datetime import date, timedelta +import pytest + from app.shared.feature_frames import ( EXOGENOUS_LAGS, build_calendar_columns, + build_future_feature_rows, + build_historical_feature_rows, build_long_lag_columns, + canonical_feature_columns, ) # The forecast origin T is the last observed day; the horizon runs T+1 … T+H. @@ -114,3 +119,136 @@ def test_calendar_columns_are_independent_of_the_target_series() -> None: for cell in values: assert cell not in history_values assert cell not in _FUTURE_TARGETS + + +# --- build_future_feature_rows — the backtest test-window matrix (MLZOO-B.2) -- +# +# build_future_feature_rows assembles one backtest fold's test-window feature +# matrix. It receives ONLY history_tail (entirely <= the fold origin T) — it is +# structurally incapable of reading a test-window observed target. These specs +# pin that, the NaN-where-future contract (including a gap > 0 fold), and the +# historical-vs-future asymmetry that is the reason X_future is rebuilt here +# rather than sliced from the historical matrix. + +_TEST_WINDOW = 14 +_TEST_PRICES = [10.0] * _TEST_WINDOW + + +def test_future_lag_cells_are_drawn_only_from_history() -> None: + """Every non-NaN future lag cell comes from ``history_tail`` — never a target. + + ``build_future_feature_rows`` takes only ``history_tail`` as target data; + a value disjoint from it appearing in any lag cell would be a leak. + """ + test_dates = [_ORIGIN + timedelta(days=offset) for offset in range(1, _TEST_WINDOW + 1)] + columns = canonical_feature_columns() + rows = build_future_feature_rows( + test_dates=test_dates, + history_tail=_HISTORY_TAIL, + gap=0, + test_prices=_TEST_PRICES, + baseline_price=10.0, + test_promo_dates=set(), + test_holiday_dates=set(), + launch_date=None, + ) + history_values = set(_HISTORY_TAIL) + for lag in EXOGENOUS_LAGS: + col = columns.index(f"lag_{lag}") + for j in range(_TEST_WINDOW): + cell = rows[j][col] + if math.isnan(cell): + continue + assert cell in history_values, ( + f"lag_{lag} test day {j} emitted {cell}, not an observed history value" + ) + assert cell not in _FUTURE_TARGETS, ( + f"lag_{lag} test day {j} leaked a future target value {cell}" + ) + + +@pytest.mark.parametrize("gap", [0, 3, 7]) +def test_future_lag_is_nan_exactly_where_source_is_a_test_day(gap: int) -> None: + """A future lag cell is ``NaN`` exactly when its source day is in the test window. + + For lag ``k`` and test day ``j`` (0-indexed) the source day relative to the + origin ``T`` is ``T + gap + j + 1 - k``; it lies in the test window — and + the cell MUST be ``NaN`` — exactly when ``gap + j - k >= 0``. Otherwise the + source is observed history and the cell MUST carry a value. + """ + test_dates = [_ORIGIN + timedelta(days=gap + offset) for offset in range(1, _TEST_WINDOW + 1)] + columns = canonical_feature_columns() + rows = build_future_feature_rows( + test_dates=test_dates, + history_tail=_HISTORY_TAIL, + gap=gap, + test_prices=_TEST_PRICES, + baseline_price=10.0, + test_promo_dates=set(), + test_holiday_dates=set(), + launch_date=None, + ) + for lag in EXOGENOUS_LAGS: + col = columns.index(f"lag_{lag}") + for j in range(_TEST_WINDOW): + cell = rows[j][col] + if gap + j - lag >= 0: + assert math.isnan(cell), ( + f"gap={gap} lag_{lag} day {j}: source is a test day — expected NaN, got {cell}" + ) + else: + assert not math.isnan(cell), ( + f"gap={gap} lag_{lag} day {j}: source is in history — expected a value, got NaN" + ) + + +def test_historical_and_future_lag_columns_are_asymmetric() -> None: + """The crux of MLZOO-B.2: a historical lag row reads adjacent observed targets; + a future lag row does NOT — which is why ``X_future`` is rebuilt here and + never sliced from the historical matrix. + + A continuous sequential series is split at the origin ``T``. The historical + matrix row for a test-window day reads that day's neighbouring *observed* + target as ``lag_1`` (slicing it for ``X_future`` would be target leakage). + The future matrix produces ``NaN`` there instead. + """ + series_len = 60 + train_end = 40 # origin T is index 39 (the last train day) + full = [float(i + 1) for i in range(series_len)] + history_tail = full[:train_end] + columns = canonical_feature_columns() + lag1 = columns.index("lag_1") + + historical = build_historical_feature_rows( + dates=[_ORIGIN + timedelta(days=offset) for offset in range(series_len)], + quantities=full, + prices=[10.0] * series_len, + baseline_price=10.0, + promo_dates=set(), + holiday_dates=set(), + launch_date=None, + ) + # The historical matrix row for a TEST-window day reads an observed + # test-day target as lag_1 — proof that slicing it for X_future leaks. + assert historical[train_end + 1][lag1] == full[train_end], ( + "historical lag_1 for a test-window row must read the adjacent observed target" + ) + + test_window = 10 + future = build_future_feature_rows( + test_dates=[_ORIGIN + timedelta(days=offset) for offset in range(1, test_window + 1)], + history_tail=history_tail, + gap=0, + test_prices=[10.0] * test_window, + baseline_price=10.0, + test_promo_dates=set(), + test_holiday_dates=set(), + launch_date=None, + ) + # The future matrix: test day 0's lag_1 is y[T] (knowable); every later + # day's lag_1 is NaN — it never reads a test-window observed target. + assert future[0][lag1] == history_tail[-1], "future lag_1 day 0 must be the origin y[T]" + for j in range(1, test_window): + assert math.isnan(future[j][lag1]), ( + f"future lag_1 test day {j} must be NaN — it must never read a test-window target" + ) diff --git a/docs/PHASE/5-BACKTESTING.md b/docs/PHASE/5-BACKTESTING.md index d06d8d2e..b465b3d1 100644 --- a/docs/PHASE/5-BACKTESTING.md +++ b/docs/PHASE/5-BACKTESTING.md @@ -373,6 +373,39 @@ $ uv run pytest app/features/backtesting/tests/ -v -m integration --- +## Feature-Aware Backtesting (MLZOO-B.2) + +Since PRP-MLZOO-B.2 the fold loop also evaluates **feature-aware** models +(`regression`, `lightgbm` — any model with `requires_features=True`), not just +the target-only baselines. + +**How it works**: +1. `run_backtest` (async) probes the model's `requires_features` flag. For a + feature-aware model it resolves the exogenous data — recorded price, + promotion windows, calendar holidays, product launch date — once, into a + pure in-memory `ExogenousFrame`. The fold loop stays sync and DB-free. +2. `_run_model_backtest` branches on the flag. The feature-aware path builds the + full historical feature matrix once, then per fold: + - **`X_train`** — a positional slice of that historical matrix. + - **`X_future`** — rebuilt per fold by `build_future_feature_rows` (never + sliced — that would leak an adjacent test-day target as `lag_1`). Its + `history_tail` ends at the fold origin `T`, so future-sourced lag cells + are `NaN`; with `gap > 0` the lag columns drop the gap lead-in. +3. The result records `feature_aware: true` and `exogenous_policy: "observed"` + (the v1 policy — the recorded test-window price/promotion plan; exogenous + foresight, not target leakage). + +**Constraints**: +- `min_train_size >= 30` is required for a feature-aware backtest (each fold + must resolve its lag features) — a smaller value raises `ValueError` (400). +- The naive / seasonal baselines stay target-only — they take the unchanged + code path even when the main model is feature-aware. + +The per-fold row builders live in `app/shared/feature_frames/rows.py`; the +leakage invariants are pinned by `app/shared/feature_frames/tests/test_leakage.py`. + +--- + ## Next Phase Preparation Phase 6 (Model Registry) will use the backtesting module to: diff --git a/examples/models/feature_frame_contract.md b/examples/models/feature_frame_contract.md index 407a0bac..27356c73 100644 --- a/examples/models/feature_frame_contract.md +++ b/examples/models/feature_frame_contract.md @@ -105,12 +105,29 @@ A new feature-aware model (PRP-MLZOO-B onward): 2. Reuses the **shared** frame builders and `canonical_feature_columns()` — it writes **zero** new contract code, so it cannot drift from the regression contract. -3. Consumes the historical frame for `fit(y, X)` and (via - `POST /scenarios/simulate`) the future frame for `predict(horizon, X)`. - -Known limitation: **backtesting is not wired for feature-aware models.** The -backtest fold loop calls `model.fit(y_train)` target-only; a feature-aware model -raises `ValueError` there — a loud, non-leaky failure, pinned by -`test_feature_aware_model_fails_loud_in_backtest`. Feature-aware backtesting is -deferred to PRP-MLZOO-B.2 — it remains pending after PRP-30 (LightGBM, MLZOO-B) -shipped the first advanced feature-aware model. +3. Consumes the historical frame for `fit(y, X)`, the future frame (via + `POST /scenarios/simulate`) for `predict(horizon, X)`, and — since + PRP-MLZOO-B.2 — the per-fold backtest frames for `POST /backtesting/run`. + +## Backtesting frame (per fold) + +Since **PRP-MLZOO-B.2** a feature-aware model is evaluated by the backtesting +fold loop. Each fold builds two matrices from the **same** canonical column set: + +- **`X_train`** — a positional slice of the full historical matrix + (`build_historical_feature_rows`), built once over the whole series. Leakage- + safe *as a training row*: every lag reads a strictly-earlier observed target. +- **`X_future`** — the test-window matrix, **rebuilt per fold** by + `build_future_feature_rows` (never sliced from the historical matrix — a + sliced historical test row would read an adjacent test-day observed target as + its `lag_1` cell, which is target leakage). Its `history_tail` ends at the + fold origin `T`, so a lag cell whose source day is in the test window is + `NaN`. With `gap > 0` the lag columns are built for `gap + horizon` days and + the first `gap` rows dropped. + +The test-window `price_factor` / `promo_active` come from the **recorded** +price/promotion for that window (the v1 `observed` exogenous policy) — that +reads no target `y`, so it is exogenous foresight, not target leakage; the +`ModelBacktestResult` records `exogenous_policy="observed"` so the metric is +read honestly. Both builders live in `app/shared/feature_frames/rows.py` and are +spec'd by the load-bearing `app/shared/feature_frames/tests/test_leakage.py`.