From 12f6cdfd2bbaef4667b07036da97ba8e18378660 Mon Sep 17 00:00:00 2001
From: Gabor Szabo <shellsnake@icloud.com>
Date: Tue, 19 May 2026 18:22:15 +0200
Subject: [PATCH] feat(backtest): wire feature-aware models into the
 backtesting fold loop (#244)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Implements PRP-MLZOO-B.2. Feature-aware models (regression, lightgbm — any
model with requires_features=True) are now evaluated by POST /backtesting/run
and backtest jobs, with per-fold leakage-safe X_train / X_future matrices.

- Promote the historical row builder to app/shared/feature_frames/rows.py
  (build_historical_feature_rows) and add build_future_feature_rows — the
  leakage-critical per-fold test-window assembler. ForecastingService.
  _assemble_regression_rows becomes a delegating shim; its leakage spec
  (test_regression_features_leakage.py) stays green, unedited.
- BacktestingService: resolve exogenous data once async (ExogenousFrame),
  branch the fold loop on requires_features, slice X_train from one historical
  matrix, rebuild X_future per fold. min_train_size >= 30 enforced loud.
- ModelBacktestResult gains additive feature_aware / exogenous_policy fields;
  _shape_backtest_result keys unchanged (frontend contract byte-stable).
- JobService._execute_backtest accepts regression + lightgbm.
- Repurpose the interim loud-fail test (PRP-29 #7 / PRP-30 #6 superseded);
  add the shared X_future leakage spec incl. a gap>0 fold.
---
 ...IAL-MLZOO-B.2-feature-aware-backtesting.md | 215 ++++++
 PRPs/INITIAL/INITIAL-MLZOO-index.md           |   6 +-
 ...PRP-MLZOO-B.2-feature-aware-backtesting.md | 707 ++++++++++++++++++
 README.md                                     |   6 +
 app/features/backtesting/schemas.py           |   8 +
 app/features/backtesting/service.py           | 331 +++++++-
 .../tests/test_feature_aware_backtest.py      | 221 ++++++
 .../tests/test_routes_integration.py          |  82 ++
 .../backtesting/tests/test_schemas.py         |  42 ++
 .../backtesting/tests/test_service.py         |  19 +-
 .../tests/test_service_integration.py         |  96 +++
 app/features/forecasting/service.py           |  39 +-
 app/features/jobs/service.py                  |   9 +
 app/features/jobs/tests/test_service.py       |  64 ++
 app/shared/feature_frames/__init__.py         |   6 +
 app/shared/feature_frames/rows.py             | 201 +++++
 .../feature_frames/tests/test_contract.py     |  10 +-
 .../feature_frames/tests/test_leakage.py      | 138 ++++
 docs/PHASE/5-BACKTESTING.md                   |  33 +
 examples/models/feature_frame_contract.md     |  35 +-
 20 files changed, 2213 insertions(+), 55 deletions(-)
 create mode 100644 PRPs/INITIAL/INITIAL-MLZOO-B.2-feature-aware-backtesting.md
 create mode 100644 PRPs/PRP-MLZOO-B.2-feature-aware-backtesting.md
 create mode 100644 app/features/backtesting/tests/test_feature_aware_backtest.py
 create mode 100644 app/shared/feature_frames/rows.py

diff --git a/PRPs/INITIAL/INITIAL-MLZOO-B.2-feature-aware-backtesting.md b/PRPs/INITIAL/INITIAL-MLZOO-B.2-feature-aware-backtesting.md
new file mode 100644
index 00000000..e160a0f7
--- /dev/null
+++ b/PRPs/INITIAL/INITIAL-MLZOO-B.2-feature-aware-backtesting.md
@@ -0,0 +1,215 @@
+# INITIAL-MLZOO-B.2-feature-aware-backtesting.md - Feature-Aware Backtesting Wiring
+
+## FEATURE:
+
+Wire **feature-aware forecasting models** (`RegressionForecaster`, `LightGBMForecaster` —
+every model with `requires_features=True`) into the backtesting fold loop so they can be
+evaluated by `POST /backtesting/run` and `train`/`backtest` jobs — **without** introducing
+target leakage and **without** train/serve skew.
+
+This is the explicit follow-up deferred by **PRP-30 DECISIONS LOCKED #6** and named there as
+`PRP-MLZOO-B.2`. It depends on MLZOO-A (PRP-29, merged `b116489`) and MLZOO-B (PRP-30, merged
+`2f1b8a5` / PR #243) being in place. It is the third MLZOO unit, after A (foundation) and
+B (first advanced model), and before C (XGBoost/Prophet) and D (frontend/registry).
+
+### Why current backtesting is incompatible with feature-aware models
+
+The backtesting slice was built for target-only baseline models and has three structural
+gaps that block feature-aware models:
+
+1. **No exogenous data is loaded.** `BacktestingService._load_series_data`
+   (`app/features/backtesting/service.py`) selects only `(date, quantity)` from `sales_daily`.
+   `SeriesData` carries only `dates` + `values`. A feature-aware model needs the canonical
+   14-column frame (`canonical_feature_columns()`): target lags, calendar columns, and the
+   exogenous columns `price_factor` / `promo_active` / `is_holiday` / `days_since_launch`.
+
+2. **The fold loop is target-only.** `_run_model_backtest` calls `model.fit(y_train)` and
+   `model.predict(horizon)` with **no `X`**. `RegressionForecaster` / `LightGBMForecaster`
+   (`requires_features=True`) raise `ValueError("… requires exogenous features X …")` at
+   `fit()`. That loud failure is the *current interim contract*, pinned by
+   `app/features/backtesting/tests/test_service.py::test_feature_aware_model_fails_loud_in_backtest`
+   (cites PRP-29 DECISIONS LOCKED #7). This PRP **supersedes** that interim contract.
+
+3. **`JobService._execute_backtest` hard-rejects** any `model_type` other than
+   `naive` / `seasonal_naive` / `moving_average` with `ValueError("Unsupported model_type…")`.
+
+A backtest of a feature-aware model is the only honest way to compare an advanced model
+against the naive/seasonal baselines — without it, PRP-30's LightGBM model can be trained
+and scenario-re-forecast but never *evaluated*.
+
+### Per-fold X_train and X_future construction
+
+The backtest must build, **per fold**, two leakage-safe feature matrices:
+
+- **`X_train`** — the historical feature matrix restricted to the fold's train rows. Every
+  column is observed-and-knowable: target lags read strictly-earlier observed targets,
+  calendar columns are pure functions of the date, exogenous columns read same-day observed
+  price/promotion/holiday/launch attributes. This is exactly the matrix
+  `ForecastingService._build_regression_features` / `_assemble_regression_rows` already
+  builds for training — it must be **reused**, not re-derived.
+
+- **`X_future`** — the feature matrix for the fold's test window (`test_size` days, after the
+  `gap`). The test window has **no observed target**, so its target-lag columns must be built
+  with the leakage-safe future-lag rule (`build_long_lag_columns`): a lag cell whose source
+  day lies in the test window is `NaN`, never the observed test value.
+
+### Leakage-safe fold contracts
+
+Each fold defines a forecast origin `T` = the fold's `train_end`. The invariant, mirroring
+`app/shared/feature_frames/tests/test_leakage.py` and
+`app/features/featuresets/tests/test_leakage.py`:
+
+> A feature value for a test-window day `D` may use ONLY information knowable at the fold
+> origin `T`: the observed history at or before `T`, or the calendar (a pure function of the
+> date). It may **NEVER** read an observed target at a test-window day.
+
+The `gap` between `train_end` and `test_start` simulates operational data latency — the
+fold's `history_tail` ends at `T` and does **not** include the gap days; lag columns whose
+source falls in the gap are therefore `NaN` too.
+
+The PRP must classify every canonical feature column with the **existing**
+`app/shared/feature_frames.FeatureSafety` taxonomy (`SAFE` / `CONDITIONALLY_SAFE` /
+`UNSAFE_UNLESS_SUPPLIED`) and decide, per class, how `X_future` is populated. It must also
+draw the line — explicitly — between **target leakage** (reading `y` at a horizon day:
+forbidden, must be structurally impossible) and **exogenous foresight** (assuming the future
+price/promotion calendar is known: a disclosed modelling choice, not target leakage). The
+result must record which exogenous policy it used so the metric is interpreted honestly.
+
+### Async / DB-backed loading requirements
+
+The fold loop (`_run_model_backtest`) is currently **synchronous and DB-free** — a property
+worth keeping (it is unit-testable without a database). Feature-aware folds need exogenous
+data (`unit_price`, `promotion` windows, `calendar` holidays, `product.launch_date`) that
+only the DB has. The PRP must resolve all of that **async, once, up front** in `run_backtest`
+(already `async`) into pure in-memory arrays, then keep the per-fold builders pure and sync —
+mirroring how `ForecastingService._build_regression_features` resolves async then delegates
+to the pure `_assemble_regression_rows`.
+
+### How feature-aware models should fail loudly until supported
+
+Even after this PRP, some paths stay unsupported and must fail **loud**, never silent:
+
+- A feature-aware model whose required feature frame cannot be sourced (e.g. an
+  `UNSAFE_UNLESS_SUPPLIED` column with no observed data-platform record and no supplied
+  assumptions) → raise `ValueError`, never a silent `NaN`/`0.0` fill.
+- An unclassifiable feature column (`FeatureSafety` / `feature_safety()` raises `KeyError`)
+  → propagate loudly.
+- The interim `test_feature_aware_model_fails_loud_in_backtest` is **repurposed**, not
+  deleted: feature-aware backtesting now *succeeds* on the supported path, so the test
+  becomes (a) a positive "regression model backtests and yields metrics" test plus (b) a new
+  loud-fail test for the genuinely-unsupported path.
+
+### Explicit out-of-scope items
+
+- **No new model families.** No XGBoost, no Prophet-like models — that is MLZOO-C. This PRP
+  wires the *existing* `regression` and `lightgbm` forecasters into backtesting and adds none.
+- **No frontend work.** No changes to `frontend/`, no backtest-page model selector, no new UI.
+  The `/visualize/backtest` job-result contract (`_shape_backtest_result`) stays byte-stable.
+- **No explainability work.** No driver attribution, no `ForecastExplanation`, no
+  `/explain/*` change — that is MLZOO-D.
+- **No scenario persistence changes.** No `scenario_plan` schema change, no `/scenarios/*`
+  contract change. The scenarios slice's `feature_frame.py` is read as a reference pattern
+  only; it is not modified.
+- **No hyperparameter search, no portfolio/global models, no recursive multi-step
+  forecasting.** `NaN`-as-unknown for future-sourced lags is kept (PRP-29 DECISIONS #6).
+- **No new Alembic migration** — this PRP adds no table or column to the database.
+
+## EXAMPLES:
+
+Read these before PRP creation:
+
+- `PRPs/PRP-29-feature-aware-forecasting-foundation.md`
+  - The merged foundation. DECISIONS LOCKED #2 (`requires_features` capability flag),
+    #6 (NaN-as-unknown), #7 (the interim backtest loud-fail this PRP supersedes).
+
+- `PRPs/PRP-30-lightgbm-first-advanced-model.md`
+  - The merged first advanced model. DECISIONS LOCKED #6 explicitly defers feature-aware
+    backtesting to *this* PRP and states why (`_run_model_backtest` is sync, DB-free,
+    target-only).
+
+- `app/features/backtesting/service.py`
+  - `_load_series_data` (target-only load), `SeriesData` (the container to extend),
+    `_run_model_backtest` (the sync fold loop to branch), `_run_baseline_comparisons`.
+
+- `app/features/backtesting/splitter.py`
+  - `TimeSeriesSplitter` — index-based, needs **no change**; each `TimeSeriesSplit` already
+    carries `train_indices` / `test_indices` / `train_dates` / `test_dates`.
+
+- `app/features/forecasting/service.py`
+  - `_build_regression_features` (async exogenous resolution pattern) and
+    `_assemble_regression_rows` (the pure, leakage-safe historical row builder to promote).
+
+- `app/shared/feature_frames/contract.py`
+  - `canonical_feature_columns()`, `build_long_lag_columns`, `build_calendar_columns`,
+    the `FeatureSafety` enum and `feature_safety()` classifier — the contract to reuse.
+
+- `app/features/scenarios/feature_frame.py`
+  - `assemble_future_frame` / `build_exogenous_columns` — the future-frame *pattern* for a
+    feature-aware model. Reference only; do not import (cross-slice import is forbidden).
+
+- `app/shared/feature_frames/tests/test_leakage.py` and
+  `app/features/forecasting/tests/test_regression_features_leakage.py`
+  - The load-bearing leakage-test patterns the new builders must follow.
+
+- `app/features/jobs/service.py`
+  - `_execute_backtest` (the `model_type` allow-list to widen) and `_shape_backtest_result`
+    (the frontend contract that must NOT drift).
+
+## DOCUMENTATION:
+
+- scikit-learn TimeSeriesSplit: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.TimeSeriesSplit.html
+- scikit-learn HistGradientBoostingRegressor (NaN-tolerant): https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.HistGradientBoostingRegressor.html
+- LightGBM (NaN handling / missing values): https://lightgbm.readthedocs.io/en/stable/Advanced-Topics.html
+- Forecasting cross-validation / leakage: https://otexts.com/fpp3/tscv.html
+- Pandas time series documentation: https://pandas.pydata.org/docs/user_guide/timeseries.html
+- Pydantic documentation: https://docs.pydantic.dev/latest/
+- `PRPs/ai_docs/exogenous-regressor-forecasting.md` — repo-local notes on exogenous-regressor
+  forecasting and the leakage-safe future-frame rule.
+
+## OTHER CONSIDERATIONS:
+
+The main architectural risk is the **train/serve-skew and leakage boundary** — get the
+`X_future` exogenous-column policy wrong and the backtest is either leaky (optimistic and
+worthless) or skewed (the model sees different feature distributions at train vs evaluate).
+
+Required decisions for the PRP:
+
+- **Where the per-fold builders live.** `backtesting` may not import `forecasting` or
+  `scenarios` (vertical-slice rule). The pure row builders must be promoted into
+  `app/shared/feature_frames` (the sanctioned cross-cutting home) so both slices consume one
+  definition. The promotion must be **additive** — `forecasting`'s `_assemble_regression_rows`
+  and its leakage test must keep working unchanged (a thin delegating shim is acceptable).
+
+- **The `X_future` exogenous-column policy.** `price_factor` / `promo_active` are
+  `UNSAFE_UNLESS_SUPPLIED`. In a *scenario* a planner supplies them; a *backtest* has no
+  assumptions. Decide v1 policy explicitly (recorded as a field on the result), and document
+  it as target-leakage-free but optimistic (it assumes the future promo/price calendar was
+  known at `T`).
+
+- **`gap` handling for future-lag columns.** With `gap > 0` the first test day is
+  `T + gap + 1`; `build_long_lag_columns` indexes test day `m` as `T + m`. The PRP must state
+  the offset correction (build `gap + test_size` rows, drop the first `gap`).
+
+- **The job-result contract.** `_shape_backtest_result` feeds `/visualize/backtest`. Any new
+  field must be additive; the existing keys must not move.
+
+Recommended defaults:
+
+- Reuse `canonical_feature_columns()` for both `X_train` and `X_future` — identical column
+  set and order on both sides is the train/serve-skew guard.
+- Keep the per-fold builders pure and sync; do all DB I/O once in `run_backtest`.
+- Branch on `model.requires_features` (a capability flag), never on a `model_type` string.
+- Keep `NaN`-as-unknown for future-sourced lag cells — the estimators tolerate it natively.
+- Prefer a loud `ValueError` on an unsupported path over any implicit guess or silent fill.
+
+Validation expectations:
+
+- Shared-builder leakage tests proving `X_future` lag cells are `NaN` where their source is a
+  test-window day (the new load-bearing spec).
+- Unit tests for the per-fold `X_train` / `X_future` builders (pure, no DB).
+- An integration test running a real DB-backed feature-aware (`regression`) backtest and
+  comparing it against the naive/seasonal baselines in the same response.
+- A route test: `POST /backtesting/run` with a `regression` model config → `200`.
+- A jobs test: a `backtest` job with `model_type="regression"` succeeds.
+- A repurposed loud-fail test for the genuinely-unsupported feature-aware path.
+- Regression proof: every existing baseline backtest test stays green with no behaviour change.
diff --git a/PRPs/INITIAL/INITIAL-MLZOO-index.md b/PRPs/INITIAL/INITIAL-MLZOO-index.md
index b7e9989c..b41faf77 100644
--- a/PRPs/INITIAL/INITIAL-MLZOO-index.md
+++ b/PRPs/INITIAL/INITIAL-MLZOO-index.md
@@ -17,6 +17,7 @@ Recommended PRP sequence:
 | --- | --- | --- | --- |
 | 1 | `INITIAL-MLZOO-A-foundation-feature-frames.md` | PRP-29 | Feature-aware forecasting foundation and leakage-safe frame contracts |
 | 2 | `INITIAL-MLZOO-B-lightgbm-first-model.md` | PRP-30 | First advanced model path with LightGBM (optional `ml-lightgbm` extra) |
+| 2.5 | `INITIAL-MLZOO-B.2-feature-aware-backtesting.md` | PRP-MLZOO-B.2 | Wire feature-aware models into the backtesting fold loop (per-fold leakage-safe `X_train` / `X_future`) |
 | 3 | `INITIAL-MLZOO-C-xgboost-prophet-extensions.md` | Future PRP | XGBoost and Prophet-like extensions |
 | 4 | `INITIAL-MLZOO-D-frontend-registry-explainability.md` | Future PRP | UI, registry surfacing, and explanation polish |
 
@@ -25,8 +26,9 @@ Dependency graph:
 ```text
 A. Foundation feature frames
   -> B. LightGBM first model
-      -> C. XGBoost / Prophet-like extensions
-      -> D. Frontend / registry / explainability
+      -> B.2 Feature-aware backtesting
+          -> C. XGBoost / Prophet-like extensions
+          -> D. Frontend / registry / explainability
 ```
 
 The full vision is documented in `docs/optional-features/05-advanced-ml-model-zoo.md`.
diff --git a/PRPs/PRP-MLZOO-B.2-feature-aware-backtesting.md b/PRPs/PRP-MLZOO-B.2-feature-aware-backtesting.md
new file mode 100644
index 00000000..105a4075
--- /dev/null
+++ b/PRPs/PRP-MLZOO-B.2-feature-aware-backtesting.md
@@ -0,0 +1,707 @@
+name: "PRP-MLZOO-B.2 — Feature-Aware Backtesting Wiring"
+description: |
+
+## Purpose
+
+The third unit of the **Advanced ML Model Zoo** sequence (`PRPs/INITIAL/INITIAL-MLZOO-index.md`),
+sitting between MLZOO-B (PRP-30) and MLZOO-C. It wires the **existing feature-aware
+forecasting models** — `RegressionForecaster` (PRP-27) and `LightGBMForecaster` (PRP-30),
+i.e. every model with `requires_features=True` — into the **backtesting fold loop** so they
+can be evaluated by `POST /backtesting/run` and `backtest` jobs.
+
+This is the explicit follow-up deferred by **PRP-30 DECISIONS LOCKED #6**, which named it
+`PRP-MLZOO-B.2` and stated the reason: `BacktestingService._run_model_backtest` is
+synchronous, DB-free, and target-only, so per-fold leakage-safe `X_train` / `X_future`
+construction is itself PRP-sized.
+
+This PRP wires **zero new models**. It adds **no** XGBoost/Prophet, **no** frontend, **no**
+explainability, **no** scenario-persistence change, **no** Alembic migration. If you find
+yourself adding a model family, touching `frontend/`, or editing a `scenario_plan` schema —
+stop; that is out of scope (see DECISIONS LOCKED #9).
+
+## What this PRP already inherits (DO NOT re-build)
+
+MLZOO-A (PRP-29, merged `b116489`) and MLZOO-B (PRP-30, merged `2f1b8a5` / PR #243) shipped
+everything a feature-aware *model* needs. Re-use it; do not re-derive it:
+
+- **The capability flag.** `BaseForecaster.requires_features: ClassVar[bool]`
+  (`app/features/forecasting/models.py:64`). `RegressionForecaster` and `LightGBMForecaster`
+  both set it `True`; all three baselines leave it `False`. **Branch on this flag — never on
+  a `model_type` string.**
+- **The shared feature-frame contract.** `app/shared/feature_frames/` owns the pinned
+  constants (`EXOGENOUS_LAGS`, `HISTORY_TAIL_DAYS`), `canonical_feature_columns()` (the
+  14-column set + order), the leakage-safe pure builders `build_long_lag_columns` /
+  `build_calendar_columns`, and the `FeatureSafety` taxonomy + `feature_safety()` classifier.
+  This PRP **extends** that package; it writes no new contract constants.
+- **The historical feature matrix.** `ForecastingService._assemble_regression_rows`
+  (`app/features/forecasting/service.py:114`) is the pure, leakage-safe historical row
+  builder, pinned by `app/features/forecasting/tests/test_regression_features_leakage.py`.
+  This PRP **promotes** it into `app/shared/feature_frames` so the backtesting slice can
+  consume it without a forbidden cross-slice import.
+- **The future-frame pattern.** `app/features/scenarios/feature_frame.py::assemble_future_frame`
+  is the reference for a feature-aware model's *future* matrix (calendar + leakage-safe lags +
+  exogenous). This PRP builds the backtesting equivalent in `app/shared/feature_frames`.
+  The scenarios module is **read as a pattern, never imported or modified.**
+- **The fold splitter.** `app/features/backtesting/splitter.py::TimeSeriesSplitter` is purely
+  index-based — each `TimeSeriesSplit` already carries `train_indices` / `test_indices` /
+  `train_dates` / `test_dates`. **It needs no change.**
+
+The **problem this PRP fixes**: `BacktestingService._load_series_data` loads only
+`(date, quantity)`; `_run_model_backtest` calls `model.fit(y_train)` target-only; a
+`requires_features=True` model raises `ValueError` at `fit()`. That loud failure is the
+*interim* contract pinned by
+`app/features/backtesting/tests/test_service.py::test_feature_aware_model_fails_loud_in_backtest`
+(PRP-29 DECISIONS LOCKED #7). `JobService._execute_backtest` hard-rejects every non-baseline
+`model_type`. The advanced LightGBM model from PRP-30 can be trained and scenario-re-forecast
+but **cannot be evaluated against the baselines** — backtesting is the only honest comparison.
+
+## DEPENDS ON — read before starting
+
+- `PRPs/INITIAL/INITIAL-MLZOO-B.2-feature-aware-backtesting.md` — this PRP's brief.
+- `PRPs/INITIAL/INITIAL-MLZOO-index.md` — the roadmap (A ✅ → B ✅ → **B.2 (this)** → C → D).
+- `PRPs/PRP-29-feature-aware-forecasting-foundation.md` — DECISIONS LOCKED #2
+  (`requires_features`), #6 (NaN-as-unknown), #7 (the interim backtest loud-fail superseded here).
+- `PRPs/PRP-30-lightgbm-first-advanced-model.md` — DECISIONS LOCKED #6 defers feature-aware
+  backtesting to this PRP and explains why.
+- `PRPs/ai_docs/exogenous-regressor-forecasting.md` — the leakage-safe future-frame rule.
+
+## Goal
+
+Make `POST /backtesting/run` (and `backtest` jobs) accept `regression` and `lightgbm` model
+configs and evaluate them across time-series CV folds with **per-fold, leakage-safe**
+`X_train` / `X_future` feature matrices — the advanced models compared head-to-head against
+the naive / seasonal baselines, with **no target leakage** and **no train/serve skew**.
+
+## Why
+
+- **Portfolio completeness.** A forecasting system whose advanced model cannot be backtested
+  has no defensible model-selection story. PRP-30 delivered the model; this delivers its
+  evaluation.
+- **Time-safety is the repo's load-bearing invariant** (`product-vision.md` §5). Wiring a
+  feature-consuming model into CV is exactly where leakage creeps in — doing it once, in a
+  shared and test-pinned way, protects every future MLZOO model (C, D).
+- **It unblocks the MLZOO sequence.** MLZOO-C (XGBoost/Prophet) and MLZOO-D
+  (frontend/registry) both assume feature-aware models are backtestable; this is their gate.
+
+## What
+
+### Technical requirements
+
+1. **Per-fold `X_train`** — the historical feature matrix sliced to the fold's train rows.
+   Built once over the full series via the promoted shared builder; sliced by
+   `split.train_indices`. Leakage-safe by position: every row reads only strictly-earlier
+   observed targets.
+2. **Per-fold `X_future`** — the test-window feature matrix, **rebuilt per fold** (never
+   sliced from the historical matrix). Target-lag columns use `build_long_lag_columns` with a
+   `history_tail` ending at the fold origin `T = train_end`; a lag cell whose source is a
+   test-window day is `NaN`.
+3. **Async exogenous loading up front.** `run_backtest` (already `async`) resolves
+   `unit_price`, `promotion` windows, `calendar` holidays, and `product.launch_date` once,
+   into pure in-memory arrays. `_run_model_backtest` stays **sync and DB-free**.
+4. **Capability branch.** `_run_model_backtest` branches on `model.requires_features`.
+   Target-only models keep the exact current code path; feature-aware models take the new
+   per-fold builder path.
+5. **`JobService._execute_backtest`** accepts `regression` and `lightgbm` (the latter still
+   gated by `forecast_enable_lightgbm` inside `model_factory`).
+6. **Loud failure** on every unsupported path — never a silent `NaN`/`0.0` fill.
+7. **No frontend contract drift.** `_shape_backtest_result` and the `/visualize/backtest`
+   job-result keys stay byte-stable; new schema fields are additive only.
+
+### Future feature classes — the `X_future` policy (the architectural core)
+
+Every canonical column is classified by the **existing** `FeatureSafety` enum
+(`app/shared/feature_frames/contract.py`). `X_future` populates each class as follows:
+
+| Class | Columns | `X_future` source | Leakage status |
+|-------|---------|-------------------|----------------|
+| `SAFE` | `dow_sin/cos`, `month_sin/cos`, `is_weekend`, `is_month_end`, `is_holiday`, `days_since_launch` | Pure function of the test-window date / timeless attribute (calendar table, product launch date) | None — never reads a target |
+| `CONDITIONALLY_SAFE` | `lag_1`, `lag_7`, `lag_14`, `lag_28` | `build_long_lag_columns(history_tail, …)` — `history_tail` ends at `T`; a cell whose source day is in the test window is `NaN` | None — `NaN`-where-future is structurally enforced |
+| `UNSAFE_UNLESS_SUPPLIED` | `price_factor`, `promo_active` | v1 policy `observed`: recorded `sales_daily.unit_price` / `promotion` rows for the test window | **No target leakage**; *is* exogenous foresight (see below) |
+
+**Target leakage vs. exogenous foresight — the line this PRP draws explicitly.** The repo's
+load-bearing leakage rule is: *never read an observed target at a horizon day*. Target-lag
+columns obey it structurally (`build_long_lag_columns`). The `UNSAFE_UNLESS_SUPPLIED`
+exogenous columns (`price_factor`, `promo_active`) read **recorded price/promotion** for the
+test window — never the target `y`. That is **not** target leakage. It **is** *exogenous
+foresight*: the backtest assumes the future price/promotion calendar was known at `T`. For
+retail demand that is realistic (promo calendars are planned ahead) and it keeps `X_train`
+and `X_future` distributionally identical (no train/serve skew — both read same-day observed
+exogenous). It is, however, optimistic, so the result **records `exogenous_policy="observed"`**
+and the metric must be interpreted as "accuracy given a known promo/price plan". A future
+PRP may add an `assumptions` policy; v1 ships exactly one.
+
+### Success Criteria
+
+- [ ] `POST /backtesting/run` with a `regression` model config returns `200` with per-fold
+      metrics and a baseline comparison.
+- [ ] A `backtest` job with `model_type="regression"` completes `success`.
+- [ ] A feature-aware backtest's `X_future` lag cells are `NaN` exactly where their source
+      day falls in the test window — pinned by a new shared leakage test.
+- [ ] `X_train` and `X_future` use the identical `canonical_feature_columns()` set and order.
+- [ ] Every existing baseline backtest test passes with **zero** edits.
+- [ ] `_shape_backtest_result` output keys are byte-stable (frontend contract intact).
+- [ ] An unsupported feature-aware path raises a loud `ValueError`, never silently degrades.
+- [ ] `ruff` + `mypy --strict` + `pyright --strict` clean; unit + integration suites green.
+
+## All Needed Context
+
+### Documentation & References
+
+```yaml
+- file: app/features/backtesting/service.py
+  why: _load_series_data (extend), SeriesData (extend), _run_model_backtest (branch),
+       _run_baseline_comparisons, _validate_config (add min-train guard).
+
+- file: app/features/backtesting/splitter.py
+  why: TimeSeriesSplit carries train/test indices + dates. NO CHANGE — index-based.
+
+- file: app/features/forecasting/service.py
+  why: _build_regression_features = the async exogenous-resolution pattern to mirror;
+       _assemble_regression_rows = the pure historical builder to promote to shared.
+
+- file: app/shared/feature_frames/contract.py
+  why: canonical_feature_columns(), build_long_lag_columns, build_calendar_columns,
+       FeatureSafety + feature_safety(). The contract this PRP extends.
+
+- file: app/features/scenarios/feature_frame.py
+  why: assemble_future_frame = the future-matrix PATTERN. Reference only — do NOT import
+       (backtesting -> scenarios is a forbidden cross-slice import).
+
+- file: app/features/jobs/service.py
+  why: _execute_backtest (widen the model_type allow-list); _shape_backtest_result
+       (the frontend contract — additive changes only, keys must not move).
+
+- file: app/shared/feature_frames/tests/test_leakage.py
+  why: the load-bearing leakage-test pattern the new builder must follow.
+
+- file: app/features/forecasting/tests/test_regression_features_leakage.py
+  why: pins _assemble_regression_rows — must stay GREEN after the promotion (shim).
+
+- url: https://otexts.com/fpp3/tscv.html
+  why: time-series cross-validation — the standard this fold loop implements.
+```
+
+### Current Codebase tree (relevant — all already exist)
+
+```
+app/
+  shared/feature_frames/
+    __init__.py            # re-exports the contract surface
+    contract.py            # constants, builders, FeatureSafety
+    tests/test_leakage.py  # load-bearing leakage spec
+    tests/test_contract.py # AST walk — no app/features import
+  features/
+    backtesting/
+      service.py           # SeriesData, run_backtest, _run_model_backtest (target-only)
+      splitter.py          # TimeSeriesSplitter — index-based, unchanged
+      schemas.py           # BacktestConfig, ModelBacktestResult, BacktestResponse
+      tests/test_service.py, test_service_integration.py, test_routes_integration.py
+    forecasting/
+      service.py           # _assemble_regression_rows (to promote), _build_regression_features
+    jobs/
+      service.py           # _execute_backtest (allow-list), _shape_backtest_result
+```
+
+### Desired Codebase tree — files to ADD
+
+```
+app/shared/feature_frames/
+  rows.py                                    # build_historical_feature_rows (promoted),
+                                             # build_future_feature_rows (NEW, leakage-safe)
+app/features/backtesting/tests/
+  test_feature_aware_backtest.py             # unit tests for the fold builders + loud-fail
+PRPs/
+  PRP-MLZOO-B.2-feature-aware-backtesting.md # this file
+```
+
+### Files to MODIFY (all additive or behaviour-preserving)
+
+```
+app/shared/feature_frames/__init__.py        # export the two row builders
+app/shared/feature_frames/tests/test_leakage.py   # ADD build_future_feature_rows leakage spec
+app/shared/feature_frames/tests/test_contract.py  # ADD: rows.py imports nothing from app/features
+app/features/forecasting/service.py          # _assemble_regression_rows -> delegating shim
+app/features/backtesting/service.py          # ExogenousFrame, exogenous load, fold branch
+app/features/backtesting/schemas.py          # ADD feature_aware + exogenous_policy (additive)
+app/features/jobs/service.py                 # _execute_backtest: accept regression + lightgbm
+app/features/backtesting/tests/test_service.py            # repurpose the interim loud-fail test
+app/features/backtesting/tests/test_service_integration.py # feature-aware DB-backed backtest
+app/features/backtesting/tests/test_routes_integration.py  # POST /backtesting/run regression
+app/features/backtesting/tests/test_schemas.py            # the new additive fields
+app/features/jobs/tests/test_service.py      # backtest job with model_type=regression
+examples/models/feature_frame_contract.md    # document the backtest future-frame
+docs/PHASE/5-BACKTESTING.md                  # feature-aware backtesting section
+README.md                                    # backtest model list: add regression/lightgbm
+PRPs/INITIAL/INITIAL-MLZOO-index.md           # note B.2 -> this PRP
+```
+
+### DECISIONS LOCKED (resolved during planning — do NOT re-litigate)
+
+1. **The per-fold row builders live in `app/shared/feature_frames/rows.py`.** `backtesting`
+   may not import `forecasting` or `scenarios` (vertical-slice rule). `app/shared/` is the
+   sanctioned cross-cutting home and already owns the column builders. `rows.py` (pure,
+   stdlib-only, `app/features` import forbidden — same as `contract.py`) holds the two
+   row-matrix assemblers. `contract.py` stays the column-builder + taxonomy home.
+
+2. **`_assemble_regression_rows` is PROMOTED, not duplicated, and the promotion is additive.**
+   Its body moves verbatim to `build_historical_feature_rows` in `rows.py`.
+   `ForecastingService._assemble_regression_rows` becomes a one-line delegating shim
+   (`return build_historical_feature_rows(...)`). `test_regression_features_leakage.py`
+   imports the shim by its old name and stays **GREEN with zero edits** — the existing
+   leakage test is not weakened, not moved, not touched.
+
+3. **`X_train` is sliced from one full-series historical matrix; `X_future` is NEVER sliced.**
+   The historical matrix is built once over `dates[0:N]` (each row reads only strictly-earlier
+   targets → leakage-safe as a *training* row). Per fold, `X_train = matrix[train_indices]`.
+   `X_future` MUST be rebuilt per fold via `build_future_feature_rows` — slicing the
+   historical matrix for test rows would let `lag_1` read an adjacent test-day observed target
+   (**target leakage**). This asymmetry is the crux of the PRP (see GOTCHA below).
+
+4. **The v1 `X_future` exogenous policy is `observed`, recorded on the result.** `price_factor`
+   / `promo_active` for the test window come from recorded `sales_daily.unit_price` /
+   `promotion` rows. This is exogenous foresight, not target leakage (it never reads `y`), and
+   it keeps `X_train`/`X_future` skew-free (both read same-day observed exogenous).
+   `ModelBacktestResult.exogenous_policy` records `"observed"` so the metric is interpreted
+   honestly. A `Literal["observed"]` (one value in v1) is the documented extension seam — a
+   future PRP may add `"assumptions"` without a breaking change.
+
+5. **Branch on `model.requires_features`, never on a `model_type` string.** `run_backtest`
+   builds a cheap probe model from `config.model_config_main` to read the flag *before* the
+   fold loop, deciding whether to load exogenous data. Mirrors
+   `ForecastingService.train_model`, which already branches on exactly this flag.
+   The probe is a no-fit `model_factory(...)` construction (cheap) — each of the three
+   sites that need the flag (`_validate_config`, `run_backtest`, `_run_model_backtest`)
+   builds its own probe locally; there is no need to thread one instance through, and
+   `BacktestingService` keeps no probe/matrix instance state.
+
+6. **The fold loop stays sync and DB-free.** All DB I/O happens once in `run_backtest`
+   (`async`), resolved into a pure `ExogenousFrame`. `_run_model_backtest` and the row
+   builders remain unit-testable without a database — the existing architecture is preserved.
+
+7. **`min_train_size >= 30` is enforced for feature-aware backtests.** `_validate_config`
+   raises `ValueError` when the main model `requires_features` and
+   `split_config.min_train_size < 30` (`_MIN_REGRESSION_TRAIN_ROWS`) — each fold's train
+   window must resolve the lag features. Loud, not silent.
+
+8. **The interim loud-fail test is REPURPOSED, not deleted.**
+   `test_feature_aware_model_fails_loud_in_backtest` asserted a `regression` backtest raises
+   `ValueError`. After this PRP it succeeds. The test is rewritten as (a) a positive
+   "feature-aware backtest runs and yields metrics" assertion and (b) a new loud-fail
+   assertion for the genuinely-unsupported path (a `requires_features` model with no
+   `ExogenousFrame` loaded → `ValueError`). PRP-29 DECISIONS LOCKED #7 and PRP-30 DECISIONS
+   LOCKED #6 are **superseded** — note this in the PRP commit body and the test docstring.
+
+9. **OUT OF SCOPE — do not touch.** No new model family (XGBoost/Prophet = MLZOO-C). No
+   `frontend/` change — `_shape_backtest_result` keys stay byte-stable. No explainability
+   (MLZOO-D). No `scenario_plan` / `/scenarios/*` change. No Alembic migration (this PRP adds
+   no table/column). No recursive multi-step forecasting — `NaN`-as-unknown is kept.
+
+### Known Gotchas of our codebase & Library Quirks
+
+```python
+# CRITICAL: X_future is NEVER a slice of the historical matrix. The historical row for a
+#   test day reads quantities[i-1] (lag_1) — for a test day that source is an adjacent
+#   observed TEST-DAY target == target leakage. X_future MUST be rebuilt per fold with
+#   build_long_lag_columns(history_tail_ending_at_T, ...) so future-sourced lag cells are NaN.
+
+# CRITICAL: gap offset. With gap > 0 the first test day is T + gap + 1, but
+#   build_long_lag_columns indexes test day m as T + m. Call it with
+#   horizon = gap + test_size and DROP the first `gap` rows. With gap=0 (the common case)
+#   this is a no-op. test_feature_aware_backtest.py MUST cover a gap>0 fold.
+
+# CRITICAL: history_tail ends at T = train_end, and EXCLUDES the gap days. The gap simulates
+#   operational data latency — data for gap days is "not yet available" at forecast time.
+#   history_tail = series.values[:train_end_idx][-HISTORY_TAIL_DAYS:] where
+#   train_end_idx = split.train_indices[-1] + 1.
+
+# GOTCHA: leaf-level import rule. app/shared/feature_frames/rows.py may NEVER import from
+#   app/features/** (test_contract.py enforces it with an AST walk). Keep rows.py pure —
+#   stdlib math/datetime only, same as contract.py.
+
+# GOTCHA: SeriesData.__post_init__ computes n_observations from `values`. Adding an optional
+#   `exogenous: ExogenousFrame | None = None` field is fine — keep it last, keep the default.
+
+# GOTCHA: ModelBacktestResult is frozen-free but consumed by _shape_backtest_result. New
+#   fields (feature_aware: bool = False, exogenous_policy: str | None = None) MUST have
+#   defaults so every existing construction site and test stays valid.
+
+# GOTCHA: lightgbm in a backtest job. _execute_backtest building a LightGBMModelConfig is
+#   fine — model_factory still raises ValueError if forecast_enable_lightgbm is False. That
+#   surfaces as a failed job (loud), which is correct. Do not pre-check the flag in jobs.
+
+# GOTCHA: baseline comparison. _run_baseline_comparisons runs naive + seasonal_naive — both
+#   target-only (requires_features=False). They take the UNCHANGED target-only fold path
+#   even when the main model is feature-aware. Do not feed them X.
+
+# GOTCHA: line endings — repo has mixed CRLF/LF, no .gitattributes. Run `git diff --stat`
+#   before committing; re-normalise any whole-file diff to the file's original ending.
+```
+
+## Implementation Blueprint
+
+### Data models and structure
+
+```python
+# app/shared/feature_frames/rows.py  (NEW — pure, stdlib only)
+
+def build_historical_feature_rows(
+    *, dates: list[date], quantities: list[float], prices: list[float],
+    baseline_price: float, promo_dates: set[date], holiday_dates: set[date],
+    launch_date: date | None,
+) -> list[list[float]]:
+    """Promoted verbatim from ForecastingService._assemble_regression_rows.
+    Row i: target lags read quantities[i-lag] (strictly earlier), calendar columns
+    are pure, exogenous columns read same-day observed attributes. canonical order."""
+
+def build_future_feature_rows(
+    *, test_dates: list[date], history_tail: list[float], gap: int,
+    test_prices: list[float], baseline_price: float, test_promo_dates: set[date],
+    test_holiday_dates: set[date], launch_date: date | None,
+) -> list[list[float]]:
+    """Leakage-safe test-window matrix. lag_* columns come from
+    build_long_lag_columns(history_tail, gap + len(test_dates))[gap:] — NaN where the
+    source day is in the test window. Calendar columns pure. price_factor / promo_active
+    from the OBSERVED test-window records (policy='observed'). canonical order.
+    Raises ValueError if asked to emit a column it cannot classify/source."""
+```
+
+```python
+# app/features/backtesting/service.py  (MODIFY)
+
+@dataclass
+class ExogenousFrame:
+    """Pre-loaded exogenous data for one series — resolved async in run_backtest,
+    consumed by the pure/sync fold loop."""
+    prices: list[float]              # aligned with SeriesData.dates
+    baseline_price: float            # median positive price (>0 fallback 1.0)
+    promo_dates: set[date]
+    holiday_dates: set[date]
+    launch_date: date | None
+
+@dataclass
+class SeriesData:
+    dates: list[date]
+    values: np.ndarray
+    store_id: int
+    product_id: int
+    exogenous: ExogenousFrame | None = None   # NEW — present only for feature-aware runs
+    n_observations: int = field(init=False)
+```
+
+### list of tasks (dependency-ordered)
+
+```yaml
+# ════════ STEP 1 — Shared row builders (pure, no behaviour change) ════════
+Task 1: CREATE app/shared/feature_frames/rows.py
+  - build_historical_feature_rows: body lifted verbatim from _assemble_regression_rows.
+  - build_future_feature_rows: NEW — see Per-task pseudocode.
+  - Pure: import only math/datetime + the contract builders. No app/features import.
+
+Task 2: MODIFY app/shared/feature_frames/__init__.py
+  - Export build_historical_feature_rows, build_future_feature_rows.
+
+Task 3: MODIFY app/features/forecasting/service.py
+  - _assemble_regression_rows becomes a one-line shim delegating to
+    build_historical_feature_rows. Keep the signature and name byte-identical so
+    test_regression_features_leakage.py imports stay valid.
+
+# ════════ STEP 2 — Shared leakage spec ════════
+Task 4: MODIFY app/shared/feature_frames/tests/test_leakage.py
+  - ADD: build_future_feature_rows lag cells are NaN exactly where source day is in the
+    test window; an observed test-day target never appears as a lag value out of place;
+    gap>0 case; the historical-vs-future asymmetry.
+Task 5: MODIFY app/shared/feature_frames/tests/test_contract.py
+  - ADD rows.py to the AST walk asserting no app/features import.
+
+# ════════ STEP 3 — Backtesting schemas (additive) ════════
+Task 6: MODIFY app/features/backtesting/schemas.py
+  - ModelBacktestResult: ADD feature_aware: bool = False,
+    exogenous_policy: Literal["observed"] | None = None  (defaults preserve all callers).
+
+# ════════ STEP 4 — Backtesting service wiring ════════
+Task 7: MODIFY app/features/backtesting/service.py
+  - ADD ExogenousFrame; ADD optional SeriesData.exogenous.
+  - _validate_config: if main model requires_features and min_train_size < 30 -> ValueError.
+  - run_backtest: build a probe model from config.model_config_main; if requires_features,
+    call new async _load_exogenous_frame and attach to series_data.exogenous.
+  - _run_model_backtest: signature is UNCHANGED (still series_data, splitter,
+    model_config, store_fold_details). It builds a probe model, branches on
+    probe.requires_features, reads gap from splitter.config.gap, and builds the full
+    historical matrix as a LOCAL once before the fold loop.
+      target-only  -> existing code path, untouched.
+      feature-aware -> _run_feature_aware_fold (new helper, all args explicit): per fold
+                       slice X_train from the local historical matrix, build X_future,
+                       fit(y,X_train), predict(test_size, X_future). Set feature_aware +
+                       exogenous_policy on the ModelBacktestResult.
+  - ADD _load_exogenous_frame (async): unit_price per date, promotion windows, calendar
+    holidays, product.launch_date — mirrors _build_regression_features.
+
+# ════════ STEP 5 — Jobs integration ════════
+Task 8: MODIFY app/features/jobs/service.py
+  - _execute_backtest: add regression + lightgbm branches building RegressionModelConfig /
+    LightGBMModelConfig. _shape_backtest_result UNCHANGED (frontend contract byte-stable).
+
+# ════════ STEP 6 — Tests ════════
+Task 9:  CREATE app/features/backtesting/tests/test_feature_aware_backtest.py
+  - Pure unit tests: per-fold X_train/X_future shape + column order; gap>0 fold;
+    feature-aware model with exogenous=None -> loud ValueError.
+Task 10: MODIFY app/features/backtesting/tests/test_service.py
+  - Repurpose test_feature_aware_model_fails_loud_in_backtest (DECISIONS LOCKED #8).
+Task 11: MODIFY app/features/backtesting/tests/test_service_integration.py
+  - DB-backed regression backtest vs naive/seasonal baselines in one response.
+Task 12: MODIFY app/features/backtesting/tests/test_routes_integration.py
+  - POST /backtesting/run with a regression model config -> 200 + per-fold metrics.
+Task 13: MODIFY app/features/backtesting/tests/test_schemas.py
+  - ModelBacktestResult new fields: defaults + explicit values.
+Task 14: MODIFY app/features/jobs/tests/test_service.py
+  - backtest job with model_type="regression" -> success + shaped result.
+
+# ════════ STEP 7 — Docs ════════
+Task 15: MODIFY examples/models/feature_frame_contract.md, docs/PHASE/5-BACKTESTING.md,
+         README.md (backtest model list), PRPs/INITIAL/INITIAL-MLZOO-index.md (B.2 row).
+```
+
+### Per-task pseudocode (critical details only)
+
+```python
+# ── Task 1 — build_future_feature_rows (the leakage-critical builder) ──
+def build_future_feature_rows(*, test_dates, history_tail, gap, test_prices,
+                              baseline_price, test_promo_dates, test_holiday_dates,
+                              launch_date):
+    horizon = len(test_dates)
+    columns = canonical_feature_columns()
+    # lags: build for gap+horizon days, drop the gap lead-in. NaN where source > T.
+    lag_cols = build_long_lag_columns(history_tail, gap + horizon)   # {"lag_k": [...]}
+    lag_cols = {k: v[gap:] for k, v in lag_cols.items()}
+    cal_cols = build_calendar_columns(test_dates)                    # SAFE — pure
+    rows: list[list[float]] = []
+    for j, day in enumerate(test_dates):
+        row: list[float] = []
+        for col in columns:
+            safety = feature_safety(col)              # raises KeyError on unknown -> loud
+            if col.startswith("lag_"):                # CONDITIONALLY_SAFE
+                row.append(lag_cols[col][j])
+            elif col in cal_cols:                     # SAFE
+                row.append(cal_cols[col][j])
+            elif col == "price_factor":               # UNSAFE_UNLESS_SUPPLIED -> observed
+                row.append(test_prices[j] / baseline_price)
+            elif col == "promo_active":               # UNSAFE_UNLESS_SUPPLIED -> observed
+                row.append(1.0 if day in test_promo_dates else 0.0)
+            elif col == "is_holiday":                 # SAFE — calendar timeless attribute
+                row.append(1.0 if day in test_holiday_dates else 0.0)
+            elif col == "days_since_launch":          # SAFE — pure fn of date
+                row.append(float((day - launch_date).days) if launch_date else math.nan)
+            else:
+                raise ValueError(f"build_future_feature_rows: unsourced column {col!r}")
+        rows.append(row)
+    return rows
+
+# ── Task 7 — _run_model_backtest branch + _run_feature_aware_fold (pure, sync) ──
+# _run_model_backtest gains NO new parameters. `gap` is read from the splitter it already
+# receives (splitter.config.gap — SplitConfig is reachable as TimeSeriesSplitter.config).
+# The full historical matrix is a LOCAL built once before the fold loop — there is no
+# self._historical_matrix and no self.config on BacktestingService (__init__ sets only
+# self.settings + self.metrics_calculator). _run_feature_aware_fold takes everything it
+# needs as explicit arguments — no phantom instance state.
+
+def _run_model_backtest(self, series_data, splitter, model_config, store_fold_details):
+    probe = model_factory(model_config, random_state=self.settings.forecast_random_seed)
+    feature_aware = probe.requires_features            # capability flag, never a string
+    historical_matrix: np.ndarray | None = None
+    if feature_aware:
+        if series_data.exogenous is None:
+            raise ValueError("feature-aware backtest requires a loaded ExogenousFrame")
+        exo = series_data.exogenous
+        historical_matrix = np.array(build_historical_feature_rows(
+            dates=series_data.dates, quantities=series_data.values.tolist(),
+            prices=exo.prices, baseline_price=exo.baseline_price,
+            promo_dates=exo.promo_dates, holiday_dates=exo.holiday_dates,
+            launch_date=exo.launch_date), dtype=np.float64)
+    for split in splitter.split(series_data.dates, series_data.values):
+        if feature_aware:
+            predictions = self._run_feature_aware_fold(
+                series_data, split, model_config, historical_matrix, splitter.config.gap)
+        else:
+            ...  # existing target-only path — UNCHANGED
+        ...  # metrics / FoldResult assembly is shared, unchanged
+    # set feature_aware=feature_aware and exogenous_policy on the returned ModelBacktestResult.
+
+def _run_feature_aware_fold(self, series_data, split, model_config,
+                            historical_matrix, gap):
+    exo = series_data.exogenous                        # caller already guaranteed non-None
+    # X_train — slice the full historical matrix (built once, leakage-safe by position)
+    X_train = historical_matrix[split.train_indices]
+    y_train = series_data.values[split.train_indices]
+    # X_future — rebuilt per fold; history_tail ends at T = train_end, excludes gap
+    train_end_idx = int(split.train_indices[-1]) + 1
+    history_tail = series_data.values[:train_end_idx][-HISTORY_TAIL_DAYS:].tolist()
+    test_idx = split.test_indices
+    X_future = np.array(build_future_feature_rows(
+        test_dates=split.test_dates, history_tail=history_tail, gap=gap,
+        test_prices=[exo.prices[i] for i in test_idx], baseline_price=exo.baseline_price,
+        test_promo_dates={series_data.dates[i] for i in test_idx if series_data.dates[i] in exo.promo_dates},
+        test_holiday_dates={d for d in split.test_dates if d in exo.holiday_dates},
+        launch_date=exo.launch_date), dtype=np.float64)
+    model = model_factory(model_config, random_state=self.settings.forecast_random_seed)
+    model.fit(y_train, X_train)
+    return model.predict(len(test_idx), X_future)
+
+# ── Task 8 — _execute_backtest allow-list ──
+#   elif model_type == "regression":  model_config = RegressionModelConfig()
+#   elif model_type == "lightgbm":    model_config = LightGBMModelConfig()
+#   else: raise ValueError(f"Unsupported model_type: {model_type}")   # e.g. "arima"
+```
+
+### Integration Points
+
+```yaml
+BACKTESTING SERVICE:
+  - run_backtest stays the only async entry; _run_model_backtest stays sync.
+  - the full historical matrix is built once per _run_model_backtest call (feature-aware
+    path) as a LOCAL variable, sliced per fold — never per-fold rebuilt for X_train, and
+    never stored on the service instance.
+
+JOBS:
+  - _execute_backtest gains regression + lightgbm; _shape_backtest_result is NOT touched.
+  - a backtest job for a disabled lightgbm fails loud via model_factory — expected.
+
+SHARED CONTRACT:
+  - rows.py joins contract.py under app/shared/feature_frames; __init__.py re-exports both.
+  - forecasting + backtesting both consume one definition — no column-order drift possible.
+
+NO CHANGE:
+  - splitter.py, scenarios/**, frontend/**, alembic/**, registry/** — untouched.
+```
+
+## Phased Execution Plan
+
+This is one coherent architectural change and **fits one reviewable PR** (~1 branch,
+`feat/backtesting-feature-aware-folds`, off `dev`, tracked by **GitHub issue #244** —
+every commit references `(#244)`). If the reviewer
+prefers a smaller diff, split along the natural seam — Phase 1 is independently mergeable
+because it is pure and behaviour-preserving:
+
+- **Phase 1 — Shared builders + leakage spec (Tasks 1–5).** Promote
+  `build_historical_feature_rows`, add `build_future_feature_rows`, wire the delegating shim,
+  add the shared leakage tests. Zero behaviour change — every existing test stays green. A
+  self-contained PR that lands the contract without touching backtesting.
+- **Phase 2 — Backtesting + jobs wiring (Tasks 6–15).** Consume the builders: schema fields,
+  async exogenous load, the fold-loop branch, jobs allow-list, integration tests, docs.
+
+Recommended: ship as **one PR** unless the diff is judged too large at review time; the
+phase boundary is the fallback, not the default.
+
+## Validation Loop
+
+### Level 1: Syntax & Style
+
+```bash
+uv run ruff check . && uv run ruff format --check .
+```
+
+### Level 2: Type Checks
+
+```bash
+uv run mypy app/ && uv run pyright app/
+# Watch: rows.py must type cleanly with no app/features import; the new optional
+# SeriesData.exogenous and the additive ModelBacktestResult fields must not break callers.
+```
+
+### Level 3: Unit Tests
+
+```bash
+uv run pytest -v -m "not integration" \
+  app/shared/feature_frames/tests/ \
+  app/features/backtesting/tests/test_service.py \
+  app/features/backtesting/tests/test_feature_aware_backtest.py \
+  app/features/backtesting/tests/test_schemas.py \
+  app/features/forecasting/tests/test_regression_features_leakage.py \
+  app/features/jobs/tests/test_service.py
+# test_regression_features_leakage.py MUST pass with ZERO edits (the shim preserves it).
+uv run pytest -v -m "not integration"          # whole fast suite — all green
+```
+
+### Level 4: Integration Tests
+
+```bash
+docker compose up -d
+uv run pytest -v -m integration \
+  app/features/backtesting/tests/test_service_integration.py \
+  app/features/backtesting/tests/test_routes_integration.py
+# A regression backtest must return per-fold metrics + a baseline comparison;
+# the response carries exogenous_policy="observed" on the main-model result.
+```
+
+### Level 5: Manual Validation (dogfood — REQUIRED)
+
+```bash
+# 1. A regression backtest runs end-to-end (needs seeded data).
+curl -sX POST localhost:8123/backtesting/run -H 'Content-Type: application/json' \
+  -d '{"store_id":1,"product_id":1,"start_date":"2024-01-01","end_date":"2024-12-01",
+       "config":{"model_config_main":{"model_type":"regression"},
+                 "split_config":{"n_splits":3,"min_train_size":60,"horizon":14}}}'
+#   -> 200; main_model_results.feature_aware == true; exogenous_policy == "observed";
+#      baseline_results has naive + seasonal_naive; leakage_check_passed == true.
+
+# 2. min_train_size guard fires loud.
+#   ... same call with "min_train_size":20  -> 400 RFC-7807, "at least 30".
+
+# 3. A backtest job with model_type=regression completes success.
+curl -sX POST localhost:8123/jobs -H 'Content-Type: application/json' \
+  -d '{"job_type":"backtest","params":{"model_type":"regression","store_id":1,
+       "product_id":1,"start_date":"2024-01-01","end_date":"2024-12-01","n_splits":3}}'
+#   -> poll GET /jobs/{id} -> status "success", result has per-fold metrics.
+
+# 4. Baselines unaffected — a naive backtest still works exactly as before.
+```
+
+## Final Validation Checklist
+
+- [ ] `ruff` + `mypy --strict` + `pyright --strict` clean.
+- [ ] Whole fast unit suite green; `test_regression_features_leakage.py` unedited & green.
+- [ ] New shared leakage spec proves `X_future` lag cells are `NaN`-where-future (incl. gap>0).
+- [ ] Integration: a `regression` backtest returns per-fold metrics + baseline comparison.
+- [ ] `POST /backtesting/run` with `regression` → `200`; with `min_train_size<30` → `400`.
+- [ ] A `backtest` job with `model_type="regression"` → `success`.
+- [ ] `_shape_backtest_result` output keys byte-identical to pre-PRP (frontend contract).
+- [ ] Every baseline backtest test green with zero edits.
+- [ ] The interim loud-fail test is repurposed (not deleted); supersession noted.
+- [ ] No `frontend/`, `scenarios/`, `alembic/` change; no new migration.
+- [ ] `git diff --stat` shows no whole-file CRLF/LF noise.
+
+## Anti-Patterns to Avoid
+
+- ❌ Slicing the historical matrix for `X_future` — that leaks adjacent test-day targets.
+- ❌ Filling an unknowable future column with `0.0`/`NaN` silently — raise `ValueError`.
+- ❌ Branching the fold loop on a `model_type` string — branch on `requires_features`.
+- ❌ Importing `forecasting`/`scenarios` from `backtesting` — promote to `app/shared/`.
+- ❌ Doing DB I/O inside `_run_model_backtest` — keep it sync; load async up front.
+- ❌ Re-deriving `build_long_lag_columns` / `canonical_feature_columns()` — reuse the contract.
+- ❌ Weakening or deleting `test_feature_aware_model_fails_loud_in_backtest` — repurpose it.
+- ❌ Editing `_shape_backtest_result` keys or any `frontend/` file — out of scope.
+- ❌ Adding XGBoost/Prophet, hyperparameter search, or a migration — all out of scope.
+
+## Open Questions
+
+1. **`exogenous_policy` v1 = `observed` only.** This PRP ships exactly one policy (recorded
+   `price`/`promo` for the test window — exogenous foresight, target-leakage-free). A stricter
+   `origin_carry_forward` policy (carry the last observed price/promo state from `T`, zero
+   foresight) and an `assumptions` policy (planner-supplied, mirroring the scenarios slice)
+   are deliberately deferred. **Resolve at PRP review:** is one policy acceptable for v1, or
+   should `origin_carry_forward` ship alongside as the conservative default? The `Literal`
+   field is the seam either way.
+2. **Feature-aware baseline comparison.** v1 compares a feature-aware main model only against
+   the *target-only* naive/seasonal baselines. Whether `regression` should also auto-run as a
+   baseline for a `lightgbm` main model (advanced-vs-advanced) is left to MLZOO-D — flag if
+   the reviewer wants it sooner.
+3. **Per-series caching of the historical matrix.** Built once per `run_backtest` call; not
+   cached across calls. Fine for single-series backtests; revisit only if a portfolio/batch
+   backtester (a separate optional feature) ever lands.
+
+## Confidence Score
+
+**9/10** for one-pass implementation. The contract (MLZOO-A), both feature-aware models
+(PRP-27, PRP-30), the splitter, and the leakage-test patterns all already exist and are
+stable. The only genuine design judgement — the `X_future` exogenous policy — is resolved
+and locked (DECISIONS LOCKED #4) with Open Question #1 as the explicit review hook. The work
+is additive, single-slice-plus-shared, and needs no migration.
diff --git a/README.md b/README.md
index 4426b908..d406b033 100644
--- a/README.md
+++ b/README.md
@@ -393,6 +393,12 @@ curl -X POST http://localhost:8123/backtesting/run \
 **Baseline Comparisons:**
 When `include_baselines=true`, automatically compares against naive and seasonal_naive models.
 
+**Feature-Aware Models:**
+`regression` and `lightgbm` models can be backtested too — set
+`model_config_main.model_type` accordingly. Each fold builds a leakage-safe
+per-fold feature matrix (`min_train_size >= 30` required); the result carries
+`feature_aware: true` and `exogenous_policy: "observed"`.
+
 See [examples/backtest/](examples/backtest/) for usage examples.
 
 ### Model Registry
diff --git a/app/features/backtesting/schemas.py b/app/features/backtesting/schemas.py
index 537809f0..747d25c2 100644
--- a/app/features/backtesting/schemas.py
+++ b/app/features/backtesting/schemas.py
@@ -173,6 +173,12 @@ class ModelBacktestResult(BaseModel):
         fold_results: Results for each fold.
         aggregated_metrics: Mean metrics across folds.
         metric_std: Standard deviation of metrics across folds.
+        feature_aware: True when the model consumed a per-fold feature matrix
+            (``requires_features``); False for target-only baseline models.
+        exogenous_policy: How the test-window exogenous columns were sourced.
+            ``"observed"`` (the recorded price/promotion plan) for a
+            feature-aware result; ``None`` for a target-only model. Recorded so
+            the metric is read honestly as "accuracy given a known plan".
     """
 
     model_type: str
@@ -180,6 +186,8 @@ class ModelBacktestResult(BaseModel):
     fold_results: list[FoldResult]
     aggregated_metrics: dict[str, float]
     metric_std: dict[str, float]
+    feature_aware: bool = False
+    exogenous_policy: Literal["observed"] | None = None
 
 
 # =============================================================================
diff --git a/app/features/backtesting/service.py b/app/features/backtesting/service.py
index b4b8e8f4..209e9081 100644
--- a/app/features/backtesting/service.py
+++ b/app/features/backtesting/service.py
@@ -17,6 +17,7 @@
 import uuid
 from dataclasses import dataclass, field
 from datetime import date as date_type
+from datetime import timedelta
 from pathlib import Path
 from typing import TYPE_CHECKING, Any
 
@@ -34,20 +35,56 @@
     ModelBacktestResult,
     SplitBoundary,
 )
-from app.features.backtesting.splitter import TimeSeriesSplitter
-from app.features.data_platform.models import SalesDaily
+from app.features.backtesting.splitter import TimeSeriesSplit, TimeSeriesSplitter
+from app.features.data_platform.models import Calendar, Product, Promotion, SalesDaily
 from app.features.forecasting.models import model_factory
 from app.features.forecasting.schemas import (
     ModelConfig,
     NaiveModelConfig,
     SeasonalNaiveModelConfig,
 )
+from app.shared.feature_frames import (
+    HISTORY_TAIL_DAYS,
+    build_future_feature_rows,
+    build_historical_feature_rows,
+)
 
 if TYPE_CHECKING:
     pass
 
 logger = structlog.get_logger()
 
+# Minimum observed train rows a feature-aware model needs per fold to resolve
+# its lag features — mirrors ``forecasting.service._MIN_REGRESSION_TRAIN_ROWS``.
+# A feature-aware backtest with a smaller ``min_train_size`` fails loud in
+# ``_validate_config`` rather than producing all-NaN lag columns silently.
+_MIN_FEATURE_AWARE_TRAIN_ROWS = 30
+
+
+@dataclass
+class ExogenousFrame:
+    """Pre-loaded exogenous data for one series — resolved async, consumed sync.
+
+    A feature-aware backtest needs price / promotion / holiday / launch-date
+    data to build its per-fold feature matrices, but the fold loop is sync and
+    DB-free by design. ``run_backtest`` resolves all of it once into this pure
+    in-memory carrier; the fold builders read it without touching the database.
+
+    Attributes:
+        prices: Observed unit prices, aligned index-for-index with
+            :attr:`SeriesData.dates`.
+        baseline_price: Median positive price (``>0``); fallback ``1.0``.
+        promo_dates: Days a promotion covered anywhere in the data window.
+        holiday_dates: Calendar holiday days in the data window.
+        launch_date: The product's launch date, or ``None``.
+    """
+
+    prices: list[float]
+    baseline_price: float
+    promo_dates: set[date_type]
+    holiday_dates: set[date_type]
+    launch_date: date_type | None
+
 
 @dataclass
 class SeriesData:
@@ -58,6 +95,8 @@ class SeriesData:
         values: Target values as numpy array.
         store_id: Store ID.
         product_id: Product ID.
+        exogenous: Pre-loaded exogenous data — present only for a feature-aware
+            backtest; ``None`` for a target-only run.
         n_observations: Number of observations.
     """
 
@@ -65,6 +104,7 @@ class SeriesData:
     values: np.ndarray[Any, np.dtype[np.floating[Any]]]
     store_id: int
     product_id: int
+    exogenous: ExogenousFrame | None = None
     n_observations: int = field(init=False)
 
     def __post_init__(self) -> None:
@@ -126,6 +166,19 @@ def _validate_config(self, config: BacktestConfig) -> None:
                 message="Using provided min_train_size below recommended default",
             )
 
+        # Feature-aware models need enough train rows per fold to resolve their
+        # lag features. Build a cheap probe (no fit) and branch on the
+        # capability flag — never on a model_type string. Loud, not silent.
+        probe = model_factory(
+            config.model_config_main, random_state=self.settings.forecast_random_seed
+        )
+        if probe.requires_features and split_config.min_train_size < _MIN_FEATURE_AWARE_TRAIN_ROWS:
+            raise ValueError(
+                f"A feature-aware model ({config.model_config_main.model_type}) needs "
+                f"min_train_size of at least {_MIN_FEATURE_AWARE_TRAIN_ROWS} to resolve its "
+                f"lag features per fold; got {split_config.min_train_size}."
+            )
+
     def save_results(
         self,
         response: BacktestResponse,
@@ -217,6 +270,21 @@ async def run_backtest(
                 f"between {start_date} and {end_date}"
             )
 
+        # Feature-aware models consume a per-fold feature matrix. Branch on the
+        # capability flag (not a model_type string) and resolve the exogenous
+        # data once, here in the async entry point — the fold loop stays sync
+        # and DB-free. Target-only models skip this entirely.
+        probe = model_factory(
+            config.model_config_main, random_state=self.settings.forecast_random_seed
+        )
+        if probe.requires_features:
+            series_data.exogenous = await self._load_exogenous_frame(
+                db=db,
+                store_id=store_id,
+                product_id=product_id,
+                dates=series_data.dates,
+            )
+
         # Create splitter and validate
         splitter = TimeSeriesSplitter(config.split_config)
 
@@ -284,30 +352,61 @@ def _run_model_backtest(
     ) -> ModelBacktestResult:
         """Run backtest for a single model configuration.
 
+        Branches on the model's ``requires_features`` capability flag (never a
+        ``model_type`` string). A target-only model takes the unchanged
+        target-only path; a feature-aware model builds the full historical
+        feature matrix once (a local — never instance state) and runs each fold
+        through :meth:`_run_feature_aware_fold` with a leakage-safe per-fold
+        ``X_train`` slice and a rebuilt ``X_future``. The method signature is
+        unchanged — ``gap`` is read from ``splitter.config.gap``. Sync and
+        DB-free: all exogenous I/O happened in :meth:`run_backtest`.
+
         Args:
-            series_data: Loaded time series data.
+            series_data: Loaded time series data (carries ``exogenous`` for a
+                feature-aware run).
             splitter: Time series splitter.
             model_config: Model configuration.
             store_fold_details: Whether to store per-fold details.
 
         Returns:
-            ModelBacktestResult with all fold results.
+            ModelBacktestResult with all fold results; ``feature_aware`` and
+            ``exogenous_policy`` are set for a feature-aware model.
+
+        Raises:
+            ValueError: If a feature-aware model has no loaded ``ExogenousFrame``.
         """
         fold_results: list[FoldResult] = []
         fold_metrics: list[dict[str, float]] = []
 
+        # Probe the capability flag, then build the historical matrix once for
+        # the whole run (feature-aware path only) — sliced, never rebuilt, for
+        # each fold's X_train.
+        probe = model_factory(model_config, random_state=self.settings.forecast_random_seed)
+        feature_aware: bool = probe.requires_features
+        historical_matrix: np.ndarray[Any, np.dtype[np.floating[Any]]] | None = None
+        if feature_aware:
+            historical_matrix = self._build_historical_matrix(series_data)
+
         for split in splitter.split(series_data.dates, series_data.values):
             # Extract train and test data
-            y_train = series_data.values[split.train_indices]
             y_test = series_data.values[split.test_indices]
-
-            # Create and fit model
-            model = model_factory(model_config, random_state=self.settings.forecast_random_seed)
-            model.fit(y_train)
-
-            # Generate predictions
             horizon = len(split.test_indices)
-            predictions = model.predict(horizon)
+
+            if historical_matrix is not None:
+                # Feature-aware path — per-fold leakage-safe X_train / X_future.
+                predictions = self._run_feature_aware_fold(
+                    series_data=series_data,
+                    split=split,
+                    model_config=model_config,
+                    historical_matrix=historical_matrix,
+                    gap=splitter.config.gap,
+                )
+            else:
+                # Target-only path — unchanged.
+                y_train = series_data.values[split.train_indices]
+                model = model_factory(model_config, random_state=self.settings.forecast_random_seed)
+                model.fit(y_train)
+                predictions = model.predict(horizon)
 
             # Calculate metrics
             metrics = self.metrics_calculator.calculate_all(
@@ -360,7 +459,113 @@ def _run_model_backtest(
             fold_results=fold_results,
             aggregated_metrics=aggregated_metrics,
             metric_std=metric_std,
+            feature_aware=feature_aware,
+            exogenous_policy="observed" if feature_aware else None,
+        )
+
+    def _build_historical_matrix(
+        self, series_data: SeriesData
+    ) -> np.ndarray[Any, np.dtype[np.floating[Any]]]:
+        """Build the full-series historical feature matrix for a backtest.
+
+        Built once per :meth:`_run_model_backtest` call as a local — never
+        instance state. Each row is leakage-safe *as a training row*: its lag
+        columns read only strictly-earlier observed targets. Per-fold
+        ``X_train`` is a positional slice of this matrix; ``X_future`` is NEVER
+        sliced from it (that would leak an adjacent test-day target).
+
+        Args:
+            series_data: Loaded time series data — must carry ``exogenous``.
+
+        Returns:
+            Row-major feature matrix aligned with ``series_data.dates``.
+
+        Raises:
+            ValueError: If ``series_data.exogenous`` is ``None`` — the
+                genuinely-unsupported path for a feature-aware backtest.
+        """
+        exo = series_data.exogenous
+        if exo is None:
+            raise ValueError(
+                "feature-aware backtest requires a loaded ExogenousFrame on series_data; "
+                "run_backtest must resolve exogenous data before the fold loop"
+            )
+        rows = build_historical_feature_rows(
+            dates=series_data.dates,
+            quantities=[float(v) for v in series_data.values],
+            prices=exo.prices,
+            baseline_price=exo.baseline_price,
+            promo_dates=exo.promo_dates,
+            holiday_dates=exo.holiday_dates,
+            launch_date=exo.launch_date,
         )
+        return np.array(rows, dtype=np.float64)
+
+    def _run_feature_aware_fold(
+        self,
+        *,
+        series_data: SeriesData,
+        split: TimeSeriesSplit,
+        model_config: ModelConfig,
+        historical_matrix: np.ndarray[Any, np.dtype[np.floating[Any]]],
+        gap: int,
+    ) -> np.ndarray[Any, np.dtype[np.floating[Any]]]:
+        """Fit + predict one fold of a feature-aware backtest — pure, sync.
+
+        ``X_train`` is a positional slice of the full historical matrix (built
+        once, leakage-safe by position). ``X_future`` is rebuilt here per fold
+        via :func:`build_future_feature_rows`: its ``history_tail`` ends at the
+        fold origin ``T`` (the last train day, gap days excluded), so a lag
+        cell whose source day falls in the test window is ``NaN``.
+
+        Args:
+            series_data: Loaded time series data — carries ``exogenous``.
+            split: The fold's train/test split.
+            model_config: Model configuration (feature-aware).
+            historical_matrix: The full-series historical feature matrix.
+            gap: Gap days between train end and test start.
+
+        Returns:
+            Per-day predictions for the fold's test window.
+
+        Raises:
+            ValueError: If ``series_data.exogenous`` is ``None``.
+        """
+        exo = series_data.exogenous
+        if exo is None:  # defensive — the caller guarantees this is non-None
+            raise ValueError("feature-aware backtest requires a loaded ExogenousFrame")
+
+        # X_train — slice the historical matrix (leakage-safe by position).
+        x_train = historical_matrix[split.train_indices]
+        y_train = series_data.values[split.train_indices]
+
+        # X_future — rebuilt per fold. history_tail ends at T = last train day
+        # and EXCLUDES the gap days (data "not yet available" at forecast time).
+        train_end_idx = int(split.train_indices[-1]) + 1
+        history_tail = [float(v) for v in series_data.values[:train_end_idx][-HISTORY_TAIL_DAYS:]]
+        test_indices = [int(i) for i in split.test_indices]
+        test_prices = [exo.prices[i] for i in test_indices]
+        test_promo_dates = {
+            series_data.dates[i] for i in test_indices if series_data.dates[i] in exo.promo_dates
+        }
+        test_holiday_dates = {d for d in split.test_dates if d in exo.holiday_dates}
+        x_future = np.array(
+            build_future_feature_rows(
+                test_dates=split.test_dates,
+                history_tail=history_tail,
+                gap=gap,
+                test_prices=test_prices,
+                baseline_price=exo.baseline_price,
+                test_promo_dates=test_promo_dates,
+                test_holiday_dates=test_holiday_dates,
+                launch_date=exo.launch_date,
+            ),
+            dtype=np.float64,
+        )
+
+        model = model_factory(model_config, random_state=self.settings.forecast_random_seed)
+        model.fit(y_train, x_train)
+        return model.predict(len(test_indices), x_future)
 
     def _run_baseline_comparisons(
         self,
@@ -543,3 +748,105 @@ async def _load_series_data(
             store_id=store_id,
             product_id=product_id,
         )
+
+    async def _load_exogenous_frame(
+        self,
+        db: AsyncSession,
+        store_id: int,
+        product_id: int,
+        dates: list[date_type],
+    ) -> ExogenousFrame:
+        """Load exogenous data for a feature-aware backtest — async, once.
+
+        Mirrors ``ForecastingService._build_regression_features``: resolves the
+        recorded unit price per date, the promotion-covered days, the calendar
+        holidays, and the product launch date into a pure :class:`ExogenousFrame`
+        the sync fold loop consumes. The only ``y``-free reads — never a target.
+
+        Args:
+            db: Database session.
+            store_id: Store ID.
+            product_id: Product ID.
+            dates: The series dates (from :meth:`_load_series_data`) the prices
+                must align with index-for-index.
+
+        Returns:
+            The resolved :class:`ExogenousFrame`.
+        """
+        start_date = dates[0]
+        end_date = dates[-1]
+
+        # Recorded unit price per date — aligned with the series dates. Every
+        # series date came from the same SalesDaily window, so each resolves.
+        price_rows = (
+            await db.execute(
+                select(SalesDaily.date, SalesDaily.unit_price).where(
+                    (SalesDaily.store_id == store_id)
+                    & (SalesDaily.product_id == product_id)
+                    & (SalesDaily.date >= start_date)
+                    & (SalesDaily.date <= end_date)
+                )
+            )
+        ).all()
+        price_by_date = {row.date: float(row.unit_price) for row in price_rows}
+        prices = [price_by_date.get(day, 0.0) for day in dates]
+
+        # Baseline price = median of the positive prices, so price_factor is
+        # ~1.0 on a typical day and < 1.0 on a markdown/promo day.
+        positive_prices = sorted(price for price in prices if price > 0.0)
+        baseline_price = positive_prices[len(positive_prices) // 2] if positive_prices else 1.0
+
+        holiday_dates: set[date_type] = set(
+            (
+                await db.execute(
+                    select(Calendar.date).where(
+                        Calendar.date >= start_date,
+                        Calendar.date <= end_date,
+                        Calendar.is_holiday.is_(True),
+                    )
+                )
+            )
+            .scalars()
+            .all()
+        )
+
+        # Promotion-active days: store-specific OR chain-wide rows overlapping
+        # the data window, expanded to the set of dates they cover.
+        promo_rows = (
+            await db.execute(
+                select(Promotion.start_date, Promotion.end_date).where(
+                    Promotion.product_id == product_id,
+                    (Promotion.store_id == store_id) | (Promotion.store_id.is_(None)),
+                    Promotion.start_date <= end_date,
+                    Promotion.end_date >= start_date,
+                )
+            )
+        ).all()
+        promo_dates: set[date_type] = set()
+        for promo in promo_rows:
+            day = max(promo.start_date, start_date)
+            last = min(promo.end_date, end_date)
+            while day <= last:
+                promo_dates.add(day)
+                day += timedelta(days=1)
+
+        launch_date: date_type | None = await db.scalar(
+            select(Product.launch_date).where(Product.id == product_id)
+        )
+
+        logger.info(
+            "backtesting.exogenous_frame_loaded",
+            store_id=store_id,
+            product_id=product_id,
+            n_dates=len(dates),
+            n_holidays=len(holiday_dates),
+            n_promo_days=len(promo_dates),
+        )
+
+        return ExogenousFrame(
+            prices=prices,
+            baseline_price=baseline_price,
+            promo_dates=promo_dates,
+            holiday_dates=holiday_dates,
+            launch_date=launch_date,
+        )
diff --git a/app/features/backtesting/tests/test_feature_aware_backtest.py b/app/features/backtesting/tests/test_feature_aware_backtest.py
new file mode 100644
index 00000000..e5281ea6
--- /dev/null
+++ b/app/features/backtesting/tests/test_feature_aware_backtest.py
@@ -0,0 +1,221 @@
+"""Unit tests for feature-aware backtesting (MLZOO-B.2).
+
+Pure, DB-free tests of the per-fold feature-aware path wired into
+``BacktestingService._run_model_backtest``: the historical matrix build, the
+per-fold ``X_train`` slice / ``X_future`` rebuild, the ``feature_aware`` /
+``exogenous_policy`` result fields, the gap > 0 fold, and the loud failure when
+a feature-aware model reaches the fold loop with no ``ExogenousFrame`` loaded.
+
+The leakage invariants of the row builders themselves live in the load-bearing
+``app/shared/feature_frames/tests/test_leakage.py``; this file pins the
+backtesting *integration* of those builders.
+"""
+
+from __future__ import annotations
+
+from datetime import date
+
+import numpy as np
+import pytest
+
+from app.features.backtesting.schemas import SplitConfig
+from app.features.backtesting.service import (
+    BacktestingService,
+    ExogenousFrame,
+    SeriesData,
+)
+from app.features.backtesting.splitter import TimeSeriesSplitter
+from app.features.forecasting.schemas import NaiveModelConfig, RegressionModelConfig
+from app.shared.feature_frames import canonical_feature_columns
+
+_N_FEATURES = len(canonical_feature_columns())  # 14 — 4 lags + 6 calendar + 4 exogenous
+
+
+def _exogenous(n: int) -> ExogenousFrame:
+    """A flat, no-promo, no-holiday ExogenousFrame aligned with an n-day series."""
+    return ExogenousFrame(
+        prices=[9.99] * n,
+        baseline_price=9.99,
+        promo_dates=set(),
+        holiday_dates=set(),
+        launch_date=None,
+    )
+
+
+def _series(dates: list[date], values: np.ndarray, *, with_exogenous: bool) -> SeriesData:
+    """Build SeriesData, optionally carrying a loaded ExogenousFrame."""
+    return SeriesData(
+        dates=dates,
+        values=values,
+        store_id=1,
+        product_id=1,
+        exogenous=_exogenous(len(dates)) if with_exogenous else None,
+    )
+
+
+def test_canonical_feature_set_is_fourteen_columns() -> None:
+    """The feature-aware matrices use exactly the 14-column canonical set."""
+    assert _N_FEATURES == 14
+
+
+def test_build_historical_matrix_shape_matches_series_and_columns(
+    sample_dates_120: list[date],
+    sample_values_120: np.ndarray,
+) -> None:
+    """The historical matrix has one row per series day, canonical column width."""
+    service = BacktestingService()
+    series = _series(sample_dates_120, sample_values_120, with_exogenous=True)
+
+    matrix = service._build_historical_matrix(series)
+
+    assert matrix.shape == (120, _N_FEATURES)
+
+
+def test_build_historical_matrix_without_exogenous_fails_loud(
+    sample_dates_120: list[date],
+    sample_values_120: np.ndarray,
+) -> None:
+    """No ExogenousFrame -> loud ValueError, never a silent all-NaN matrix."""
+    service = BacktestingService()
+    series = _series(sample_dates_120, sample_values_120, with_exogenous=False)
+
+    with pytest.raises(ValueError, match="ExogenousFrame"):
+        service._build_historical_matrix(series)
+
+
+def test_feature_aware_fold_predicts_one_value_per_test_day(
+    sample_dates_120: list[date],
+    sample_values_120: np.ndarray,
+    sample_split_config_expanding: SplitConfig,
+) -> None:
+    """A single feature-aware fold yields exactly horizon predictions."""
+    service = BacktestingService()
+    series = _series(sample_dates_120, sample_values_120, with_exogenous=True)
+    splitter = TimeSeriesSplitter(sample_split_config_expanding)
+    historical_matrix = service._build_historical_matrix(series)
+    split = next(splitter.split(series.dates, series.values))
+
+    predictions = service._run_feature_aware_fold(
+        series_data=series,
+        split=split,
+        model_config=RegressionModelConfig(),
+        historical_matrix=historical_matrix,
+        gap=sample_split_config_expanding.gap,
+    )
+
+    assert predictions.shape == (len(split.test_indices),)
+    assert np.all(np.isfinite(predictions))
+
+
+def test_feature_aware_backtest_produces_per_fold_metrics(
+    sample_dates_120: list[date],
+    sample_values_120: np.ndarray,
+    sample_split_config_expanding: SplitConfig,
+) -> None:
+    """A regression backtest runs end-to-end and yields per-fold metrics.
+
+    Repurposed positive assertion — feature-aware models are backtestable now
+    (supersedes the PRP-29 interim loud-fail contract).
+    """
+    service = BacktestingService()
+    series = _series(sample_dates_120, sample_values_120, with_exogenous=True)
+    splitter = TimeSeriesSplitter(sample_split_config_expanding)
+
+    result = service._run_model_backtest(
+        series_data=series,
+        splitter=splitter,
+        model_config=RegressionModelConfig(),
+        store_fold_details=True,
+    )
+
+    assert result.model_type == "regression"
+    assert len(result.fold_results) > 0
+    assert "mae" in result.aggregated_metrics
+    for fold in result.fold_results:
+        assert "mae" in fold.metrics
+
+
+def test_feature_aware_result_records_observed_policy(
+    sample_dates_120: list[date],
+    sample_values_120: np.ndarray,
+    sample_split_config_expanding: SplitConfig,
+) -> None:
+    """A feature-aware result is flagged and records the v1 exogenous policy."""
+    service = BacktestingService()
+    series = _series(sample_dates_120, sample_values_120, with_exogenous=True)
+    splitter = TimeSeriesSplitter(sample_split_config_expanding)
+
+    result = service._run_model_backtest(
+        series_data=series,
+        splitter=splitter,
+        model_config=RegressionModelConfig(),
+        store_fold_details=True,
+    )
+
+    assert result.feature_aware is True
+    assert result.exogenous_policy == "observed"
+
+
+def test_target_only_result_is_not_flagged_feature_aware(
+    sample_dates_120: list[date],
+    sample_values_120: np.ndarray,
+    sample_split_config_expanding: SplitConfig,
+) -> None:
+    """A target-only baseline keeps feature_aware False and no exogenous policy."""
+    service = BacktestingService()
+    series = _series(sample_dates_120, sample_values_120, with_exogenous=False)
+    splitter = TimeSeriesSplitter(sample_split_config_expanding)
+
+    result = service._run_model_backtest(
+        series_data=series,
+        splitter=splitter,
+        model_config=NaiveModelConfig(),
+        store_fold_details=True,
+    )
+
+    assert result.feature_aware is False
+    assert result.exogenous_policy is None
+
+
+def test_feature_aware_backtest_without_exogenous_fails_loud(
+    sample_dates_120: list[date],
+    sample_values_120: np.ndarray,
+    sample_split_config_expanding: SplitConfig,
+) -> None:
+    """A feature-aware model reaching the fold loop with no ExogenousFrame
+    must fail LOUD — the genuinely-unsupported path (DECISIONS LOCKED #8)."""
+    service = BacktestingService()
+    series = _series(sample_dates_120, sample_values_120, with_exogenous=False)
+    splitter = TimeSeriesSplitter(sample_split_config_expanding)
+
+    with pytest.raises(ValueError, match="ExogenousFrame"):
+        service._run_model_backtest(
+            series_data=series,
+            splitter=splitter,
+            model_config=RegressionModelConfig(),
+            store_fold_details=True,
+        )
+
+
+def test_feature_aware_backtest_runs_with_a_gap_fold(
+    sample_dates_120: list[date],
+    sample_values_120: np.ndarray,
+    sample_split_config_with_gap: SplitConfig,
+) -> None:
+    """A gap > 0 fold runs — the lag columns drop the gap lead-in correctly."""
+    assert sample_split_config_with_gap.gap > 0
+    service = BacktestingService()
+    series = _series(sample_dates_120, sample_values_120, with_exogenous=True)
+    splitter = TimeSeriesSplitter(sample_split_config_with_gap)
+
+    result = service._run_model_backtest(
+        series_data=series,
+        splitter=splitter,
+        model_config=RegressionModelConfig(),
+        store_fold_details=True,
+    )
+
+    assert result.feature_aware is True
+    assert len(result.fold_results) > 0
+    for fold in result.fold_results:
+        assert np.isfinite(fold.metrics["mae"])
diff --git a/app/features/backtesting/tests/test_routes_integration.py b/app/features/backtesting/tests/test_routes_integration.py
index efe2af33..a0533146 100644
--- a/app/features/backtesting/tests/test_routes_integration.py
+++ b/app/features/backtesting/tests/test_routes_integration.py
@@ -393,3 +393,85 @@ async def test_response_contains_all_expected_fields(
             assert "test_end" in split
             assert "train_size" in split
             assert "test_size" in split
+
+
+@pytest.mark.integration
+@pytest.mark.asyncio
+class TestBacktestingRouteFeatureAwareIntegration:
+    """Integration tests for POST /backtesting/run with a feature-aware model."""
+
+    async def test_run_backtest_regression_model(
+        self,
+        client: AsyncClient,
+        sample_store: Store,
+        sample_product: Product,
+        sample_sales_120: list[SalesDaily],
+    ) -> None:
+        """A regression backtest returns 200 with per-fold metrics + baselines."""
+        response = await client.post(
+            "/backtesting/run",
+            json={
+                "store_id": sample_store.id,
+                "product_id": sample_product.id,
+                "start_date": "2024-01-01",
+                "end_date": "2024-04-29",
+                "config": {
+                    "split_config": {
+                        "strategy": "expanding",
+                        "n_splits": 3,
+                        "min_train_size": 30,
+                        "gap": 0,
+                        "horizon": 14,
+                    },
+                    "model_config_main": {"model_type": "regression"},
+                    "include_baselines": True,
+                    "store_fold_details": True,
+                },
+            },
+        )
+
+        assert response.status_code == 200
+        data = response.json()
+
+        main = data["main_model_results"]
+        assert main["model_type"] == "regression"
+        assert main["feature_aware"] is True
+        assert main["exogenous_policy"] == "observed"
+        assert len(main["fold_results"]) > 0
+        assert data["leakage_check_passed"] is True
+
+        baseline_types = {b["model_type"] for b in data["baseline_results"]}
+        assert baseline_types == {"naive", "seasonal_naive"}
+
+    async def test_run_backtest_regression_rejects_small_min_train_size(
+        self,
+        client: AsyncClient,
+        sample_store: Store,
+        sample_product: Product,
+        sample_sales_120: list[SalesDaily],
+    ) -> None:
+        """A regression backtest with min_train_size < 30 returns RFC 7807 400."""
+        response = await client.post(
+            "/backtesting/run",
+            json={
+                "store_id": sample_store.id,
+                "product_id": sample_product.id,
+                "start_date": "2024-01-01",
+                "end_date": "2024-04-29",
+                "config": {
+                    "split_config": {
+                        "strategy": "expanding",
+                        "n_splits": 3,
+                        "min_train_size": 20,
+                        "gap": 0,
+                        "horizon": 14,
+                    },
+                    "model_config_main": {"model_type": "regression"},
+                    "include_baselines": False,
+                    "store_fold_details": True,
+                },
+            },
+        )
+
+        assert response.status_code == 400
+        assert "at least 30" in response.text
diff --git a/app/features/backtesting/tests/test_schemas.py b/app/features/backtesting/tests/test_schemas.py
index 31eec119..c071fae9 100644
--- a/app/features/backtesting/tests/test_schemas.py
+++ b/app/features/backtesting/tests/test_schemas.py
@@ -225,6 +225,48 @@ def test_model_backtest_result_creation(self):
         assert result.model_type == "naive"
         assert result.aggregated_metrics["mae"] == 5.0
 
+    def test_feature_aware_fields_default_to_target_only(self):
+        """The MLZOO-B.2 fields default to a target-only result — every existing
+        construction site (which omits them) stays valid."""
+        result = ModelBacktestResult(
+            model_type="naive",
+            config_hash="abc123",
+            fold_results=[],
+            aggregated_metrics={"mae": 5.0},
+            metric_std={"mae_stability": 10.0},
+        )
+
+        assert result.feature_aware is False
+        assert result.exogenous_policy is None
+
+    def test_feature_aware_fields_accept_explicit_values(self):
+        """A feature-aware result carries feature_aware=True + the exogenous policy."""
+        result = ModelBacktestResult(
+            model_type="regression",
+            config_hash="def456",
+            fold_results=[],
+            aggregated_metrics={"mae": 3.0},
+            metric_std={"mae_stability": 8.0},
+            feature_aware=True,
+            exogenous_policy="observed",
+        )
+
+        assert result.feature_aware is True
+        assert result.exogenous_policy == "observed"
+
+    def test_exogenous_policy_rejects_an_unknown_value(self):
+        """exogenous_policy is a closed Literal — only 'observed' in v1."""
+        with pytest.raises(ValidationError):
+            ModelBacktestResult(
+                model_type="regression",
+                config_hash="def456",
+                fold_results=[],
+                aggregated_metrics={"mae": 3.0},
+                metric_std={"mae_stability": 8.0},
+                feature_aware=True,
+                exogenous_policy="assumptions",  # type: ignore[arg-type]
+            )
+
 
 class TestBacktestRequest:
     """Tests for BacktestRequest schema."""
diff --git a/app/features/backtesting/tests/test_service.py b/app/features/backtesting/tests/test_service.py
index 87d17cb1..992ccb80 100644
--- a/app/features/backtesting/tests/test_service.py
+++ b/app/features/backtesting/tests/test_service.py
@@ -125,11 +125,16 @@ def test_feature_aware_model_fails_loud_in_backtest(
     ) -> None:
         """A feature-aware model must fail LOUD in a backtest, never run silently.
 
-        The fold loop calls ``model.fit(y_train)`` target-only; a
-        ``RegressionForecaster`` (``requires_features=True``) raises ``ValueError``
-        there because no exogenous ``X`` was supplied. Feature-aware backtesting
-        is wired in PRP-MLZOO-B — until then this loud, non-leaky failure is the
-        contract (PRP-29 DECISIONS LOCKED #7).
+        Repurposed by PRP-MLZOO-B.2 (DECISIONS LOCKED #8): feature-aware
+        backtesting is now wired, so a ``regression`` backtest SUCCEEDS once
+        ``run_backtest`` has resolved the exogenous data — the positive
+        assertions live in ``test_feature_aware_backtest.py``. PRP-29 DECISIONS
+        LOCKED #7 and PRP-30 DECISIONS LOCKED #6 are therefore superseded.
+
+        What stays loud — and is pinned here — is the genuinely-unsupported
+        path: a ``requires_features`` model reaching the fold loop with no
+        ``ExogenousFrame`` loaded on ``series_data`` raises ``ValueError``
+        rather than degrading to a silent all-NaN matrix.
         """
         from app.features.backtesting.splitter import TimeSeriesSplitter
         from app.features.forecasting.schemas import RegressionModelConfig
@@ -140,10 +145,10 @@ def test_feature_aware_model_fails_loud_in_backtest(
             values=sample_values_120,
             store_id=1,
             product_id=1,
-        )
+        )  # NOTE: no ExogenousFrame loaded — the unsupported path.
         splitter = TimeSeriesSplitter(sample_split_config_expanding)
 
-        with pytest.raises(ValueError, match="requires exogenous features X"):
+        with pytest.raises(ValueError, match="ExogenousFrame"):
             service._run_model_backtest(
                 series_data=series_data,
                 splitter=splitter,
diff --git a/app/features/backtesting/tests/test_service_integration.py b/app/features/backtesting/tests/test_service_integration.py
index d1b0fbd7..81e661ca 100644
--- a/app/features/backtesting/tests/test_service_integration.py
+++ b/app/features/backtesting/tests/test_service_integration.py
@@ -295,3 +295,99 @@ async def test_backtest_with_gap_produces_correct_splits(
             actual_gap = (test_start - train_end).days
             # Gap should be at least gap_days (could be more if data is sparse)
             assert actual_gap >= gap_days, f"Expected gap >= {gap_days}, got {actual_gap}"
+
+
+@pytest.mark.integration
+@pytest.mark.asyncio
+class TestBacktestingServiceFeatureAwareIntegration:
+    """Integration tests for feature-aware backtesting (MLZOO-B.2).
+
+    A ``regression`` model is evaluated end-to-end against the real database:
+    ``run_backtest`` resolves the exogenous frame, the fold loop builds the
+    per-fold leakage-safe ``X_train`` / ``X_future``, and the result is
+    compared head-to-head with the naive / seasonal baselines.
+    """
+
+    async def test_regression_backtest_runs_with_baseline_comparison(
+        self,
+        db_session: AsyncSession,
+        sample_store: Store,
+        sample_product: Product,
+        sample_sales_120: list[SalesDaily],
+    ) -> None:
+        """A regression backtest yields per-fold metrics + a baseline comparison."""
+        from app.features.forecasting.schemas import RegressionModelConfig
+
+        service = BacktestingService()
+        config = BacktestConfig(
+            split_config=SplitConfig(
+                strategy="expanding",
+                n_splits=3,
+                min_train_size=30,
+                gap=0,
+                horizon=14,
+            ),
+            model_config_main=RegressionModelConfig(),
+            include_baselines=True,
+            store_fold_details=True,
+        )
+
+        response = await service.run_backtest(
+            db=db_session,
+            store_id=sample_store.id,
+            product_id=sample_product.id,
+            start_date=date(2024, 1, 1),
+            end_date=date(2024, 4, 29),
+            config=config,
+        )
+
+        main = response.main_model_results
+        assert main.model_type == "regression"
+        assert main.feature_aware is True
+        assert main.exogenous_policy == "observed"
+        assert len(main.fold_results) > 0
+        assert "mae" in main.aggregated_metrics
+        assert response.leakage_check_passed is True
+
+        # Baselines stay target-only and unflagged.
+        assert response.baseline_results is not None
+        baseline_types = {b.model_type for b in response.baseline_results}
+        assert baseline_types == {"naive", "seasonal_naive"}
+        for baseline in response.baseline_results:
+            assert baseline.feature_aware is False
+            assert baseline.exogenous_policy is None
+        assert response.comparison_summary is not None
+
+    async def test_regression_backtest_rejects_small_min_train_size(
+        self,
+        db_session: AsyncSession,
+        sample_store: Store,
+        sample_product: Product,
+        sample_sales_120: list[SalesDaily],
+    ) -> None:
+        """A feature-aware backtest with min_train_size < 30 fails loud."""
+        from app.features.forecasting.schemas import RegressionModelConfig
+
+        service = BacktestingService()
+        config = BacktestConfig(
+            split_config=SplitConfig(
+                strategy="expanding",
+                n_splits=3,
+                min_train_size=20,
+                gap=0,
+                horizon=14,
+            ),
+            model_config_main=RegressionModelConfig(),
+            include_baselines=False,
+            store_fold_details=True,
+        )
+
+        with pytest.raises(ValueError, match="at least 30"):
+            await service.run_backtest(
+                db=db_session,
+                store_id=sample_store.id,
+                product_id=sample_product.id,
+                start_date=date(2024, 1, 1),
+                end_date=date(2024, 4, 29),
+                config=config,
+            )
diff --git a/app/features/forecasting/service.py b/app/features/forecasting/service.py
index f2affc75..ca706803 100644
--- a/app/features/forecasting/service.py
+++ b/app/features/forecasting/service.py
@@ -11,7 +11,6 @@
 
 from __future__ import annotations
 
-import math
 import time
 import uuid
 from dataclasses import dataclass, field
@@ -40,10 +39,8 @@
     TrainResponse,
 )
 from app.shared.feature_frames import (
-    CALENDAR_COLUMNS,
-    EXOGENOUS_LAGS,
     HISTORY_TAIL_DAYS,
-    build_calendar_columns,
+    build_historical_feature_rows,
     canonical_feature_columns,
 )
 
@@ -133,9 +130,11 @@ def _assemble_regression_rows(
     then the calendar columns, then ``price_factor``, ``promo_active``,
     ``is_holiday``, ``days_since_launch``.
 
-    Extracted from :meth:`ForecastingService._build_regression_features` so the
-    leakage invariant can be unit-tested without a database
-    (``test_regression_features_leakage.py``).
+    Delegating shim (MLZOO-B.2): the row-assembly body was promoted verbatim
+    to :func:`app.shared.feature_frames.build_historical_feature_rows` so the
+    ``backtesting`` slice can reuse it without a forbidden cross-slice import.
+    This wrapper keeps the name and signature byte-stable, so the load-bearing
+    leakage spec (``test_regression_features_leakage.py``) imports it unchanged.
 
     Args:
         dates: Observed days in chronological order.
@@ -151,23 +150,15 @@ def _assemble_regression_rows(
         a lag whose source day precedes the series, and ``days_since_launch``
         when the product has no launch date.
     """
-    calendar_columns = build_calendar_columns(dates)
-    rows: list[list[float]] = []
-    for index, day in enumerate(dates):
-        row: list[float] = []
-        # Target long-lag columns — read only strictly-earlier observations.
-        for lag in EXOGENOUS_LAGS:
-            row.append(quantities[index - lag] if index >= lag else math.nan)
-        # Calendar columns — pure functions of the date (shared builder).
-        for name in CALENDAR_COLUMNS:
-            row.append(calendar_columns[name][index])
-        # Exogenous columns — same-day observed attributes.
-        row.append(prices[index] / baseline_price)
-        row.append(1.0 if day in promo_dates else 0.0)
-        row.append(1.0 if day in holiday_dates else 0.0)
-        row.append(float((day - launch_date).days) if launch_date is not None else math.nan)
-        rows.append(row)
-    return rows
+    return build_historical_feature_rows(
+        dates=dates,
+        quantities=quantities,
+        prices=prices,
+        baseline_price=baseline_price,
+        promo_dates=promo_dates,
+        holiday_dates=holiday_dates,
+        launch_date=launch_date,
+    )
 
 
 class ForecastingService:
diff --git a/app/features/jobs/service.py b/app/features/jobs/service.py
index fd0dfdc6..9ec982a0 100644
--- a/app/features/jobs/service.py
+++ b/app/features/jobs/service.py
@@ -609,8 +609,10 @@ async def _execute_backtest(
         from app.features.backtesting.schemas import BacktestConfig, SplitConfig
         from app.features.backtesting.service import BacktestingService
         from app.features.forecasting.schemas import (
+            LightGBMModelConfig,
             MovingAverageModelConfig,
             NaiveModelConfig,
+            RegressionModelConfig,
             SeasonalNaiveModelConfig,
         )
 
@@ -644,6 +646,13 @@ async def _execute_backtest(
         elif model_type == "moving_average":
             window_size = params.get("window_size", 7)
             model_config = MovingAverageModelConfig(window_size=window_size)
+        elif model_type == "regression":
+            # Feature-aware — the backtest builds per-fold leakage-safe X.
+            model_config = RegressionModelConfig()
+        elif model_type == "lightgbm":
+            # Feature-aware — still gated by forecast_enable_lightgbm inside
+            # model_factory; a disabled flag surfaces as a loud failed job.
+            model_config = LightGBMModelConfig()
         else:
             msg = f"Unsupported model_type: {model_type}"
             raise ValueError(msg)
diff --git a/app/features/jobs/tests/test_service.py b/app/features/jobs/tests/test_service.py
index 834c630d..9773cb96 100644
--- a/app/features/jobs/tests/test_service.py
+++ b/app/features/jobs/tests/test_service.py
@@ -21,6 +21,7 @@
     SplitBoundary,
     SplitConfig,
 )
+from app.features.backtesting.service import BacktestingService
 from app.features.forecasting.schemas import (
     LightGBMModelConfig,
     RegressionModelConfig,
@@ -246,3 +247,66 @@ async def test_execute_train_rejects_unsupported_model_type() -> None:
             db=cast(AsyncSession, AsyncMock()),
             params={**_REGRESSION_PARAMS, "model_type": "arima"},
         )
+
+
+# Parameters for a backtest job — _execute_backtest reads these keys.
+_BACKTEST_PARAMS: dict[str, Any] = {
+    "model_type": "regression",
+    "store_id": 1,
+    "product_id": 1,
+    "start_date": "2024-01-01",
+    "end_date": "2024-12-01",
+    "n_splits": 3,
+}
+
+
+async def test_execute_backtest_builds_regression_config() -> None:
+    """A backtest job with model_type='regression' builds a RegressionModelConfig.
+
+    ``run_backtest`` is mocked, so the test is pure (no DB): it pins that
+    ``_execute_backtest`` widened its allow-list and shaped the result.
+    """
+    response = _make_response()
+    with patch.object(
+        BacktestingService, "run_backtest", new=AsyncMock(return_value=response)
+    ) as mock_run:
+        result = await JobService()._execute_backtest(
+            db=cast(AsyncSession, AsyncMock()),
+            params=_BACKTEST_PARAMS,
+        )
+    assert mock_run.call_args is not None
+    config = mock_run.call_args.kwargs["config"]
+    assert isinstance(config.model_config_main, RegressionModelConfig)
+    assert result["model_type"] == "regression"
+    # The frontend job-result contract is still shaped (byte-stable keys).
+    assert "fold_metrics" in result
+    assert "aggregated_metrics" in result
+
+
+async def test_execute_backtest_builds_lightgbm_config() -> None:
+    """A backtest job with model_type='lightgbm' builds a LightGBMModelConfig.
+
+    ``run_backtest`` is mocked, so ``model_factory``'s feature-flag gate is
+    never reached and the optional lightgbm dependency is not required.
+    """
+    response = _make_response()
+    with patch.object(
+        BacktestingService, "run_backtest", new=AsyncMock(return_value=response)
+    ) as mock_run:
+        result = await JobService()._execute_backtest(
+            db=cast(AsyncSession, AsyncMock()),
+            params={**_BACKTEST_PARAMS, "model_type": "lightgbm"},
+        )
+    assert mock_run.call_args is not None
+    config = mock_run.call_args.kwargs["config"]
+    assert isinstance(config.model_config_main, LightGBMModelConfig)
+    assert result["model_type"] == "lightgbm"
+
+
+async def test_execute_backtest_rejects_unsupported_model_type() -> None:
+    """_execute_backtest still rejects a genuinely unsupported model_type."""
+    with pytest.raises(ValueError, match="Unsupported model_type"):
+        await JobService()._execute_backtest(
+            db=cast(AsyncSession, AsyncMock()),
+            params={**_BACKTEST_PARAMS, "model_type": "arima"},
+        )
diff --git a/app/shared/feature_frames/__init__.py b/app/shared/feature_frames/__init__.py
index b79d7d92..df0568b4 100644
--- a/app/shared/feature_frames/__init__.py
+++ b/app/shared/feature_frames/__init__.py
@@ -23,6 +23,10 @@
     canonical_feature_columns,
     feature_safety,
 )
+from app.shared.feature_frames.rows import (
+    build_future_feature_rows,
+    build_historical_feature_rows,
+)
 
 __all__ = [
     "CALENDAR_COLUMNS",
@@ -33,6 +37,8 @@
     "FeatureSafety",
     "FutureFeatureFrame",
     "build_calendar_columns",
+    "build_future_feature_rows",
+    "build_historical_feature_rows",
     "build_long_lag_columns",
     "canonical_feature_columns",
     "feature_safety",
diff --git a/app/shared/feature_frames/rows.py b/app/shared/feature_frames/rows.py
new file mode 100644
index 00000000..9aed8adc
--- /dev/null
+++ b/app/shared/feature_frames/rows.py
@@ -0,0 +1,201 @@
+"""Shared row-matrix assemblers for feature-aware forecasting (MLZOO-B.2).
+
+This module joins :mod:`app.shared.feature_frames.contract` under the
+cross-cutting ``app/shared/feature_frames`` package. ``contract.py`` owns the
+pinned constants, the canonical column set, the leakage-safe *column* builders
+and the :class:`~app.shared.feature_frames.contract.FeatureSafety` taxonomy;
+this module owns the two *row-matrix* assemblers built on top of them:
+
+* :func:`build_historical_feature_rows` — the historical (training) feature
+  matrix. Promoted verbatim from ``ForecastingService._assemble_regression_rows``
+  so the ``backtesting`` slice can reuse it without a forbidden cross-slice
+  import (``backtesting -> forecasting`` is not allowed; ``-> app/shared`` is).
+* :func:`build_future_feature_rows` — the test-window (future) feature matrix
+  for a backtest fold. Leakage-safe by construction (see below).
+
+LEAF-LEVEL: like ``contract.py`` this module may NEVER import from
+``app/features/**``. Every function is pure — stdlib ``math`` / ``datetime``
+plus the contract builders only. ``tests/test_contract.py`` enforces it with
+an AST walk; ``tests/test_leakage.py`` pins the leakage invariants.
+
+The leakage rule the future builder obeys (mirrors ``contract.py`` and the
+load-bearing ``tests/test_leakage.py``):
+
+    A future feature value for a test-window day may use ONLY information
+    knowable at the forecast origin ``T`` — the observed history up to and
+    including ``T``, the calendar (a pure function of the date), or an
+    exogenous input recorded for the test window (price / promotion). It may
+    NEVER read an observed *target* at a test-window day.
+"""
+
+from __future__ import annotations
+
+import math
+from datetime import date
+
+from app.shared.feature_frames.contract import (
+    CALENDAR_COLUMNS,
+    EXOGENOUS_LAGS,
+    build_calendar_columns,
+    build_long_lag_columns,
+    canonical_feature_columns,
+)
+
+
+def build_historical_feature_rows(
+    *,
+    dates: list[date],
+    quantities: list[float],
+    prices: list[float],
+    baseline_price: float,
+    promo_dates: set[date],
+    holiday_dates: set[date],
+    launch_date: date | None,
+) -> list[list[float]]:
+    """Assemble the historical regression feature matrix — pure, leakage-safe.
+
+    Time-safe by construction: every lag column at row ``i`` reads only the
+    observed target at ``i - lag`` (a strictly earlier day); calendar columns
+    are pure functions of the date; ``price_factor`` / ``promo_active`` /
+    ``is_holiday`` / ``days_since_launch`` read the same-day exogenous
+    attributes. No row reads a future observation.
+
+    Column order is :func:`canonical_feature_columns` exactly: the target
+    lags, then the calendar columns, then ``price_factor``, ``promo_active``,
+    ``is_holiday``, ``days_since_launch``.
+
+    Promoted verbatim from ``ForecastingService._assemble_regression_rows``
+    (which now delegates here) so the leakage invariant is unit-tested without
+    a database (``app/features/forecasting/tests/test_regression_features_leakage.py``)
+    and the ``backtesting`` slice can reuse it without a cross-slice import.
+
+    Args:
+        dates: Observed days in chronological order.
+        quantities: Observed target values aligned with ``dates``.
+        prices: Observed unit prices aligned with ``dates``.
+        baseline_price: The typical price; ``price_factor`` is the ratio to it.
+        promo_dates: Days a promotion covered.
+        holiday_dates: Calendar holiday days.
+        launch_date: The product's launch date, or ``None``.
+
+    Returns:
+        Row-major feature matrix ``[n_observations][n_features]``; ``NaN`` marks
+        a lag whose source day precedes the series, and ``days_since_launch``
+        when the product has no launch date.
+    """
+    calendar_columns = build_calendar_columns(dates)
+    rows: list[list[float]] = []
+    for index, day in enumerate(dates):
+        row: list[float] = []
+        # Target long-lag columns — read only strictly-earlier observations.
+        for lag in EXOGENOUS_LAGS:
+            row.append(quantities[index - lag] if index >= lag else math.nan)
+        # Calendar columns — pure functions of the date (shared builder).
+        for name in CALENDAR_COLUMNS:
+            row.append(calendar_columns[name][index])
+        # Exogenous columns — same-day observed attributes.
+        row.append(prices[index] / baseline_price)
+        row.append(1.0 if day in promo_dates else 0.0)
+        row.append(1.0 if day in holiday_dates else 0.0)
+        row.append(float((day - launch_date).days) if launch_date is not None else math.nan)
+        rows.append(row)
+    return rows
+
+
+def build_future_feature_rows(
+    *,
+    test_dates: list[date],
+    history_tail: list[float],
+    gap: int,
+    test_prices: list[float],
+    baseline_price: float,
+    test_promo_dates: set[date],
+    test_holiday_dates: set[date],
+    launch_date: date | None,
+) -> list[list[float]]:
+    """Assemble a backtest fold's test-window feature matrix — leakage-safe.
+
+    This is the leakage-critical builder. A test-window day has no observed
+    target, so the matrix MUST be rebuilt here rather than sliced from the
+    historical matrix — a sliced historical row would read an adjacent
+    test-day observed target as its ``lag_1`` cell (target leakage).
+
+    Column population by class (matches the canonical column order exactly):
+
+    * **Target lags** (``lag_*``) — from :func:`build_long_lag_columns` over
+      ``history_tail``, which ends at the fold origin ``T``. A lag cell whose
+      source day lies in the test window is ``NaN`` — structurally enforced,
+      never a recursive prediction.
+    * **Calendar columns** — pure functions of the test-window date.
+    * **Exogenous columns** (``price_factor`` / ``promo_active``) — the
+      *observed* recorded price / promotion for the test window. This reads
+      no target ``y`` (not target leakage); it is exogenous foresight under
+      the ``observed`` policy and assumes the future price/promo plan was
+      known at ``T``.
+    * ``is_holiday`` / ``days_since_launch`` — calendar / launch-date
+      attributes, knowable at ``T``.
+
+    Gap handling: with ``gap > 0`` the first test day is ``T + gap + 1`` but
+    :func:`build_long_lag_columns` indexes its day ``m`` as ``T + m``. The lag
+    columns are therefore built for ``gap + len(test_dates)`` days and the
+    first ``gap`` rows dropped. With ``gap == 0`` the slice is a no-op.
+
+    Args:
+        test_dates: The fold's test-window days (chronological).
+        history_tail: Observed targets ending at the fold origin ``T``
+            (``history_tail[-1] == y[T]``); excludes the gap days.
+        gap: Gap days between train end and test start (simulated latency).
+        test_prices: Recorded unit prices aligned with ``test_dates``.
+        baseline_price: The typical price; ``price_factor`` is the ratio to it.
+        test_promo_dates: Test-window days a promotion covered.
+        test_holiday_dates: Test-window calendar holiday days.
+        launch_date: The product's launch date, or ``None``.
+
+    Returns:
+        Row-major feature matrix ``[len(test_dates)][n_features]`` in canonical
+        column order; ``NaN`` marks a future-sourced lag cell and
+        ``days_since_launch`` when the product has no launch date.
+
+    Raises:
+        ValueError: When ``gap`` is negative, ``test_prices`` does not align
+            with ``test_dates``, or a canonical column cannot be sourced.
+    """
+    horizon = len(test_dates)
+    if gap < 0:
+        raise ValueError(f"build_future_feature_rows: gap must be >= 0, got {gap}")
+    if len(test_prices) != horizon:
+        raise ValueError(
+            f"build_future_feature_rows: test_prices has {len(test_prices)} entries "
+            f"but test_dates has {horizon} — they must align"
+        )
+
+    # Lags: build for gap + horizon days, then drop the gap lead-in so row j
+    # corresponds to test day j. NaN-where-future is enforced by the builder.
+    lag_columns = build_long_lag_columns(history_tail, gap + horizon)
+    lag_columns = {name: values[gap:] for name, values in lag_columns.items()}
+    calendar_columns = build_calendar_columns(test_dates)
+    calendar_names = set(CALENDAR_COLUMNS)
+    columns = canonical_feature_columns()
+
+    rows: list[list[float]] = []
+    for j, day in enumerate(test_dates):
+        row: list[float] = []
+        for column in columns:
+            if column.startswith("lag_"):  # target lag — NaN where future
+                row.append(lag_columns[column][j])
+            elif column in calendar_names:  # pure function of the date
+                row.append(calendar_columns[column][j])
+            elif column == "price_factor":  # observed exogenous foresight
+                row.append(test_prices[j] / baseline_price)
+            elif column == "promo_active":  # observed exogenous foresight
+                row.append(1.0 if day in test_promo_dates else 0.0)
+            elif column == "is_holiday":  # calendar attribute
+                row.append(1.0 if day in test_holiday_dates else 0.0)
+            elif column == "days_since_launch":  # pure function of the date
+                row.append(float((day - launch_date).days) if launch_date is not None else math.nan)
+            else:  # loud failure — never a silent 0.0 / NaN fill
+                raise ValueError(
+                    f"build_future_feature_rows: cannot source future column {column!r}"
+                )
+        rows.append(row)
+    return rows
diff --git a/app/shared/feature_frames/tests/test_contract.py b/app/shared/feature_frames/tests/test_contract.py
index 0b68fd6f..6280e19e 100644
--- a/app/shared/feature_frames/tests/test_contract.py
+++ b/app/shared/feature_frames/tests/test_contract.py
@@ -125,10 +125,14 @@ def test_shared_package_imports_nothing_from_features() -> None:
     """``app/shared/**`` is leaf-level — it may never import a vertical slice.
 
     Walks every ``.py`` file in the package and asserts no module imports a
-    name under ``app.features`` (AGENTS.md § Architecture).
+    name under ``app.features`` (AGENTS.md § Architecture). ``rows.py`` (the
+    MLZOO-B.2 row-matrix assemblers) is walked here too — see the explicit
+    assertion below that it is part of the package.
     """
     pkg_dir = Path(__file__).resolve().parents[1]  # app/shared/feature_frames/
+    walked: set[str] = set()
     for py_file in pkg_dir.rglob("*.py"):
+        walked.add(py_file.name)
         source = py_file.read_text(encoding="utf-8")
         for node in ast.walk(ast.parse(source)):
             if isinstance(node, ast.ImportFrom) and node.module:
@@ -140,3 +144,7 @@ def test_shared_package_imports_nothing_from_features() -> None:
                     assert not alias.name.startswith("app.features"), (
                         f"ARCHITECTURE BREACH: {py_file} imports {alias.name}"
                     )
+    # rows.py must exist and be covered by the leaf-level walk above.
+    assert {"contract.py", "rows.py"} <= walked, (
+        f"expected contract.py and rows.py in the package walk, got {sorted(walked)}"
+    )
diff --git a/app/shared/feature_frames/tests/test_leakage.py b/app/shared/feature_frames/tests/test_leakage.py
index 56ba62f4..f1730d64 100644
--- a/app/shared/feature_frames/tests/test_leakage.py
+++ b/app/shared/feature_frames/tests/test_leakage.py
@@ -33,10 +33,15 @@
 import math
 from datetime import date, timedelta
 
+import pytest
+
 from app.shared.feature_frames import (
     EXOGENOUS_LAGS,
     build_calendar_columns,
+    build_future_feature_rows,
+    build_historical_feature_rows,
     build_long_lag_columns,
+    canonical_feature_columns,
 )
 
 # The forecast origin T is the last observed day; the horizon runs T+1 … T+H.
@@ -114,3 +119,136 @@ def test_calendar_columns_are_independent_of_the_target_series() -> None:
         for cell in values:
             assert cell not in history_values
             assert cell not in _FUTURE_TARGETS
+
+
+# --- build_future_feature_rows — the backtest test-window matrix (MLZOO-B.2) --
+#
+# build_future_feature_rows assembles one backtest fold's test-window feature
+# matrix. It receives ONLY history_tail (entirely <= the fold origin T) — it is
+# structurally incapable of reading a test-window observed target. These specs
+# pin that, the NaN-where-future contract (including a gap > 0 fold), and the
+# historical-vs-future asymmetry that is the reason X_future is rebuilt here
+# rather than sliced from the historical matrix.
+
+_TEST_WINDOW = 14
+_TEST_PRICES = [10.0] * _TEST_WINDOW
+
+
+def test_future_lag_cells_are_drawn_only_from_history() -> None:
+    """Every non-NaN future lag cell comes from ``history_tail`` — never a target.
+
+    ``build_future_feature_rows`` takes only ``history_tail`` as target data;
+    a value disjoint from it appearing in any lag cell would be a leak.
+    """
+    test_dates = [_ORIGIN + timedelta(days=offset) for offset in range(1, _TEST_WINDOW + 1)]
+    columns = canonical_feature_columns()
+    rows = build_future_feature_rows(
+        test_dates=test_dates,
+        history_tail=_HISTORY_TAIL,
+        gap=0,
+        test_prices=_TEST_PRICES,
+        baseline_price=10.0,
+        test_promo_dates=set(),
+        test_holiday_dates=set(),
+        launch_date=None,
+    )
+    history_values = set(_HISTORY_TAIL)
+    for lag in EXOGENOUS_LAGS:
+        col = columns.index(f"lag_{lag}")
+        for j in range(_TEST_WINDOW):
+            cell = rows[j][col]
+            if math.isnan(cell):
+                continue
+            assert cell in history_values, (
+                f"lag_{lag} test day {j} emitted {cell}, not an observed history value"
+            )
+            assert cell not in _FUTURE_TARGETS, (
+                f"lag_{lag} test day {j} leaked a future target value {cell}"
+            )
+
+
+@pytest.mark.parametrize("gap", [0, 3, 7])
+def test_future_lag_is_nan_exactly_where_source_is_a_test_day(gap: int) -> None:
+    """A future lag cell is ``NaN`` exactly when its source day is in the test window.
+
+    For lag ``k`` and test day ``j`` (0-indexed) the source day relative to the
+    origin ``T`` is ``T + gap + j + 1 - k``; it lies in the test window — and
+    the cell MUST be ``NaN`` — exactly when ``gap + j - k >= 0``. Otherwise the
+    source is observed history and the cell MUST carry a value.
+    """
+    test_dates = [_ORIGIN + timedelta(days=gap + offset) for offset in range(1, _TEST_WINDOW + 1)]
+    columns = canonical_feature_columns()
+    rows = build_future_feature_rows(
+        test_dates=test_dates,
+        history_tail=_HISTORY_TAIL,
+        gap=gap,
+        test_prices=_TEST_PRICES,
+        baseline_price=10.0,
+        test_promo_dates=set(),
+        test_holiday_dates=set(),
+        launch_date=None,
+    )
+    for lag in EXOGENOUS_LAGS:
+        col = columns.index(f"lag_{lag}")
+        for j in range(_TEST_WINDOW):
+            cell = rows[j][col]
+            if gap + j - lag >= 0:
+                assert math.isnan(cell), (
+                    f"gap={gap} lag_{lag} day {j}: source is a test day — expected NaN, got {cell}"
+                )
+            else:
+                assert not math.isnan(cell), (
+                    f"gap={gap} lag_{lag} day {j}: source is in history — expected a value, got NaN"
+                )
+
+
+def test_historical_and_future_lag_columns_are_asymmetric() -> None:
+    """The crux of MLZOO-B.2: a historical lag row reads adjacent observed targets;
+    a future lag row does NOT — which is why ``X_future`` is rebuilt here and
+    never sliced from the historical matrix.
+
+    A continuous sequential series is split at the origin ``T``. The historical
+    matrix row for a test-window day reads that day's neighbouring *observed*
+    target as ``lag_1`` (slicing it for ``X_future`` would be target leakage).
+    The future matrix produces ``NaN`` there instead.
+    """
+    series_len = 60
+    train_end = 40  # origin T is index 39 (the last train day)
+    full = [float(i + 1) for i in range(series_len)]
+    history_tail = full[:train_end]
+    columns = canonical_feature_columns()
+    lag1 = columns.index("lag_1")
+
+    historical = build_historical_feature_rows(
+        dates=[_ORIGIN + timedelta(days=offset) for offset in range(series_len)],
+        quantities=full,
+        prices=[10.0] * series_len,
+        baseline_price=10.0,
+        promo_dates=set(),
+        holiday_dates=set(),
+        launch_date=None,
+    )
+    # The historical matrix row for a TEST-window day reads an observed
+    # test-day target as lag_1 — proof that slicing it for X_future leaks.
+    assert historical[train_end + 1][lag1] == full[train_end], (
+        "historical lag_1 for a test-window row must read the adjacent observed target"
+    )
+
+    test_window = 10
+    future = build_future_feature_rows(
+        test_dates=[_ORIGIN + timedelta(days=offset) for offset in range(1, test_window + 1)],
+        history_tail=history_tail,
+        gap=0,
+        test_prices=[10.0] * test_window,
+        baseline_price=10.0,
+        test_promo_dates=set(),
+        test_holiday_dates=set(),
+        launch_date=None,
+    )
+    # The future matrix: test day 0's lag_1 is y[T] (knowable); every later
+    # day's lag_1 is NaN — it never reads a test-window observed target.
+    assert future[0][lag1] == history_tail[-1], "future lag_1 day 0 must be the origin y[T]"
+    for j in range(1, test_window):
+        assert math.isnan(future[j][lag1]), (
+            f"future lag_1 test day {j} must be NaN — it must never read a test-window target"
+        )
diff --git a/docs/PHASE/5-BACKTESTING.md b/docs/PHASE/5-BACKTESTING.md
index d06d8d2e..b465b3d1 100644
--- a/docs/PHASE/5-BACKTESTING.md
+++ b/docs/PHASE/5-BACKTESTING.md
@@ -373,6 +373,39 @@ $ uv run pytest app/features/backtesting/tests/ -v -m integration
 
 ---
 
+## Feature-Aware Backtesting (MLZOO-B.2)
+
+Since PRP-MLZOO-B.2 the fold loop also evaluates **feature-aware** models
+(`regression`, `lightgbm` — any model with `requires_features=True`), not just
+the target-only baselines.
+
+**How it works**:
+1. `run_backtest` (async) probes the model's `requires_features` flag. For a
+   feature-aware model it resolves the exogenous data — recorded price,
+   promotion windows, calendar holidays, product launch date — once, into a
+   pure in-memory `ExogenousFrame`. The fold loop stays sync and DB-free.
+2. `_run_model_backtest` branches on the flag. The feature-aware path builds the
+   full historical feature matrix once, then per fold:
+   - **`X_train`** — a positional slice of that historical matrix.
+   - **`X_future`** — rebuilt per fold by `build_future_feature_rows` (never
+     sliced — that would leak an adjacent test-day target as `lag_1`). Its
+     `history_tail` ends at the fold origin `T`, so future-sourced lag cells
+     are `NaN`; with `gap > 0` the lag columns drop the gap lead-in.
+3. The result records `feature_aware: true` and `exogenous_policy: "observed"`
+   (the v1 policy — the recorded test-window price/promotion plan; exogenous
+   foresight, not target leakage).
+
+**Constraints**:
+- `min_train_size >= 30` is required for a feature-aware backtest (each fold
+  must resolve its lag features) — a smaller value raises `ValueError` (400).
+- The naive / seasonal baselines stay target-only — they take the unchanged
+  code path even when the main model is feature-aware.
+
+The per-fold row builders live in `app/shared/feature_frames/rows.py`; the
+leakage invariants are pinned by `app/shared/feature_frames/tests/test_leakage.py`.
+
+---
+
 ## Next Phase Preparation
 
 Phase 6 (Model Registry) will use the backtesting module to:
diff --git a/examples/models/feature_frame_contract.md b/examples/models/feature_frame_contract.md
index 407a0bac..27356c73 100644
--- a/examples/models/feature_frame_contract.md
+++ b/examples/models/feature_frame_contract.md
@@ -105,12 +105,29 @@ A new feature-aware model (PRP-MLZOO-B onward):
 2. Reuses the **shared** frame builders and `canonical_feature_columns()` — it
    writes **zero** new contract code, so it cannot drift from the regression
    contract.
-3. Consumes the historical frame for `fit(y, X)` and (via
-   `POST /scenarios/simulate`) the future frame for `predict(horizon, X)`.
-
-Known limitation: **backtesting is not wired for feature-aware models.** The
-backtest fold loop calls `model.fit(y_train)` target-only; a feature-aware model
-raises `ValueError` there — a loud, non-leaky failure, pinned by
-`test_feature_aware_model_fails_loud_in_backtest`. Feature-aware backtesting is
-deferred to PRP-MLZOO-B.2 — it remains pending after PRP-30 (LightGBM, MLZOO-B)
-shipped the first advanced feature-aware model.
+3. Consumes the historical frame for `fit(y, X)`, the future frame (via
+   `POST /scenarios/simulate`) for `predict(horizon, X)`, and — since
+   PRP-MLZOO-B.2 — the per-fold backtest frames for `POST /backtesting/run`.
+
+## Backtesting frame (per fold)
+
+Since **PRP-MLZOO-B.2** a feature-aware model is evaluated by the backtesting
+fold loop. Each fold builds two matrices from the **same** canonical column set:
+
+- **`X_train`** — a positional slice of the full historical matrix
+  (`build_historical_feature_rows`), built once over the whole series. Leakage-
+  safe *as a training row*: every lag reads a strictly-earlier observed target.
+- **`X_future`** — the test-window matrix, **rebuilt per fold** by
+  `build_future_feature_rows` (never sliced from the historical matrix — a
+  sliced historical test row would read an adjacent test-day observed target as
+  its `lag_1` cell, which is target leakage). Its `history_tail` ends at the
+  fold origin `T`, so a lag cell whose source day is in the test window is
+  `NaN`. With `gap > 0` the lag columns are built for `gap + horizon` days and
+  the first `gap` rows dropped.
+
+The test-window `price_factor` / `promo_active` come from the **recorded**
+price/promotion for that window (the v1 `observed` exogenous policy) — that
+reads no target `y`, so it is exogenous foresight, not target leakage; the
+`ModelBacktestResult` records `exogenous_policy="observed"` so the metric is
+read honestly. Both builders live in `app/shared/feature_frames/rows.py` and are
+spec'd by the load-bearing `app/shared/feature_frames/tests/test_leakage.py`.