Skip to content

Release: Phase 3 Feature Engineering#25

Merged
w7-mgfcode merged 2 commits into
mainfrom
dev
Jan 31, 2026
Merged

Release: Phase 3 Feature Engineering#25
w7-mgfcode merged 2 commits into
mainfrom
dev

Conversation

@w7-mgfcode
Copy link
Copy Markdown
Owner

@w7-mgfcode w7-mgfcode commented Jan 31, 2026

Summary

Merges dev branch into main with Phase 3 Feature Engineering implementation.

Phase 3: Feature Engineering (PRP-4)

  • FeatureEngineeringService: Time-safe feature computation with CRITICAL leakage prevention
  • API Endpoints: POST /featuresets/compute, POST /featuresets/preview
  • Feature Types:
    • Lag features (shift with positive values only)
    • Rolling features (shift(1) before rolling to exclude current observation)
    • Calendar features (cyclical sin/cos encoding)
    • Exogenous features (price lags, price change, stockout flags)
  • Imputation Strategies: zero, ffill, bfill, mean, expanding_mean, drop
  • 55 tests including comprehensive leakage prevention tests

Code Quality Fixes

  • Validate store_id/product_id presence (no silent defaults)
  • HTTP 400 for unsupported date types
  • Added time-safe expanding_mean imputation strategy
  • Fixed price_pct_change_7d leakage with shift(1)
  • Documentation improvements

Validation

  • ✅ Ruff: All checks passed
  • ✅ MyPy: 0 errors
  • ✅ Pyright: 0 errors
  • ✅ Pytest: 55 tests passed

🤖 Generated with Claude Code

Summary by CodeRabbit

  • New Features

    • Comprehensive time-safe feature engineering: lags, rolling aggregations, calendar features, imputation, and multi-horizon support
    • New compute and preview API endpoints for feature generation with cutoff enforcement and group isolation
  • Documentation

    • Added Feature Engineering docs, architecture updates, README examples, and a Phase 3 feature-engineering writeup
  • Examples & Demos

    • Demo script and example usage for compute/preview workflows
  • Tests

    • Extensive unit and leakage tests for configs, service, routes, and edge cases
  • Chores

    • New runtime dependencies and configuration settings for feature limits

✏️ Tip: You can customize this high-level summary in your review settings.

* docs: update INITIAL-4 and INITIAL-5 with additional references

- Add scikit-learn, mlforecast, and sktime documentation links
- Add considerations for imputation logic, agent tooling, and computation overhead
- Add model persistence documentation references

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* feat(featuresets): implement time-safe feature engineering layer

Add complete feature engineering module with:
- Pydantic schemas for feature configuration (lag, rolling, calendar, exogenous, imputation)
- FeatureEngineeringService with CRITICAL leakage prevention:
  - Lag features use positive shift() only
  - Rolling features use shift(1) BEFORE rolling to exclude current observation
  - Group-aware operations prevent cross-series leakage
  - Cutoff date filtering before any computation
- FastAPI endpoints: POST /featuresets/compute and /featuresets/preview
- Comprehensive test suite (55 tests) including leakage prevention tests
- Example demo script

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* docs: update INITIAL-5.md

* docs: update documentation for Phase 3 Feature Engineering

- README.md: Add featuresets module to project structure and API endpoints
- docs/ARCHITECTURE.md: Add Feature Engineering section (section 6)
- docs/PHASE-index.md: Mark Phase 3 as completed with summary
- docs/PHASE/3-FEATURE_ENGINEERING.md: Create detailed phase documentation

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* fix(featuresets): address code review feedback and prevent data leakage

Routes:
- Validate store_id/product_id presence (no silent defaults to 0)
- Convert ValueError for unsupported date types to HTTP 400

Service:
- Add expanding_mean imputation strategy (time-safe alternative)
- Add warnings when bfill/mean strategies are used (leakage risk)
- Fix price_pct_change_7d to use shift(1) before pct_change

Schemas:
- Add expanding_mean to ImputationConfig Literal type
- Document time-safety of each imputation strategy
- Fix PreviewFeaturesRequest docstring: GET → POST

Documentation:
- Convert bare URLs to markdown links in INITIAL-4.md, INITIAL-5.md
- Fix PRP-4 to show POST for preview endpoint

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* style: format schemas.py

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

---------

Co-authored-by: Gabe@w7dev <gabor@w7-7.net>
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
Copy link
Copy Markdown

@sourcery-ai sourcery-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry @w7-mgfcode, your pull request is larger than the review limit of 150000 diff characters

@socket-security
Copy link
Copy Markdown

socket-security Bot commented Jan 31, 2026

Review the following changes in direct dependencies. Learn more about Socket for GitHub.

Diff Package Supply Chain
Security
Vulnerability Quality Maintenance License
Addednumpy@​2.4.17510010010070
Addedpandas@​3.0.07610010010080

View full report

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Jan 31, 2026

📝 Walkthrough

Walkthrough

Adds a time-safe feature engineering subsystem: Pydantic configs, a FeatureEngineeringService, data loader and compute helpers, FastAPI endpoints (/featuresets/compute, /featuresets/preview), extensive tests, docs, examples, and new pandas/numpy dependencies; integrates router and config settings into the app.

Changes

Cohort / File(s) Summary
Documentation & Phase Records
INITIAL-4.md, INITIAL-5.md, README.md, docs/ARCHITECTURE.md, docs/PHASE-index.md, docs/PHASE/3-FEATURE_ENGINEERING.md
Added and updated documentation describing the feature engineering design, API endpoints, leakage prevention, implementation details, and phase completion.
Public API & Package Exports
app/features/featuresets/__init__.py
New package initializer re-exporting schemas, service, and result types for external use.
Schemas & Config
app/features/featuresets/schemas.py, app/core/config.py
Introduced Pydantic schema hierarchy and config models (LagConfig, RollingConfig, CalendarConfig, ExogenousConfig, ImputationConfig, FeatureSetConfig), API models (Compute/Preview requests/responses), validation, hashing, and three new Settings fields (feature_max_lookback_days, feature_max_lag, feature_max_window).
Feature Engineering Core
app/features/featuresets/service.py
Implemented FeatureEngineeringService, FeatureComputationResult, FeatureDataLoader, and compute_features_for_series with time-safe steps (cutoff enforcement, imputation, lag/rolling/calendar/exogenous features) and observability.
API Routes & Integration
app/features/featuresets/routes.py, app/main.py
Added FastAPI routes POST /featuresets/compute and POST /featuresets/preview, request validation, DB loading integration, error translation, and registered the router in main app.
Tests & Fixtures
app/features/featuresets/tests/*
Added test package, fixtures (conftest.py), and comprehensive tests: schema validation (test_schemas.py), service behavior (test_service.py), leakage/group isolation/cutoff tests (test_leakage.py), and routes tests scaffolding.
Examples & Demo
examples/compute_features_demo.py
New demo script illustrating compute/preview API usage and a sample feature configuration.
Dependencies
pyproject.toml
Added runtime deps: pandas>=3.0.0, numpy>=2.4.1; added dev stubs (pandas-stubs) in dependency groups.
Small docs/initial notes
INITIAL-4.md
Added documentation links and other-considerations bullets (imputation, tooling, compute overhead) — documentation-only edits.

Sequence Diagram(s)

sequenceDiagram
    participant Client as Client
    participant API as API Routes
    participant Loader as FeatureData<br/>Loader
    participant DB as Database
    participant Service as FeatureEngineering<br/>Service

    Client->>API: POST /featuresets/compute (store_id, product_id, cutoff_date, config)
    activate API
    API->>Loader: load_sales_data(store_id, product_id, start_date, end_date)
    activate Loader
    Loader->>DB: Query sales & calendar
    DB-->>Loader: Return rows
    Loader-->>API: DataFrame
    deactivate Loader

    API->>Service: compute_features(df, cutoff_date)
    activate Service
    Service->>Service: _apply_imputation (group-aware)
    Service->>Service: _compute_lag_features (grouped shift)
    Service->>Service: _compute_rolling_features (shift(1) then agg)
    Service->>Service: _compute_calendar_features
    Service->>Service: _compute_exogenous_features
    Service-->>API: FeatureComputationResult (df, feature_columns, stats)
    deactivate Service

    API->>API: Serialize FeatureRow list, null_counts, duration_ms
    API-->>Client: ComputeFeaturesResponse
    deactivate API
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Suggested reviewers

  • w7-learn

Poem

🐰 Soft paws tap on timestamps’ rim,

I stitch past days so features brim,
Lags and windows, shifted right,
No peeks ahead — I keep the sight,
Rabbit hums: safe data, bright.

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'Release: Phase 3 Feature Engineering' clearly and concisely summarizes the primary change—delivering the complete Phase 3 Feature Engineering system as the main feature of this PR.
Docstring Coverage ✅ Passed Docstring coverage is 98.88% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch dev

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@w7-learn
Copy link
Copy Markdown
Collaborator

@coderabbitai help

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Jan 31, 2026

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

CodeRabbit commands

These commands are invoked using PR/Issue comments.

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai evaluate custom pre-merge check --instructions <custom-checks-instructions> --name <custom-checks-title> [--mode <error|warning>] to validate the custom pre-merge checks instructions. Defaults to error when --mode is omitted.
    • @coderabbitai ignore pre-merge checks to override pre-merge checks and get an approval on PR.
    • @coderabbitai run pre-merge checks to run pre-merge checks on the pull request.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
  • @coderabbitai generate unit tests to generate unit tests for this PR.
  • @coderabbitai resolve to resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai generate configuration to create a PR that adds the current resolved configuration as .coderabbit.yaml (or show it if already present).
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore or @coderabbit ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit configuration file (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • You can also validate your configuration using the online YAML validator.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Status, support, documentation and community

  • Visit our status page to check the current availability of CodeRabbit.
  • Create a ticket on our support page for assistance with any issues or questions.
  • Visit our documentation site for detailed information on how to use CodeRabbit.
  • Join our Discord community to connect with other users and get help from the community.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🤖 Fix all issues with AI agents
In `@app/features/featuresets/service.py`:
- Around line 215-217: The loop that sets min_per incorrectly treats 0 as falsy;
change the expression in the loop (where min_per is assigned for each window in
config.windows) to explicitly check for None: use "config.min_periods if
config.min_periods is not None else window" (so an explicit 0 is preserved and
only None falls back to window), updating the assignment in the function/method
that iterates over config.windows and prepares arguments for pandas' rolling().

In `@app/features/featuresets/tests/test_schemas.py`:
- Around line 118-138: Update the TestImputationConfig.test_valid_strategies to
include the missing "expanding_mean" strategy in the ImputationConfig(...)
strategies dict and update the assertion to expect 6 strategies; locate the test
class TestImputationConfig and method test_valid_strategies and modify the
strategies mapping and the assert len(config.strategies) to reflect the six
supported strategies ("zero","ffill","bfill","mean","expanding_mean","drop").

In `@docs/PHASE/3-FEATURE_ENGINEERING.md`:
- Around line 3-6: Update the release reference string "Release: PR `#24`" to the
current PR number by replacing it with "Release: PR `#25`" in the
PHASE/3-FEATURE_ENGINEERING.md content (look for the exact text "Release: PR
`#24`"); ensure the surrounding markdown remains unchanged and commit the updated
file.

In `@PRPs/PRP-4-feature-engineering.md`:
- Around line 40-42: The PRP currently documents the `/featuresets/preview`
endpoint as GET but the implementation uses POST; update every occurrence in
this document to reflect POST (change the checklist line and any other mentions
of GET for `/featuresets/preview`) so the spec matches the implemented endpoint.
Ensure references to the preview endpoint (e.g., "/featuresets/preview", any
summary lines or bullets) use POST and adjust any example calls or verbs
accordingly.
🧹 Nitpick comments (7)
pyproject.toml (1)

164-167: Consider consolidating dev dependencies in a single location.

The [dependency-groups] section (PEP 735) adds pandas-stubs separately from the existing [project.optional-dependencies] dev group (lines 23-32). Having dev dependencies split across two sections could cause confusion.

Consider moving pandas-stubs to the existing dev optional-dependencies for consistency:

♻️ Suggested consolidation
 [project.optional-dependencies]
 dev = [
     "pytest>=8.3.0",
     "pytest-asyncio>=0.24.0",
     "pytest-cov>=6.0.0",
     "httpx>=0.28.0",
     "ruff>=0.8.0",
     "mypy>=1.13.0",
     "pyright>=1.1.390",
+    "pandas-stubs>=2.3.3.260113",
 ]

Then remove the [dependency-groups] section if no other tools rely on it.

app/features/featuresets/schemas.py (1)

177-184: Minor inconsistency: price_lags allows empty tuple unlike lags.

The LagConfig.lags validator requires at least one lag (line 75-76), but ExogenousConfig.price_lags validator only checks for positive values without rejecting an empty tuple. This allows price_lags=() which may be intentional if include_price=False, but could be confusing.

Consider adding a similar check if empty price_lags should be rejected when include_price=True:

♻️ Optional: Add conditional validation
     `@field_validator`("price_lags")
     `@classmethod`
     def validate_price_lags_positive(cls, v: tuple[int, ...]) -> tuple[int, ...]:
         """Ensure all price lags are positive."""
+        # Note: Empty tuple is allowed when include_price=False
         if any(lag <= 0 for lag in v):
             raise ValueError("All price lags must be positive integers")
         return v

Or add a model_validator to enforce non-empty price_lags when include_price=True.

app/features/featuresets/routes.py (2)

259-311: Extract shared row-conversion logic to avoid drift between compute and preview.

The row/feature extraction block is duplicated; any future fix (like the numeric handling above) will need to be applied twice. Consider a small helper to centralize this logic.


3-8: The current NaN and numeric handling works correctly with pandas 3.0.0.

Pandas 3.0.0's to_dict("records") converts NumPy scalars (e.g., np.int64, np.float64) to native Python int and float types, so they pass the isinstance(val, (int, float)) check without issue. The current code using isinstance(val, float) and math.isnan(val) correctly detects NaN values.

That said, using pd.isna(val) is slightly more robust for edge cases, and using numbers.Integral and numbers.Real from the numbers module is more semantically correct for numeric type checking. Consider these as optional improvements rather than bug fixes.

docs/PHASE-index.md (1)

147-152: Clarify imputation defaults vs full strategy set.

The bullet reads like the only imputation options are zero/ffill. Consider marking them as defaults and pointing to the config for the full set to avoid ambiguity.

✏️ Suggested wording
- **Imputation**: Zero-fill for sales, forward-fill for prices
+ **Imputation**: Default strategies like zero-fill for sales and forward-fill for prices (see ImputationConfig for full set)
docs/PHASE/3-FEATURE_ENGINEERING.md (1)

9-12: Tighten redundant wording in the executive summary.

Small wording tweak for clarity (matches the style lint hint).

✏️ Suggested wording
- ensuring that features computed at any cutoff date only use data available at that point in time.
+ ensuring that features computed at any cutoff date only use data available at that time.
docs/ARCHITECTURE.md (1)

207-210: Trim repeated “only” in the lag bullet.

Minor style cleanup per lint hint.

✏️ Suggested wording
- - **Lag features**: `shift(lag)` with positive lag only — ensures only past data accessed
+ - **Lag features**: `shift(lag)` with positive lag only — ensures past data is accessed

Comment thread app/features/featuresets/service.py
Comment thread app/features/featuresets/tests/test_schemas.py
Comment thread docs/PHASE/3-FEATURE_ENGINEERING.md
Comment thread PRPs/PRP-4-feature-engineering.md
w7-mgfcode pushed a commit that referenced this pull request Jan 31, 2026
Service:
- Fix min_periods falsy check to explicit None check (preserves 0)

Tests:
- Add expanding_mean to test_valid_strategies, expect 6 strategies

Documentation:
- Update PR reference from #24 to #25
- Fix all GET /featuresets/preview to POST in PRP-4

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Service:
- Fix min_periods falsy check to explicit None check (preserves 0)

Tests:
- Add expanding_mean to test_valid_strategies, expect 6 strategies

Documentation:
- Update PR reference from #24 to #25
- Fix all GET /featuresets/preview to POST in PRP-4

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@w7-mgfcode w7-mgfcode merged commit 888c082 into main Jan 31, 2026
11 of 12 checks passed
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🤖 Fix all issues with AI agents
In `@app/features/featuresets/service.py`:
- Around line 257-296: _compute_calendar_features currently ignores
config.include_is_holiday so the holiday flag never gets added to outputs;
modify the method to check config.include_is_holiday and when true set
result["is_holiday"] = dates["is_holiday"].astype(int) (or dates.dt.is_holiday
if using a datetime accessor) and append "is_holiday" to columns so the feature
is returned and included in stats; ensure you follow the same casting/typing
pattern used for is_weekend/is_month_end in the function.
- Around line 360-404: The _compute_exogenous_features method currently ignores
ExogenousConfig.include_promo and include_inventory; either implement those
feature computations or fail fast. Update _compute_exogenous_features to (A)
implement promo features when config.include_promo is true by reading the promo
column (e.g., "is_promo" or configured name), creating lagged promo columns
analogous to price (use groupby(self.entity_cols).shift(lag) for each
config.promo_lags and any pct/change or flag derivations), and (B) implement
inventory features when config.include_inventory is true by using the inventory
column (e.g., "on_hand" or configured name) to produce lagged inventory columns
and derived indicators (like inventory_pct_change or inventory_level_lag_N)
similar to how price/stockout are done; if you prefer validation instead, add an
explicit check at the start of _compute_exogenous_features that raises a
RuntimeError when config.include_promo or config.include_inventory is true but
no corresponding implementation/columns exist, referencing the ExogenousConfig
flags and the _compute_exogenous_features method so reviewers can locate
changes.

In `@docs/PHASE/3-FEATURE_ENGINEERING.md`:
- Around line 19-33: Update the documentation to include the missing
expanding_mean imputation strategy: in the section describing ImputationConfig
and the deliverables table listing strategies, add "expanding_mean" to the list
of strategies alongside zero, ffill, bfill, mean, drop, and update any
descriptive text to briefly note that expanding_mean computes a cumulative mean
per series (expanding window) as an imputation option; ensure the symbol
ImputationConfig and any mentions of FeatureSetConfig/ComputeFeaturesRequest
remain consistent after adding the new strategy.

In `@PRPs/PRP-4-feature-engineering.md`:
- Around line 773-775: The current pseudocode uses a falsy check when assigning
min_per inside the loop over config.windows (for window in config.windows) which
will override an explicit 0 in config.min_periods; change the condition to
explicitly test for None so min_per preserves 0 (i.e., set min_per to
config.min_periods if config.min_periods is not None, otherwise to window),
updating the assignment that defines min_per.
- Around line 1123-1144: The preview_features() example is calling the compute
endpoint and using lookback_days which doesn't match the Preview request schema;
change the POST URL from "/featuresets/compute" to "/featuresets/preview" and
update the JSON payload to conform to the PreviewRequest schema (remove or
replace "lookback_days" and any compute-specific fields with the
preview-specific fields expected by PreviewRequest), keeping the same
identifying caller function preview_features() so the example exercises the
preview endpoint and schema correctly.

Comment on lines +257 to +296
if config.include_day_of_week:
dow = dates.dt.dayofweek # 0=Monday, 6=Sunday
if config.use_cyclical_encoding:
result["dow_sin"] = np.sin(2 * np.pi * dow / 7)
result["dow_cos"] = np.cos(2 * np.pi * dow / 7)
columns.extend(["dow_sin", "dow_cos"])
else:
result["day_of_week"] = dow
columns.append("day_of_week")

if config.include_month:
month = dates.dt.month
if config.use_cyclical_encoding:
result["month_sin"] = np.sin(2 * np.pi * month / 12)
result["month_cos"] = np.cos(2 * np.pi * month / 12)
columns.extend(["month_sin", "month_cos"])
else:
result["month"] = month
columns.append("month")

if config.include_quarter:
result["quarter"] = dates.dt.quarter
columns.append("quarter")

if config.include_year:
result["year"] = dates.dt.year
columns.append("year")

if config.include_is_weekend:
result["is_weekend"] = dates.dt.dayofweek.isin([5, 6]).astype(int)
columns.append("is_weekend")

if config.include_is_month_end:
result["is_month_end"] = dates.dt.is_month_end.astype(int)
columns.append("is_month_end")

# is_holiday would require calendar table lookup
# Handled separately if data is joined from Calendar table

return result, columns
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Honor include_is_holiday in feature outputs.
include_is_holiday is set in config and the calendar data can be merged, but _compute_calendar_features never adds the is_holiday column to feature_columns, so it’s omitted from responses and stats.

🔧 Suggested fix
         if config.include_is_month_end:
             result["is_month_end"] = dates.dt.is_month_end.astype(int)
             columns.append("is_month_end")
 
-        # is_holiday would require calendar table lookup
-        # Handled separately if data is joined from Calendar table
+        if config.include_is_holiday and "is_holiday" in result.columns:
+            result["is_holiday"] = result["is_holiday"].astype(int)
+            columns.append("is_holiday")
🤖 Prompt for AI Agents
In `@app/features/featuresets/service.py` around lines 257 - 296,
_compute_calendar_features currently ignores config.include_is_holiday so the
holiday flag never gets added to outputs; modify the method to check
config.include_is_holiday and when true set result["is_holiday"] =
dates["is_holiday"].astype(int) (or dates.dt.is_holiday if using a datetime
accessor) and append "is_holiday" to columns so the feature is returned and
included in stats; ensure you follow the same casting/typing pattern used for
is_weekend/is_month_end in the function.

Comment on lines +360 to +404
def _compute_exogenous_features(self, df: pd.DataFrame) -> tuple[pd.DataFrame, list[str]]:
"""Compute exogenous features (price, promo, inventory).

CRITICAL: All exogenous features are lagged to prevent leakage.

Args:
df: Input dataframe with exogenous columns.

Returns:
Tuple of (dataframe with exogenous features, list of new column names).
"""
config = self.config.exogenous_config
if config is None:
raise RuntimeError("_compute_exogenous_features called without exogenous_config")

result = df.copy()
columns: list[str] = []

# Price features (if price column exists)
if config.include_price and "unit_price" in df.columns:
for lag in config.price_lags:
col_name = f"price_lag_{lag}"
result[col_name] = df.groupby(self.entity_cols, observed=True)["unit_price"].shift(
lag
)
columns.append(col_name)

if config.include_price_change:
# CRITICAL: shift(1) before pct_change to prevent using current price
# This computes: (price[t-1] - price[t-8]) / price[t-8]
# Without shift(1), it would use current price at t, causing leakage
result["price_pct_change_7d"] = df.groupby(self.entity_cols, observed=True)[
"unit_price"
].transform(lambda x: x.shift(1).pct_change(periods=7))
columns.append("price_pct_change_7d")

# Stockout flag (if inventory column exists)
if config.include_stockout_flag and "is_stockout" in df.columns:
# Lagged stockout flag (yesterday's stockout)
result["stockout_lag_1"] = df.groupby(self.entity_cols, observed=True)[
"is_stockout"
].shift(1)
columns.append("stockout_lag_1")

return result, columns
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Search for promo/inventory feature references and ExogenousConfig definition
rg -n "include_promo|include_inventory|promo|inventory" app/features/featuresets -g '*.py'

Repository: w7-mgfcode/ForecastLabAI

Length of output: 911


🏁 Script executed:

#!/bin/bash
# Find and examine ExogenousConfig definition
fd -e py -x grep -l "class ExogenousConfig" {} \;

Repository: w7-mgfcode/ForecastLabAI

Length of output: 105


🏁 Script executed:

#!/bin/bash
# Get the full path and examine ExogenousConfig to see what fields it defines
rg -n "class ExogenousConfig" -A 50 app/features/featuresets -g '*.py'

Repository: w7-mgfcode/ForecastLabAI

Length of output: 4094


🏁 Script executed:

#!/bin/bash
# Search for any handling of include_promo or include_inventory beyond what we've seen
rg -n "include_promo|include_inventory" app/features/featuresets -g '*.py' -B 2 -A 2

Repository: w7-mgfcode/ForecastLabAI

Length of output: 990


🏁 Script executed:

#!/bin/bash
# Check if there are any other methods in the service that might compute promo/inventory
rg -n "def.*promo|def.*inventory" app/features/featuresets -g '*.py'

Repository: w7-mgfcode/ForecastLabAI

Length of output: 50


🏁 Script executed:

#!/bin/bash
# Check if _compute_exogenous_features is called/tested and how these flags are used
rg -n "_compute_exogenous_features" app/features/featuresets -g '*.py' -B 2 -A 2

Repository: w7-mgfcode/ForecastLabAI

Length of output: 1249


Implement promo and inventory feature computation or add validation to prevent silent misconfiguration.

ExogenousConfig exposes include_promo (defaults to True) and include_inventory flags, but _compute_exogenous_features() only implements price and stockout features. These flags are silently ignored with no error, creating a functional gap where users expecting promo/inventory features receive none. Either implement the missing feature computation or add explicit validation to raise an error if these flags are enabled.

🤖 Prompt for AI Agents
In `@app/features/featuresets/service.py` around lines 360 - 404, The
_compute_exogenous_features method currently ignores
ExogenousConfig.include_promo and include_inventory; either implement those
feature computations or fail fast. Update _compute_exogenous_features to (A)
implement promo features when config.include_promo is true by reading the promo
column (e.g., "is_promo" or configured name), creating lagged promo columns
analogous to price (use groupby(self.entity_cols).shift(lag) for each
config.promo_lags and any pct/change or flag derivations), and (B) implement
inventory features when config.include_inventory is true by using the inventory
column (e.g., "on_hand" or configured name) to produce lagged inventory columns
and derived indicators (like inventory_pct_change or inventory_level_lag_N)
similar to how price/stockout are done; if you prefer validation instead, add an
explicit check at the start of _compute_exogenous_features that raises a
RuntimeError when config.include_promo or config.include_inventory is true but
no corresponding implementation/columns exist, referencing the ExogenousConfig
flags and the _compute_exogenous_features method so reviewers can locate
changes.

Comment on lines +19 to +33
### 1. Feature Configuration Schemas

**File**: `app/features/featuresets/schemas.py`

Pydantic v2 schemas with frozen configs for reproducibility:

| Schema | Purpose |
|--------|---------|
| `LagConfig` | Lag feature configuration (lags, target_column, fill_value) |
| `RollingConfig` | Rolling window configuration (windows, aggregations, min_periods) |
| `CalendarConfig` | Calendar feature configuration (day_of_week, month, cyclical encoding) |
| `ExogenousConfig` | Exogenous feature configuration (price_lags, stockout flags) |
| `ImputationConfig` | Imputation strategies (zero, ffill, bfill, mean, drop) |
| `FeatureSetConfig` | Complete feature configuration with config_hash() |
| `ComputeFeaturesRequest` | API request schema with validation |
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Document the new expanding_mean strategy.
The service supports expanding_mean, but the doc lists only five strategies in both the deliverables table and the Imputation section.

📝 Suggested edits
-| `ImputationConfig` | Imputation strategies (zero, ffill, bfill, mean, drop) |
+| `ImputationConfig` | Imputation strategies (zero, ffill, bfill, mean, expanding_mean, drop) |

-**Strategies**: `zero`, `ffill`, `bfill`, `mean`, `drop`
+**Strategies**: `zero`, `ffill`, `bfill`, `mean`, `expanding_mean`, `drop`

Also applies to: 199-211

🤖 Prompt for AI Agents
In `@docs/PHASE/3-FEATURE_ENGINEERING.md` around lines 19 - 33, Update the
documentation to include the missing expanding_mean imputation strategy: in the
section describing ImputationConfig and the deliverables table listing
strategies, add "expanding_mean" to the list of strategies alongside zero,
ffill, bfill, mean, drop, and update any descriptive text to briefly note that
expanding_mean computes a cumulative mean per series (expanding window) as an
imputation option; ensure the symbol ImputationConfig and any mentions of
FeatureSetConfig/ComputeFeaturesRequest remain consistent after adding the new
strategy.

Comment on lines +773 to +775
for window in config.windows:
min_per = config.min_periods if config.min_periods else window

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Preserve explicit min_periods=0 in pseudocode.
The PRP uses a falsy check, which would override an explicit 0. Align this snippet with the implemented is None check.

📝 Suggested edit
-            min_per = config.min_periods if config.min_periods else window
+            min_per = window if config.min_periods is None else config.min_periods
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
for window in config.windows:
min_per = config.min_periods if config.min_periods else window
for window in config.windows:
min_per = window if config.min_periods is None else config.min_periods
🤖 Prompt for AI Agents
In `@PRPs/PRP-4-feature-engineering.md` around lines 773 - 775, The current
pseudocode uses a falsy check when assigning min_per inside the loop over
config.windows (for window in config.windows) which will override an explicit 0
in config.min_periods; change the condition to explicitly test for None so
min_per preserves 0 (i.e., set min_per to config.min_periods if
config.min_periods is not None, otherwise to window), updating the assignment
that defines min_per.

Comment on lines +1123 to +1144
async def preview_features() -> None:
"""Preview features for store 1, product 1."""
async with httpx.AsyncClient() as client:
response = await client.post(
"http://localhost:8123/featuresets/compute",
json={
"store_id": 1,
"product_id": 1,
"cutoff_date": "2024-01-31",
"lookback_days": 60,
"config": {
"schema_version": "1.0",
"name": "preview_test",
"lag_config": {
"schema_version": "1.0",
"lags": [1, 7, 14],
},
"rolling_config": {
"schema_version": "1.0",
"windows": [7, 14],
"aggregations": ["mean", "std"],
},
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Preview example should call the preview endpoint and schema.
The preview_features.py example posts to /featuresets/compute and uses lookback_days, which doesn’t match the Preview request schema.

📝 Suggested edit
-        response = await client.post(
-            "http://localhost:8123/featuresets/compute",
+        response = await client.post(
+            "http://localhost:8123/featuresets/preview",
             json={
                 "store_id": 1,
                 "product_id": 1,
                 "cutoff_date": "2024-01-31",
-                "lookback_days": 60,
+                "sample_rows": 10,
                 "config": {
🤖 Prompt for AI Agents
In `@PRPs/PRP-4-feature-engineering.md` around lines 1123 - 1144, The
preview_features() example is calling the compute endpoint and using
lookback_days which doesn't match the Preview request schema; change the POST
URL from "/featuresets/compute" to "/featuresets/preview" and update the JSON
payload to conform to the PreviewRequest schema (remove or replace
"lookback_days" and any compute-specific fields with the preview-specific fields
expected by PreviewRequest), keeping the same identifying caller function
preview_features() so the example exercises the preview endpoint and schema
correctly.

w7-mgfcode added a commit that referenced this pull request Feb 1, 2026
* feat(featuresets): implement time-safe feature engineering layer (#24)

* docs: update INITIAL-4 and INITIAL-5 with additional references

- Add scikit-learn, mlforecast, and sktime documentation links
- Add considerations for imputation logic, agent tooling, and computation overhead
- Add model persistence documentation references

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* feat(featuresets): implement time-safe feature engineering layer

Add complete feature engineering module with:
- Pydantic schemas for feature configuration (lag, rolling, calendar, exogenous, imputation)
- FeatureEngineeringService with CRITICAL leakage prevention:
  - Lag features use positive shift() only
  - Rolling features use shift(1) BEFORE rolling to exclude current observation
  - Group-aware operations prevent cross-series leakage
  - Cutoff date filtering before any computation
- FastAPI endpoints: POST /featuresets/compute and /featuresets/preview
- Comprehensive test suite (55 tests) including leakage prevention tests
- Example demo script

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* docs: update INITIAL-5.md

* docs: update documentation for Phase 3 Feature Engineering

- README.md: Add featuresets module to project structure and API endpoints
- docs/ARCHITECTURE.md: Add Feature Engineering section (section 6)
- docs/PHASE-index.md: Mark Phase 3 as completed with summary
- docs/PHASE/3-FEATURE_ENGINEERING.md: Create detailed phase documentation

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* fix(featuresets): address code review feedback and prevent data leakage

Routes:
- Validate store_id/product_id presence (no silent defaults to 0)
- Convert ValueError for unsupported date types to HTTP 400

Service:
- Add expanding_mean imputation strategy (time-safe alternative)
- Add warnings when bfill/mean strategies are used (leakage risk)
- Fix price_pct_change_7d to use shift(1) before pct_change

Schemas:
- Add expanding_mean to ImputationConfig Literal type
- Document time-safety of each imputation strategy
- Fix PreviewFeaturesRequest docstring: GET → POST

Documentation:
- Convert bare URLs to markdown links in INITIAL-4.md, INITIAL-5.md
- Fix PRP-4 to show POST for preview endpoint

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* style: format schemas.py

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

---------

Co-authored-by: Gabe@w7dev <gabor@w7-7.net>
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>

* fix: address code review feedback

Service:
- Fix min_periods falsy check to explicit None check (preserves 0)

Tests:
- Add expanding_mean to test_valid_strategies, expect 6 strategies

Documentation:
- Update PR reference from #24 to #25
- Fix all GET /featuresets/preview to POST in PRP-4

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* docs: update DAILY-FLOW.md for Phase 4 Forecasting (#27)

* docs: update DAILY-FLOW.md for Phase 4 Forecasting

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* docs: add PRP-5 for Forecasting module

Comprehensive PRP including:
- Model zoo (naive, seasonal naive, moving average)
- Unified BaseForecaster interface (fit/predict/serialize)
- ModelBundle persistence with joblib
- 15 ordered implementation tasks
- 40+ test cases specified
- Integration with FeatureEngineeringService

Confidence: 8/10

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

---------

Co-authored-by: Gabe@w7dev <gabor@w7-7.net>
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>

* feat(forecasting): implement baseline model zoo and unified interface (#28)

* feat(forecasting): implement baseline model zoo and unified interface

Add forecasting module (PRP-5) with:
- BaseForecaster ABC with scikit-learn-style interface (fit/predict)
- NaiveForecaster, SeasonalNaiveForecaster, MovingAverageForecaster
- ModelBundle persistence with joblib serialization
- POST /forecasting/train and /forecasting/predict endpoints
- ForecastingService for orchestration
- 81 unit tests covering schemas, models, persistence, and service
- Example scripts demonstrating each baseline model
- LightGBM placeholder (feature-flagged, not yet implemented)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* docs: Update documentation for forecasting module (PRP-5)

- Add forecasting API endpoints to README.md with examples
- Update ARCHITECTURE.md with forecasting implementation details
- Add scikit-learn and joblib to dependencies list
- Add forecasting config variables to .env.example
- Mark forecasting module as IMPLEMENTED in architecture docs

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* fix: address CI lint and type check failures

- Add type: ignore for intentional type mismatch in frozen config test
- Add S101 ignore for examples/ to allow assert statements

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

---------

Co-authored-by: Gabe@w7dev <gabor@w7-7.net>
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>

* fix(forecasting): add security validations and fix documentation

Security improvements:
- Add constructor validation for season_length >= 1 in SeasonalNaiveForecaster
- Add constructor validation for window_size >= 1 in MovingAverageForecaster
- Add path traversal prevention in ForecastingService.predict()
- Validate .joblib extension and artifacts directory containment
- Log rejection reasons for security auditing

Test improvements:
- Fix get_settings patching to wrap ForecastingService construction
- Add tests for constructor validation
- Add tests for path traversal and extension validation

Documentation fixes:
- Fix config parameter names in ARCHITECTURE.md (season_length, window_size)
- Fix README example to use season_length instead of seasonal_period
- Fix markdown issues in PRP-5 (code fences, ATX headings)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

---------

Co-authored-by: Gabe@w7dev <gabor@w7-7.net>
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants