Skip to content

feat(featuresets): implement time-safe feature engineering layer#23

Closed
w7-mgfcode wants to merge 3 commits into
mainfrom
feat/prp-4-feature-engineering
Closed

feat(featuresets): implement time-safe feature engineering layer#23
w7-mgfcode wants to merge 3 commits into
mainfrom
feat/prp-4-feature-engineering

Conversation

@w7-mgfcode
Copy link
Copy Markdown
Owner

@w7-mgfcode w7-mgfcode commented Jan 31, 2026

Summary

  • Implement complete feature engineering module (PRP-4) with time-safe feature computation
  • Add FeatureEngineeringService with CRITICAL leakage prevention patterns
  • Create FastAPI endpoints for feature computation and preview
  • Comprehensive test suite with 55 tests including leakage prevention tests

Features

  • Lag features: Positive shift() only to prevent future data access
  • Rolling features: shift(1) BEFORE rolling to exclude current observation
  • Calendar features: Cyclical encoding (sin/cos) for day of week, month
  • Group isolation: Entity-aware groupby prevents cross-series leakage
  • Imputation strategies: Zero-fill for sales, forward-fill for prices
  • Cutoff enforcement: Data filtered before any computation

Files Added

  • app/features/featuresets/ - Complete module (schemas, service, routes)
  • app/features/featuresets/tests/ - 55 unit tests
  • examples/compute_features_demo.py - Demo script
  • PRPs/PRP-4-feature-engineering.md - PRP document

Test plan

  • All 55 unit tests passing
  • Ruff linting: clean
  • MyPy: no issues
  • Pyright: 0 errors
  • Manual API testing with seeded data

🤖 Generated with Claude Code

Summary by CodeRabbit

Release Notes

  • New Features

    • Added feature engineering pipeline with time-safe computation and leakage prevention.
    • Introduced /featuresets/compute and /featuresets/preview API endpoints for configurable feature generation.
    • Support for lag, rolling window, calendar, and exogenous features with flexible imputation strategies.
  • Documentation

    • Added example demo script demonstrating feature computation and preview workflows.
  • Dependencies

    • Added pandas and numpy as runtime dependencies.

✏️ Tip: You can customize this high-level summary in your review settings.

w7-learn and others added 3 commits January 31, 2026 21:57
- Add scikit-learn, mlforecast, and sktime documentation links
- Add considerations for imputation logic, agent tooling, and computation overhead
- Add model persistence documentation references

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add complete feature engineering module with:
- Pydantic schemas for feature configuration (lag, rolling, calendar, exogenous, imputation)
- FeatureEngineeringService with CRITICAL leakage prevention:
  - Lag features use positive shift() only
  - Rolling features use shift(1) BEFORE rolling to exclude current observation
  - Group-aware operations prevent cross-series leakage
  - Cutoff date filtering before any computation
- FastAPI endpoints: POST /featuresets/compute and /featuresets/preview
- Comprehensive test suite (55 tests) including leakage prevention tests
- Example demo script

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Copy link
Copy Markdown

@sourcery-ai sourcery-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry @w7-mgfcode, you have reached your weekly rate limit of 500000 diff characters.

Please try again later or upgrade to continue using Sourcery

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Jan 31, 2026

📝 Walkthrough

Walkthrough

This PR introduces a production-ready, time-safe feature engineering vertical slice for ForecastLabAI. It adds configuration schemas for lag, rolling, calendar, exogenous, and imputation features; a FeatureEngineeringService implementing leakage-free feature computation; FastAPI endpoints for computing and previewing features; comprehensive tests; and example scripts. Dependencies on pandas and numpy are added.

Changes

Cohort / File(s) Summary
Documentation & Configuration
INITIAL-4.md, INITIAL-5.md, app/core/config.py
Added feature engineering documentation references and three new Settings fields (feature_max_lookback_days, feature_max_lag, feature_max_window) to enforce size constraints.
Feature Engineering Schemas
app/features/featuresets/schemas.py, app/features/featuresets/__init__.py
Introduced Pydantic-based configuration schemas (FeatureSetConfig, LagConfig, RollingConfig, CalendarConfig, ExogenousConfig, ImputationConfig) with validation, immutability, and config-hash support; added API request/response schemas (ComputeFeaturesRequest/Response, PreviewFeaturesRequest) with FeatureRow data model.
Feature Engineering Service & Data Loading
app/features/featuresets/service.py
Implemented FeatureEngineeringService with time-safe feature computation (cutoff handling, lag, rolling, calendar, exogenous, imputation); added FeatureDataLoader for database retrieval; introduced compute_features_for_series orchestration function.
API Routes & Integration
app/features/featuresets/routes.py, app/main.py
Added FastAPI router with POST /featuresets/compute and GET /featuresets/preview endpoints; integrated router into main app; both endpoints handle data validation, NaN normalization, error mapping, and structured logging.
Tests & Fixtures
app/features/featuresets/tests/*
Added conftest.py with time-series and config fixtures; comprehensive test suites for schemas (validation, defaults, immutability), service (lag, rolling, calendar, imputation, cutoff enforcement), leakage detection (cross-series isolation, future-data prevention), and route integration.
Examples & Demo Scripts
examples/compute_features_demo.py
Added demonstration script showing feature engineering API usage via httpx, including sample config creation, preview and full computation invocations, error handling, and result parsing.
Dependencies
pyproject.toml
Added pandas>=3.0.0 and numpy>=2.4.1 as runtime dependencies; introduced pandas-stubs>=2.3.3.260113 in dev dependency group.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant Route as FastAPI Route
    participant DataLoader as FeatureDataLoader
    participant DB as Database
    participant Service as FeatureEngineeringService
    
    Client->>Route: POST /featuresets/compute (store_id, product_id, cutoff_date, config)
    Route->>Route: Validate ComputeFeaturesRequest
    Route->>DataLoader: load_sales_data(store_id, product_id, date_range)
    DataLoader->>DB: Query SalesDaily records
    DB-->>DataLoader: Return sales data
    Route->>DataLoader: load_calendar_data(date_range)
    DataLoader->>DB: Query Calendar records
    DB-->>DataLoader: Return calendar data
    Route->>Service: compute_features(merged_df, cutoff_date)
    Service->>Service: Sort & filter by cutoff_date
    Service->>Service: Apply imputation (group-aware)
    Service->>Service: Compute lag features (per-entity)
    Service->>Service: Compute rolling features (shifted)
    Service->>Service: Compute calendar features (with cyclical encoding)
    Service->>Service: Compute exogenous features
    Service->>Service: Collect stats (nulls, row counts)
    Service-->>Route: Return FeatureComputationResult
    Route->>Route: Convert DataFrame to FeatureRow list
    Route->>Route: Handle NaN/None normalization
    Route-->>Client: ComputeFeaturesResponse (rows, columns, hash, stats)
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

Possibly related PRs

Suggested reviewers

  • w7-learn

Poem

🐰 Features flow like clover in spring,
Lags and rolling means in a line,
Cutoff dates keep the leakage away,
Time-safe computing—a feature buffet! 🌾

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and specifically describes the main change: implementing a time-safe feature engineering layer, which is the primary objective of this PR.
Docstring Coverage ✅ Passed Docstring coverage is 98.88% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch feat/prp-4-feature-engineering

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@socket-security
Copy link
Copy Markdown

Review the following changes in direct dependencies. Learn more about Socket for GitHub.

Diff Package Supply Chain
Security
Vulnerability Quality Maintenance License
Addednumpy@​2.4.17510010010070
Addedpandas@​3.0.07610010010080

View full report

@w7-mgfcode w7-mgfcode closed this Jan 31, 2026
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

🤖 Fix all issues with AI agents
In `@app/features/featuresets/routes.py`:
- Around line 135-136: The code silently defaults missing store_id/product_id to
0 by using record.get("store_id", 0) / record.get("product_id", 0); update the
parsing logic that constructs the store_id/product_id (where record is used) to
explicitly validate presence and value (e.g., require the keys exist and that
int(record["store_id"]) >= 1 and int(record["product_id"]) >= 1), and on failure
raise a clear exception (ValueError) or log an error including the offending
record content instead of silently using 0 so corrupted rows are surfaced
immediately.
- Around line 129-130: The raise ValueError for unsupported date types (the
branch raising "Cannot extract date from {type(date_val)}") can surface as a
500; update the handler around that logic in app/features/featuresets/routes.py
so invalid date types are caught and translated into a client-facing error
(e.g., raise a ValidationError or an HTTP 400/BadRequest exception) that
includes the offending type and a clear message; locate the block that inspects
date_val, replace or wrap the ValueError with a caught/converted error response
(or re-raise a framework ValidationError/HTTPException) so callers receive a 4xx
with details instead of an unhandled 500.

In `@app/features/featuresets/service.py`:
- Around line 324-329: The bfill and mean strategies (in the branch where
strategy == "bfill" and strategy == "mean" that operate on result grouped by
self.entity_cols) leak future information; update the function/class docstring
to note this limitation, add an "expanding_mean" (or "expanding-mean") option
that computes groupby(...)[col].transform(lambda x:
x.expanding().mean().fillna(...)) so each row only uses past values, and in
configuration/validation logic emit a warning or error when users select "bfill"
or plain "mean" for time-series use to guide them toward "expanding_mean".
Ensure references to the symbols result, self.entity_cols, strategy, bfill,
mean, and expanding_mean are used so reviewers can locate the changes.
- Around line 362-367: The price_pct_change_7d calculation leaks by using the
current row's unit_price; change the pipeline to compute percent change on a
1-day-shifted series so only historical prices are used (i.e., take
df.groupby(self.entity_cols, observed=True)["unit_price"].shift(1) and then
.pct_change(periods=7) so the feature at time T uses prices up through T-1).
Keep the result key price_pct_change_7d and
columns.append("price_pct_change_7d") as before.

In `@INITIAL-4.md`:
- Around line 21-23: Replace the three bare URLs
(https://scikit-learn.org/stable/modules/compose.html,
https://www.nixtla.io/blog/automated-time-series-feature-engineering-with-mlforecast?utm_source=chatgpt.com#introduction-to-mlforecast,
https://www.sktime.net/en/stable/api_reference/transformations.html) with
Markdown links using the [text](url) syntax (e.g., [scikit-learn compose
module](https://scikit-learn.org/stable/modules/compose.html)) so they are not
bare URLs and comply with markdownlint MD034; update the visible link text to a
concise descriptive label for each URL.

In `@INITIAL-5.md`:
- Around line 31-33: Replace the bare URLs with markdown links by converting
each plain URL into the [text](url) form; specifically update the three
occurrences (https://scikit-learn.org/stable/modules/compose.html,
https://scikit-learn.org/stable/glossary.html,
https://scikit-learn.org/stable/model_persistence.html) to descriptive link text
(e.g., [Scikit-learn: ColumnTransformer
documentation](https://scikit-learn.org/stable/modules/compose.html)) so the
file no longer triggers MD034; edit the lines containing those URLs in
INITIAL-5.md (look for the exact URL strings) and replace them with appropriate
markdown link syntax.

In `@PRPs/PRP-4-feature-engineering.md`:
- Around line 8-12: The PRP lists the preview endpoint as GET but the
implemented route uses router.post("/preview"); update the documentation entry
in PRP-4-feature-engineering.md to read "POST /featuresets/preview" instead of
"GET /featuresets/preview" so it matches the implemented route
(router.post("/preview")) and the API behavior.
🧹 Nitpick comments (13)
pyproject.toml (2)

163-167: Avoid dev-dependency drift between optional-dependencies and dependency-groups.
pandas-stubs only appears in dependency-groups.dev, so pip install -e .[dev] won’t include it. Consider adding it to the dev extra (or document that dependency-groups is the required path).

♻️ Suggested consolidation
 [project.optional-dependencies]
 dev = [
     "pytest>=8.3.0",
     "pytest-asyncio>=0.24.0",
     "pytest-cov>=6.0.0",
     "httpx>=0.28.0",
     "ruff>=0.8.0",
     "mypy>=1.13.0",
     "pyright>=1.1.390",
+    "pandas-stubs>=2.3.3.260113",
 ]

19-20: Consider upper-bounding pandas and numpy versions, and plan a staged upgrade for pandas 3.0.0.

The specified versions are available: pandas 3.0.0 (released Jan 21, 2026) and numpy 2.4.1 (released Jan 10, 2026) both support Python 3.12. However, pandas 3.0.0 release notes recommend upgrading to pandas 2.3.x first to clear deprecation warnings before migrating to 3.0.0. Adding upper bounds prevents unexpected major version upgrades and is a good practice.

♻️ Proposed guardrail version range
-    "pandas>=3.0.0",
-    "numpy>=2.4.1",
+    "pandas>=3.0.0,<4.0.0",
+    "numpy>=2.4.1,<3.0.0",
app/core/config.py (1)

38-41: Add Pydantic Field bounds for feature caps.
These caps act as guardrails; adding ge/le prevents invalid env overrides and documents intent.

♻️ Suggested change
-from pydantic_settings import BaseSettings, SettingsConfigDict
+from pydantic import Field
+from pydantic_settings import BaseSettings, SettingsConfigDict
@@
-    feature_max_lookback_days: int = 1095  # 3 years
-    feature_max_lag: int = 365
-    feature_max_window: int = 90
+    feature_max_lookback_days: int = Field(1095, ge=30, le=1825, description="Max lookback days")
+    feature_max_lag: int = Field(365, ge=1, le=730, description="Max lag value")
+    feature_max_window: int = Field(90, ge=1, le=365, description="Max rolling window")
app/features/featuresets/tests/conftest.py (1)

140-147: Consider adding schema_version to the nested LagConfig for consistency.
This keeps fixtures aligned if schema_version becomes required across configs.

♻️ Suggested tweak
-        lag_config=LagConfig(lags=(1,)),
+        lag_config=LagConfig(schema_version="1.0", lags=(1,)),
app/features/featuresets/tests/test_service.py (1)

234-248: Consider strengthening the ffill assertion.

The assertion non_null_count >= len(result.df) - 1 is permissive and may not catch subtle bugs. If the first row has a valid value, forward fill should result in zero NaNs. Consider asserting the exact expected behavior based on your test fixture's data.

app/features/featuresets/schemas.py (2)

177-183: Inconsistent empty tuple handling for price_lags.

Unlike LagConfig.validate_lags_positive and RollingConfig.validate_windows_positive, this validator doesn't reject an empty tuple. Consider adding a check for consistency, or document if empty price_lags is intentionally allowed (to disable price lagging while keeping other exogenous features).

♻️ Suggested fix for consistency
 `@field_validator`("price_lags")
 `@classmethod`
 def validate_price_lags_positive(cls, v: tuple[int, ...]) -> tuple[int, ...]:
     """Ensure all price lags are positive."""
+    if not v:
+        raise ValueError("At least one price lag must be specified")
     if any(lag <= 0 for lag in v):
         raise ValueError("All price lags must be positive integers")
     return v

186-207: Mutable dict in frozen model undermines immutability guarantee.

ImputationConfig.strategies is a dict, which is mutable. Although Pydantic v2's frozen=True prevents reassignment of the field itself (e.g., config.strategies = {...}), it does not prevent in-place mutation of the dict contents (e.g., config.strategies["new_col"] = "zero"). This is known as "faux immutability" and can undermine the intent of a frozen model.

Consider using an immutable mapping type (e.g., types.MappingProxyType or a frozendict library) or document this limitation.

app/features/featuresets/routes.py (1)

105-139: Extract shared record-to-FeatureRow conversion logic.

The record conversion logic (handling NaN/None in features and date extraction) is nearly identical between compute_features and preview_features. Consider extracting a helper function to reduce duplication and ensure consistent behavior.

♻️ Suggested helper extraction
def _convert_records_to_rows(
    records: list[dict],
    feature_columns: list[str],
    date_column: str,
) -> list[FeatureRow]:
    """Convert DataFrame records to FeatureRow objects."""
    rows: list[FeatureRow] = []
    for record in records:
        features: dict[str, float | int | None] = {}
        for col in feature_columns:
            val = record.get(col)
            if val is None or (isinstance(val, float) and math.isnan(val)):
                features[col] = None
            elif isinstance(val, (int, float)):
                features[col] = float(val) if isinstance(val, float) else int(val)
            else:
                features[col] = None

        date_val = record.get(date_column)
        if isinstance(date_val, pd.Timestamp):
            row_date = date_val.date()
        elif isinstance(date_val, date_type):
            row_date = date_val
        elif date_val is not None and hasattr(date_val, "date"):
            row_date = date_val.date()
        else:
            raise ValueError(f"Cannot extract date from {type(date_val)}")

        rows.append(
            FeatureRow(
                date=row_date,
                store_id=int(record.get("store_id", 0)),
                product_id=int(record.get("product_id", 0)),
                features=features,
            )
        )
    return rows

Also applies to: 243-276

examples/compute_features_demo.py (1)

22-24: Consider making API_BASE configurable.

Hardcoding localhost:8123 limits flexibility. Consider using an environment variable with a default for easier testing against different environments.

♻️ Suggested improvement
+import os
+
 # API configuration
-API_BASE = "http://localhost:8123"
+API_BASE = os.environ.get("FORECASTLAB_API_BASE", "http://localhost:8123")
 FEATURES_ENDPOINT = f"{API_BASE}/featuresets"
app/features/featuresets/service.py (4)

26-27: Unused TYPE_CHECKING block.

The TYPE_CHECKING block contains only pass and no conditional imports. Consider removing it or adding the intended type-only imports.

🔧 Proposed fix
-if TYPE_CHECKING:
-    pass
-

183-191: Inconsistent use of df vs result in lag computation.

Line 186-188 uses df (the original input) for the groupby/shift operation, but then assigns to result. This works because df and result have the same data at this point (result = df.copy()), but it's inconsistent with other methods like _compute_rolling_features which also use df. While functionally correct here, using result consistently would be clearer for maintainability.


267-275: Cyclical encoding for month may not correctly represent adjacency of December and January.

The current formula np.sin(2 * np.pi * month / 12) maps month values 1-12. However, this means:

  • January (1) → sin(2π/12) ≈ 0.5
  • December (12) → sin(2π) = 0

For proper cyclical encoding where December and January should be adjacent, consider using (month - 1) / 12 to map months to [0, 11/12]:

🔧 Proposed fix
         if config.include_month:
             month = dates.dt.month
             if config.use_cyclical_encoding:
-                result["month_sin"] = np.sin(2 * np.pi * month / 12)
-                result["month_cos"] = np.cos(2 * np.pi * month / 12)
+                result["month_sin"] = np.sin(2 * np.pi * (month - 1) / 12)
+                result["month_cos"] = np.cos(2 * np.pi * (month - 1) / 12)
                 columns.extend(["month_sin", "month_cos"])

386-444: Consider adding explicit error handling for database operations.

The load_sales_data method executes a database query but doesn't handle potential exceptions (e.g., connection failures, query timeouts). While the caller (compute_features_for_series) may handle this at a higher level, and the routes.py shows DatabaseError handling, documenting or handling database errors here could improve debuggability.

Comment on lines +129 to +130
else:
raise ValueError(f"Cannot extract date from {type(date_val)}")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Unhandled ValueError may leak as 500 error.

The ValueError raised here for invalid date types will propagate as an unhandled exception, resulting in a generic 500 error. Consider catching this and converting to a more informative error response (e.g., ValidationError or a custom error with details).

🛡️ Suggested error handling
+from app.core.exceptions import ValidationError
 ...
             else:
-                raise ValueError(f"Cannot extract date from {type(date_val)}")
+                raise ValidationError(
+                    message=f"Cannot extract date from column '{request.config.date_column}'",
+                    details={"date_type": str(type(date_val))},
+                )
🤖 Prompt for AI Agents
In `@app/features/featuresets/routes.py` around lines 129 - 130, The raise
ValueError for unsupported date types (the branch raising "Cannot extract date
from {type(date_val)}") can surface as a 500; update the handler around that
logic in app/features/featuresets/routes.py so invalid date types are caught and
translated into a client-facing error (e.g., raise a ValidationError or an HTTP
400/BadRequest exception) that includes the offending type and a clear message;
locate the block that inspects date_val, replace or wrap the ValueError with a
caught/converted error response (or re-raise a framework
ValidationError/HTTPException) so callers receive a 4xx with details instead of
an unhandled 500.

Comment on lines +135 to +136
store_id=int(record.get("store_id", 0)),
product_id=int(record.get("product_id", 0)),
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Silent default of 0 for missing store_id/product_id.

Using record.get("store_id", 0) and record.get("product_id", 0) silently defaults to 0 if these columns are missing. Since the API validates that store_id >= 1 and product_id >= 1, a 0 value would indicate corrupted data. Consider raising an error or logging a warning instead of silently defaulting.

🤖 Prompt for AI Agents
In `@app/features/featuresets/routes.py` around lines 135 - 136, The code silently
defaults missing store_id/product_id to 0 by using record.get("store_id", 0) /
record.get("product_id", 0); update the parsing logic that constructs the
store_id/product_id (where record is used) to explicitly validate presence and
value (e.g., require the keys exist and that int(record["store_id"]) >= 1 and
int(record["product_id"]) >= 1), and on failure raise a clear exception
(ValueError) or log an error including the offending record content instead of
silently using 0 so corrupted rows are surfaced immediately.

Comment on lines +324 to +329
elif strategy == "bfill":
result[col] = result.groupby(self.entity_cols, observed=True)[col].bfill()
elif strategy == "mean":
result[col] = result.groupby(self.entity_cols, observed=True)[col].transform(
lambda x: x.fillna(x.mean())
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Potential time leakage in bfill and mean imputation strategies.

While the data is filtered to cutoff_date, these strategies can still cause subtle time leakage within the training window:

  • bfill: Fills missing values with future values from later rows in the same series
  • mean: Uses the mean of all values in the group, including those after the current row

For time-series forecasting, this means a row at time t could be imputed using information from time t+1, t+2, etc., which wouldn't be available at prediction time.

Consider:

  1. Documenting this limitation in the docstring
  2. Adding an "expanding mean" strategy that only uses past values
  3. Warning users in configuration validation
🤖 Prompt for AI Agents
In `@app/features/featuresets/service.py` around lines 324 - 329, The bfill and
mean strategies (in the branch where strategy == "bfill" and strategy == "mean"
that operate on result grouped by self.entity_cols) leak future information;
update the function/class docstring to note this limitation, add an
"expanding_mean" (or "expanding-mean") option that computes
groupby(...)[col].transform(lambda x: x.expanding().mean().fillna(...)) so each
row only uses past values, and in configuration/validation logic emit a warning
or error when users select "bfill" or plain "mean" for time-series use to guide
them toward "expanding_mean". Ensure references to the symbols result,
self.entity_cols, strategy, bfill, mean, and expanding_mean are used so
reviewers can locate the changes.

Comment on lines +362 to +367
if config.include_price_change:
# Price change vs 7 days ago
result["price_pct_change_7d"] = df.groupby(self.entity_cols, observed=True)[
"unit_price"
].pct_change(periods=7)
columns.append("price_pct_change_7d")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Potential leakage in price_pct_change_7d calculation.

The current implementation uses pct_change(periods=7) directly, which compares the current price to the price 7 days ago. At prediction time for day T, you wouldn't know price_T—only price_{T-1} and earlier.

To maintain consistency with the time-safe approach used elsewhere (e.g., shift(1) before rolling), consider shifting before calculating percent change:

🛡️ Proposed fix
             if config.include_price_change:
-                # Price change vs 7 days ago
-                result["price_pct_change_7d"] = df.groupby(self.entity_cols, observed=True)[
+                # Price change: yesterday's price vs 8 days ago (shifted to avoid leakage)
+                result["price_pct_change_7d"] = df.groupby(self.entity_cols, observed=True)[
                     "unit_price"
-                ].pct_change(periods=7)
+                ].shift(1).pct_change(periods=7)
                 columns.append("price_pct_change_7d")
🤖 Prompt for AI Agents
In `@app/features/featuresets/service.py` around lines 362 - 367, The
price_pct_change_7d calculation leaks by using the current row's unit_price;
change the pipeline to compute percent change on a 1-day-shifted series so only
historical prices are used (i.e., take df.groupby(self.entity_cols,
observed=True)["unit_price"].shift(1) and then .pct_change(periods=7) so the
feature at time T uses prices up through T-1). Keep the result key
price_pct_change_7d and columns.append("price_pct_change_7d") as before.

Comment thread INITIAL-4.md
Comment on lines +21 to +23
- https://scikit-learn.org/stable/modules/compose.html
- https://www.nixtla.io/blog/automated-time-series-feature-engineering-with-mlforecast?utm_source=chatgpt.com#introduction-to-mlforecast
- https://www.sktime.net/en/stable/api_reference/transformations.html
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Fix markdownlint MD034: replace bare URLs with links.
These links should be wrapped for lint compliance.

📝 Suggested fix
-- https://scikit-learn.org/stable/modules/compose.html
-- https://www.nixtla.io/blog/automated-time-series-feature-engineering-with-mlforecast?utm_source=chatgpt.com#introduction-to-mlforecast
-- https://www.sktime.net/en/stable/api_reference/transformations.html
+- [scikit-learn compose module](https://scikit-learn.org/stable/modules/compose.html)
+- [MLForecast blog](https://www.nixtla.io/blog/automated-time-series-feature-engineering-with-mlforecast?utm_source=chatgpt.com#introduction-to-mlforecast)
+- [sktime transformations API](https://www.sktime.net/en/stable/api_reference/transformations.html)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
- https://scikit-learn.org/stable/modules/compose.html
- https://www.nixtla.io/blog/automated-time-series-feature-engineering-with-mlforecast?utm_source=chatgpt.com#introduction-to-mlforecast
- https://www.sktime.net/en/stable/api_reference/transformations.html
- [scikit-learn compose module](https://scikit-learn.org/stable/modules/compose.html)
- [MLForecast blog](https://www.nixtla.io/blog/automated-time-series-feature-engineering-with-mlforecast?utm_source=chatgpt.com#introduction-to-mlforecast)
- [sktime transformations API](https://www.sktime.net/en/stable/api_reference/transformations.html)
🧰 Tools
🪛 markdownlint-cli2 (0.20.0)

[warning] 21-21: Bare URL used

(MD034, no-bare-urls)


[warning] 22-22: Bare URL used

(MD034, no-bare-urls)


[warning] 23-23: Bare URL used

(MD034, no-bare-urls)

🤖 Prompt for AI Agents
In `@INITIAL-4.md` around lines 21 - 23, Replace the three bare URLs
(https://scikit-learn.org/stable/modules/compose.html,
https://www.nixtla.io/blog/automated-time-series-feature-engineering-with-mlforecast?utm_source=chatgpt.com#introduction-to-mlforecast,
https://www.sktime.net/en/stable/api_reference/transformations.html) with
Markdown links using the [text](url) syntax (e.g., [scikit-learn compose
module](https://scikit-learn.org/stable/modules/compose.html)) so they are not
bare URLs and comply with markdownlint MD034; update the visible link text to a
concise descriptive label for each URL.

Comment thread INITIAL-5.md
Comment on lines +31 to +33
- https://scikit-learn.org/stable/modules/compose.html
- https://scikit-learn.org/stable/glossary.html
- https://scikit-learn.org/stable/model_persistence.html
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Fix markdownlint MD034: replace bare URLs with links.
The linter warnings are valid; wrap these in link syntax.

📝 Suggested fix
-- https://scikit-learn.org/stable/modules/compose.html
-- https://scikit-learn.org/stable/glossary.html
-- https://scikit-learn.org/stable/model_persistence.html
+- [scikit-learn compose module](https://scikit-learn.org/stable/modules/compose.html)
+- [scikit-learn glossary](https://scikit-learn.org/stable/glossary.html)
+- [scikit-learn model persistence](https://scikit-learn.org/stable/model_persistence.html)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
- https://scikit-learn.org/stable/modules/compose.html
- https://scikit-learn.org/stable/glossary.html
- https://scikit-learn.org/stable/model_persistence.html
- [scikit-learn compose module](https://scikit-learn.org/stable/modules/compose.html)
- [scikit-learn glossary](https://scikit-learn.org/stable/glossary.html)
- [scikit-learn model persistence](https://scikit-learn.org/stable/model_persistence.html)
🧰 Tools
🪛 markdownlint-cli2 (0.20.0)

[warning] 31-31: Bare URL used

(MD034, no-bare-urls)


[warning] 32-32: Bare URL used

(MD034, no-bare-urls)


[warning] 33-33: Bare URL used

(MD034, no-bare-urls)

🤖 Prompt for AI Agents
In `@INITIAL-5.md` around lines 31 - 33, Replace the bare URLs with markdown links
by converting each plain URL into the [text](url) form; specifically update the
three occurrences (https://scikit-learn.org/stable/modules/compose.html,
https://scikit-learn.org/stable/glossary.html,
https://scikit-learn.org/stable/model_persistence.html) to descriptive link text
(e.g., [Scikit-learn: ColumnTransformer
documentation](https://scikit-learn.org/stable/modules/compose.html)) so the
file no longer triggers MD034; edit the lines containing those URLs in
INITIAL-5.md (look for the exact URL strings) and replace them with appropriate
markdown link syntax.

Comment on lines +8 to +12
- `FeatureSetConfig` — Pydantic-based configuration schema with versioning
- `FeatureEngineeringService` — Core service computing features with cutoff enforcement
- `POST /featuresets/compute` — API endpoint to compute features for a series
- `GET /featuresets/preview` — Preview feature output for debugging/exploration
- Time-safety enforced at every level (lags, rolling, calendar)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Verify the actual HTTP method used for the preview endpoint
rg -n "featuresets/preview" --type py

Repository: w7-mgfcode/ForecastLabAI

Length of output: 152


🏁 Script executed:

#!/bin/bash
# Search for the preview endpoint route definition more broadly
rg -n "preview" --type py -B 2 -A 2 | grep -E "(def|@|/featuresets)"

Repository: w7-mgfcode/ForecastLabAI

Length of output: 3366


Change preview endpoint from GET to POST in documentation.
The actual route implementation uses @router.post("/preview") (app/features/featuresets/routes.py:178), not GET. Update the PRP to reflect POST /featuresets/preview for consistency.

🤖 Prompt for AI Agents
In `@PRPs/PRP-4-feature-engineering.md` around lines 8 - 12, The PRP lists the
preview endpoint as GET but the implemented route uses router.post("/preview");
update the documentation entry in PRP-4-feature-engineering.md to read "POST
/featuresets/preview" instead of "GET /featuresets/preview" so it matches the
implemented route (router.post("/preview")) and the API behavior.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants