Skip to content

AlishaAng/safefeat

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

31 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SafeFeat Logo

PyPI version Documentation License: MIT

Leakage-safe, point-in-time feature engineering for event logs.

safefeat builds ML features from event data using only information available at prediction time — no future data, no silent leakage, no surprises in production.


The Problem

When you compute features like "total purchases in the last 30 days" without anchoring to a cutoff time, you accidentally include future events. Your model looks great in training — then falls apart in production.

# ❌ Leaky — uses ALL events, including future ones
features = events.groupby("user_id")["amount"].sum()
df = spine.merge(features, on="user_id")

# ✅ Safe — only uses events before each cutoff_time
X = build_features(spine, tables, spec, event_time_cols={"events": "event_time"})

Install

pip install safefeat

How It Works

safefeat works with three components:

Component Description
Spine When to make predictions — one row per (entity_id, cutoff_time)
Events Historical time-series data tied to each entity
Spec Declarative definition of what features to compute

For each row in the spine, safefeat joins only events where event_time <= cutoff_time, then computes your features. Future events are excluded.


Quick Start

import pandas as pd
from safefeat import build_features, WindowAgg

spine = pd.DataFrame({
    "entity_id":   ["u1", "u2"],
    "cutoff_time": ["2024-01-10", "2024-01-31"],
})

events = pd.DataFrame({
    "entity_id":  ["u1", "u1", "u2", "u2"],
    "event_time": ["2024-01-05", "2024-01-06", "2024-01-10", "2024-01-30"],
    "amount":     [10.0, 20.0, 5.0, 25.0],
    "event_type": ["click", "purchase", "purchase", "click"],
})

spec = [
    WindowAgg(
        table="events",
        windows=["7D", "30D"],
        metrics={
            "*":          ["count"],
            "amount":     ["sum", "mean"],
            "event_type": ["nunique"],
        },
    )
]

X = build_features(
    spine=spine,
    tables={"events": events},
    spec=spec,
    event_time_cols={"events": "event_time"},
    allowed_lag="0s",
)

Output columns follow the pattern {table}__{column}__{agg}__{window}:

events__n_events__7d               # number of events in the last 7 days
events__amount__sum__7d            # total spend in the last 7 days
events__amount__mean__7d           # average spend per event in the last 7 days
events__event_type__nunique__7d    # distinct event types seen in the last 7 days

events__n_events__30d              # number of events in the last 30 days
events__amount__sum__30d           # total spend in the last 30 days
events__amount__mean__30d          # average spend per event in the last 30 days
events__event_type__nunique__30d   # distinct event types seen in the last 30 days

Demo Dataset

safefeat ships with a synthetic e-commerce dataset for experimentation:

from safefeat.datasets import load_customer_demo

events, spine = load_customer_demo()

See the customer demo examples for worked questions using this dataset.

Window aggregations

Windows support days, months, years, and unlimited history:

spec = [
    WindowAgg(
        table="events",
        windows=["7D", "30D", "3M", "1Y", None],  # None = all history before cutoff
        metrics={
            "*":          ["count"],
            "amount":     ["sum", "mean"],
            "event_type": ["nunique"],
        },
    )
]
Unit Example Meaning
D "30D" Exact days
M "3M" Calendar months
Y "1Y" Calendar years
None None All history before cutoff

Recency features

from safefeat import RecencyBlock

spec = [RecencyBlock(table="events")]

X = build_features(
    spine=spine,
    tables={"events": events},
    spec=spec,
    event_time_cols={"events": "event_time"},
)
# Adds: events__recency (days since last event before cutoff_time)

Filter to a specific event type:

spec = [
    RecencyBlock(
        table="events",
        filter_col="event_type",
        filter_value="purchase",
    )
]
# Adds: events__recency__event_type_purchase

Audit report

Verify exactly which events were included and dropped for each prediction point:

X, audit = build_features(
    spine=spine,
    tables={"events": events},
    spec=spec,
    event_time_cols={"events": "event_time"},
    return_report=True,
)

events_audit = audit.tables.get("events")
print(events_audit.total_joined_pairs)    # total event-cutoff pairs considered
print(events_audit.kept_pairs)            # events before cutoff (used)
print(events_audit.dropped_future_pairs)  # events after cutoff (excluded)

Multiple event tables

Pass multiple tables — each with its own event time column:

spec = [
    WindowAgg(table="transactions", windows=["30D"], metrics={"amount": ["sum"]}),
    WindowAgg(table="logins",       windows=["7D"],  metrics={"*": ["count"]}),
    RecencyBlock(table="transactions"),
]

X = build_features(
    spine=spine,
    tables={
        "transactions": transactions_df,
        "logins":        logins_df,
    },
    event_time_cols={
        "transactions": "transaction_time",
        "logins":       "login_time",
    },
)

The table= name is just a label — it must match a key in tables and event_time_cols, but can be anything you choose.


Development

pip install -e ".[dev]"
pytest -q
ruff check .

Contributing

Contributions, bug reports, and feature requests are welcome. Open an issue at github.com/AlishaAng/safefeat/issues.

About

Leakage-safe, point-in-time feature engineering for ML — build features from event logs without future data leakage.

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors