Leakage-safe, point-in-time feature engineering for event logs.
safefeat builds ML features from event data using only information available at prediction time — no future data, no silent leakage, no surprises in production.
When you compute features like "total purchases in the last 30 days" without anchoring to a cutoff time, you accidentally include future events. Your model looks great in training — then falls apart in production.
# ❌ Leaky — uses ALL events, including future ones
features = events.groupby("user_id")["amount"].sum()
df = spine.merge(features, on="user_id")
# ✅ Safe — only uses events before each cutoff_time
X = build_features(spine, tables, spec, event_time_cols={"events": "event_time"})pip install safefeatsafefeat works with three components:
| Component | Description |
|---|---|
| Spine | When to make predictions — one row per (entity_id, cutoff_time) |
| Events | Historical time-series data tied to each entity |
| Spec | Declarative definition of what features to compute |
For each row in the spine, safefeat joins only events where event_time <= cutoff_time, then computes your features. Future events are excluded.
import pandas as pd
from safefeat import build_features, WindowAgg
spine = pd.DataFrame({
"entity_id": ["u1", "u2"],
"cutoff_time": ["2024-01-10", "2024-01-31"],
})
events = pd.DataFrame({
"entity_id": ["u1", "u1", "u2", "u2"],
"event_time": ["2024-01-05", "2024-01-06", "2024-01-10", "2024-01-30"],
"amount": [10.0, 20.0, 5.0, 25.0],
"event_type": ["click", "purchase", "purchase", "click"],
})
spec = [
WindowAgg(
table="events",
windows=["7D", "30D"],
metrics={
"*": ["count"],
"amount": ["sum", "mean"],
"event_type": ["nunique"],
},
)
]
X = build_features(
spine=spine,
tables={"events": events},
spec=spec,
event_time_cols={"events": "event_time"},
allowed_lag="0s",
)Output columns follow the pattern {table}__{column}__{agg}__{window}:
events__n_events__7d # number of events in the last 7 days
events__amount__sum__7d # total spend in the last 7 days
events__amount__mean__7d # average spend per event in the last 7 days
events__event_type__nunique__7d # distinct event types seen in the last 7 days
events__n_events__30d # number of events in the last 30 days
events__amount__sum__30d # total spend in the last 30 days
events__amount__mean__30d # average spend per event in the last 30 days
events__event_type__nunique__30d # distinct event types seen in the last 30 days
safefeat ships with a synthetic e-commerce dataset for experimentation:
from safefeat.datasets import load_customer_demo
events, spine = load_customer_demo()See the customer demo examples for worked questions using this dataset.
Windows support days, months, years, and unlimited history:
spec = [
WindowAgg(
table="events",
windows=["7D", "30D", "3M", "1Y", None], # None = all history before cutoff
metrics={
"*": ["count"],
"amount": ["sum", "mean"],
"event_type": ["nunique"],
},
)
]| Unit | Example | Meaning |
|---|---|---|
D |
"30D" |
Exact days |
M |
"3M" |
Calendar months |
Y |
"1Y" |
Calendar years |
None |
None |
All history before cutoff |
from safefeat import RecencyBlock
spec = [RecencyBlock(table="events")]
X = build_features(
spine=spine,
tables={"events": events},
spec=spec,
event_time_cols={"events": "event_time"},
)
# Adds: events__recency (days since last event before cutoff_time)Filter to a specific event type:
spec = [
RecencyBlock(
table="events",
filter_col="event_type",
filter_value="purchase",
)
]
# Adds: events__recency__event_type_purchaseVerify exactly which events were included and dropped for each prediction point:
X, audit = build_features(
spine=spine,
tables={"events": events},
spec=spec,
event_time_cols={"events": "event_time"},
return_report=True,
)
events_audit = audit.tables.get("events")
print(events_audit.total_joined_pairs) # total event-cutoff pairs considered
print(events_audit.kept_pairs) # events before cutoff (used)
print(events_audit.dropped_future_pairs) # events after cutoff (excluded)Pass multiple tables — each with its own event time column:
spec = [
WindowAgg(table="transactions", windows=["30D"], metrics={"amount": ["sum"]}),
WindowAgg(table="logins", windows=["7D"], metrics={"*": ["count"]}),
RecencyBlock(table="transactions"),
]
X = build_features(
spine=spine,
tables={
"transactions": transactions_df,
"logins": logins_df,
},
event_time_cols={
"transactions": "transaction_time",
"logins": "login_time",
},
)The table= name is just a label — it must match a key in tables and event_time_cols, but can be anything you choose.
pip install -e ".[dev]"
pytest -q
ruff check .Contributions, bug reports, and feature requests are welcome. Open an issue at github.com/AlishaAng/safefeat/issues.
