GitHub - AlishaAng/safefeat: Leakage-safe, point-in-time feature engineering for ML — build features from event logs without future data leakage.

Leakage-safe, point-in-time feature engineering for event logs.

safefeat builds ML features from event data using only information available at prediction time — no future data, no silent leakage, no surprises in production.

The Problem

When you compute features like "total purchases in the last 30 days" without anchoring to a cutoff time, you accidentally include future events. Your model looks great in training — then falls apart in production.

# ❌ Leaky — uses ALL events, including future ones
features = events.groupby("user_id")["amount"].sum()
df = spine.merge(features, on="user_id")

# ✅ Safe — only uses events before each cutoff_time
X = build_features(spine, tables, spec, event_time_cols={"events": "event_time"})

Install

pip install safefeat

How It Works

safefeat works with three components:

Component	Description
Spine	When to make predictions — one row per `(entity_id, cutoff_time)`
Events	Historical time-series data tied to each entity
Spec	Declarative definition of what features to compute

For each row in the spine, safefeat joins only events where event_time <= cutoff_time, then computes your features. Future events are excluded.

Quick Start

import pandas as pd
from safefeat import build_features, WindowAgg

spine = pd.DataFrame({
    "entity_id":   ["u1", "u2"],
    "cutoff_time": ["2024-01-10", "2024-01-31"],
})

events = pd.DataFrame({
    "entity_id":  ["u1", "u1", "u2", "u2"],
    "event_time": ["2024-01-05", "2024-01-06", "2024-01-10", "2024-01-30"],
    "amount":     [10.0, 20.0, 5.0, 25.0],
    "event_type": ["click", "purchase", "purchase", "click"],
})

spec = [
    WindowAgg(
        table="events",
        windows=["7D", "30D"],
        metrics={
            "*":          ["count"],
            "amount":     ["sum", "mean"],
            "event_type": ["nunique"],
        },
    )
]

X = build_features(
    spine=spine,
    tables={"events": events},
    spec=spec,
    event_time_cols={"events": "event_time"},
    allowed_lag="0s",
)

Output columns follow the pattern {table}__{column}__{agg}__{window}:

events__n_events__7d               # number of events in the last 7 days
events__amount__sum__7d            # total spend in the last 7 days
events__amount__mean__7d           # average spend per event in the last 7 days
events__event_type__nunique__7d    # distinct event types seen in the last 7 days

events__n_events__30d              # number of events in the last 30 days
events__amount__sum__30d           # total spend in the last 30 days
events__amount__mean__30d          # average spend per event in the last 30 days
events__event_type__nunique__30d   # distinct event types seen in the last 30 days

Demo Dataset

safefeat ships with a synthetic e-commerce dataset for experimentation:

from safefeat.datasets import load_customer_demo

events, spine = load_customer_demo()

See the customer demo examples for worked questions using this dataset.

Window aggregations

Windows support days, months, years, and unlimited history:

spec = [
    WindowAgg(
        table="events",
        windows=["7D", "30D", "3M", "1Y", None],  # None = all history before cutoff
        metrics={
            "*":          ["count"],
            "amount":     ["sum", "mean"],
            "event_type": ["nunique"],
        },
    )
]

Unit	Example	Meaning
`D`	`"30D"`	Exact days
`M`	`"3M"`	Calendar months
`Y`	`"1Y"`	Calendar years
`None`	`None`	All history before cutoff

Recency features

from safefeat import RecencyBlock

spec = [RecencyBlock(table="events")]

X = build_features(
    spine=spine,
    tables={"events": events},
    spec=spec,
    event_time_cols={"events": "event_time"},
)
# Adds: events__recency (days since last event before cutoff_time)

Filter to a specific event type:

spec = [
    RecencyBlock(
        table="events",
        filter_col="event_type",
        filter_value="purchase",
    )
]
# Adds: events__recency__event_type_purchase

Audit report

Verify exactly which events were included and dropped for each prediction point:

X, audit = build_features(
    spine=spine,
    tables={"events": events},
    spec=spec,
    event_time_cols={"events": "event_time"},
    return_report=True,
)

events_audit = audit.tables.get("events")
print(events_audit.total_joined_pairs)    # total event-cutoff pairs considered
print(events_audit.kept_pairs)            # events before cutoff (used)
print(events_audit.dropped_future_pairs)  # events after cutoff (excluded)

Multiple event tables

Pass multiple tables — each with its own event time column:

spec = [
    WindowAgg(table="transactions", windows=["30D"], metrics={"amount": ["sum"]}),
    WindowAgg(table="logins",       windows=["7D"],  metrics={"*": ["count"]}),
    RecencyBlock(table="transactions"),
]

X = build_features(
    spine=spine,
    tables={
        "transactions": transactions_df,
        "logins":        logins_df,
    },
    event_time_cols={
        "transactions": "transaction_time",
        "logins":       "login_time",
    },
)

The table= name is just a label — it must match a key in tables and event_time_cols, but can be anything you choose.

Development

pip install -e ".[dev]"
pytest -q
ruff check .

Contributing

Contributions, bug reports, and feature requests are welcome. Open an issue at github.com/AlishaAng/safefeat/issues.

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
.github/workflows		.github/workflows
docs		docs
overrides		overrides
src		src
tests		tests
.DS_Store		.DS_Store
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
mkdocs.yml		mkdocs.yml
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

The Problem

Install

How It Works

Quick Start

Demo Dataset

Window aggregations

Recency features

Audit report

Multiple event tables

Development

Contributing

About

Uh oh!

Releases 3

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

The Problem

Install

How It Works

Quick Start

Demo Dataset

Window aggregations

Recency features

Audit report

Multiple event tables

Development

Contributing

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 3

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages