from safedata import FilterPipeline, ToxicityClassifier, PIIDetector, PoisonDetector
from safedata.filters.cascade import BlocklistFilter
from safedata.filters.dedup import NearDuplicateFilter
dataset = [
{"id": "1", "text": "You are terrible and should disappear."},
{"id": "2", "text": "Contact me at jane@example.com"},
]
pipeline = FilterPipeline([
BlocklistFilter(path="blocklists/", blocklist=["slur_example"]),
NearDuplicateFilter(threshold=0.95),
ToxicityClassifier(model="microsoft/deberta-v3-base", threshold=0.85),
PIIDetector(redact=True, categories=["ssn", "email", "phone"]),
PoisonDetector(method="spectral", sample_rate=0.01),
])
results = pipeline.filter(dataset, audit_log="audit.jsonl")
print(results.summary)Training-data filtering is a core AI-safety control: harmful language, sensitive personal data, representational bias, and poisoned examples can all be encoded at scale and persist through training. Recent literature pushes toward explicit constitutional safety classifiers and stronger data-centric defenses, including Anthropic's constitutional classifier work (2025), DABUF-style data attribution and unlearning methods, and the Wild Patterns Reloaded survey on harmful data artifacts.
Not production-ready. This repository is a technical reference implementation for experimentation and review.
- Provide a practical safety filtering stack for pretraining/finetuning data.
- Make every filtering decision auditable.
- Evaluate robustness under adversarial evasion attacks.
- Include fairness and calibration diagnostics by default.
ToxicityClassifier: multi-label scoring (toxicity,severe_toxicity,obscene,threat,insult,identity_attack) with optional transformer backend and temperature scaling.PIIDetector: regex + optional spaCy NER + contextual PII rules.BiasDetector: embedding diagnostics, representation ratios, counterfactual swapping, optional Fairlearn metrics.PoisonDetector: spectral signatures and DBSCAN anomaly detection.FilterPipeline: cascade architecture with audit logging.
- Cannot verify factual accuracy—pattern matching only.
- Heuristic fallbacks are weaker than task-specific trained classifiers.
- Poison detection quality depends on embedding/activation quality.
- Fairness metrics depend on sensitive-attribute annotations or proxies.
- Evasion coverage in this repo is intentionally limited.
- Anthropic (2025), "Constitutional Classifiers: Defending Against Universal Jailbreaks Across Thousands of Hours of Red Teaming". Paper
- Pan et al. (2025), "Detecting and Filtering Unsafe Training Data via Data Attribution (DABUF)". Paper
- Cinà et al. (2023), "Wild Patterns Reloaded: A Survey of Machine Learning Security against Training Data Poisoning". Paper | DOI
pip install -e .
pip install -e .[ml,dev]pytest
python examples/quick_start.py