Skip to content

codernate92/SafeData

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

safedata

from safedata import FilterPipeline, ToxicityClassifier, PIIDetector, PoisonDetector
from safedata.filters.cascade import BlocklistFilter
from safedata.filters.dedup import NearDuplicateFilter

dataset = [
    {"id": "1", "text": "You are terrible and should disappear."},
    {"id": "2", "text": "Contact me at jane@example.com"},
]

pipeline = FilterPipeline([
    BlocklistFilter(path="blocklists/", blocklist=["slur_example"]),
    NearDuplicateFilter(threshold=0.95),
    ToxicityClassifier(model="microsoft/deberta-v3-base", threshold=0.85),
    PIIDetector(redact=True, categories=["ssn", "email", "phone"]),
    PoisonDetector(method="spectral", sample_rate=0.01),
])

results = pipeline.filter(dataset, audit_log="audit.jsonl")
print(results.summary)

Training-data filtering is a core AI-safety control: harmful language, sensitive personal data, representational bias, and poisoned examples can all be encoded at scale and persist through training. Recent literature pushes toward explicit constitutional safety classifiers and stronger data-centric defenses, including Anthropic's constitutional classifier work (2025), DABUF-style data attribution and unlearning methods, and the Wild Patterns Reloaded survey on harmful data artifacts.

Status

Not production-ready. This repository is a technical reference implementation for experimentation and review.

Why this project exists

  • Provide a practical safety filtering stack for pretraining/finetuning data.
  • Make every filtering decision auditable.
  • Evaluate robustness under adversarial evasion attacks.
  • Include fairness and calibration diagnostics by default.

Core capabilities

  • ToxicityClassifier: multi-label scoring (toxicity, severe_toxicity, obscene, threat, insult, identity_attack) with optional transformer backend and temperature scaling.
  • PIIDetector: regex + optional spaCy NER + contextual PII rules.
  • BiasDetector: embedding diagnostics, representation ratios, counterfactual swapping, optional Fairlearn metrics.
  • PoisonDetector: spectral signatures and DBSCAN anomaly detection.
  • FilterPipeline: cascade architecture with audit logging.

Honest limitations

  • Cannot verify factual accuracy—pattern matching only.
  • Heuristic fallbacks are weaker than task-specific trained classifiers.
  • Poison detection quality depends on embedding/activation quality.
  • Fairness metrics depend on sensitive-attribute annotations or proxies.
  • Evasion coverage in this repo is intentionally limited.

Research context

  • Anthropic (2025), "Constitutional Classifiers: Defending Against Universal Jailbreaks Across Thousands of Hours of Red Teaming". Paper
  • Pan et al. (2025), "Detecting and Filtering Unsafe Training Data via Data Attribution (DABUF)". Paper
  • Cinà et al. (2023), "Wild Patterns Reloaded: A Survey of Machine Learning Security against Training Data Poisoning". Paper | DOI

Install

pip install -e .
pip install -e .[ml,dev]

Validate

pytest
python examples/quick_start.py

Docs

About

Auditable training-data safety filtering stack for toxicity, PII, bias, poisoning, and robustness evaluation.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors