Skip to content

ufukkaraca/safehorizon

Repository files navigation

safehorizon

Personal Digital Twin CLI — fine-tune a LoRA adapter that mirrors your communication style from your own platform data exports, then run training on Modal's cloud GPUs and save the result as a portable .safetensors file.

What it does

  1. Ingests data exports from Google, Instagram, Netflix, WhatsApp, and Spotify (standard ZIP downloads from each platform's privacy page).
  2. Normalises conversations into a deduplicated JSONL training corpus with PII redacted.
  3. Trains a QLoRA adapter on Modal using a small (<3B parameter) causal-LM base model.
  4. Saves the adapter as a .safetensors file you can load with PEFT anywhere.

The trained adapter is typically 5–50 MB — small enough to carry on a USB drive or store in a private git repo.


Installation

pip install safehorizon
# or, from source:
git clone https://github.com/yourname/safehorizon
cd safehorizon
pip install -e .

Authentication

Modal (required for training)

pip install modal
modal token new          # opens browser, stores token in ~/.modal.toml

HuggingFace (required for gated models like Llama)

export HF_TOKEN=hf_...
# or add it to .env

How to export your data

Platform Where to go
Google myaccount.google.com/data-and-privacy → Download your data
Instagram Settings → Your activity → Download your information
Netflix netflix.com/account/getmyinfo
WhatsApp Open a chat → ⋮ → More → Export chat
Spotify spotify.com/account/privacy → Request data

Google Takeout exports can be split across multiple ZIPs — pass each part separately and safehorizon will merge them automatically.


Quick start

All-in-one (run)

safehorizon run \
  --name "Alice" \
  --whatsapp  ~/exports/WhatsApp\ Chat\ with\ Bob.zip \
  --google    ~/exports/takeout-20240101-001.zip \
  --google    ~/exports/takeout-20240101-002.zip \
  --instagram ~/exports/instagram-alice.zip \
  --netflix   ~/exports/netflix.zip \
  --spotify   ~/exports/my_spotify_data.zip \
  --output    ~/my-twin

This creates:

~/my-twin/
  corpus.jsonl          ← normalised training data
  adapter/
    adapter_config.json
    adapter_model.safetensors   ← your digital twin weights
    tokenizer.json
    …

Step by step

# 1. Parse exports → JSONL
safehorizon ingest \
  --name "Alice" \
  --whatsapp chat.zip \
  --output corpus.jsonl

# 2. Train on Modal → adapter
safehorizon train \
  --corpus corpus.jsonl \
  --output ./adapter

CLI reference

safehorizon ingest

Parse platform exports into a training corpus.

Flag Default Description
--name (required) Your name as it appears in your messages
--google PATH Google Takeout ZIP (repeatable)
--instagram PATH Instagram export ZIP (repeatable)
--netflix PATH Netflix export ZIP (repeatable)
--whatsapp PATH WhatsApp export ZIP (repeatable)
--spotify PATH Spotify export ZIP (repeatable)
--output PATH corpus.jsonl Output JSONL file
--min-chars INT 40 Drop samples shorter than this
--max-chars INT 8000 Drop samples longer than this

safehorizon train

Fine-tune a LoRA adapter on Modal.

Flag Default Description
--corpus PATH (required) JSONL corpus from ingest
--output DIR ./adapter Local adapter output directory
--base-model ID unsloth/Llama-3.2-1B-Instruct HuggingFace model ID
--rank INT 16 LoRA rank r
--alpha INT 2×rank LoRA scaling α
--epochs INT 3 Training epochs
--batch-size INT 4 Per-device batch size
--grad-accum INT 4 Gradient accumulation steps
--lr FLOAT 2e-4 Peak learning rate
--max-seq-len INT 2048 Max token length per sample
--gpu TYPE A10G Modal GPU type (A10G, A100, …)
--hf-token TOKEN $HF_TOKEN HuggingFace token for gated models

safehorizon run

Shorthand for ingest + train in one command (accepts all flags from both).


Loading the adapter

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel

base_model = "unsloth/Llama-3.2-1B-Instruct"
adapter_path = "./my-twin/adapter"

tokenizer = AutoTokenizer.from_pretrained(adapter_path)
model = AutoModelForCausalLM.from_pretrained(base_model)
model = PeftModel.from_pretrained(model, adapter_path)
model.eval()

# Chat with your twin
messages = [{"role": "user", "content": "How was your weekend?"}]
input_ids = tokenizer.apply_chat_template(messages, return_tensors="pt", add_generation_prompt=True)
output = model.generate(input_ids, max_new_tokens=200)
print(tokenizer.decode(output[0], skip_special_tokens=True))

Architecture

safehorizon/
  ingest/
    archive.py       Split-ZIP assembly
    google.py        Google Takeout parser
    instagram.py     Instagram DM/comments parser
    netflix.py       Netflix viewing history parser
    whatsapp.py      WhatsApp chat export parser
    spotify.py       Spotify streaming history parser
  normalize/
    schema.py        PII redaction + sample filtering
    writer.py        Deduplicating JSONL writer
  train/
    modal_app.py     Modal GPU image + QLoRA training function
    orchestrator.py  Local → Modal call orchestration
  ui/
    progress.py      Rich console + progress bars
  cli.py             Click CLI (ingest / train / run)
  models.py          Shared data types (ChatSample, Turn, …)

Training approach:

  • Base model frozen and quantised to 4-bit NF4 (QLoRA / BitsAndBytes).
  • LoRA adapters injected into all attention + MLP projection matrices.
  • Trained with SFTTrainer (TRL) on the chat-formatted JSONL corpus.
  • Adapter saved in .safetensors format (no arbitrary code execution risk).

Privacy notes

  • All processing of your raw exports happens locally before anything is sent to Modal.
  • PII (email addresses, phone numbers, passwords, API keys) is automatically redacted from training samples.
  • The trained adapter is sent back to your machine; no copy is retained by Modal after the job completes.
  • To permanently delete your twin, delete the .safetensors file — it contains all personalisation.

License

MIT

About

Personal Digital Twin CLI — fine-tune a LoRA adapter that mirrors your communication style from your own platform data exports, then run training on Modal's cloud GPUs and save the result as a portable .safetensors file.

Topics

Resources

License

Contributing

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages