Personal Digital Twin CLI — fine-tune a LoRA adapter that mirrors your communication style from your own platform data exports, then run training on Modal's cloud GPUs and save the result as a portable .safetensors file.
- Ingests data exports from Google, Instagram, Netflix, WhatsApp, and Spotify (standard ZIP downloads from each platform's privacy page).
- Normalises conversations into a deduplicated JSONL training corpus with PII redacted.
- Trains a QLoRA adapter on Modal using a small (<3B parameter) causal-LM base model.
- Saves the adapter as a
.safetensorsfile you can load with PEFT anywhere.
The trained adapter is typically 5–50 MB — small enough to carry on a USB drive or store in a private git repo.
pip install safehorizon
# or, from source:
git clone https://github.com/yourname/safehorizon
cd safehorizon
pip install -e .pip install modal
modal token new # opens browser, stores token in ~/.modal.tomlexport HF_TOKEN=hf_...
# or add it to .env| Platform | Where to go |
|---|---|
| myaccount.google.com/data-and-privacy → Download your data | |
| Settings → Your activity → Download your information | |
| Netflix | netflix.com/account/getmyinfo |
| Open a chat → ⋮ → More → Export chat | |
| Spotify | spotify.com/account/privacy → Request data |
Google Takeout exports can be split across multiple ZIPs — pass each part separately and safehorizon will merge them automatically.
safehorizon run \
--name "Alice" \
--whatsapp ~/exports/WhatsApp\ Chat\ with\ Bob.zip \
--google ~/exports/takeout-20240101-001.zip \
--google ~/exports/takeout-20240101-002.zip \
--instagram ~/exports/instagram-alice.zip \
--netflix ~/exports/netflix.zip \
--spotify ~/exports/my_spotify_data.zip \
--output ~/my-twinThis creates:
~/my-twin/
corpus.jsonl ← normalised training data
adapter/
adapter_config.json
adapter_model.safetensors ← your digital twin weights
tokenizer.json
…
# 1. Parse exports → JSONL
safehorizon ingest \
--name "Alice" \
--whatsapp chat.zip \
--output corpus.jsonl
# 2. Train on Modal → adapter
safehorizon train \
--corpus corpus.jsonl \
--output ./adapterParse platform exports into a training corpus.
| Flag | Default | Description |
|---|---|---|
--name |
(required) | Your name as it appears in your messages |
--google PATH |
— | Google Takeout ZIP (repeatable) |
--instagram PATH |
— | Instagram export ZIP (repeatable) |
--netflix PATH |
— | Netflix export ZIP (repeatable) |
--whatsapp PATH |
— | WhatsApp export ZIP (repeatable) |
--spotify PATH |
— | Spotify export ZIP (repeatable) |
--output PATH |
corpus.jsonl |
Output JSONL file |
--min-chars INT |
40 |
Drop samples shorter than this |
--max-chars INT |
8000 |
Drop samples longer than this |
Fine-tune a LoRA adapter on Modal.
| Flag | Default | Description |
|---|---|---|
--corpus PATH |
(required) | JSONL corpus from ingest |
--output DIR |
./adapter |
Local adapter output directory |
--base-model ID |
unsloth/Llama-3.2-1B-Instruct |
HuggingFace model ID |
--rank INT |
16 |
LoRA rank r |
--alpha INT |
2×rank |
LoRA scaling α |
--epochs INT |
3 |
Training epochs |
--batch-size INT |
4 |
Per-device batch size |
--grad-accum INT |
4 |
Gradient accumulation steps |
--lr FLOAT |
2e-4 |
Peak learning rate |
--max-seq-len INT |
2048 |
Max token length per sample |
--gpu TYPE |
A10G |
Modal GPU type (A10G, A100, …) |
--hf-token TOKEN |
$HF_TOKEN |
HuggingFace token for gated models |
Shorthand for ingest + train in one command (accepts all flags from both).
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
base_model = "unsloth/Llama-3.2-1B-Instruct"
adapter_path = "./my-twin/adapter"
tokenizer = AutoTokenizer.from_pretrained(adapter_path)
model = AutoModelForCausalLM.from_pretrained(base_model)
model = PeftModel.from_pretrained(model, adapter_path)
model.eval()
# Chat with your twin
messages = [{"role": "user", "content": "How was your weekend?"}]
input_ids = tokenizer.apply_chat_template(messages, return_tensors="pt", add_generation_prompt=True)
output = model.generate(input_ids, max_new_tokens=200)
print(tokenizer.decode(output[0], skip_special_tokens=True))safehorizon/
ingest/
archive.py Split-ZIP assembly
google.py Google Takeout parser
instagram.py Instagram DM/comments parser
netflix.py Netflix viewing history parser
whatsapp.py WhatsApp chat export parser
spotify.py Spotify streaming history parser
normalize/
schema.py PII redaction + sample filtering
writer.py Deduplicating JSONL writer
train/
modal_app.py Modal GPU image + QLoRA training function
orchestrator.py Local → Modal call orchestration
ui/
progress.py Rich console + progress bars
cli.py Click CLI (ingest / train / run)
models.py Shared data types (ChatSample, Turn, …)
Training approach:
- Base model frozen and quantised to 4-bit NF4 (QLoRA / BitsAndBytes).
- LoRA adapters injected into all attention + MLP projection matrices.
- Trained with SFTTrainer (TRL) on the chat-formatted JSONL corpus.
- Adapter saved in
.safetensorsformat (no arbitrary code execution risk).
- All processing of your raw exports happens locally before anything is sent to Modal.
- PII (email addresses, phone numbers, passwords, API keys) is automatically redacted from training samples.
- The trained adapter is sent back to your machine; no copy is retained by Modal after the job completes.
- To permanently delete your twin, delete the
.safetensorsfile — it contains all personalisation.
MIT