Takeout Tool

Data timeline exploration tool for Google Takeout and other social media archives, forked and heavily modified from Google's Timesketch.

Instead of uploading sensitive files to a server, this project processes everything locally by outsourcing parsing to Pyodide, a port of CPython to WebAssembly that runs a full Python environment to the web browser.

Quickstart

Prerequisites

Node.js 18+
yarn

Setup

cd webapp
yarn install
yarn serve

The frontend usually runs on http://localhost:5001

Architecture

UI (JavaScript): The user selects a platform (e.g., Facebook) and drops a ZIP file into the browser.
Unzip (JavaScript): JS unzips the uploaded file in local browser storage and discards files not mentioned in the platform schemas.
Parsing engine (Python/Pyodide): JavaScript passes each remaining file string AND the schema YAML to functions in python_core/. Python handles all the data cleaning and field name standardization.
Storage (SQLite/OPFS): Python/Pyodide extracts raw data into SQLite, which is persisted in the browser's Origin Private File System (OPFS). Data is normalized and stored in structured tables for querying.
UI (JavaScript): UI elements are rendered from the local SQLite database.

The way I like to think about it, the Python/Pyodide engine pretends to be the "server" in the classic client-server model, even though the data doesn't leave the local machine.

Understanding and writing new schema

The platform YAML files in manifests/ (i.e., apple.yaml) provide instructions for the Python engine to map data exports into (a slightly modified version of) the Elastic Common Schema (ECS). The goal of this is to minimize how much of the Python engine in python_core we have to rewrite if we want to add support for a new platform or if the platforms change file formats.

A manifest has two main sections:

files: physical files to parse (inputs)
views: specs for the logical events/states extracted from those files (outputs).

(1) File Sources

Define the physical file once, even if it contains multiple types of events.

files:
  - id: "insta_devices"                 # Unique reference ID        
    path: "path/to/devices.json"        # Relative path in the ZIP
    parser:
      format: "json"
      json_root: "devices_devices[]"
      drop_duplicates:                  # optional deduplication, follows pandas conventions
        subset: ["Device ID"]
        keep: "last"
  - id: ....

The attributes under parser: describe the shape of the file and how the engine should parse it.

format:
- json: standard JSON, ok for nested dictionaries within dictionaries, but not nested lists.
- json_label_values: Special Meta format where data is stored in lists of {label: "Key", value: "Val"}.
- jsonl: newline delimited JSON format, common in Discord
- csv
- csv_multi: a very odd file that contains multiple CSV sections separated by titles and newlines, common in Apple
json_root: (json/json_label_values/jsonl only; optional): The path to the list of objects you want to parse. Use [] to denote a list (e.g., account_activity_v2[]).
drop_duplicates: (optional) logic to clean data at the source level. Follows pandas conventions. Requires:
- subset: list of columns to check
- keep: first, last, or row_completeness (keep the row with the most non-null row entries)

(2) Views

A view defines how to transform a source into a strem of events/states (vaguely using ECS conventions). You can have multiple views for a single source (e.g., one for "Logins" and one for "Logouts").

views:
  - file: 
      id: "insta_devices"           # Must match a file ID
      where: {source: "status", op: "==", value: "active"}.  # Filtering (optional) only rows that match this condition

    # hardcoded ECS values for every event in this view.
    static:
      event.kind: "asset"                           # 'event' (action) or 'asset' (state)
      event.category: ["authentication", "host"]    # follow ECS conventions
      event.type: ["info"]                          # follow ECS conventions
      entity.type: "authenticated_device"
    
    # dynamic mappings
    fields:
      - {target: "entity.last_seen_timestamp", source: "'Last Login'.timestamp", type: "datetime"}
      - {target: "user_agent.original", source: "'User Agent'.value", type: "string"}

file attributes:

where selects only rows that match this condition
- Simple: i.e., {source: "event_type", op: "==", value: "login"}
- Complex: Uses logic: any or all with a conditions list.
  - Supports operators: ==, startswith, contains, endswith, !=.
  - Add more supported operators in webapp/src/pyparser/base.py.

path traversal for source keys

JSON (and jsonl) source fields support path traversal for nested dictionaries via dot notation and bracket notation. Use single quotes for keys containing spaces, e.g.:

session.ip_address
'Device ID details'.first_seen_time
push_tokens[0].id: Gets the id of the first item in the list.

Static attributes:

hardcoded ECS (Elastic Common Schema) values for every event in this view.

event.kind
- event: points in time (Logins, Messages, Clicks).
- asset: inventory items (Devices, Contacts). use this for "State" snapshots.
  - for asset rows, map specific entity identifiers (see Custom Fields below).

Dynamic field mappings:

Every very item must have target (the standardized ECS field name) and a source (the raw data key in the file), and optionally a type and transform. Path traversal for source fields follows conventions above.

If transform is set to "coalesce" (or if source is a list), the engine will pick the first non-null value.

e.g., { ... source: ["push_tokens[0].id", "family_id"], transform: "coalesce"}

Custom ECS Fields currently used

ECS isn't really built for this use case, so we have a couple of custom fields that we're consistently using at the moment

device.id.[platform]: internal device fingerprint (not user ID) for a specific platform --> also + .ad
device.given_name: user-defined nickname for device (e.g., "Bob's iPhone"), common in Apple
device.imei: International Mobile Equipment Identity
device.meid: Mobile Equipment Identifier
entity.type: for assets. use value "authenticated_device" for trusted device lists
entity.[first/last]_seen_timestamp: for assets, instead of @timestamp
url.title: for title of web page associated with a search/website
client.session.id and client.session.type
user.email.new and user.email.old
device.screen_resolution

Name		Name	Last commit message	Last commit date
Latest commit History 50 Commits
.github/workflows		.github/workflows
UA-Extract-purepy @ 88a25b7		UA-Extract-purepy @ 88a25b7
manifests		manifests
python_core		python_core
supplementary_material_screenings		supplementary_material_screenings
tests		tests
webapp		webapp
.gitattributes		.gitattributes
.gitignore		.gitignore
.gitmodules		.gitmodules
LICENSE		LICENSE
README.md		README.md
config.yaml		config.yaml
package-lock.json		package-lock.json
package.json		package.json
schema.sql		schema.sql

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Takeout Tool

Quickstart

Architecture

Understanding and writing new schema

(1) File Sources

(2) Views

file attributes:

path traversal for source keys

Static attributes:

Dynamic field mappings:

Custom ECS Fields currently used

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Takeout Tool

Quickstart

Architecture

Understanding and writing new schema

(1) File Sources

(2) Views

file attributes:

path traversal for source keys

Static attributes:

Dynamic field mappings:

Custom ECS Fields currently used

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages