"Hidden in Plain Bytes" Source Code
This repository contains code and documentation for processing data exports from major online platforms, used in the research paper:
Julia Nonnenkamp, Naman Gupta, Abhimanyu Dev Gupta, and Rahul Chatterjee. 2025. Hidden in Plain Bytes: Investigating Interpersonal Account Compromise with Data Exports. In Proceedings of the 2025 ACM SIGSAC Conference on Computer and Communications Security (CCS '25). ACM, Taipei, Taiwan, 1–14 (October 13–17, 2025). DOI: 10.1145/3719027.3765147.
This repository processes 12 data exports obtained under "right of access" requests (GDPR/CCPA) from 6 major online platforms: Apple, Discord, Facebook, Google, Instagram, and Snapchat. The data was collected to investigate and identify interpersonal account compromise patterns using real-world data exports.
The processed dataset is available on Zenodo with controlled access due to the sensitive nature of Tech-Facilitated Abuse (TFA) research. Access is restricted to prevent misuse.
To access the data:
- Request access via the Zenodo repository
- Download the dataset files:
sam_january_cleaned.zipandsam_february_cleaned.zip - Extract both ZIP files recursively in your working directory before running the transform.zip notebook.
- Python >=3.12
- Git
-
Clone this repository:
git clone https://github.com/WISPR-lab/data-exports-tfa.git cd data-exports-tfa -
Create and activate a virtual environment:
python -m venv .venv313 source .venv313/bin/activate # on Windows: .venv313\Scripts\activate
-
Install dependencies:
pip install -r requirements.txt
- Prepare your data: Ensure the dataset ZIP files are extracted in your working directory
- Open the analysis notebook: Launch
transform.ipynbin Jupyter or VS Code - Configure paths: Follow the instructions in the notebook to set your data input and output paths
- Run the analysis: Execute the notebook cells to reproduce the data transformation and analysis
The main analysis is contained in transform.ipynb, which:
- Parses raw data exports from multiple platforms
- Transforms nested JSON/CSV structures into flattened DataFrames
- Groups similar data elements across platforms according to characteristic features (see Section 3 or )
The scripts/ directory contains modular utilities:
parse.py: Core parsing and data transformation logicgroup_utils.py: Functions for clustering and grouping similar data elementstimeutils.py: Time-related parsing utilities
For documentation of parser utilities and data structures in an extended appendix, see appendix.md.
- Code: Released under the MIT License
- Paper: Protected under Creative Commons Non-Commercial ShareAlike license
- Dataset: Available under CC International license with controlled access
This research was conducted at the University of Wisconsin–Madison.