Information-theoretic evaluation of synthetic, simulated, and ground-truth data, with a focus on fidelity metrics (e.g., Jensen–Shannon divergence, MMD, classifier two-sample tests) across multiple data modalities.
- B. Ismalej, X. Ruan, X. Jiang, “Evaluating Privacy and Utility of Synthetic Tabular Data with Membership Inference Attacks”. In press to IEEE Future Machine Learning and Data Science (FMLDS) 2025, Nov 2-5, 2025, Los Angeles, CA, USA.
- More context + talk slides: Project page on my website
- Core experiment pipelines (CLI): under
experiments/experiments/adult_tabular: GT vs SIM vs SYN (SDV CTGAN/TVAE) on UCI Adult.experiments/time_series_har: GT vs SIM (DFM-mosaic) on UCI HAR inertial signals.experiments/bank_marketing: currently mirrors the Adult pipeline (see note below).
- Shared utilities: under
common/(metrics, I/O, seeding, visualization). - Docs: LaTeX writeups (and built PDFs) under
docs/documenting methodology and exact run commands. - Tutorial notebook:
notebooks/synthetic_data_tutorial.ipynb(intro-style SDV walkthrough).
Run an experiment from the repo root:
python -m experiments.adult_tabular.run \
--seed 42 \
--n_eval 5000 \
--sim_mode gaussian_copula \
--synth ctgan \
--epochs 300 \
--batch_size 500 \
--pac 10 \
--out results_ctgan.jsonThis writes:
- Figures to
experiments/adult_tabular/figures/ - Results JSON to
experiments/adult_tabular/<out>
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt- Entry point:
python -m experiments.adult_tabular.run - Key flags:
--sim_mode {independent,gaussian_copula}--synth {ctgan,tvae}--epochs,--batch_size,--pac(CTGAN requiresbatch_size % pac == 0)--n_eval(size for SIM/SYN sampling and GT reference)
- Entry point:
python -m experiments.time_series_har.run - Example:
python -m experiments.time_series_har.run \
--seed 42 \
--n_sim 4000 \
--sg_window 15 --sg_poly 3 \
--segments_min 3 --segments_max 6 \
--seglen_min_frac 0.15 --seglen_max_frac 0.35 \
--residual_scale 1.0 \
--reconstruct mul \
--out results_sim.jsonOutputs:
experiments/time_series_har/figures/(global overlays + per-class grids)experiments/time_series_har/<out>(results JSON)
common/: shared utilitiesmetrics.py: JS divergence (histogram), RBF-MMD, C2ST AUCio.py: JSON writing + runtime metadata helperssampling.py: global seedingviz.py: plotting helpers used by experiments
experiments/: reproducible CLI pipelinesdocs/: LaTeX/PDF writeupsnotebooks/: standalone tutorial notebook