This repository contains the POC implementation for the EuroMLSys 2026 paper "Block-Aware Distributed Data Pipelines for Out-of-Core Tabular Machine Learning". It provides the components for distributed preprocessing and data loading for DNN training on large tabular datasets.
src/cacheloader/: Cache-aware DataLoader.mneme_torc/: Mneme distributed for preprocessing.
examples/: runnable examples and launch scripts.scripts/: dataset generation utilities.requirements.txt,pyproject.toml: packaging and dependencies.install.sh,mpi.cfg: installation and MPI configuration.
- Python >= 3.10
- MPI runtime (preferably MPICH).
- A CUDA-enabled PyTorch build if you plan to use GPU acceleration.
The project uses torcpy for distributed execution, which depends on mpi4py. For reproducible builds, configure mpi4py to point to your MPI installation through mpi.cfg, then run the provided script.
- Update
mpi.cfgwith the path to your MPI installation (for example,/opt/mpich). - Adjust the version of torch inside
install.shaccording to your CUDA's version. - Run the installer:
./install.sh
install.sh creates a virtual environment, configures mpi4py to use the MPI compiler wrappers, installs the project in editable mode, and pins CUDA 12.1 compatible PyTorch wheels.
Generate a synthetic dataset with the helper script:
./scripts/create_ds.shThis calls scripts/create_csv_large_scale.py with defaults for 16M samples and 700 numerical features. Adjust the arguments in the script to match your scale or feature configuration.
The examples demonstrate the preprocessing fit stage and the expected launch patterns for distributed runs.
examples/fit_example.py: distributed preprocessing fit usingMnemethrough a preprocessor pipeline construct.examples/launch_fit.sh: MPI launcher script for the fit example. Update theMPI_EXEC, the hostfile, the input file paths (dataset and cached offsets if they exist) before running.examples/loading_example.py: Data loading example using CacheLoader across 1 or multiple gpus.examples/launch_loading.sh: torchrun launcher script for the loading example.
requirements.txtlists the Python dependencies;torchis installed separately to match your CUDA runtime.mpi.cfgcontrols howmpi4pyis built. If you move your MPI installation, updatempi_dirand reinstall.
If you find this codebase helpful for your research, we kindly ask that you cite our EuroMLSys 2026 paper, "Block-Aware Distributed Data Pipelines for Out-of-Core Tabular Machine Learning"