Block-Aware Distributed Data Pipelines

This repository contains the POC implementation for the EuroMLSys 2026 paper "Block-Aware Distributed Data Pipelines for Out-of-Core Tabular Machine Learning". It provides the components for distributed preprocessing and data loading for DNN training on large tabular datasets.

Repository layout

src/
- cacheloader/: Cache-aware DataLoader.
- mneme_torc/: Mneme distributed for preprocessing.
examples/: runnable examples and launch scripts.
scripts/: dataset generation utilities.
requirements.txt, pyproject.toml: packaging and dependencies.
install.sh, mpi.cfg: installation and MPI configuration.

Prerequisites

Python >= 3.10
MPI runtime (preferably MPICH).
A CUDA-enabled PyTorch build if you plan to use GPU acceleration.

Installation

The project uses torcpy for distributed execution, which depends on mpi4py. For reproducible builds, configure mpi4py to point to your MPI installation through mpi.cfg, then run the provided script.

Update mpi.cfg with the path to your MPI installation (for example, /opt/mpich).
Adjust the version of torch inside install.sh according to your CUDA's version.
Run the installer:
```
./install.sh
```

install.sh creates a virtual environment, configures mpi4py to use the MPI compiler wrappers, installs the project in editable mode, and pins CUDA 12.1 compatible PyTorch wheels.

Dataset creation

Generate a synthetic dataset with the helper script:

./scripts/create_ds.sh

This calls scripts/create_csv_large_scale.py with defaults for 16M samples and 700 numerical features. Adjust the arguments in the script to match your scale or feature configuration.

Examples

The examples demonstrate the preprocessing fit stage and the expected launch patterns for distributed runs.

examples/fit_example.py: distributed preprocessing fit using Mneme through a preprocessor pipeline construct.
examples/launch_fit.sh: MPI launcher script for the fit example. Update the MPI_EXEC, the hostfile, the input file paths (dataset and cached offsets if they exist) before running.
examples/loading_example.py: Data loading example using CacheLoader across 1 or multiple gpus.
examples/launch_loading.sh: torchrun launcher script for the loading example.

Notes

requirements.txt lists the Python dependencies; torch is installed separately to match your CUDA runtime.
mpi.cfg controls how mpi4py is built. If you move your MPI installation, update mpi_dir and reinstall.

Citation

If you find this codebase helpful for your research, we kindly ask that you cite our EuroMLSys 2026 paper, "Block-Aware Distributed Data Pipelines for Out-of-Core Tabular Machine Learning"

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Block-Aware Distributed Data Pipelines

Repository layout

Prerequisites

Installation

Dataset creation

Examples

Notes

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
examples		examples
scripts		scripts
src		src
.gitignore		.gitignore
README.md		README.md
install.sh		install.sh
mpi.cfg		mpi.cfg
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Block-Aware Distributed Data Pipelines

Repository layout

Prerequisites

Installation

Dataset creation

Examples

Notes

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages