Skip to content

CEID-HPCLAB/BlockAwareTabPipeline

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Block-Aware Distributed Data Pipelines

This repository contains the POC implementation for the EuroMLSys 2026 paper "Block-Aware Distributed Data Pipelines for Out-of-Core Tabular Machine Learning". It provides the components for distributed preprocessing and data loading for DNN training on large tabular datasets.

Repository layout

  • src/
    • cacheloader/: Cache-aware DataLoader.
    • mneme_torc/: Mneme distributed for preprocessing.
  • examples/: runnable examples and launch scripts.
  • scripts/: dataset generation utilities.
  • requirements.txt, pyproject.toml: packaging and dependencies.
  • install.sh, mpi.cfg: installation and MPI configuration.

Prerequisites

  • Python >= 3.10
  • MPI runtime (preferably MPICH).
  • A CUDA-enabled PyTorch build if you plan to use GPU acceleration.

Installation

The project uses torcpy for distributed execution, which depends on mpi4py. For reproducible builds, configure mpi4py to point to your MPI installation through mpi.cfg, then run the provided script.

  1. Update mpi.cfg with the path to your MPI installation (for example, /opt/mpich).
  2. Adjust the version of torch inside install.sh according to your CUDA's version.
  3. Run the installer:
    ./install.sh

install.sh creates a virtual environment, configures mpi4py to use the MPI compiler wrappers, installs the project in editable mode, and pins CUDA 12.1 compatible PyTorch wheels.

Dataset creation

Generate a synthetic dataset with the helper script:

./scripts/create_ds.sh

This calls scripts/create_csv_large_scale.py with defaults for 16M samples and 700 numerical features. Adjust the arguments in the script to match your scale or feature configuration.

Examples

The examples demonstrate the preprocessing fit stage and the expected launch patterns for distributed runs.

  • examples/fit_example.py: distributed preprocessing fit using Mneme through a preprocessor pipeline construct.
  • examples/launch_fit.sh: MPI launcher script for the fit example. Update the MPI_EXEC, the hostfile, the input file paths (dataset and cached offsets if they exist) before running.
  • examples/loading_example.py: Data loading example using CacheLoader across 1 or multiple gpus.
  • examples/launch_loading.sh: torchrun launcher script for the loading example.

Notes

  • requirements.txt lists the Python dependencies; torch is installed separately to match your CUDA runtime.
  • mpi.cfg controls how mpi4py is built. If you move your MPI installation, update mpi_dir and reinstall.

Citation

If you find this codebase helpful for your research, we kindly ask that you cite our EuroMLSys 2026 paper, "Block-Aware Distributed Data Pipelines for Out-of-Core Tabular Machine Learning"

About

A block-aware distributed data pipeline for out-of-core tabular ML, featuring MPI-based preprocessing and cache-aware loading.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors