GPU Ordering

A GPU-accelerated fill-reducing ordering library for sparse linear systems. This project implements and benchmarks various fill-reducing ordering algorithms, with a focus on patch-based nested dissection methods that leverage GPU parallelism.

Quick Start

Prerequisites

CMake 3.18+
CUDA Toolkit (with cuSPARSE, cuSOLVER)
GCC with C++17 support
OpenMP

Build

# Create build directory
mkdir cmake-build-release && cd cmake-build-release

# Configure
cmake .. -DCMAKE_BUILD_TYPE=Release

# Build
make -j$(nproc)

Or use the provided build script:

./scripts/build.sh

Run Benchmark

./cmake-build-release/benchmark/gpu_ordering_benchmark \
  -i /path/to/mesh.obj \
  -s CUDSS \
  -a PATCH_ORDERING \
  -g 0 \
  -p metis_kway

This example:

Loads a triangle mesh from an OBJ file
Constructs a cotangent Laplacian matrix
Applies the PATCH_ORDERING fill-reducing ordering using METIS k-way partitioning for patches
Solves the system using the NVIDIA cuDSS GPU solver

Command Line Options

Option	Description	Values
`-i, --input`	Input mesh file	Path to OBJ file
`-s, --solver`	Linear solver	`CHOLMOD`, `CUDSS`, `PARTH_SOLVER`, `STRUMPACK`
`-a, --ordering`	Ordering algorithm	`DEFAULT`, `METIS`, `RXMESH_ND`, `PATCH_ORDERING`, `PARTH`, `NEUTRAL`
`-g, --use_gpu`	Use GPU for patch ordering	`0` (CPU), `1` (GPU)
`-p, --patch_type`	Patch generation method	`rxmesh`, `metis_kway`
`-o, --output`	Output folder	Path to output directory

Solver Types

CHOLMOD: CPU-based Cholesky solver from SuiteSparse
CUDSS: NVIDIA cuDSS GPU sparse direct solver
PARTH_SOLVER: Parth direct solver
STRUMPACK: Sparse direct solver with GPU support (requires MPI)

Ordering Types

DEFAULT: Use the solver's default ordering
METIS: METIS nested dissection ordering
RXMESH_ND: RXMesh-based nested dissection
PATCH_ORDERING: Patch-based fill-reducing ordering (main algorithm under development)
PARTH: Parth ordering algorithm
NEUTRAL: Identity permutation (no reordering)

Project Structure

gpu_ordering/
├── benchmark/           # Benchmark executables
├── cmake/               # CMake configuration and recipes
├── input/               # Sample mesh files
├── output/              # Output directory
├── scripts/             # Build and utility scripts
└── src/                 # Source code
    ├── AnalysisScripts/ # Python analysis and plotting scripts
    ├── LinSysSolvers/   # Linear solver wrappers
    ├── Ordering/        # Fill-reducing ordering algorithms
    ├── PatchOrdering/   # Patch-based ordering implementation
    └── utils/           # Helper utilities

src/AnalysisScripts/

Python scripts for analysis and visualization:

FactorVisualization/see_factor.py: Visualizes sparse matrix sparsity patterns using spy plots. Compares factor structures between different orderings.
OrderingAnalysis/see_ratio.py: Compares fill-ratios between ordering methods. Generates normalized comparison plots against METIS baseline.

src/LinSysSolvers/

Wrappers for sparse direct linear solvers. Entry point: LinSysSolver.hpp

// Create a solver instance
LinSysSolver* solver = LinSysSolver::create(LinSysSolverType::GPU_CUDSS);

// Set the sparse matrix (CSC format)
solver->setMatrix(outerIndexPtr, innerIndexPtr, values, numRows, numNonZeros);

// Symbolic analysis with user-defined permutation
solver->analyze_pattern(permutation, etree);

// Numerical factorization
solver->factorize();

// Solve Ax = b
solver->solve(rhs, result);

Files:

File	Description
`LinSysSolver.hpp`	Abstract base class and factory (`create()` function)
`LinSysSolver.cpp`	Factory implementation
`CHOLMODSolver.hpp/cpp`	SuiteSparse CHOLMOD wrapper
`CUDSSSolver.hpp/cu`	NVIDIA cuDSS GPU solver wrapper
`STRUMPACKSolver.hpp/cpp`	STRUMPACK solver wrapper

src/Ordering/

Fill-reducing ordering algorithms. Entry point: ordering.h

// Create an ordering instance
Ordering* ordering = Ordering::create(DEMO_ORDERING_TYPE::PATCH_ORDERING);

// Configure options
ordering->setOptions({{"use_gpu", "1"}, {"patch_type", "metis_kway"}});

// Set the adjacency graph (CSR format, without diagonal)
ordering->setGraph(Gp, Gi, numNodes, numEdges);

// Optionally provide mesh data for mesh-aware orderings
if (ordering->needsMesh()) {
    ordering->setMesh(V_data, V_rows, V_cols, F_data, F_rows, F_cols);
}

// Initialize and compute permutation
ordering->init();
ordering->compute_permutation(perm, etree, compute_etree);

Files:

File	Description
`ordering.h`	Abstract base class and factory (`create()` function)
`ordering.cu`	Factory implementation
`metis_ordering.cpp/h`	METIS nested dissection wrapper
`parth_ordering.cpp/h`	Parth ordering wrapper
`rxmesh_ordering.cu/h`	RXMesh-based nested dissection
`patch_ordering.cu/h`	Patch-based ordering (main algorithm)
`neutral_ordering.cpp/h`	Identity permutation

src/PatchOrdering/

The core patch-based fill-reducing ordering algorithm. This is the main algorithm under development.

Algorithm Overview:

Partition the mesh/graph into patches (using RXMesh or METIS k-way)
Build a quotient graph where each patch is a supernode
Recursively bisect the quotient graph to create a decomposition tree
Find separators using bipartite matching refinement
Apply local fill-reducing ordering (AMD) within each tree node
Assemble the final permutation in post-order

Files:

File	Description
`cpu_ordering_with_patch.cpp/h`	CPU implementation of patch-based ordering
`gpu_ordering_with_patch.cu/h`	GPU-accelerated implementation

Key Classes:

GPUOrdering_PATCH: Main GPU ordering class with decomposition tree management
DecompositionTree: Binary tree representing the nested dissection structure
QuotientGraph: Compressed graph where patches are supernodes

src/utils/

Helper utilities for benchmarking and development:

File	Description
`check_valid_permutation.cpp/h`	Validates that a permutation is valid (bijection)
`compute_inverse_perm.cpp/h`	Computes inverse permutation
`get_factor_nnz.cpp/h`	Estimates Cholesky factor non-zeros for a given ordering
`remove_diagonal.cpp/h`	Removes diagonal entries from sparse matrix (for graph representation)
`metis_helper.cpp/h`	METIS utility functions
`min_vertex_cover_bipartite.cpp/h`	Hopcroft-Karp algorithm for bipartite matching
`cuda_error_handler.h`	CUDA error checking macros
`nvtx_helper.h`	NVIDIA Tools Extension for profiling

benchmark/

Benchmark executables:

cudss_benchmark/cudss_benchmark.cpp: Main benchmark that:
1. Loads a triangle mesh
2. Constructs a cotangent Laplacian matrix
3. Applies a fill-reducing ordering
4. Solves the linear system
5. Reports timing and fill-ratio metrics
ordering_benchmark.cpp: Ordering-focused benchmark for comparing fill-ratios across different ordering methods
etree_analysis/etree_analysis.cpp: Elimination tree analysis benchmark for studying tree structure properties

Running the Benchmark Script

The project includes a comprehensive benchmark script that automates testing across multiple ordering algorithms and configurations.

Script Location

scripts/benchmark/cudss_ordering_benchmark.sh

Configuration

Before running the script, you must configure three variables at the top of the script:

Variable	Description	Example
`INPUT_ROOT`	Directory containing your `.obj` mesh files	`/path/to/meshes`
`OUTPUT_CSV`	Path prefix for the output CSV file (without `.csv` extension)	`/path/to/output/results`
`BENCHMARK_BIN`	Path to the compiled benchmark binary	`/path/to/cmake-build-release/benchmark/gpu_ordering_cudss_benchmark`

Edit the script to set these paths:

INPUT_ROOT="/path/to/your/mesh/directory"
OUTPUT_CSV="/path/to/output/benchmark_results"
BENCHMARK_BIN="/path/to/gpu_ordering/cmake-build-release/benchmark/gpu_ordering_cudss_benchmark"

What the Script Benchmarks

The script automatically discovers all .obj files in INPUT_ROOT and runs three categories of benchmarks:

DEFAULT Ordering: Uses the solver's built-in ordering for each mesh
PARTH Ordering: Tests the Parth algorithm with varying binary_level values:
- binary_level: 2, 4, 6, 8, 10
PATCH_ORDERING: Tests patch-based ordering with all combinations of:
- patch_type: metis_kway, rxmesh
- patch_size: 64, 256, 512

Running the Benchmark

# Make the script executable (if needed)
chmod +x scripts/benchmark/cudss_ordering_benchmark.sh

# Run the benchmark
./scripts/benchmark/cudss_ordering_benchmark.sh

Output

Results are appended to ${OUTPUT_CSV}.csv, containing timing and fill-ratio metrics for each configuration tested

Build Options

CMake options for enabling/disabling features:

-Dgpu_ordering_WITH_PARTH=ON       # Enable Parth ordering (default: ON)
-Dgpu_ordering_WITH_SUITESPARSE=ON # Enable SuiteSparse/CHOLMOD (default: ON)
-Dgpu_ordering_WITH_STRUMPACK=OFF  # Enable STRUMPACK (default: OFF, requires MPI)
-Dgpu_ordering_WITH_PROFILE=ON     # Enable profiling support (default: ON)

Cross-Platform Build Instructions

The project uses CMake recipes that automatically detect and configure BLAS/LAPACK for SuiteSparse. The detection order is:

Pre-set BLAS_LIBRARIES/LAPACK_LIBRARIES (command line override)
Intel OneAPI MKL
OpenBLAS (system-installed)
Any system BLAS/LAPACK (e.g., Apple Accelerate on macOS)
Build OpenBLAS from source (fallback)

Linux

On most Linux distributions, the build should work out of the box if you have a system BLAS/LAPACK installed:

# Ubuntu/Debian
sudo apt-get install libopenblas-dev

# Fedora/RHEL
sudo dnf install openblas-devel

macOS

macOS includes the Accelerate framework which provides BLAS/LAPACK. No additional setup is required.

Windows

On Windows, the library naming conventions differ from Linux. Pre-built OpenBLAS binaries typically use 32-bit integers and have different file names.

Option 1: Using Pre-built OpenBLAS (Recommended)

Automated Setup (PowerShell):

Run the following commands in PowerShell to automatically download and configure OpenBLAS:

# Download and extract OpenBLAS
$OpenBLAS_VERSION = "0.3.28"
$OpenBLAS_URL = "https://github.com/OpenMathLib/OpenBLAS/releases/download/v${OpenBLAS_VERSION}/OpenBLAS-${OpenBLAS_VERSION}-x64.zip"
$OpenBLAS_DIR = "C:\OpenBLAS"

Invoke-WebRequest -Uri $OpenBLAS_URL -OutFile openblas.zip
Expand-Archive -Path openblas.zip -DestinationPath $OpenBLAS_DIR -Force

# Move contents from versioned subdirectory to parent (if present)
$SubDir = Get-ChildItem -Path $OpenBLAS_DIR -Directory | Select-Object -First 1
if ($SubDir) {
    Move-Item -Path "$($SubDir.FullName)\*" -Destination $OpenBLAS_DIR -Force
    Remove-Item -Path $SubDir.FullName -Force -Recurse
}

# Auto-detect library and include paths
$OPENBLAS_LIB = (Get-ChildItem -Path $OpenBLAS_DIR -Recurse -Filter "*openblas*.lib" | Select-Object -First 1).FullName
$OPENBLAS_INCLUDE = (Get-ChildItem -Path $OpenBLAS_DIR -Recurse -Directory -Filter "include" | Select-Object -First 1).FullName

# Configure and build with Visual Studio 2022
cmake -B build `
    -G "Visual Studio 17 2022" `
    -DCMAKE_BUILD_TYPE=Release `
    -DBLAS_LIBRARIES="$OPENBLAS_LIB" `
    -DLAPACK_LIBRARIES="$OPENBLAS_LIB" `
    -DOPENBLAS_INCLUDE_DIR="$OPENBLAS_INCLUDE"

cmake --build build --config Release --parallel

Manual Setup:

Download pre-built OpenBLAS from GitHub Releases
Extract to a known location (e.g., C:\OpenBLAS)
Configure CMake with the following variables:

# PowerShell
cmake -B build -DCMAKE_BUILD_TYPE=Release `
    -G "Visual Studio 17 2022" `
    -DBLAS_LIBRARIES="C:/OpenBLAS/libopenblas.lib" `
    -DLAPACK_LIBRARIES="C:/OpenBLAS/libopenblas.lib" `
    -DOPENBLAS_INCLUDE_DIR="C:/OpenBLAS/include"

:: Command Prompt
cmake -B build -DCMAKE_BUILD_TYPE=Release ^
    -G "Visual Studio 17 2022" ^
    -DBLAS_LIBRARIES="C:/OpenBLAS/libopenblas.lib" ^
    -DLAPACK_LIBRARIES="C:/OpenBLAS/libopenblas.lib" ^
    -DOPENBLAS_INCLUDE_DIR="C:/OpenBLAS/include"

Option 2: Using Environment Variables

Set environment variables before running CMake:

# PowerShell
$env:BLAS_LIBRARIES = "C:/OpenBLAS/libopenblas.lib"
$env:LAPACK_LIBRARIES = "C:/OpenBLAS/libopenblas.lib"
$env:OPENBLAS_INCLUDE_DIR = "C:/OpenBLAS/include"

cmake -B build -DCMAKE_BUILD_TYPE=Release -G "Visual Studio 17 2022"

Option 3: Using vcpkg

vcpkg install openblas:x64-windows
cmake -B build -DCMAKE_BUILD_TYPE=Release -DCMAKE_TOOLCHAIN_FILE=[vcpkg-root]/scripts/buildsystems/vcpkg.cmake

Windows Library Naming Notes

Platform	BLAS Library Name	Notes
Linux	`libopenblas.a` or `libopenblas.so`	64-bit integers supported
macOS	`libopenblas.a` or `libopenblas.dylib`	64-bit integers supported
Windows	`libopenblas.lib`	Typically uses 32-bit integers

The CMake recipes automatically handle these differences:

On Windows, SuiteSparse is configured with SUITESPARSE_USE_64BIT_BLAS=OFF
On Linux/macOS, SuiteSparse uses SUITESPARSE_USE_64BIT_BLAS=ON

License

See LICENSE file.

Name		Name	Last commit message	Last commit date
Latest commit History 46 Commits
Renders		Renders
assets		assets
benchmark		benchmark
cmake		cmake
input		input
output		output
python_scripts		python_scripts
scripts		scripts
src		src
tools		tools
.gitignore		.gitignore
CMakeLists.txt		CMakeLists.txt
LICENSE		LICENSE
README.md		README.md
RXMesh.log		RXMesh.log
patch_ids_tet_vertex_order.txt		patch_ids_tet_vertex_order.txt
single_factor_benchmark.log		single_factor_benchmark.log

Folders and files

Latest commit

History

Repository files navigation

GPU Ordering

Quick Start

Prerequisites

Build

Run Benchmark

Command Line Options

Solver Types

Ordering Types

Project Structure

src/AnalysisScripts/

src/LinSysSolvers/

src/Ordering/

src/PatchOrdering/

src/utils/

benchmark/

Running the Benchmark Script

Script Location

Configuration

What the Script Benchmarks

Running the Benchmark

Output

Build Options

Cross-Platform Build Instructions

Linux

macOS

Windows

Option 1: Using Pre-built OpenBLAS (Recommended)

Option 2: Using Environment Variables

Option 3: Using vcpkg

Windows Library Naming Notes

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages