A GPU-accelerated fill-reducing ordering library for sparse linear systems. This project implements and benchmarks various fill-reducing ordering algorithms, with a focus on patch-based nested dissection methods that leverage GPU parallelism.
- CMake 3.18+
- CUDA Toolkit (with cuSPARSE, cuSOLVER)
- GCC with C++17 support
- OpenMP
# Create build directory
mkdir cmake-build-release && cd cmake-build-release
# Configure
cmake .. -DCMAKE_BUILD_TYPE=Release
# Build
make -j$(nproc)Or use the provided build script:
./scripts/build.sh./cmake-build-release/benchmark/gpu_ordering_benchmark \
-i /path/to/mesh.obj \
-s CUDSS \
-a PATCH_ORDERING \
-g 0 \
-p metis_kwayThis example:
- Loads a triangle mesh from an OBJ file
- Constructs a cotangent Laplacian matrix
- Applies the
PATCH_ORDERINGfill-reducing ordering using METIS k-way partitioning for patches - Solves the system using the NVIDIA cuDSS GPU solver
| Option | Description | Values |
|---|---|---|
-i, --input |
Input mesh file | Path to OBJ file |
-s, --solver |
Linear solver | CHOLMOD, CUDSS, PARTH_SOLVER, STRUMPACK |
-a, --ordering |
Ordering algorithm | DEFAULT, METIS, RXMESH_ND, PATCH_ORDERING, PARTH, NEUTRAL |
-g, --use_gpu |
Use GPU for patch ordering | 0 (CPU), 1 (GPU) |
-p, --patch_type |
Patch generation method | rxmesh, metis_kway |
-o, --output |
Output folder | Path to output directory |
- CHOLMOD: CPU-based Cholesky solver from SuiteSparse
- CUDSS: NVIDIA cuDSS GPU sparse direct solver
- PARTH_SOLVER: Parth direct solver
- STRUMPACK: Sparse direct solver with GPU support (requires MPI)
- DEFAULT: Use the solver's default ordering
- METIS: METIS nested dissection ordering
- RXMESH_ND: RXMesh-based nested dissection
- PATCH_ORDERING: Patch-based fill-reducing ordering (main algorithm under development)
- PARTH: Parth ordering algorithm
- NEUTRAL: Identity permutation (no reordering)
gpu_ordering/
├── benchmark/ # Benchmark executables
├── cmake/ # CMake configuration and recipes
├── input/ # Sample mesh files
├── output/ # Output directory
├── scripts/ # Build and utility scripts
└── src/ # Source code
├── AnalysisScripts/ # Python analysis and plotting scripts
├── LinSysSolvers/ # Linear solver wrappers
├── Ordering/ # Fill-reducing ordering algorithms
├── PatchOrdering/ # Patch-based ordering implementation
└── utils/ # Helper utilities
Python scripts for analysis and visualization:
- FactorVisualization/see_factor.py: Visualizes sparse matrix sparsity patterns using spy plots. Compares factor structures between different orderings.
- OrderingAnalysis/see_ratio.py: Compares fill-ratios between ordering methods. Generates normalized comparison plots against METIS baseline.
Wrappers for sparse direct linear solvers. Entry point: LinSysSolver.hpp
// Create a solver instance
LinSysSolver* solver = LinSysSolver::create(LinSysSolverType::GPU_CUDSS);
// Set the sparse matrix (CSC format)
solver->setMatrix(outerIndexPtr, innerIndexPtr, values, numRows, numNonZeros);
// Symbolic analysis with user-defined permutation
solver->analyze_pattern(permutation, etree);
// Numerical factorization
solver->factorize();
// Solve Ax = b
solver->solve(rhs, result);Files:
| File | Description |
|---|---|
LinSysSolver.hpp |
Abstract base class and factory (create() function) |
LinSysSolver.cpp |
Factory implementation |
CHOLMODSolver.hpp/cpp |
SuiteSparse CHOLMOD wrapper |
CUDSSSolver.hpp/cu |
NVIDIA cuDSS GPU solver wrapper |
STRUMPACKSolver.hpp/cpp |
STRUMPACK solver wrapper |
Fill-reducing ordering algorithms. Entry point: ordering.h
// Create an ordering instance
Ordering* ordering = Ordering::create(DEMO_ORDERING_TYPE::PATCH_ORDERING);
// Configure options
ordering->setOptions({{"use_gpu", "1"}, {"patch_type", "metis_kway"}});
// Set the adjacency graph (CSR format, without diagonal)
ordering->setGraph(Gp, Gi, numNodes, numEdges);
// Optionally provide mesh data for mesh-aware orderings
if (ordering->needsMesh()) {
ordering->setMesh(V_data, V_rows, V_cols, F_data, F_rows, F_cols);
}
// Initialize and compute permutation
ordering->init();
ordering->compute_permutation(perm, etree, compute_etree);Files:
| File | Description |
|---|---|
ordering.h |
Abstract base class and factory (create() function) |
ordering.cu |
Factory implementation |
metis_ordering.cpp/h |
METIS nested dissection wrapper |
parth_ordering.cpp/h |
Parth ordering wrapper |
rxmesh_ordering.cu/h |
RXMesh-based nested dissection |
patch_ordering.cu/h |
Patch-based ordering (main algorithm) |
neutral_ordering.cpp/h |
Identity permutation |
The core patch-based fill-reducing ordering algorithm. This is the main algorithm under development.
Algorithm Overview:
- Partition the mesh/graph into patches (using RXMesh or METIS k-way)
- Build a quotient graph where each patch is a supernode
- Recursively bisect the quotient graph to create a decomposition tree
- Find separators using bipartite matching refinement
- Apply local fill-reducing ordering (AMD) within each tree node
- Assemble the final permutation in post-order
Files:
| File | Description |
|---|---|
cpu_ordering_with_patch.cpp/h |
CPU implementation of patch-based ordering |
gpu_ordering_with_patch.cu/h |
GPU-accelerated implementation |
Key Classes:
GPUOrdering_PATCH: Main GPU ordering class with decomposition tree managementDecompositionTree: Binary tree representing the nested dissection structureQuotientGraph: Compressed graph where patches are supernodes
Helper utilities for benchmarking and development:
| File | Description |
|---|---|
check_valid_permutation.cpp/h |
Validates that a permutation is valid (bijection) |
compute_inverse_perm.cpp/h |
Computes inverse permutation |
get_factor_nnz.cpp/h |
Estimates Cholesky factor non-zeros for a given ordering |
remove_diagonal.cpp/h |
Removes diagonal entries from sparse matrix (for graph representation) |
metis_helper.cpp/h |
METIS utility functions |
min_vertex_cover_bipartite.cpp/h |
Hopcroft-Karp algorithm for bipartite matching |
cuda_error_handler.h |
CUDA error checking macros |
nvtx_helper.h |
NVIDIA Tools Extension for profiling |
Benchmark executables:
-
cudss_benchmark/cudss_benchmark.cpp: Main benchmark that:
- Loads a triangle mesh
- Constructs a cotangent Laplacian matrix
- Applies a fill-reducing ordering
- Solves the linear system
- Reports timing and fill-ratio metrics
-
ordering_benchmark.cpp: Ordering-focused benchmark for comparing fill-ratios across different ordering methods
-
etree_analysis/etree_analysis.cpp: Elimination tree analysis benchmark for studying tree structure properties
The project includes a comprehensive benchmark script that automates testing across multiple ordering algorithms and configurations.
scripts/benchmark/cudss_ordering_benchmark.shBefore running the script, you must configure three variables at the top of the script:
| Variable | Description | Example |
|---|---|---|
INPUT_ROOT |
Directory containing your .obj mesh files |
/path/to/meshes |
OUTPUT_CSV |
Path prefix for the output CSV file (without .csv extension) |
/path/to/output/results |
BENCHMARK_BIN |
Path to the compiled benchmark binary | /path/to/cmake-build-release/benchmark/gpu_ordering_cudss_benchmark |
Edit the script to set these paths:
INPUT_ROOT="/path/to/your/mesh/directory"
OUTPUT_CSV="/path/to/output/benchmark_results"
BENCHMARK_BIN="/path/to/gpu_ordering/cmake-build-release/benchmark/gpu_ordering_cudss_benchmark"The script automatically discovers all .obj files in INPUT_ROOT and runs three categories of benchmarks:
-
DEFAULT Ordering: Uses the solver's built-in ordering for each mesh
-
PARTH Ordering: Tests the Parth algorithm with varying
binary_levelvalues:binary_level: 2, 4, 6, 8, 10
-
PATCH_ORDERING: Tests patch-based ordering with all combinations of:
patch_type:metis_kway,rxmeshpatch_size: 64, 256, 512
# Make the script executable (if needed)
chmod +x scripts/benchmark/cudss_ordering_benchmark.sh
# Run the benchmark
./scripts/benchmark/cudss_ordering_benchmark.shResults are appended to ${OUTPUT_CSV}.csv, containing timing and fill-ratio metrics for each configuration tested
CMake options for enabling/disabling features:
-Dgpu_ordering_WITH_PARTH=ON # Enable Parth ordering (default: ON)
-Dgpu_ordering_WITH_SUITESPARSE=ON # Enable SuiteSparse/CHOLMOD (default: ON)
-Dgpu_ordering_WITH_STRUMPACK=OFF # Enable STRUMPACK (default: OFF, requires MPI)
-Dgpu_ordering_WITH_PROFILE=ON # Enable profiling support (default: ON)The project uses CMake recipes that automatically detect and configure BLAS/LAPACK for SuiteSparse. The detection order is:
- Pre-set
BLAS_LIBRARIES/LAPACK_LIBRARIES(command line override) - Intel OneAPI MKL
- OpenBLAS (system-installed)
- Any system BLAS/LAPACK (e.g., Apple Accelerate on macOS)
- Build OpenBLAS from source (fallback)
On most Linux distributions, the build should work out of the box if you have a system BLAS/LAPACK installed:
# Ubuntu/Debian
sudo apt-get install libopenblas-dev
# Fedora/RHEL
sudo dnf install openblas-develmacOS includes the Accelerate framework which provides BLAS/LAPACK. No additional setup is required.
On Windows, the library naming conventions differ from Linux. Pre-built OpenBLAS binaries typically use 32-bit integers and have different file names.
Automated Setup (PowerShell):
Run the following commands in PowerShell to automatically download and configure OpenBLAS:
# Download and extract OpenBLAS
$OpenBLAS_VERSION = "0.3.28"
$OpenBLAS_URL = "https://github.com/OpenMathLib/OpenBLAS/releases/download/v${OpenBLAS_VERSION}/OpenBLAS-${OpenBLAS_VERSION}-x64.zip"
$OpenBLAS_DIR = "C:\OpenBLAS"
Invoke-WebRequest -Uri $OpenBLAS_URL -OutFile openblas.zip
Expand-Archive -Path openblas.zip -DestinationPath $OpenBLAS_DIR -Force
# Move contents from versioned subdirectory to parent (if present)
$SubDir = Get-ChildItem -Path $OpenBLAS_DIR -Directory | Select-Object -First 1
if ($SubDir) {
Move-Item -Path "$($SubDir.FullName)\*" -Destination $OpenBLAS_DIR -Force
Remove-Item -Path $SubDir.FullName -Force -Recurse
}
# Auto-detect library and include paths
$OPENBLAS_LIB = (Get-ChildItem -Path $OpenBLAS_DIR -Recurse -Filter "*openblas*.lib" | Select-Object -First 1).FullName
$OPENBLAS_INCLUDE = (Get-ChildItem -Path $OpenBLAS_DIR -Recurse -Directory -Filter "include" | Select-Object -First 1).FullName
# Configure and build with Visual Studio 2022
cmake -B build `
-G "Visual Studio 17 2022" `
-DCMAKE_BUILD_TYPE=Release `
-DBLAS_LIBRARIES="$OPENBLAS_LIB" `
-DLAPACK_LIBRARIES="$OPENBLAS_LIB" `
-DOPENBLAS_INCLUDE_DIR="$OPENBLAS_INCLUDE"
cmake --build build --config Release --parallelManual Setup:
- Download pre-built OpenBLAS from GitHub Releases
- Extract to a known location (e.g.,
C:\OpenBLAS) - Configure CMake with the following variables:
# PowerShell
cmake -B build -DCMAKE_BUILD_TYPE=Release `
-G "Visual Studio 17 2022" `
-DBLAS_LIBRARIES="C:/OpenBLAS/libopenblas.lib" `
-DLAPACK_LIBRARIES="C:/OpenBLAS/libopenblas.lib" `
-DOPENBLAS_INCLUDE_DIR="C:/OpenBLAS/include":: Command Prompt
cmake -B build -DCMAKE_BUILD_TYPE=Release ^
-G "Visual Studio 17 2022" ^
-DBLAS_LIBRARIES="C:/OpenBLAS/libopenblas.lib" ^
-DLAPACK_LIBRARIES="C:/OpenBLAS/libopenblas.lib" ^
-DOPENBLAS_INCLUDE_DIR="C:/OpenBLAS/include"Set environment variables before running CMake:
# PowerShell
$env:BLAS_LIBRARIES = "C:/OpenBLAS/libopenblas.lib"
$env:LAPACK_LIBRARIES = "C:/OpenBLAS/libopenblas.lib"
$env:OPENBLAS_INCLUDE_DIR = "C:/OpenBLAS/include"
cmake -B build -DCMAKE_BUILD_TYPE=Release -G "Visual Studio 17 2022"vcpkg install openblas:x64-windows
cmake -B build -DCMAKE_BUILD_TYPE=Release -DCMAKE_TOOLCHAIN_FILE=[vcpkg-root]/scripts/buildsystems/vcpkg.cmake| Platform | BLAS Library Name | Notes |
|---|---|---|
| Linux | libopenblas.a or libopenblas.so |
64-bit integers supported |
| macOS | libopenblas.a or libopenblas.dylib |
64-bit integers supported |
| Windows | libopenblas.lib |
Typically uses 32-bit integers |
The CMake recipes automatically handle these differences:
- On Windows, SuiteSparse is configured with
SUITESPARSE_USE_64BIT_BLAS=OFF - On Linux/macOS, SuiteSparse uses
SUITESPARSE_USE_64BIT_BLAS=ON
See LICENSE file.