This document outlines the development roadmap for achieving a commercial-grade, open-source CFD library.
- Pluggable solver architecture (function pointers, registry pattern)
- Multiple solver backends (CPU, SIMD/AVX2, OpenMP, CUDA)
- 2D/3D incompressible Navier-Stokes solver
- Full 3D support with branch-free stride_z=0 pattern for 2D fallback
- Explicit Euler and projection methods
- Cross-platform builds (Windows, Linux, macOS)
- CI/CD with GitHub Actions
- Unity test framework integration
- VTK and CSV output with timestamped directories
- Visualization library (cfd-visualization)
- Thread-safe library initialization
- SIMD Poisson solvers (Jacobi and Red-Black SOR with AVX2)
- Boundary condition abstraction layer with runtime backend selection
- Neumann and Periodic boundary conditions (all backends)
- GPU boundary condition kernels (CUDA)
- Comprehensive test suite (core, solvers, simulation, I/O)
- 12 example programs demonstrating various features
Each algorithm should have scalar (CPU) + SIMD + OMP variants. Track gaps here.
| Category | Algorithm | CPU | AVX2 | NEON | OMP | GPU |
|---|---|---|---|---|---|---|
| N-S Solvers | Explicit Euler | done | done | — | done | — |
| Projection | done | done | — | done | done | |
| RK2 (Heun) | done | done | — | done | — | |
| Linear Solvers | Jacobi | done | done | done | — | — |
| SOR | done | done | done | — | — | |
| Red-Black SOR | done | done | done | done | — | |
| CG / PCG | done | done | done | done | — | |
| BiCGSTAB | done | done | done | — | — | |
| Boundary Conds | All types | done | done | done | done | done |
-
No 3D validation benchmarks— Taylor-Green 3D and Poiseuille 3D added - No turbulence models
- Limited linear solvers (no multigrid)
- No restart/checkpoint capability
OMP Red-Black SOR Poisson Solver Convergence (P1) ✅ RESOLVED
Root cause: Hard-coded omega=1.5 was suboptimal for larger grids. For a 33×33 grid, the optimal SOR omega is ~1.83. With omega=1.5, the spectral radius was too high, causing convergence to exceed max iterations.
Fix: All SOR and Red-Black SOR solvers now auto-compute optimal omega from grid dimensions using the Jacobi spectral radius formula: ω_opt = 2/(1+√(1-ρ_J²)). Users can override by setting params.omega > 0. Projection backends remain on CG (grid-size-independent).
Action items:
- Profile OMP Red-Black SOR to identify convergence bottleneck — suboptimal omega
- Compare OMP vs AVX2 Red-Black implementations for differences — identical algorithm
- Test omega parameter sweep (1.0 to 1.9) for optimal convergence — auto-computed
- Verify convergence via iteration-count tests — test_optimal_omega.c
- Consider switch to Chebyshev acceleration or SSOR — not needed, optimal omega suffices
Grid Convergence Non-Monotonic Behavior (P1) ✅ RESOLVED
Root cause: Red-Black SOR Poisson solver had insufficient convergence on larger grids. Switching all projection backends to CG (Conjugate Gradient) resolved the issue. Grid convergence is now strictly monotonic (17×17: 0.046, 25×25: 0.037, 33×33: 0.032).
Action items:
- Investigate why 33×33 produces worse results than 25×25 — Red-Black SOR Poisson solver
- May need more Poisson iterations for larger grids — solved by switch to CG
- Consider using SIMD Red-Black SOR or multigrid for better accuracy — CG sufficient
- Remove relaxed tolerance from grid convergence tests
- Add strict grid convergence test that FAILs if RMS increases with refinement
SIMD Poisson Strict Tolerance (Section 1.2)
Current SIMD Poisson solvers produce valid results but may not converge to strict tolerance (1e-6) on challenging problems like sinusoidal RHS within iteration limits. They converge properly on simpler problems (zero RHS, uniform RHS). See docs/simd-optimization-analysis.md for details.
Convergence Order BC-Limited (Section 1.3.2)
Spatial convergence achieves ~O(h^1.5) rather than theoretical O(h²), limited by first-order boundary conditions. Temporal convergence O(dt) is difficult to isolate; spatial error dominates on practical grids.
Jacobi Preconditioner on Uniform Grids (Section 1.2.4)
For the uniform-grid Laplacian with constant coefficients, the Jacobi preconditioner M⁻¹ = 1/(2/dx² + 2/dy²) is a constant scalar, which doesn't improve the condition number. PCG provides benefit for problems with variable coefficients or non-uniform grids.
Modular Library Circular Dependencies (Section 4.2)
The modular libraries have circular dependencies: cfd_scalar/cfd_simd call poisson_solve() (defined in cfd_api), while cfd_api links against cfd_scalar/cfd_simd. Resolved on Linux with linker groups; Windows/macOS handle automatically. Future: consider weak symbols or plugin architecture.
OMP Loop Variable int Overflow for Large Grids (P3)
Status: Deferred — low risk, no practical impact yet
OMP backends cast size_t loop variables to int for MSVC OpenMP 2.0 compatibility. Overflows for grids where nx * ny > INT_MAX (~46K × 46K).
- Audit all OMP backends for
size_t→intcasts - Add
CFD_ASSERT(nx * ny <= INT_MAX)guards if targeting large grids - Consider requiring OpenMP 3.0+ when dropping MSVC OMP 2.0 support
Debug Mode SIMD Benchmarking (Section 1.10)
Current tests run in Debug mode where SIMD may be slower than scalar due to lack of compiler optimizations. Release mode benchmark suite needed for accurate performance measurements.
Spectral Radius Test BC Assumption (Section 1.2.3)
The Jacobi spectral radius test uses Dirichlet BCs (p=0 on boundary) because the ρ = cos(πh) formula applies only to the Dirichlet problem. The SOR optimal ω = 2/(1 + sin(πh)) also applies to Dirichlet BCs; with Neumann BCs optimal ω is typically lower (1.5-1.7).
Status: COMPLETE
- 0.1 Safe Error Handling —
cfd_status_t, error propagation, resource cleanup - 0.2 Thread Safety — removed static buffers, thread-safe registry, re-entrant solvers
- 0.3 API Robustness — input validation, logging callback, version header, symbol visibility
- 0.4 API & Robustness Testing — comprehensive test suites for all subsystems
- 0.5 Error Handling & Robustness — removed static
warnedflag,run_simulation_step()returnscfd_status_t - 0.6 Structured Logging & Diagnostics —
cfd_log()API with levels, component tags, thread-safe callbacks
0.6 Deferred items:
- Log filtering by component (e.g., only show "boundary" logs)
- Timestamps and colored output
- Structured data API for metrics (convergence stats, timings)
Goal: Make the solver practically usable for real problems.
All BC types implemented across all backends (Scalar, AVX2, NEON, OMP, GPU): Dirichlet, Neumann, Periodic, No-slip, Inlet (uniform/parabolic/custom), Outlet, Symmetry, Moving wall, Time-varying. Code refactored with shared templates (~505 lines removed).
Implemented: Jacobi, SOR, Red-Black SOR, CG/PCG, BiCGSTAB (all with SIMD backends). CG is the default Poisson solver for all projection methods.
Still needed:
- GMRES (Generalized Minimal Residual) for non-symmetric systems
- GMRES scalar
- GMRES AVX2
- GMRES NEON
- GMRES OMP
- GMRES GPU
- SSOR (Symmetric SOR) preconditioner
- GPU (CUDA) linear solver backends
- Standalone Jacobi GPU (refactor from monolithic
solver_projection_jacobi_gpu.cu) - Red-Black SOR GPU
- CG GPU
- BiCGSTAB GPU
- Standalone Jacobi GPU (refactor from monolithic
- ILU preconditioner
- Geometric multigrid
- Algebraic multigrid (AMG) solver
- AMG preconditioner (for use with CG/GMRES/BiCGSTAB)
- Performance benchmarking in Release mode
Completed sub-sections: Poisson Accuracy Tests (1.2.1), Laplacian Validation (1.2.2), Convergence Validation (1.2.3), PCG (1.2.4).
- Upwind differencing (1st order) for stability
- Central differencing with delayed correction
- High-resolution TVD schemes (Van Leer, Superbee)
- Gradient limiters (Barth-Jespersen, Venkatakrishnan)
- Migrate solver code to use
cfd/math/stencils.h(currently inline implementations)
Completed sub-sections: Stencil Tests (1.3.1), Convergence Order (1.3.2), MMS (1.3.3), Divergence-Free (1.3.4).
- SIMPLE algorithm for incompressible flow
- SIMPLEC / PISO variants
- Pseudo-transient continuation
- Convergence acceleration (relaxation)
Each time integrator requires a scalar (CPU) reference implementation first, then backend variants. See /add-ns-time-integrator for the cross-backend workflow.
Implemented: RK2 (Heun's method) — all CPU/AVX2/OMP backends, O(dt²) verified.
Still needed:
- RK2 CUDA (
rk2_gpu) - RK4 (classical Runge-Kutta) — scalar
- RK4 AVX2+OMP (
rk4_optimized) - RK4 OpenMP (
rk4_omp) - RK4 CUDA (
rk4_gpu)
- RK4 AVX2+OMP (
- Implicit Euler (backward Euler)
- Crank-Nicolson (2nd order implicit)
- BDF2 (backward differentiation)
- Adaptive time stepping with error control
- Binary checkpoint format
- Save/restore complete simulation state
- Portable across platforms
- Version compatibility
Solve F(x)=0 where F is nonlinear. Required for steady-state Navier-Stokes.
- Newton-Raphson iteration
- Picard iteration (successive substitution)
- Quasi-Newton methods (BFGS, L-BFGS)
- Line search and globalization
- Nonlinear solver abstraction interface
Find eigenvalues/eigenvectors for stability analysis.
- Power iteration
- Inverse iteration
- Arnoldi iteration
- Stability analysis framework
Status: OpenMP implemented, SIMD and CUDA pending
Current: OpenMP parallelization for velocity magnitude and field statistics with threshold-based parallelization (OMP_THRESHOLD = 1000 cells).
Still needed:
- Add AVX2 implementation for velocity magnitude
- Add NEON implementation for velocity magnitude (ARM64)
- Combine with OpenMP (
#pragma omp simdor manual intrinsics) - Add SIMD horizontal reduction for statistics
- Add runtime CPU feature detection
- Benchmark against scalar+OpenMP version
- Add CUDA kernel for velocity magnitude
- Add parallel reduction for statistics (use CUB library or custom)
- Share GPU memory with CUDA solvers (avoid CPU<->GPU transfers)
- Add threshold logic (only use GPU for large grids)
- Benchmark transfer overhead vs compute benefit
Status: SIMD Poisson integration completed, further optimizations pending. Current ~1.3-1.5x speedup limited by Amdahl's law (parallelizable fraction ~80%).
- Increase
POISSON_MAX_ITERor implement adaptive tolerance - Add optional multigrid preconditioner for faster convergence
- Investigate Red-Black omega parameter tuning
- Create Release mode benchmark suite
- Profile to identify remaining bottlenecks
- Implement hybrid projection solver using OpenMP across rows + SIMD within rows
- Benchmark scaling efficiency
- Compare against pure OMP and pure SIMD implementations
- Implement multigrid V-cycle framework
- Add restriction/prolongation operators
- Use SIMD Jacobi/Red-Black as smoothers at each level
- Benchmark convergence rate vs pure iterative methods
Goal: Support more physical phenomena.
- Temperature advection-diffusion
- Thermal boundary conditions
- Buoyancy coupling (Boussinesq approximation)
- Heat source terms
- Conjugate heat transfer
- Variable properties (viscosity/density as function of T)
- Temperature-dependent thermal conductivity
- Spalart-Allmaras (1-equation)
- k-epsilon standard
- k-epsilon realizable
- k-omega SST
- Wall functions
- Low-Reynolds number treatment
- Density-based solver
- Ideal gas equation of state
- Shock capturing (MUSCL, WENO)
- Pressure-based compressible (SIMPLE variants)
- Multi-species advection-diffusion
- Variable diffusivity
- Source terms for reactions
- Mass fraction constraints
- Volume of Fluid (VOF)
- Level Set method
- Surface tension
- Phase change
Goal: Support complex geometries.
Status: Complete. "2D as subset of 3D" approach — nz=1 produces bit-identical results to previous 2D code. Branch-free solver loops using precomputed constants (stride_z=0, inv_dz2=0.0). All 8 phases done: indexing macros, core data structures, 3D stencils, NS solvers, BCs, SIMD, OMP, CUDA, I/O, examples, docs.
- Cell-centered finite volume
- Face-based data structures
- Gradient reconstruction
- Cell connectivity
- Triangle/tetrahedral elements
- Quadrilateral/hexahedral elements
- Mixed element support
- Gmsh format (.msh)
- VTK unstructured (.vtu)
- CGNS format
- OpenFOAM polyMesh
- Mesh quality metrics
- Mesh validation
- Cell-based refinement
- Refinement criteria (gradient, error)
- Coarsening
- Load balancing
- Hanging nodes treatment
Goal: Scale to large problems.
- Domain decomposition
- Ghost cell exchange
- Parallel I/O
- Load balancing
- Hybrid MPI+OpenMP
Status: Implemented in v0.1.5. Split monolithic library into per-backend targets.
| Library | CMake Target | Contents | Dependencies |
|---|---|---|---|
cfd_core |
CFD::Core |
Grid, memory, I/O, status, common utilities | None |
cfd_scalar |
CFD::Scalar |
Scalar CPU solvers | CFD::Core |
cfd_simd |
CFD::SIMD |
AVX2/NEON SIMD solvers | CFD::Core, CFD::Scalar |
cfd_omp |
CFD::OMP |
OpenMP parallelized solvers | CFD::Core, CFD::Scalar |
cfd_cuda |
CFD::CUDA |
CUDA GPU solvers | CFD::Core |
cfd_library |
CFD::Library |
Unified library (all backends) | All above |
Still needed:
- Update examples/tests to link against specific backends (optional)
- Plugin loading system for dynamic backend selection
See Known Issues — modular library circular dependencies.
Implemented: CUDA device detection, GPU projection with Jacobi Poisson, GPU memory management, GPU BC kernels, configurable GPU settings, GPU solver statistics.
Still needed:
- Multi-GPU support
- Unified memory optimization
- Advanced async transfers (multi-stream overlap, double buffering)
- GPU-aware MPI
- GPU linear solver backends tracked in section 1.2
- Built-in profiling
- Memory usage tracking
- Roofline analysis integration
- Scaling benchmarks
Goal: Industry-standard data formats.
- Parallel HDF5 support
- Compression options
- Chunked storage
- XDMF metadata
Implemented: VTK legacy ASCII format, scalar/vector/flow field output, timestamped run directories.
Still needed:
- VTK XML format (.vtu, .pvtu)
- Parallel VTK files
- Time series support
- Binary encoding
Implemented: timeseries data, centerline profiles, global statistics, velocity magnitude, automatic headers.
- Efficient binary format
- Incremental checkpoints
- Automatic recovery
- Catalyst/ParaView integration
- ADIOS2 integration
Goal: Validate against reference solutions and provide comprehensive documentation.
Status: COMPLETE — All backends validated at 33×33, Re=100. Projection RMS < 0.10, Explicit Euler RMS < 0.15, all backends within 0.1% of each other. Grid convergence now monotonic.
Remaining work (129×129 full validation):
- Confirm all backends pass at 129×129 on EC2 and record RMS values
- Use CAVITY_FULL_VALIDATION=1 build flag
Acceptance Criteria:
- RMS error vs Ghia < 0.10 for Re=100, 400, 1000
- All 7 solver backends produce identical results (within 0.1%)
- Grid convergence: error decreases monotonically with refinement
Resolved by switching projection backends from Red-Black SOR to CG. Grid convergence tests now enforce strict monotonicity with convergence rate reporting.
Velocity/energy decay, L2 error, grid convergence, divergence-free, backend consistency, long-time and low-viscosity stability — all verified.
Profile stability (<1% RMS), mass conservation (<1%), pressure gradient (<5% error), inlet BC (machine precision) — all verified.
- Backward-facing step - compare to Armaly et al. (1983)
- Flow over cylinder - compare to Williamson (1996)
Goal: Full-length validation tests that run during releases (too slow for CI).
CI vs Release Parameters:
| Test | CI Mode | Release Mode |
|---|---|---|
| Cavity Ghia Validation | 33×33, 5000 steps | 129×129, 50000 steps |
| Cavity Re=400 Stability | 25×25, 500 steps | 65×65, 20000 steps |
| Grid Convergence | 17→25→33 | 33→65→129 |
| Taylor-Green Vortex | 32×32, 200 steps | 128×128, 10000 steps |
Release Validation Tests to Implement:
- Full Ghia validation at 129×129 grid for Re=100, 400, 1000
- Extended cavity flow convergence (run to true steady-state, residual < 1e-8)
- Multi-Reynolds grid convergence study (Richardson extrapolation)
- Taylor-Green vortex decay rate verification (extended time)
- Cross-architecture consistency check (all backends produce identical results)
- Memory usage and performance regression benchmarks
- Grid independence studies for benchmark cases
- Time step independence studies
- Richardson extrapolation for error estimation
- Automated convergence reporting
- Doxygen API documentation
- Theory/mathematics guide
- User tutorials
- Installation guide
- Troubleshooting guide
- Performance tuning guide
- Developer guide
Implemented: 12 examples (minimal, basic simulation, animated flow, velocity visualization, performance/runtime comparison, solver selection, custom BCs, custom source terms, CSV export, lid-driven cavity).
Still needed:
- Heat transfer examples
- Turbulent flow examples
- Parallel computing examples (MPI)
Goal: Enable fast inference of pre-trained neural network surrogate models in pure C, without Python dependencies.
Priority: P3 - Future enhancement
Train models in Python (PyTorch/JAX), deploy inference in C for:
- ~1000× speedup over traditional CFD for inference
- No external dependencies at runtime
- SIMD-optimized matrix operations
- Embedded systems / HPC integration
- Define binary weight format (.cfdnn)
- JSON metadata support for model info
- Weight loading API
- Dense (fully connected) layer
- Activation functions (ReLU, Leaky ReLU, Tanh, Sigmoid, GELU, Swish/SiLU)
- Batch normalization
- Layer normalization
- Dropout (inference mode = identity)
- AVX2 matrix-vector multiply
- AVX2 matrix-matrix multiply (for batched inference)
- NEON equivalents for ARM
- Cache-optimized tiling for large layers
| Architecture | Status | Notes |
|---|---|---|
| Fully Connected (MLP) | Planned | Priority - most common for PINNs |
| Convolutional (Conv2D) | Future | For grid-based models |
| Fourier Neural Operator | Future | State-of-the-art for PDEs |
- High-level
cfdnn_predict()andcfdnn_predict_batch()API - Model loading/freeing lifecycle
- Use ML prediction as initial guess for iterative solver
- Adaptive switching: ML for steady-state estimate, CFD for accuracy
- Uncertainty quantification (ensemble models)
- Inference time benchmarks vs Python
- Accuracy comparison with CFD solver
- Memory usage profiling
- Example: Cavity flow surrogate
- Load and run inference on exported PyTorch models
- <1ms inference time for 64×64 grid
- <5% L2 error compared to CFD solver (Re < 200)
- SIMD kernels show >2× speedup over scalar
- GGML - Lightweight C tensor library
- ONNX Runtime C API - Alternative: use ONNX format
- TensorFlow Lite Micro - Embedded ML reference
Goal: Provide C-level building blocks for ML workflows while Python handles training and high-level orchestration.
Priority: P3 - Optional/Alternative to Phase 7
Alternative to full C inference (Phase 7). Python handles training/orchestration, C provides optimized compute kernels. Lower effort, easier to support new architectures.
- SIMD-optimized matrix operations (
cfd_matmul,cfd_matmul_add_bias) - Activation functions (ReLU, Tanh, GELU — in-place)
- Python bindings for C compute kernels
- Optimized NS residual computation for PINN training
- Finite difference operators (gradient, laplacian)
- Batch simulation API for dataset generation with OpenMP parallelization
- Zero-copy shared buffers between C and Python/NumPy
| Aspect | Phase 7 (Full C Inference) | Phase 8 (Hybrid) |
|---|---|---|
| Implementation effort | High | Medium |
| Python dependency | None at runtime | Required |
| New architecture support | Requires C changes | Just Python |
| Deployment | Embedded/HPC friendly | Python environment |
| Performance | Highest | High (kernel-level) |
| Flexibility | Fixed architectures | Any PyTorch model |
- Stretched grid formula fix, tanh-based stretching, grid unit tests
- Phases 0–4: Indexing macros, core data structures, 3D stencils, NS solvers, BCs
- Phase 5-7: SIMD, OMP, CUDA backends for 3D
- Phase 8: 3D VTK output, examples, validation (Taylor-Green 3D, Poiseuille 3D)
- Energy equation
- Thermal boundary conditions
- Natural convection validation
- At least one RANS model (k-epsilon or SA)
- Wall functions
- Turbulent channel flow validation
- MPI parallelization
- Scalability benchmarks
- HDF5 parallel I/O
- Unstructured mesh support
- Gmsh import
- Complex geometry examples
- All Phase 1-6 features complete
- Comprehensive validation suite
- Complete documentation
- Stable API
- Performance optimized
| Priority | Meaning |
|---|---|
| P0 | Critical - blocks v1.0 release |
| P1 | Important - required for v1.0 |
| P2 | Valuable - nice to have for v1.0 |
| P3 | Future - post v1.0 |
See CONTRIBUTING.md for guidelines on contributing to this project.
When working on roadmap items:
- Create an issue referencing the roadmap item
- Create a feature branch
- Implement with tests
- Update documentation
- Submit PR referencing the issue
- Ghia et al. (1982) - Lid-driven cavity
- Kim & Moin (1985) - Turbulent channel flow
- Armaly et al. (1983) - Backward-facing step
- Williamson (1996) - Vortex shedding from cylinder
- Ferziger & Peric - "Computational Methods for Fluid Dynamics"
- Versteeg & Malalasekera - "An Introduction to CFD"
- Moukalled et al. - "The Finite Volume Method in CFD"
- Wilcox - "Turbulence Modeling for CFD"
- Pope - "Turbulent Flows"