Comprehensive guide to solvers and numerical methods in the CFD Framework.
The CFD Framework implements multiple solvers for the incompressible Navier-Stokes equations:
∂u/∂t + (u·∇)u = -∇p/ρ + ν∇²u
∇·u = 0
All solvers support both 2D (nz=1) and 3D (nz>1) grids using a branch-free approach: stride_z=0 when nz==1 causes z-terms to vanish, producing bit-identical 2D results with zero overhead.
Simple forward Euler time integration using finite differences.
Equation:
u^(n+1) = u^n + dt * [-(u·∇)u + ν∇²u - ∇p/ρ]
Characteristics:
- First-order accurate in time
- Simple and fast
- Does not enforce incompressibility strictly
- Good for learning and debugging
Stability: CFL condition must be satisfied:
dt ≤ min(dx²/(4ν), dx/u_max)
Available Backends:
| Solver | Backend | Description |
|---|---|---|
explicit_euler |
Scalar | Basic implementation |
explicit_euler_optimized |
SIMD | SIMD-optimized (auto-detects AVX2/NEON) |
explicit_euler_omp |
OpenMP | Multi-threaded |
explicit_euler_gpu |
CUDA | GPU-accelerated |
Chorin's projection method - properly enforces incompressibility constraint.
Algorithm (Fractional Step Method):
-
Predictor Step - Compute intermediate velocity (ignore pressure):
u* = u^n + dt * [-(u·∇)u + ν∇²u] -
Pressure Poisson Solve - Enforce incompressibility:
∇²p^(n+1) = (ρ/dt) * ∇·u* -
Corrector Step - Project velocity to divergence-free space:
u^(n+1) = u* - (dt/ρ) * ∇p^(n+1)
Characteristics:
- Second-order accurate in space (central differences)
- First-order accurate in time
- Properly enforces ∇·u = 0
- More expensive due to Poisson solve
Available Backends:
| Solver | Backend | Description |
|---|---|---|
projection |
Scalar | Basic implementation |
projection_optimized |
SIMD | SIMD-optimized (runtime detection: AVX2/NEON) |
projection_omp |
OpenMP | Multi-threaded |
projection_jacobi_gpu |
GPU | CUDA-accelerated (Jacobi iteration) |
The projection method requires solving the pressure Poisson equation:
∇²p = f
Algorithm:
p_ij^(k+1) = (p_i-1,j + p_i+1,j + p_i,j-1 + p_i,j+1 - h²f_ij) / 4
Characteristics:
- Simple, parallel-friendly
- Slow convergence: O(n²) iterations for n×n grid
- Unconditionally stable
- Good for GPU/SIMD
Convergence Rate: ρ ≈ 1 - π²/(2n²)
Usage:
poisson_solver_t* solver = poisson_solver_create(POISSON_METHOD_JACOBI,
POISSON_BACKEND_SIMD);Algorithm:
p_ij^(k+1) = (1-ω)p_ij^k + (ω/4)(p_i-1,j + p_i+1,j + p_i,j-1 + p_i,j+1 - h²f_ij)
Optimal Relaxation (Dirichlet BCs):
ω_opt = 2 / (1 + sin(πh))
Characteristics:
- Faster than Jacobi (ω > 1)
- Sequential row updates (row j depends on j-1)
- Optimal ω depends on problem
- SIMD variant uses Block SOR: processes SIMD_WIDTH consecutive cells per block, with intra-block left-neighbor approximation (see Block SOR technical note)
Convergence Rate: ρ ≈ 1 - 2πh (with optimal ω)
Available Backends:
| Solver | Backend | Description |
|---|---|---|
sor_scalar |
Scalar | Sequential Gauss-Seidel + SOR relaxation |
sor_simd |
SIMD | Block SOR (auto-detects AVX2/NEON) |
Usage:
poisson_solver_params_t params = poisson_solver_params_default();
params.omega = 1.5; // Relaxation parameter
// Scalar (fully sequential, best convergence per iteration)
poisson_solver_t* solver = poisson_solver_create(POISSON_METHOD_SOR,
POISSON_BACKEND_SCALAR);
// SIMD (Block SOR, higher throughput, slightly more iterations)
poisson_solver_t* solver = poisson_solver_create(POISSON_METHOD_SOR,
POISSON_BACKEND_SIMD);Algorithm:
- Split grid into red and black points (checkerboard)
- Update all red points (can parallelize)
- Update all black points (can parallelize)
Characteristics:
- Same convergence as SOR
- Parallelizable (red/black decoupled)
- SIMD and GPU friendly
- Preferred over standard SOR for performance
Usage:
poisson_solver_t* solver = poisson_solver_create(POISSON_METHOD_REDBLACK_SOR,
POISSON_BACKEND_SIMD);Algorithm: Krylov subspace method for symmetric positive definite systems.
Characteristics:
- Theoretically converges in n iterations
- Practical convergence: O(√κ) where κ = condition number
- Memory efficient (no matrix storage)
- Best for large grids
Convergence: For Poisson equation, κ ≈ 4(n-1)²/π², giving O(n) iterations.
Usage:
poisson_solver_t* solver = poisson_solver_create(POISSON_METHOD_CG,
POISSON_BACKEND_SIMD);Algorithm: CG with preconditioner M to improve conditioning.
Available Preconditioners:
- Jacobi (Diagonal): M = diag(A)
- Simple, cheap per iteration
- Variable coefficients: can reduce iterations
- Constant coefficients (uniform grid): no benefit
- SSOR: M = (D + L)D⁻¹(D + U) (future)
- ILU: Incomplete LU factorization (future)
Usage:
poisson_solver_params_t params = poisson_solver_params_default();
params.preconditioner = POISSON_PRECOND_JACOBI; // Enable preconditioning
poisson_solver_t* solver = poisson_solver_create(POISSON_METHOD_CG,
POISSON_BACKEND_SIMD);
poisson_solver_init(solver, nx, ny, dx, dy, ¶ms); // Pass params with preconditionerAlgorithm: Bi-Conjugate Gradient Stabilized for non-symmetric systems.
Characteristics:
- Handles non-symmetric matrices
- More robust than CG for difficult problems
- Higher cost per iteration than CG
Usage:
poisson_solver_t* solver = poisson_solver_create(POISSON_METHOD_BICGSTAB,
POISSON_BACKEND_SCALAR);Problem: 65×65 grid, tolerance = 1e-6
| Method | Iterations | Time (ms) | Notes |
|---|---|---|---|
| Jacobi | ~8000 | 45 | Simple, slow |
| SOR (ω=1.5) | ~2000 | 15 | Good serial performance |
| Red-Black SOR | ~2000 | 8 | Parallelizable |
| CG | ~80 | 5 | Best for large grids |
| PCG (Jacobi) | ~80 | 5.5 | No benefit on uniform grid |
| BiCGSTAB | ~40 | 4 | Fastest convergence |
Note: Jacobi preconditioning provides no benefit on uniform grids with constant coefficients (diagonal is constant 4/h²).
Basic C implementation:
- No SIMD, no parallelization
- Portable to all platforms
- Good baseline for correctness
Use when:
- Debugging
- Small grids (<50×50)
- Platform doesn't support SIMD
Vectorized implementation using intrinsics:
AVX2 (x86-64):
- 4 double operations per instruction (256-bit)
- Fused multiply-add (FMA) for performance
- 2-3x speedup over scalar
NEON (ARM64):
- 2 double operations per instruction (128-bit)
- ARM64 architecture optimization
- 1.5-2x speedup over scalar
Features:
- Aligned memory allocation
- Vectorized inner loops
- Scalar remainder handling
Use when:
- Medium to large grids (>100×100)
- Modern CPU (Intel Haswell+ or ARM Cortex-A57+)
- Single-threaded workload
Example:
// AVX2 inner loop (simplified)
for (size_t i = 0; i < n; i += 4) {
__m256d u_vec = _mm256_load_pd(&u[i]);
__m256d v_vec = _mm256_load_pd(&v[i]);
__m256d result = _mm256_fmadd_pd(u_vec, v_vec, sum_vec);
_mm256_store_pd(&out[i], result);
}Multi-threaded parallelization:
Features:
- Thread pool for hot loops
- Static scheduling for load balance
- Scales with CPU cores
Speedup: ~3-6x on 8-core CPU (depends on grid size and cache)
Use when:
- Large grids (>200×200)
- Multi-core CPU available
- Long simulations
Example:
#pragma omp parallel for schedule(static)
for (int j = 1; j < ny-1; j++) {
for (int i = 1; i < nx-1; i++) {
// Stencil operation...
}
}GPU-accelerated using CUDA:
Features:
- Thousands of parallel threads
- Coalesced memory access
- Shared memory tiling
Speedup: 10-50x for very large grids (>500×500)
Use when:
- Very large grids (>200×200)
- Many time steps
- NVIDIA GPU available
Crossover Point:
- Small grids: CPU faster (data transfer overhead)
- Grid >200×200: GPU becomes beneficial
- Grid >500×500: GPU significantly faster
Example:
// Check if GPU should be used
gpu_config_t config = gpu_config_default();
if (gpu_should_use(&config, nx, ny, num_steps)) {
solver = cfd_solver_create(registry, "projection_jacobi_gpu");
} else {
solver = cfd_solver_create(registry, "projection_optimized");
}| Solver | Time (ms) | Speedup | Accuracy |
|---|---|---|---|
| explicit_euler | 2.6 | 1.0x | Low |
| explicit_euler_optimized | 0.9 | 2.9x | Low |
| explicit_euler_omp (8 cores) | 0.8 | 3.3x | Low |
| projection | 19.0 | 1.0x | High |
| projection_optimized | 5.3 | 3.6x | High |
| projection_omp (8 cores) | 4.2 | 4.5x | High |
| projection_jacobi_gpu | 8.4 | 0.45x† | High |
† GPU slower on small grids due to data transfer overhead
| Grid | Time (s) | Memory (MB) | Iterations/step |
|---|---|---|---|
| 50×50 | 0.5 | 0.02 | ~150 |
| 100×100 | 2.1 | 0.08 | ~300 |
| 200×200 | 9.8 | 0.31 | ~600 |
| 500×500 | 82.4 | 1.91 | ~1500 |
| Solver | Time (s) | Speedup |
|---|---|---|
| projection_optimized | 824 | 1.0x |
| projection_jacobi_gpu | 68 | 12.1x |
Need strict incompressibility enforcement?
├─ No → Use Explicit Euler family (faster)
│ ├─ Small grid (<100×100) → explicit_euler
│ ├─ Medium grid (100-500) → explicit_euler_optimized or explicit_euler_omp
│ └─ Large grid (>500) → explicit_euler_omp or explicit_euler_gpu
│
└─ Yes → Use Projection Method family
├─ Small grid (<100×100) → projection
├─ Medium grid (100-500) → projection_optimized or projection_omp
└─ Large grid (>500) → projection_jacobi_gpu
GPU available and grid >200×200?
└─ Use CUDA variant for 10-50x speedup
Learning/Debugging:
explicit_eulerorprojection- Simple, predictable behavior
- Easy to inspect intermediate results
Production Simulations:
projection_optimizedorprojection_omp(medium grids)projection_jacobi_gpu(large grids)- Best accuracy and performance
Benchmarking:
- Always use
CMAKE_BUILD_TYPE=Release - Disable I/O during timing measurements
- Run multiple iterations for statistical significance
- Dirichlet - Fixed values: u(boundary) = u_bc
- Neumann - Fixed gradient: ∂u/∂n(boundary) = g_bc
- Periodic - u(0) = u(L)
- No-slip - u = 0, v = 0 (walls)
- Inlet - Specified velocity profile
- Outlet - Zero-gradient: ∂u/∂n = 0
Boundary conditions are applied after each solver step:
void apply_boundary_conditions(flow_field* field, grid_t* grid) {
// No-slip walls (u=0, v=0)
for (size_t i = 0; i < grid->nx; i++) {
field->u[i + 0 * grid->nx] = 0.0; // Bottom
field->u[i + (grid->ny-1) * grid->nx] = 0.0; // Top
}
// Lid-driven cavity (top wall moves)
for (size_t i = 0; i < grid->nx; i++) {
field->u[i + (grid->ny-1) * grid->nx] = 1.0;
}
}Classic benchmark problem validated against Ghia et al. (1982):
Setup:
- Square cavity [0,1] × [0,1]
- Top wall moves with velocity u=1
- All other walls no-slip
- Reynolds number Re = ρUL/μ
Results (Re=100, 129×129 grid):
- Centerline velocity profiles match published data
- RMS error < 0.01
See validation/lid-driven-cavity.md for details.
Analytical solution for decaying vortex:
Exact Solution:
u(x,y,t) = -cos(x)sin(y)exp(-2νt)
v(x,y,t) = sin(x)cos(y)exp(-2νt)
p(x,y,t) = -0.25(cos(2x) + cos(2y))exp(-4νt)
Results:
- Velocity error < 1% at t=1.0
- Energy decay matches analytical solution
3D Extension:
u(x,y,z,t) = cos(x)sin(y)cos(z)exp(-3νt)
v(x,y,z,t) = -sin(x)cos(y)cos(z)exp(-3νt)
w(x,y,z,t) = 0
- Velocity decays as exp(-3νt), kinetic energy as exp(-6νt)
- Validated on 16×16×16 grid with ν=0.01
- Chorin, A.J. (1968). "Numerical Solution of the Navier-Stokes Equations". Mathematics of Computation.
- Ferziger & Peric - "Computational Methods for Fluid Dynamics"
- Versteeg & Malalasekera - "An Introduction to CFD"
- Ghia, U., Ghia, K.N., Shin, C.T. (1982). "High-Re Solutions for Incompressible Flow Using the Navier-Stokes Equations and a Multigrid Method". Journal of Computational Physics.
- Saad, Y. - "Iterative Methods for Sparse Linear Systems"
- Barrett et al. - "Templates for the Solution of Linear Systems"
- Examples - See solvers in action
- API Reference - API documentation
- Building - Build with specific solvers