Fix BLAS spec violation: zero C when beta=0 in split_k and local_multiply_cpu by exopoiesis · Pull Request #168 · eth-cscs/COSMA

exopoiesis · 2026-04-07T11:17:27Z

Summary

When beta=0, the BLAS specification states that matrix C may be uninitialized and must not be read. COSMA violates this in two places, causing sporadic incorrect results and segfaults in downstream codes (specifically CP2K AIMD with GPU-accelerated COSMA).

Bug 1: `multiply.cpp` split_k accumulation

When the strategy splits along k, iterations K>0 set beta=1 to accumulate partial results into C. However, C is allocated from the memory pool and never initialized. With NaN/Inf in uninitialized memory:

K=0: beta=0 → OK (result overwrites C)
K=1: beta=1 → reads uninitialized C → 1.0 * NaN + partial = NaN

Fix: zero-fill C before the split_k loop when beta=0 and divisor > 1.

Bug 2: `local_multiply_cpu` scaling

Cvalue *= beta;  // when beta=0 and Cvalue=NaN: 0.0 * NaN = NaN (IEEE 754)

Fix: conditional assignment: Cvalue = (beta == 0) ? 0 : Cvalue * beta

Relationship to PR #159

PR #159 (v2.8.0) added C.fill(beta) in multiply_using_layout, which fixes the user-facing matrix C. However, the internal C_cosma buffer (from memory pool) used in recursive multiply() is not covered. This PR fixes the internal path.

Reproduction

Found in CP2K 2025.2+ AIMD: parallel_gemm('T','N', ..., beta=0.0, overlap_vv) called from make_basis_lowdin in qs_mo_methods.F. With MPI ranks > 1 and COSMA splitting along k, uninitialized pool memory corrupts the Lowdin orthogonalization → segfault or silent wrong results.

CP2K correctly passes beta=0 per BLAS spec — the matrices need not be pre-initialized.

Testing

Added BetaZero.FloatUninitialized and BetaZero.DoubleUninitialized regression tests
Tests fill C with NaN, call local_multiply_cpu with beta=0, verify no NaN in output and correct result
All existing tests pass (test.mapper, test.multiply_using_layout, test.scalar_matmul including new tests)
Note: test.multiply (16 MPI ranks) fails/hangs on both upstream master and this branch — pre-existing MPI RMA issue unrelated to this change

🤖 Generated with Claude Code

…al_multiply_cpu When beta=0, the BLAS specification states that matrix C may be uninitialized and its contents must not be read. COSMA violated this in two places: 1. multiply.cpp split_k path: iterations K>0 set beta=1 to accumulate partial results, but the C buffer from the memory pool was never initialized. With NaN/Inf in uninitialized memory, this produces garbage results (0*NaN=NaN per IEEE 754, and 1*NaN=NaN for K>0). 2. local_multiply_cpu: `Cvalue *= beta` reads C even when beta=0. Per IEEE 754, 0.0 * NaN = NaN, so this corrupts the output. Fix: (1) zero-fill C before the split_k loop when beta=0 and divisor>1, (2) use conditional assignment in local_multiply_cpu. Add regression tests (BetaZero/FloatUninitialized, DoubleUninitialized) that fill C with NaN and verify beta=0 produces correct, NaN-free results. Found while debugging sporadic segfaults in CP2K AIMD with GPU-accelerated COSMA (parallel_gemm called with beta=0 from make_basis_lowdin). PR eth-cscs#159 (v2.8.0) partially addressed this in multiply_using_layout but did not cover the internal split_k accumulation or local_multiply_cpu. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

simonpintarelli · 2026-04-07T18:39:56Z

cscs-ci run GH200

exopoiesis mentioned this pull request Apr 7, 2026

Initialize matrices before parallel_gemm in make_basis_lowdin cp2k/cp2k#5044

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix BLAS spec violation: zero C when beta=0 in split_k and local_multiply_cpu#168

Fix BLAS spec violation: zero C when beta=0 in split_k and local_multiply_cpu#168
exopoiesis wants to merge 1 commit intoeth-cscs:masterfrom
exopoiesis:fix/zero-c-buffer-split-k

exopoiesis commented Apr 7, 2026

Uh oh!

simonpintarelli commented Apr 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

exopoiesis commented Apr 7, 2026

Summary

Bug 1: multiply.cpp split_k accumulation

Bug 2: local_multiply_cpu scaling

Relationship to PR #159

Reproduction

Testing

Uh oh!

simonpintarelli commented Apr 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Bug 1: `multiply.cpp` split_k accumulation

Bug 2: `local_multiply_cpu` scaling