Skip to content

Fix BLAS spec violation: zero C when beta=0 in split_k and local_multiply_cpu#168

Open
exopoiesis wants to merge 1 commit intoeth-cscs:masterfrom
exopoiesis:fix/zero-c-buffer-split-k
Open

Fix BLAS spec violation: zero C when beta=0 in split_k and local_multiply_cpu#168
exopoiesis wants to merge 1 commit intoeth-cscs:masterfrom
exopoiesis:fix/zero-c-buffer-split-k

Conversation

@exopoiesis
Copy link
Copy Markdown

Summary

When beta=0, the BLAS specification states that matrix C may be uninitialized and must not be read. COSMA violates this in two places, causing sporadic incorrect results and segfaults in downstream codes (specifically CP2K AIMD with GPU-accelerated COSMA).

Bug 1: multiply.cpp split_k accumulation

When the strategy splits along k, iterations K>0 set beta=1 to accumulate partial results into C. However, C is allocated from the memory pool and never initialized. With NaN/Inf in uninitialized memory:

  • K=0: beta=0 → OK (result overwrites C)
  • K=1: beta=1 → reads uninitialized C → 1.0 * NaN + partial = NaN

Fix: zero-fill C before the split_k loop when beta=0 and divisor > 1.

Bug 2: local_multiply_cpu scaling

Cvalue *= beta;  // when beta=0 and Cvalue=NaN: 0.0 * NaN = NaN (IEEE 754)

Fix: conditional assignment: Cvalue = (beta == 0) ? 0 : Cvalue * beta

Relationship to PR #159

PR #159 (v2.8.0) added C.fill(beta) in multiply_using_layout, which fixes the user-facing matrix C. However, the internal C_cosma buffer (from memory pool) used in recursive multiply() is not covered. This PR fixes the internal path.

Reproduction

Found in CP2K 2025.2+ AIMD: parallel_gemm('T','N', ..., beta=0.0, overlap_vv) called from make_basis_lowdin in qs_mo_methods.F. With MPI ranks > 1 and COSMA splitting along k, uninitialized pool memory corrupts the Lowdin orthogonalization → segfault or silent wrong results.

CP2K correctly passes beta=0 per BLAS spec — the matrices need not be pre-initialized.

Testing

  • Added BetaZero.FloatUninitialized and BetaZero.DoubleUninitialized regression tests
  • Tests fill C with NaN, call local_multiply_cpu with beta=0, verify no NaN in output and correct result
  • All existing tests pass (test.mapper, test.multiply_using_layout, test.scalar_matmul including new tests)
  • Note: test.multiply (16 MPI ranks) fails/hangs on both upstream master and this branch — pre-existing MPI RMA issue unrelated to this change

🤖 Generated with Claude Code

…al_multiply_cpu

When beta=0, the BLAS specification states that matrix C may be
uninitialized and its contents must not be read. COSMA violated this
in two places:

1. multiply.cpp split_k path: iterations K>0 set beta=1 to accumulate
   partial results, but the C buffer from the memory pool was never
   initialized. With NaN/Inf in uninitialized memory, this produces
   garbage results (0*NaN=NaN per IEEE 754, and 1*NaN=NaN for K>0).

2. local_multiply_cpu: `Cvalue *= beta` reads C even when beta=0.
   Per IEEE 754, 0.0 * NaN = NaN, so this corrupts the output.

Fix: (1) zero-fill C before the split_k loop when beta=0 and
divisor>1, (2) use conditional assignment in local_multiply_cpu.

Add regression tests (BetaZero/FloatUninitialized, DoubleUninitialized)
that fill C with NaN and verify beta=0 produces correct, NaN-free results.

Found while debugging sporadic segfaults in CP2K AIMD with GPU-accelerated
COSMA (parallel_gemm called with beta=0 from make_basis_lowdin).
PR eth-cscs#159 (v2.8.0) partially addressed this in multiply_using_layout but
did not cover the internal split_k accumulation or local_multiply_cpu.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@simonpintarelli
Copy link
Copy Markdown
Member

cscs-ci run GH200

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants