Skip to content

fix: CpuGemmDirectConv2d: accumulate in fp32 unless in fast_mode#1287

Open
Sqvid wants to merge 1 commit intoARM-software:mainfrom
Sqvid:fix-f16-fast-conv
Open

fix: CpuGemmDirectConv2d: accumulate in fp32 unless in fast_mode#1287
Sqvid wants to merge 1 commit intoARM-software:mainfrom
Sqvid:fix-f16-fast-conv

Conversation

@Sqvid
Copy link
Copy Markdown
Contributor

@Sqvid Sqvid commented May 5, 2026

We have a bug in PyTorch+oneDNN+ACL where numerical errors were observed in certain f16 convs.
PyTorch issue: pytorch/pytorch#177245
oneDNN issue: uxlfoundation/oneDNN#5106

The root cause is that we are accumulating in f16 rather than in f32 even when fast_mode is false. The reason this happens is because CpuGemmDirectConv2d::configure() calls CpuGemmAssemblyDispatch::configure() here:

_gemm_asm_func->configure(src, &_perm_weights, biases, dst, asm_info);

Which does the following:
// If fast_mode is disabled, we must enable it when fp32 accumulation is not set for fp16.
bool is_fp16 =
a->data_type() == DataType::F16 && b->data_type() == DataType::F16 && d->data_type() == DataType::F16;
bool fast_mode = info.fast_mode || (is_fp16 && !info.use_fp32_acc);

My solution is therefore to change CpuGemmDirectConv2d::init_assembly_metadata() such that use_fp32_acc is true unless enable_fast_mode is set. This resolves the bug at the oneDNN level. Hopefully @robert-hardwick can confirm if this fixes the PyTorch bug as well.

Change-Id: Id00f8e17b3349893164eb0b7edd616345488515e
Signed-off-by: Siddhartha Menon <siddhartha.menon@arm.com>
Copy link
Copy Markdown

@Dongsung-arm Dongsung-arm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me.
This matches the intended behavior: FP32 accumulation by default, and only disabled when enable_fast_math is set.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants