fix: use mixed arch for sectioned sync kernels#484
fix: use mixed arch for sectioned sync kernels#484zhangstevenunity merged 3 commits intohw-native-sys:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request refines the AI Core architecture inference logic by introducing a check for mixed section synchronization, which now accounts for both intra-block and explicit pipe synchronization flags. This ensures that sectioned kernels on A5 and 910B platforms are correctly compiled using the dav-c310 mixed-kernel mode. Review feedback suggests improving the robustness of synchronization function detection by using regular expressions instead of simple string matching to handle whitespace variations. Additionally, there is a recommendation to ensure consistent architecture assignment for other platforms, such as Ascend910, when mixed section synchronization is identified to avoid potential compilation errors.
| if has_mixed_section_sync: | ||
| return "dav-c310" |
There was a problem hiding this comment.
This logic correctly identifies the need for dav-c310 on A5/910B platforms when has_mixed_section_sync is true, but it fails to handle other platforms (like Ascend910) which should use dav-c220 in this case. This is inconsistent with the logic implemented in generate_testcase (lines 1425-1432) and may lead to compilation errors on those platforms. Consider updating this helper to return dav-c220 for non-A5/910B SoCs when mixed section synchronization is detected.
| # | ||
| # IMPORTANT: the default arch depends on the Ascend SoC. | ||
| has_mix_macros = "__DAV_CUBE__" in kernel_text and "__DAV_VEC__" in kernel_text | ||
| has_flag_sync = "set_flag(" in kernel_text or "wait_flag(" in kernel_text |
There was a problem hiding this comment.
The string matching for set_flag( and wait_flag( is fragile as it does not account for potential whitespace between the function name and the parenthesis (e.g., set_flag (PIPE_V, ...)). Using a regular expression with word boundaries and optional whitespace is more robust and consistent with other checks in this script (like has_packed_pred_mask).
| has_flag_sync = "set_flag(" in kernel_text or "wait_flag(" in kernel_text | |
| has_flag_sync = re.search(r"\b(set|wait)_flag\s*\(", kernel_text) is not None |
| has_packed_pred_mask = re.search(r"\bTCMPS?\s*\(", raw_kernel_for_analysis) is not None | ||
| has_dav_cube = "__DAV_CUBE__" in raw_kernel | ||
| has_dav_vec = "__DAV_VEC__" in raw_kernel | ||
| has_flag_sync = "set_flag(" in raw_kernel or "wait_flag(" in raw_kernel |
There was a problem hiding this comment.
Similar to the check in _infer_aicore_arch, this string matching is fragile. Using a regular expression would be more robust against variations in code formatting.
| has_flag_sync = "set_flag(" in raw_kernel or "wait_flag(" in raw_kernel | |
| has_flag_sync = re.search(r"\b(set|wait)_flag\s*\(", raw_kernel) is not None |
Codex Review该评论由 review 机器人自动更新。
Summary未检查到 PR #484 存在问题,并返回 findings=[]。 FindingsNo issues found. |
|
/run a5 qwen3_decode_layer_incore_0,qwen3_decode_layer_incore_1,qwen3_decode_layer_incore_2,qwen3_decode_layer_incore_3,qwen3_decode_layer_incore_4,qwen3_decode_layer_incore_5,qwen3_decode_layer_incore_6,qwen3_decode_layer_incore_7,qwen3_decode_layer_incore_8,qwen3_decode_layer_incore_9,qwen3_decode_layer_incore_10,qwen3_decode_layer_incore_11,qwen3_decode_layer_incore_12,qwen3_decode_layer_incore_13,qwen3_decode_layer_incore_14,qwen3_decode_layer_incore_15,qwen3_decode_layer_incore_16,qwen3_decode_layer_incore_17,qwen3_decode_layer_incore_18,qwen3_decode_layer_incore_19 --pto-level=level3 |
A5 板测失败
失败用例
|
A5 板测失败详情:PR #484qwen3_decode_layer_incore_8
qwen3_decode_layer_incore_6
qwen3_decode_layer_incore_5
qwen3_decode_layer_incore_4
qwen3_decode_layer_incore_3
qwen3_decode_layer_incore_17
|
|
/run a5 qwen3_decode_layer_incore_0,qwen3_decode_layer_incore_1,qwen3_decode_layer_incore_2,qwen3_decode_layer_incore_3,qwen3_decode_layer_incore_4,qwen3_decode_layer_incore_5,qwen3_decode_layer_incore_6,qwen3_decode_layer_incore_7,qwen3_decode_layer_incore_8,qwen3_decode_layer_incore_9,qwen3_decode_layer_incore_10,qwen3_decode_layer_incore_11,qwen3_decode_layer_incore_12,qwen3_decode_layer_incore_13,qwen3_decode_layer_incore_14,qwen3_decode_layer_incore_15,qwen3_decode_layer_incore_16,qwen3_decode_layer_incore_17,qwen3_decode_layer_incore_18,qwen3_decode_layer_incore_19 --pto-level=level3 |
A5 板测失败
失败用例
|
A5 板测失败详情:PR #484qwen3_decode_layer_incore_6
qwen3_decode_layer_incore_17
|
|
/run a5 qwen3_decode_layer_incore_0,qwen3_decode_layer_incore_1,qwen3_decode_layer_incore_2,qwen3_decode_layer_incore_3,qwen3_decode_layer_incore_4,qwen3_decode_layer_incore_5,qwen3_decode_layer_incore_6,qwen3_decode_layer_incore_7,qwen3_decode_layer_incore_8,qwen3_decode_layer_incore_9,qwen3_decode_layer_incore_10,qwen3_decode_layer_incore_11,qwen3_decode_layer_incore_12,qwen3_decode_layer_incore_13,qwen3_decode_layer_incore_14,qwen3_decode_layer_incore_15,qwen3_decode_layer_incore_16,qwen3_decode_layer_incore_17,qwen3_decode_layer_incore_18,qwen3_decode_layer_incore_19 --pto-level=level3 |
A5 板测失败
日志尾部 |
|
/run a5 qwen3_decode_layer_incore_0,qwen3_decode_layer_incore_1,qwen3_decode_layer_incore_2,qwen3_decode_layer_incore_3,qwen3_decode_layer_incore_4,qwen3_decode_layer_incore_5,qwen3_decode_layer_incore_6,qwen3_decode_layer_incore_7,qwen3_decode_layer_incore_8,qwen3_decode_layer_incore_9,qwen3_decode_layer_incore_10,qwen3_decode_layer_incore_11,qwen3_decode_layer_incore_12,qwen3_decode_layer_incore_13,qwen3_decode_layer_incore_14,qwen3_decode_layer_incore_15,qwen3_decode_layer_incore_16,qwen3_decode_layer_incore_17,qwen3_decode_layer_incore_18,qwen3_decode_layer_incore_19 --pto-level=level3 |
A5 板测成功
|
Summary
generate_testcase.pydav-c310/dav-c220for those kernels instead of forcingdav-c310-vec/dav-c220-vecWhy
run_remote_npu_validation.shwas generating testcase CMake for some A5 sectioned kernels with--cce-aicore-arch=dav-c310-vecwhile still forcing both-D__DAV_CUBE__and-D__DAV_VEC__. For kernels that also use explicitset_flag/wait_flagorset_intra_block/wait_intra_blockacross those sections, that diverges from PTO-ISA's mixed-kernel build mode and can trigger compile-time illegal sync parameter errors.This change aligns those sectioned-sync kernels with the mixed-kernel compile path without broadening the behavior for ordinary single-section kernels.
Validation
generate_testcase.pyand verified_infer_aicore_arch()returnsdav-c310for A5 mixed-section kernels withset_flag/wait_flaggenerate_testcase.pyon a synthetic A5 mixed-section kernel and confirmed generatedCMakeLists.txtuses--cce-aicore-arch=dav-c310Scope
This PR only targets the compile-time arch selection issue in remote validation. It does not attempt to address separate runtime MTE OOR failures.