From 7b5318646d1c8e4047900a1484e032c4b412848a Mon Sep 17 00:00:00 2001 From: hdsong2 Date: Tue, 17 Mar 2026 16:38:19 +0800 Subject: [PATCH 1/3] refactor(github-action-diagnose): expand troubleshooting reference and simplify SKILL.md Move inline failure patterns from SKILL.md into the reference doc and add new sections: driver/hardware keywords, K8s runner deprecation, OOM/resource overflow, multi-node cascading timeout, and node network failure with identification signals and differentiation guide. --- .../github-action-diagnose/SKILL.md | 10 +---- .../references/ascend-troubleshooting.md | 45 +++++++++++++++++++ 2 files changed, 47 insertions(+), 8 deletions(-) diff --git a/skills/infrastructure/github-action-diagnose/SKILL.md b/skills/infrastructure/github-action-diagnose/SKILL.md index 5e1fc11..657d572 100644 --- a/skills/infrastructure/github-action-diagnose/SKILL.md +++ b/skills/infrastructure/github-action-diagnose/SKILL.md @@ -49,18 +49,12 @@ allowed-tools: Bash(gh run view:*), Bash(gh run list:*), Bash(gh api:*), Bash(ku - Pod 已销毁:在日志中搜索 `Successfully assigned to `。 #### 2b. 环境故障深度定位 -检查以下故障模式: -- **驱动/硬件失效**:`npu-smi info` 报错、`ERR99999`、`error code 507035`、`Device not found`。 -- **资源溢出 (OOM)**:系统级 `Killed` 信号、`Bus error`(SHM 不足)、`No space left on device`。 -- **多机连锁超时**:多机任务中某节点报 `Timeout`,优先识别并检查 Master 节点。Master 节点识别方法: - - 检查 `RANK_TABLE_FILE` 中 `rank_id=0` 对应的节点 - - 或在日志中搜索 `master_addr` / `MASTER_ADDR` 环境变量指向的节点 - - 确认 Master 节点是否有 `Unexpected Exit` 或驱动报错 +读取 `references/ascend-troubleshooting.md`,按其中的故障模式逐一比对日志,涵盖:驱动/硬件失效、资源饱和、OOM、设备插件异常、Runner 版本过期、多机连锁超时、节点出口网络故障 ### Step 3: 非环境问题判定 -如果 Step 2 未发现环境/硬件故障,检查以下非环境因素: +如果 Step 2 未发现环境/硬件/网络故障,检查以下非环境因素: - **YAML 语法错误**:如 `undefined variable "False"`(应为小写 `false`)。 - **业务逻辑报错**:`AssertionError`、Python Traceback 指向业务源码、测试用例失败。 diff --git a/skills/infrastructure/github-action-diagnose/references/ascend-troubleshooting.md b/skills/infrastructure/github-action-diagnose/references/ascend-troubleshooting.md index 60118bf..52adb21 100644 --- a/skills/infrastructure/github-action-diagnose/references/ascend-troubleshooting.md +++ b/skills/infrastructure/github-action-diagnose/references/ascend-troubleshooting.md @@ -22,6 +22,11 @@ Standard output of `npu-smi info`: | HBM Usage | If near 100%, memory fragmentation or leakage may occur. | | Temp | Should be < 80°C typically. High temp leads to frequency reduction. | +Driver/hardware failure keywords in logs: +- `ERR99999`: General NPU driver error. +- `error code 507035`: NPU device initialization failure. +- `Device not found`: NPU not recognized by driver. + ## 3. Kernel & Driver Logs (dmesg) Key strings to search for: - `hiai: npu heartbeat loss`: Hardware/Firmware crash. @@ -37,3 +42,43 @@ Look for: - `get npu device count failed` - `npu device is unhealthy` +## 5. K8s Runner Logs +Key strings to search for: +- **Runner version v2.330.0 is deprecated and cannot receive messages**: github action runner exits for its version is deprecated + +## 6. OOM / Resource Overflow +- `Killed`: system-level OOM kill signal. +- `Bus error`: insufficient shared memory (SHM). +- `No space left on device`: disk full. + +## 7. Multi-node Cascading Timeout +When a multi-node job reports `Timeout`, identify the Master node first — it is the most likely root cause. + +Master node identification: +- Check `RANK_TABLE_FILE` for the node with `rank_id=0`. +- Search logs for `master_addr` / `MASTER_ADDR` environment variable. +- Confirm whether the Master node has `Unexpected Exit` or driver errors. + +## 8. Node Network Failure (出口网络故障) + +When a runner node has degraded or broken external network connectivity, symptoms appear during dependency installation or artifact download steps — not in NPU-related steps. + +### Identification Signals + +| curl Error | Code | Description | +|---|---|---| +| `SSL connection timeout` | 28 | SSL handshake blocked; 0 bytes transferred over minutes | +| `HTTP/2 stream 0 was not closed cleanly: PROTOCOL_ERROR (err 1)` | 92 | Connection reset by server/proxy after partial download | +| `Could not resolve host` | 6 | DNS resolution failure | + +### Precursor Pattern +- Download speed consistently < 50 KB/s for multi-MB files throughout the job (visible in wget/curl progress output) +- Example: a 2.35 MB file taking 4+ minutes at 8–14 KB/s before the final curl failure + +### Differentiation from Code Issues +- Failure is in a `curl`/`wget`/`pip` download command, not a Python traceback or test assertion +- Other concurrent jobs on different runner nodes complete the same installation steps successfully in the same Run +- Error codes are OS-level (curl exit codes), not Python exceptions + +### Node Tracing +Apply the standard Step 2a node tracing procedure. If multiple failing jobs share the same runner name prefix (same node pool), the fault is pool-level rather than isolated to a single node. From 05bc51b3bfafd67189498d1d7cf3fb5082476663 Mon Sep 17 00:00:00 2001 From: hdsong2 Date: Wed, 18 Mar 2026 14:48:26 +0800 Subject: [PATCH 2/3] chore: add CLAUDE.local.md to gitignore --- .gitignore | 1 + 1 file changed, 1 insertion(+) diff --git a/.gitignore b/.gitignore index 3a2d7e3..43a2903 100644 --- a/.gitignore +++ b/.gitignore @@ -13,6 +13,7 @@ # Claude Code settings .claude/ .omc/ +CLAUDE.local.md # Temporary files *.tmp From d75b2f2aa43fb22774da875349e87158c44e3e96 Mon Sep 17 00:00:00 2001 From: hdsong2 Date: Wed, 18 Mar 2026 14:56:34 +0800 Subject: [PATCH 3/3] fix(github-action-diagnose): replace kubectl config with gh pr view in allowed-tools --- skills/infrastructure/github-action-diagnose/SKILL.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/skills/infrastructure/github-action-diagnose/SKILL.md b/skills/infrastructure/github-action-diagnose/SKILL.md index 657d572..0f7ad79 100644 --- a/skills/infrastructure/github-action-diagnose/SKILL.md +++ b/skills/infrastructure/github-action-diagnose/SKILL.md @@ -1,7 +1,7 @@ --- name: github-action-diagnose description: 诊断昇腾(Ascend)NPU 集群上 GitHub Actions 执行失败的原因,定位基础设施故障与根因分析 -allowed-tools: Bash(gh run view:*), Bash(gh run list:*), Bash(gh api:*), Bash(kubectl get:*), Bash(kubectl describe:*), Bash(kubectl logs:*), Bash(kubectl config:*), Read, Grep +allowed-tools: Bash(gh run view:*), Bash(gh run list:*), Bash(gh pr view:*), Bash(gh api:*), Bash(kubectl get:*), Bash(kubectl describe:*), Bash(kubectl logs:*), Read, Grep --- ## Your task