Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@
# Claude Code settings
.claude/
.omc/
CLAUDE.local.md

# Temporary files
*.tmp
Expand Down
12 changes: 3 additions & 9 deletions skills/infrastructure/github-action-diagnose/SKILL.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
---
name: github-action-diagnose
description: 诊断昇腾(Ascend)NPU 集群上 GitHub Actions 执行失败的原因,定位基础设施故障与根因分析
allowed-tools: Bash(gh run view:*), Bash(gh run list:*), Bash(gh api:*), Bash(kubectl get:*), Bash(kubectl describe:*), Bash(kubectl logs:*), Bash(kubectl config:*), Read, Grep
allowed-tools: Bash(gh run view:*), Bash(gh run list:*), Bash(gh pr view:*), Bash(gh api:*), Bash(kubectl get:*), Bash(kubectl describe:*), Bash(kubectl logs:*), Read, Grep
---

## Your task
Expand Down Expand Up @@ -49,18 +49,12 @@ allowed-tools: Bash(gh run view:*), Bash(gh run list:*), Bash(gh api:*), Bash(ku
- Pod 已销毁:在日志中搜索 `Successfully assigned <runner_name> to <node_name>`。

#### 2b. 环境故障深度定位
检查以下故障模式:

- **驱动/硬件失效**:`npu-smi info` 报错、`ERR99999`、`error code 507035`、`Device not found`。
- **资源溢出 (OOM)**:系统级 `Killed` 信号、`Bus error`(SHM 不足)、`No space left on device`。
- **多机连锁超时**:多机任务中某节点报 `Timeout`,优先识别并检查 Master 节点。Master 节点识别方法:
- 检查 `RANK_TABLE_FILE` 中 `rank_id=0` 对应的节点
- 或在日志中搜索 `master_addr` / `MASTER_ADDR` 环境变量指向的节点
- 确认 Master 节点是否有 `Unexpected Exit` 或驱动报错
读取 `references/ascend-troubleshooting.md`,按其中的故障模式逐一比对日志,涵盖:驱动/硬件失效、资源饱和、OOM、设备插件异常、Runner 版本过期、多机连锁超时、节点出口网络故障

### Step 3: 非环境问题判定

如果 Step 2 未发现环境/硬件故障,检查以下非环境因素:
如果 Step 2 未发现环境/硬件/网络故障,检查以下非环境因素:

- **YAML 语法错误**:如 `undefined variable "False"`(应为小写 `false`)。
- **业务逻辑报错**:`AssertionError`、Python Traceback 指向业务源码、测试用例失败。
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,11 @@ Standard output of `npu-smi info`:
| HBM Usage | If near 100%, memory fragmentation or leakage may occur. |
| Temp | Should be < 80°C typically. High temp leads to frequency reduction. |

Driver/hardware failure keywords in logs:
- `ERR99999`: General NPU driver error.
- `error code 507035`: NPU device initialization failure.
- `Device not found`: NPU not recognized by driver.

## 3. Kernel & Driver Logs (dmesg)
Key strings to search for:
- `hiai: npu heartbeat loss`: Hardware/Firmware crash.
Expand All @@ -37,3 +42,43 @@ Look for:
- `get npu device count failed`
- `npu device is unhealthy`

## 5. K8s Runner Logs
Key strings to search for:
- **Runner version v2.330.0 is deprecated and cannot receive messages**: github action runner exits for its version is deprecated

## 6. OOM / Resource Overflow
- `Killed`: system-level OOM kill signal.
- `Bus error`: insufficient shared memory (SHM).
- `No space left on device`: disk full.

## 7. Multi-node Cascading Timeout
When a multi-node job reports `Timeout`, identify the Master node first — it is the most likely root cause.

Master node identification:
- Check `RANK_TABLE_FILE` for the node with `rank_id=0`.
- Search logs for `master_addr` / `MASTER_ADDR` environment variable.
- Confirm whether the Master node has `Unexpected Exit` or driver errors.

## 8. Node Network Failure (出口网络故障)

When a runner node has degraded or broken external network connectivity, symptoms appear during dependency installation or artifact download steps — not in NPU-related steps.

### Identification Signals

| curl Error | Code | Description |
|---|---|---|
| `SSL connection timeout` | 28 | SSL handshake blocked; 0 bytes transferred over minutes |
| `HTTP/2 stream 0 was not closed cleanly: PROTOCOL_ERROR (err 1)` | 92 | Connection reset by server/proxy after partial download |
| `Could not resolve host` | 6 | DNS resolution failure |

### Precursor Pattern
- Download speed consistently < 50 KB/s for multi-MB files throughout the job (visible in wget/curl progress output)
- Example: a 2.35 MB file taking 4+ minutes at 8–14 KB/s before the final curl failure

### Differentiation from Code Issues
- Failure is in a `curl`/`wget`/`pip` download command, not a Python traceback or test assertion
- Other concurrent jobs on different runner nodes complete the same installation steps successfully in the same Run
- Error codes are OS-level (curl exit codes), not Python exceptions

### Node Tracing
Apply the standard Step 2a node tracing procedure. If multiple failing jobs share the same runner name prefix (same node pool), the fault is pool-level rather than isolated to a single node.
Loading