Skip to content

env probe: schema 1.1 — full ROCm/PyTorch identity capture#161

Draft
oyazdanb wants to merge 1 commit intomainfrom
users/oyazdanb/env-probe-v1.1
Draft

env probe: schema 1.1 — full ROCm/PyTorch identity capture#161
oyazdanb wants to merge 1 commit intomainfrom
users/oyazdanb/env-probe-v1.1

Conversation

@oyazdanb
Copy link
Copy Markdown
Collaborator

@oyazdanb oyazdanb commented May 5, 2026

Extends the A1 env probe (#152) from the original 4 library blocks into a 22-block schema covering every ROCm + PyTorch identity surface that materially affects trial reproducibility on MI300/MI200 hosts. Goal: when two trials diverge, diff <(jq -S . a.json) <(jq -S . b.json) surfaces the cause directly, instead of multi-day investigations into implicit-environment drift.

New top-level blocks (all follow the existing fail-soft contract -- fully-shaped dict, partial_reasons line per missing field, never raises):

  • rocblas, miopen, rccl: parallel to hipblaslt -- header-parsed rocm_release_tweak + package_version + lib_hash (resolved through symlinks; falls through to versioned filenames in stripped images)
    • per-library kernel-DB filename fingerprint.
  • composable_kernel: two sub-blocks. system reads /opt/rocm/include/ck/version.h for version + 40-char SHA + ck_tile_present. pytorch_bundled runs nm -D | c++filt over libtorch_hip.so to count ck:: symbols actually compiled into the loaded wheel (~417 MB binary; uses NM_TIMEOUT_SEC=30 separately from SHORT_TIMEOUT_SEC=5). Two sibling booleans pytorch_use_ck_sdpa / pytorch_use_ck_gemm parse -DUSE_ROCM_CK_SDPA / -DUSE_ROCM_CK_GEMM out of torch.config.show() -- these are build-time cmake flags, not runtime env vars (a common misconception that an earlier draft of this PR got wrong).
  • tensile: optional pip probe + sorted-filenames sha256 over the union of the hipBLASLt + rocBLAS kernel DBs (parent-dir-namespaced by library so same-named files in both dirs don't collide).
  • triton, fbgemm, aiter: pip-importable Python pkg version. fbgemm also surfaces -DUSE_FBGEMM / -DUSE_FBGEMM_GENAI build-time flags from torch.config.show() (FBGEMM is vendored inside the PyTorch wheel; the standalone fbgemm_gpu pip pkg is rarely installed alongside).
  • aotriton: bundled in /lib/libaotriton_v2.so.MAJOR.MINOR.PATCH via cmake/External (NOT a third_party submodule). Version parsed from filename via numeric-tuple sort (string sort would order 0.9.0 after 0.10.0 -- the lib_hash now uses the same precomputed best_path via a new _hash_file_path helper to keep version + hash describing the same file). Captures bundled_present, bundled_version, bundled_lib_hash, bundled_images_dir_present (the aotriton.images/ dir of pre-compiled kernel images), and the AOTRITON_INSTALLED_PREFIX env var (operator override).
  • gpu_arch: rocm_agent_enumerator subprocess (works without /dev/kfd on most hosts; falls back to /opt/rocm/bin if not on PATH). Captures agent_count, gfx_targets (sorted unique), and agent_arch_counts (per-arch distribution -- {"gfx942": 8} on a homogeneous box; {"gfx1100": 1, "gfx942": 6} on a mixed-arch host). Filters the gfx000 placeholder some hosts include for the host CPU agent.
  • host: kernel_release + kernel_version + machine + glibc_version via os.uname() + os.confstr(CS_GNU_LIBC_VERSION). Closes the gap where rdhc was the only firmware/kernel source on customer hosts with sudo unconfigured.
  • pytorch_build: structured complement to pytorch_version. Always captures git_commit (from torch.version.git_version) + hip_version + cuda_version + debug + install_kind. When a PyTorch source tree is detected (AORTA_PYTORCH_SRC env var, PEP 660 editable-install marker, or walk-up .git+third_party from torch.file), additionally populates submodule_commits. {composable_kernel, aiter, fbgemm} via git -C third_party/<sub> rev-parse HEAD. Wheel installs fall back to a partial_reasons line containing the literal GitHub-tree URL with the captured git_commit substituted in, so an operator reading env.json gets a copy-pasteable recovery URL without leaving the doc.

Renames (schema 1.0 -> 1.1, see SCHEMA_VERSION changelog comment in environment.py for the field-by-field history):

  • hipblaslt/rocblas/miopen.commit -> .rocm_release_tweak: AMD sets *_VERSION_TWEAK to the ROCm release identifier shared across every library in a release, NOT to per-library upstream commit SHAs. The old name misled consumers (every library in a given ROCm release shows the same value -- useless for distinguishing per-library drift). lib_hash is the per-binary signal.
  • hipblaslt/rocblas.tensile_yaml_revision -> .kernel_db_revision: matches miopen.kernel_db_revision; modern hipBLASLt/rocBLAS ship .dat files (binary), not .yaml.

env_vars: 22 additions, 1 removal (now 31 vs 1.0's 13). Added GPU scoping (HIP_VISIBLE_DEVICES, ROCR_VISIBLE_DEVICES, HSA_OVERRIDE_GFX_VERSION), launch (HIP_LAUNCH_BLOCKING), build target (PYTORCH_ROCM_ARCH), MIOpen (MIOPEN_SYSTEM_DB_PATH, MIOPEN_USER_DB_PATH, MIOPEN_DEBUG_DISABLE_FIND_DB, MIOPEN_FIND_MODE), SDPA backend (TORCH_ROCM_FA_PREFER_CK, TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL), GEMM backend + autotune (TORCH_BLAS_PREFER_HIPBLASLT, TORCH_HIPBLASLT_TUNING_FILE, TORCH_HIPBLASLT_TUNING_OVERRIDE_FILE), and NCCL/RCCL extras (NCCL_P2P_LEVEL, NCCL_IB_HCA, NCCL_SOCKET_IFNAME,
RCCL_MSCCL_ENABLE). Removed USE_ROCM_CK_SDPA -- it's a build-time cmake flag, not a runtime env var; setting it in the workload's environment does nothing. Replaced by
composable_kernel.pytorch_use_ck_sdpa and pytorch_use_ck_gemm.

Refactor: extracted _parse_version_header, _hash_shared_library, _hash_file_path, _kernel_db_filename_fingerprint,
_capture_python_package_version, _safe_import_torch as shared helpers used by every library/Python-pkg block. Existing _parse_hipblaslt_header / _hash_hipblaslt_library / _tensile_fingerprint / _capture_pytorch_version stay as 1-line wrappers so the existing TestHipblaslt* / TestPytorchVersion test classes pass without edits.

CLI: new -v/--verbose flag dumps the full snapshot JSON to stdout after the brief. partial_reasons echoed inline so the operator can fix issues without jq'ing env.json. Closing [PARTIAL, N reason(s)] or [OK] marker repeats the probe state at end-of-output.

summary(): expanded from the original 6 lines to ~18 lines (one labelled cell per top-level block) so an operator running the probe sees the new GEMM/kernel-library identities without reading the JSON. Self-explanatory wording for absent-but-expected pip pkgs (e.g. "[Tensile pip pkg: (not installed); build-time tool, normal]" rather than "pip=None"). [PARTIAL, N reason(s)] only at end-of-output, never duplicated on the runtime line.

Tests: 218 passing (107 original + 111 new across TestRocblas*, TestCK*, TestCKPytorchBuildFlags, TestCombinedKernelDbFingerprint, TestTensileBlock, TestTritonBlock, TestFbgemmBlock, TestAiterBlock, TestAotriton*, TestMiopen, TestRccl, TestGpuArch, TestHostBlock, TestPytorchBuildBlockShape, TestDetectPytorchInstallKind, TestGitRevParseHead, TestCapturePytorchSubmodules, TestCapturePytorchBuildIntegration, TestPytorchVersionRealTorch, TestSafeImportTorch, TestHashFilePath, TestPythonPackageVersionHelper, plus stripped-image-fallback regression for _hash_shared_library and multi-version-crossover regression for _capture_aotriton hash).

Docs: docs/env-probe.md schema table, sources-of-data table, CLI section, PyTorch source-tree submodule probing section, schema changelog, and field-naming notes (rocm_release_tweak vs commit; hip.version vs pytorch_build.hip_version) all updated. src/aorta/instrumentation/README.md schema-version block updated to v1.1 with all 22 top-level keys listed; submodule purpose description shortened with a bullet list under the table.

Verified end-to-end on MI300X / ROCm 7.2.1 / PyTorch 2.9.1+rocm7.2.1: every block populates cleanly; partial_reasons contains only the expected pre-existing entries (rdhc-sudo, version_dev, wheel-install submodule recovery URL). Wall time ~2.2s warm baremetal, well inside the <15s overall target.

Extends the A1 env probe (#152) from the original 4 library blocks
into a 22-block schema covering every ROCm + PyTorch identity surface
that materially affects trial reproducibility on MI300/MI200 hosts.
Goal: when two trials diverge, `diff <(jq -S . a.json) <(jq -S . b.json)`
surfaces the cause directly, instead of multi-day investigations into
implicit-environment drift.

New top-level blocks (all follow the existing fail-soft contract --
fully-shaped dict, partial_reasons line per missing field, never raises):

* rocblas, miopen, rccl: parallel to hipblaslt -- header-parsed
  rocm_release_tweak + package_version + lib_hash (resolved through
  symlinks; falls through to versioned filenames in stripped images)
  + per-library kernel-DB filename fingerprint.
* composable_kernel: two sub-blocks. `system` reads
  /opt/rocm/include/ck/version.h for version + 40-char SHA +
  ck_tile_present. `pytorch_bundled` runs `nm -D | c++filt` over
  libtorch_hip.so to count ck:: symbols actually compiled into the
  loaded wheel (~417 MB binary; uses NM_TIMEOUT_SEC=30 separately
  from SHORT_TIMEOUT_SEC=5). Two sibling booleans
  pytorch_use_ck_sdpa / pytorch_use_ck_gemm parse
  -DUSE_ROCM_CK_SDPA / -DUSE_ROCM_CK_GEMM out of
  torch.__config__.show() -- these are build-time cmake flags, not
  runtime env vars (a common misconception that an earlier draft of
  this PR got wrong).
* tensile: optional pip probe + sorted-filenames sha256 over the
  union of the hipBLASLt + rocBLAS kernel DBs (parent-dir-namespaced
  by library so same-named files in both dirs don't collide).
* triton, fbgemm, aiter: pip-importable Python pkg version. fbgemm
  also surfaces -DUSE_FBGEMM / -DUSE_FBGEMM_GENAI build-time flags
  from torch.__config__.show() (FBGEMM is vendored inside the
  PyTorch wheel; the standalone fbgemm_gpu pip pkg is rarely
  installed alongside).
* aotriton: bundled in <torch>/lib/libaotriton_v2.so.MAJOR.MINOR.PATCH
  via cmake/External (NOT a third_party submodule). Version parsed
  from filename via numeric-tuple sort (string sort would order 0.9.0
  after 0.10.0 -- the lib_hash now uses the same precomputed best_path
  via a new _hash_file_path helper to keep version + hash describing
  the same file). Captures bundled_present, bundled_version,
  bundled_lib_hash, bundled_images_dir_present (the
  aotriton.images/ dir of pre-compiled kernel images), and the
  AOTRITON_INSTALLED_PREFIX env var (operator override).
* gpu_arch: rocm_agent_enumerator subprocess (works without /dev/kfd
  on most hosts; falls back to /opt/rocm/bin if not on PATH).
  Captures agent_count, gfx_targets (sorted unique), and
  agent_arch_counts (per-arch distribution -- {"gfx942": 8} on a
  homogeneous box; {"gfx1100": 1, "gfx942": 6} on a mixed-arch host).
  Filters the gfx000 placeholder some hosts include for the host CPU
  agent.
* host: kernel_release + kernel_version + machine + glibc_version
  via os.uname() + os.confstr(CS_GNU_LIBC_VERSION). Closes the gap
  where rdhc was the only firmware/kernel source on customer hosts
  with sudo unconfigured.
* pytorch_build: structured complement to pytorch_version. Always
  captures git_commit (from torch.version.git_version) +
  hip_version + cuda_version + debug + install_kind. When a PyTorch
  source tree is detected (AORTA_PYTORCH_SRC env var, PEP 660
  editable-install marker, or walk-up .git+third_party from
  torch.__file__), additionally populates submodule_commits.
  {composable_kernel, aiter, fbgemm} via `git -C
  third_party/<sub> rev-parse HEAD`. Wheel installs fall back to a
  partial_reasons line containing the literal GitHub-tree URL with
  the captured git_commit substituted in, so an operator reading
  env.json gets a copy-pasteable recovery URL without leaving the
  doc.

Renames (schema 1.0 -> 1.1, see SCHEMA_VERSION changelog comment in
environment.py for the field-by-field history):

* hipblaslt/rocblas/miopen.commit -> .rocm_release_tweak: AMD sets
  *_VERSION_TWEAK to the ROCm release identifier shared across every
  library in a release, NOT to per-library upstream commit SHAs. The
  old name misled consumers (every library in a given ROCm release
  shows the same value -- useless for distinguishing per-library
  drift). lib_hash is the per-binary signal.
* hipblaslt/rocblas.tensile_yaml_revision -> .kernel_db_revision:
  matches miopen.kernel_db_revision; modern hipBLASLt/rocBLAS ship
  .dat files (binary), not .yaml.

env_vars: 22 additions, 1 removal (now 31 vs 1.0's 13). Added GPU
scoping (HIP_VISIBLE_DEVICES, ROCR_VISIBLE_DEVICES,
HSA_OVERRIDE_GFX_VERSION), launch (HIP_LAUNCH_BLOCKING), build
target (PYTORCH_ROCM_ARCH), MIOpen (MIOPEN_SYSTEM_DB_PATH,
MIOPEN_USER_DB_PATH, MIOPEN_DEBUG_DISABLE_FIND_DB,
MIOPEN_FIND_MODE), SDPA backend (TORCH_ROCM_FA_PREFER_CK,
TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL), GEMM backend + autotune
(TORCH_BLAS_PREFER_HIPBLASLT, TORCH_HIPBLASLT_TUNING_FILE,
TORCH_HIPBLASLT_TUNING_OVERRIDE_FILE), and NCCL/RCCL extras
(NCCL_P2P_LEVEL, NCCL_IB_HCA, NCCL_SOCKET_IFNAME,
RCCL_MSCCL_ENABLE). Removed USE_ROCM_CK_SDPA -- it's a build-time
cmake flag, not a runtime env var; setting it in the workload's
environment does nothing. Replaced by
composable_kernel.pytorch_use_ck_sdpa and pytorch_use_ck_gemm.

Refactor: extracted _parse_version_header, _hash_shared_library,
_hash_file_path, _kernel_db_filename_fingerprint,
_capture_python_package_version, _safe_import_torch as shared
helpers used by every library/Python-pkg block. Existing
_parse_hipblaslt_header / _hash_hipblaslt_library /
_tensile_fingerprint / _capture_pytorch_version stay as 1-line
wrappers so the existing TestHipblaslt* / TestPytorchVersion test
classes pass without edits.

CLI: new -v/--verbose flag dumps the full snapshot JSON to stdout
after the brief. partial_reasons echoed inline so the operator can
fix issues without jq'ing env.json. Closing [PARTIAL, N reason(s)]
or [OK] marker repeats the probe state at end-of-output.

summary(): expanded from the original 6 lines to ~18 lines (one
labelled cell per top-level block) so an operator running the probe
sees the new GEMM/kernel-library identities without reading the JSON.
Self-explanatory wording for absent-but-expected pip pkgs (e.g.
"[Tensile pip pkg: (not installed); build-time tool, normal]" rather
than "pip=None"). [PARTIAL, N reason(s)] only at end-of-output, never
duplicated on the runtime line.

Tests: 218 passing (107 original + 111 new across TestRocblas*,
TestCK*, TestCKPytorchBuildFlags, TestCombinedKernelDbFingerprint,
TestTensileBlock, TestTritonBlock, TestFbgemmBlock, TestAiterBlock,
TestAotriton*, TestMiopen, TestRccl, TestGpuArch, TestHostBlock,
TestPytorchBuildBlockShape, TestDetectPytorchInstallKind,
TestGitRevParseHead, TestCapturePytorchSubmodules,
TestCapturePytorchBuildIntegration, TestPytorchVersionRealTorch,
TestSafeImportTorch, TestHashFilePath, TestPythonPackageVersionHelper,
plus stripped-image-fallback regression for _hash_shared_library and
multi-version-crossover regression for _capture_aotriton hash).

Docs: docs/env-probe.md schema table, sources-of-data table, CLI
section, PyTorch source-tree submodule probing section, schema
changelog, and field-naming notes (rocm_release_tweak vs commit;
hip.version vs pytorch_build.hip_version) all updated.
src/aorta/instrumentation/README.md schema-version block updated to
v1.1 with all 22 top-level keys listed; submodule purpose
description shortened with a bullet list under the table.

Verified end-to-end on MI300X / ROCm 7.2.1 / PyTorch
2.9.1+rocm7.2.1: every block populates cleanly; partial_reasons
contains only the expected pre-existing entries (rdhc-sudo,
version_dev, wheel-install submodule recovery URL). Wall time ~2.2s
warm baremetal, well inside the <15s overall target.

Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings May 5, 2026 21:50
@oyazdanb oyazdanb marked this pull request as draft May 5, 2026 21:51
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Expands the aorta env probe environment snapshot schema from v1.0 to v1.1 to capture substantially more ROCm + PyTorch reproducibility identity surfaces (additional ROCm libraries, CK/AOTriton, GPU arch + host info, and structured PyTorch build metadata), while maintaining the existing fail-soft/never-raise contract.

Changes:

  • Bumps env-probe schema to 1.1, adding many new top-level blocks (rocBLAS, CK, Tensile, Triton/FBGEMM/AITER, AOTriton, MIOpen, RCCL, gpu_arch, host, pytorch_build) and renaming a couple of existing fields for correctness.
  • Updates aorta env probe CLI output: adds -v/--verbose, prints inline partial_reasons, and prints a closing [OK]/[PARTIAL, N] marker.
  • Extends tests and docs to cover/describe the new schema shape and CLI behavior.

Reviewed changes

Copilot reviewed 4 out of 5 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
src/aorta/instrumentation/environment.py Implements schema v1.1 blocks, shared helpers, summary formatting, and additional probes (rocBLAS/CK/Tensile/AOTriton/MIOpen/RCCL/gpu_arch/host/pytorch_build).
src/aorta/cli/env.py Adds verbose output and improves operator-facing CLI output (inline partial reasons + closing status marker).
tests/instrumentation/test_environment.py Expands unit/integration tests for new blocks, helpers, schema shape, and CLI output behavior.
docs/env-probe.md Updates schema documentation, CLI docs/examples, changelog, and operational guidance.
src/aorta/instrumentation/README.md Updates instrumentation README to reflect the expanded env-probe block list and schema v1.1.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +1203 to +1211
# Build candidate list: unversioned first, then any versioned
# filenames sorted descending so the highest-versioned wins. String
# sort works for the conventional ``soname.MAJOR.MINOR.PATCH``
# layout since each component is left-padded by the package version
# numbering (and where it does not, "good enough" still picks a
# real file deterministically).
candidates: list[Path] = [lib_dir / soname]
try:
versioned = sorted(lib_dir.glob(f"{soname}.*"), reverse=True)
Comment on lines +80 to +84
# - Added 7 new env vars to CANONICAL_ENV_VARS (HIP_VISIBLE_DEVICES,
# ROCR_VISIBLE_DEVICES, HSA_OVERRIDE_GFX_VERSION,
# HIP_LAUNCH_BLOCKING, MIOPEN_FIND_MODE, TORCH_HIPBLASLT_TUNING_FILE,
# TORCH_HIPBLASLT_TUNING_OVERRIDE_FILE) plus the SDPA/Tuning/MIOpen
# family from earlier.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants