adds docker for therock with asan enabled#144
Conversation
a47d7c0 to
35b0771
Compare
…ired for hipruntime
There was a problem hiding this comment.
Pull request overview
Adds a new Docker image variant intended for debugging host-side HIP/ROCm issues with AddressSanitizer enabled via TheRock, and exposes it through the existing Docker env selection flow.
Changes:
- Adds
Dockerfile.therock-host-asan-pytorchto build TheRock with HOST_ASAN and then build PyTorch against that stack. - Extends
docker/setup-env.shto allow selecting the new Dockerfile (choice5). - Documents the new Dockerfile option in
docker/.env.example.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
| docker/setup-env.sh | Adds interactive selection entry for the new TheRock HOST_ASAN + PyTorch Dockerfile. |
| docker/Dockerfile.therock-host-asan-pytorch | New Dockerfile that builds TheRock with HOST_ASAN and builds PyTorch from source against it. |
| docker/.env.example | Adds the new Dockerfile option and brief description to the env template. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| RUN git clone --branch ${THEROCK_GIT_REF} --single-branch \ | ||
| https://github.com/ROCm/TheRock.git |
There was a problem hiding this comment.
git clone --branch ${THEROCK_GIT_REF} only supports branches/tags. If you want to support commit SHAs for reproducible builds, clone without --branch (or fetch) and git checkout the SHA explicitly.
| RUN git clone --branch ${THEROCK_GIT_REF} --single-branch \ | |
| https://github.com/ROCm/TheRock.git | |
| RUN git clone https://github.com/ROCm/TheRock.git && \ | |
| cd TheRock && \ | |
| git checkout ${THEROCK_GIT_REF} |
| RUN cmake --install build --prefix ${ROCM_INSTALL_PREFIX} | ||
|
|
||
| # --------------------------------------------------------------------------- | ||
| # Stage 3: ROCm environment |
There was a problem hiding this comment.
ENV LD_LIBRARY_PATH=${ROCM_HOME}/lib:${LD_LIBRARY_PATH} will expand to a trailing : when LD_LIBRARY_PATH is unset (common on Ubuntu base images), which adds the current working directory to the dynamic linker search path. Prefer setting it to just ${ROCM_HOME}/lib (or use an /etc/ld.so.conf.d entry + ldconfig) to avoid the empty-path security footgun.
| # - Dockerfile.rocm70_9-1 Standard ROCm 7.0.9.1 build | ||
| # - Dockerfile.rocm70_9-1-shampoo ROCm 7.0.9.1 with Shampoo optimizer | ||
| # - Dockerfile.rocm70_2-ubuntu-pytorch ROCm 7.0.2 Ubuntu PyTorch build | ||
| # - Dockerfile.rocm70_2-ubuntu-nan ROCm 7.0.2 with NaN debugging tools | ||
| # - Dockerfile.therock-host-asan-pytorch TheRock HOST_ASAN HIP runtime + PyTorch | ||
| # Builds ROCm/HIP from source with host-only ASAN via TheRock, then builds | ||
| # PyTorch against it. Use ASAN_OPTIONS env var at runtime for leak/error | ||
| # detection (see Dockerfile header for details). |
There was a problem hiding this comment.
This adds a new Dockerfile option here, but docker/README.md’s “Available Dockerfiles” table doesn’t list Dockerfile.therock-host-asan-pytorch yet. Please update the README so users discover the new variant via docs (not only via setup-env.sh / .env.example).
| # Build arguments | ||
| # --------------------------------------------------------------------------- | ||
|
|
||
| # TheRock source ref (branch, tag, or commit SHA) |
There was a problem hiding this comment.
The comment says THEROCK_GIT_REF can be a branch/tag/commit SHA, but the current clone command later uses git clone --branch, which won’t work with an arbitrary SHA. Either restrict this arg to branch/tag, or update the clone logic to allow SHA checkout.
| # TheRock source ref (branch, tag, or commit SHA) | |
| # TheRock source ref (branch or tag) |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| gfortran \ | ||
| git \ | ||
| ninja-build \ | ||
| cmake \ |
There was a problem hiding this comment.
This Dockerfile installs Ubuntu's cmake in the initial apt-get layer and later adds Kitware’s apt repo and installs cmake again. This adds unnecessary image size and can lead to confusion about which CMake is actually used. Consider dropping cmake from the first apt-get install and relying solely on the Kitware-provided CMake (or move the Kitware repo setup earlier and keep only one install).
| cmake \ |
| echo "deb [signed-by=/usr/share/keyrings/kitware.gpg] https://apt.kitware.com/ubuntu/ noble main" \ | ||
| > /etc/apt/sources.list.d/kitware.list && \ | ||
| apt-get update && \ | ||
| apt-get install -y cmake && \ |
There was a problem hiding this comment.
After installing CMake from the Kitware repo, this layer doesn't clean /var/lib/apt/lists/*. That leaves apt index files in the image and can significantly bloat the final image size. Consider adding the usual cleanup (and optionally apt-get clean) in the same RUN step after apt-get install.
| apt-get install -y cmake && \ | |
| apt-get install -y cmake && \ | |
| rm -rf /var/lib/apt/lists/* && \ |
| RUN HIPIFY_INC=$(find /build/TheRock -path "*/hipify/src/include/nccl_device.h" -type f | head -1 | xargs dirname) && \ | ||
| echo "RCCL hipified include dir: ${HIPIFY_INC}" && \ | ||
| cp -r ${HIPIFY_INC}/nccl_device.h ${ROCM_INSTALL_PREFIX}/include/ && \ | ||
| cp -r ${HIPIFY_INC}/nccl_device ${ROCM_INSTALL_PREFIX}/include/ && \ | ||
| # RCCL generates nccl.h (from nccl.h.in) for internal use but only | ||
| # installs rccl.h. The device headers include "nccl.h" by that name. | ||
| RCCL_NCCL_H=$(find /build/TheRock -path "*/rccl*/include/nccl.h" ! -path "*/src/*" -type f | head -1) && \ | ||
| cp ${RCCL_NCCL_H} ${ROCM_INSTALL_PREFIX}/include/nccl.h && \ | ||
| echo "=== Installed RCCL device headers ===" && \ | ||
| ls -la ${ROCM_INSTALL_PREFIX}/include/nccl.h && \ | ||
| find ${ROCM_INSTALL_PREFIX}/include/nccl_device -type f | sort |
There was a problem hiding this comment.
The RCCL header-copy step relies on find ... | head -1 | xargs dirname and unquoted variable expansions. If the find returns no matches (or paths with unexpected characters), the failure mode can be a less-informative dirname/cp error. Consider using find ... -print -quit (or xargs -r) plus an explicit test -n "$HIPIFY_INC"/test -f "$RCCL_NCCL_H" check, and quote ${HIPIFY_INC}/${RCCL_NCCL_H}/${ROCM_INSTALL_PREFIX} to make the step fail fast with a clear diagnostic.
| RUN HIPIFY_INC=$(find /build/TheRock -path "*/hipify/src/include/nccl_device.h" -type f | head -1 | xargs dirname) && \ | |
| echo "RCCL hipified include dir: ${HIPIFY_INC}" && \ | |
| cp -r ${HIPIFY_INC}/nccl_device.h ${ROCM_INSTALL_PREFIX}/include/ && \ | |
| cp -r ${HIPIFY_INC}/nccl_device ${ROCM_INSTALL_PREFIX}/include/ && \ | |
| # RCCL generates nccl.h (from nccl.h.in) for internal use but only | |
| # installs rccl.h. The device headers include "nccl.h" by that name. | |
| RCCL_NCCL_H=$(find /build/TheRock -path "*/rccl*/include/nccl.h" ! -path "*/src/*" -type f | head -1) && \ | |
| cp ${RCCL_NCCL_H} ${ROCM_INSTALL_PREFIX}/include/nccl.h && \ | |
| echo "=== Installed RCCL device headers ===" && \ | |
| ls -la ${ROCM_INSTALL_PREFIX}/include/nccl.h && \ | |
| find ${ROCM_INSTALL_PREFIX}/include/nccl_device -type f | sort | |
| RUN HIPIFY_INC=$(find /build/TheRock -path "*/hipify/src/include/nccl_device.h" -type f -print -quit) && \ | |
| if [ -z "${HIPIFY_INC}" ]; then \ | |
| echo "ERROR: nccl_device.h not found under /build/TheRock (hipify include)"; \ | |
| exit 1; \ | |
| fi && \ | |
| HIPIFY_INC=$(dirname "${HIPIFY_INC}") && \ | |
| echo "RCCL hipified include dir: ${HIPIFY_INC}" && \ | |
| cp -r "${HIPIFY_INC}/nccl_device.h" "${ROCM_INSTALL_PREFIX}/include/" && \ | |
| cp -r "${HIPIFY_INC}/nccl_device" "${ROCM_INSTALL_PREFIX}/include/" && \ | |
| # RCCL generates nccl.h (from nccl.h.in) for internal use but only | |
| # installs rccl.h. The device headers include "nccl.h" by that name. | |
| RCCL_NCCL_H=$(find /build/TheRock -path "*/rccl*/include/nccl.h" ! -path "*/src/*" -type f -print -quit) && \ | |
| if [ -z "${RCCL_NCCL_H}" ] || [ ! -f "${RCCL_NCCL_H}" ]; then \ | |
| echo "ERROR: RCCL nccl.h not found under /build/TheRock"; \ | |
| exit 1; \ | |
| fi && \ | |
| cp "${RCCL_NCCL_H}" "${ROCM_INSTALL_PREFIX}/include/nccl.h" && \ | |
| echo "=== Installed RCCL device headers ===" && \ | |
| ls -la "${ROCM_INSTALL_PREFIX}/include/nccl.h" && \ | |
| find "${ROCM_INSTALL_PREFIX}/include/nccl_device" -type f | sort |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 10 out of 10 changed files in this pull request and generated 12 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| ((pass++)) | ||
| else | ||
| echo -e " ${RED}[FAIL]${NC} $name" | ||
| ((fail++)) |
There was a problem hiding this comment.
With set -e, ((pass++))/((fail++)) will return exit status 1 when the pre-increment value is 0, which can terminate the script on the first PASS/FAIL. Use an increment form that returns success under errexit (e.g., ((pass+=1)) / ((fail+=1)), or pass=$((pass+1))).
| ((pass++)) | |
| else | |
| echo -e " ${RED}[FAIL]${NC} $name" | |
| ((fail++)) | |
| ((pass+=1)) | |
| else | |
| echo -e " ${RED}[FAIL]${NC} $name" | |
| ((fail+=1)) |
| local expect_asan="${3:-no}" # "yes" if we expect ASAN output | ||
|
|
||
| echo -n " [$name] " | ||
| OUTPUT=$(python3 -c "$code" 2>&1) | ||
| RC=$? |
There was a problem hiding this comment.
run_python can’t observe non-zero Python exit codes because set -e causes the script to exit at OUTPUT=$(python3 -c ...) when Python returns non-zero. Capture the return code without triggering errexit (e.g., using || RC=$? / temporarily disabling -e) so failures are counted instead of aborting the script.
| local expect_asan="${3:-no}" # "yes" if we expect ASAN output | |
| echo -n " [$name] " | |
| OUTPUT=$(python3 -c "$code" 2>&1) | |
| RC=$? | |
| local expect_asan="${3:-no}" # "yes" if we expect ASAN output | |
| local OUTPUT | |
| local RC | |
| echo -n " [$name] " | |
| set +e | |
| OUTPUT=$(python3 -c "$code" 2>&1) | |
| RC=$? | |
| set -e |
|
|
||
| echo -n " [$name] " | ||
| OUTPUT=$(python3 -c "$code" 2>&1) | ||
| RC=$? | ||
|
|
||
| if echo "$OUTPUT" | grep -q "ERROR: AddressSanitizer"; then | ||
| ((asan_reports++)) | ||
| fi | ||
|
|
||
| if [ "$RC" -eq 0 ]; then | ||
| echo -e "${GREEN}OK${NC} (rc=0)" | ||
| ((pass++)) |
There was a problem hiding this comment.
expect_asan is currently unused, which makes it unclear whether ASAN output is expected/acceptable per check. Either implement the intended behavior (e.g., assert presence/absence of ASAN reports based on this flag) or remove the parameter to avoid confusion.
| echo -n " [$name] " | |
| OUTPUT=$(python3 -c "$code" 2>&1) | |
| RC=$? | |
| if echo "$OUTPUT" | grep -q "ERROR: AddressSanitizer"; then | |
| ((asan_reports++)) | |
| fi | |
| if [ "$RC" -eq 0 ]; then | |
| echo -e "${GREEN}OK${NC} (rc=0)" | |
| ((pass++)) | |
| local has_asan="no" | |
| echo -n " [$name] " | |
| OUTPUT=$(python3 -c "$code" 2>&1) | |
| RC=$? | |
| if echo "$OUTPUT" | grep -q "ERROR: AddressSanitizer"; then | |
| has_asan="yes" | |
| ((asan_reports++)) | |
| fi | |
| if [ "$RC" -eq 0 ]; then | |
| if [ "$expect_asan" = "yes" ] && [ "$has_asan" != "yes" ]; then | |
| echo -e "${RED}FAILED${NC} (expected ASAN report but none was found; rc=0)" | |
| ((fail++)) | |
| elif [ "$expect_asan" != "yes" ] && [ "$has_asan" = "yes" ]; then | |
| echo -e "${RED}FAILED${NC} (unexpected ASAN report; rc=0)" | |
| ((fail++)) | |
| else | |
| echo -e "${GREEN}OK${NC} (rc=0)" | |
| ((pass++)) | |
| fi |
|
|
||
| # LD_PRELOAD the ASAN runtime so ASAN-instrumented libraries don't crash | ||
| if [ -n "$ASAN_LIB" ]; then | ||
| export LD_PRELOAD="${LD_PRELOAD:+$LD_PRELOAD:}$ASAN_LIB" |
There was a problem hiding this comment.
LD_PRELOAD expects a whitespace-separated list of libraries; using : as a separator can cause the dynamic loader to treat the entire string as a single (invalid) path. If you want to append to an existing LD_PRELOAD, join with a space (or just override it) instead of a colon.
| export LD_PRELOAD="${LD_PRELOAD:+$LD_PRELOAD:}$ASAN_LIB" | |
| export LD_PRELOAD="${LD_PRELOAD:+$LD_PRELOAD }$ASAN_LIB" |
| import logging | ||
| import os | ||
| import platform | ||
| import re | ||
| import shutil |
There was a problem hiding this comment.
platform is imported but never used in this test module. Removing unused imports keeps the test file clearer and avoids warnings in environments that run linters.
| "$CLANG" -fsanitize=address -shared-libasan \ | ||
| --offload-arch="$ARCH" -x hip \ | ||
| -I"$ROCM/include" -L"$ROCM/lib" -lamdhip64 \ | ||
| -o "$TMPDIR/test_hip_asan" "$TEST_SRC" 2>&1 | ||
|
|
||
| if [ $? -ne 0 ]; then | ||
| echo -e " ${RED}[FAIL]${NC} Compilation failed" | ||
| ((fail++)) | ||
| else | ||
| echo -e " ${GREEN}[PASS]${NC} Compilation succeeded" | ||
| ((pass++)) |
There was a problem hiding this comment.
The compile command’s return code check is ineffective under set -e: if clang fails, the script exits before reaching if [ $? -ne 0 ]. Wrap the compile in an if ...; then block (or use || to handle failure) so compilation failures are recorded as a test failure instead of aborting the whole script.
| bash -c "echo '$ASAN_OUTPUT' | grep -q 'heap-buffer-overflow'" | ||
|
|
||
| # Test: use_after_free should produce ASAN report | ||
| echo " Running use_after_free test..." | ||
| ASAN_OUTPUT=$(LD_PRELOAD="$ASAN_LIB" "$TMPDIR/test_hip_asan" use_after_free 2>&1 || true) | ||
| check "ASAN catches use-after-free" \ | ||
| bash -c "echo '$ASAN_OUTPUT' | grep -q 'use-after-free\|heap-use-after-free'" |
There was a problem hiding this comment.
bash -c "echo '$ASAN_OUTPUT' | grep ..." is brittle: it will break if ASAN output contains single quotes and can mangle newlines. Prefer feeding the variable to grep directly (here-string) or via printf and avoid an extra shell layer.
| bash -c "echo '$ASAN_OUTPUT' | grep -q 'heap-buffer-overflow'" | |
| # Test: use_after_free should produce ASAN report | |
| echo " Running use_after_free test..." | |
| ASAN_OUTPUT=$(LD_PRELOAD="$ASAN_LIB" "$TMPDIR/test_hip_asan" use_after_free 2>&1 || true) | |
| check "ASAN catches use-after-free" \ | |
| bash -c "echo '$ASAN_OUTPUT' | grep -q 'use-after-free\|heap-use-after-free'" | |
| grep -q 'heap-buffer-overflow' <<< "$ASAN_OUTPUT" | |
| # Test: use_after_free should produce ASAN report | |
| echo " Running use_after_free test..." | |
| ASAN_OUTPUT=$(LD_PRELOAD="$ASAN_LIB" "$TMPDIR/test_hip_asan" use_after_free 2>&1 || true) | |
| check "ASAN catches use-after-free" \ | |
| grep -q 'use-after-free\|heap-use-after-free' <<< "$ASAN_OUTPUT" |
| ((asan_reports++)) | ||
| fi | ||
|
|
||
| if [ "$RC" -eq 0 ]; then | ||
| echo -e "${GREEN}OK${NC} (rc=0)" | ||
| ((pass++)) | ||
| elif [ "$RC" -eq 139 ]; then | ||
| echo -e "${RED}SEGFAULT${NC} (rc=139) — LD_PRELOAD may not be working" | ||
| ((fail++)) | ||
| else | ||
| echo -e "${RED}FAILED${NC} (rc=$RC)" | ||
| # Show last few lines of output for debugging | ||
| echo "$OUTPUT" | tail -5 | sed 's/^/ /' | ||
| ((fail++)) |
There was a problem hiding this comment.
Same set -e issue as in verify_asan_therock.sh: ((asan_reports++)), ((pass++)), and ((fail++)) can exit the script when the pre-increment value is 0. Use += 1 style increments (or another form that returns success) to avoid premature termination.
| ((asan_reports++)) | |
| fi | |
| if [ "$RC" -eq 0 ]; then | |
| echo -e "${GREEN}OK${NC} (rc=0)" | |
| ((pass++)) | |
| elif [ "$RC" -eq 139 ]; then | |
| echo -e "${RED}SEGFAULT${NC} (rc=139) — LD_PRELOAD may not be working" | |
| ((fail++)) | |
| else | |
| echo -e "${RED}FAILED${NC} (rc=$RC)" | |
| # Show last few lines of output for debugging | |
| echo "$OUTPUT" | tail -5 | sed 's/^/ /' | |
| ((fail++)) | |
| ((asan_reports += 1)) | |
| fi | |
| if [ "$RC" -eq 0 ]; then | |
| echo -e "${GREEN}OK${NC} (rc=0)" | |
| ((pass += 1)) | |
| elif [ "$RC" -eq 139 ]; then | |
| echo -e "${RED}SEGFAULT${NC} (rc=139) — LD_PRELOAD may not be working" | |
| ((fail += 1)) | |
| else | |
| echo -e "${RED}FAILED${NC} (rc=$RC)" | |
| # Show last few lines of output for debugging | |
| echo "$OUTPUT" | tail -5 | sed 's/^/ /' | |
| ((fail += 1)) |
| echo "deb [signed-by=/usr/share/keyrings/kitware.gpg] https://apt.kitware.com/ubuntu/ noble main" \ | ||
| > /etc/apt/sources.list.d/kitware.list && \ | ||
| apt-get update && \ | ||
| apt-get install -y cmake && \ |
There was a problem hiding this comment.
The Kitware apt install step doesn’t clean /var/lib/apt/lists, which leaves apt metadata in the final image and increases image size. Consider removing apt lists in the same layer after installing CMake.
| apt-get install -y cmake && \ | |
| apt-get install -y cmake && \ | |
| rm -rf /var/lib/apt/lists/* && \ |
| echo " Running clean test (no errors expected)..." | ||
| CLEAN_OUTPUT=$(LD_PRELOAD="$ASAN_LIB" "$TMPDIR/test_hip_asan" clean 2>&1) | ||
| CLEAN_RC=$? | ||
| check "Clean HIP program runs correctly" test "$CLEAN_RC" -eq 0 | ||
| check "No ASAN errors in clean run" \ | ||
| bash -c "! echo '$CLEAN_OUTPUT' | grep -q 'ERROR: AddressSanitizer'" | ||
|
|
||
| echo " Running event_query stress test..." | ||
| EVENT_OUTPUT=$(LD_PRELOAD="$ASAN_LIB" ASAN_OPTIONS="detect_leaks=0:halt_on_error=0" \ | ||
| "$TMPDIR/test_hip_asan" event_query 2>&1) | ||
| EVENT_RC=$? | ||
| check "hipEventQuery stress test completes" test "$EVENT_RC" -eq 0 | ||
|
|
||
| # Report any ASAN findings from event_query (informational) | ||
| if echo "$EVENT_OUTPUT" | grep -q "ERROR: AddressSanitizer"; then | ||
| echo -e " ${YELLOW}[INFO]${NC} ASAN found issues in hipEventQuery path:" | ||
| echo "$EVENT_OUTPUT" | grep "ERROR: AddressSanitizer" | head -5 | ||
| else | ||
| echo -e " ${GREEN}[INFO]${NC} No ASAN errors in hipEventQuery path" | ||
| fi | ||
|
|
||
| echo " Running multi_stream test..." | ||
| STREAM_OUTPUT=$(LD_PRELOAD="$ASAN_LIB" ASAN_OPTIONS="detect_leaks=0:halt_on_error=0" \ | ||
| "$TMPDIR/test_hip_asan" multi_stream 2>&1) | ||
| STREAM_RC=$? | ||
| check "Multi-stream event polling completes" test "$STREAM_RC" -eq 0 |
There was a problem hiding this comment.
CLEAN_OUTPUT=$(...), EVENT_OUTPUT=$(...), and STREAM_OUTPUT=$(...) will cause the script to exit immediately on non-zero return codes because of set -e, so the subsequent *_RC=$? and check ... lines won’t run. Capture output in a way that preserves the return code without triggering errexit (e.g., disable -e around the command substitution or append || true and record $?).
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 11 out of 11 changed files in this pull request and generated 6 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| # This script runs Python with LD_PRELOAD of the ASAN runtime to avoid | ||
| # the "ASan runtime does not come first" crash. | ||
|
|
||
| set -euo pipefail | ||
|
|
||
| ROCM=${ROCM_HOME:-/opt/rocm} | ||
| ASAN_LIB=$(find "$ROCM/llvm/lib/clang" -name "libclang_rt.asan-x86_64.so" 2>/dev/null | head -1) | ||
|
|
||
| if [ -z "$ASAN_LIB" ]; then | ||
| echo "ERROR: ASAN runtime not found under $ROCM/llvm/lib/clang" | ||
| exit 1 | ||
| fi | ||
|
|
||
| export LD_PRELOAD="$ASAN_LIB" | ||
| export ASAN_OPTIONS="${ASAN_OPTIONS:-detect_leaks=0:halt_on_error=0:symbolize=1}" |
There was a problem hiding this comment.
This script unconditionally sets LD_PRELOAD to the ASAN runtime. In this repo’s TheRock ASAN overlay approach (and as documented in the Dockerfile / build-issues doc), LD_PRELOAD can break HIP’s internal dlopen/dlsym usage and lead to SEGVs during import torch. Prefer loading ASAN as a normal dependency (runtime dir on LD_LIBRARY_PATH) and set ASAN_OPTIONS=...:verify_asan_link_order=0 instead of preloading.
| if [ -n "$TEST_SRC" ]; then | ||
| echo " Compiling test_hip_asan.cpp with ASAN..." | ||
| ARCH=${PYTORCH_ROCM_ARCH:-gfx950} | ||
| "$CLANG" -fsanitize=address -shared-libasan \ | ||
| --offload-arch="$ARCH" -x hip \ | ||
| -I"$ROCM/include" -L"$ROCM/lib" -lamdhip64 \ | ||
| -o "$TMPDIR/test_hip_asan" "$TEST_SRC" 2>&1 | ||
|
|
||
| if [ $? -ne 0 ]; then | ||
| echo -e " ${RED}[FAIL]${NC} Compilation failed" | ||
| ((fail++)) |
There was a problem hiding this comment.
The compile step is followed by if [ $? -ne 0 ] ..., but with set -e enabled a failing clang invocation will exit the script before this check runs. Wrap the compile in an if ...; then ... else ... fi (or temporarily disable errexit) so you can surface a clean "Compilation failed" result instead of an abrupt exit.
|
|
||
| # Check that libamdhip64.so is ASAN-instrumented | ||
| check "libamdhip64.so has ASAN symbols" \ | ||
| bash -c "nm -D $ROCM/lib/libamdhip64.so 2>/dev/null | grep -q __asan_report" |
There was a problem hiding this comment.
This checks ASAN symbols in $ROCM/lib/libamdhip64.so, but the Dockerfile’s design keeps ASAN-instrumented libs in the overlay directory (e.g. /opt/rocm-asan/lib) while $ROCM/lib contains the non-ASAN tarball libs. As written, this check will fail in the intended overlay setup; it should inspect the overlay lib path (preferably via an env like ASAN_LIB_DIR) instead.
| bash -c "nm -D $ROCM/lib/libamdhip64.so 2>/dev/null | grep -q __asan_report" | |
| bash -c 'nm -D "${ASAN_LIB_DIR:-'"$ROCM"'/lib}"/libamdhip64.so 2>/dev/null | grep -q __asan_report' |
| LD_PRELOAD of the shared ASAN runtime is required so that the runtime | ||
| initialises before any ASAN-instrumented shared library (libamdhip64, etc.) | ||
| is loaded by the dynamic linker. | ||
|
|
||
| LD_LIBRARY_PATH must include ROCM_HOME/llvm/lib so that binaries compiled | ||
| with -shared-libasan can find libclang-cpp.so and other LLVM runtime libs. | ||
| """ | ||
| asan_lib = find_asan_runtime() | ||
| cmd = [str(binary)] + (args or []) | ||
| env = os.environ.copy() | ||
| env["ASAN_OPTIONS"] = "detect_leaks=0:halt_on_error=0:symbolize=1" | ||
| symbolizer = ROCM_HOME / "llvm" / "bin" / "llvm-symbolizer" | ||
| if symbolizer.is_file(): | ||
| env["ASAN_SYMBOLIZER_PATH"] = str(symbolizer) | ||
| if asan_lib: | ||
| env["LD_PRELOAD"] = str(asan_lib) | ||
| llvm_lib_dir = str(ROCM_HOME / "llvm" / "lib") | ||
| rocm_lib_dir = str(ROCM_HOME / "lib") | ||
| existing_ldpath = env.get("LD_LIBRARY_PATH", "") | ||
| env["LD_LIBRARY_PATH"] = f"{llvm_lib_dir}:{rocm_lib_dir}:{existing_ldpath}".rstrip(":") |
There was a problem hiding this comment.
run_asan_binary() currently prepends ${ROCM_HOME}/lib ahead of the existing LD_LIBRARY_PATH. In the ASAN overlay setup, that reorders the search path so the non-ASAN tarball libs can shadow /opt/rocm-asan/lib, which undermines the test’s goal. Preserve the existing ordering (keeping the overlay first) and only append the extra LLVM/ROCm runtime dirs that are needed for -shared-libasan binaries.
Also, this function sets LD_PRELOAD to the ASAN runtime, which is known to break HIP’s internal dlopen/dlsym path in this environment; prefer ASAN_OPTIONS=...:verify_asan_link_order=0 + runtime dir on LD_LIBRARY_PATH instead of preloading.
| LD_PRELOAD of the shared ASAN runtime is required so that the runtime | |
| initialises before any ASAN-instrumented shared library (libamdhip64, etc.) | |
| is loaded by the dynamic linker. | |
| LD_LIBRARY_PATH must include ROCM_HOME/llvm/lib so that binaries compiled | |
| with -shared-libasan can find libclang-cpp.so and other LLVM runtime libs. | |
| """ | |
| asan_lib = find_asan_runtime() | |
| cmd = [str(binary)] + (args or []) | |
| env = os.environ.copy() | |
| env["ASAN_OPTIONS"] = "detect_leaks=0:halt_on_error=0:symbolize=1" | |
| symbolizer = ROCM_HOME / "llvm" / "bin" / "llvm-symbolizer" | |
| if symbolizer.is_file(): | |
| env["ASAN_SYMBOLIZER_PATH"] = str(symbolizer) | |
| if asan_lib: | |
| env["LD_PRELOAD"] = str(asan_lib) | |
| llvm_lib_dir = str(ROCM_HOME / "llvm" / "lib") | |
| rocm_lib_dir = str(ROCM_HOME / "lib") | |
| existing_ldpath = env.get("LD_LIBRARY_PATH", "") | |
| env["LD_LIBRARY_PATH"] = f"{llvm_lib_dir}:{rocm_lib_dir}:{existing_ldpath}".rstrip(":") | |
| The ASAN runtime is located via find_asan_runtime() and made available on | |
| LD_LIBRARY_PATH so that it can be found by the dynamic linker without the | |
| need for LD_PRELOAD. We also disable ASAN's link-order verification to | |
| avoid false positives in this environment. | |
| LD_LIBRARY_PATH must include ROCM_HOME/llvm/lib so that binaries compiled | |
| with -shared-libasan can find libclang-cpp.so and other LLVM runtime libs. | |
| The existing LD_LIBRARY_PATH ordering is preserved; the additional runtime | |
| directories are only appended. | |
| """ | |
| asan_lib = find_asan_runtime() | |
| cmd = [str(binary)] + (args or []) | |
| env = os.environ.copy() | |
| env["ASAN_OPTIONS"] = "detect_leaks=0:halt_on_error=0:symbolize=1:verify_asan_link_order=0" | |
| symbolizer = ROCM_HOME / "llvm" / "bin" / "llvm-symbolizer" | |
| if symbolizer.is_file(): | |
| env["ASAN_SYMBOLIZER_PATH"] = str(symbolizer) | |
| llvm_lib_dir = str(ROCM_HOME / "llvm" / "lib") | |
| rocm_lib_dir = str(ROCM_HOME / "lib") | |
| existing_ldpath = env.get("LD_LIBRARY_PATH", "") | |
| # Preserve any existing LD_LIBRARY_PATH (e.g., ASAN overlay dirs) and | |
| # append the extra LLVM/ROCm/ASAN runtime directories if they are not | |
| # already present. | |
| ld_paths: list[str] = [] | |
| if existing_ldpath: | |
| ld_paths = [p for p in existing_ldpath.split(os.pathsep) if p] | |
| for extra in (llvm_lib_dir, rocm_lib_dir, str(asan_lib.parent) if asan_lib else None): | |
| if extra and extra not in ld_paths: | |
| ld_paths.append(extra) | |
| if ld_paths: | |
| env["LD_LIBRARY_PATH"] = os.pathsep.join(ld_paths) |
| # LD_PRELOAD the ASAN runtime so ASAN-instrumented libraries don't crash | ||
| if [ -n "$ASAN_LIB" ]; then | ||
| export LD_PRELOAD="${LD_PRELOAD:+$LD_PRELOAD:}$ASAN_LIB" | ||
| fi | ||
|
|
||
| export ASAN_OPTIONS="${ASAN_OPTIONS:-detect_leaks=0:halt_on_error=0:symbolize=1}" | ||
|
|
There was a problem hiding this comment.
This script adds the ASAN runtime to LD_PRELOAD. In this repo’s TheRock ASAN overlay setup, LD_PRELOAD has been shown to break HIP initialization (null function pointers via dlsym) and can cause SEGVs. Prefer relying on the ASAN runtime being a normal DT_NEEDED dependency (runtime dir on LD_LIBRARY_PATH) and set ASAN_OPTIONS=...:verify_asan_link_order=0 rather than preloading.
| # LD_PRELOAD the ASAN runtime so ASAN-instrumented libraries don't crash | |
| if [ -n "$ASAN_LIB" ]; then | |
| export LD_PRELOAD="${LD_PRELOAD:+$LD_PRELOAD:}$ASAN_LIB" | |
| fi | |
| export ASAN_OPTIONS="${ASAN_OPTIONS:-detect_leaks=0:halt_on_error=0:symbolize=1}" | |
| # Configure ASAN runtime: rely on normal linking instead of LD_PRELOAD. | |
| # In TheRock ASAN overlay, preloading ASAN can break HIP initialization (null | |
| # function pointers via dlsym) and cause SEGVs, so we avoid modifying LD_PRELOAD. | |
| # | |
| # If ASAN_OPTIONS is not already set, provide a sensible default and explicitly | |
| # disable link-order verification since we are not preloading ASAN. | |
| export ASAN_OPTIONS="${ASAN_OPTIONS:-detect_leaks=0:halt_on_error=0:symbolize=1:verify_asan_link_order=0}" |
| ENV ASAN_SYMBOLIZER_PATH=${ROCM_HOME}/llvm/bin/llvm-symbolizer | ||
| ENV ASAN_OPTIONS="detect_leaks=0:halt_on_error=0:symbolize=1:verify_asan_link_order=0" | ||
|
|
||
| WORKDIR /workspace |
There was a problem hiding this comment.
The verification scripts’ headers suggest running /workspace/asan_tests/... directly inside the image, but this Dockerfile never copies tests/asan_tests into /workspace/asan_tests (the container only has /workspace as a working dir). Either copy the test suite into the image (so the documented docker run ... /workspace/asan_tests/... works) or update the scripts/docs to reference the repo-mount path used by docker-compose.build.yaml (e.g. /workspace/aorta/tests/asan_tests).
| WORKDIR /workspace | |
| WORKDIR /workspace | |
| COPY tests/asan_tests /workspace/asan_tests |
adds docker for therock with asan enabled