Skip to content

adds docker for therock with asan enabled#144

Open
amd-vivekag wants to merge 9 commits intomainfrom
users/vivekag/therock_asan_docker
Open

adds docker for therock with asan enabled#144
amd-vivekag wants to merge 9 commits intomainfrom
users/vivekag/therock_asan_docker

Conversation

@amd-vivekag
Copy link
Copy Markdown
Collaborator

adds docker for therock with asan enabled

Copilot AI review requested due to automatic review settings March 20, 2026 15:21
@amd-vivekag amd-vivekag force-pushed the users/vivekag/therock_asan_docker branch from a47d7c0 to 35b0771 Compare March 20, 2026 15:21
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new Docker image variant intended for debugging host-side HIP/ROCm issues with AddressSanitizer enabled via TheRock, and exposes it through the existing Docker env selection flow.

Changes:

  • Adds Dockerfile.therock-host-asan-pytorch to build TheRock with HOST_ASAN and then build PyTorch against that stack.
  • Extends docker/setup-env.sh to allow selecting the new Dockerfile (choice 5).
  • Documents the new Dockerfile option in docker/.env.example.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.

File Description
docker/setup-env.sh Adds interactive selection entry for the new TheRock HOST_ASAN + PyTorch Dockerfile.
docker/Dockerfile.therock-host-asan-pytorch New Dockerfile that builds TheRock with HOST_ASAN and builds PyTorch from source against it.
docker/.env.example Adds the new Dockerfile option and brief description to the env template.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +115 to +116
RUN git clone --branch ${THEROCK_GIT_REF} --single-branch \
https://github.com/ROCm/TheRock.git
Copy link

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

git clone --branch ${THEROCK_GIT_REF} only supports branches/tags. If you want to support commit SHAs for reproducible builds, clone without --branch (or fetch) and git checkout the SHA explicitly.

Suggested change
RUN git clone --branch ${THEROCK_GIT_REF} --single-branch \
https://github.com/ROCm/TheRock.git
RUN git clone https://github.com/ROCm/TheRock.git && \
cd TheRock && \
git checkout ${THEROCK_GIT_REF}

Copilot uses AI. Check for mistakes.
Comment on lines +152 to +155
RUN cmake --install build --prefix ${ROCM_INSTALL_PREFIX}

# ---------------------------------------------------------------------------
# Stage 3: ROCm environment
Copy link

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ENV LD_LIBRARY_PATH=${ROCM_HOME}/lib:${LD_LIBRARY_PATH} will expand to a trailing : when LD_LIBRARY_PATH is unset (common on Ubuntu base images), which adds the current working directory to the dynamic linker search path. Prefer setting it to just ${ROCM_HOME}/lib (or use an /etc/ld.so.conf.d entry + ldconfig) to avoid the empty-path security footgun.

Copilot uses AI. Check for mistakes.
Comment thread docker/.env.example
Comment on lines +14 to +21
# - Dockerfile.rocm70_9-1 Standard ROCm 7.0.9.1 build
# - Dockerfile.rocm70_9-1-shampoo ROCm 7.0.9.1 with Shampoo optimizer
# - Dockerfile.rocm70_2-ubuntu-pytorch ROCm 7.0.2 Ubuntu PyTorch build
# - Dockerfile.rocm70_2-ubuntu-nan ROCm 7.0.2 with NaN debugging tools
# - Dockerfile.therock-host-asan-pytorch TheRock HOST_ASAN HIP runtime + PyTorch
# Builds ROCm/HIP from source with host-only ASAN via TheRock, then builds
# PyTorch against it. Use ASAN_OPTIONS env var at runtime for leak/error
# detection (see Dockerfile header for details).
Copy link

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This adds a new Dockerfile option here, but docker/README.md’s “Available Dockerfiles” table doesn’t list Dockerfile.therock-host-asan-pytorch yet. Please update the README so users discover the new variant via docs (not only via setup-env.sh / .env.example).

Copilot uses AI. Check for mistakes.
# Build arguments
# ---------------------------------------------------------------------------

# TheRock source ref (branch, tag, or commit SHA)
Copy link

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment says THEROCK_GIT_REF can be a branch/tag/commit SHA, but the current clone command later uses git clone --branch, which won’t work with an arbitrary SHA. Either restrict this arg to branch/tag, or update the clone logic to allow SHA checkout.

Suggested change
# TheRock source ref (branch, tag, or commit SHA)
# TheRock source ref (branch or tag)

Copilot uses AI. Check for mistakes.
Copilot AI review requested due to automatic review settings March 24, 2026 17:49
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

gfortran \
git \
ninja-build \
cmake \
Copy link

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This Dockerfile installs Ubuntu's cmake in the initial apt-get layer and later adds Kitware’s apt repo and installs cmake again. This adds unnecessary image size and can lead to confusion about which CMake is actually used. Consider dropping cmake from the first apt-get install and relying solely on the Kitware-provided CMake (or move the Kitware repo setup earlier and keep only one install).

Suggested change
cmake \

Copilot uses AI. Check for mistakes.
echo "deb [signed-by=/usr/share/keyrings/kitware.gpg] https://apt.kitware.com/ubuntu/ noble main" \
> /etc/apt/sources.list.d/kitware.list && \
apt-get update && \
apt-get install -y cmake && \
Copy link

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After installing CMake from the Kitware repo, this layer doesn't clean /var/lib/apt/lists/*. That leaves apt index files in the image and can significantly bloat the final image size. Consider adding the usual cleanup (and optionally apt-get clean) in the same RUN step after apt-get install.

Suggested change
apt-get install -y cmake && \
apt-get install -y cmake && \
rm -rf /var/lib/apt/lists/* && \

Copilot uses AI. Check for mistakes.
Comment on lines +238 to +248
RUN HIPIFY_INC=$(find /build/TheRock -path "*/hipify/src/include/nccl_device.h" -type f | head -1 | xargs dirname) && \
echo "RCCL hipified include dir: ${HIPIFY_INC}" && \
cp -r ${HIPIFY_INC}/nccl_device.h ${ROCM_INSTALL_PREFIX}/include/ && \
cp -r ${HIPIFY_INC}/nccl_device ${ROCM_INSTALL_PREFIX}/include/ && \
# RCCL generates nccl.h (from nccl.h.in) for internal use but only
# installs rccl.h. The device headers include "nccl.h" by that name.
RCCL_NCCL_H=$(find /build/TheRock -path "*/rccl*/include/nccl.h" ! -path "*/src/*" -type f | head -1) && \
cp ${RCCL_NCCL_H} ${ROCM_INSTALL_PREFIX}/include/nccl.h && \
echo "=== Installed RCCL device headers ===" && \
ls -la ${ROCM_INSTALL_PREFIX}/include/nccl.h && \
find ${ROCM_INSTALL_PREFIX}/include/nccl_device -type f | sort
Copy link

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The RCCL header-copy step relies on find ... | head -1 | xargs dirname and unquoted variable expansions. If the find returns no matches (or paths with unexpected characters), the failure mode can be a less-informative dirname/cp error. Consider using find ... -print -quit (or xargs -r) plus an explicit test -n "$HIPIFY_INC"/test -f "$RCCL_NCCL_H" check, and quote ${HIPIFY_INC}/${RCCL_NCCL_H}/${ROCM_INSTALL_PREFIX} to make the step fail fast with a clear diagnostic.

Suggested change
RUN HIPIFY_INC=$(find /build/TheRock -path "*/hipify/src/include/nccl_device.h" -type f | head -1 | xargs dirname) && \
echo "RCCL hipified include dir: ${HIPIFY_INC}" && \
cp -r ${HIPIFY_INC}/nccl_device.h ${ROCM_INSTALL_PREFIX}/include/ && \
cp -r ${HIPIFY_INC}/nccl_device ${ROCM_INSTALL_PREFIX}/include/ && \
# RCCL generates nccl.h (from nccl.h.in) for internal use but only
# installs rccl.h. The device headers include "nccl.h" by that name.
RCCL_NCCL_H=$(find /build/TheRock -path "*/rccl*/include/nccl.h" ! -path "*/src/*" -type f | head -1) && \
cp ${RCCL_NCCL_H} ${ROCM_INSTALL_PREFIX}/include/nccl.h && \
echo "=== Installed RCCL device headers ===" && \
ls -la ${ROCM_INSTALL_PREFIX}/include/nccl.h && \
find ${ROCM_INSTALL_PREFIX}/include/nccl_device -type f | sort
RUN HIPIFY_INC=$(find /build/TheRock -path "*/hipify/src/include/nccl_device.h" -type f -print -quit) && \
if [ -z "${HIPIFY_INC}" ]; then \
echo "ERROR: nccl_device.h not found under /build/TheRock (hipify include)"; \
exit 1; \
fi && \
HIPIFY_INC=$(dirname "${HIPIFY_INC}") && \
echo "RCCL hipified include dir: ${HIPIFY_INC}" && \
cp -r "${HIPIFY_INC}/nccl_device.h" "${ROCM_INSTALL_PREFIX}/include/" && \
cp -r "${HIPIFY_INC}/nccl_device" "${ROCM_INSTALL_PREFIX}/include/" && \
# RCCL generates nccl.h (from nccl.h.in) for internal use but only
# installs rccl.h. The device headers include "nccl.h" by that name.
RCCL_NCCL_H=$(find /build/TheRock -path "*/rccl*/include/nccl.h" ! -path "*/src/*" -type f -print -quit) && \
if [ -z "${RCCL_NCCL_H}" ] || [ ! -f "${RCCL_NCCL_H}" ]; then \
echo "ERROR: RCCL nccl.h not found under /build/TheRock"; \
exit 1; \
fi && \
cp "${RCCL_NCCL_H}" "${ROCM_INSTALL_PREFIX}/include/nccl.h" && \
echo "=== Installed RCCL device headers ===" && \
ls -la "${ROCM_INSTALL_PREFIX}/include/nccl.h" && \
find "${ROCM_INSTALL_PREFIX}/include/nccl_device" -type f | sort

Copilot uses AI. Check for mistakes.
Copilot AI review requested due to automatic review settings March 25, 2026 13:37
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated 12 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread tests/asan_tests/verify_asan_therock.sh Outdated
Comment on lines +34 to +37
((pass++))
else
echo -e " ${RED}[FAIL]${NC} $name"
((fail++))
Copy link

Copilot AI Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With set -e, ((pass++))/((fail++)) will return exit status 1 when the pre-increment value is 0, which can terminate the script on the first PASS/FAIL. Use an increment form that returns success under errexit (e.g., ((pass+=1)) / ((fail+=1)), or pass=$((pass+1))).

Suggested change
((pass++))
else
echo -e " ${RED}[FAIL]${NC} $name"
((fail++))
((pass+=1))
else
echo -e " ${RED}[FAIL]${NC} $name"
((fail+=1))

Copilot uses AI. Check for mistakes.
Comment thread tests/asan_tests/verify_asan_pytorch.sh Outdated
Comment on lines +37 to +41
local expect_asan="${3:-no}" # "yes" if we expect ASAN output

echo -n " [$name] "
OUTPUT=$(python3 -c "$code" 2>&1)
RC=$?
Copy link

Copilot AI Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

run_python can’t observe non-zero Python exit codes because set -e causes the script to exit at OUTPUT=$(python3 -c ...) when Python returns non-zero. Capture the return code without triggering errexit (e.g., using || RC=$? / temporarily disabling -e) so failures are counted instead of aborting the script.

Suggested change
local expect_asan="${3:-no}" # "yes" if we expect ASAN output
echo -n " [$name] "
OUTPUT=$(python3 -c "$code" 2>&1)
RC=$?
local expect_asan="${3:-no}" # "yes" if we expect ASAN output
local OUTPUT
local RC
echo -n " [$name] "
set +e
OUTPUT=$(python3 -c "$code" 2>&1)
RC=$?
set -e

Copilot uses AI. Check for mistakes.
Comment thread tests/asan_tests/verify_asan_pytorch.sh Outdated
Comment on lines +38 to +49

echo -n " [$name] "
OUTPUT=$(python3 -c "$code" 2>&1)
RC=$?

if echo "$OUTPUT" | grep -q "ERROR: AddressSanitizer"; then
((asan_reports++))
fi

if [ "$RC" -eq 0 ]; then
echo -e "${GREEN}OK${NC} (rc=0)"
((pass++))
Copy link

Copilot AI Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

expect_asan is currently unused, which makes it unclear whether ASAN output is expected/acceptable per check. Either implement the intended behavior (e.g., assert presence/absence of ASAN reports based on this flag) or remove the parameter to avoid confusion.

Suggested change
echo -n " [$name] "
OUTPUT=$(python3 -c "$code" 2>&1)
RC=$?
if echo "$OUTPUT" | grep -q "ERROR: AddressSanitizer"; then
((asan_reports++))
fi
if [ "$RC" -eq 0 ]; then
echo -e "${GREEN}OK${NC} (rc=0)"
((pass++))
local has_asan="no"
echo -n " [$name] "
OUTPUT=$(python3 -c "$code" 2>&1)
RC=$?
if echo "$OUTPUT" | grep -q "ERROR: AddressSanitizer"; then
has_asan="yes"
((asan_reports++))
fi
if [ "$RC" -eq 0 ]; then
if [ "$expect_asan" = "yes" ] && [ "$has_asan" != "yes" ]; then
echo -e "${RED}FAILED${NC} (expected ASAN report but none was found; rc=0)"
((fail++))
elif [ "$expect_asan" != "yes" ] && [ "$has_asan" = "yes" ]; then
echo -e "${RED}FAILED${NC} (unexpected ASAN report; rc=0)"
((fail++))
else
echo -e "${GREEN}OK${NC} (rc=0)"
((pass++))
fi

Copilot uses AI. Check for mistakes.
Comment thread tests/asan_tests/run_therock_sanity.sh Outdated

# LD_PRELOAD the ASAN runtime so ASAN-instrumented libraries don't crash
if [ -n "$ASAN_LIB" ]; then
export LD_PRELOAD="${LD_PRELOAD:+$LD_PRELOAD:}$ASAN_LIB"
Copy link

Copilot AI Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LD_PRELOAD expects a whitespace-separated list of libraries; using : as a separator can cause the dynamic loader to treat the entire string as a single (invalid) path. If you want to append to an existing LD_PRELOAD, join with a space (or just override it) instead of a colon.

Suggested change
export LD_PRELOAD="${LD_PRELOAD:+$LD_PRELOAD:}$ASAN_LIB"
export LD_PRELOAD="${LD_PRELOAD:+$LD_PRELOAD }$ASAN_LIB"

Copilot uses AI. Check for mistakes.
Comment on lines +26 to +30
import logging
import os
import platform
import re
import shutil
Copy link

Copilot AI Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

platform is imported but never used in this test module. Removing unused imports keeps the test file clearer and avoids warnings in environments that run linters.

Copilot uses AI. Check for mistakes.
Comment thread tests/asan_tests/verify_asan_therock.sh Outdated
Comment on lines +73 to +83
"$CLANG" -fsanitize=address -shared-libasan \
--offload-arch="$ARCH" -x hip \
-I"$ROCM/include" -L"$ROCM/lib" -lamdhip64 \
-o "$TMPDIR/test_hip_asan" "$TEST_SRC" 2>&1

if [ $? -ne 0 ]; then
echo -e " ${RED}[FAIL]${NC} Compilation failed"
((fail++))
else
echo -e " ${GREEN}[PASS]${NC} Compilation succeeded"
((pass++))
Copy link

Copilot AI Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The compile command’s return code check is ineffective under set -e: if clang fails, the script exits before reaching if [ $? -ne 0 ]. Wrap the compile in an if ...; then block (or use || to handle failure) so compilation failures are recorded as a test failure instead of aborting the whole script.

Copilot uses AI. Check for mistakes.
Comment thread tests/asan_tests/verify_asan_therock.sh Outdated
Comment on lines +89 to +95
bash -c "echo '$ASAN_OUTPUT' | grep -q 'heap-buffer-overflow'"

# Test: use_after_free should produce ASAN report
echo " Running use_after_free test..."
ASAN_OUTPUT=$(LD_PRELOAD="$ASAN_LIB" "$TMPDIR/test_hip_asan" use_after_free 2>&1 || true)
check "ASAN catches use-after-free" \
bash -c "echo '$ASAN_OUTPUT' | grep -q 'use-after-free\|heap-use-after-free'"
Copy link

Copilot AI Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bash -c "echo '$ASAN_OUTPUT' | grep ..." is brittle: it will break if ASAN output contains single quotes and can mangle newlines. Prefer feeding the variable to grep directly (here-string) or via printf and avoid an extra shell layer.

Suggested change
bash -c "echo '$ASAN_OUTPUT' | grep -q 'heap-buffer-overflow'"
# Test: use_after_free should produce ASAN report
echo " Running use_after_free test..."
ASAN_OUTPUT=$(LD_PRELOAD="$ASAN_LIB" "$TMPDIR/test_hip_asan" use_after_free 2>&1 || true)
check "ASAN catches use-after-free" \
bash -c "echo '$ASAN_OUTPUT' | grep -q 'use-after-free\|heap-use-after-free'"
grep -q 'heap-buffer-overflow' <<< "$ASAN_OUTPUT"
# Test: use_after_free should produce ASAN report
echo " Running use_after_free test..."
ASAN_OUTPUT=$(LD_PRELOAD="$ASAN_LIB" "$TMPDIR/test_hip_asan" use_after_free 2>&1 || true)
check "ASAN catches use-after-free" \
grep -q 'use-after-free\|heap-use-after-free' <<< "$ASAN_OUTPUT"

Copilot uses AI. Check for mistakes.
Comment thread tests/asan_tests/verify_asan_pytorch.sh Outdated
Comment on lines +44 to +57
((asan_reports++))
fi

if [ "$RC" -eq 0 ]; then
echo -e "${GREEN}OK${NC} (rc=0)"
((pass++))
elif [ "$RC" -eq 139 ]; then
echo -e "${RED}SEGFAULT${NC} (rc=139) — LD_PRELOAD may not be working"
((fail++))
else
echo -e "${RED}FAILED${NC} (rc=$RC)"
# Show last few lines of output for debugging
echo "$OUTPUT" | tail -5 | sed 's/^/ /'
((fail++))
Copy link

Copilot AI Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same set -e issue as in verify_asan_therock.sh: ((asan_reports++)), ((pass++)), and ((fail++)) can exit the script when the pre-increment value is 0. Use += 1 style increments (or another form that returns success) to avoid premature termination.

Suggested change
((asan_reports++))
fi
if [ "$RC" -eq 0 ]; then
echo -e "${GREEN}OK${NC} (rc=0)"
((pass++))
elif [ "$RC" -eq 139 ]; then
echo -e "${RED}SEGFAULT${NC} (rc=139) — LD_PRELOAD may not be working"
((fail++))
else
echo -e "${RED}FAILED${NC} (rc=$RC)"
# Show last few lines of output for debugging
echo "$OUTPUT" | tail -5 | sed 's/^/ /'
((fail++))
((asan_reports += 1))
fi
if [ "$RC" -eq 0 ]; then
echo -e "${GREEN}OK${NC} (rc=0)"
((pass += 1))
elif [ "$RC" -eq 139 ]; then
echo -e "${RED}SEGFAULT${NC} (rc=139) — LD_PRELOAD may not be working"
((fail += 1))
else
echo -e "${RED}FAILED${NC} (rc=$RC)"
# Show last few lines of output for debugging
echo "$OUTPUT" | tail -5 | sed 's/^/ /'
((fail += 1))

Copilot uses AI. Check for mistakes.
echo "deb [signed-by=/usr/share/keyrings/kitware.gpg] https://apt.kitware.com/ubuntu/ noble main" \
> /etc/apt/sources.list.d/kitware.list && \
apt-get update && \
apt-get install -y cmake && \
Copy link

Copilot AI Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Kitware apt install step doesn’t clean /var/lib/apt/lists, which leaves apt metadata in the final image and increases image size. Consider removing apt lists in the same layer after installing CMake.

Suggested change
apt-get install -y cmake && \
apt-get install -y cmake && \
rm -rf /var/lib/apt/lists/* && \

Copilot uses AI. Check for mistakes.
Comment on lines +114 to +139
echo " Running clean test (no errors expected)..."
CLEAN_OUTPUT=$(LD_PRELOAD="$ASAN_LIB" "$TMPDIR/test_hip_asan" clean 2>&1)
CLEAN_RC=$?
check "Clean HIP program runs correctly" test "$CLEAN_RC" -eq 0
check "No ASAN errors in clean run" \
bash -c "! echo '$CLEAN_OUTPUT' | grep -q 'ERROR: AddressSanitizer'"

echo " Running event_query stress test..."
EVENT_OUTPUT=$(LD_PRELOAD="$ASAN_LIB" ASAN_OPTIONS="detect_leaks=0:halt_on_error=0" \
"$TMPDIR/test_hip_asan" event_query 2>&1)
EVENT_RC=$?
check "hipEventQuery stress test completes" test "$EVENT_RC" -eq 0

# Report any ASAN findings from event_query (informational)
if echo "$EVENT_OUTPUT" | grep -q "ERROR: AddressSanitizer"; then
echo -e " ${YELLOW}[INFO]${NC} ASAN found issues in hipEventQuery path:"
echo "$EVENT_OUTPUT" | grep "ERROR: AddressSanitizer" | head -5
else
echo -e " ${GREEN}[INFO]${NC} No ASAN errors in hipEventQuery path"
fi

echo " Running multi_stream test..."
STREAM_OUTPUT=$(LD_PRELOAD="$ASAN_LIB" ASAN_OPTIONS="detect_leaks=0:halt_on_error=0" \
"$TMPDIR/test_hip_asan" multi_stream 2>&1)
STREAM_RC=$?
check "Multi-stream event polling completes" test "$STREAM_RC" -eq 0
Copy link

Copilot AI Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CLEAN_OUTPUT=$(...), EVENT_OUTPUT=$(...), and STREAM_OUTPUT=$(...) will cause the script to exit immediately on non-zero return codes because of set -e, so the subsequent *_RC=$? and check ... lines won’t run. Capture output in a way that preserves the return code without triggering errexit (e.g., disable -e around the command substitution or append || true and record $?).

Copilot uses AI. Check for mistakes.
Copilot AI review requested due to automatic review settings March 31, 2026 16:32
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 11 out of 11 changed files in this pull request and generated 6 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread tests/asan_tests/verify_asan_pytorch.sh Outdated
Comment on lines +9 to +23
# This script runs Python with LD_PRELOAD of the ASAN runtime to avoid
# the "ASan runtime does not come first" crash.

set -euo pipefail

ROCM=${ROCM_HOME:-/opt/rocm}
ASAN_LIB=$(find "$ROCM/llvm/lib/clang" -name "libclang_rt.asan-x86_64.so" 2>/dev/null | head -1)

if [ -z "$ASAN_LIB" ]; then
echo "ERROR: ASAN runtime not found under $ROCM/llvm/lib/clang"
exit 1
fi

export LD_PRELOAD="$ASAN_LIB"
export ASAN_OPTIONS="${ASAN_OPTIONS:-detect_leaks=0:halt_on_error=0:symbolize=1}"
Copy link

Copilot AI Mar 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This script unconditionally sets LD_PRELOAD to the ASAN runtime. In this repo’s TheRock ASAN overlay approach (and as documented in the Dockerfile / build-issues doc), LD_PRELOAD can break HIP’s internal dlopen/dlsym usage and lead to SEGVs during import torch. Prefer loading ASAN as a normal dependency (runtime dir on LD_LIBRARY_PATH) and set ASAN_OPTIONS=...:verify_asan_link_order=0 instead of preloading.

Copilot uses AI. Check for mistakes.
Comment thread tests/asan_tests/verify_asan_therock.sh Outdated
Comment on lines +70 to +80
if [ -n "$TEST_SRC" ]; then
echo " Compiling test_hip_asan.cpp with ASAN..."
ARCH=${PYTORCH_ROCM_ARCH:-gfx950}
"$CLANG" -fsanitize=address -shared-libasan \
--offload-arch="$ARCH" -x hip \
-I"$ROCM/include" -L"$ROCM/lib" -lamdhip64 \
-o "$TMPDIR/test_hip_asan" "$TEST_SRC" 2>&1

if [ $? -ne 0 ]; then
echo -e " ${RED}[FAIL]${NC} Compilation failed"
((fail++))
Copy link

Copilot AI Mar 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The compile step is followed by if [ $? -ne 0 ] ..., but with set -e enabled a failing clang invocation will exit the script before this check runs. Wrap the compile in an if ...; then ... else ... fi (or temporarily disable errexit) so you can surface a clean "Compilation failed" result instead of an abrupt exit.

Copilot uses AI. Check for mistakes.
Comment thread tests/asan_tests/verify_asan_therock.sh Outdated

# Check that libamdhip64.so is ASAN-instrumented
check "libamdhip64.so has ASAN symbols" \
bash -c "nm -D $ROCM/lib/libamdhip64.so 2>/dev/null | grep -q __asan_report"
Copy link

Copilot AI Mar 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This checks ASAN symbols in $ROCM/lib/libamdhip64.so, but the Dockerfile’s design keeps ASAN-instrumented libs in the overlay directory (e.g. /opt/rocm-asan/lib) while $ROCM/lib contains the non-ASAN tarball libs. As written, this check will fail in the intended overlay setup; it should inspect the overlay lib path (preferably via an env like ASAN_LIB_DIR) instead.

Suggested change
bash -c "nm -D $ROCM/lib/libamdhip64.so 2>/dev/null | grep -q __asan_report"
bash -c 'nm -D "${ASAN_LIB_DIR:-'"$ROCM"'/lib}"/libamdhip64.so 2>/dev/null | grep -q __asan_report'

Copilot uses AI. Check for mistakes.
Comment thread tests/asan_tests/test_therock_asan.py Outdated
Comment on lines +126 to +145
LD_PRELOAD of the shared ASAN runtime is required so that the runtime
initialises before any ASAN-instrumented shared library (libamdhip64, etc.)
is loaded by the dynamic linker.

LD_LIBRARY_PATH must include ROCM_HOME/llvm/lib so that binaries compiled
with -shared-libasan can find libclang-cpp.so and other LLVM runtime libs.
"""
asan_lib = find_asan_runtime()
cmd = [str(binary)] + (args or [])
env = os.environ.copy()
env["ASAN_OPTIONS"] = "detect_leaks=0:halt_on_error=0:symbolize=1"
symbolizer = ROCM_HOME / "llvm" / "bin" / "llvm-symbolizer"
if symbolizer.is_file():
env["ASAN_SYMBOLIZER_PATH"] = str(symbolizer)
if asan_lib:
env["LD_PRELOAD"] = str(asan_lib)
llvm_lib_dir = str(ROCM_HOME / "llvm" / "lib")
rocm_lib_dir = str(ROCM_HOME / "lib")
existing_ldpath = env.get("LD_LIBRARY_PATH", "")
env["LD_LIBRARY_PATH"] = f"{llvm_lib_dir}:{rocm_lib_dir}:{existing_ldpath}".rstrip(":")
Copy link

Copilot AI Mar 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

run_asan_binary() currently prepends ${ROCM_HOME}/lib ahead of the existing LD_LIBRARY_PATH. In the ASAN overlay setup, that reorders the search path so the non-ASAN tarball libs can shadow /opt/rocm-asan/lib, which undermines the test’s goal. Preserve the existing ordering (keeping the overlay first) and only append the extra LLVM/ROCm runtime dirs that are needed for -shared-libasan binaries.

Also, this function sets LD_PRELOAD to the ASAN runtime, which is known to break HIP’s internal dlopen/dlsym path in this environment; prefer ASAN_OPTIONS=...:verify_asan_link_order=0 + runtime dir on LD_LIBRARY_PATH instead of preloading.

Suggested change
LD_PRELOAD of the shared ASAN runtime is required so that the runtime
initialises before any ASAN-instrumented shared library (libamdhip64, etc.)
is loaded by the dynamic linker.
LD_LIBRARY_PATH must include ROCM_HOME/llvm/lib so that binaries compiled
with -shared-libasan can find libclang-cpp.so and other LLVM runtime libs.
"""
asan_lib = find_asan_runtime()
cmd = [str(binary)] + (args or [])
env = os.environ.copy()
env["ASAN_OPTIONS"] = "detect_leaks=0:halt_on_error=0:symbolize=1"
symbolizer = ROCM_HOME / "llvm" / "bin" / "llvm-symbolizer"
if symbolizer.is_file():
env["ASAN_SYMBOLIZER_PATH"] = str(symbolizer)
if asan_lib:
env["LD_PRELOAD"] = str(asan_lib)
llvm_lib_dir = str(ROCM_HOME / "llvm" / "lib")
rocm_lib_dir = str(ROCM_HOME / "lib")
existing_ldpath = env.get("LD_LIBRARY_PATH", "")
env["LD_LIBRARY_PATH"] = f"{llvm_lib_dir}:{rocm_lib_dir}:{existing_ldpath}".rstrip(":")
The ASAN runtime is located via find_asan_runtime() and made available on
LD_LIBRARY_PATH so that it can be found by the dynamic linker without the
need for LD_PRELOAD. We also disable ASAN's link-order verification to
avoid false positives in this environment.
LD_LIBRARY_PATH must include ROCM_HOME/llvm/lib so that binaries compiled
with -shared-libasan can find libclang-cpp.so and other LLVM runtime libs.
The existing LD_LIBRARY_PATH ordering is preserved; the additional runtime
directories are only appended.
"""
asan_lib = find_asan_runtime()
cmd = [str(binary)] + (args or [])
env = os.environ.copy()
env["ASAN_OPTIONS"] = "detect_leaks=0:halt_on_error=0:symbolize=1:verify_asan_link_order=0"
symbolizer = ROCM_HOME / "llvm" / "bin" / "llvm-symbolizer"
if symbolizer.is_file():
env["ASAN_SYMBOLIZER_PATH"] = str(symbolizer)
llvm_lib_dir = str(ROCM_HOME / "llvm" / "lib")
rocm_lib_dir = str(ROCM_HOME / "lib")
existing_ldpath = env.get("LD_LIBRARY_PATH", "")
# Preserve any existing LD_LIBRARY_PATH (e.g., ASAN overlay dirs) and
# append the extra LLVM/ROCm/ASAN runtime directories if they are not
# already present.
ld_paths: list[str] = []
if existing_ldpath:
ld_paths = [p for p in existing_ldpath.split(os.pathsep) if p]
for extra in (llvm_lib_dir, rocm_lib_dir, str(asan_lib.parent) if asan_lib else None):
if extra and extra not in ld_paths:
ld_paths.append(extra)
if ld_paths:
env["LD_LIBRARY_PATH"] = os.pathsep.join(ld_paths)

Copilot uses AI. Check for mistakes.
Comment thread tests/asan_tests/run_therock_sanity.sh Outdated
Comment on lines +67 to +73
# LD_PRELOAD the ASAN runtime so ASAN-instrumented libraries don't crash
if [ -n "$ASAN_LIB" ]; then
export LD_PRELOAD="${LD_PRELOAD:+$LD_PRELOAD:}$ASAN_LIB"
fi

export ASAN_OPTIONS="${ASAN_OPTIONS:-detect_leaks=0:halt_on_error=0:symbolize=1}"

Copy link

Copilot AI Mar 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This script adds the ASAN runtime to LD_PRELOAD. In this repo’s TheRock ASAN overlay setup, LD_PRELOAD has been shown to break HIP initialization (null function pointers via dlsym) and can cause SEGVs. Prefer relying on the ASAN runtime being a normal DT_NEEDED dependency (runtime dir on LD_LIBRARY_PATH) and set ASAN_OPTIONS=...:verify_asan_link_order=0 rather than preloading.

Suggested change
# LD_PRELOAD the ASAN runtime so ASAN-instrumented libraries don't crash
if [ -n "$ASAN_LIB" ]; then
export LD_PRELOAD="${LD_PRELOAD:+$LD_PRELOAD:}$ASAN_LIB"
fi
export ASAN_OPTIONS="${ASAN_OPTIONS:-detect_leaks=0:halt_on_error=0:symbolize=1}"
# Configure ASAN runtime: rely on normal linking instead of LD_PRELOAD.
# In TheRock ASAN overlay, preloading ASAN can break HIP initialization (null
# function pointers via dlsym) and cause SEGVs, so we avoid modifying LD_PRELOAD.
#
# If ASAN_OPTIONS is not already set, provide a sensible default and explicitly
# disable link-order verification since we are not preloading ASAN.
export ASAN_OPTIONS="${ASAN_OPTIONS:-detect_leaks=0:halt_on_error=0:symbolize=1:verify_asan_link_order=0}"

Copilot uses AI. Check for mistakes.
ENV ASAN_SYMBOLIZER_PATH=${ROCM_HOME}/llvm/bin/llvm-symbolizer
ENV ASAN_OPTIONS="detect_leaks=0:halt_on_error=0:symbolize=1:verify_asan_link_order=0"

WORKDIR /workspace
Copy link

Copilot AI Mar 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The verification scripts’ headers suggest running /workspace/asan_tests/... directly inside the image, but this Dockerfile never copies tests/asan_tests into /workspace/asan_tests (the container only has /workspace as a working dir). Either copy the test suite into the image (so the documented docker run ... /workspace/asan_tests/... works) or update the scripts/docs to reference the repo-mount path used by docker-compose.build.yaml (e.g. /workspace/aorta/tests/asan_tests).

Suggested change
WORKDIR /workspace
WORKDIR /workspace
COPY tests/asan_tests /workspace/asan_tests

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants