Assessing the Impact of Code Changes on the Fault Localizability of Large Language Models

Paper: arXiv Preprint
Artifact Archive: Zenodo Permanent Record
Authors: Sabaat Haroon, Ahmad Faraz Khan, Ahmad Humayun, Waris Gill, Abdul Haddi Amjad, Ali R. Butt, Mohammad Taha Khan, Muhammad Ali Gulzar

1. Purpose

This work provides the experimental framework used to conduct the first large-scale empirical investigation into the robustness of Large Language Models' (LLMs) fault localization (FL) capabilities. While LLMs are increasingly used for software maintenance, our research reveals that their performance is often tied to surface-level syntactic cues rather than true program semantics.

This artifact provides:

Robustness Testing: A framework to apply Semantic-Preserving Mutations (SPMs), such as misleading comments, misleading variable names or dead code, to evaluate if the model's fault localization accuracy remains unaffected.

Reproducibility: Pre-configured scripts to replicate the findings of RQ1, RQ2, and RQ3 as presented in the paper.

Extensibility: Instructions to add custom datasets and test new models via Ollama or proprietary APIs.

Badges Claimed:

Artifact Available

Permanently hosted on Zenodo and supplemented on GitHub.
Fully packaged with Docker
All scripts included

Artifact Reviewed

The artifact is:

Documented: Includes installation steps, execution commands (quick and full modes), expected outputs, and dataset format specification.

Consistent: Directly implements the evaluation framework and reproduces the paper’s reported results.

Complete: Contains all scripts, data (or generation pipeline), and plotting tools required for reproduction.

Exercisable: Runs in a 15–20 minute quick mode and supports full pipeline execution to regenerate results.

2. Provenance

Paper: ICST 2026 (Accepted)

Archived Artifact: The exact version of this repository (including code, configurations, and instructions) is archived at Zenodo

GitHub Repository: GitHub

License: MIT

3. Data & System Requirements

All Java and Python datasets used in the study are fully included in this artifact, along with the complete set of data processing and handling scripts.

Minimum System Requirements (Artifact Evaluation Mode)

Docker
Ollama installed on host
12GB RAM recommended
Commodity GPU Required

Full Paper Reproduction

A high-end GPU (e.g., NVIDIA RTX 5090) is required for full paper reproduction.
Java installed (for Java pipeline)
A longer runtime should be expected for full reproduction.

4. Setup

4.1 Install Docker

Check installation:

    docker --version

If not installed:

Linux: https://docs.docker.com/engine/install/
macOS/Windows: https://www.docker.com/products/docker-desktop/

4.2 Install Ollama

Download from https://ollama.com/download

Pull model:

    ollama pull llama3.2:3b

5. Quick Evaluation (15--20 Minutes)

Runs evaluation on pre-generated data.

Step 1: Build Docker Image

From repository root:

    docker build -t artifact-eval ./artifact

Step 2: Run Quick Evaluation

Linux

    cd artifact
    chmod +x run_artifact.sh
    docker run --rm -it --network host   -v "$(pwd)":/artifact   -w /artifact   artifact-eval   ./run_artifact.sh --eval-only llama3.2:3b

macOS / Windows

    cd artifact
    chmod +x run_artifact.sh
    docker run --rm -it   -e OLLAMA_HOST=http://host.docker.internal:11434   -v "$(pwd)":/artifact   -w /artifact   artifact-eval   ./run_artifact.sh --eval-only llama3.2:3b

Expected terminal output at the start of run:

Wait for 10 to 15 minutes for run to complete. This will be at the end of a successful run.

7. Expected Output Files

After completion, the artifact/ directory will contain:

results_summary.txt
artifact_results.png
artifact_results_strength_comparison.png
artifact_results_mutation_types.png
artifact_results_windowed.png

8. Interpreting Results

RQ1: Robustness to SPMs

For each bug type (SAM), how many of the N programs were still correctly localized after each of the five SPMs (mutation strength 1) Open:

artifact_results.png

Example:

RQ2: Effect of Mutation Type & Strength

For each bug type (SAM), localization success at mutation strength 1 vs 4 for each of the five SPMs.
Total number of programs still correctly localized by SPM type, summed over all bug types, at strength 1 vs 4

Open:

artifact_results_strength_comparison.png
artifact_results_mutation_types.png

Example:

RQ3: Effect of Fault Location

Cumulative count of correct (matches) vs incorrect (mismatches) localizations by bug position in the file. Open:

artifact_results_windowed.png

Example:

9. Full Pipeline (Will require longer runtime based on N - 30 min+)

Regenerates SPMs and recomputes first N programs:

    ./run_artifact.sh llama3.2:3b 5

Default N = 5.

10. Java Pipeline

Requires Java installed on host:

    cd artifact_java
    chmod +x run_artifact.sh run-experiments.sh
    ./run_artifact.sh llama3.2:3b

11. Full Paper Reproduction (RQ4, RQ5)

    ./run_paper_python.sh llama3.2:3b

This mode requires a high-end GPU and has a significantly longer runtime.

12. Adding Your Own Python or Java Projects

Each buggy program must be one JSON file containing:

Field Type Description

instruction string Intended behavior buggy_code string Full source code with bug line_no number 1-based bug line line_no_percent string Percentage location

Example:

{
  "instruction": "Return the sum of two integers.",
  "buggy_code": "def add(a, b):\n    return a - b\n",
  "line_no": 2,
  "line_no_percent": "100"
}

Place dataset folders inside:

artifact/ for Python
artifact_java/ for Java

Or mount a custom directory during Docker execution.

13. How This Artifact Meets ICST Criteria

Artifact Available

The artifact is permanently archived on Zenodo and supplemented on GitHub.
The archived version includes all source code, datasets, mutation scripts, evaluation pipelines, and plotting utilities required to reproduce the results reported in the paper. The system is containerized using Docker to ensure portability across environments.

Artifact Reviewed

The artifact directly implements the methodology described in the paper and is structured to make reproduction straightforward. The README provides clear setup instructions and execution commands for both a lightweight quick mode (approximately 15–20 minutes) and the full experimental pipeline. Each research question corresponds to specific scripts and output artifacts, allowing reviewers to trace generated results back to the evaluation stages described in the paper. All components necessary to regenerate the reported findings are included in the repository.

14. Contact

For questions related to the paper or advanced usage, contact the author directly via (sabaat@vt.edu)[sabaat@vt.edu].

Citation

If you use our work in your research, please cite the paper:

@inproceedings{haroon2026assessing,
  title={Assessing the Impact of Code Changes on the Fault Localizability of Large Language Models},
  author={Haroon, Sabaat and Khan, Ahmad Faraz and Humayun, Ahmad and Gill, Waris and Amjad, Abdul Haddi and Butt, Ali R and Khan, Mohammad Taha and Gulzar, Muhammad Ali},
  booktitle={2026 IEEE Conference on Software Testing, Verification and Validation (ICST)},
  year={2026}
}

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
artifact		artifact
images		images
Assessing the Impact of Code Changes on the Fault Localizability of Large Language Models.pdf		Assessing the Impact of Code Changes on the Fault Localizability of Large Language Models.pdf
INSTALL.md		INSTALL.md
LICENSE		LICENSE
Readme.md		Readme.md

Folders and files

Latest commit

History

Repository files navigation

Assessing the Impact of Code Changes on the Fault Localizability of Large Language Models

1. Purpose

Badges Claimed:

Artifact Available

Artifact Reviewed

2. Provenance

3. Data & System Requirements

Minimum System Requirements (Artifact Evaluation Mode)

Full Paper Reproduction

4. Setup

4.1 Install Docker

4.2 Install Ollama

5. Quick Evaluation (15--20 Minutes)

Step 1: Build Docker Image

Step 2: Run Quick Evaluation

Linux

macOS / Windows

7. Expected Output Files

8. Interpreting Results

RQ1: Robustness to SPMs

RQ2: Effect of Mutation Type & Strength

RQ3: Effect of Fault Location

9. Full Pipeline (Will require longer runtime based on N - 30 min+)

10. Java Pipeline

11. Full Paper Reproduction (RQ4, RQ5)

12. Adding Your Own Python or Java Projects

13. How This Artifact Meets ICST Criteria

Artifact Available

Artifact Reviewed

14. Contact

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages