Paper: arXiv Preprint
Artifact Archive: Zenodo Permanent Record
Authors: Sabaat Haroon, Ahmad Faraz Khan, Ahmad Humayun, Waris Gill, Abdul Haddi Amjad, Ali R. Butt, Mohammad Taha Khan, Muhammad Ali Gulzar
This work provides the experimental framework used to conduct the first large-scale empirical investigation into the robustness of Large Language Models' (LLMs) fault localization (FL) capabilities. While LLMs are increasingly used for software maintenance, our research reveals that their performance is often tied to surface-level syntactic cues rather than true program semantics.
This artifact provides:
Robustness Testing: A framework to apply Semantic-Preserving Mutations (SPMs), such as misleading comments, misleading variable names or dead code, to evaluate if the model's fault localization accuracy remains unaffected.
Reproducibility: Pre-configured scripts to replicate the findings of RQ1, RQ2, and RQ3 as presented in the paper.
Extensibility: Instructions to add custom datasets and test new models via Ollama or proprietary APIs.
- Permanently hosted on Zenodo and supplemented on GitHub.
- Fully packaged with Docker
- All scripts included
The artifact is:
Documented: Includes installation steps, execution commands (quick and full modes), expected outputs, and dataset format specification.
Consistent: Directly implements the evaluation framework and reproduces the paper’s reported results.
Complete: Contains all scripts, data (or generation pipeline), and plotting tools required for reproduction.
Exercisable: Runs in a 15–20 minute quick mode and supports full pipeline execution to regenerate results.
Paper: ICST 2026 (Accepted)
Archived Artifact: The exact version of this repository (including code, configurations, and instructions) is archived at Zenodo
GitHub Repository: GitHub
License: MIT
All Java and Python datasets used in the study are fully included in this artifact, along with the complete set of data processing and handling scripts.
- Docker
- Ollama installed on host
- 12GB RAM recommended
- Commodity GPU Required
- A high-end GPU (e.g., NVIDIA RTX 5090) is required for full paper reproduction.
- Java installed (for Java pipeline)
- A longer runtime should be expected for full reproduction.
Check installation:
docker --versionIf not installed:
- Linux: https://docs.docker.com/engine/install/
- macOS/Windows: https://www.docker.com/products/docker-desktop/
- Download from https://ollama.com/download
Pull model:
ollama pull llama3.2:3bRuns evaluation on pre-generated data.
From repository root:
docker build -t artifact-eval ./artifact cd artifact
chmod +x run_artifact.sh
docker run --rm -it --network host -v "$(pwd)":/artifact -w /artifact artifact-eval ./run_artifact.sh --eval-only llama3.2:3b cd artifact
chmod +x run_artifact.sh
docker run --rm -it -e OLLAMA_HOST=http://host.docker.internal:11434 -v "$(pwd)":/artifact -w /artifact artifact-eval ./run_artifact.sh --eval-only llama3.2:3bExpected terminal output at the start of run:
Wait for 10 to 15 minutes for run to complete. This will be at the end of a successful run.
After completion, the artifact/ directory will contain:
results_summary.txtartifact_results.pngartifact_results_strength_comparison.pngartifact_results_mutation_types.pngartifact_results_windowed.png
For each bug type (SAM), how many of the N programs were still correctly localized after each of the five SPMs (mutation strength 1) Open:
artifact_results.png
- For each bug type (SAM), localization success at mutation strength 1 vs 4 for each of the five SPMs.
- Total number of programs still correctly localized by SPM type, summed over all bug types, at strength 1 vs 4
Open:
artifact_results_strength_comparison.png
artifact_results_mutation_types.png
Cumulative count of correct (matches) vs incorrect (mismatches) localizations by bug position in the file. Open:
artifact_results_windowed.png
Regenerates SPMs and recomputes first N programs:
./run_artifact.sh llama3.2:3b 5Default N = 5.
Requires Java installed on host:
cd artifact_java
chmod +x run_artifact.sh run-experiments.sh
./run_artifact.sh llama3.2:3b ./run_paper_python.sh llama3.2:3bThis mode requires a high-end GPU and has a significantly longer runtime.
Each buggy program must be one JSON file containing:
Field Type Description
instruction string Intended behavior buggy_code string Full source code with bug line_no number 1-based bug line line_no_percent string Percentage location
Example:
{
"instruction": "Return the sum of two integers.",
"buggy_code": "def add(a, b):\n return a - b\n",
"line_no": 2,
"line_no_percent": "100"
}Place dataset folders inside:
artifact/for Pythonartifact_java/for Java
Or mount a custom directory during Docker execution.
- The artifact is permanently archived on Zenodo and supplemented on GitHub.
- The archived version includes all source code, datasets, mutation scripts, evaluation pipelines, and plotting utilities required to reproduce the results reported in the paper. The system is containerized using Docker to ensure portability across environments.
The artifact directly implements the methodology described in the paper and is structured to make reproduction straightforward. The README provides clear setup instructions and execution commands for both a lightweight quick mode (approximately 15–20 minutes) and the full experimental pipeline. Each research question corresponds to specific scripts and output artifacts, allowing reviewers to trace generated results back to the evaluation stages described in the paper. All components necessary to regenerate the reported findings are included in the repository.
- For questions related to the paper or advanced usage, contact the author directly via (sabaat@vt.edu)[sabaat@vt.edu].
If you use our work in your research, please cite the paper:
@inproceedings{haroon2026assessing,
title={Assessing the Impact of Code Changes on the Fault Localizability of Large Language Models},
author={Haroon, Sabaat and Khan, Ahmad Faraz and Humayun, Ahmad and Gill, Waris and Amjad, Abdul Haddi and Butt, Ali R and Khan, Mohammad Taha and Gulzar, Muhammad Ali},
booktitle={2026 IEEE Conference on Software Testing, Verification and Validation (ICST)},
year={2026}
}







