Skip to content

SEED-VT/LLMCodeImpact

Repository files navigation

Assessing the Impact of Code Changes on the Fault Localizability of Large Language Models

Paper: arXiv Preprint
Artifact Archive: Zenodo Permanent Record
Authors: Sabaat Haroon, Ahmad Faraz Khan, Ahmad Humayun, Waris Gill, Abdul Haddi Amjad, Ali R. Butt, Mohammad Taha Khan, Muhammad Ali Gulzar


1. Purpose

This work provides the experimental framework used to conduct the first large-scale empirical investigation into the robustness of Large Language Models' (LLMs) fault localization (FL) capabilities. While LLMs are increasingly used for software maintenance, our research reveals that their performance is often tied to surface-level syntactic cues rather than true program semantics.

This artifact provides:

Robustness Testing: A framework to apply Semantic-Preserving Mutations (SPMs), such as misleading comments, misleading variable names or dead code, to evaluate if the model's fault localization accuracy remains unaffected.

Reproducibility: Pre-configured scripts to replicate the findings of RQ1, RQ2, and RQ3 as presented in the paper.

Extensibility: Instructions to add custom datasets and test new models via Ollama or proprietary APIs.

Badges Claimed:

Artifact Available

  • Permanently hosted on Zenodo and supplemented on GitHub.
  • Fully packaged with Docker
  • All scripts included

Artifact Reviewed

The artifact is:

Documented: Includes installation steps, execution commands (quick and full modes), expected outputs, and dataset format specification.

Consistent: Directly implements the evaluation framework and reproduces the paper’s reported results.

Complete: Contains all scripts, data (or generation pipeline), and plotting tools required for reproduction.

Exercisable: Runs in a 15–20 minute quick mode and supports full pipeline execution to regenerate results.


2. Provenance

Paper: ICST 2026 (Accepted)

Archived Artifact: The exact version of this repository (including code, configurations, and instructions) is archived at Zenodo

GitHub Repository: GitHub

License: MIT


3. Data & System Requirements

All Java and Python datasets used in the study are fully included in this artifact, along with the complete set of data processing and handling scripts.

Minimum System Requirements (Artifact Evaluation Mode)

  • Docker
  • Ollama installed on host
  • 12GB RAM recommended
  • Commodity GPU Required

Full Paper Reproduction

  • A high-end GPU (e.g., NVIDIA RTX 5090) is required for full paper reproduction.
  • Java installed (for Java pipeline)
  • A longer runtime should be expected for full reproduction.

4. Setup

4.1 Install Docker

Check installation:

    docker --version

If not installed:


4.2 Install Ollama

Pull model:

    ollama pull llama3.2:3b

Successful ollama pull


5. Quick Evaluation (15--20 Minutes)

Runs evaluation on pre-generated data.

Step 1: Build Docker Image

From repository root:

    docker build -t artifact-eval ./artifact

Successful Docker Build


Step 2: Run Quick Evaluation

Linux

    cd artifact
    chmod +x run_artifact.sh
    docker run --rm -it --network host   -v "$(pwd)":/artifact   -w /artifact   artifact-eval   ./run_artifact.sh --eval-only llama3.2:3b

macOS / Windows

    cd artifact
    chmod +x run_artifact.sh
    docker run --rm -it   -e OLLAMA_HOST=http://host.docker.internal:11434   -v "$(pwd)":/artifact   -w /artifact   artifact-eval   ./run_artifact.sh --eval-only llama3.2:3b

Expected terminal output at the start of run:

Successful Run Start

Wait for 10 to 15 minutes for run to complete. This will be at the end of a successful run.

Successful Run End


7. Expected Output Files

After completion, the artifact/ directory will contain:

  • results_summary.txt
  • artifact_results.png
  • artifact_results_strength_comparison.png
  • artifact_results_mutation_types.png
  • artifact_results_windowed.png

Successful File Generation


8. Interpreting Results

RQ1: Robustness to SPMs

For each bug type (SAM), how many of the N programs were still correctly localized after each of the five SPMs (mutation strength 1) Open:

artifact_results.png

Example: Artifact Results

RQ2: Effect of Mutation Type & Strength

  • For each bug type (SAM), localization success at mutation strength 1 vs 4 for each of the five SPMs.
  • Total number of programs still correctly localized by SPM type, summed over all bug types, at strength 1 vs 4

Open:

artifact_results_strength_comparison.png
artifact_results_mutation_types.png

Example: Mutation Strength Analysis Mutation Type Analysis

RQ3: Effect of Fault Location

Cumulative count of correct (matches) vs incorrect (mismatches) localizations by bug position in the file. Open:

artifact_results_windowed.png

Example: Fault location Analysis


9. Full Pipeline (Will require longer runtime based on N - 30 min+)

Regenerates SPMs and recomputes first N programs:

    ./run_artifact.sh llama3.2:3b 5

Default N = 5.


10. Java Pipeline

Requires Java installed on host:

    cd artifact_java
    chmod +x run_artifact.sh run-experiments.sh
    ./run_artifact.sh llama3.2:3b

11. Full Paper Reproduction (RQ4, RQ5)

    ./run_paper_python.sh llama3.2:3b

This mode requires a high-end GPU and has a significantly longer runtime.


12. Adding Your Own Python or Java Projects

Each buggy program must be one JSON file containing:

Field Type Description


instruction string Intended behavior buggy_code string Full source code with bug line_no number 1-based bug line line_no_percent string Percentage location

Example:

{
  "instruction": "Return the sum of two integers.",
  "buggy_code": "def add(a, b):\n    return a - b\n",
  "line_no": 2,
  "line_no_percent": "100"
}

Place dataset folders inside:

  • artifact/ for Python
  • artifact_java/ for Java

Or mount a custom directory during Docker execution.


13. How This Artifact Meets ICST Criteria

Artifact Available

  • The artifact is permanently archived on Zenodo and supplemented on GitHub.
  • The archived version includes all source code, datasets, mutation scripts, evaluation pipelines, and plotting utilities required to reproduce the results reported in the paper. The system is containerized using Docker to ensure portability across environments.

Artifact Reviewed

The artifact directly implements the methodology described in the paper and is structured to make reproduction straightforward. The README provides clear setup instructions and execution commands for both a lightweight quick mode (approximately 15–20 minutes) and the full experimental pipeline. Each research question corresponds to specific scripts and output artifacts, allowing reviewers to trace generated results back to the evaluation stages described in the paper. All components necessary to regenerate the reported findings are included in the repository.


14. Contact


Citation

If you use our work in your research, please cite the paper:

@inproceedings{haroon2026assessing,
  title={Assessing the Impact of Code Changes on the Fault Localizability of Large Language Models},
  author={Haroon, Sabaat and Khan, Ahmad Faraz and Humayun, Ahmad and Gill, Waris and Amjad, Abdul Haddi and Butt, Ali R and Khan, Mohammad Taha and Gulzar, Muhammad Ali},
  booktitle={2026 IEEE Conference on Software Testing, Verification and Validation (ICST)},
  year={2026}
}

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors