This is the repository for our paper Towards Explainable Code Readability Classification with Graph Neural Networks.
Context: Code readability is of central concern for developers, since a more readable code indicates higher maintainability, reusability, and portability. In recent years, many deep learning-based code readability classification methods have been proposed. Among them, a Graph Neural Network (GNN)-based model has achieved the best performance in the field of code readability classification. However, it is still unclear what aspects of the model’s input lead to its decisions, which hinders its practical use in the software industry. Objective: To improve the interpretability of existing code readability classification models and identify key code characteristics that drive their readability predictions, we propose an explanation framework with GNN explainers towards transparent and trustworthy code readability classification. Method: First, we propose a simplified Abstract Syntax Tree (AST)-based code representation method, which transforms Java code snippets into ASTs and discards lower-level nodes with limited information. Then, we retrain the state-of- the-art GNN-based model together with our simplified program graphs. Finally, we employ SubgraphX to explain the model’s code readability predictions at the subgraph-level and visualize the explanation results to further analyze what causes such predictions. Result: The experimental results show that sequential logic, code comments, selection logic, and nested structure are the most influential code characteristics when classifying code snippets as readable or unreadable. Further investigations indicate the model’s proficiency in capturing features related to complex logic structures and extensive data flows but point to its limitations in identifying readability issues associated with naming conventions and code formatting. Conclusion: The explainability analysis conducted in this research is the first step toward more transparent and reliable code readability classification. We believe that our findings are useful in providing constructive suggestions for developers to write more readable code and delimitating directions for future model improvement.
Explainability
├─Code
│ └─checkpoint
│ ├─code_hh
│ └─code_readability_new
├─Dataset
│ ├─Neutral
│ ├─Readable
│ └─Unreadable
├─explanation
│ ├─explainer
│ └─utils
├─newResults
│ ├─readable
│ └─unreadable
├─Questionnaire
└─result
In Code folder:
code_dataset.pyfile achieves pipeline of constructing AST-based program graphs.Code/checkpoint/code_readability_newfolder stores the best gcn model parameters after five fold training. These model files can be applied by Pytorch library through usingmodel.load_state_dict(torch.load(model_path)).
In Dataset folder:
Dataset/Neutralfolder stores the code snippets that are labeled as neutral.Dataset/Readablefolder stores the code snippets that are labeled as readable.Dataset/Unreadablefolder stores the code snippets that are labeled as unreadable.
In explanation folder:
explanation/explainerfolder stores the implementation of SubgraphX.explanation/explaining_subgraphx.pyfile achieves the entire pipeline of graph construction, GNN-based model training and GNN interpretation with SubgraphX.explanation/input.pklfile stores the input data for the GNN-based model. This file can be applied by Pytorch library through usingpkl_file = pd.read_pickle(pkl_path).
In newResults folder, all the visualized interpretation results (png files) are stored:
newResults/readablefolder stores the visualized interpretation results of code snippets that are judged as readable.newResults/readable/Scalabrino{i}.pngfile shows the visualized interpretation result of the code snippet.newResults/readable/Scalabrino{i}.java.ptfile stores the intermediate results of the interpretation process.
newResults/unreadablefolder stores the visualized interpretation results of code snippets that are judged as unreadable.
In Questionnaire folder:
Questionnaire/data_analysis.ipynbfile achieves the data analysis process of the questionnaire.Questionnaire/questionnaire.pdffile shows the content of the questionnaire.Questionnaire/questionnaire_modified.pdffile is a revised questionnaire in accordance with the requirements of gender inclusion and survey rigor.Questionnaire/questionnaire_results.xlsxfile stores the raw data of the questionnaire.
In result folder, all predictions of the gcn classifier on the dataset are stored here.
conda create -n CodeReadability python=3.9 conda activate CodeReadabilityThe detailed versions of dependencies are recorded in requirement.txt file, run the following script to install.
pip install -r requirements.txtThen, you can run the following script to start the entire pipeline of graph construction, GNN-based model training and GNN interpretation with SubgraphX.
cd explanation
python explaining_subgraphx.pyIn this repository, several files with uncommon extensions are used. Below is a detailed explanation of their purposes and the methods to load them:
Files with the .pt & .pth extension store the trained model parameters. These are generated during the training process of the GNN-based model and are essential for making predictions or further fine-tuning.
To load a .pt & .pth file, you can use the PyTorch library as follows:
import torch
model = torch.load("model.pt") # model = torch.load("model.pth")Files with the .ipynb extension are Jupyter Notebook files. These are used for interactive data analysis, visualization, and documentation purposes. For example, Questionnaire/data_analysis.ipynb contains the data analysis process for the questionnaire results.
To open and run a .ipynb file, you need to install Jupyter Notebook or JupyterLab. Use the following commands to start Jupyter Notebook:
jupyter notebookThen, navigate to the directory containing the .ipynb file and open it in the browser.
Files with the .pkl extension are used to store serialized Python objects. In this project, explanation/input.pkl contains the input data for the GNN-based model. These files are useful for saving and loading complex data structures such as dictionaries, lists, or custom objects.
To load a .pkl file, you can use the Pandas library as follows:
import pandas as pd
data = pd.read_pickle("input.pkl")