Hello,
I am trying to reproduce the results of this repository on the Devign dataset, but I am encountering some difficulties.
After running multiple experiments, the F1 score I obtained is only around 57–58, which is significantly lower than the results reported in the paper.
During reproduction I also noticed several issues in the code:
- The preprocessing pipeline seems quite inefficient and contains a lot of brute-force operations.
- Because of this, I had to rewrite and optimize several parts of the data processing to make the pipeline run properly.
- Even after optimizing the preprocessing and ensuring that the dataset and splits are correct, the F1 score still remains around 57–58.
I would like to ask:
- Has anyone successfully reproduced the reported results on the Devign dataset?
- What F1 score did you obtain?
- Are there any important preprocessing steps, hyperparameters, or dataset filtering steps that are not clearly documented?
Any guidance would be greatly appreciated.
Thank you!
Hello,
I am trying to reproduce the results of this repository on the Devign dataset, but I am encountering some difficulties.
After running multiple experiments, the F1 score I obtained is only around 57–58, which is significantly lower than the results reported in the paper.
During reproduction I also noticed several issues in the code:
I would like to ask:
Any guidance would be greatly appreciated.
Thank you!