Hi, I have three questions:
1/3 Evaluation of searched code:
I use the searched code in results/arc_gpt3.5_results.jsonl, which has Median: 13.7%, Line 147.
But I get a much lower result, 95% Bootstrap Confidence Interval: (1.3%, 5.7%), Median: 3.3%.

2/3 Running search.py and evaluation:
The best result is "95% Bootstrap Confidence Interval: (3.0%, 14.0%), Median: 8.0%", which still has gap to results in paper.

3/3 Results based on sampled training dataset, but not from evaluation dataset?
Looking forward to your reply.
Hi, I have three questions:
1/3 Evaluation of searched code:
I use the searched code in results/arc_gpt3.5_results.jsonl, which has Median: 13.7%, Line 147.
But I get a much lower result, 95% Bootstrap Confidence Interval: (1.3%, 5.7%), Median: 3.3%.
2/3 Running search.py and evaluation:
The best result is "95% Bootstrap Confidence Interval: (3.0%, 14.0%), Median: 8.0%", which still has gap to results in paper.

3/3 Results based on sampled training dataset, but not from evaluation dataset?
Looking forward to your reply.