Enquiry about reproducing results on ARC 

Hi, I have three questions:
# 1/3 Evaluation of searched code:
I use the searched code in results/arc_gpt3.5_results.jsonl, which has Median: 13.7%, Line 147.
But I get a much lower result, 95% Bootstrap Confidence Interval: (1.3%, 5.7%), Median: 3.3%.

![image](https://github.com/user-attachments/assets/4b18d61b-9e5e-4a3c-aca2-e76327448244)

# 2/3 Running search.py and evaluation:
```python
python search.py
```
The best result is "95% Bootstrap Confidence Interval: (3.0%, 14.0%), Median: 8.0%", which still has gap to results in paper.
![image](https://github.com/user-attachments/assets/156a4014-b916-4416-ab42-4df96f947588)

# 3/3 Results based on sampled training dataset, but not from evaluation dataset?

Looking forward to your reply.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enquiry about reproducing results on ARC #21

1/3 Evaluation of searched code:

2/3 Running search.py and evaluation:

3/3 Results based on sampled training dataset, but not from evaluation dataset?

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Enquiry about reproducing results on ARC #21

Description

1/3 Evaluation of searched code:

2/3 Running search.py and evaluation:

3/3 Results based on sampled training dataset, but not from evaluation dataset?

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions