Skip to content

Enquiry about reproducing results on ARC  #21

@dercaft

Description

@dercaft

Hi, I have three questions:

1/3 Evaluation of searched code:

I use the searched code in results/arc_gpt3.5_results.jsonl, which has Median: 13.7%, Line 147.
But I get a much lower result, 95% Bootstrap Confidence Interval: (1.3%, 5.7%), Median: 3.3%.

image

2/3 Running search.py and evaluation:

python search.py

The best result is "95% Bootstrap Confidence Interval: (3.0%, 14.0%), Median: 8.0%", which still has gap to results in paper.
image

3/3 Results based on sampled training dataset, but not from evaluation dataset?

Looking forward to your reply.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions