You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi @Leolty , this is a creative and challenging task with an interesting idea. It requires many crucial abilities in realistic scenarios, including web interaction, visual perception, and code implementation and execution.
I have two small concerns:
Note-head counts problem: I believe the answer makes sense, but if the number of correct models is not specified, I would assume that GPT-5.2-Thinking (Extended) has five note-heads based on the confusing definition of note-head. You might consider explicitly specifying the number of models that satisfy the requirement that “note-head counts match the original image.”
The agent output example in evaluation.md seems to be missing.
Once these two issues are resolved, I’d be happy to include this amazing task. Thank you!
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
A complex and interesting task requires GUI, visual inspection, and complex coding