[New Task] find-fingers#21
Conversation
There was a problem hiding this comment.
Pull request overview
This PR adds three new image recognition tasks to the CocoaBench benchmark: find-players, find-hero, and find-fingers. Each task requires agents to identify and distinguish specific objects in images containing misleading elements, testing their visual recognition, preprocessing capabilities, and reasoning abilities.
Changes:
- Added three new encrypted benchmark tasks (find-players, find-hero, find-fingers) following the contribution guidelines structure
- Each task includes encrypted instruction, evaluation, solution, and metadata files
- Added image URLs hosted on postimg.cc for each task
Reviewed changes
Copilot reviewed 18 out of 18 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
| cocoabench-head/find-players/* | New task files for identifying players in an image |
| cocoabench-head/find-hero/* | New task files for identifying heroes in an image |
| cocoabench-head/find-fingers/* | New task files for identifying fingers in an image |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| @@ -0,0 +1 @@ | |||
| 02c10714a5911f76 No newline at end of file | |||
There was a problem hiding this comment.
The PR title indicates a single new task "[New Task] find-fingers", but this PR actually adds three separate tasks: find-players, find-hero, and find-fingers. The PR description should be updated to mention all three tasks being added, or explain why they are being grouped together in a single PR.
|
Hi @Eric123-tech! Thanks for the contribution! I read all three tasks carefully. My current impression is that they mostly look like single-image VQA / counting / ID-style problems (even though models can still make mistakes). In contrib/CONTRIBUTING.md, we emphasize multi-step/multi-tool/multi-cognition tasks, typically ones that are difficult to solve in just a few steps and require more extensive work (e.g., search/browse, reasoning, and coding). From that lens, these may not align as strongly with the benchmark direction as we’re aiming for. For the find-hero task specifically, it does feel somewhat more promising, but currently it reads like a prior-knowledge recognition question. Is the intended workflow that the agent should actually search/browse all Honor of Kings heroes and compare candidates one by one? |
9140c12 to
46ea831
Compare
Hi @Leolty, thanks for the feedback! You are right that in a general scenario, one-to-one visual matching might be needed. However, in the image provided for this task, the original hero's key quote/line appears at the bottom. The intended workflow is that the agent can extract this text and use it to search for the specific hero directly, rather than browsing through all candidates visually. |
This task provides an image and requires identifying and distinguishing the objects specified in the instructions. The image contains many misleading elements, testing the agent’s ability to accurately recognize the image, perform preprocessing, and carry out deep reasoning.