feat: add benchmark support to PrunaDataModule and implement PartiPrompts#502
Conversation
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 3 potential issues.
Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.
Comment @cursor review or bugbot run to trigger another review on this PR
begumcig
left a comment
There was a problem hiding this comment.
Really nice David! I think we shouldn't tightly couple metrics with datasets under benchmark, but otherwise it all looks good to me! Thank you!
…nctions - Fix category filter to handle list[str] in setup_parti_prompts_dataset - Remove unused BenchmarkInfo and benchmark_info imports in test - Add TODO comment for unused list_benchmarks and get_benchmark_info functions Fixes comments from PR #502
- Fix category filter to handle list[str] in setup_*_dataset functions - Remove unused BenchmarkInfo imports in test files - Align with PR #502 fixes
- Fix category filter to handle list[str] in setup_*_dataset functions - Remove unused BenchmarkInfo imports in test files - Align with PR #502 fixes
- Fix category filter to handle list[str] in setup_*_dataset functions - Remove unused BenchmarkInfo imports in test files - Align with PR #502 fixes
- Fix category filter to handle list[str] in setup_*_dataset functions - Remove unused BenchmarkInfo imports in test files - Align with PR #502 fixes
- Fix category filter to handle list[str] in setup_*_dataset functions - Remove unused BenchmarkInfo imports in test files - Align with PR #502 fixes
- Fix category filter to handle list[str] in setup_*_dataset functions - Remove unused BenchmarkInfo imports in test files - Align with PR #502 fixes
- Fix category filter to handle list[str] in setup_*_dataset functions - Remove unused BenchmarkInfo imports in test files - Align with PR #502 fixes
…nctions - Fix category filter to handle list[str] in setup_parti_prompts_dataset - Remove unused BenchmarkInfo and benchmark_info imports in test - Add TODO comment for unused list_benchmarks and get_benchmark_info functions Fixes comments from PR #502
1a4e4de to
e0e96b9
Compare
simlang
left a comment
There was a problem hiding this comment.
Super nice! Left only two suggestions you can decide if you want to accept them or not 🚀
…nctions - Fix category filter to handle list[str] in setup_parti_prompts_dataset - Remove unused BenchmarkInfo and benchmark_info imports in test - Add TODO comment for unused list_benchmarks and get_benchmark_info functions Fixes comments from PR #502
3be835c to
7ebb4cd
Compare
- Remove shuffle from test-only prompt datasets (test sets should not be shuffled) - Remove duplicate Benchmark/benchmark_info from data/__init__.py (already in evaluation/benchmarks) - Use explicit if/else for category is None in setup_parti_prompts_dataset - Add status_code check to ImgEdit for consistency with OneIG - Put load_dataset on one line in _load_oneig_generic - Derive alignment_cats from ONEIG_DATASET_CATEGORIES Made-with: Cursor
begumcig
left a comment
There was a problem hiding this comment.
Thank you so much David, really nice work on this! I like the use of Literal types for benchmark categories, and the BenchmarkRegistry with validations.
A few updates would be super cool before merge: it would be great to wire BenchmarkRegistry into Task/EvaluationAgent. Adding raise_for_status() in dataset loaders would also make things more robust for loading. I also added a few nitpicky fixes, but leaving them up to you!
Overall though, this is really strong!! Once those integration and robustness bits are addressed, I think it’s ready to merge.
- Add PartiPrompts dataset with category/challenge subset filtering - Simplify benchmark system: BenchmarkInfo, list_benchmarks, get_benchmark_info - Extend base_datasets with 4-tuple (subsets), BENCHMARK_CATEGORY_CONFIG - Add GenEval, LongTextBench, HPS benchmarks; ImgEdit, GEditBench with subsets - Add get_literal_values_from_param for category discovery from Literal types - Category filter tests via _category_filter_params and test_category_filter Made-with: Cursor
398d340 to
40c6112
Compare
- Reorganized import statements in src/pruna/data/utils.py to follow PEP 8 guidelines, enhancing code clarity.
…sistency - Removed unused ignore rule from ruff configuration in pyproject.toml. - Refactored function signatures in utils.py for improved readability by removing unnecessary line breaks. - Adjusted import order in task.py for consistency. - Simplified test case formatting in test_datamodule.py and test_aesthetics_laion.py for better clarity.
…on_agent.py - Added 'D420' to the ignore rules in pyproject.toml for ruff. - Reformatted import statements in evaluation_agent.py for improved readability. - Updated docstring in evaluation_agent.py to clarify dataloader arguments and corrected example section formatting. - Cleared out unused metrics in BenchmarkRegistry to reflect human evaluation focus.
- Enhanced descriptions for various benchmarks to include specific references and evaluation methods. - Updated metrics for benchmarks such as COCO, GenEval, HPS, ImgEdit, Long Text Bench, OneIG, and DPG to reflect accurate scoring models and methodologies as per recent papers. - Clarified the absence of certain metrics in Pruna for better transparency.
simlang
left a comment
There was a problem hiding this comment.
so only two comments, but they kinda imply a lot 👀
|
|
||
|
|
||
| def stratify_dataset(dataset: Dataset, sample_size: int, seed: int = 42) -> Dataset: | ||
| def stratify_dataset( |
There was a problem hiding this comment.
since we use this function now in every test dataset, we have to make seeding (which leads to shuffling) optional, as test sets should never be shuffled, as already mentioned in other comments
| datamodule : PrunaDataModule, optional | ||
| The dataloader to use for the evaluation. Required if task is not provided. | ||
| The dataloader to use for the evaluation. Required if task/benchmark not provided. | ||
| benchmark : str, optional |
There was a problem hiding this comment.
i can't tell you exactly why, but i don't super-like adding benchmark (because now we have task, request and benchmark which are all kinda the same but not really), tokenizer (single use-case to the EvaluationAgent) and dataloader_args (only for benchmark)- it feels a bit cluttered for some reason.
But to be very honest, i don't have very constructive feedback how to improve this 🫠
maybe one way would be, not to pass the benchmark as a string, but as a Task, which is being created outside of the agent and got there all of the information it needs - but this might require Benchmark inheriting the Task interface
what do you think?
There was a problem hiding this comment.
I removed benchmrk from the init and creadted a seperate classmethod. I feel that this should be something we'll refactor during the 1.0 release, as you alraedy mentioned, there are a lot of single use argument that could have been enforced as part of the Task
…ibility - Changed the seed parameter to be optional, allowing for no shuffling when set to None. - Updated docstring to clarify the behavior of the seed parameter and improved overall documentation formatting. - Ensured that shuffling only occurs if a seed is provided, enhancing the function's usability.
…t setup functions - Updated multiple dataset setup functions to remove the seed parameter from the stratify_dataset calls, enhancing consistency across the codebase. - Ensured that the stratification process remains functional while simplifying the function signatures.
…tion_agent.py - Consolidated import statements for improved readability and adherence to PEP 8 guidelines. - Revised docstring to clarify the requirements for the 'request' and 'datamodule' parameters when not using a 'task', enhancing documentation clarity.
- Upgraded ruff pre-commit hook from v0.14.6 to v0.15.4 for improved linting capabilities. - Enhanced the docstring in stratify_dataset to clarify return values and exceptions, improving documentation quality.
…n framework - Added a new benchmarks.py file containing the Benchmark and BenchmarkRegistry classes to manage benchmark datasets and their metadata. - Implemented methods for registering benchmarks, retrieving benchmark metadata, and listing available benchmarks by task type. - Included detailed descriptions and references for various benchmarks, enhancing the evaluation framework's capabilities. - Removed the old __init__.py file in the benchmarks directory as its content has been migrated to the new structure.
…ings in utils and evaluation_agent - Downgraded ruff pre-commit hook from v0.15.4 to v0.14.6 for compatibility. - Cleaned up docstrings in utils.py to remove unnecessary sections, enhancing clarity. - Updated method names and docstrings in evaluation_agent.py to better reflect functionality and improve documentation quality.
…and data modules - Updated PrunaDataModule to use a dictionary for sampling parameters, enhancing clarity and maintainability. - Refactored EvaluationAgent's evaluate_benchmark method to from_benchmark, better reflecting its purpose and updated the associated docstring for clarity. - Simplified metric registration checks in BenchmarkRegistry by introducing a has_metric method, improving code readability. - Removed unused base_datasets references in Task class, streamlining the codebase.
…stratify_dataset - Updated the docstring in stratify_dataset to use consistent formatting for parameters and return values, improving clarity and readability. - Clarified the exceptions raised by the function, ensuring better documentation quality.
…taset - Enhanced the docstring in the stratify_dataset function for better readability and consistency. - Updated the has_metric method's docstring in MetricRegistry to include detailed parameter and return descriptions, improving overall documentation quality.
simlang
left a comment
There was a problem hiding this comment.
Thanks for addressing all the comments! looks amazing!
one little docstring update and then the question if we should add some info/warn that seeding won't do anything here
|
|
||
|
|
||
| def setup_geneval_dataset( | ||
| seed: int, |
There was a problem hiding this comment.
I feel like i'm super annoying about this seed - since we stopped using the seed now, do you think it makes sense to add, that when the seed is not None (and while writing this we should make this optional as well) to info/warn the user that a seed wouldn't change anything
| >>> agent = EvaluationAgent.from_benchmark("Parti Prompts", model) | ||
| >>> agent = EvaluationAgent.from_benchmark("HPS", model, category="anime", fraction=0.1) |
There was a problem hiding this comment.
| >>> agent = EvaluationAgent.from_benchmark("Parti Prompts", model) | |
| >>> agent = EvaluationAgent.from_benchmark("HPS", model, category="anime", fraction=0.1) | |
| >>> agent = EvaluationAgent.from_benchmark("Parti Prompts") | |
| >>> agent = EvaluationAgent.from_benchmark("HPS", category="anime", fraction=0.1) |
Two related fixes:
1. _benchmark_category_smoke() now picks "Anime_Stylization" for OneIG instead
of the alphabetically first literal ("3d rendering"). Fine-grained art styles
such as "3d rendering" can have fewer than 4 samples, which made the nightly CI
fail deterministically when assert len(prompts) == 4 was evaluated.
2. test_benchmark_category_filter: relax `assert len(prompts) == 4` to
`assert 1 <= len(prompts) <= 4` so category-filtered datasets with a small
number of samples do not fail the test.
Fixes the nightly failures introduced in #502 (3f01339).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Summary
Adds a benchmark system to PrunaDataModule and implements PartiPrompts plus 7 additional benchmarks for text-to-image and image editing evaluation.
Benchmarks Included
Changes
BenchmarkInfodataclass,benchmark_inforegistry,BENCHMARK_CATEGORY_CONFIGfor category/subset filtering_prepare_test_only_prompt_datasetfor test-only prompt datasetslist_benchmarks()andget_benchmark_info()for filtering by task typecategoryorsubsetviaPrunaDataModule.from_string(name, category=...)prompt_with_auxiliaries_collatepreserves metadata (Category, Challenge, questions, text_content, etc.) for evaluationUsage
Merged PRs
Test Plan
test_benchmark_category_filter– category filtering for all benchmarkstest_dm_from_string– loading each dataset viafrom_stringtest_long_text_bench_auxiliaries,test_oneig_text_rendering_auxiliaries– auxiliaries preserved