feat: add benchmark support to PrunaDataModule and implement PartiPrompts by davidberenstein1957 · Pull Request #502 · PrunaAI/pruna

davidberenstein1957 · 2026-01-22T10:01:32Z

Summary

Adds a benchmark system to PrunaDataModule and implements PartiPrompts plus 7 additional benchmarks for text-to-image and image editing evaluation.

Benchmarks Included

Benchmark	Task type	Description
PartiPrompts	text_to_image	1,600+ prompts across 12 categories and 11 challenge aspects
GenEval	text_to_image	Fine-grained compositional evaluation (object co-occurrence, positioning, counting, color binding)
HPS	text_to_image	Human preference score (HPDv2) with 4 categories: anime, concept-art, paintings, photo
LongTextBench	text_to_image	Long, detail-rich prompts (~285 tokens) for compositional reasoning
OneIG	text_to_image	Text rendering and image-text alignment (anime, portrait, object)
DPG	text_to_image	Descriptive prompt generation (entity, attribute, relation, global, other)
ImgEdit	image_edit	8 edit types: replace, add, remove, adjust, extract, style, background, compose
GEditBench	image_edit	11 task types for fine-grained editing (background_change, color_alter, etc.)

Changes

Benchmark system: BenchmarkInfo dataclass, benchmark_info registry, BENCHMARK_CATEGORY_CONFIG for category/subset filtering
Shared helper: _prepare_test_only_prompt_dataset for test-only prompt datasets
Discovery: list_benchmarks() and get_benchmark_info() for filtering by task type
Category filtering: All benchmarks support category or subset via PrunaDataModule.from_string(name, category=...)
Auxiliaries: prompt_with_auxiliaries_collate preserves metadata (Category, Challenge, questions, text_content, etc.) for evaluation

Usage

from pruna.data import PrunaDataModule

# Load PartiPrompts with category filter
dm = PrunaDataModule.from_string("PartiPrompts", category="Animals")

# Load HPS with anime category
dm = PrunaDataModule.from_string("HPS", category="anime")

# Load ImgEdit with specific edit type
dm = PrunaDataModule.from_string("ImgEdit", category="replace")

# List all text-to-image benchmarks
from pruna.data import list_benchmarks
list_benchmarks(task_type="text_to_image")

Merged PRs

feat: add HPS (Human Preference Score) benchmark #505 HPS
feat: add LongTextBench benchmark #506 LongTextBench
feat: add GenEval benchmark #507 GenEval
feat: add OneIG benchmark with subset support #508 OneIG
feat: add DPG benchmark #509 DPG
feat: add ImgEdit benchmark with edit type subsets #517 ImgEdit
feat: add GEditBench benchmark with task type subsets #518 GEditBench

Test Plan

test_benchmark_category_filter – category filtering for all benchmarks
test_dm_from_string – loading each dataset via from_string
test_long_text_bench_auxiliaries, test_oneig_text_rendering_auxiliaries – auxiliaries preserved

cursor

Cursor Bugbot has reviewed your changes and found 3 potential issues.

^{Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.}

Comment @cursor review or bugbot run to trigger another review on this PR

begumcig

Really nice David! I think we shouldn't tightly couple metrics with datasets under benchmark, but otherwise it all looks good to me! Thank you!

…nctions - Fix category filter to handle list[str] in setup_parti_prompts_dataset - Remove unused BenchmarkInfo and benchmark_info imports in test - Add TODO comment for unused list_benchmarks and get_benchmark_info functions Fixes comments from PR #502

- Fix category filter to handle list[str] in setup_*_dataset functions - Remove unused BenchmarkInfo imports in test files - Align with PR #502 fixes

…nctions - Fix category filter to handle list[str] in setup_parti_prompts_dataset - Remove unused BenchmarkInfo and benchmark_info imports in test - Add TODO comment for unused list_benchmarks and get_benchmark_info functions Fixes comments from PR #502

simlang

Super nice! Left only two suggestions you can decide if you want to accept them or not 🚀

…nctions - Fix category filter to handle list[str] in setup_parti_prompts_dataset - Remove unused BenchmarkInfo and benchmark_info imports in test - Add TODO comment for unused list_benchmarks and get_benchmark_info functions Fixes comments from PR #502

- Remove shuffle from test-only prompt datasets (test sets should not be shuffled) - Remove duplicate Benchmark/benchmark_info from data/__init__.py (already in evaluation/benchmarks) - Use explicit if/else for category is None in setup_parti_prompts_dataset - Add status_code check to ImgEdit for consistency with OneIG - Put load_dataset on one line in _load_oneig_generic - Derive alignment_cats from ONEIG_DATASET_CATEGORIES Made-with: Cursor

begumcig

Thank you so much David, really nice work on this! I like the use of Literal types for benchmark categories, and the BenchmarkRegistry with validations.

A few updates would be super cool before merge: it would be great to wire BenchmarkRegistry into Task/EvaluationAgent. Adding raise_for_status() in dataset loaders would also make things more robust for loading. I also added a few nitpicky fixes, but leaving them up to you!

Overall though, this is really strong!! Once those integration and robustness bits are addressed, I think it’s ready to merge.

- Add PartiPrompts dataset with category/challenge subset filtering - Simplify benchmark system: BenchmarkInfo, list_benchmarks, get_benchmark_info - Extend base_datasets with 4-tuple (subsets), BENCHMARK_CATEGORY_CONFIG - Add GenEval, LongTextBench, HPS benchmarks; ImgEdit, GEditBench with subsets - Add get_literal_values_from_param for category discovery from Literal types - Category filter tests via _category_filter_params and test_category_filter Made-with: Cursor

- Reorganized import statements in src/pruna/data/utils.py to follow PEP 8 guidelines, enhancing code clarity.

…sistency - Removed unused ignore rule from ruff configuration in pyproject.toml. - Refactored function signatures in utils.py for improved readability by removing unnecessary line breaks. - Adjusted import order in task.py for consistency. - Simplified test case formatting in test_datamodule.py and test_aesthetics_laion.py for better clarity.

…on_agent.py - Added 'D420' to the ignore rules in pyproject.toml for ruff. - Reformatted import statements in evaluation_agent.py for improved readability. - Updated docstring in evaluation_agent.py to clarify dataloader arguments and corrected example section formatting. - Cleared out unused metrics in BenchmarkRegistry to reflect human evaluation focus.

- Enhanced descriptions for various benchmarks to include specific references and evaluation methods. - Updated metrics for benchmarks such as COCO, GenEval, HPS, ImgEdit, Long Text Bench, OneIG, and DPG to reflect accurate scoring models and methodologies as per recent papers. - Clarified the absence of certain metrics in Pruna for better transparency.

simlang

so only two comments, but they kinda imply a lot 👀

simlang · 2026-03-10T17:27:45Z



-def stratify_dataset(dataset: Dataset, sample_size: int, seed: int = 42) -> Dataset:
+def stratify_dataset(


since we use this function now in every test dataset, we have to make seeding (which leads to shuffling) optional, as test sets should never be shuffled, as already mentioned in other comments

simlang · 2026-03-10T17:52:29Z

    datamodule : PrunaDataModule, optional
-        The dataloader to use for the evaluation. Required if task is not provided.
+        The dataloader to use for the evaluation. Required if task/benchmark not provided.
+    benchmark : str, optional


i can't tell you exactly why, but i don't super-like adding benchmark (because now we have task, request and benchmark which are all kinda the same but not really), tokenizer (single use-case to the EvaluationAgent) and dataloader_args (only for benchmark)- it feels a bit cluttered for some reason.
But to be very honest, i don't have very constructive feedback how to improve this 🫠

maybe one way would be, not to pass the benchmark as a string, but as a Task, which is being created outside of the agent and got there all of the information it needs - but this might require Benchmark inheriting the Task interface

what do you think?

I removed benchmrk from the init and creadted a seperate classmethod. I feel that this should be something we'll refactor during the 1.0 release, as you alraedy mentioned, there are a lot of single use argument that could have been enforced as part of the Task

…ibility - Changed the seed parameter to be optional, allowing for no shuffling when set to None. - Updated docstring to clarify the behavior of the seed parameter and improved overall documentation formatting. - Ensured that shuffling only occurs if a seed is provided, enhancing the function's usability.

…t setup functions - Updated multiple dataset setup functions to remove the seed parameter from the stratify_dataset calls, enhancing consistency across the codebase. - Ensured that the stratification process remains functional while simplifying the function signatures.

…tion_agent.py - Consolidated import statements for improved readability and adherence to PEP 8 guidelines. - Revised docstring to clarify the requirements for the 'request' and 'datamodule' parameters when not using a 'task', enhancing documentation clarity.

- Upgraded ruff pre-commit hook from v0.14.6 to v0.15.4 for improved linting capabilities. - Enhanced the docstring in stratify_dataset to clarify return values and exceptions, improving documentation quality.

…n framework - Added a new benchmarks.py file containing the Benchmark and BenchmarkRegistry classes to manage benchmark datasets and their metadata. - Implemented methods for registering benchmarks, retrieving benchmark metadata, and listing available benchmarks by task type. - Included detailed descriptions and references for various benchmarks, enhancing the evaluation framework's capabilities. - Removed the old __init__.py file in the benchmarks directory as its content has been migrated to the new structure.

…ings in utils and evaluation_agent - Downgraded ruff pre-commit hook from v0.15.4 to v0.14.6 for compatibility. - Cleaned up docstrings in utils.py to remove unnecessary sections, enhancing clarity. - Updated method names and docstrings in evaluation_agent.py to better reflect functionality and improve documentation quality.

…and data modules - Updated PrunaDataModule to use a dictionary for sampling parameters, enhancing clarity and maintainability. - Refactored EvaluationAgent's evaluate_benchmark method to from_benchmark, better reflecting its purpose and updated the associated docstring for clarity. - Simplified metric registration checks in BenchmarkRegistry by introducing a has_metric method, improving code readability. - Removed unused base_datasets references in Task class, streamlining the codebase.

…stratify_dataset - Updated the docstring in stratify_dataset to use consistent formatting for parameters and return values, improving clarity and readability. - Clarified the exceptions raised by the function, ensuring better documentation quality.

…taset - Enhanced the docstring in the stratify_dataset function for better readability and consistency. - Updated the has_metric method's docstring in MetricRegistry to include detailed parameter and return descriptions, improving overall documentation quality.

simlang

Thanks for addressing all the comments! looks amazing!
one little docstring update and then the question if we should add some info/warn that seeding won't do anything here

simlang · 2026-03-19T15:12:42Z



 def setup_geneval_dataset(
    seed: int,


I feel like i'm super annoying about this seed - since we stopped using the seed now, do you think it makes sense to add, that when the seed is not None (and while writing this we should make this optional as well) to info/warn the user that a seed wouldn't change anything

simlang · 2026-03-19T15:43:14Z

+        >>> agent = EvaluationAgent.from_benchmark("Parti Prompts", model)
+        >>> agent = EvaluationAgent.from_benchmark("HPS", model, category="anime", fraction=0.1)


Suggested change

>>> agent = EvaluationAgent.from_benchmark("Parti Prompts", model)

>>> agent = EvaluationAgent.from_benchmark("HPS", model, category="anime", fraction=0.1)

>>> agent = EvaluationAgent.from_benchmark("Parti Prompts")

>>> agent = EvaluationAgent.from_benchmark("HPS", category="anime", fraction=0.1)

simon reviewed it

Two related fixes: 1. _benchmark_category_smoke() now picks "Anime_Stylization" for OneIG instead of the alphabetically first literal ("3d rendering"). Fine-grained art styles such as "3d rendering" can have fewer than 4 samples, which made the nightly CI fail deterministically when assert len(prompts) == 4 was evaluated. 2. test_benchmark_category_filter: relax `assert len(prompts) == 4` to `assert 1 <= len(prompts) <= 4` so category-filtered datasets with a small number of samples do not fail the test. Fixes the nightly failures introduced in #502 (3f01339). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

davidberenstein1957 requested review from begumcig and simlang January 22, 2026 10:01

davidberenstein1957 marked this pull request as ready for review February 2, 2026 04:24

cursor Bot reviewed Feb 2, 2026

View reviewed changes

Comment thread src/pruna/data/datasets/prompt.py Outdated

Comment thread tests/data/test_datamodule.py Outdated

Comment thread src/pruna/data/__init__.py Outdated

begumcig requested changes Feb 10, 2026

View reviewed changes

Comment thread src/pruna/data/datasets/prompt.py Outdated

Comment thread src/pruna/data/__init__.py Outdated

davidberenstein1957 requested a review from begumcig February 20, 2026 22:24

davidberenstein1957 force-pushed the feat/add-partiprompts-benchmark-to-pruna branch from 1a4e4de to e0e96b9 Compare February 26, 2026 10:18

simlang approved these changes Feb 26, 2026

View reviewed changes

Comment thread src/pruna/evaluation/benchmarks/__init__.py Outdated

Comment thread src/pruna/data/datasets/prompt.py Outdated

davidberenstein1957 force-pushed the feat/add-partiprompts-benchmark-to-pruna branch from 3be835c to 7ebb4cd Compare February 27, 2026 10:26

davidberenstein1957 requested a review from simlang February 27, 2026 14:28

simlang requested changes Feb 27, 2026

View reviewed changes

begumcig previously requested changes Feb 27, 2026

View reviewed changes

davidberenstein1957 requested review from begumcig and simlang March 5, 2026 07:52

davidberenstein1957 force-pushed the feat/add-partiprompts-benchmark-to-pruna branch from 398d340 to 40c6112 Compare March 5, 2026 09:41

davidberenstein1957 added 3 commits March 5, 2026 10:54

refactor: format import statements in utils.py for improved readability

02fdf76

- Reorganized import statements in src/pruna/data/utils.py to follow PEP 8 guidelines, enhancing code clarity.

simlang requested changes Mar 10, 2026

View reviewed changes

davidberenstein1957 added 4 commits March 19, 2026 11:48

chore: update ruff version and enhance docstring in stratify_dataset

289571a

- Upgraded ruff pre-commit hook from v0.14.6 to v0.15.4 for improved linting capabilities. - Enhanced the docstring in stratify_dataset to clarify return values and exceptions, improving documentation quality.

davidberenstein1957 requested a review from simlang March 19, 2026 11:31

davidberenstein1957 added 5 commits March 19, 2026 12:40

simlang approved these changes Mar 19, 2026

View reviewed changes

davidberenstein1957 merged commit 3f01339 into main Mar 19, 2026
7 checks passed



		def stratify_dataset(dataset: Dataset, sample_size: int, seed: int = 42) -> Dataset:
		def stratify_dataset(

		>>> agent = EvaluationAgent.from_benchmark("Parti Prompts", model)
		>>> agent = EvaluationAgent.from_benchmark("HPS", model, category="anime", fraction=0.1)

Conversation

davidberenstein1957 commented Jan 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Benchmarks Included

Changes

Usage

Merged PRs

Test Plan

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

begumcig left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

simlang left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

begumcig left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

simlang left a comment

Choose a reason for hiding this comment

Uh oh!

simlang Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

simlang Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

davidberenstein1957 Mar 19, 2026

Choose a reason for hiding this comment

Uh oh!

simlang left a comment

Choose a reason for hiding this comment

Uh oh!

simlang Mar 19, 2026

Choose a reason for hiding this comment

Uh oh!

simlang Mar 19, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

davidberenstein1957 commented Jan 22, 2026 •

edited

Loading