Skip to content

feat: add benchmark support to PrunaDataModule and implement PartiPrompts#502

Merged
davidberenstein1957 merged 14 commits intomainfrom
feat/add-partiprompts-benchmark-to-pruna
Mar 19, 2026
Merged

feat: add benchmark support to PrunaDataModule and implement PartiPrompts#502
davidberenstein1957 merged 14 commits intomainfrom
feat/add-partiprompts-benchmark-to-pruna

Conversation

@davidberenstein1957
Copy link
Copy Markdown
Member

@davidberenstein1957 davidberenstein1957 commented Jan 22, 2026

Summary

Adds a benchmark system to PrunaDataModule and implements PartiPrompts plus 7 additional benchmarks for text-to-image and image editing evaluation.

Benchmarks Included

Benchmark Task type Description
PartiPrompts text_to_image 1,600+ prompts across 12 categories and 11 challenge aspects
GenEval text_to_image Fine-grained compositional evaluation (object co-occurrence, positioning, counting, color binding)
HPS text_to_image Human preference score (HPDv2) with 4 categories: anime, concept-art, paintings, photo
LongTextBench text_to_image Long, detail-rich prompts (~285 tokens) for compositional reasoning
OneIG text_to_image Text rendering and image-text alignment (anime, portrait, object)
DPG text_to_image Descriptive prompt generation (entity, attribute, relation, global, other)
ImgEdit image_edit 8 edit types: replace, add, remove, adjust, extract, style, background, compose
GEditBench image_edit 11 task types for fine-grained editing (background_change, color_alter, etc.)

Changes

  • Benchmark system: BenchmarkInfo dataclass, benchmark_info registry, BENCHMARK_CATEGORY_CONFIG for category/subset filtering
  • Shared helper: _prepare_test_only_prompt_dataset for test-only prompt datasets
  • Discovery: list_benchmarks() and get_benchmark_info() for filtering by task type
  • Category filtering: All benchmarks support category or subset via PrunaDataModule.from_string(name, category=...)
  • Auxiliaries: prompt_with_auxiliaries_collate preserves metadata (Category, Challenge, questions, text_content, etc.) for evaluation

Usage

from pruna.data import PrunaDataModule

# Load PartiPrompts with category filter
dm = PrunaDataModule.from_string("PartiPrompts", category="Animals")

# Load HPS with anime category
dm = PrunaDataModule.from_string("HPS", category="anime")

# Load ImgEdit with specific edit type
dm = PrunaDataModule.from_string("ImgEdit", category="replace")

# List all text-to-image benchmarks
from pruna.data import list_benchmarks
list_benchmarks(task_type="text_to_image")

Merged PRs

Test Plan

  • test_benchmark_category_filter – category filtering for all benchmarks
  • test_dm_from_string – loading each dataset via from_string
  • test_long_text_bench_auxiliaries, test_oneig_text_rendering_auxiliaries – auxiliaries preserved

@davidberenstein1957 davidberenstein1957 marked this pull request as ready for review February 2, 2026 04:24
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 3 potential issues.

Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.

Comment @cursor review or bugbot run to trigger another review on this PR

Comment thread src/pruna/data/datasets/prompt.py Outdated
Comment thread tests/data/test_datamodule.py Outdated
Comment thread src/pruna/data/__init__.py Outdated
Copy link
Copy Markdown
Member

@begumcig begumcig left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really nice David! I think we shouldn't tightly couple metrics with datasets under benchmark, but otherwise it all looks good to me! Thank you!

Comment thread src/pruna/data/datasets/prompt.py Outdated
Comment thread src/pruna/data/__init__.py Outdated
davidberenstein1957 added a commit that referenced this pull request Feb 20, 2026
…nctions

- Fix category filter to handle list[str] in setup_parti_prompts_dataset
- Remove unused BenchmarkInfo and benchmark_info imports in test
- Add TODO comment for unused list_benchmarks and get_benchmark_info functions

Fixes comments from PR #502
davidberenstein1957 added a commit that referenced this pull request Feb 20, 2026
- Fix category filter to handle list[str] in setup_*_dataset functions
- Remove unused BenchmarkInfo imports in test files
- Align with PR #502 fixes
davidberenstein1957 added a commit that referenced this pull request Feb 20, 2026
- Fix category filter to handle list[str] in setup_*_dataset functions
- Remove unused BenchmarkInfo imports in test files
- Align with PR #502 fixes
davidberenstein1957 added a commit that referenced this pull request Feb 20, 2026
- Fix category filter to handle list[str] in setup_*_dataset functions
- Remove unused BenchmarkInfo imports in test files
- Align with PR #502 fixes
davidberenstein1957 added a commit that referenced this pull request Feb 20, 2026
- Fix category filter to handle list[str] in setup_*_dataset functions
- Remove unused BenchmarkInfo imports in test files
- Align with PR #502 fixes
davidberenstein1957 added a commit that referenced this pull request Feb 20, 2026
- Fix category filter to handle list[str] in setup_*_dataset functions
- Remove unused BenchmarkInfo imports in test files
- Align with PR #502 fixes
davidberenstein1957 added a commit that referenced this pull request Feb 20, 2026
- Fix category filter to handle list[str] in setup_*_dataset functions
- Remove unused BenchmarkInfo imports in test files
- Align with PR #502 fixes
davidberenstein1957 added a commit that referenced this pull request Feb 20, 2026
- Fix category filter to handle list[str] in setup_*_dataset functions
- Remove unused BenchmarkInfo imports in test files
- Align with PR #502 fixes
davidberenstein1957 added a commit that referenced this pull request Feb 26, 2026
…nctions

- Fix category filter to handle list[str] in setup_parti_prompts_dataset
- Remove unused BenchmarkInfo and benchmark_info imports in test
- Add TODO comment for unused list_benchmarks and get_benchmark_info functions

Fixes comments from PR #502
@davidberenstein1957 davidberenstein1957 force-pushed the feat/add-partiprompts-benchmark-to-pruna branch from 1a4e4de to e0e96b9 Compare February 26, 2026 10:18
Copy link
Copy Markdown
Member

@simlang simlang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Super nice! Left only two suggestions you can decide if you want to accept them or not 🚀

Comment thread src/pruna/evaluation/benchmarks/__init__.py Outdated
Comment thread src/pruna/data/datasets/prompt.py Outdated
davidberenstein1957 added a commit that referenced this pull request Feb 27, 2026
…nctions

- Fix category filter to handle list[str] in setup_parti_prompts_dataset
- Remove unused BenchmarkInfo and benchmark_info imports in test
- Add TODO comment for unused list_benchmarks and get_benchmark_info functions

Fixes comments from PR #502
@davidberenstein1957 davidberenstein1957 force-pushed the feat/add-partiprompts-benchmark-to-pruna branch from 3be835c to 7ebb4cd Compare February 27, 2026 10:26
Comment thread src/pruna/data/utils.py Outdated
Comment thread src/pruna/data/__init__.py Outdated
Comment thread src/pruna/data/datasets/prompt.py
Comment thread src/pruna/data/datasets/prompt.py Outdated
Comment thread src/pruna/data/datasets/prompt.py Outdated
Comment thread src/pruna/data/datasets/prompt.py Outdated
Comment thread .github/actions/setup-uv-project/action.yml Outdated
davidberenstein1957 added a commit that referenced this pull request Feb 27, 2026
- Remove shuffle from test-only prompt datasets (test sets should not be shuffled)
- Remove duplicate Benchmark/benchmark_info from data/__init__.py (already in evaluation/benchmarks)
- Use explicit if/else for category is None in setup_parti_prompts_dataset
- Add status_code check to ImgEdit for consistency with OneIG
- Put load_dataset on one line in _load_oneig_generic
- Derive alignment_cats from ONEIG_DATASET_CATEGORIES

Made-with: Cursor
Copy link
Copy Markdown
Member

@begumcig begumcig left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you so much David, really nice work on this! I like the use of Literal types for benchmark categories, and the BenchmarkRegistry with validations.

A few updates would be super cool before merge: it would be great to wire BenchmarkRegistry into Task/EvaluationAgent. Adding raise_for_status() in dataset loaders would also make things more robust for loading. I also added a few nitpicky fixes, but leaving them up to you!

Overall though, this is really strong!! Once those integration and robustness bits are addressed, I think it’s ready to merge.

Comment thread src/pruna/evaluation/benchmarks/__init__.py Outdated
Comment thread src/pruna/evaluation/benchmarks/__init__.py Outdated
Comment thread src/pruna/evaluation/benchmarks/__init__.py Outdated
Comment thread src/pruna/evaluation/benchmarks/__init__.py Outdated
Comment thread src/pruna/evaluation/benchmarks/__init__.py Outdated
Comment thread src/pruna/data/datasets/prompt.py Outdated
Comment thread src/pruna/data/datasets/prompt.py Outdated
Comment thread src/pruna/data/datasets/prompt.py Outdated
Comment thread src/pruna/data/datasets/prompt.py Outdated
Comment thread src/pruna/data/datasets/prompt.py
- Add PartiPrompts dataset with category/challenge subset filtering
- Simplify benchmark system: BenchmarkInfo, list_benchmarks, get_benchmark_info
- Extend base_datasets with 4-tuple (subsets), BENCHMARK_CATEGORY_CONFIG
- Add GenEval, LongTextBench, HPS benchmarks; ImgEdit, GEditBench with subsets
- Add get_literal_values_from_param for category discovery from Literal types
- Category filter tests via _category_filter_params and test_category_filter

Made-with: Cursor
@davidberenstein1957 davidberenstein1957 force-pushed the feat/add-partiprompts-benchmark-to-pruna branch from 398d340 to 40c6112 Compare March 5, 2026 09:41
- Reorganized import statements in src/pruna/data/utils.py to follow PEP 8 guidelines, enhancing code clarity.
…sistency

- Removed unused ignore rule from ruff configuration in pyproject.toml.
- Refactored function signatures in utils.py for improved readability by removing unnecessary line breaks.
- Adjusted import order in task.py for consistency.
- Simplified test case formatting in test_datamodule.py and test_aesthetics_laion.py for better clarity.
…on_agent.py

- Added 'D420' to the ignore rules in pyproject.toml for ruff.
- Reformatted import statements in evaluation_agent.py for improved readability.
- Updated docstring in evaluation_agent.py to clarify dataloader arguments and corrected example section formatting.
- Cleared out unused metrics in BenchmarkRegistry to reflect human evaluation focus.
- Enhanced descriptions for various benchmarks to include specific references and evaluation methods.
- Updated metrics for benchmarks such as COCO, GenEval, HPS, ImgEdit, Long Text Bench, OneIG, and DPG to reflect accurate scoring models and methodologies as per recent papers.
- Clarified the absence of certain metrics in Pruna for better transparency.
Copy link
Copy Markdown
Member

@simlang simlang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so only two comments, but they kinda imply a lot 👀

Comment thread src/pruna/data/utils.py


def stratify_dataset(dataset: Dataset, sample_size: int, seed: int = 42) -> Dataset:
def stratify_dataset(
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since we use this function now in every test dataset, we have to make seeding (which leads to shuffling) optional, as test sets should never be shuffled, as already mentioned in other comments

datamodule : PrunaDataModule, optional
The dataloader to use for the evaluation. Required if task is not provided.
The dataloader to use for the evaluation. Required if task/benchmark not provided.
benchmark : str, optional
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i can't tell you exactly why, but i don't super-like adding benchmark (because now we have task, request and benchmark which are all kinda the same but not really), tokenizer (single use-case to the EvaluationAgent) and dataloader_args (only for benchmark)- it feels a bit cluttered for some reason.
But to be very honest, i don't have very constructive feedback how to improve this 🫠

maybe one way would be, not to pass the benchmark as a string, but as a Task, which is being created outside of the agent and got there all of the information it needs - but this might require Benchmark inheriting the Task interface

what do you think?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed benchmrk from the init and creadted a seperate classmethod. I feel that this should be something we'll refactor during the 1.0 release, as you alraedy mentioned, there are a lot of single use argument that could have been enforced as part of the Task

…ibility

- Changed the seed parameter to be optional, allowing for no shuffling when set to None.
- Updated docstring to clarify the behavior of the seed parameter and improved overall documentation formatting.
- Ensured that shuffling only occurs if a seed is provided, enhancing the function's usability.
…t setup functions

- Updated multiple dataset setup functions to remove the seed parameter from the stratify_dataset calls, enhancing consistency across the codebase.
- Ensured that the stratification process remains functional while simplifying the function signatures.
…tion_agent.py

- Consolidated import statements for improved readability and adherence to PEP 8 guidelines.
- Revised docstring to clarify the requirements for the 'request' and 'datamodule' parameters when not using a 'task', enhancing documentation clarity.
- Upgraded ruff pre-commit hook from v0.14.6 to v0.15.4 for improved linting capabilities.
- Enhanced the docstring in stratify_dataset to clarify return values and exceptions, improving documentation quality.
…n framework

- Added a new benchmarks.py file containing the Benchmark and BenchmarkRegistry classes to manage benchmark datasets and their metadata.
- Implemented methods for registering benchmarks, retrieving benchmark metadata, and listing available benchmarks by task type.
- Included detailed descriptions and references for various benchmarks, enhancing the evaluation framework's capabilities.
- Removed the old __init__.py file in the benchmarks directory as its content has been migrated to the new structure.
…ings in utils and evaluation_agent

- Downgraded ruff pre-commit hook from v0.15.4 to v0.14.6 for compatibility.
- Cleaned up docstrings in utils.py to remove unnecessary sections, enhancing clarity.
- Updated method names and docstrings in evaluation_agent.py to better reflect functionality and improve documentation quality.
…and data modules

- Updated PrunaDataModule to use a dictionary for sampling parameters, enhancing clarity and maintainability.
- Refactored EvaluationAgent's evaluate_benchmark method to from_benchmark, better reflecting its purpose and updated the associated docstring for clarity.
- Simplified metric registration checks in BenchmarkRegistry by introducing a has_metric method, improving code readability.
- Removed unused base_datasets references in Task class, streamlining the codebase.
…stratify_dataset

- Updated the docstring in stratify_dataset to use consistent formatting for parameters and return values, improving clarity and readability.
- Clarified the exceptions raised by the function, ensuring better documentation quality.
…taset

- Enhanced the docstring in the stratify_dataset function for better readability and consistency.
- Updated the has_metric method's docstring in MetricRegistry to include detailed parameter and return descriptions, improving overall documentation quality.
Copy link
Copy Markdown
Member

@simlang simlang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for addressing all the comments! looks amazing!
one little docstring update and then the question if we should add some info/warn that seeding won't do anything here



def setup_geneval_dataset(
seed: int,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like i'm super annoying about this seed - since we stopped using the seed now, do you think it makes sense to add, that when the seed is not None (and while writing this we should make this optional as well) to info/warn the user that a seed wouldn't change anything

Comment on lines +115 to +116
>>> agent = EvaluationAgent.from_benchmark("Parti Prompts", model)
>>> agent = EvaluationAgent.from_benchmark("HPS", model, category="anime", fraction=0.1)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
>>> agent = EvaluationAgent.from_benchmark("Parti Prompts", model)
>>> agent = EvaluationAgent.from_benchmark("HPS", model, category="anime", fraction=0.1)
>>> agent = EvaluationAgent.from_benchmark("Parti Prompts")
>>> agent = EvaluationAgent.from_benchmark("HPS", category="anime", fraction=0.1)

@davidberenstein1957 davidberenstein1957 merged commit 3f01339 into main Mar 19, 2026
7 checks passed
davidberenstein1957 added a commit that referenced this pull request Apr 14, 2026
Two related fixes:

1. _benchmark_category_smoke() now picks "Anime_Stylization" for OneIG instead
   of the alphabetically first literal ("3d rendering"). Fine-grained art styles
   such as "3d rendering" can have fewer than 4 samples, which made the nightly CI
   fail deterministically when assert len(prompts) == 4 was evaluated.

2. test_benchmark_category_filter: relax `assert len(prompts) == 4` to
   `assert 1 <= len(prompts) <= 4` so category-filtered datasets with a small
   number of samples do not fail the test.

Fixes the nightly failures introduced in #502 (3f01339).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants