FEAT Add BeaverTails dataset loader by romanlutz · Pull Request #1424 · Azure/PyRIT

romanlutz · 2026-03-01T14:28:01Z

Add remote dataset loader for BeaverTails (PKU-Alignment/BeaverTails), containing 330k+ QA pairs annotated across 14 harm categories for safety alignment research. Filters to unsafe entries by default.

Copilot

Pull request overview

Adds a new remote seed dataset loader for the BeaverTails HuggingFace dataset, making it discoverable via SeedDatasetProvider and documenting its availability.

Changes:

Introduces _BeaverTailsDataset remote loader with optional unsafe_only filtering (default: unsafe only).
Registers the loader in the remote datasets module and adds unit tests for filtering behavior.
Updates the “Loading Built-in Datasets” notebook output to include the new dataset name.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.

File	Description
`pyrit/datasets/seed_datasets/remote/beaver_tails_dataset.py`	New HuggingFace-backed loader that converts BeaverTails rows into `SeedPrompt`s (unsafe-only by default).
`pyrit/datasets/seed_datasets/remote/__init__.py`	Imports/exports the new loader so it’s auto-registered/discoverable.
`tests/unit/datasets/test_beaver_tails_dataset.py`	Adds unit tests covering unsafe-only vs all-entries behavior and dataset naming.
`doc/code/datasets/1_loading_datasets.ipynb`	Notebook updated to reflect the new dataset in the available list (but now includes executed outputs/metadata).

pyrit/datasets/seed_datasets/remote/beaver_tails_dataset.py

doc/code/datasets/1_loading_datasets.ipynb

Copilot

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 5 comments.

doc/code/datasets/1_loading_datasets.ipynb

pyrit/datasets/seed_datasets/remote/beaver_tails_dataset.py

Copilot

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.

pyrit/datasets/seed_datasets/remote/beaver_tails_dataset.py

doc/code/datasets/1_loading_datasets.ipynb

Copilot

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated no new comments.

Add remote dataset loader for BeaverTails (PKU-Alignment/BeaverTails), containing 330k+ QA pairs annotated across 14 harm categories for safety alignment research. Filters to unsafe entries by default. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

The HF dataset identifier is now a class constant HF_DATASET_NAME instead of a constructor parameter, consistent with other loaders. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

For a 330k-row dataset, this avoids hundreds of thousands of redundant string/list allocations. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

pyrit/datasets/seed_datasets/remote/beaver_tails_dataset.py

doc/code/datasets/1_loading_datasets.ipynb

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…tails-dataset # Conflicts: # pyproject.toml

Copilot

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

pyrit/datasets/seed_datasets/remote/beaver_tails_dataset.py

…tails-dataset

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 1 comment.

Comments suppressed due to low confidence (1)

pyrit/prompt_converter/braille_converter.py:133

In _get_braile, is_number is still reset to False after processing any character that isn’t in numberPunctuations (line 132). Since digits aren’t in numberPunctuations, this resets the number-mode after every digit, causing the Braille number indicator (characterUnicodes['num']) to be emitted before each digit instead of once per digit sequence. Consider resetting number-mode only when leaving a numeric run (e.g., when is_number is True and the current char is neither a digit nor an allowed number punctuation).

        is_number = False
        for char in text:
            if char in escapeCharacters:
                output += char
            elif char.isupper():
                if char.lower() in characterUnicodes:
                    output += characterUnicodes["caps"]
                    output += characterUnicodes[char.lower()]
            elif char in characterUnicodes:
                if char.isdigit() and not is_number:
                    is_number = True
                    output += characterUnicodes["num"]
                output += characterUnicodes[char]
            if is_number and char not in numberPunctuations:
                is_number = False

pyrit/executor/attack/printer/markdown_printer.py

…tails-dataset

Replaces isoformat().replace('+00:00', 'Z') with strftime('%Y-%m-%dT%H:%M:%SZ') for second-resolution timestamps without microsecond noise. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 1 comment.

pyrit/prompt_converter/braille_converter.py

varunj-msft

Loader looks clean. Worth splitting the braille and markdown printer fixes into their own PRs? The braille one may need a follow up on the reset condition

pyrit/prompt_converter/braille_converter.py

pyrit/datasets/seed_datasets/remote/beaver_tails_dataset.py

Merge latest main into the branch. Revert unrelated changes to braille_converter.py and markdown_printer.py that don't belong in the BeaverTails dataset PR. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Wrap SeedPrompt construction in try/except TemplateSyntaxError to gracefully skip prompts that contain Jinja2 syntax (e.g. endraw) which would crash the template parser. Add test for this case. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…om/romanlutz/PyRIT into romanlutz/add-beaver-tails-dataset

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot AI review requested due to automatic review settings March 1, 2026 14:28

Copilot started reviewing on behalf of romanlutz March 1, 2026 14:28 View session

romanlutz force-pushed the romanlutz/add-beaver-tails-dataset branch from 7b635d9 to b652d70 Compare March 1, 2026 14:28

Copilot AI reviewed Mar 1, 2026

View reviewed changes

pyrit/datasets/seed_datasets/remote/beaver_tails_dataset.py Outdated Show resolved Hide resolved

doc/code/datasets/1_loading_datasets.ipynb Outdated Show resolved Hide resolved

doc/code/datasets/1_loading_datasets.ipynb Show resolved Hide resolved

romanlutz force-pushed the romanlutz/add-beaver-tails-dataset branch 2 times, most recently from 9741ae3 to 1fd2ef7 Compare March 2, 2026 13:02

Copilot AI review requested due to automatic review settings March 2, 2026 13:02

Copilot started reviewing on behalf of romanlutz March 2, 2026 13:03 View session

Copilot AI reviewed Mar 2, 2026

View reviewed changes

Copilot AI review requested due to automatic review settings March 2, 2026 13:56

Copilot started reviewing on behalf of romanlutz March 2, 2026 13:57 View session

Copilot AI reviewed Mar 2, 2026

View reviewed changes

pyrit/datasets/seed_datasets/remote/beaver_tails_dataset.py Outdated Show resolved Hide resolved

pyrit/datasets/seed_datasets/remote/beaver_tails_dataset.py Show resolved Hide resolved

doc/code/datasets/1_loading_datasets.ipynb Outdated Show resolved Hide resolved

Copilot AI review requested due to automatic review settings March 2, 2026 15:07

Copilot started reviewing on behalf of romanlutz March 2, 2026 15:07 View session

Copilot AI reviewed Mar 2, 2026

View reviewed changes

romanlutz and others added 7 commits March 2, 2026 13:48

Remove dataset_name from constructor, hardcode as class constant

3a71604

The HF dataset identifier is now a class constant HF_DATASET_NAME instead of a constructor parameter, consistent with other loaders. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Use AsyncMock for _fetch_from_huggingface in tests

4f4fe8d

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Precompute source_url and groups outside the loop

9fd9044

For a 330k-row dataset, this avoids hundreds of thousands of redundant string/list allocations. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Wrap prompt values in raw/endraw to preserve Jinja2 syntax

f98493c

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Add license notice and content warning to docstring

e8d1379

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

fix: update notebook output for rebased datasets

a91052f

Copilot AI review requested due to automatic review settings March 2, 2026 21:50

romanlutz force-pushed the romanlutz/add-beaver-tails-dataset branch from 8a9dccb to a91052f Compare March 2, 2026 21:50

Copilot started reviewing on behalf of romanlutz March 2, 2026 21:51 View session

Copilot AI reviewed Mar 2, 2026

View reviewed changes

pyrit/datasets/seed_datasets/remote/beaver_tails_dataset.py Outdated Show resolved Hide resolved

doc/code/datasets/1_loading_datasets.ipynb Outdated Show resolved Hide resolved

romanlutz and others added 2 commits March 2, 2026 16:48

merge main, add E402/E501 to doc per-file-ignores

29dec57

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Merge remote-tracking branch 'origin/main' into romanlutz/add-beaver-…

064f3eb

…tails-dataset # Conflicts: # pyproject.toml

Copilot AI review requested due to automatic review settings March 3, 2026 04:50

Copilot started reviewing on behalf of romanlutz March 3, 2026 04:51 View session

Copilot AI reviewed Mar 3, 2026

View reviewed changes

pyrit/datasets/seed_datasets/remote/beaver_tails_dataset.py Outdated Show resolved Hide resolved

pyrit/datasets/seed_datasets/remote/beaver_tails_dataset.py Outdated Show resolved Hide resolved

romanlutz and others added 2 commits March 2, 2026 21:01

Merge remote-tracking branch 'origin/main' into romanlutz/add-beaver-…

7722b6e

…tails-dataset

fix: clarify BeaverTails loader extracts prompts only, not QA pairs

4cd75b5

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot AI review requested due to automatic review settings March 3, 2026 05:05

Copilot started reviewing on behalf of romanlutz March 3, 2026 05:06 View session

Copilot AI reviewed Mar 3, 2026

View reviewed changes

pyrit/executor/attack/printer/markdown_printer.py Show resolved Hide resolved

romanlutz and others added 2 commits March 2, 2026 21:21

Merge remote-tracking branch 'origin/main' into romanlutz/add-beaver-…

0173c09

…tails-dataset

Use strftime for cleaner UTC timestamp in markdown printer

8ffd8b4

Replaces isoformat().replace('+00:00', 'Z') with strftime('%Y-%m-%dT%H:%M:%SZ') for second-resolution timestamps without microsecond noise. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot AI review requested due to automatic review settings March 3, 2026 12:43

Copilot started reviewing on behalf of romanlutz March 3, 2026 12:44 View session

Copilot AI reviewed Mar 3, 2026

View reviewed changes

pyrit/prompt_converter/braille_converter.py Show resolved Hide resolved

varunj-msft reviewed Mar 3, 2026

View reviewed changes

pyrit/prompt_converter/braille_converter.py Show resolved Hide resolved

pyrit/datasets/seed_datasets/remote/beaver_tails_dataset.py Outdated Show resolved Hide resolved

romanlutz and others added 4 commits March 3, 2026 15:19

Merge origin/main and revert unrelated changes

01d636c

Merge latest main into the branch. Revert unrelated changes to braille_converter.py and markdown_printer.py that don't belong in the BeaverTails dataset PR. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Merge branch 'romanlutz/add-beaver-tails-dataset' of https://github.c…

8ce7634

…om/romanlutz/PyRIT into romanlutz/add-beaver-tails-dataset

Revert unrelated markdown_printer.py change

700714a

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

rlundeen2 approved these changes Mar 4, 2026

View reviewed changes

rlundeen2 self-assigned this Mar 4, 2026

romanlutz merged commit 799e981 into Azure:main Mar 4, 2026
37 checks passed

romanlutz deleted the romanlutz/add-beaver-tails-dataset branch March 4, 2026 01:36

Conversation

romanlutz commented Mar 1, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

varunj-msft left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants