-
Notifications
You must be signed in to change notification settings - Fork 682
FEAT Add ToxicChat dataset loader #1422
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
romanlutz
wants to merge
20
commits into
Azure:main
Choose a base branch
from
romanlutz:romanlutz/add-toxic-chat-dataset
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
+400
−9
Open
Changes from all commits
Commits
Show all changes
20 commits
Select commit
Hold shift + click to select a range
79e2bf9
Add ToxicChat dataset loader
romanlutz dec4d62
Remove dataset_name from constructor, hardcode as class constant
romanlutz ead6d2e
Use AsyncMock for _fetch_from_huggingface in tests
romanlutz 38f86c8
Precompute source_url and groups outside the loop
romanlutz 409cb36
Wrap prompt values in raw/endraw, remove TemplateSyntaxError catch
romanlutz e812345
Fix ruff formatting
romanlutz 25f0115
Add license notice and content warning to docstring
romanlutz 2c59d7c
Merge remote-tracking branch 'origin/main' into romanlutz/add-toxic-c…
romanlutz 1745a46
fix: update notebook output with all dataset names
romanlutz 3d0e7d2
add E402/E501 to doc per-file-ignores
romanlutz a2d610f
Merge remote-tracking branch 'origin/main' into romanlutz/add-toxic-c…
romanlutz 33a4b3e
Merge remote-tracking branch 'origin/main' into romanlutz/add-toxic-c…
romanlutz bb25d03
fix: address copilot comments - exception handling and test coverage
romanlutz dc11907
Merge remote-tracking branch 'origin/main' into romanlutz/add-toxic-c…
romanlutz 4938f4e
Register ToxicChat in remote __init__.py and add exc_info to debug log
romanlutz 8872002
Narrow exception handling in ToxicChat loader
romanlutz 1416436
Merge remote-tracking branch 'origin/main' into romanlutz/add-toxic-c…
romanlutz 77ba56b
Address review comments: narrow exception, handle {% for %} edge case…
romanlutz d411500
Resolve merge conflict: keep TemplateSyntaxError and raw wrapper stri…
romanlutz 07529fd
Populate harm_categories from toxicity, jailbreaking, and openai_mode…
romanlutz File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Some comments aren't visible on the classic Files Changed page.
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
164 changes: 164 additions & 0 deletions
164
pyrit/datasets/seed_datasets/remote/toxic_chat_dataset.py
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,164 @@ | ||
| # Copyright (c) Microsoft Corporation. | ||
| # Licensed under the MIT license. | ||
|
|
||
| import json | ||
| import logging | ||
| from typing import Any | ||
|
|
||
| from jinja2 import TemplateSyntaxError | ||
|
|
||
| from pyrit.datasets.seed_datasets.remote.remote_dataset_loader import ( | ||
| _RemoteDatasetLoader, | ||
| ) | ||
| from pyrit.models import SeedDataset, SeedPrompt | ||
|
|
||
| logger = logging.getLogger(__name__) | ||
|
|
||
|
|
||
| class _ToxicChatDataset(_RemoteDatasetLoader): | ||
| """ | ||
| Loader for the ToxicChat dataset from HuggingFace. | ||
|
|
||
| ToxicChat contains approximately 10k real user-chatbot conversations from the Chatbot Arena, | ||
| annotated for toxicity and jailbreaking attempts. It provides real-world examples of | ||
| how users interact with LLMs in adversarial ways. | ||
|
|
||
| References: | ||
| - https://huggingface.co/datasets/lmsys/toxic-chat | ||
| - https://arxiv.org/abs/2310.17389 | ||
| License: CC BY-NC 4.0 | ||
|
|
||
| Warning: This dataset contains toxic, offensive, and jailbreaking content from real user | ||
| conversations. Consult your legal department before using these prompts for testing. | ||
| """ | ||
|
|
||
| HF_DATASET_NAME: str = "lmsys/toxic-chat" | ||
|
|
||
romanlutz marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| OPENAI_MODERATION_THRESHOLD: float = 0.8 | ||
|
|
||
| def __init__( | ||
| self, | ||
| *, | ||
| config: str = "toxicchat0124", | ||
| split: str = "train", | ||
| ): | ||
| """ | ||
| Initialize the ToxicChat dataset loader. | ||
|
|
||
| Args: | ||
| config: Dataset configuration. Defaults to "toxicchat0124". | ||
| split: Dataset split to load. Defaults to "train". | ||
| """ | ||
| self.config = config | ||
| self.split = split | ||
|
|
||
| @property | ||
| def dataset_name(self) -> str: | ||
| """Return the dataset name.""" | ||
| return "toxic_chat" | ||
|
|
||
| def _extract_harm_categories(self, item: dict[str, Any]) -> list[str]: | ||
| """ | ||
| Extract harm categories from toxicity, jailbreaking, and openai_moderation fields. | ||
|
|
||
| Args: | ||
| item: A single dataset row. | ||
|
|
||
| Returns: | ||
| list[str]: Harm category labels for this entry. | ||
| """ | ||
| categories: list[str] = [] | ||
|
|
||
| if item.get("toxicity") == 1: | ||
| categories.append("toxicity") | ||
| if item.get("jailbreaking") == 1: | ||
| categories.append("jailbreaking") | ||
|
|
||
| openai_mod = item.get("openai_moderation", "[]") | ||
| try: | ||
| moderation_scores = json.loads(openai_mod) if isinstance(openai_mod, str) else openai_mod | ||
| for category, score in moderation_scores: | ||
| if score > self.OPENAI_MODERATION_THRESHOLD: | ||
| categories.append(category) | ||
| except (json.JSONDecodeError, TypeError, ValueError): | ||
| logger.debug(f"Could not parse openai_moderation for conv_id={item.get('conv_id', 'unknown')}") | ||
|
|
||
| return categories | ||
|
|
||
| async def fetch_dataset(self, *, cache: bool = True) -> SeedDataset: | ||
| """ | ||
| Fetch ToxicChat dataset from HuggingFace and return as SeedDataset. | ||
|
|
||
| Args: | ||
| cache: Whether to cache the fetched dataset. Defaults to True. | ||
|
|
||
| Returns: | ||
| SeedDataset: A SeedDataset containing the ToxicChat user inputs. | ||
| """ | ||
| logger.info(f"Loading ToxicChat dataset from {self.HF_DATASET_NAME}") | ||
|
|
||
| data = await self._fetch_from_huggingface( | ||
| dataset_name=self.HF_DATASET_NAME, | ||
| config=self.config, | ||
| split=self.split, | ||
| cache=cache, | ||
| ) | ||
|
|
||
| authors = [ | ||
| "Zi Lin", | ||
| "Zihan Wang", | ||
| "Yongqi Tong", | ||
| "Yangkun Wang", | ||
| "Yuxin Guo", | ||
| "Yujia Wang", | ||
| "Jingbo Shang", | ||
| ] | ||
| description = ( | ||
| "ToxicChat contains ~10k real user-chatbot conversations from the Chatbot Arena, " | ||
| "annotated for toxicity and jailbreaking attempts. It provides real-world examples " | ||
| "of adversarial user interactions with LLMs." | ||
| ) | ||
|
|
||
| source_url = f"https://huggingface.co/datasets/{self.HF_DATASET_NAME}" | ||
| groups = ["UC San Diego"] | ||
|
|
||
| raw_prefix = "{% raw %}" | ||
| raw_suffix = "{% endraw %}" | ||
|
|
||
| seed_prompts: list[SeedPrompt] = [] | ||
| for item in data: | ||
| user_input = item["user_input"] | ||
| harm_categories = self._extract_harm_categories(item) | ||
| try: | ||
| prompt = SeedPrompt( | ||
| value=f"{{% raw %}}{user_input}{{% endraw %}}", | ||
| data_type="text", | ||
| dataset_name=self.dataset_name, | ||
| description=description, | ||
| source=source_url, | ||
| authors=authors, | ||
| groups=groups, | ||
| harm_categories=harm_categories, | ||
| metadata={ | ||
| "toxicity": str(item.get("toxicity", "")), | ||
| "jailbreaking": str(item.get("jailbreaking", "")), | ||
| "human_annotation": str(item.get("human_annotation", "")), | ||
| }, | ||
| ) | ||
|
|
||
| # If user_input contains Jinja2 control structures (e.g., {% for %}), | ||
| # render_template_value_silent may skip rendering and leave the raw wrapper. | ||
| if prompt.value.startswith(raw_prefix) and prompt.value.endswith(raw_suffix): | ||
| prompt.value = prompt.value[len(raw_prefix) : -len(raw_suffix)] | ||
|
|
||
| seed_prompts.append(prompt) | ||
| except TemplateSyntaxError: | ||
| conv_id = item.get("conv_id", "unknown") | ||
| logger.debug( | ||
| f"Skipping entry with conv_id={conv_id}: failed to parse as Jinja2 template", | ||
| exc_info=True, | ||
| ) | ||
|
|
||
romanlutz marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| logger.info(f"Successfully loaded {len(seed_prompts)} prompts from ToxicChat dataset") | ||
|
|
||
| return SeedDataset(seeds=seed_prompts, dataset_name=self.dataset_name) | ||
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.