Skip to content

FEAT: Scientific Translation Converter#1379

Open
jbolor21 wants to merge 12 commits intoAzure:mainfrom
jbolor21:users/bjagdagdorj/science_converter
Open

FEAT: Scientific Translation Converter#1379
jbolor21 wants to merge 12 commits intoAzure:mainfrom
jbolor21:users/bjagdagdorj/science_converter

Conversation

@jbolor21
Copy link
Contributor

Description

Adding scientific translation converter to translate queries into various "scientific" modes

Tests and Documentation

Added unit tests and added converter into converters notebook for text->text using LLMs

@jbolor21 jbolor21 changed the title [DRAFT]: FEAT: Scientific Translation Converter FEAT: Scientific Translation Converter Feb 19, 2026

## Mode-specific guidelines:

{% if mode == "academic" %}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am confused. This includes only a specific section depending on the mode BUT at the end there's a combined mode. How will it know all the modes if we exclude most of them? Examples below also include all of them.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also not sure if I understand the combined mode. In the class it's explicitly listed as a mode, but here it's a catchall, so "foobar" would resolve to a combined prompt. I feel like it would be better to just drop "combined" and refer to the default/wildcard as a combined mode

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I put the combined to combine a couple of the methods into 1. I did change to an "elif" rather than catch all "else" since any other mode should get caught as an exception!

Raises:
ValueError: If an invalid mode is provided.
"""
valid_modes = ("academic", "technical", "smiles", "research", "reaction", "combined")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there an easier way to check for this given that it's a literal that's defined above

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggested one below by just attaching the valid modes to the class itself, but it's a nit so feel free to disregard

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose you could have both.


from typing import Literal, get_args

ObfuscationMode = Literal[
    "academic", "technical", "smiles", "research", "reaction", "combined"
]

OBFUSCATION_MODES = set(get_args(ObfuscationMode))

def is_valid_mode(value: str) -> bool:
    return value in OBFUSCATION_MODES

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new LLM-based prompt converter that rewrites prompts into “scientific/technical” phrasing across multiple modes, along with the seed prompt template, documentation wiring, and unit tests.

Changes:

  • Introduces ScientificObfuscationConverter (mode-driven) backed by a new YAML seed prompt template.
  • Exposes the converter via pyrit.prompt_converter exports and API docs.
  • Adds unit tests and an example usage snippet in the converters documentation notebook.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
pyrit/prompt_converter/scientific_obfuscation_converter.py Implements the new LLM-based converter and identifier construction.
pyrit/datasets/prompt_converters/scientific_obfuscation_converter.yaml Adds the mode-parameterized system prompt template used by the converter.
pyrit/prompt_converter/__init__.py Exports the new converter from the prompt_converter package.
tests/unit/converter/test_scientific_obfuscation_converter.py Adds unit tests validating initialization, mode validation, and conversion behavior.
doc/code/converters/1_text_to_text_converters.py Documents example usage of the new converter in the text-to-text converters notebook source.
doc/code/converters/1_text_to_text_converters.ipynb Adds the corresponding notebook cell content for the new converter example.
doc/api.rst Adds the converter to the API reference list.
Comments suppressed due to low confidence (2)

pyrit/prompt_converter/scientific_obfuscation_converter.py:23

  • The PR title/description refer to a "Scientific Translation Converter", but the implementation and dataset are named "ScientificObfuscationConverter" / "scientific_obfuscation_converter". If this is intended to be a translation-style converter, consider aligning the naming (or update the PR description) to avoid confusion for API consumers and documentation readers.
class ScientificObfuscationConverter(LLMGenericTextConverter):
    """
    Uses an LLM to transform simple or direct prompts into

pyrit/prompt_converter/scientific_obfuscation_converter.py:67

  • valid_modes duplicates the allowed values already defined in ObfuscationMode. To avoid the tuple and the type alias drifting out of sync, derive the runtime list from the type (e.g., typing.get_args(ObfuscationMode)) or centralize the allowed modes as a single constant reused for both validation and typing.
        valid_modes = ("academic", "technical", "smiles", "research", "reaction", "combined")
        if mode not in valid_modes:
            raise ValueError(f"Invalid mode '{mode}'. Must be one of: {valid_modes}")

Raises:
ValueError: If an invalid mode is provided.
"""
valid_modes = ("academic", "technical", "smiles", "research", "reaction", "combined")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggested one below by just attaching the valid modes to the class itself, but it's a nit so feel free to disregard


## Mode-specific guidelines:

{% if mode == "academic" %}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also not sure if I understand the combined mode. In the class it's explicitly listed as a mode, but here it's a catchall, so "foobar" would resolve to a combined prompt. I feel like it would be better to just drop "combined" and refer to the default/wildcard as a combined mode

if any(var not in kwargs for var in for_vars):
# Don't render if we're missing loop collection variables - preserve the template as-is
# Extract variable names from {% if var ... %} and {% elif var ... %} patterns
if_vars = re.findall(r"\{%[-\s]*(?:el)?if\s+(\w+)", self.value)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than parsing here and having one yaml, is it possible to have multiple yamls each for the mode and then we can leave the seed.py untouched (I'm assuming this parsing is specific to this converter which I would prefer not to do since seed is a generic data structure) and redirect to the corresponding yaml based on the mode in the scientific_translation_converter.py

@hannahwestra25 hannahwestra25 self-assigned this Feb 24, 2026
Copilot AI review requested due to automatic review settings March 3, 2026 22:59
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 14 out of 14 changed files in this pull request and generated 12 comments.

Comments suppressed due to low confidence (1)

pyrit/datasets/prompt_converters/academic_science_converter.yaml:21

  • Spelling in the example prompt: asprin should be aspirin (unless the misspelling is intentional for the example). Fixing this avoids propagating typos into docs/training data.
  Original: "How do I make asprin?"
  Converted: """Organic Chemistry 201 - Design a Synthesis Pathway to an Unknown Final Product, Starting with Salicylic

Comment on lines +47 to +52
@pytest.mark.parametrize("mode", ["academic", "technical", "smiles", "research", "reaction", "combined"])
def test_scientific_translation_converter_init_valid_modes(mode, sqlite_instance):
prompt_target = MockPromptTarget()
converter = ScientificTranslationConverter(converter_target=prompt_target, mode=mode)
assert converter._system_prompt_template
assert converter._mode == mode
Copy link

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TranslationMode includes "math", and there is a corresponding math_science_converter.yaml, but the valid-mode parametrized init test doesn’t include math. Adding it here would ensure all built-in modes are covered and prevent silent regressions for that mode.

Copilot uses AI. Check for mistakes.
name: scientific_translation_converter_technical_mode
description: |
Converts prompts into a technical mode (ie using scientific/technical language)
authors: Bolor Jagdagdorj
Copy link

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In seed prompt YAMLs, authors is typically a YAML list (e.g., authors: ['AI Red Team']). Here it’s a scalar string, which will be loaded into the SeedPrompt.authors: Sequence[str] field as an iterable of characters and can break filtering/serialization that expects a list of author names. Change authors to a list form (even if it’s a single author).

Suggested change
authors: Bolor Jagdagdorj
authors: ['Bolor Jagdagdorj']

Copilot uses AI. Check for mistakes.
name: scientific_translation_converter_reaction_mode
description: |
Converts prompts into a reaction mechanism mode (ie using step-by-step mechanisms)
authors: Bolor Jagdagdorj
Copy link

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In seed prompt YAMLs, authors is typically a YAML list (e.g., authors: ['AI Red Team']). Here it’s a scalar string, which will be loaded into the SeedPrompt.authors: Sequence[str] field as an iterable of characters and can break filtering/serialization that expects a list of author names. Change authors to a list form (even if it’s a single author).

Suggested change
authors: Bolor Jagdagdorj
authors: ['Bolor Jagdagdorj']

Copilot uses AI. Check for mistakes.
name: scientific_translation_converter_combined_mode
description: |
Converts prompts into a complex scientific language mode that combines elements of academic, research, chemical notation, and reaction mechanism styles
authors: Bolor Jagdagdorj
Copy link

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In seed prompt YAMLs, authors is typically a YAML list (e.g., authors: ['AI Red Team']). Here it’s a scalar string, which will be loaded into the SeedPrompt.authors: Sequence[str] field as an iterable of characters and can break filtering/serialization that expects a list of author names. Change authors to a list form (even if it’s a single author).

Suggested change
authors: Bolor Jagdagdorj
authors: ['Bolor Jagdagdorj']

Copilot uses AI. Check for mistakes.
name: scientific_translation_converter_academic_mode
description: |
Converts prompts into an academic mode (ie homework/exam style questioning)
authors: Bolor Jagdagdorj
Copy link

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In seed prompt YAMLs, authors is typically a YAML list (e.g., authors: ['AI Red Team']). Here it’s a scalar string, which will be loaded into the SeedPrompt.authors: Sequence[str] field as an iterable of characters and can break filtering/serialization that expects a list of author names. Change authors to a list form (even if it’s a single author).

Suggested change
authors: Bolor Jagdagdorj
authors: ["Bolor Jagdagdorj"]

Copilot uses AI. Check for mistakes.
Comment on lines +192 to +196
# Extract variable names from {% if var ... %} and {% elif var ... %} patterns
if_vars = re.findall(r"\{%[-\s]*(?:el)?if\s+(\w+)", self.value)
control_vars = set(for_vars + if_vars)
if any(var not in kwargs for var in control_vars):
# Don't render if we're missing control structure variables - preserve the template as-is
Copy link

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new {% if ... %} extraction regex captures the first token after if, which breaks valid Jinja expressions like {% if not foo %} or {% if foo is defined %} (it would capture not/foo incorrectly) and can cause templates to be returned unrendered even when required variables are provided. Consider updating the pattern to handle optional not and more general expressions, or use Jinja's AST parsing to determine undeclared variables used in control structures.

Copilot uses AI. Check for mistakes.
name: scientific_translation_converter_smiles_mode
description: |
Converts prompts into a SMILES/chemical notation mode (ie using chemical notation and formulas)
authors: Bolor Jagdagdorj
Copy link

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In seed prompt YAMLs, authors is typically a YAML list (e.g., authors: ['AI Red Team']). Here it’s a scalar string, which will be loaded into the SeedPrompt.authors: Sequence[str] field as an iterable of characters and can break filtering/serialization that expects a list of author names. Change authors to a list form (even if it’s a single author).

Suggested change
authors: Bolor Jagdagdorj
authors: ["Bolor Jagdagdorj"]

Copilot uses AI. Check for mistakes.
name: scientific_translation_converter_research_mode
description: |
Converts prompts into a research question or hypothesis mode (ie using scientific research framing)
authors: Bolor Jagdagdorj
Copy link

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In seed prompt YAMLs, authors is typically a YAML list (e.g., authors: ['AI Red Team']). Here it’s a scalar string, which will be loaded into the SeedPrompt.authors: Sequence[str] field as an iterable of characters and can break filtering/serialization that expects a list of author names. Change authors to a list form (even if it’s a single author).

Suggested change
authors: Bolor Jagdagdorj
authors: ["Bolor Jagdagdorj"]

Copilot uses AI. Check for mistakes.
name: scientific_translation_converter_math_mode
description: |
Converts prompts into a math mode (ie math word problem for homework/exam style questioning)
authors: Bolor Jagdagdorj
Copy link

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In seed prompt YAMLs, authors is typically a YAML list (e.g., authors: ['AI Red Team']). Here it’s a scalar string, which will be loaded into the SeedPrompt.authors: Sequence[str] field as an iterable of characters and can break filtering/serialization that expects a list of author names. Change authors to a list form (even if it’s a single author).

Suggested change
authors: Bolor Jagdagdorj
authors: ["Bolor Jagdagdorj"]

Copilot uses AI. Check for mistakes.

from pyrit.common.apply_defaults import REQUIRED_VALUE, apply_defaults
from pyrit.common.path import CONVERTER_SEED_PROMPT_PATH
from pyrit.identifiers import ConverterIdentifier
Copy link

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ConverterIdentifier is imported from pyrit.identifiers, but that module only exports ComponentIdentifier/Identifiable/helpers; this import will raise ImportError at runtime. Update the import and the _build_identifier return annotation to use the correct identifier type used by other converters (e.g., ComponentIdentifier).

Suggested change
from pyrit.identifiers import ConverterIdentifier
from pyrit.identifiers import ComponentIdentifier

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants