attacks: Unify chat templating by tomtseng · Pull Request #97 · criticalml-uw/TamperBench

tomtseng · 2026-02-14T00:58:18Z

Summary

Convert attack training data formatting from manual f-strings to tokenizer.apply_chat_template(), then remove the now-unused user_prefix / assistant_prefix / end_turn fields from ModelConfig.

Parent PR: #95

Motivation

PR #95 unified all eval prompt formatting onto apply_chat_template via configurable Jinja2 templates. However, the attack training data pipelines still formatted prompts using manual f-strings with model_config.user_prefix / assistant_prefix / end_turn:

# Old approach — duplicated template knowledge in f-strings
prompt = f"{user_prefix}{data_point['prompt']}{end_turn}"
completion = f"{assistant_prefix}{data_point['rejected']}{end_turn}"

This meant ModelConfig had to carry three string fields to support this formatting path, but it's nicer to have attacks also go through apply_chat_template to reduce the number of different chat templating code paths.

Changes

New helper (evals/utils.py): format_prompt_completion(user_content, assistant_content, tokenizer) uses the chat template to create the prompt and completion fields that exactly match inference-time formatting.
ModelConfig simplified: drops user_prefix, assistant_prefix, and end_turn fields.
TextTemplate.to_jinja2(): strips trailing spaces from the assistant prefix in the add_generation_prompt block to avoid BPE tokenization boundary misalignment (see below).

Behavior changes

ModelConfig API — Constructing a ModelConfig no longer requires or accepts user_prefix, assistant_prefix, or end_turn. Any external code or YAML configs passing these fields to ModelConfig.from_dict() will now raise ValueError.
SFT loss boundary shift for non-PLAIN templates — The old f-string approach put the assistant prefix (e.g., "Assistant: ") at the start of the completion field; the new format_prompt_completion uses add_generation_prompt=True, which places the assistant prefix at the end of the prompt instead. The full concatenated string is identical, but because SFTTrainer computes loss only on completion tokens, the model no longer receives gradient signal on the fixed assistant-prefix tokens. This is the more correct behavior since the assistant prefix is a fixed formatting token that doesn't need to be learned during fine-tuning.

Regression test results

Ran every attack (except embedding_attack) with best-trial hyperparameters from nov7_trial on Qwen3-0.6B and Qwen3-0.6B-Base. b1/b2 are the old code (commit 8c4cb2e) with seeds 42/123; f1/f2 are the PR branch (commit d72495c) fix, same seeds.
These are the scripts scripts/user/tomtseng/chat_template_refactor_testing_260217/run_regression.py and scripts/user/tomtseng/chat_template_refactor_testing_260217/analyze_results.py.

In summary things look roughly within noise, some are higher, some are lower. The things I would flag are:

Benign LoRA finetuning StrongREJECT is maybe worse, though I benign attacks aren't a top priority anyway
Backdoor jailbreak-tuning might be worse on MMLU now, not sure why

StrongREJECT scores

Model	Attack	b1	b2	f1	f2
qwen3_0_6b	backdoor_finetune	0.588	0.629	0.605	0.548
qwen3_0_6b	benign_full_parameter_finetune	0.153	0.349	0.374	0.359
qwen3_0_6b	benign_lora_finetune	0.365	0.351	0.300	0.248
qwen3_0_6b	competing_objectives_finetune	0.630	0.599	0.559	0.570
qwen3_0_6b	full_parameter_finetune	0.510	0.414	0.524	0.568
qwen3_0_6b	lora_finetune	0.440	0.460	0.431	0.443
qwen3_0_6b	multilingual_finetune	0.490	0.443	0.496	0.452
qwen3_0_6b	no_weight_modification	0.199	0.198	0.206	0.202
qwen3_0_6b	style_modulation_finetune	0.626	0.305	0.580	0.583
qwen3_0_6b_base	backdoor_finetune	0.630	0.625	0.544	0.625
qwen3_0_6b_base	benign_full_parameter_finetune	0.473	0.440	0.495	0.448
qwen3_0_6b_base	benign_lora_finetune	0.419	0.380	0.198	0.089
qwen3_0_6b_base	competing_objectives_finetune	0.540	0.485	0.303	0.487
qwen3_0_6b_base	full_parameter_finetune	0.555	0.526	0.597	0.588
qwen3_0_6b_base	lora_finetune	0.574	0.529	0.597	0.594
qwen3_0_6b_base	multilingual_finetune	0.602	0.572	0.592	0.558
qwen3_0_6b_base	no_weight_modification	0.320	0.323	0.325	0.321
qwen3_0_6b_base	style_modulation_finetune	0.211	0.255	0.402	0.343

Summary (StrongREJECT):

Comparison	n	Mean diff	RMS
f1 − b1 (fix vs old, seed 42)	18	−0.011	0.109
f2 − b2 (fix vs old, seed 123)	18	+0.008	0.111
b2 − b1 (old code noise, seeds 123 vs 42)	18	−0.025	0.097

MMLU-Pro scores

Model	Attack	b1	b2	f1	f2
qwen3_0_6b	backdoor_finetune	0.243	0.221	0.150	0.164
qwen3_0_6b	benign_full_parameter_finetune	0.171	0.243	0.171	0.221
qwen3_0_6b	benign_lora_finetune	0.179	0.164	0.207	0.186
qwen3_0_6b	competing_objectives_finetune	0.214	0.086	0.193	0.157
qwen3_0_6b	full_parameter_finetune	0.036	0.093	0.057	0.086
qwen3_0_6b	lora_finetune	0.193	0.214	0.193	0.243
qwen3_0_6b	multilingual_finetune	0.179	0.179	0.186	0.186
qwen3_0_6b	no_weight_modification	0.286	0.293	0.286	0.279
qwen3_0_6b	style_modulation_finetune	0.021	0.136	0.136	0.029
qwen3_0_6b_base	backdoor_finetune	0.171	0.193	0.093	0.064
qwen3_0_6b_base	benign_full_parameter_finetune	0.250	0.279	0.264	0.250
qwen3_0_6b_base	benign_lora_finetune	0.243	0.236	0.214	0.214
qwen3_0_6b_base	competing_objectives_finetune	0.271	0.193	0.186	0.207
qwen3_0_6b_base	full_parameter_finetune	0.250	0.243	0.279	0.271
qwen3_0_6b_base	lora_finetune	0.207	0.200	0.236	0.193
qwen3_0_6b_base	multilingual_finetune	0.271	0.271	0.207	0.279
qwen3_0_6b_base	no_weight_modification	0.314	0.300	0.307	0.329
qwen3_0_6b_base	style_modulation_finetune	0.250	0.236	0.243	0.236

Summary (MMLU-Pro):

Comparison	n	Mean diff	RMS
f1 − b1 (fix vs old, seed 42)	18	−0.008	0.049
f2 − b2 (fix vs old, seed 123)	18	−0.010	0.048
b2 − b1 (old code noise, seeds 123 vs 42)	18	+0.002	0.051

Fix-vs-old RMS is comparable to old code's own seed-to-seed noise for both StrongREJECT and MMLU-Pro. No systematic regression.

we seem to get stronger results if we don't apply chat template

Temporarily switch add_generation_prompt from True to False so the assistant prefix is included in the completion (matching the old string-concat behavior). This lets us isolate whether the boundary change is responsible for the StrongREJECT regression. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…for testing" This reverts commit 8f6a4ff.

…ly masking to be misaligned during training

…mplate" This reverts commit bb092e9.

sdhossain

lgtm -- quite clean! if we could have backwards compatibility, that would be nice.

sdhossain · 2026-03-27T01:07:35Z

+        regression_threshold=args.regression_threshold,
+    )
+
+


did we / can we run this?

oh i guess we'd need to run the models beforehand.

yup I didn't add them into the PR description yet, thanks for the reminder and I've added them now

sdhossain · 2026-03-27T01:14:51Z

+    Uses ``tokenizer.apply_chat_template`` to produce formatting that matches
+    inference time.  The prompt/completion boundary is derived by formatting
+    the user message alone (with ``add_generation_prompt=True``) and splitting
+    the full conversation at that boundary.


can we also add a note how different this is from our string concatentation?

also an idea for backwards compatibility: we could add an arg like legacy_template where it that is not None -- it uses the old method?

Maybe we just hold on merging these chat templating PRs until next time we're ready to re-run experiments.

I think the reason to add backwards compatibility would be: you're in the middle of running a set of experiments, then I merge these PRs, and now you can't pull from main anymore without pulling in changes that make your set of experiments inconsistent.

If we think these changes are what are simpler and more logically correct and what we should default to, keeping backwards compatibility is maintaining a legacy codepath which can get messy and confusing if it's not necessary.
If we did want to keep backwards compatibility I think the changes would be:

allow an add_generation_prompt arg to format_prompt_completion to address the SFT loss boundary change of this PR

add a flag to toggle templating in StrongREJECT

and then we would also accept a few other small changes (our hand-written llama3 template was slightly different than the actual llama3 chat template) would get merged in without a flag.

Not a terrible set of changes but takes plumbing through some params

…rBench into tomtseng/chat-attack

attacks: Unify chat templating

8a5cd7b

tomtseng changed the base branch from main to tomtseng/chat2 February 14, 2026 00:58

tomtseng mentioned this pull request Feb 14, 2026

Defenses & evals: Unify chat templating #95

Merged

tomtseng force-pushed the tomtseng/chat2 branch from 05fafb2 to 18ff014 Compare February 14, 2026 04:16

attacks: Chat templating clean-up

ec287da

tomtseng force-pushed the tomtseng/chat-attack branch from bd11c13 to ec287da Compare February 14, 2026 04:28

tomtseng added 3 commits February 13, 2026 20:34

Merge branch 'tomtseng/chat2' into tomtseng/chat-attack

f82aa04

attacks: Clarify chat templating util

34c6681

Merge branch 'tomtseng/chat2' into tomtseng/chat-attack

45107b0

tomtseng marked this pull request as ready for review February 14, 2026 05:03

tomtseng requested a review from sdhossain February 14, 2026 05:04

tomtseng added 2 commits February 17, 2026 18:07

scripts: Add scripts for testing PR 97

fc4eddb

scripts chat_template_refactor_testing: Convert legacy template names

b87c16d

sdhossain mentioned this pull request Feb 24, 2026

Tracking Technical Debt #20

Closed

tomtseng and others added 6 commits March 10, 2026 11:02

strong_reject: Default to not formatting prompts with chat template

bb092e9

we seem to get stronger results if we don't apply chat template

Revert "format_prompt_completion: Use old prompt/completion boundary …

b17fcda

…for testing" This reverts commit 8f6a4ff.

templates: Fix tokenization boundary issues that led to completion-on…

bc5b9d3

…ly masking to be misaligned during training

Revert "strong_reject: Default to not formatting prompts with chat te…

d72495c

…mplate" This reverts commit bb092e9.

Revert "strong_reject: Default to not formatting prompts with chat te…

8000203

…mplate" This reverts commit bb092e9.

sdhossain approved these changes Mar 27, 2026

View reviewed changes

tomtseng added 9 commits March 27, 2026 15:52

Merge branch 'tomtseng/chat2' into tomtseng/chat-attack

f5fb602

Merge branch 'tomtseng/chat-attack' of github.com:criticalml-uw/Tampe…

21eebc3

…rBench into tomtseng/chat-attack

Merge branch 'tomtseng/chat2' into tomtseng/chat-attack

f100ee4

scripts: Fix merge conflict on ModelConfig chat templates

215eda4

[run-full-tests]

a244e32

Merge branch 'tomtseng/chat2' into tomtseng/chat-attack

4032875

Merge branch 'tomtseng/chat2' into tomtseng/chat-attack

436ddbf

Fix chat templating refactor merge c onflicts

4680906

Chat templating: Allow adding a system prompt (used for SDD test script)

3f25df9

user scripts: Fix lint error

13d4f63

Base automatically changed from tomtseng/chat2 to main April 20, 2026 22:42

tomtseng merged commit ca4fade into main Apr 20, 2026
2 checks passed

tomtseng deleted the tomtseng/chat-attack branch April 20, 2026 22:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

attacks: Unify chat templating#97

attacks: Unify chat templating#97
tomtseng merged 23 commits intomainfrom
tomtseng/chat-attack

tomtseng commented Feb 14, 2026 •

edited

Loading

Uh oh!

sdhossain left a comment

Uh oh!

sdhossain Mar 27, 2026

Uh oh!

sdhossain Mar 27, 2026

Uh oh!

tomtseng Mar 28, 2026

Uh oh!

sdhossain Mar 27, 2026

Uh oh!

tomtseng Mar 28, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

tomtseng commented Feb 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Motivation

Changes

Behavior changes

Regression test results

StrongREJECT scores

MMLU-Pro scores

Uh oh!

sdhossain left a comment

Choose a reason for hiding this comment

Uh oh!

sdhossain Mar 27, 2026

Choose a reason for hiding this comment

Uh oh!

sdhossain Mar 27, 2026

Choose a reason for hiding this comment

Uh oh!

tomtseng Mar 28, 2026

Choose a reason for hiding this comment

Uh oh!

sdhossain Mar 27, 2026

Choose a reason for hiding this comment

Uh oh!

tomtseng Mar 28, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

tomtseng commented Feb 14, 2026 •

edited

Loading