Skip to content

attacks: Unify chat templating#97

Merged
tomtseng merged 23 commits intomainfrom
tomtseng/chat-attack
Apr 20, 2026
Merged

attacks: Unify chat templating#97
tomtseng merged 23 commits intomainfrom
tomtseng/chat-attack

Conversation

@tomtseng
Copy link
Copy Markdown
Collaborator

@tomtseng tomtseng commented Feb 14, 2026

Summary

Convert attack training data formatting from manual f-strings to tokenizer.apply_chat_template(), then remove the now-unused user_prefix / assistant_prefix / end_turn fields from ModelConfig.

Parent PR: #95

Motivation

PR #95 unified all eval prompt formatting onto apply_chat_template via configurable Jinja2 templates. However, the attack training data pipelines still formatted prompts using manual f-strings with model_config.user_prefix / assistant_prefix / end_turn:

# Old approach — duplicated template knowledge in f-strings
prompt = f"{user_prefix}{data_point['prompt']}{end_turn}"
completion = f"{assistant_prefix}{data_point['rejected']}{end_turn}"

This meant ModelConfig had to carry three string fields to support this formatting path, but it's nicer to have attacks also go through apply_chat_template to reduce the number of different chat templating code paths.

Changes

  • New helper (evals/utils.py): format_prompt_completion(user_content, assistant_content, tokenizer) uses the chat template to create the prompt and completion fields that exactly match inference-time formatting.
  • ModelConfig simplified: drops user_prefix, assistant_prefix, and end_turn fields.
  • TextTemplate.to_jinja2(): strips trailing spaces from the assistant prefix in the add_generation_prompt block to avoid BPE tokenization boundary misalignment (see below).

Behavior changes

  • ModelConfig API — Constructing a ModelConfig no longer requires or accepts user_prefix, assistant_prefix, or end_turn. Any external code or YAML configs passing these fields to ModelConfig.from_dict() will now raise ValueError.
  • SFT loss boundary shift for non-PLAIN templates — The old f-string approach put the assistant prefix (e.g., "Assistant: ") at the start of the completion field; the new format_prompt_completion uses add_generation_prompt=True, which places the assistant prefix at the end of the prompt instead. The full concatenated string is identical, but because SFTTrainer computes loss only on completion tokens, the model no longer receives gradient signal on the fixed assistant-prefix tokens. This is the more correct behavior since the assistant prefix is a fixed formatting token that doesn't need to be learned during fine-tuning.

Regression test results

Ran every attack (except embedding_attack) with best-trial hyperparameters from nov7_trial on Qwen3-0.6B and Qwen3-0.6B-Base. b1/b2 are the old code (commit 8c4cb2e) with seeds 42/123; f1/f2 are the PR branch (commit d72495c) fix, same seeds.
These are the scripts scripts/user/tomtseng/chat_template_refactor_testing_260217/run_regression.py and scripts/user/tomtseng/chat_template_refactor_testing_260217/analyze_results.py.

In summary things look roughly within noise, some are higher, some are lower. The things I would flag are:

  • Benign LoRA finetuning StrongREJECT is maybe worse, though I benign attacks aren't a top priority anyway
  • Backdoor jailbreak-tuning might be worse on MMLU now, not sure why

StrongREJECT scores

Model Attack b1 b2 f1 f2
qwen3_0_6b backdoor_finetune 0.588 0.629 0.605 0.548
qwen3_0_6b benign_full_parameter_finetune 0.153 0.349 0.374 0.359
qwen3_0_6b benign_lora_finetune 0.365 0.351 0.300 0.248
qwen3_0_6b competing_objectives_finetune 0.630 0.599 0.559 0.570
qwen3_0_6b full_parameter_finetune 0.510 0.414 0.524 0.568
qwen3_0_6b lora_finetune 0.440 0.460 0.431 0.443
qwen3_0_6b multilingual_finetune 0.490 0.443 0.496 0.452
qwen3_0_6b no_weight_modification 0.199 0.198 0.206 0.202
qwen3_0_6b style_modulation_finetune 0.626 0.305 0.580 0.583
qwen3_0_6b_base backdoor_finetune 0.630 0.625 0.544 0.625
qwen3_0_6b_base benign_full_parameter_finetune 0.473 0.440 0.495 0.448
qwen3_0_6b_base benign_lora_finetune 0.419 0.380 0.198 0.089
qwen3_0_6b_base competing_objectives_finetune 0.540 0.485 0.303 0.487
qwen3_0_6b_base full_parameter_finetune 0.555 0.526 0.597 0.588
qwen3_0_6b_base lora_finetune 0.574 0.529 0.597 0.594
qwen3_0_6b_base multilingual_finetune 0.602 0.572 0.592 0.558
qwen3_0_6b_base no_weight_modification 0.320 0.323 0.325 0.321
qwen3_0_6b_base style_modulation_finetune 0.211 0.255 0.402 0.343

Summary (StrongREJECT):

Comparison n Mean diff RMS
f1 − b1 (fix vs old, seed 42) 18 −0.011 0.109
f2 − b2 (fix vs old, seed 123) 18 +0.008 0.111
b2 − b1 (old code noise, seeds 123 vs 42) 18 −0.025 0.097

MMLU-Pro scores

Model Attack b1 b2 f1 f2
qwen3_0_6b backdoor_finetune 0.243 0.221 0.150 0.164
qwen3_0_6b benign_full_parameter_finetune 0.171 0.243 0.171 0.221
qwen3_0_6b benign_lora_finetune 0.179 0.164 0.207 0.186
qwen3_0_6b competing_objectives_finetune 0.214 0.086 0.193 0.157
qwen3_0_6b full_parameter_finetune 0.036 0.093 0.057 0.086
qwen3_0_6b lora_finetune 0.193 0.214 0.193 0.243
qwen3_0_6b multilingual_finetune 0.179 0.179 0.186 0.186
qwen3_0_6b no_weight_modification 0.286 0.293 0.286 0.279
qwen3_0_6b style_modulation_finetune 0.021 0.136 0.136 0.029
qwen3_0_6b_base backdoor_finetune 0.171 0.193 0.093 0.064
qwen3_0_6b_base benign_full_parameter_finetune 0.250 0.279 0.264 0.250
qwen3_0_6b_base benign_lora_finetune 0.243 0.236 0.214 0.214
qwen3_0_6b_base competing_objectives_finetune 0.271 0.193 0.186 0.207
qwen3_0_6b_base full_parameter_finetune 0.250 0.243 0.279 0.271
qwen3_0_6b_base lora_finetune 0.207 0.200 0.236 0.193
qwen3_0_6b_base multilingual_finetune 0.271 0.271 0.207 0.279
qwen3_0_6b_base no_weight_modification 0.314 0.300 0.307 0.329
qwen3_0_6b_base style_modulation_finetune 0.250 0.236 0.243 0.236

Summary (MMLU-Pro):

Comparison n Mean diff RMS
f1 − b1 (fix vs old, seed 42) 18 −0.008 0.049
f2 − b2 (fix vs old, seed 123) 18 −0.010 0.048
b2 − b1 (old code noise, seeds 123 vs 42) 18 +0.002 0.051

Fix-vs-old RMS is comparable to old code's own seed-to-seed noise for both StrongREJECT and MMLU-Pro. No systematic regression.

@tomtseng tomtseng changed the base branch from main to tomtseng/chat2 February 14, 2026 00:58
@tomtseng tomtseng force-pushed the tomtseng/chat-attack branch from bd11c13 to ec287da Compare February 14, 2026 04:28
@tomtseng tomtseng marked this pull request as ready for review February 14, 2026 05:03
@tomtseng tomtseng requested a review from sdhossain February 14, 2026 05:04
@sdhossain sdhossain mentioned this pull request Feb 24, 2026
tomtseng and others added 6 commits March 10, 2026 11:02
we seem to get stronger results if we don't apply chat template
Temporarily switch add_generation_prompt from True to False so the
assistant prefix is included in the completion (matching the old
string-concat behavior). This lets us isolate whether the boundary
change is responsible for the StrongREJECT regression.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown
Collaborator

@sdhossain sdhossain left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm -- quite clean! if we could have backwards compatibility, that would be nice.

regression_threshold=args.regression_threshold,
)


Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did we / can we run this?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh i guess we'd need to run the models beforehand.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yup I didn't add them into the PR description yet, thanks for the reminder and I've added them now

Uses ``tokenizer.apply_chat_template`` to produce formatting that matches
inference time. The prompt/completion boundary is derived by formatting
the user message alone (with ``add_generation_prompt=True``) and splitting
the full conversation at that boundary.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we also add a note how different this is from our string concatentation?

also an idea for backwards compatibility: we could add an arg like legacy_template where it that is not None -- it uses the old method?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we just hold on merging these chat templating PRs until next time we're ready to re-run experiments.

I think the reason to add backwards compatibility would be: you're in the middle of running a set of experiments, then I merge these PRs, and now you can't pull from main anymore without pulling in changes that make your set of experiments inconsistent.

If we think these changes are what are simpler and more logically correct and what we should default to, keeping backwards compatibility is maintaining a legacy codepath which can get messy and confusing if it's not necessary.
If we did want to keep backwards compatibility I think the changes would be:

  • allow an add_generation_prompt arg to format_prompt_completion to address the SFT loss boundary change of this PR
  • add a flag to toggle templating in StrongREJECT

and then we would also accept a few other small changes (our hand-written llama3 template was slightly different than the actual llama3 chat template) would get merged in without a flag.

Not a terrible set of changes but takes plumbing through some params

Base automatically changed from tomtseng/chat2 to main April 20, 2026 22:42
@tomtseng tomtseng merged commit ca4fade into main Apr 20, 2026
2 checks passed
@tomtseng tomtseng deleted the tomtseng/chat-attack branch April 20, 2026 22:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants