Conversation
05fafb2 to
18ff014
Compare
bd11c13 to
ec287da
Compare
we seem to get stronger results if we don't apply chat template
Temporarily switch add_generation_prompt from True to False so the assistant prefix is included in the completion (matching the old string-concat behavior). This lets us isolate whether the boundary change is responsible for the StrongREJECT regression. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…for testing" This reverts commit 8f6a4ff.
…ly masking to be misaligned during training
…mplate" This reverts commit bb092e9.
…mplate" This reverts commit bb092e9.
sdhossain
left a comment
There was a problem hiding this comment.
lgtm -- quite clean! if we could have backwards compatibility, that would be nice.
| regression_threshold=args.regression_threshold, | ||
| ) | ||
|
|
||
|
|
There was a problem hiding this comment.
did we / can we run this?
There was a problem hiding this comment.
oh i guess we'd need to run the models beforehand.
There was a problem hiding this comment.
yup I didn't add them into the PR description yet, thanks for the reminder and I've added them now
| Uses ``tokenizer.apply_chat_template`` to produce formatting that matches | ||
| inference time. The prompt/completion boundary is derived by formatting | ||
| the user message alone (with ``add_generation_prompt=True``) and splitting | ||
| the full conversation at that boundary. |
There was a problem hiding this comment.
can we also add a note how different this is from our string concatentation?
also an idea for backwards compatibility: we could add an arg like legacy_template where it that is not None -- it uses the old method?
There was a problem hiding this comment.
Maybe we just hold on merging these chat templating PRs until next time we're ready to re-run experiments.
I think the reason to add backwards compatibility would be: you're in the middle of running a set of experiments, then I merge these PRs, and now you can't pull from main anymore without pulling in changes that make your set of experiments inconsistent.
If we think these changes are what are simpler and more logically correct and what we should default to, keeping backwards compatibility is maintaining a legacy codepath which can get messy and confusing if it's not necessary.
If we did want to keep backwards compatibility I think the changes would be:
- allow an
add_generation_promptarg toformat_prompt_completionto address the SFT loss boundary change of this PR - add a flag to toggle templating in StrongREJECT
and then we would also accept a few other small changes (our hand-written llama3 template was slightly different than the actual llama3 chat template) would get merged in without a flag.
Not a terrible set of changes but takes plumbing through some params
…rBench into tomtseng/chat-attack
Summary
Convert attack training data formatting from manual f-strings to
tokenizer.apply_chat_template(), then remove the now-unuseduser_prefix/assistant_prefix/end_turnfields fromModelConfig.Parent PR: #95
Motivation
PR #95 unified all eval prompt formatting onto
apply_chat_templatevia configurable Jinja2 templates. However, the attack training data pipelines still formatted prompts using manual f-strings withmodel_config.user_prefix/assistant_prefix/end_turn:This meant
ModelConfighad to carry three string fields to support this formatting path, but it's nicer to have attacks also go throughapply_chat_templateto reduce the number of different chat templating code paths.Changes
evals/utils.py):format_prompt_completion(user_content, assistant_content, tokenizer)uses the chat template to create thepromptandcompletionfields that exactly match inference-time formatting.ModelConfigsimplified: dropsuser_prefix,assistant_prefix, andend_turnfields.TextTemplate.to_jinja2(): strips trailing spaces from the assistant prefix in theadd_generation_promptblock to avoid BPE tokenization boundary misalignment (see below).Behavior changes
ModelConfigAPI — Constructing aModelConfigno longer requires or acceptsuser_prefix,assistant_prefix, orend_turn. Any external code or YAML configs passing these fields toModelConfig.from_dict()will now raiseValueError."Assistant: ") at the start of thecompletionfield; the newformat_prompt_completionusesadd_generation_prompt=True, which places the assistant prefix at the end of thepromptinstead. The full concatenated string is identical, but becauseSFTTrainercomputes loss only on completion tokens, the model no longer receives gradient signal on the fixed assistant-prefix tokens. This is the more correct behavior since the assistant prefix is a fixed formatting token that doesn't need to be learned during fine-tuning.Regression test results
Ran every attack (except
embedding_attack) with best-trial hyperparameters fromnov7_trialon Qwen3-0.6B and Qwen3-0.6B-Base.b1/b2are the old code (commit8c4cb2e) with seeds 42/123;f1/f2are the PR branch (commitd72495c) fix, same seeds.These are the scripts scripts/user/tomtseng/chat_template_refactor_testing_260217/run_regression.py and scripts/user/tomtseng/chat_template_refactor_testing_260217/analyze_results.py.
In summary things look roughly within noise, some are higher, some are lower. The things I would flag are:
StrongREJECT scores
Summary (StrongREJECT):
MMLU-Pro scores
Summary (MMLU-Pro):
Fix-vs-old RMS is comparable to old code's own seed-to-seed noise for both StrongREJECT and MMLU-Pro. No systematic regression.