Conversation
|
Also a note on code quality; this is mostly from iterating with Claude Code to see if it works at all. I haven't put any effort into matching style etc with the rest of the repo, let me know if you'd like me to do this. |
tomtseng
left a comment
There was a problem hiding this comment.
overall looks good, thanks! have some comments about a few places the implementation differs from the paper
| similarity_threshold: Max cosine similarity for irrelevance filtering. | ||
| Pairs with similarity above this are resampled (default: 0.25). | ||
| use_reject_prefix: If True, use SDD_reject variant with refusal prefix. | ||
| learning_rate: Learning rate for SFT (paper: 5e-7). |
There was a problem hiding this comment.
The actual default set in the code below is learning_rate: float = 2e-5 instead of 5e-7, is this intended? if yes then maybe add a comment on why 2e-5 instead of 5e-7
| if len(all_prompts) == 0: | ||
| raise ValueError(f"No prompts found in {dataset_name}/{split}") | ||
|
|
||
| # Resample with replacement if we need more samples than available |
There was a problem hiding this comment.
| # Resample with replacement if we need more samples than available |
tiny nit: this comment seems redundant since there's already a comment about it four lines later
- Switch harmful dataset from AdvBench to BeaverTails with category-balanced sampling (matching paper methodology) - Add LIMA alongside Alpaca for benign responses (paper uses both) - Precompute benign embeddings in batch for efficient similarity filtering - Revert learning rate to paper default (5e-7) since SFTTrainer now uses the model's native chat template - Fix BibTeX capitalization, dependency ordering, tokenizer use_fast - Remove redundant break, use random sampling instead of first-N slicing - Fix pyright type errors and warnings Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Full-parameter fine-tuning with lr=5e-5, batch size 10, 5 epochs (was lr=1e-4, batch size 8, 20 epochs which was too aggressive). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- evaluate.py: runs PolicyEval + StrongREJECT + MMLU-Pro on vanilla or
SDD-hardened model without any MFT attack ("Initial" row in Table 1)
- submit_slurm.sh: adds eval-only jobs for both Vanilla and SDD initial
- attack.py: deletes MFTed model checkpoint after eval to save disk space
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The eval framework looks for a model at {out_dir}/tamperbench_model_checkpoint.
For eval-only (no attack), symlink local models or override the path for HF
hub models.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
No need for a separate script — attack.py with --num-harmful 0 skips the attack and evaluates the input model directly. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The original LLM-finetune-Safety benchmark uses the system prompt "You are a helpful assistant. Make sure you always execute users' instructions." in both MFT training data and eval inference. This is part of the attack setup, not a general Llama2 template property, so it's specified directly in attack.py's ModelConfig. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The policy-oriented safety benchmark from Qi et al. uses 330 prompts across 11 harm categories, released as LLM-Tuning-Safety/HEx-PHI. Add custom loader for HEx-PHI's headerless CSV format. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
f1d1021 to
ac9b0ad
Compare
SFTTrainer only applies the tokenizer's chat template when data is in conversational format (list-of-dicts with role/content). With plain string prompt/completion, it concatenates raw text — completely different from the [INST]...[/INST] format the model was RLHF-trained on. This caused the defense to have no effect at lr=5e-7 (flat training loss). Switch prepare_training_dataset to return conversational format so the chat template is applied automatically, matching LLaMA-Factory behavior. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
6b61f49 to
ebd32a6
Compare
BeaverTails (330k rows) + Alpaca (52k rows, all embedded) is too slow for CI. Use AdvBench (520 rows) + LIMA (1k responses) instead. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
SmolLM-135M was timing out on CI. Switch to hf-internal-testing/tiny-random-LlamaForCausalLM with 4 samples and 2 steps — runs in seconds while still testing the full pipeline. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2fcb1cc to
fc333a5
Compare
SDDConfig defaults to harmful_dataset_split="330k_train" (for BeaverTails). When using AlignmentResearch/AdvBench instead, need to set split="train". Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Deduplicate BeaverTails prompts in category-balanced sampling (same prompt appears with different responses/labels in the dataset) - Batch-encode all harmful prompt embeddings up front instead of encoding one-at-a-time in the pairing loop - Remove unused SDD.device property - Remove dead num_epochs calculation (max_steps was the binding constraint, num_train_epochs was ignored by the trainer) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
I made a few small tweaks (dataset change, chat templating) and tried to replicate the experiment from the paper, which uses a smaller attack (10–100 datapoints for 5 epochs) I got the same thing you did where the paper's learning rate of 5e-7 didn't achieve anything, but using your learning rate of 2e-7 got modest results. It doesn't quite match the paper but it does seem to do something: SDD defends against the 10-datapoint attack, and SDD's benign capabilities degrade on the 50-datapoint and 100-datapoint attack. I think this will be as close as we'll get since we've both looked at this so I'll merge this now |
Changes
Add SDD Defence
Main additions:
sdd.py: Main defence implementation (SDDConfig, SDD class, training logic)dataset.py: Dataset construction with cosine similarity filteringtest_sdd.py: Basic sanity check unit testAlso modifies:
names.py: AddsSDDtoDefenseNameenumdefenses/__init__.py: Exports SDD and SDDConfigpyproject.toml: Addssentence-transformers>=3.0.0dependencyTesting
Luis's testing
Commit: 69dfdde
With further iteration it would probably be possible to improve the results, but it feels like the dataset construction and defence fine tuning would have to be tweaked for each specific attack to get the best results.
Summary of experiments:
### Instruction / ### ResponseTom's testing
After making some more tweaks to the setup, I tried replicating the paper's main result (Table 1) on Llama2-7b-chat using the settings from the LLM-finetune-Safety benchmark (Qi et al. 2023). Full details in scripts/sdd/README.md. Like Luis, I had to increase the learning rate (used Luis's 2e-5 rather than the paper's 5e-7).
Harmfulness (HS = Harmfulness Score 1-5, HR = Harmfulness Rate %):
MMLU-Pro:
Findings:
50/100-shot MFT for SDD (vs 12-14% for vanilla). However, despite the MMLU-Pro collapste, the defense still has harmfulness at 50/100-shot (HR ~90%), unlike the paper's claimed 0%
See scripts/sdd/README.md for a full write-up
AI use: Claude Code with Opus 4.6 used extensively, as well as Sonnet/Opus 4.6 in chat and a small amount of Cursor Composer