Skip to content

attack: LoFIT Attack #110

Open
NayeemaNonta wants to merge 19 commits intomainfrom
nnonta/lofit_attack
Open

attack: LoFIT Attack #110
NayeemaNonta wants to merge 19 commits intomainfrom
nnonta/lofit_attack

Conversation

@NayeemaNonta
Copy link
Copy Markdown
Collaborator

@NayeemaNonta NayeemaNonta commented Mar 15, 2026

Changes

Added LoFIT Fine-tuning

Key structural change:

attacks like LoFiT apply custom forward hooks at inference time that vLLM cannot register. To support this, WhiteBoxEvaluationConfig now has an optional hf_model_loader field — a callable returning (model, tokenizer). When set, every eval's compute_inferences() uses HF model.generate() instead of vLLM. Future attacks needing custom hooks (JoLA, ReFT) will follow the same pattern with no changes required to eval files.

Testing

run python tests/attacks/test_lofit_attack.py

Comment thread src/tamperbench/whitebox/evals/base.py Outdated
return model, tokenizer


def load_lofit_model_and_tokenizer(
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sdhossain this is the custom loading function thing I mentioned, I was wondering if you had a better idea to do this more efficiently. I couldn't get vLLM to work with this so I had to add custom eval functions for the inference part as well (for now I just added it for strong reject). Without this custom loading function the evaluations are the same as the base model since activation steering doesn't update any weights.

@NayeemaNonta NayeemaNonta added the attack Adds or modifies attacks label Mar 18, 2026
@NayeemaNonta NayeemaNonta changed the title Add LoFIT Attack attack: LoFIT Attack Mar 31, 2026
@NayeemaNonta NayeemaNonta marked this pull request as ready for review April 7, 2026 20:45
@NayeemaNonta NayeemaNonta marked this pull request as draft April 7, 2026 20:47
@NayeemaNonta NayeemaNonta marked this pull request as ready for review April 7, 2026 22:12
Copy link
Copy Markdown
Collaborator

@sdhossain sdhossain left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, I think there can be a few tune ups (like making functions to separate logic a bit more and run in isolation), but otherwise should be good to merge after that.

Could we also have some documentation on how harmful this attack could get? Like could it be used to tamper a model beyond like 50% ASR or harmfulness for instance.

**kwargs,
)

# def _rope_scaling_validation(self):
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this comment block needed? if not, imo we remove it

@@ -0,0 +1,303 @@
# Copyright 2022 EleutherAI and The HuggingFace Inc. team. All rights reserved.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are qwen models supported? I see files for llama and gemma only

@@ -0,0 +1,1404 @@
# Copyright 2024 Google Inc. HuggingFace Inc. team. All rights reserved.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

imo -- for files directly ported over from other repositories -- a link to the source might be helpful

from tamperbench.whitebox.utils.ops import run_in_isolation


# ---------------------------------------------------------------------------
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can we get rid of these comments like #------- Config ------# <-- imo noises up the code a bit

random_seed: int = 42


# ---------------------------------------------------------------------------
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: remove this comment as well?

# Get few-shot examples
fewshot_examples = utils.list_fewshot_samples()

if self.eval_config.hf_model_loader is not None:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: for all functions where we do this, can we make a helper function that does this, and then use run_in_isolation <-- makes it less likely to crash sweeps, and also makes it a bit more readable.

Additionally, if the function could be named something like _instantiate_model_and_infer_activ or something like that to highlight it is a patch for activation steering attacks, that would be helpful too.

if self.eval_config.hf_model_loader is not None:
from tamperbench.whitebox.evals.hf_inference import HFGenerationConfig, hf_generate_inferences

gen_config = HFGenerationConfig(
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same for this (if we could make a helper, and run in isolation (see comment in minerva math file)


``run_in_isolation`` spawns subprocesses via ``multiprocessing`` which
requires pickling the config. Lambda/closure loaders are not picklable,
and the scoring/results subprocesses don't need the model loader anyway.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

generally we would avoid this -- by having everything encapsulated within the run_in_isolation method, including instatiating the model, etc.

model_checkpoint=self.output_checkpoint_path,
out_dir=self.attack_config.out_dir,
model_config=self.attack_config.model_config,
hf_model_loader=self._lofit_loader(),
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if picking is an issue here btw, one option is to make and pass in an Enum or StrEnum like HFModelLoader.LOFIT_LOADER and then use a dictionary mapping elsewhere.

from tamperbench.whitebox.utils.models.config import ModelConfig
from tamperbench.whitebox.utils.names import EvalName, MetricName

if __name__ == "__main__":
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this needs to be in a function and marked both gpu required and expensive yes?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

attack Adds or modifies attacks

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants