attack: LoFIT Attack by NayeemaNonta · Pull Request #110 · criticalml-uw/TamperBench

NayeemaNonta · 2026-03-15T03:12:26Z

Changes

Key structural change:

attacks like LoFiT apply custom forward hooks at inference time that vLLM cannot register. To support this, WhiteBoxEvaluationConfig now has an optional hf_model_loader field — a callable returning (model, tokenizer). When set, every eval's compute_inferences() uses HF model.generate() instead of vLLM. Future attacks needing custom hooks (JoLA, ReFT) will follow the same pattern with no changes required to eval files.

Testing

run python tests/attacks/test_lofit_attack.py

NayeemaNonta · 2026-03-16T22:41:27Z

    return model, tokenizer
+
+
+def load_lofit_model_and_tokenizer(


@sdhossain this is the custom loading function thing I mentioned, I was wondering if you had a better idea to do this more efficiently. I couldn't get vLLM to work with this so I had to add custom eval functions for the inference part as well (for now I just added it for strong reject). Without this custom loading function the evaluations are the same as the base model since activation steering doesn't update any weights.

sdhossain

Looks good, I think there can be a few tune ups (like making functions to separate logic a bit more and run in isolation), but otherwise should be good to merge after that.

Could we also have some documentation on how harmful this attack could get? Like could it be used to tamper a model beyond like 50% ASR or harmfulness for instance.

sdhossain · 2026-04-12T18:25:42Z

+            **kwargs,
+        )
+
+    # def _rope_scaling_validation(self):


is this comment block needed? if not, imo we remove it

sdhossain · 2026-04-12T18:26:48Z

@@ -0,0 +1,303 @@
+# Copyright 2022 EleutherAI and The HuggingFace Inc. team. All rights reserved.


are qwen models supported? I see files for llama and gemma only

sdhossain · 2026-04-12T18:28:10Z

@@ -0,0 +1,1404 @@
+# Copyright 2024 Google Inc. HuggingFace Inc. team. All rights reserved.


imo -- for files directly ported over from other repositories -- a link to the source might be helpful

sdhossain · 2026-04-12T18:29:06Z

+from tamperbench.whitebox.utils.ops import run_in_isolation
+
+
+# ---------------------------------------------------------------------------


nit: can we get rid of these comments like #------- Config ------# <-- imo noises up the code a bit

sdhossain · 2026-04-12T18:29:35Z

+    random_seed: int = 42
+
+
+# ---------------------------------------------------------------------------


nit: remove this comment as well?

sdhossain · 2026-04-12T18:35:47Z

        # Get few-shot examples
        fewshot_examples = utils.list_fewshot_samples()

+        if self.eval_config.hf_model_loader is not None:


nit: for all functions where we do this, can we make a helper function that does this, and then use run_in_isolation <-- makes it less likely to crash sweeps, and also makes it a bit more readable.

Additionally, if the function could be named something like _instantiate_model_and_infer_activ or something like that to highlight it is a patch for activation steering attacks, that would be helpful too.

sdhossain · 2026-04-12T18:36:20Z

+        if self.eval_config.hf_model_loader is not None:
+            from tamperbench.whitebox.evals.hf_inference import HFGenerationConfig, hf_generate_inferences
+
+            gen_config = HFGenerationConfig(


same for this (if we could make a helper, and run in isolation (see comment in minerva math file)

sdhossain · 2026-04-12T18:38:52Z

+
+        ``run_in_isolation`` spawns subprocesses via ``multiprocessing`` which
+        requires pickling the config.  Lambda/closure loaders are not picklable,
+        and the scoring/results subprocesses don't need the model loader anyway.


generally we would avoid this -- by having everything encapsulated within the run_in_isolation method, including instatiating the model, etc.

sdhossain · 2026-04-12T18:41:00Z

+            model_checkpoint=self.output_checkpoint_path,
+            out_dir=self.attack_config.out_dir,
+            model_config=self.attack_config.model_config,
+            hf_model_loader=self._lofit_loader(),


if picking is an issue here btw, one option is to make and pass in an Enum or StrEnum like HFModelLoader.LOFIT_LOADER and then use a dictionary mapping elsewhere.

sdhossain · 2026-04-12T18:44:00Z

+from tamperbench.whitebox.utils.models.config import ModelConfig
+from tamperbench.whitebox.utils.names import EvalName, MetricName
+
+if __name__ == "__main__":


I think this needs to be in a function and marked both gpu required and expensive yes?

NayeemaNonta added 7 commits March 14, 2026 20:49

add lofit

9e36bd9

add lofit

9f9fb9d

add lofit

8d012de

add lofit

4105114

add lofit

191f029

add lofit

0b5001f

add lofit

f56a31d

NayeemaNonta commented Mar 16, 2026

View reviewed changes

NayeemaNonta added the attack Adds or modifies attacks label Mar 18, 2026

refactor

d099340

NayeemaNonta changed the title ~~Add LoFIT Attack~~ attack: LoFIT Attack Mar 31, 2026

NayeemaNonta added 4 commits April 7, 2026 16:31

add compatibility with other evals

a7cfd41

update

aba85e1

update

28cac6f

update

8d82c47

NayeemaNonta marked this pull request as ready for review April 7, 2026 20:45

NayeemaNonta marked this pull request as draft April 7, 2026 20:47

NayeemaNonta added 7 commits April 7, 2026 16:48

update

133d5bc

update

5e035ff

update

b2cd580

update

c622668

update

302f260

update

ab13229

update

79ddc62

NayeemaNonta marked this pull request as ready for review April 7, 2026 22:12

NayeemaNonta requested review from sdhossain and tomtseng April 7, 2026 22:12

sdhossain reviewed Apr 12, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

attack: LoFIT Attack #110

attack: LoFIT Attack #110
NayeemaNonta wants to merge 19 commits intomainfrom
nnonta/lofit_attack

NayeemaNonta commented Mar 15, 2026 •

edited

Loading

Uh oh!

NayeemaNonta Mar 16, 2026

Uh oh!

sdhossain left a comment

Uh oh!

sdhossain Apr 12, 2026

Uh oh!

sdhossain Apr 12, 2026

Uh oh!

sdhossain Apr 12, 2026

Uh oh!

sdhossain Apr 12, 2026

Uh oh!

sdhossain Apr 12, 2026

Uh oh!

sdhossain Apr 12, 2026

Uh oh!

sdhossain Apr 12, 2026

Uh oh!

sdhossain Apr 12, 2026

Uh oh!

sdhossain Apr 12, 2026

Uh oh!

sdhossain Apr 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		@@ -0,0 +1,303 @@
		# Copyright 2022 EleutherAI and The HuggingFace Inc. team. All rights reserved.

		@@ -0,0 +1,1404 @@
		# Copyright 2024 Google Inc. HuggingFace Inc. team. All rights reserved.

		from tamperbench.whitebox.utils.ops import run_in_isolation


		# ---------------------------------------------------------------------------

		random_seed: int = 42


		# ---------------------------------------------------------------------------

Conversation

NayeemaNonta commented Mar 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes

Key structural change:

Testing

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sdhossain left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

NayeemaNonta commented Mar 15, 2026 •

edited

Loading