Figure 1: Schematic overview of safety mirage findings of safety fine-tuned VLM. |
This is the official code repository for the ICLR 2026 paper Safety Mirage: How Spurious Correlations Undermine VLM Safety Fine-tuning and Can Be Mitigated by Machine Unlearning.
- 🎉 Our another paper on LLM Unlearn Detection has been accepted by ICLR! 📚
- 🏆 Congrats! Our paper Safety Mirage has been accepted by ICLR 2026! ✨
Our safety-unlearn framework has been developed on the LLaVA-1.5, so the require installments could also be found from here. Also, you could use following steps:
- Clone this repository and navigate to LLaVA folder
git clone https://github.com/OPTML-Group/VLM-Safety-MU
cd VLM-Safety-MU- Install Package
conda create -n llava python=3.10 -y
conda activate llava
pip install --upgrade pip # enable PEP 660 support
pip install -e .- Install additional packages for training cases
pip install -e ".[train]"
pip install flash-attn --no-build-isolation
The forget and retain datasets are derived from the VLGuard dataset. For the full data preparation pipeline, please refer to data/data.md.
Place the previously generated training data (forget/retain JSON files) and the VLGuard training images into the corresponding folders specified in the training scripts before running unlearning fine-tune.
Our base model LLaVA-1.5 will be downloaded automatically when you run the training scripts. No action is needed.
We support two unlearning algorithms: NPO (Negative Preference Optimization) and RMU (Representation Mismatch Unlearning).
For full-parameter unlearning fine-tune, you should run
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 bash scripts/v1_5/finetune_unlearn.shFor LoRA unlearning fine-tune, you should run
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 bash scripts/v1_5/finetune_unlearn_lora.shHere are some unlearn related options to note:
--unlearn_type: unlearning algorithm type, which could benpoorrmu.--rmu_llava_loss_weight: weight for LLaVA training loss on the retain data.--rmu_retain_alpha: weight for RMU loss on the retain data.--npo_beta: balancing parameter for the NPO algorithm.--npo_forget_alpha: weight for NPO loss on the forget data.--npo_llava_loss_weight: weight for LLaVA training loss on the retain data.
Also, the data path and the output directory should also be specified.
scripts/v1_5/finetune_unlearn_npo.sh is the dedicated script for running a single NPO fine-tune with full-parameter training and DeepSpeed ZeRO-3:
bash scripts/v1_5/finetune_unlearn_npo.shData paths and output directory are controlled by variables at the top of the script:
RETAIN_DATA_PATH="../unlearn_data_npo/train_retain_mixed.json"
FORGET_DATA_PATH="../unlearn_data_npo/train_forget.json"
OUT_DIR="./checkpoints_npo/..."The eval/ folder contains the test data and evaluation scripts used to measure model safety. See eval/evaluation.md for full details.
- Test data — VLGuard test split (
eval/data/test.json) for normal inputs, and one-word jailbreak variants (eval/safety_data/) with 1-shot and 3-shot attack prefixes. - Model inference (
eval/run_eval_each_model.sh) — runsVLGuard_eval.pyacross all models and evaluation cases. Supports optional question sampling:bash eval/run_eval_each_model.sh # use all questions (default) bash eval/run_eval_each_model.sh 128 # sample 128 questions per dataset
- Post-evaluation (
eval/run_post_eval.sh) — computes rejection rate via keyword matching (eval/llm-eval/rejection_eval.py) and LLM-judged ASR via Qwen2.5-VL (eval/llm-eval/llm-judge.py,eval/llm-eval/llm-judge-asr-3shot.py).
If you found our code or paper helpful, please cite our work~
@article{chen2025safety,
title={Safety Mirage: How Spurious Correlations Undermine VLM Safety Fine-Tuning and Can Be Mitigated by Machine Unlearning},
author={Chen, Yiwei and Yao, Yuguang and Zhang, Yihua and Shen, Bingquan and Liu, Gaowen and Liu, Sijia},
journal={arXiv preprint arXiv:2503.11832},
year={2025}
}
