Skip to content

OPTML-Group/VLM-Safety-Unlearn

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

22 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Safety Mirage: How Spurious Correlations Undermine VLM Safety Fine-tuning and Can Be Mitigated by Machine Unlearning

teaser
Figure 1: Schematic overview of safety mirage findings of safety fine-tuned VLM.

This is the official code repository for the ICLR 2026 paper Safety Mirage: How Spurious Correlations Undermine VLM Safety Fine-tuning and Can Be Mitigated by Machine Unlearning.

News

Installation

Our safety-unlearn framework has been developed on the LLaVA-1.5, so the require installments could also be found from here. Also, you could use following steps:

  1. Clone this repository and navigate to LLaVA folder
git clone https://github.com/OPTML-Group/VLM-Safety-MU
cd VLM-Safety-MU
  1. Install Package
conda create -n llava python=3.10 -y
conda activate llava
pip install --upgrade pip  # enable PEP 660 support
pip install -e .
  1. Install additional packages for training cases
pip install -e ".[train]"
pip install flash-attn --no-build-isolation

Data Preparation

The forget and retain datasets are derived from the VLGuard dataset. For the full data preparation pipeline, please refer to data/data.md.

Place the previously generated training data (forget/retain JSON files) and the VLGuard training images into the corresponding folders specified in the training scripts before running unlearning fine-tune.

Unlearning Fine-tune

Our base model LLaVA-1.5 will be downloaded automatically when you run the training scripts. No action is needed.

We support two unlearning algorithms: NPO (Negative Preference Optimization) and RMU (Representation Mismatch Unlearning).

Full-parameter and LoRA Variants

For full-parameter unlearning fine-tune, you should run

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 bash scripts/v1_5/finetune_unlearn.sh

For LoRA unlearning fine-tune, you should run

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 bash scripts/v1_5/finetune_unlearn_lora.sh

Here are some unlearn related options to note:

  • --unlearn_type: unlearning algorithm type, which could be npo or rmu.
  • --rmu_llava_loss_weight: weight for LLaVA training loss on the retain data.
  • --rmu_retain_alpha: weight for RMU loss on the retain data.
  • --npo_beta: balancing parameter for the NPO algorithm.
  • --npo_forget_alpha: weight for NPO loss on the forget data.
  • --npo_llava_loss_weight: weight for LLaVA training loss on the retain data.

Also, the data path and the output directory should also be specified.

NPO Training

scripts/v1_5/finetune_unlearn_npo.sh is the dedicated script for running a single NPO fine-tune with full-parameter training and DeepSpeed ZeRO-3:

bash scripts/v1_5/finetune_unlearn_npo.sh

Data paths and output directory are controlled by variables at the top of the script:

RETAIN_DATA_PATH="../unlearn_data_npo/train_retain_mixed.json"
FORGET_DATA_PATH="../unlearn_data_npo/train_forget.json"
OUT_DIR="./checkpoints_npo/..."

Evaluation

The eval/ folder contains the test data and evaluation scripts used to measure model safety. See eval/evaluation.md for full details.

  • Test data — VLGuard test split (eval/data/test.json) for normal inputs, and one-word jailbreak variants (eval/safety_data/) with 1-shot and 3-shot attack prefixes.
  • Model inference (eval/run_eval_each_model.sh) — runs VLGuard_eval.py across all models and evaluation cases. Supports optional question sampling:
    bash eval/run_eval_each_model.sh          # use all questions (default)
    bash eval/run_eval_each_model.sh 128      # sample 128 questions per dataset
  • Post-evaluation (eval/run_post_eval.sh) — computes rejection rate via keyword matching (eval/llm-eval/rejection_eval.py) and LLM-judged ASR via Qwen2.5-VL (eval/llm-eval/llm-judge.py, eval/llm-eval/llm-judge-asr-3shot.py).

Cite This Work

If you found our code or paper helpful, please cite our work~

@article{chen2025safety,
  title={Safety Mirage: How Spurious Correlations Undermine VLM Safety Fine-Tuning and Can Be Mitigated by Machine Unlearning},
  author={Chen, Yiwei and Yao, Yuguang and Zhang, Yihua and Shen, Bingquan and Liu, Gaowen and Liu, Sijia},
  journal={arXiv preprint arXiv:2503.11832},
  year={2025}
}