CF-VLM : CounterFactual Vision-Language Fine-tuning

Official code for the NeurIPS 2025 paper “CF-VLM : CounterFactual Vision-Language Fine-tuning”.

Abstract

Recent years have witnessed remarkable progress in Vision-Language Models (VLMs) for cross-modal semantic understanding. However, they still struggle with fine-grained discrimination and deep causal reasoning tasks. Existing VLMs often rely on surface-level statistical correlations, failing to capture the causal logic between vision and text.
To address this, we propose CounterFactual Vision-Language Fine-tuning (CF-VLM): by injecting targeted counterfactual samples, we enhance model sensitivity to uniqueness/stability and key causal micro-edits, without disrupting basic cross-modal alignment. This improves compositional reasoning, generalization, and factual consistency. Extensive experiments show that CF-VLM outperforms strong baselines and SOTA methods across multiple reasoning benchmarks, while also alleviating visual hallucination.

Please refer to the paper for theoretical details and full experiments.

Installation

Clone the repository

git clone https://github.com/your_org/CF-VLM.git
cd CF-VLM

(Optional) Create a virtual environment

python3 -m venv .venv
source .venv/bin/activate      # Linux / macOS
# .\.venv\Scripts\activate  # Windows PowerShell

Install dependencies
```
pip install -r requirements.txt
```
Configure the Qwen2.5-VL inference model
Please follow the official documentation for deployment and API setup:
👉 Qwen official documentation: https://github.com/QwenLM/Qwen2.5-VL

Environment Requirements

Python 3.9+
PyTorch 2.1+
CUDA 11.8+
NVIDIA GPU (A100/80GB recommended)
Dependencies listed in requirements.txt

Data Preparation

Run process.py to generate counterfactual data:

python process.py --input_path data/raw --output_path data/counterfactual --num_workers 8 --seed 42

Project Structure

CF-VLM/
├─ process.py
├─ clip_train.py
├─ Qwen_train.py
├─ requirements.txt
├─ README.md
└─ README_zh.md

Quick Start

Generate counterfactual data
```
python process.py
```
Train CLIP model
```
python clip_train.py
```
Train Qwen model
```
python Qwen_train.py
```

Citation

If you find this project useful, please cite:

@misc{zhang2025cfvlmcounterfactualvisionlanguagefinetuning,
      title={CF-VLM:CounterFactual Vision-Language Fine-tuning}, 
      author={Jusheng Zhang and Kaitong Cai and Yijia Fan and Jian Wang and Keze Wang},
      year={2025},
      eprint={2506.17267},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2506.17267}, 
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CF-VLM : CounterFactual Vision-Language Fine-tuning

Abstract

Table of Contents

Installation

Environment Requirements

Data Preparation

Project Structure

Quick Start

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
qwen-vl-finetune		qwen-vl-finetune
LICENSE		LICENSE
Mainflow.png		Mainflow.png
Qwen_train.py		Qwen_train.py
README.md		README.md
README_zh.md		README_zh.md
clip_train.py		clip_train.py
process.py		process.py
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

CF-VLM : CounterFactual Vision-Language Fine-tuning

Abstract

Table of Contents

Installation

Environment Requirements

Data Preparation

Project Structure

Quick Start

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages