Official code for the NeurIPS 2025 paper “CF-VLM : CounterFactual Vision-Language Fine-tuning”.
Recent years have witnessed remarkable progress in Vision-Language Models (VLMs) for cross-modal semantic understanding. However, they still struggle with fine-grained discrimination and deep causal reasoning tasks. Existing VLMs often rely on surface-level statistical correlations, failing to capture the causal logic between vision and text.
To address this, we propose CounterFactual Vision-Language Fine-tuning (CF-VLM): by injecting targeted counterfactual samples, we enhance model sensitivity to uniqueness/stability and key causal micro-edits, without disrupting basic cross-modal alignment. This improves compositional reasoning, generalization, and factual consistency. Extensive experiments show that CF-VLM outperforms strong baselines and SOTA methods across multiple reasoning benchmarks, while also alleviating visual hallucination.
Please refer to the paper for theoretical details and full experiments.
- Installation
- Environment Requirements
- Data Preparation
- Project Structure
- Quick Start
- Training
- Inference & Evaluation
- FAQ
- License
- Citation
- Acknowledgement
- Contact
-
Clone the repository
git clone https://github.com/your_org/CF-VLM.git cd CF-VLM -
(Optional) Create a virtual environment
python3 -m venv .venv source .venv/bin/activate # Linux / macOS # .\.venv\Scripts\activate # Windows PowerShell
-
Install dependencies
pip install -r requirements.txt
-
Configure the Qwen2.5-VL inference model
Please follow the official documentation for deployment and API setup:
👉 Qwen official documentation: https://github.com/QwenLM/Qwen2.5-VL
- Python 3.9+
- PyTorch 2.1+
- CUDA 11.8+
- NVIDIA GPU (A100/80GB recommended)
- Dependencies listed in
requirements.txt
Run process.py to generate counterfactual data:
python process.py --input_path data/raw --output_path data/counterfactual --num_workers 8 --seed 42CF-VLM/
├─ process.py
├─ clip_train.py
├─ Qwen_train.py
├─ requirements.txt
├─ README.md
└─ README_zh.md
-
Generate counterfactual data
python process.py
-
Train CLIP model
python clip_train.py
-
Train Qwen model
python Qwen_train.py
If you find this project useful, please cite:
@misc{zhang2025cfvlmcounterfactualvisionlanguagefinetuning,
title={CF-VLM:CounterFactual Vision-Language Fine-tuning},
author={Jusheng Zhang and Kaitong Cai and Yijia Fan and Jian Wang and Keze Wang},
year={2025},
eprint={2506.17267},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2506.17267},
}