π DeepScan: A Training-Free Framework for Visually Grounded Reasoning in Large Vision-Language Models
Official implementation of the CVPR 2026 paper
TL;DR. DeepScan is a training-free framework for visually grounded reasoning in LVLMs. Instead of relying on brittle one-shot, coarse-to-fine localization, it adopts a bottom-up pipeline: Hierarchical Scanning for cue discovery and evidence recovery, Refocusing for context-optimal evidence views, and Evidence-Enhanced Reasoning for final answer generation from multi-granular evidence memory. DeepScan achieves 90.6% on V* with Qwen2.5-VL-7B, and scales consistently across LVLM architectures and model sizes.
- TODO. Release the evaluation scripts.
- 2026-03. The core codebase is open-sourced !
- 2026-02. DeepScan was accepted to CVPR 2026 main track.
Humans often solve challenging visual problems in a bottom-up manner: they first identify subtle local cues, then recover the full evidence from those cues, and finally reason over the recovered evidence. DeepScan is built on the same intuition.
DeepScan contains three tightly coupled stages:
-
Hierarchical Scanning
- Partition the image into local patches.
- Use a search expert to produce patch-wise attention maps.
- Convert connected cue regions into point-based proxies using both semantic saliency and topological interiority.
- Recover image-level evidence via point-prompt segmentation, followed by morphological post-processing.
- Retain only the top-k smallest evidence candidates for efficient evidence judgment.
-
Refocusing
- Starting from the fused evidence crop, search over a concise set of candidate views.
- Use Zoom-In and Zoom-Out actions to calibrate the surrounding context.
- Select the smallest view that still fully contains the evidence needed for answering.
-
Evidence-Enhanced Reasoning
- Build a Hybrid Evidence Memory composed of:
- fine-grained evidence crops from Hierarchical Scanning, and
- a coarse-grained refined view from Refocusing.
- Materialize them as an ordered multi-image prompt for the LVLM.
- Generate answers that are both more accurate and better grounded in the visual evidence.
- Build a Hybrid Evidence Memory composed of:
Unlike RL-based visually grounded reasoning methods, DeepScan is plug-and-play and training-free. It can be integrated with different LVLM backbones without additional adaptation cost.
DeepScan/
βββ scripts/
β βββ blip_server/ # Search-expert service (BLIP-ITM + Grad-CAM attention)
β βββ expert_server/ # Visual-expert service (LangSAM-based detection)
β βββ lmm_server/ # LVLM serving scripts (e.g., LLaVA / Qwen backends)
β βββ sam2_server/ # SAM2 point-prompt segmentation service
β βββ pope/
β βββ vstar/
βββ src/
β βββ eval.py # Evaluation script for prediction files
β βββ qwen_runtime.py # Local Qwen-based runtime for LVLM querying
β βββ run.py # Main evaluation / inference entry point
β βββ utils.py # Common utilities
β βββ policies/
β βββ deepscan.py # DeepScan policy implementation
β βββ visual_grounding.py
β βββ control_point_sam.py
β βββ client.py
β βββ mstc.py
β βββ ...
βββ README.md
git clone https://github.com/YChenL/DeepScan
cd DeepScanWe recommend Python 3.10+.
conda create -n deepscan python=3.10 -y
conda activate deepscanThis codebase is built around a service-oriented pipeline. At minimum, you will need PyTorch, Transformers, OpenCV, FastAPI, and the supporting packages used by the search / visual experts and LVLM runtime.
pip install -r requirements.txtDepending on your local setup, you will also need the expert-side dependencies used in this repository:
# Search expert
pip install salesforce-lavis
# Visual expert
pip install lang-sam
# SAM2 backend
# Install from your local / official SAM2 checkout as needed.The provided code snapshot contains several environment-specific local paths / placeholders that should be updated before launch. In particular, check:
scripts/blip_server/blip_service.pyscripts/expert_server/model_service.pyscripts/sam2_server/sam2_service.pyscripts/lmm_server/llava_server.shscripts/lmm_server/qwen_server.sh
Before running the pipeline, replace local placeholders with the actual paths for:
- the BLIP-ITM tokenizer / checkpoint,
- the LangSAM / GroundingDINO / SAM2 checkpoints,
- the SAM2 repository root / config / weights,
- and the LVLM checkpoint you want to serve.
DeepScan augments an LVLM with two plug-and-play experts:
The paper uses BLIP-ITM as the search expert to produce patch-wise Grad-CAM attention maps for local cue exploration.
The visual expert exposes two primitives:
- point-prompt segmentation, and
- text-conditioned detection.
In the paper, DeepScan uses LangSAM as the visual expert. In this repository snapshot, the visual grounding pipeline is implemented through the combination of:
- a LangSAM-based detection service, and
- a SAM2 point-prompt segmentation service.
The paper evaluates DeepScan on five LVLMs:
- LLaVA-1.5-7B
- Qwen2-VL-7B
- Qwen2.5-VL-7B
- Qwen2.5-VL-32B
- Qwen2.5-VL-72B
This repository also includes example serving scripts for LLaVA / Qwen-style backends under scripts/lmm_server/.
DeepScan is organized as a multi-service inference pipeline. In a typical setup, you should launch:
- the search-expert server,
- the visual-expert server,
- the SAM2 segmentation server, and
- the LVLM server / runtime.
The corresponding launch scripts are under:
scripts/blip_server/
scripts/expert_server/
scripts/sam2_server/
scripts/lmm_server/
Please adapt ports, checkpoint paths, and CUDA device assignment to your environment before starting them.
The main evaluation entry point is src/run.py.
python src/run.py \
--model-path Qwen/Qwen2.5-VL-7B-Instruct \
--question-file path/to/questions.tsv \
--answers-file outputs/deepscan_predictions.jsonl \
--method_name deepscan \
--temperature 0.0Useful arguments include:
--model-path LVLM checkpoint / served model name
--question-file Input question file (TSV)
--answers-file Output prediction file (JSONL)
--method_name Method name, e.g. deepscan
--num-chunks Number of data chunks for parallel evaluation
--chunk-idx Current chunk index
--temperature Sampling temperature
--image-size Image resize limit used by the client runtime
After inference, you can evaluate predictions with:
python src/eval.py --path outputs/deepscan_predictions.jsonlDeepScan relies on three lightweight LVLM query templates:
-
Evidence Decomposition
- Extract the objects mentioned in the question.
- Used to decide whether the question is single-object or multi-object, and thus which patch size to use.
-
Evidence Judgment
- Judge whether a cropped evidence candidate actually contains clues for answering the question.
-
View Completeness Justification
- Judge whether a refocused view fully contains every target object without truncation.
DeepScan provides strong gains on fine-grained and visually grounded reasoning benchmarks.
- V* (Qwen2.5-VL-7B backbone): 90.6% overall
- 93.0% Attribute
- 86.8% Spatial
- Improvement over vanilla Qwen2.5-VL-7B:
- +16.3% on V*
- +5.5% on TreeBench
- HR-Bench:
- 75.0% on HR-4K
- 72.4% on HR-8K
- TreeBench:
- 42.5% overall
- 37.3 mIoU
- Scaling:
- DeepScan-72B reaches 94.2% on V* at k = β
DeepScan is also competitive with strong RL-based visually grounded reasoning methods while remaining fully training-free.
DeepScan is designed as a test-time scaling framework, so it introduces extra inference cost compared with vanilla one-shot inference. At the same time, it admits an explicit performanceβefficiency trade-off through:
- the patch size,
- the number of retained evidence candidates (k), and
- the batched engineering optimizations described in the supplementary material.
In the optimized implementation discussed in the supplementary material, DeepScan benefits substantially from:
- batched attention-map computation,
- batched top-k evidence judgment,
- batched view justification, and
- vLLM-based serving.
These optimizations reduce the sequential overhead of visually grounded search and significantly improve throughput.
DeepScan builds on several excellent open-source projects and model ecosystems. We would like to give special thanks first to DyFo for its inspiring open-source release. We also acknowledge the following projects and model ecosystems:
- Qwen2-VL / Qwen2.5-VL
- LAVIS
- LangSAM
- SAM2
- vLLM
We thank the authors and maintainers of these projects for making their work available.
If you find DeepScan useful, please cite:
@article{li2026deepscan,
title={DeepScan: A Training-Free Framework for Visually Grounded Reasoning in Large Vision-Language Models},
author={Li, Yangfu and Zhan, Hongjian and Chen, Jiawei and Gong, Yuning and Liu, Qi and Lu, Yue},
journal={arXiv preprint arXiv:2603.03857},
year={2026}
}