We maintain the data collection in this repo at HuggingFace Hub Platform, which hosts all the training&evaluation data for RSCoVLM.
To make it easier for users to download and use, we have uploaded all the processed images and annotations. We recommend downloading the entire dataset and extracting it, as we strive to make it ready for use immediately. If you already have some of the images or annotations locally, you can exclude certain files during the download to save time. We have compressed the images and annotations for each dataset separately to facilitate such convenience.
If you encounter any issues, such as errors in the data or have other questions about the dataset, feel free to contact me via GitHub issues(prefered), or email me directly. We will continue to maintain the dataset.
First, read the Statement and ToU, if aggred, go to our dataset page and click the following button to access the dataset, your request will be approved automatically. NOTE once you request means that you agree with our ToU.
You can download with your web browser on the file page.
We recommand downloading in terminal using huggingface_hub (pip install --upgrade huggingface_hub). You can refer to the document for more usages.
# log in for huggingface account (if required, you can create your token at https://huggingface.co/settings/tokens)
hf login
# Set Huggingface Mirror for Chinese users (if required):
export HF_ENDPOINT=https://hf-mirror.com
# Download the whole folder (you can also modify local-dir with your data path and make soft link here):
hf download Qingyun/remote-sensing-sft-data --repo-type dataset --local-dir ./playground/data
# If any error (such as network error) interrupts the downloading, you just need to execute the same command, the latest huggingface_hub will resume downloading.
If you already download some data, you can also exclude them to save time. For example, you can exclude DOTA(split_ss_dota) trainval images with the --exclude option. You can also only download certain file with the position arg filenames or the --include option.
# This will exclude the files and just download the others.
hf download Qingyun/remote-sensing-sft-data --repo-type dataset --local-dir ./playground/data --exclude **split_ss_dota_trainval**
# This will download the file and should put it in the folder.
hf download Qingyun/remote-sensing-sft-data split_ss_dota/trainval/split_ss_dota_trainval_annfiles.tar.gz --repo-type dataset --local-dir ./playground/data
# This will download the files and put them like the arrangement in the repo.
hf download Qingyun/remote-sensing-sft-data --repo-type dataset --local-dir ./playground/data --include **split_ss_dota_trainval**
Then, extract all files from the compressed files.
find . \( -name "*.tar.gz" -o -name "*.part0" \) -execdir bash -c '[[ "{}" =~ \.part0$ ]] && cat {} {}.part1 | tar -zxvf - || tar -zxvf {}' \;
At last, if required, you can delete all the compressed files.
# list the files to delete for checking (if required)
find . -type f -name "*.tar.gz*" -print
# delete
find . -type f -name "*.tar.gz*" -exec rm -f {} \;
We support both the LLaVA-style conversations format and Qwen-style openai-like messages format:
- Each training sample of the
conversationsformat data is a data dict:
Your annotation file should follow one of the two formats below:
- Single-image example (json or jsonl entry):
{
"image": "images/001.jpg",
"source_dataset": "sub_dataset",
"conversations": [
{
"from": "human",
"value": "<image>\nWhat's the main object in this picture?"
},
{
"from": "gpt",
"value": "A red apple on a wooden table"
}
]
}
- Multi-image example:
{
"images": ["cats/001.jpg", "cats/002.jpg"],
"source_dataset": ["sub_dataset1", "sub_dataset2"],
"conversations": [
{
"from": "human",
"value": "<image>\n<image>\nWhat are the differences between these two cats?"
},
{
"from": "gpt",
"value": "The first cat is an orange tabby with short fur and green eyes, while the second is a gray Siamese with blue eyes and pointed coloration. They also appear to be in different environments - the first is indoors on a couch, the second is outdoors in a garden."
}
]
}NOTE: Only local path is support for tha image.
- Each training sample of the
messagesformat data is a data list:
- Single-image example (json or jsonl entry):
[
{
"role": "user",
"content": [
{
"type": "image",
"image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
},
{"type": "text", "text": "Describe this image."},
],
}
]- Multi-image example:
[
{
"role": "user",
"content": [
{"type": "image", "image": "file:///path/to/image1.jpg"},
{"type": "image", "image": "file:///path/to/image2.jpg"},
{"type": "text", "text": "Identify the similarities between these images."},
],
},
{"role": "assistant", "content": "They are the same."},
]We support three manners to let the program know what data you want to train with.
- You can pass the annotation json/jsonl to
--data_pathand the corresponding image folder path to--image_folder - You can pass multiple
--image_folderand--data_path, the number should be equal - You can only use
SupervisedDatasetForQwen2_5_VLin this mode.
You can register the usual datasets as buildin datasets
- Step 1: Register Your Dataset
Open the file: rscoagent/training/data/config.py
Add a dictionary describing your dataset, including both the annotation file and the image root path:
YOUR_DATASET = {
"annotation_path": "/absolute/path/to/your_dataset/annotations.json",
"data_path": "/absolute/path/to/your_dataset/images/",
}Then register it in data_dict:
data_dict = {
"your_dataset": YOUR_DATASET,
# other pre-defined datasets...
}- Step 2: pass the name of your dataset to
--datasets - You can pass multiple dataset names, a
ConcatDatasetmay be built. - Use "dataset_name%N" to sample N% of the data.
- For refgeo_* datasets, make sure the 'dataset_type' is set in data config.
You can also write a json/yaml config file to be a replacement of the data config.
We release the data under a CC-BY-4.0 license, with the primary intent of supporting research activities. We do not impose any additional using limitation, but the users must comply with the terms of use (ToUs) of the source dataset. This dataset is a processed version, intended solely for academic sharing by the owner, and does not involve any commercial use or other violations of the ToUs. Any usage of this dataset by users should be regarded as usage of the original dataset. If there are any concerns regarding potential copyright infringement in the release of this dataset, please contact me, and We will remove any data that may pose a risk.
@ARTICLE{li2026rscovlm,
author={Li, Qingyun and Ma, Shuran and Luo, Junwei and Yu, Yi and Zhou, Yue and Wang, Fengxiang and Lu, Xudong and Wang, Xiaoxing and He, Xin and Chen, Yushi and Yang, Xue},
title={Co-Training Vision-Language Models for Remote Sensing Multi-Task Learning},
journal={Remote Sensing},
volume={18},
year={2026},
number={2},
article-number={222},
url={https://www.mdpi.com/2072-4292/18/2/222},
issn={2072-4292},
doi={10.3390/rs18020222}
}
@INPROCEEDINGS{11242725,
author={Li, Qingyun and He, Xin and Shu, Xinya and Yu, Yi and Chen, Dong and Chen, Yushi and Yang, Xue},
booktitle={IGARSS 2025 - 2025 IEEE International Geoscience and Remote Sensing Symposium},
title={A Simple Aerial Detection Baseline of Multimodal Language Models},
year={2025},
pages={6833-6837},
doi={10.1109/IGARSS55030.2025.11242725}
}
Please also cite the paper of the original source dataset if they are adopted in your research.