- Following LLaVA's instructions. You MUST first download eval.zip.
- It contains custom annotations, scripts, and the prediction files with LLaVA v1.5. Extract to
eval. This also provides a general structure for all datasets.
After downloading all of them, organize the data as follows in eval.
eval
├── gqa
│ ├── answers
│ ├── data
│ └── llava_gqa_testdev_balanced.jsonl
├── llava-bench-in-the-wild
│ ├── answers
│ ├── answers_gpt4.jsonl
│ ├── bard_0718.jsonl
│ ├── bing_chat_0629.jsonl
│ ├── context.jsonl
│ ├── images
│ ├── questions.jsonl
│ ├── README.md
│ └── reviews
├── mmbench
│ ├── answers
│ ├── answers_upload
│ ├── mmbench_dev_20230712.tsv
│ └── mmbench_dev_en_20231003.tsv
├── MME
│ ├── answers
│ ├── convert_answer_to_mme.py
│ └── llava_mme.jsonl
├── mm-vet
│ ├── answers
│ ├── bard_set.json
│ ├── convert_answers.py
│ ├── images
│ ├── llava-mm-vet.jsonl
│ ├── mm-vet.json
│ └── results
├── pope
│ ├── answers
│ ├── coco
│ ├── llava_pope_test.jsonl
│ └── val2014
├── scienceqa
│ ├── answers
│ ├── images
│ ├── llava_test_CQM-A.json
│ ├── pid_splits.json
│ └── problems.json
├── seed_bench
│ ├── answers
│ ├── answers_upload
│ ├── extract_video_frames.py
│ └── llava-seed-bench.jsonl
├── textvqa
│ ├── answers
│ ├── llava_textvqa_val_v051_ocr.jsonl
│ ├── TextVQA_0.5.1_val.json
│ └── train_images
├── vizwiz
│ ├── answers
│ ├── answers_upload
│ ├── llava_test.jsonl
│ ├── test
│ ├── test.json
│ ├── train.json
│ └── val.json
└── vqav2
├── answers
├── answers_upload
├── llava_vqav2_mscoco_test2015.jsonl
├── llava_vqav2_mscoco_test-dev2015.jsonl
└── test2015Our image validation code comes from LLaVA, thanks for their contribution!
You can refer to the official repository for validation, but we also provide off-the-shelf scripts.
- Download
test2015and put it undereval/vqav2. - Multi-GPU inference.
LLaVA-based model
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 bash scripts/v1/eval/llava/vqav2.shMoE-based model
bash scripts/v1/eval/moe_llava/vqav2.sh- Submit the results to the evaluation server:
eval/vqav2/answers_upload.
- Download the data following the official instructions here and put under
eval/gqa/data. - Multi-GPU inference
LLaVA-based model
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 bash scripts/v1/eval/llava/gqa.shMoE-based model
bash scripts/v1/eval/moe_llava/gqa.shLLaVA-based model
CUDA_VISIBLE_DEVICES=0 bash scripts/v1/eval/moe_llava/vizwiz.shMoE-based model
bash scripts/v1/eval/moe_llava/vizwiz.sh- Submit the results to the evaluation server:
eval/vizwiz/answers_upload.
- Under
eval/scienceqa, downloadimages,pid_splits.json,problems.jsonfrom thedata/scienceqafolder of the ScienceQA repo. - Single-GPU inference and evaluate.
LLaVA-based model
CUDA_VISIBLE_DEVICES=0 bash scripts/v1/eval/moe_llava/sqa.shMoE-based model
bash scripts/v1/eval/moe_llava/sqa.sh- Download
TextVQA_0.5.1_val.jsonand images and extract toeval/textvqa. - Single-GPU inference and evaluate.
LLaVA-based model
CUDA_VISIBLE_DEVICES=0 bash scripts/v1/eval/moe_llava/textvqa.shMoE-based model
bash scripts/v1/eval/moe_llava/textvqa.sh- Download
cocofrom POPE and put undereval/pope. - Single-GPU inference and evaluate.
LLaVA-based model
CUDA_VISIBLE_DEVICES=0 bash scripts/v1/eval/moe_llava/pope.shMoE-based model
bash scripts/v1/eval/moe_llava/pope.sh- Download the data following the official instructions here.
- Downloaded images to
MME_Benchmark_release_version. - Put the official
eval_toolandMME_Benchmark_release_versionundereval/MME. - Single-GPU inference and evaluate.
LLaVA-based model
CUDA_VISIBLE_DEVICES=0 bash scripts/v1/eval/llava/mme.shMoE-based model
bash scripts/v1/eval/moe_llava/mme.sh- Download
mmbench_dev_20230712.tsvand put undereval/mmbench. - Single-GPU inference.
LLaVA-based model
CUDA_VISIBLE_DEVICES=0 bash scripts/v1/eval/llava/mmbench.shMoE-based model
bash scripts/v1/eval/moe_llava/mmbench.sh- Submit the results to the evaluation server:
eval/mmbench/answers_upload/mmbench_dev_20230712.
- Download
mmbench_dev_cn_20231003.tsvand put undereval/mmbench. - Single-GPU inference.
LLaVA-based model
CUDA_VISIBLE_DEVICES=0 bash scripts/v1/eval/llava/mmbench_cn.shMoE-based model
bash scripts/v1/eval/moe_llava/mmbench_cn.sh- Submit the results to the evaluation server:
eval/mmbench/answers_upload/mmbench_dev_cn_20231003.
- Following the official instructions to download the images and the videos. Put images under
eval/seed_bench/SEED-Bench-image. - Extract the video frame in the middle from the downloaded videos, and put them under
eval/seed_bench/SEED-Bench-video-image. - Multiple-GPU inference and evaluate.
LLaVA-based model
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 bash scripts/v1/eval/llava/seed.shMoE-based model
bash scripts/v1/eval/moe_llava/seed.sh- Optionally, submit the results to the leaderboard:
eval/seed_bench/answers_uploadusing the official jupyter notebook.
- Extract contents of
llava-bench-in-the-wildtoeval/llava-bench-in-the-wild. - Single-GPU inference and evaluate.
LLaVA-based model
CUDA_VISIBLE_DEVICES=0 bash scripts/v1/eval/moe_llava/llavabench.shMoE-based model
bash scripts/v1/eval/moe_llava/llavabench.sh- Extract
mm-vet.ziptoeval/mmvet. - Single-GPU inference.
LLaVA-based model
CUDA_VISIBLE_DEVICES=0 bash scripts/v1/eval/moe_llava/mmvet.shMoE-based model
bash scripts/v1/eval/moe_llava/mmvet.sh