EditEval is a benchmark for evaluating text-guided video editing methods. We provide data and scripts including:
- 200 source videos and 1,280 edited videos produced by 8 video editing models
- 1,010 editing text prompts covering 8 task categories
- Human annotations from 4 annotators across 3 evaluation dimensions, along with inter-annotator agreement computation
- MLLM inference outputs from 8 multimodal large language models, with correlation analysis against human annotations
EditEval/
├── source_video/
│ ├── download.sh # Download 200 source videos from Google Drive
│ └── *.mp4 # Source video files (after download)
├── edited_video/
│ ├── download.sh # Download 1,280 edited videos from Google Drive
│ └── *.mp4 # Edited video files (after download)
├── mllm_results/
│ ├── download.sh # Download MLLM inference results from Google Drive
│ ├── 4o/ # GPT-4o
│ ├── 4o_0806/ # GPT-4o-0806
│ ├── gemini/ # Gemini-Pro
│ ├── one_vision_7b/ # LLaVA-OneVision-7B
│ ├── qwen_vl/ # Qwen-VL-Chat
│ ├── timechat/ # TimeChat
│ ├── videollama2/ # VideoLLaMA2
│ └── vila/ # VILA-1.5-40B
├── annotations/
│ ├── data_worker_1.csv # Annotation results from annotator 1
│ ├── data_worker_2.csv # Annotation results from annotator 2
│ ├── data_worker_3.csv # Annotation results from annotator 3
│ └── data_worker_4.csv # Annotation results from annotator 4
├── labeled_full.csv # Aggregated human annotation scores (1,280 samples)
├── edit_eval_text_prompts.csv # 1,010 editing text prompts with task metadata
├── inter-annotator-agreement.py # Compute inter-annotator agreement
├── compute_correlation.py # Compute MLLM-human score correlations
└── README.md
The 1,280 edited videos are produced by the following 8 video editing models, each contributing 160 samples:
| Model | Samples |
|---|---|
| FateZero | 160 |
| RAVE | 160 |
| Text2Video-Zero | 160 |
| TokenFlow | 160 |
| Tune-A-Video | 160 |
| Vidtome | 160 |
| pix2video | 160 |
| vid2vid-zero | 160 |
| Statistic | Number |
|---|---|
| Total video clips | 200 |
| Video resolution | 480 × 480 |
| Video length | 25 frames |
| Statistic | Number |
|---|---|
| Total text prompts | 1,010 |
| Single-Target Editing | 706 |
| Animal Editing | 56 |
| Human Editing | 107 |
| Object Editing | 96 |
| Background Editing | 143 |
| Overall Style Editing | 152 |
| Color Transfer | 152 |
| Multiple-Target Editing | 304 |
| Statistic | Number |
|---|---|
| Total text prompts | 160 |
| Single-Target Editing | 96 |
| Animal Editing | 12 |
| Human Editing | 12 |
| Object Editing | 12 |
| Background Editing | 12 |
| Overall Style Editing | 24 |
| Color Transfer | 24 |
| Multiple-Target Editing | 64 |
| Maximum editing target | 5 |
| Minimum editing target | 1 |
| Average editing target | 1.8 |
| Maximum caption length | 29 |
| Minimum caption length | 6 |
| Average caption length | 14.4 |
| Maximum text prompt length | 29 |
| Minimum text prompt length | 7 |
| Average text prompt length | 16.5 |
Each sample is evaluated on 3 dimensions (scored 1–5 by human annotators):
- Textual Faithfulness: How well the edited video aligns with the editing text prompt
- Frame Consistency: Temporal coherence and smoothness across frames
- Video Fidelity: Overall visual quality and realism of the edited video
pip install gdown pandas scipy prettytable krippendorff numpyDownload the 200 source videos from Google Drive:
cd source_video
bash download.sh
cd ..Download the 1,280 edited videos from Google Drive:
cd edited_video
bash download.sh
cd ..Download the MLLM inference outputs (8 models × 1,280 JSON files) from Google Drive:
cd mllm_results
bash download.sh
cd ..The mllm_results/ directory contains inference outputs from 8 multimodal large language models. Each model directory contains 1,280 JSON files (one per sample), with scores for the 3 evaluation dimensions.
| Directory | Model |
|---|---|
4o/ |
GPT-4o |
4o_0806/ |
GPT-4o-0806 |
gemini/ |
Gemini-Pro |
one_vision_7b/ |
LLaVA-OneVision-7B |
qwen_vl/ |
Qwen-VL-Chat |
timechat/ |
TimeChat |
videollama2/ |
VideoLLaMA2 |
vila/ |
VILA-1.5-40B |
The annotations/ directory contains per-annotator scores (with personal information removed). The aggregated scores (averaged over 4 annotators) are in labeled_full.csv.
Compute inter-annotator agreement:
python inter-annotator-agreement.pyExpected output:
----------------------------------------
Metrics for Textual Faithfulness:
Averaged Kendall's τc: 0.6408 ± 0.0728
Averaged Spearman’s ρ: 0.7095 ± 0.0810
Krippendorff’s α: 0.6995
----------------------------------------
Metrics for Frame Consistency:
Averaged Kendall's τc: 0.6505 ± 0.0226
Averaged Spearman’s ρ: 0.7293 ± 0.0244
Krippendorff’s α: 0.6688
----------------------------------------
Metrics for Video Fidelity:
Averaged Kendall's τc: 0.6126 ± 0.0295
Averaged Spearman’s ρ: 0.6935 ± 0.0299
Krippendorff’s α: 0.6628
Compute Pearson, Spearman, and Kendall-τ correlations between MLLM-generated scores and human annotations:
Run on a single model:
python compute_correlation.py --labeled_csv labeled_full.csv --mllm_dir mllm_results/vilaRun on multiple models:
python compute_correlation.py --labeled_csv labeled_full.csv --mllm_dir mllm_results/vila mllm_results/videollama2Run on all models:
python compute_correlation.py --labeled_csv labeled_full.csv --mllm_dir allIf you find EditEval useful in your research, please cite our paper:
@inproceedings{liu2025editeval,
author = {Bingshuai Liu and Ante Wang and Zijun Min and Chenyang Lyu and
Longyue Wang and Zhihao Wang and Xu Han and Peng Li and Jinsong Su},
title = {EditEval: Towards Comprehensive and Automatic Evaluation for Text-guided Video Editing},
booktitle = {Proceedings of the 33rd ACM International Conference on Multimedia (MM '25)},
year = {2025},
pages = {3507--3516},
publisher = {ACM},
doi = {10.1145/3746027.3755100},
url = {https://doi.org/10.1145/3746027.3755100}
}