Logo ViDiC

A benchmark for Video Difference Captioning

1Nanjing University    2Kuaishou Technology

Abstract

ViDiC Teaser: Comparison of compositional, spatial, and temporal changes

Figure 1: Review of Video Difference Captioning framework.

Understanding visual differences between dynamic scenes requires the comparative perception of compositional, spatial, and temporal changes—a capability that remains underexplored in existing vision-language systems. While prior work on Image Difference Captioning (IDC) has enabled models to describe semantic changes between static images, these approaches fail to capture motion continuity, event evolution, or editing consistency over time.

We introduce the ViDiC (Video Difference Captioning) task and its corresponding ViDiC-1K dataset, designed to evaluate the ability of Multimodal Large Language Models (MLLMs) to provide fine-grained descriptions of similarities and differences between video pairs. ViDiC comprises 1,000 curated video pairs annotated with over 4,000 comparative checklist items, covering seven categories: Subject, Style, Background, Cinematography, Motion, Location (Position), and Playback Techniques.

To ensure reliable evaluation, we propose a dual-checklist framework that measures the accuracy of similarity and difference separately, using a scalable LLM-as-a-Judge protocol. Experiments on twelve state-of-the-art multimodal models reveal a significant performance gap in their comparative description and difference perception abilities, with all models underperforming on playback- and camera-related dimensions.

Leaderboard

We evaluate video comparison using Accuracy over a set of Dual-Checklist questions.
Avg: Average Score    Diff: Difference Accuracy    Sim: Similarity Accuracy

Fine-grained Categories:
Subj (Subject), Motion, Pos (Position), Backgr (Background), Cam (Camera Work), Style, Tech (Playback Technique).

💡 indicates specific "thinking" or reasoning modes activated in models.

# Model Param Overall Metrics (%) Category Performance (%)
Avg Diff. Sim. Subj Motion Pos. Backgr. Cam. Style Tech.
- Gemini-2.5-Pro💡 - 66.7263.7375.33 67.7162.7868.2470.6559.9775.7974.32
- GPT-5💡 - 62.9457.3279.17 61.5257.7865.3169.1557.3977.6054.66
- Gemini-2.5-Flash💡 - 58.8752.1178.37 59.6351.2957.2363.9852.8281.5855.41
- Gemini-2.0-Flash💡 - 53.7150.2663.66 58.9048.7157.8657.1147.3055.7918.92
- GPT-4o💡 - 49.9539.1481.12 46.7943.5351.8953.7349.1877.8927.03
- Qwen3-VL-32B 32B 61.3858.5471.50 64.6051.7762.0068.6252.6674.8647.83
- InternVL-3.5💡 38B 52.4446.2570.30 52.6643.0453.7759.8047.8072.6320.27
- Qwen2.5-VL-Inst 72B 49.7142.5670.30 48.0744.8248.1155.9246.4268.9522.97
- Mimo-VL-SFT 7B 52.5946.5170.17 54.3946.5551.2557.3148.3767.7125.33
- GLM-4.1V💡 9B 40.9533.9961.08 42.6034.3538.1347.2633.8364.5814.67
- Llama-3.2 11B 19.435.2361.01 14.4820.3117.8413.4429.5640.0011.70

Dataset Examples

Video Comparison Pairs

Each example in ViDiC consists of a video pair and a fine-grained checklist.

Benchmark Statistics

ViDiC Teaser: Comparison of compositional, spatial, and temporal changes

Evaluation Methodology

We propose a Dual-Checklist Evaluation Framework to ensure reliable measuring of comparative captioning. Traditional metrics measure textual similarity rather than factual correctness. ViDiC overcomes this by quantifying factual accuracy using a human-annotated checklist composed of binary questions derived from predefined dimensions (Subject, Style, Background, etc.).

  • Similarity Questions: Framed inversely to penalize hallucinations. A response is correct if it confirms similarity or omits the attribute.
  • Difference Questions: Framed as verifiable propositions about specific differences. The model must correctly affirm these true statements.

We leverage an LLM-as-a-Judge protocol (using GPT-5-Mini or equivalent high-capability models) to compare the generated captions against the ground-truth checklist without accessing the video pixels directly. This ensures scalable and interpretable benchmarking.

Citation


    @misc{wu2025vidicvideodifferencecaptioning,
      title={ViDiC: Video Difference Captioning}, 
      author={Jiangtao Wu and Shihao Li and Zhaozhou Bian and Yuanxing Zhang and Jialu Chen and Runzhe Wen and An Ping and Yiwen He and Jiakai Wang and Jiaheng Liu},
      year={2025},
      eprint={2512.03405},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2512.03405}, 
    }