ViDiC: Video Difference Captioning

Abstract

ViDiC Teaser: Comparison of compositional, spatial, and temporal changes

Figure 1: Review of Video Difference Captioning framework.

Understanding visual differences between dynamic scenes requires the comparative perception of compositional, spatial, and temporal changes—a capability that remains underexplored in existing vision-language systems. While prior work on Image Difference Captioning (IDC) has enabled models to describe semantic changes between static images, these approaches fail to capture motion continuity, event evolution, or editing consistency over time.

We introduce the ViDiC (Video Difference Captioning) task and its corresponding ViDiC-1K dataset, designed to evaluate the ability of Multimodal Large Language Models (MLLMs) to provide fine-grained descriptions of similarities and differences between video pairs. ViDiC comprises 1,000 curated video pairs annotated with 3,720 comparative checklist items, covering seven categories: Subject, Style, Background, Cinematography, Motion, Location (Position), and Playback Techniques.

To ensure reliable evaluation, we propose a dual-checklist framework that measures the accuracy of similarity and difference separately, using a scalable LLM-as-a-Judge protocol. Experiments on twelve state-of-the-art multimodal models reveal a significant performance gap in their comparative description and difference perception abilities, with all models underperforming on playback- and camera-related dimensions.

Video Comparison Pairs

Each example in ViDiC consists of a video pair and a fine-grained checklist.

Video A

Video B

Similarity Question: Are the man's clothes different between the two videos?

Correct Answer: No

Difference Question: Does the Source video feature diffuse, cool-toned light, while the Target video features bright, warm-toned backlighting from the upper left?

Correct Answer: Yes

Difference Question: Is the sky in the Source video completely grey and cloudy, in contrast to the Target video where the sky is blue with a visible sun?

Correct Answer: Yes

Video A

Video B

Similarity Question: Do the clothing attributes of the shooter and goalkeeper vary between the Source and Target videos?

Correct Answer: No

Difference Question: Regarding the camera's position, is it true that the Source video is filmed more to the side, while the Target video is filmed from more centrally?

Correct Answer: Yes

Difference Question: Is the building in the background more clearly visible in the Target video compared to the Source video?

Correct Answer: Yes

Video A

Video B

Similarity Question: Are the camera movements between the Source and Target videos different?

Correct Answer: No

Difference Question: Regarding the lighting, is the Source video illuminated by bright, white daylight, in contrast to the Target video which is illuminated by the warm, colored light of a sunrise or sunset?

Correct Answer: Yes

Difference Question: In terms of atmosphere, does the Source video show a blue sky with white clouds, while the Target video shows a sky with warmer colors?

Correct Answer: Yes

Benchmark Statistics

We propose a Dual-Checklist Evaluation Framework to ensure reliable measuring of comparative captioning. Traditional metrics measure textual similarity rather than factual correctness. ViDiC overcomes this by quantifying factual accuracy using a human-annotated checklist composed of binary questions derived from predefined dimensions (Subject, Style, Background, etc.).

Similarity Questions: Framed inversely to penalize hallucinations. A response is correct if it confirms similarity or omits the attribute.
Difference Questions: Framed as verifiable propositions about specific differences. The model must correctly affirm these true statements.

We leverage an LLM-as-a-Judge protocol (using GPT-5-Mini or equivalent high-capability models) to compare the generated captions against the ground-truth checklist without accessing the video pixels directly. This ensures scalable and interpretable benchmarking.


    @misc{wu2025vidicvideodifferencecaptioning,
      title={ViDiC: Video Difference Captioning}, 
      author={Jiangtao Wu and Shihao Li and Zhaozhou Bian and Yuanxing Zhang and Jialu Chen and Runzhe Wen and An Ping and Yiwen He and Jiakai Wang and Jiaheng Liu},
      year={2025},
      eprint={2512.03405},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2512.03405}, 
    }

ViDiC

A benchmark for Video Difference Captioning

Abstract

Dataset Examples

Video Comparison Pairs

Benchmark Statistics

Evaluation Methodology

Citation