Unlocking VideoLLM via Agent-of-Thoughts Distillation

Yudi Shi1,2, Shangzhe Di1,2, Qirui Chen1,2, Weidi Xie1
1School of Artificial Intelligence, Shanghai Jiao Tong University, China   
2CMIC, Shanghai Jiao Tong University, China   
Teaser

Our method, AoTD, distills multi-step reasoning and spatial-temporal understanding into a single generative video-language model. When addressing complex VideoQA tasks, the model trained with AoTD (as shown in (b)) enables to generate a step-by-step reasoning to get the correct answer. In contrast, previous models trained solely on question-answer pairs (as in (a)) generate only a final answer, often without intermediate reasoning, which can lead to incorrect conclusions.

Abstract

This paper tackles the problem of video question answering (VideoQA), a task that often requires multi-step reasoning and a profound understanding of spatial-temporal dynamics. While large video-language models perform well on benchmarks, they often lack explainability and spatial-temporal grounding. In this paper, we propose Agent-of-Thoughts Distillation (AoTD), a method that enhances models by incorporating automatically generated Chain-of-Thoughts (CoTs) into the instruction-tuning process. Specifically, we leverage an agent-based system to decompose complex questions into sub-tasks, and address them with specialized vision models, the intermediate results are then treated as reasoning chains. We also introduce a verification mechanism using a large language model (LLM) to ensure the reliability of generated CoTs. Extensive experiments demonstrate that AoTD improves the performance on multiple-choice and open-ended benchmarks.

Method

Overall Structure

Overview on Agent-of-Thoughts Distillation (AoTD). Step 1: Selecting best-performing agents for each sub-task to construct an agent-based system. Step 2: Decomposing question into executable program and leveraging chosen models to solve it sequentially to generate execution trace. Step 3: The execution trace is converted and filtered by LLM to produce high quality natural language CoTs. Step 4: Distilling CoTs into Video-LLM with two forms of prompt, allowing it achieve a balance between concise answers and comprehensive rationales. The final model is Video-LLM-AoTD.

Results

MCVQA Results

Comparison with Video-LLMs on MC-VQA benchmarks. LLaVA-NeXT-Video-AoTD improves performance in all open-ended benchmarks compared with the Instruct version. * means results reproduced by ourseleves.

OEVQA Results

Comparison with Video-LLMs on OE-VQA benchmarks. LLaVA-NeXT-Video-AoTD improves performance in all open-ended benchmarks compared with the Instruct version. * means results reproduced by ourseleves.

Qualitative Results

VQA Results

Visualization of rationales. LLaVA-NeXT-Video-AoTD can output rationales containing both spatial-temporal grounding of key information and step-by-step thinking process to solve the question.

BibTeX

@article{shi2024aotd,
  title={Unlocking Video-LLM via Agent-of-Thoughts Distillation},
  author={Shi, Yudi and Di, Shangzhe and Chen, Qirui and Xie, Weidi},
  journal={arXiv preprint arXiv:2412.01694},
  year={2024}
}