Ego-R1: Chain-of-Tool-Thought for Ultra-Long Egocentric Video Reasoning

Abstract

We introduce Ego-R1, a novel framework for reasoning over ultra-long (i.e.,in days and weeks) egocentric videos, which leverages a structured Chain-of-Tool-Thought (CoTT) process, orchestrated by an Ego-R1 Agent trained via reinforcement learning (RL).

Inspired by human problem-solving strategies, CoTT decomposes complex reasoning into modular steps, with the RL agent invoking specific tools, one per step, to iteratively and collaboratively answer sub-question stackling such tasks as temporal retrieval and multi-modal understanding. We design a two-stage training paradigm involving supervised finetuning (SFT) of a pretrained language model using CoTT data and RL to enable our agent to dynamically propose step-by-step tools for long-range reasoning. To facilitate training, we construct a dataset called Ego-R1 Data, which consists of Ego-CoTT-25K for SFT and Ego-QA-4.4K for RL. Furthermore, our Ego-R1 agent is evaluated on a newly curated week-long video QA benchmark, Ego-R1 Bench, which contains human-verified QA pairs from hybrid sources.

Extensive results demonstrate that the dynamic, tool-augmented chain-of-thought reasoning by our Ego-R1 Agent can effectively tackle the unique challenges of understanding ultra-long egocentric videos, significantly extending the time coverage from few hours to a week.

Ego-R1 Agent orchestrates specialized tools to answer the question step-by-step, based on the observations and previous actions.

Ego-R1 Data: Chain-of-Tool-Thought (CoTT) for Video Reasoning

To unleash the reasoning capabilities of LLM under the CoT prompting paradigm and to enable dynamic tool selection conditioned on current observations and past actions, we introduce Ego-R1 Data, a dataset designed to enable agentic tool-use with Chain-of-Tool-Thought (CoTT) reasoning chains.

Data generation pipeline of the Ego-R1 Data. We first obtained raw QA pairs from both AI-generated and human-annotated sources based on 6 raw videos collected from 6 participants and the corresponding log. The verified and processed Multiple Choice Questions (MCQs) serve as the foundation of the Ego-R1 Data (left). We take questions without answers for Chain-of-Tool-Thought (CoTT) generation, which involves creating reasoning chains that include explicit thinking steps and dynamic tool-calling sequences (right).

Two Stage Training: SFT + RL

Our goal is to train a language model capable of performing long-form video reasoning via a structured long-chain reasoning schema that automatically invokes multi-turn tool calls to collaboratively solve the problem. Inspired by the recent post-training techniques, we design our training framework with a two-stage strategy.

Ego-R1 employs a two-stage training approach: Stage 1 utilizes supervised fine-tuning with CoTT data to establish structured tool-calling capabilities, while Stage 2 applies multi-turn reinforcement learning with rule-based rewards to optimize iterative reasoning and tool execution across diverse question types.

Ego-R1 Agent produces more detailed, interpretable step-by-step reasoning chains through dynamic tool-calling, instead of the traditional one-step reasoning.

The proposed Ego-R1 model demonstrates superior performance across multiple metrics*.

*Bold indicates best performance, underscored values show second best. The results from the 72B version of the model or using less frames are marked in gray. As some of the QA pairs in EgoLifeQA were used for CoTT generation and training, we excluded these from evaluation and retained only a clean subset for fair testing.

BibTeX

@misc{tian2025egor1chainoftoolthoughtultralongegocentric,
  title={Ego-R1: Chain-of-Tool-Thought for Ultra-Long Egocentric Video Reasoning}, 
  author={Shulin Tian and Ruiqi Wang and Hongming Guo and Penghao Wu and Yuhao Dong and Xiuying Wang and Jingkang Yang and Hao Zhang and Hongyuan Zhu and Ziwei Liu},
  year={2025},
  eprint={2506.13654},
  archivePrefix={arXiv},
}