Paper: https://arxiv.org/abs/2501.00888
Github: https://github.com/Alibaba-NLP/CHRONOS
Demo: https://modelscope.cn/studios/vickywu1022/CHRONOS

In the digital age, the exponential growth of news information makes it crucial to extract and organize historical event timelines from massive texts. To address this challenge, Alibaba’s Tongyi Lab and researchers from Shanghai Jiao Tong University proposed a new framework for news timeline summarization based on agents—CHRONOS, named after the Greek god of time, Chronos. This framework guides the model in retrieving relevant news through iterative self-questioning, enabling slow thinking to ultimately generate a chronological news summary, providing a novel solution for news timeline summarization.

Slow thinking is a method of deep analysis and reasoning, emphasizing a step-by-step breakdown of problems to deeply explore the connections between pieces of information, thereby achieving a more comprehensive and accurate understanding and answers. In the task of news timeline summarization, slow thinking helps the model to better understand the complex relationships and deeper information between retrieved news, and through initially broad questions, gradually refines the inquiry to track and generate higher quality timeline summaries, effectively tackling timeline summarization tasks in both open-domain and closed-domain settings.
For example, for the news “National Football Team 1-0 Bahrain,” CHRONOS can summarize a massive amount of news, presenting the ins and outs of the event. Comparing the two phases of questioning, the latter phase focuses more on the details and deeper factors of the event based on the previous phase’s questions: extending from focusing on key players’ current match performance to their past states, expanding to the performance of more relevant teams… This questioning method helps the model to understand various aspects of news events more comprehensively and deeply. For news covering a longer period, such as “China’s Lunar Exploration Program,” CHRONOS can also focus on key events, presenting the timeline development clearly to users.

Below is a detailed introduction to the paper’s content:
Task Introduction
The Timeline Summarization (TLS) task is a classic technical challenge in the field of natural language processing, aiming to extract key events from large amounts of text data and arrange them in chronological order to provide a structured view of the historical development of a certain topic or field. For example, in the news domain, timeline summarization can help users quickly understand the ins and outs of a news event. This task requires not only identifying important events but also understanding the temporal relationships and causal connections between events to generate a coherent, concise, and information-rich timeline summary.

According to the sources of retrievable events, the TLS task can be divided into closed-domain and open-domain settings: in the closed-domain TLS task, the timeline is created from a predefined set of news articles related to a specific topic or field, while open-domain TLS refers to the process of directly searching and retrieving news articles from the internet to generate timelines. Previous works have mainly focused on solving the timeline generation problem in closed domains, while open-domain TLS requires strong information retrieval and filtering capabilities, as well as the ability to identify and establish connections between events without a global view, presenting new requirements and challenges for this task.
CHRONOS Method
To address the above challenges, we propose the CHRONOS framework, which generates accurate and comprehensive timeline summaries through iterative questioning, effectively solving the TLS tasks in both open-domain and closed-domain settings.
1. Motivation
The core of timeline generation lies in establishing the temporal and causal relationships between events. Each news event can be represented as a different node, and the goal of the task is to establish edges between these nodes to show their relevance, ultimately forming a heterogeneous graph starting from the node of the topic news. Therefore, a retrieval mechanism to search for relevant news articles can effectively establish these edges and form connections between events.
2. Overview
CHRONOS leverages the capabilities of large models by simulating the human information retrieval process, which involves asking questions, further posing new questions based on retrieval results, and ultimately collecting comprehensive information about related events and summarizing it into a timeline.

CHRONOS includes the following modules:
-
Self-Questioning: First, search for coarse-grained news background information, then iteratively pose questions to retrieve more relevant news. -
Question Rewriting: Break down complex or poorly performing questions into more specific and easily retrievable queries. -
Timeline Generation: Summarize a timeline highlighting important events by merging the timelines generated from each round of retrieval.
3. Self-Questioning
3.1 Coarse-Grained Background Research
The starting point of slow thinking is the coarse-grained background research on news events. In the initial phase of self-questioning, CHRONOS first uses the title of the target news as keywords for search, collecting information that is most directly related to the target news. This information forms the news context, laying the foundation for subsequent deep analysis and iterative questioning. This step is similar to how humans begin to think about a new problem by first obtaining some basic background information to better understand the context and framework of the problem.
3.2 Question Example Selection
After coarse-grained background research, CHRONOS utilizes the contextual learning capabilities of large models to guide the model in generating high-quality questions about the target news through a few sample prompts.
To evaluate the quality of question samples, the concept of Chrono-Informativeness (CI) is introduced, which measures the model’s ability to pose questions that align with the reference timeline events, i.e., questions with a high CI value are more likely to lead to the retrieval of articles related to the target news event, measured using the F1 score of dates contained in the retrieved timeline and the reference timeline.
Based on the goal of maximizing the temporal informativeness of the question set, a “news-question” example pool is constructed to guide the question generation for new target news. For each new target news, the most similar samples to the target news are dynamically retrieved using cosine similarity, ensuring the contextual relevance and temporal accuracy of the samples.
3.3 Iterative Questioning
The core of CHRONOS’s slow thinking lies in continuous iterative questioning. CHRONOS continuously poses questions, with each round based on the retrieval results from the previous round, deeply exploring new questions and information, gradually building complex connections between events until the number of events in the timeline is satisfied or the maximum iteration count is reached. This process is akin to how humans refine their understanding and solutions by continuously posing new questions and exploring new possibilities during the thinking process.
3.4 Question Rewriting
Query rewriting is a common optimization method in retrieval-enhanced generation. In the CHRONOS framework, we rewrite broad or complex questions generated in the initial questioning phase into 2-3 more easily retrievable sub-questions, which can generate more specific and targeted queries, thereby improving the retrieval effectiveness of the search engine. We also incorporate a few samples in the prompts to guide the large model in effective rewriting, transforming complex questions into more specific queries while preserving the original intent of the questions.
3.5 Timeline Generation
CHRONOS generates a complete timeline summary in two stages: Generation and Merging.
-
Generation: Identify key events and details by analyzing the news articles retrieved in each round. Utilize the understanding and generation capabilities of large models to extract the occurrence dates and related details of each event, and write concise descriptions for each event. These events and descriptions are organized into a preliminary timeline, arranged in chronological order, providing a foundation for the subsequent merging phase. -
Merging: Integrate the preliminary timelines generated from multiple rounds of retrieval into a coherent final summary. This process involves aligning events in different timelines, resolving any conflicts in dates or descriptions, and selecting the most representative and important events.
OPEN-TLS
To evaluate the TLS system, we collected timelines written by professional journalists on recent news events, constructing a new dataset called Open-TLS. Compared to previous closed-domain datasets, Open-TLS is not only more diverse in terms of dataset scale and content, covering politics, economy, society, sports, and science and technology, but also has advantages in timeliness, providing a more comprehensive and challenging benchmark for open-domain TLS tasks.

Experimental Results
1. Experimental Setup
The experiments were conducted based on GPT-3.5-Turbo, GPT-4, and Qwen2.5-72B to build the CHRONOS system, evaluating the performance of TLS in both open-domain and closed-domain settings. The evaluation metrics used mainly include:
-
ROUGE-N: Measures the N-gram overlap between the generated timeline and the reference timeline. Specifically includes: (1) Concat F1: Calculates ROUGE by concatenating all date summaries to assess overall consistency; (2) Agree F1: Calculates ROUGE using summaries only for matching dates to assess accuracy for specific dates; (3) Align F1: Aligns predicted and reference summaries based on similarity and date proximity before calculating ROUGE, assessing post-alignment consistency. -
Date F1: Measures the degree of matching dates in the generated timeline with the true dates in the reference timeline.
2. Open-Domain TLS
In the open-domain TLS experiments, CHRONOS was compared with several baseline methods, including directly searching for target news (DIRECT) and rewriting target news to create queries for retrieval (REWRITE). In contrast, CHRONOS significantly improved the quality of event summarization and date alignment accuracy through iterative self-questioning and retrieving relevant news articles, leading all metrics ahead of baseline methods.

3. Closed-Domain TLS
In the closed-domain TLS experiments, CHRONOS was compared with previous representative works, including: (1) CLUST based on event aggregation (Gholipour Ghalandari and Ifrim, 2020); (2) EGC based on event graph models (Li et al., 2021); and (3) LLM-TLS utilizing large models for event clustering (Hu et al., 2024). The comparison results on the classic datasets Crisis and T17 showed that CHRONOS achieved performance similar to these works, reaching SOTA results on the AR-2 metric of both datasets, proving its strong performance and adaptability across different types of events and time spans.

4. Runtime Analysis
Another advantage of CHRONOS lies in its efficiency. Compared to LLM-TLS methods that also rely on large models but need to process all articles in the news repository, it focuses on the most relevant news articles through a retrieval-enhanced mechanism, significantly reducing processing time. This efficiency improvement makes it more practical in real applications, especially in scenarios requiring rapid responses.

Case Study
We conducted an in-depth analysis of the model’s performance in handling specific news events, selecting representative news events, such as major product launches by Apple, to observe how CHRONOS generates timelines through shallow to deep self-questioning and information retrieval. In the case study, CHRONOS demonstrated its ability to accurately extract key events and dates, while also revealing areas that may require improvement, such as omissions of certain events or date hallucinations.

Conclusion
The CHRONOS framework provides a novel and effective solution for the timeline summarization task by combining the iterative self-questioning of large language models with retrieval-enhanced generation techniques. The core of this method is to simulate the human information retrieval process, gradually deepening the understanding of events through the continuous posing and answering of new questions, ultimately generating a comprehensive and coherent timeline summary.
Experimental results have sufficiently demonstrated CHRONOS’s capabilities in complex event retrieval and timeline construction, showcasing the framework’s potential and accuracy in practical news timeline generation applications. Additionally, whether this iterative questioning and retrieval generation method can generalize to universal tasks is also worthy of further research in the future.
Reference:[1] Demian Gholipour Ghalandari and Georgiana Ifrim. 2020. Examining the state-of-the-art in news timeline summarization. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 1322–1334, Online. Association for Computational Linguistics.[2] Manling Li, Tengfei Ma, Mo Yu, Lingfei Wu, Tian Gao, Heng Ji, and Kathleen McKeown. 2021. Timeline summarization based on event graph compression via time-aware optimal transport. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 6443–6456, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.[3] Qisheng Hu, Geonsik Moon, and Hwee Tou Ng. 2024. From moments to milestones: Incremental timeline summarization leveraging large language models. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7232–7246, Bangkok, Thailand. Association for Computational Linguistics.
Welcome to support my knowledge circle (NLP Engineering): Dify source code analysis and Q&A, Dify dialogue system source code, e-books and report downloads, all paid materials from the public account. If the WeChat group QR code is expired, please add WeChat buxingtianxia21 to join the group.
NLP Engineering Knowledge Circle
NLP Engineering Resource Group