LLM Based Information Synchronization

About

In today’s digital era, non-English content on platforms like Wikipedia is often outdated or incomplete, especially in low-resource languages. Our work tackles this problem through two key contributions:

Example:

Below is an example of information synchronization from our dataset. On the right is a reference table in a high-resource language, and on the left is an outdated table in a low-resource language. Updates made by our model are highlighted.

Methodology

Our solution leverages large language models (LLMs) through a detailed hierarchical task decomposition strategy with their justifications:

Each of these steps was introduced to tackle specific challenges identified during the review process, and together they form a robust framework for multilingual information synchronization.

Results

Our main results on updating information are shown in Table 1, demonstrate that our proposed decomposition technique significantly outperforms several baselines.

Dataset

To evaluate multilingual table synchronization, we introduce the INFOUPDATE dataset, which simulates the real-world process of updating outdated Wikipedia tables. This dataset consists of approximately 950 annotated instances spanning 9 categories (e.g., Album, Athlete, City, College, Company, Country, Musician, Person, and Stadium) across 14 languages, including Spanish, French, German, Arabic, Hindi, Korean, Russian, Afrikaans, Cebuano, Swedish, Dutch, Turkish, and Chinese.

The dataset construction process involves extracting two versions of the same Wikipedia table entity from different time periods. Specifically, we extract an old version from 2018 (Source Table) and a new version from 2023 (Reference Table). The goal is to update the outdated source table using information from the reference table while ensuring alignment with a manually curated gold-standard table.

The objective of the Information Synchronization task is to update rows in the Source Table (TS) using the Reference Table (TR) so that the generated Output Table (TO) closely matches the Gold Table (TG).

Dataset Statistics

Below is the dataset breakdown by language and category, showcasing the diversity of data collected for evaluating information synchronization.

Language Tables Category Tables
Afrikaans7Album76
Arabic120Athlete70
Cebuano4City108
German105College112
English206Company148
Spanish23Country122
French123Musician138
Hindi64Person108
Korean93Stadium66
Dutch21
Russian131
Swedish15
Turkish18
Chinese18
Dataset Statistics - Number of tables by language and category.

People

This paper was a research collaboration between people working at IIT Guwahati , University of Utah, and University of Pennsylvania.

From left to right, Siddharth Khincha, Tushar Kataria, Ankita Anand, Dan Roth and Vivek Gupta.

Citation

Please cite our paper as below if you use the INFOSYNC dataset.




			

Acknowledgement

Authors sincerely thank the reviewers and meta-reviewer of NAACL 2025 for their valuable pointers related to their works, corrections, and helpful comments. Authors thank the largest free resource Wikipedia for InfoSync tables.