LLM Based Information Synchronization

About

In today’s digital era, non-English content on platforms like Wikipedia is often outdated or incomplete, especially in low-resource languages. Our work tackles this problem through two key contributions:

Information Updation Dataset: We introduce a new dataset that simulates the process of updating outdated Wikipedia tables by comparing older versions with current, human-curated tables.
Hierarchical Task Decomposition: Rather than a single-step update, we propose a structured approach that decomposes the synchronization task into multiple sub-tasks. This includes translating tables into a common language, converting them into knowledge graphs for improved reasoning, aligning and merging information from multiple sources, and finally updating the tables. This methodology not only improves performance but also offers interpretability and modularity to address complex edge cases.

Example:

Below is an example of information synchronization from our dataset. On the right is a reference table in a high-resource language, and on the left is an outdated table in a low-resource language. Updates made by our model are highlighted.

Methodology

Our solution leverages large language models (LLMs) through a detailed hierarchical task decomposition strategy with their justifications:

Translation: Converting all tables to English ensures consistency, as most state-of-the-art LLMs are optimized for English, reducing noise from translation discrepancies.
Knowledge Graph Construction: Representing tables as structured graphs enhances the model’s reasoning abilities, as LLMs perform better with graph-structured information compared to raw tabular data.
Alignment and Merging: The merging step is critical to identify overlapping, missing, or outdated information. It ensures that the most accurate and comprehensive data is preserved in the final output, and it addresses ambiguities that could arise from a direct update.
Final Update: Once the information is aligned and merged, the updated knowledge graph is converted back to a table. This step is essential to reconcile differences and produce a coherent output that mirrors a human-curated table.

Each of these steps was introduced to tackle specific challenges identified during the review process, and together they form a robust framework for multilingual information synchronization.

Results

Our main results on updating information are shown in Table 1, demonstrate that our proposed decomposition technique significantly outperforms several baselines.

Dataset

To evaluate multilingual table synchronization, we introduce the INFOUPDATE dataset, which simulates the real-world process of updating outdated Wikipedia tables. This dataset consists of approximately 950 annotated instances spanning 9 categories (e.g., Album, Athlete, City, College, Company, Country, Musician, Person, and Stadium) across 14 languages, including Spanish, French, German, Arabic, Hindi, Korean, Russian, Afrikaans, Cebuano, Swedish, Dutch, Turkish, and Chinese.

The dataset construction process involves extracting two versions of the same Wikipedia table entity from different time periods. Specifically, we extract an old version from 2018 (Source Table) and a new version from 2023 (Reference Table). The goal is to update the outdated source table using information from the reference table while ensuring alignment with a manually curated gold-standard table.

Source Table (TS): The outdated version of an entity in language Li.
Reference Table (TR): The updated version of the entity in a different language Lj (i ≠ j).
Gold Table (TG): A human-annotated table that integrates and synchronizes all available updates.

The objective of the Information Synchronization task is to update rows in the Source Table (TS) using the Reference Table (TR) so that the generated Output Table (TO) closely matches the Gold Table (TG).

Dataset Statistics

Below is the dataset breakdown by language and category, showcasing the diversity of data collected for evaluating information synchronization.

Dataset Statistics - Number of tables by language and category.
Language	Tables	Category	Tables
Afrikaans	7	Album	76
Arabic	120	Athlete	70
Cebuano	4	City	108
German	105	College	112
English	206	Company	148
Spanish	23	Country	122
French	123	Musician	138
Hindi	64	Person	108
Korean	93	Stadium	66
Dutch	21
Russian	131
Swedish	15
Turkish	18
Chinese	18

People

This paper was a research collaboration between people working at IIT Guwahati , University of Utah, and University of Pennsylvania.

From left to right, Siddharth Khincha, Tushar Kataria, Ankita Anand, Dan Roth and Vivek Gupta.

Citation

Please cite our paper as below if you use the INFOSYNC dataset.

Acknowledgement

Authors sincerely thank the reviewers and meta-reviewer of NAACL 2025 for their valuable pointers related to their works, corrections, and helpful comments. Authors thank the largest free resource Wikipedia for InfoSync tables.