AutoClean: LLMs Can Prepare Their Training Corpus

Xingyu Shen; Shengding Hu; Xinrong Zhang; Xu Han (韩旭); Xiaojun Meng; Jiansheng Wei; Zhiyuan Liu; Maosong Sun

doi:10.18653/v1/2025.naacl-demo.9

AutoClean: LLMs Can Prepare Their Training Corpus

Xingyu Shen, Shengding Hu, Xinrong Zhang, Xu Han, Xiaojun Meng, Jiansheng Wei, Zhiyuan Liu, Maosong Sun

Abstract

Recent studies highlight the reliance of Large Language Models (LLMs) on high-quality, diverse data for optimal performance. The data sourced from the Internet often aggregated into datasets like the Common Crawl corpus, presents significant quality variability and necessitates extensive cleaning. Moreover, specific domain knowledge is usually presented in HTML, but there is a lack of effective methods to clean them into the training corpus automatically. Traditional cleaning methods involve either labor-intensive human teams that lack scalability or static heuristics that lead to suboptimal outcomes and are unable to be applied to specific target domains. In this paper, inspired by the recent progress in employing LLMs as versatile agents for diverse tasks, we take the initiative to explore the potential of these agents in automating data-cleaning methodologies. By configuring LLMs as an agent team that imitates the human data-cleaning team, we can automatically generate cleaning rules that traditionally require the involvement of data-cleaning experts. These rules are developed using a limited number of data samples and can then be applied broadly to substantial portions of raw data from the same domain. We demonstrate the efficiency and effectiveness of on both pre-train scale corpora such as Common Crawl and specific target websites. Both automatic and human evaluations of the quality of the cleaned content highlight the feasibility of using LLMs to prepare their training corpus.

Anthology ID:: 2025.naacl-demo.9
Volume:: Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (System Demonstrations)
Month:: April
Year:: 2025
Address:: Albuquerque, New Mexico
Editors:: Nouha Dziri, Sean (Xiang) Ren, Shizhe Diao
Venues:: NAACL | WS
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 85–95
Language:
URL:: https://aclanthology.org/2025.naacl-demo.9/
DOI:: 10.18653/v1/2025.naacl-demo.9
Bibkey:
Cite (ACL):: Xingyu Shen, Shengding Hu, Xinrong Zhang, Xu Han, Xiaojun Meng, Jiansheng Wei, Zhiyuan Liu, and Maosong Sun. 2025. AutoClean: LLMs Can Prepare Their Training Corpus. In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (System Demonstrations), pages 85–95, Albuquerque, New Mexico. Association for Computational Linguistics.
Cite (Informal):: AutoClean: LLMs Can Prepare Their Training Corpus (Shen et al., NAACL 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.naacl-demo.9.pdf

PDF Cite Search Fix data