BERT-based Document Clustering: Unveiling Semantic Patterns in 20News Group, Reuters, and BBC Sports Corpora

Ratnam Dodda; Suresh Babu Alladi

doi:10.22541/au.171506422.20645846/v1

loading page

BERT-based Document Clustering: Unveiling Semantic Patterns in 20News Group, Reuters, and BBC Sports Corpora

Ratnam Dodda,
Suresh Babu Alladi

Abstract

Document clustering plays a pivotal role in structuring and analyzing vast textual datasets. In this paper, we leverage the Bidirectional Encoder Representations from Transformers (BERT) algorithm, a cutting-edge natural language processing model, to perform document clustering on three distinct datasets: the 20News Group dataset, Reuters dataset, and BBC Sports dataset. BERT’s contextualized embeddings enable a deeper understanding of document semantics, enhancing the clustering process. The objective is to investigate the efficacy of BERT-based document clustering across diverse domains, shedding light on its performance and potential applications. We implement BERT for document clustering, utilizing its pre-trained contextual embeddings to capture intricate relationships within textual data. Our study aims to assess how well BERT adapts to the unique characteristics of each dataset, offering insights into the model’s generalizability and effectiveness across various domains.