BERT-based Document Clustering: Unveiling Semantic Patterns in 20News
Group, Reuters, and BBC Sports Corpora
Abstract
Document clustering plays a pivotal role in structuring and
analyzing vast textual datasets. In this paper, we leverage the
Bidirectional Encoder Representations from Transformers (BERT)
algorithm, a cutting-edge natural language processing model, to perform
document clustering on three distinct datasets: the 20News Group
dataset, Reuters dataset, and BBC Sports dataset. BERT’s contextualized
embeddings enable a deeper understanding of document semantics,
enhancing the clustering process. The objective is to investigate the
efficacy of BERT-based document clustering across diverse domains,
shedding light on its performance and potential applications. We
implement BERT for document clustering, utilizing its pre-trained
contextual embeddings to capture intricate relationships within textual
data. Our study aims to assess how well BERT adapts to the unique
characteristics of each dataset, offering insights into the model’s
generalizability and effectiveness across various domains.