Evaluating Large Language Models: ChatGPT-4, Mistral 8x7B, and Google Gemini Benchmarked Against MMLU

Kensuke Ono; Akira Morita

doi:10.36227/techrxiv.170956672.21573677/v1

loading page

Evaluating Large Language Models: ChatGPT-4, Mistral 8x7B, and Google Gemini Benchmarked Against MMLU

Kensuke Ono,
Akira Morita

Abstract

This study was designed to explore the capabilities of contemporary large language models (LLMs)-specifically, ChatGPT-4, Google Gemini, and Mistral 8x7B-in processing and generating text across different languages, with a focused comparison on English and Japanese. By employing a rigorous benchmarking methodology anchored in the Massive Multitask Language Understanding (MMLU) framework, we sought to quantitatively assess the performance of these models in a variety of linguistic tasks designed to challenge their understanding, reasoning, and language generation capabilities. Our methodology encompassed a diverse range of tests, from simple grammatical assessments to complex reasoning and comprehension challenges, enabling a comprehensive evaluation of each model's linguistic proficiency and adaptability. The key finding of our investigation reveals significant disparities in language performance among the evaluated LLMs, with ChatGPT-4 demonstrating superior proficiency in English, Google Gemini excelling in Japanese, and Mistral 8x7B showcasing a balanced performance across both languages. These results highlight the influence of training data diversity, model architecture, and linguistic focus in shaping the abilities of LLMs to understand and generate human language. Furthermore, our study underscores the critical need for incorporating a more diverse and inclusive range of linguistic data in the training processes of future LLMs. We advocate for the advancement of language technologies that are capable of bridging linguistic gaps, enhancing cross-cultural communication, and fostering a more equitable digital landscape for users worldwide.

27 Feb 2024Submitted to TechRxiv

04 Mar 2024Published in TechRxiv

Abstract

Peer review timeline