Babel - Open Multilingual Large Language Models Serving Over 90% of Global Speakers

Abstract

Large language models (LLMs) have transformed natural language processing (NLP), yet open-source multilingual LLMs remain scarce, with existing models often limited in language coverage. Such models typically prioritize well-resourced languages like French and German, while widely spoken but under-resourced languages such as Hindi, Bengali, and Urdu are overlooked.

To address this disparity, we introduce Babel, a multilingual LLM that covers the top 25 languages by number of speakers, supports over 90% of the global population, and includes many languages neglected by other open multilingual LLMs.

Unlike traditional continue pretraining approaches, Babel expands its parameter count through a layer extension technique that elevates Babel's performance ceiling. We introduce two variants: Babel-9B, designed for efficient single-GPU inference and fine-tuning, and Babel-83B, which sets a new standard for open multilingual LLMs. Extensive evaluations on multilingual tasks demonstrate its superior performance compared to open LLMs of comparable size. In addition, using existing supervised fine-tuning datasets, Babel achieves remarkable performance, with Babel-9B-Chat leading among 10B-sized LLMs and Babel-83B-Chat setting a new standard for open LLMs, performing comparably to GPT-4o on certain tasks.

Babel Supported Language

Languages supported by Babel sorted by the number of speakers (B = Billion, M = Million). Highlighted languages are those underexplored by previous multilingual LLMs.

Language	Speakers	Language Family	Macroarea
English	1.5B	Germanic	Worldwide
Chinese (Mandarin)	1.4B	Sinitic	Asia
Hindi	700M	Indo-Aryan	Asia
Spanish	595M	Romance	Americas, Europe
Standard Arabic	400M	Semitic	Asia, Africa
French	300M	Romance	Europe, Africa, Americas
Bengali	300M	Indo-Aryan	Asia
Portuguese	270M	Romance	Americas, Europe, Africa
Russian	260M	Slavic	Europe, Asia
Urdu	230M	Indo-Aryan	Asia
Indonesian	200M	Malayo-Polynesian	Asia
Standard German	135M	Germanic	Europe
Japanese	130M	Japonic	Asia
Swahili	100M	Bantu	Africa
Filipino (Tagalog)	100M	Malayo-Polynesian	Asia
Tamil	90M	Dravidian	Asia
Vietnamese	86M	Vietic	Asia
Turkish	85M	Turkic	Asia, Europe
Italian	85M	Romance	Europe
Javanese	83M	Malayo-Polynesian	Asia
Korean	80M	Koreanic	Asia
Hausa	80M	Chadic	Africa
Iranian Persian	80M	Indo-Iranian	Asia
Thai	80M	Kra-Dai	Asia
Burmese	50M	Tibeto-Burman	Asia

Multilingual Capability

We employ multilingual tasks across several categories:

World Knowledge: MMMLU (OpenAI 2024), a human-translated version of MMLU (Hendrycks et al. 2021) available in 14 languages. For languages not covered, we use Google Translate (Google Translate API) to generate translations. Additionally, we include M3Exam (M3Exam), which consists of authentic human exam questions collected from various countries, covering multiple subjects and educational levels.
Reasoning: MGSM (Shi et al. 2022) and XCOPA (Ponti et al. 2020).
Understanding: XNLI (Conneau et al. 2018).
Translation: Flores-200 (NLLB Team 2022).

Performance of 10B-Size Base Models vs. Babel-9B-Base
Dataset	GLM4-9B	Gemma2-9B	Mistral-12B	Llama3.1-8B	Qwen2.5-7B	Babel-9B
MMMLU	55.6	59.8	52.8	49.4	56.7	59.4
M3Exam	56.6	61.6	54.2	52.5	58.8	61.3
XCOPA	87.3	84.6	81.3	75.9	81.1	89.2
MGSM	39.0	34.3	26.0	18.0	41.1	43.4
XNLI	69.9	61.7	55.0	48.9	70.3	71.9
Flores-200	46.6	53.2	50.8	50.9	45.5	55.1
Average	59.2	59.5	53.4	49.3	58.9	63.4

Performance of Open Large Multilingual LLMs vs. Babel-83B-Base
Dataset	Llama3.1-70B	Qwen2.5-72B	Babel-83B
MMMLU	69.1	74.7	76.3
M3Exam	67.4	71.2	72.1
XCOPA	92.6	81.1	92.8
MGSM	48.9	63.9	62.6
XNLI	66.2	74.9	76.6
Flores-200	57.4	53.1	58.8
Average	66.9	69.8	73.2

Acknowledgement

We would like to thank Guanzheng Chen for assisting with the implementation of the training codebase. Our special thanks go to our professional and native linguists—Tantong Champaiboon, Nguyen Ngoc Yen Nhi, and Tara Devina Putri—who contributed to building, evaluating, and fact-checking our sampled pretraining dataset. We also appreciate Fan Wang, Jiasheng Tang, Xin Li, and Hao Zhang for their efforts in coordinating computing resources.

BibTeX

If you find our project useful, we hope you would kindly star our repo and cite our work as follows.
Corresponding Author: wxzhang@sutd.edu.sg

@misc{zhao2025babelopenmultilinguallarge,
      title={Babel: Open Multilingual Large Language Models Serving Over 90% of Global Speakers}, 
      author={Yiran Zhao and Chaoqun Liu and Yue Deng and Jiahao Ying and Mahani Aljunied and Zhaodonghui Li and Lidong Bing and Hou Pong Chan and Yu Rong and Deli Zhao and Wenxuan Zhang},
      year={2025},
      eprint={2503.00865},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2503.00865}, 
}