ADVERTISEMENT

Home|Journals|Articles by Year|Audio Abstracts
 

Original Article

JJCIT. 2025; 11(3): 319-335


Jordanian Arabic to Modern Standard Arabic Translation Using a Large Model Tuned on a Purpose-Built Dataset and Synthetic Error Injection

Gheith A. Abandah, Moath R. Khaleel, Iyad F. Jafar, Mohammad R. Abdel-majeed, Yousef H. Hamdan, Ashraf E. Suyyagh, Asma A. Abdel-karim, Shorouk M. Alawawdeh.



Abstract
Download PDF Post

This paper addresses the challenge of accurately translating Jordanian Arabic into Modern Standard Arabic (MSA) and correcting common linguistic errors. Although MSA is the formal standard for Arabic communication, the widespread use of local dialects in social media and everyday interactions often results in texts laden with spelling and grammatical issues. To overcome these challenges, we present an end-to-end system based on a newly constructed Jordanian Arabic dataset (JODA) comprising 59,135 sentences, as well as the Tashkeela dataset perturbed through synthetic error injection. We employ ByT5, a large pre-trained language model that processes text at the byte level, making it resilient to spelling variations and morphological complexities common in Arabic dialects. Our experimental results show that fine-tuning ByT5 on JODA and a 10% error-injected Tashkeela subset notably improves both BLEU scores and character error rates (CER). Combining JODA with the synthetically modified Tashkeela data reduces the CER to 4.64% on the Test-200 test set and 1.65% on the TSMTS test set. Moreover, manual inspections reveal that the model produces correct or near-correct translations in most cases. Finally, we developed a custom smartphone keyboard and a web portal to demonstrate how the system can be made easily accessible to interested users, offering a practical solution for millions of Arabic speakers seeking to produce accurate, diacritized MSA text. This solution is currently limited to the Jordanian dialect; future work will focus on developing similar datasets and solutions for other Arabic dialects.

Key words: Jordanian Arabic, Modern Standard Arabic, Dialectal Translation, Large language Models, Synthetic Error Injection, Natural Language Processing, ByT5







Bibliomed Article Statistics

62
38
51
63
60
R
E
A
D
S

40

24

37

23

67
D
O
W
N
L
O
A
D
S
0910111201
20252026

Full-text options


Share this Article


Online Article Submission
• ejmanager.com




ejPort - eJManager.com
Author Tools
About BiblioMed
License Information
Terms & Conditions
Privacy Policy
Contact Us

The articles in Bibliomed are open access articles licensed under Creative Commons Attribution 4.0 International License (CC BY), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.