Home|Journals|Articles by Year|Audio Abstracts
 

Original Article

JJCIT. 2022; 8(4): 370-387


COTA 2.0: an Automatic Corrector of Tunisian Arabic Social Media Texts

Asma Mekki, Inès Zribi, Mariem Ellouze, Lamia Hadrich Belguith.




Abstract

In written text, orthographic noise is a common concern for NLP, especially when operating social network comments and raw documents. This is mainly due to its orthographic conventions and morphological ambiguity. We propose to automatically normalize the social media dialect corpora by following CODA-TUN, the Conventional Orthography for Tunisian Arabic (TA). The existing system developed for TA is not able to handle all forms of TA. Therefore, we propose to extend its rules and lexicons to address the peculiarities of social media dialect. In certain words, the COTA Orthography 1.0 system provides the user with several correction possibilities. Therefore, in the new version, we incorporated a trigram language model to automatically select the right correction. Our results show that the system can reduce transcription errors by 95.72%.

Key words: Orthographic normalization, Tunisian Arabic, COTA Orthography system, CODA-TUN






Full-text options


Share this Article


Online Article Submission
• ejmanager.com




ejPort - eJManager.com
Refer & Earn
JournalList
About BiblioMed
License Information
Terms & Conditions
Privacy Policy
Contact Us

The articles in Bibliomed are open access articles licensed under Creative Commons Attribution 4.0 International License (CC BY), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.