In this paper, we address the problem of Arabic text part-of-speech tagging (POS) and the morphological classification for Arabic text. Our focus is the Classical Arabic (CA) language and Modern Arabic (MSA), where the text is vocalized and has diacritics in most of its letters. Our proposed method of classification is lexicon-free, tokenization-agnostic, stemming processes, or artificial intelligence techniques. The goal is to lower the needed resources to classify the Arabic text. It is built upon the fact that each verb in the Arabic language follows a rule (وزن) that can be used to identify a word. The process is determined by a finite state machine translated to regular expressions. Each verb tense is presented in a set of regular expressions (RE). The order in which a set of regular expressions is processed is significant to the result accuracy. Whenever a match is found, the word is marked so no further matches occur. The provided method is lightweight and provides a best-effort classifier where the closest match is assigned as a tag.
Key words: Part of Speech Tagging, Arabic Rule Based Classifier, Natural Languages, Context Free.
|