Volume 9 Number 7 (Jul. 2014)
Home > Archive > 2014 > Volume 9 Number 7 (Jul. 2014) >
JSW 2014 Vol.9(7): 1818-1826 ISSN: 1796-217X
doi: 10.4304/jsw.9.7.1818-1826

Challenges of Diacritical Marker or Hudhaa Character in Tokenization of Oromo Text

Abraham Tesso Nedjo, Degen Huang, Xiaoxia Liu

School of Computer Science and Technology, Faculty of Electronic Information and Electrical Engineering, Dalian University of Technology, Dalian, China

Abstract—The problem of tokenization in natural language processing is to find a way to get every token in a text. For languages like Oromo, for which, much effort has not been done yet regarding language processing, the task of tokenization by no means cannot be overlooked. This paper reports on Oromo tokenizer that we designed and tested by accommodating the challenge of diacritical marker-Hudhaa which is a sign to represent in-word glottal sound in Oromo language. In this work, we have studied the effect of using acute accent for diacritical mark rather than using other confusing marks like right-quote to write Hudhaa. Accuracy is a prime factor in evaluating any Natural Language Processing (NLP) system. So we measured the accuracy of our system on 1.2MB (9686 sentences having 164932 words) of Oromo text data and an accuracy of 99.94% was achieved by this algorithm.

Index Terms—Diacritical Marker; Glottal; Hudhaa; Oromo; Tokenization

[PDF]

Cite: Abraham Tesso Nedjo, Degen Huang, Xiaoxia Liu, "Challenges of Diacritical Marker or Hudhaa Character in Tokenization of Oromo Text," Journal of Software vol. 9, no. 7, pp. 1818-1826, 2014.

General Information

  • ISSN: 1796-217X (Online)

  • Abbreviated Title: J. Softw.

  • Frequency:  Quarterly

  • APC: 500USD

  • DOI: 10.17706/JSW

  • Editor-in-Chief: Prof. Antanas Verikas

  • Executive Editor: Ms. Cecilia Xie

  • Abstracting/ Indexing: DBLP, EBSCO,
           CNKIGoogle Scholar, ProQuest,
           INSPEC(IET), ULRICH's Periodicals
           Directory, WorldCat, etc

  • E-mail: jsweditorialoffice@gmail.com

  • Oct 22, 2024 News!

    Vol 19, No 3 has been published with online version   [Click]

  • Jan 04, 2024 News!

    JSW will adopt Article-by-Article Work Flow

  • Apr 01, 2024 News!

    Vol 14, No 4- Vol 14, No 12 has been indexed by IET-(Inspec)     [Click]

  • Apr 01, 2024 News!

    Papers published in JSW Vol 18, No 1- Vol 18, No 6 have been indexed by DBLP   [Click]

  • Jun 12, 2024 News!

    Vol 19, No 2 has been published with online version   [Click]