A Novel Central Kurdish Part-of-Speech Corpus and  Deep Tagging Model Evaluation

Haneen Al-Raghefy; Halgurd S. Maghdid; Akar H. Taher

doi:10.14500/aro.12641

Authors

Haneen Al-Raghefy Department of Software Engineering, Faculty of Engineering, Koya University, Koya, Kurdistan Region – F.R. Iraq https://orcid.org/0009-0007-1867-6614
Halgurd S. Maghdid Department of Engineering Research Center, Deanship of R&D Center, Koya University, Koya, Kurdistan Region – F.R. Iraq https://orcid.org/0000-0003-1109-4009
Akar H. Taher Department of Software Engineering, Faculty of Engineering, Koya University, Koya, Kurdistan Region – F.R. Iraq https://orcid.org/0000-0001-5502-0015

DOI:

https://doi.org/10.14500/aro.12641

Keywords:

Annotated dataset, Central Kurdish language, Deep learning, Long short-term memory

Abstract

For many low-resource languages, including the central Kurdish language (CKL), building effective natural language processing (NLP) tools has been a challenge. This is due to the lack of annotated text. Without a large corpus that specifies how words function grammatically, it is difficult to perform basic tasks such as part-of-speech (POS) tagging, which is the building block of many language technologies. To address this issue, this study presents the first comprehensive POS-tagged corpus for CKL. This dataset consists of 108,680 words manually tagged with 86 tags. Unlike simpler tagging schemes, the 86 tags account for the complexity of Kurdish grammar and allow a single word to have multiple valid tags, reflecting the language’s natural ambiguity. Using this resource, this study benchmarks a range of deep models, including neural networks such as bidirectional long short-term memory (BiLSTM). To address the ambiguity challenge, this paper introduces a new method, adaptive tag cycling within the BiLSTM that trains the model to consider all possible tags. The most advanced model in this study, an ensemble of neural sequence taggers, achieves 92.3% accuracy with stop-words retained and 89.5% with stop-words removed on broad grammatical categories (main tags). On the full fine-grained tagset (detailed tags), the same model attains 79.0% accuracy with stop-words and 76.2% without stop-words. Therefore, this study provides two key contributions: (i) a new dataset that supports future Kurdish NLP research, and (ii) a strong performance benchmark for CKL POS tagging.

Downloads

Download data is not yet available.

References

Abdullah, A.A., Hasan, S., Toufiq, D., Maghdid, H.S., Rashid, T.A., Farho, P., Sabr, S., Taher, A.H., Sabir, D., Veisi, H., and Asaad, A.T., 2024. NER- RoBERTa: Fine-Tuning RoBERTa for Named Entity Recognition (NER) within Low-resource Languages. arXiv [Preprint].

Abdullah, A.A., Mohammed, N.S., Khanzadi, M., Asaad, S.M., Abdul, Z.K., and Maghdid, H.S., 2025. In-depth analysis on machine learning approaches: Techniques, applications, and trends. ARO-The Scientific Journal Of Koya University, 13(1), pp.190-202.

Abdulrahman, R.O., and Hassani, H., 2020. Using Punkt for Sentence Segmentation in Non-Latin Scripts: Experiments on Kurdish (Sorani) Texts. arXiv [Preprint].

Ahmadi, S., 2020a. A Tokenization System for the Kurdish Language. In: Zampieri, M., Nakov, P., Ljubešić, N., Tiedemann, J., and Scherrer, Y., Eds. Proceedings of the 7th Workshop on NLP for Similar Languages, Varieties and Dialects. International Committee on Computational Linguistics (ICCL), Barcelona, Spain, pp.114-127. Available from: https://aclanthology.org/2020. vardial-1.11 [Last accessed on 2025 May 03].

Ahmadi, S., 2020b. KLPT-Kurdish language processing toolkit. In: Park, E.L., Hagiwara, M., Milajevs, D., Liu, N.F., Chauhan, G., and Tan, L., Eds. Proceedings of Second Workshop for NLP Open Source Software (NLP-OSS). Association for Computational Linguistics, Pennsylvania, pp.72-84.

Ahmadi, S., and Masoud, M., 2020. Towards Machine Translation for the Kurdish Language. In: Karakanta, A., Ojha, A.K., Liu, C.H., Abbott, J., Ortega, J., Washington, J., Oco, N., Lakew, S.M., Pirinen, T.A., Malykh, V., Logacheva, V., and Zhao, X., Eds. Proceedings of the 3rd Workshop on Technologies for MT of Low Resource Languages. Association for Computational Linguistics, Suzhou, China, pp.87-98.

Amini, Z., Mohammadamini, M., Hosseini, H., Mansouri, M., and Jaffet, D., 2021. Central Kurdish Machine Translation: First Large Scale Parallel Corpus and Experiments. arXiv [Preprint]. Available from: https://hal.science/hal 03263105 [Last accessed on 2025 May 03].

Ataman, D., 2018. Bianet: A Parallel News Corpus in Turkish, Kurdish and Awlla, K.M., Veisi, H., and Abdullah, A.A., 2025. Sentiment analysis in low resource contexts: BERT’s impact on Central Kurdish. Language Resources and Evaluation, 59(3), pp.2213-2243.

Azzat, M., Jacksi, K., and Ali, I., 2024. A Hybrid Approach to ontology construction for the Badini Kurdish language. Information, 15(9), p.578. Daelemans, W., 2011. POS tagging. In: Sammut, C., and Webb, G.I., Eds. Encyclopedia of Machine Learning. Springer, US, Boston, MA, pp.776-779.

Gökırmak, M., and Tyers, F.M., 2017. A Dependency Treebank for Kurmanji Kurdish. In: Montemagni, S., and Nivre, J., Eds. Proceedings of the Fourth International Conference on Dependency Linguistics (Depling 2017). Linköping University Electronic Press, Pisa, Italy, pp.64-72. Available from: https:// aclanthology.org/W17-6509 [Last accessed on 2025 May 03].

Grave, E., Bojanowski, P., Gupta, P., Joulin, A., and Mikolov, T., 2018. Learning Word Vectors for 157 Languages. In: Proceedings of the International Conference on Language Resources and Evaluation (LREC 2018). European Language Resources Association (ELRA). Available from: https://fasttext.cc/docs/en/ crawl-vectors.html [Lase accessed on 2025 Sep 16].

Hassani, H., 2022. Part of Speech Tagging (POST) of a Low-resource Language using another Language (Developing a POS-Tagged Lexicon for Kurdish (Sorani) using a Tagged Persian (Farsi) Corpus. arXiv [Preprint].

Hochreiter, S., and Schmidhuber, J., 1997. Long short-term memory. Neural Computation, 9(8), pp.1735-1780. Huang, Z., Xu, W., and Yu, K., 2015. Bidirectional LSTM-CRF Models for Sequence Tagging. arXiv [Preprint].

Jurafsky, D., and Martin, J.H., 2023. Speech and Language Processing. 3rd ed. Stanford University, California.

Lafferty, J.D., McCallum, A., and Pereira, F.C.N., 2001. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data In: Proceedings of the Eighteenth International Conference on Machine Learning. Morgan Kaufmann Publishers Inc, San Francisco, CA, USA, pp.282-289.

Landis, J.R., and Koch, G.G., 1977. The measurement of observer agreement for categorical data. Biometrics, 33(1), p.159.

Malmasi, S., 2016. Subdialectal Differences in Sorani Kurdish. In: Nakov, P., Zampieri, M., Tan, L., Ljubešić, N., Tiedemann, J., and Malmasiet, S., Eds. Proceedings of the Third Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial3). The COLING 2016 Organizing Committee, Osaka, Japan, pp.89-96. Available from: https://aclanthology.org/W16-4812 [Last accessed on 2025 May 03].

Maulud, D., Jacksi, K., and Ali, I., 2023a. A hybrid part-of-speech tagger with annotated Kurdish corpus: Advancements in POS tagging. Digital Scholarship in the Humanities, 38(4), pp.1604-1612.

Maulud, D., Jacksi, K., and Ali, I., 2023b. Towards a complete Kurdish NLP Pipeline: Challenges and opportunities. Jurnal Informatika, 17, pp.1-17. Mustafa, A.M., and Rashid, T.A., 2018. Kurdish stemmer pre-processing steps for improving information retrieval. Journal of Information Science, 44(1), pp.15-27.

Naserzade, M., Mahmudi, A., Veisi, H., Hosseini, H., and MohammadAmini, M., 2022. CKMorph: A Comprehensive Morphological Analyzer for Central Kurdish. arXiv [Preprint].

Nguyen, T.Q., and Salazar, J., 2019. Transformers without Tears: Improving the Normalization of Self-Attention. arXiv [Preprint].

Pota, M., Marulli, F., Esposito, M., De Pietro, G., and Fujita, H., 2019. Multilingual POS tagging by a composite deep architecture based on character level features and on-the-fly enriched word embeddings. Knowledge-Based Systems, 164, pp.309-323.

Qiu, X., Sun, T., Xu, Y., Shao, Y., Dai, N., and Huang, X., 2020. Pre-trained models for natural language processing: A survey. Science China Technological Sciences, 63(10), pp.1872-1897.

Sabr, S.S., Sabr Mustafa, N., Omar, T.S., Rasool, S.H., Omer, N.A., Hamad, D.S., Abdulhameed Shams, H., Kareem, O.M., Noori, R.A., Abdullah, K.A., Mohammad, M,A., Al-Raghefy, H., Asaad, S.M., Mohammed, S.J.,… & Maghdid, H.S., 2025. A Comprehensive Part-of-Speech Tagging to Standardize Central-Kurdish Language: A Research Guide for Kurdish Natural Language Processing Tasks. arXiv [Preprint].

Salavati, S., and Ahmadi, S., 2018. Building a Lemmatizer and a Spell-checker for Sorani Kurdish. arxiv [Preprint].

Veisi, H., MohammadAmini, M., and Hosseini, H., 2019. Toward Kurdish language processing: Experiments in collecting and processing the AsoSoft text corpus. Digital Scholarship in the Humanities, 35, pp.176-193