A Novel Central Kurdish Part-of-Speech Corpus and Deep Tagging Model Evaluation
DOI:
https://doi.org/10.14500/aro.12641Keywords:
Annotated dataset, Central Kurdish language, Deep learning, Long short-term memoryAbstract
For many low-resource languages, including the central Kurdish language (CKL), building effective natural language processing (NLP) tools has been a challenge. This is due to the lack of annotated text. Without a large corpus that specifies how words function grammatically, it is difficult to perform basic tasks such as part-of-speech (POS) tagging, which is the building block of many language technologies. To address this issue, this study presents the first comprehensive POS-tagged corpus for CKL. This dataset consists of 108,680 words manually tagged with 86 tags. Unlike simpler tagging schemes, the 86 tags account for the complexity of Kurdish grammar and allow a single word to have multiple valid tags, reflecting the language’s natural ambiguity. Using this resource, this study benchmarks a range of deep models, including neural networks such as bidirectional long short-term memory (BiLSTM). To address the ambiguity challenge, this paper introduces a new method, adaptive tag cycling within the BiLSTM that trains the model to consider all possible tags. The most advanced model in this study, an ensemble of neural sequence taggers, achieves 92.3% accuracy with stop-words retained and 89.5% with stop-words removed on broad grammatical categories (main tags). On the full fine-grained tagset (detailed tags), the same model attains 79.0% accuracy with stop-words and 76.2% without stop-words. Therefore, this study provides two key contributions: (i) a new dataset that supports future Kurdish NLP research, and (ii) a strong performance benchmark for CKL POS tagging.
Downloads
References
Abdullah, A.A., Hasan, S., Toufiq, D., Maghdid, H.S., Rashid, T.A., Farho, P., Sabr, S., Taher, A.H., Sabir, D., Veisi, H., and Asaad, A.T., 2024. NER- RoBERTa: Fine-Tuning RoBERTa for Named Entity Recognition (NER) within Low-resource Languages. arXiv [Preprint].
Abdullah, A.A., Mohammed, N.S., Khanzadi, M., Asaad, S.M., Abdul, Z.K., and Maghdid, H.S., 2025. In-depth analysis on machine learning approaches: Techniques, applications, and trends. ARO-The Scientific Journal Of Koya University, 13(1), pp.190-202.
Abdulrahman, R.O., and Hassani, H., 2020. Using Punkt for Sentence Segmentation in Non-Latin Scripts: Experiments on Kurdish (Sorani) Texts. arXiv [Preprint].
Ahmadi, S., 2020a. A Tokenization System for the Kurdish Language. In: Zampieri, M., Nakov, P., Ljubešić, N., Tiedemann, J., and Scherrer, Y., Eds. Proceedings of the 7th Workshop on NLP for Similar Languages, Varieties and Dialects. International Committee on Computational Linguistics (ICCL), Barcelona, Spain, pp.114-127. Available from: https://aclanthology.org/2020. vardial-1.11 [Last accessed on 2025 May 03].
Ahmadi, S., 2020b. KLPT-Kurdish language processing toolkit. In: Park, E.L., Hagiwara, M., Milajevs, D., Liu, N.F., Chauhan, G., and Tan, L., Eds. Proceedings of Second Workshop for NLP Open Source Software (NLP-OSS). Association for Computational Linguistics, Pennsylvania, pp.72-84.
Ahmadi, S., and Masoud, M., 2020. Towards Machine Translation for the Kurdish Language. In: Karakanta, A., Ojha, A.K., Liu, C.H., Abbott, J., Ortega, J., Washington, J., Oco, N., Lakew, S.M., Pirinen, T.A., Malykh, V., Logacheva, V., and Zhao, X., Eds. Proceedings of the 3rd Workshop on Technologies for MT of Low Resource Languages. Association for Computational Linguistics, Suzhou, China, pp.87-98.
Amini, Z., Mohammadamini, M., Hosseini, H., Mansouri, M., and Jaffet, D., 2021. Central Kurdish Machine Translation: First Large Scale Parallel Corpus and Experiments. arXiv [Preprint]. Available from: https://hal.science/hal 03263105 [Last accessed on 2025 May 03].
Ataman, D., 2018. Bianet: A Parallel News Corpus in Turkish, Kurdish and Awlla, K.M., Veisi, H., and Abdullah, A.A., 2025. Sentiment analysis in low resource contexts: BERT’s impact on Central Kurdish. Language Resources and Evaluation, 59(3), pp.2213-2243.
Azzat, M., Jacksi, K., and Ali, I., 2024. A Hybrid Approach to ontology construction for the Badini Kurdish language. Information, 15(9), p.578. Daelemans, W., 2011. POS tagging. In: Sammut, C., and Webb, G.I., Eds. Encyclopedia of Machine Learning. Springer, US, Boston, MA, pp.776-779.
Gökırmak, M., and Tyers, F.M., 2017. A Dependency Treebank for Kurmanji Kurdish. In: Montemagni, S., and Nivre, J., Eds. Proceedings of the Fourth International Conference on Dependency Linguistics (Depling 2017). Linköping University Electronic Press, Pisa, Italy, pp.64-72. Available from: https:// aclanthology.org/W17-6509 [Last accessed on 2025 May 03].
Grave, E., Bojanowski, P., Gupta, P., Joulin, A., and Mikolov, T., 2018. Learning Word Vectors for 157 Languages. In: Proceedings of the International Conference on Language Resources and Evaluation (LREC 2018). European Language Resources Association (ELRA). Available from: https://fasttext.cc/docs/en/ crawl-vectors.html [Lase accessed on 2025 Sep 16].
Hassani, H., 2022. Part of Speech Tagging (POST) of a Low-resource Language using another Language (Developing a POS-Tagged Lexicon for Kurdish (Sorani) using a Tagged Persian (Farsi) Corpus. arXiv [Preprint].
Hochreiter, S., and Schmidhuber, J., 1997. Long short-term memory. Neural Computation, 9(8), pp.1735-1780. Huang, Z., Xu, W., and Yu, K., 2015. Bidirectional LSTM-CRF Models for Sequence Tagging. arXiv [Preprint].
Jurafsky, D., and Martin, J.H., 2023. Speech and Language Processing. 3rd ed. Stanford University, California.
Lafferty, J.D., McCallum, A., and Pereira, F.C.N., 2001. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data In: Proceedings of the Eighteenth International Conference on Machine Learning. Morgan Kaufmann Publishers Inc, San Francisco, CA, USA, pp.282-289.
Landis, J.R., and Koch, G.G., 1977. The measurement of observer agreement for categorical data. Biometrics, 33(1), p.159.
Malmasi, S., 2016. Subdialectal Differences in Sorani Kurdish. In: Nakov, P., Zampieri, M., Tan, L., Ljubešić, N., Tiedemann, J., and Malmasiet, S., Eds. Proceedings of the Third Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial3). The COLING 2016 Organizing Committee, Osaka, Japan, pp.89-96. Available from: https://aclanthology.org/W16-4812 [Last accessed on 2025 May 03].
Maulud, D., Jacksi, K., and Ali, I., 2023a. A hybrid part-of-speech tagger with annotated Kurdish corpus: Advancements in POS tagging. Digital Scholarship in the Humanities, 38(4), pp.1604-1612.
Maulud, D., Jacksi, K., and Ali, I., 2023b. Towards a complete Kurdish NLP Pipeline: Challenges and opportunities. Jurnal Informatika, 17, pp.1-17. Mustafa, A.M., and Rashid, T.A., 2018. Kurdish stemmer pre-processing steps for improving information retrieval. Journal of Information Science, 44(1), pp.15-27.
Naserzade, M., Mahmudi, A., Veisi, H., Hosseini, H., and MohammadAmini, M., 2022. CKMorph: A Comprehensive Morphological Analyzer for Central Kurdish. arXiv [Preprint].
Nguyen, T.Q., and Salazar, J., 2019. Transformers without Tears: Improving the Normalization of Self-Attention. arXiv [Preprint].
Pota, M., Marulli, F., Esposito, M., De Pietro, G., and Fujita, H., 2019. Multilingual POS tagging by a composite deep architecture based on character level features and on-the-fly enriched word embeddings. Knowledge-Based Systems, 164, pp.309-323.
Qiu, X., Sun, T., Xu, Y., Shao, Y., Dai, N., and Huang, X., 2020. Pre-trained models for natural language processing: A survey. Science China Technological Sciences, 63(10), pp.1872-1897.
Sabr, S.S., Sabr Mustafa, N., Omar, T.S., Rasool, S.H., Omer, N.A., Hamad, D.S., Abdulhameed Shams, H., Kareem, O.M., Noori, R.A., Abdullah, K.A., Mohammad, M,A., Al-Raghefy, H., Asaad, S.M., Mohammed, S.J.,… & Maghdid, H.S., 2025. A Comprehensive Part-of-Speech Tagging to Standardize Central-Kurdish Language: A Research Guide for Kurdish Natural Language Processing Tasks. arXiv [Preprint].
Salavati, S., and Ahmadi, S., 2018. Building a Lemmatizer and a Spell-checker for Sorani Kurdish. arxiv [Preprint].
Veisi, H., MohammadAmini, M., and Hosseini, H., 2019. Toward Kurdish language processing: Experiments in collecting and processing the AsoSoft text corpus. Digital Scholarship in the Humanities, 35, pp.176-193
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2026 Haneen Al-Raghefy, Halgurd S. Maghdid, Akar H. Taher

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
Authors who choose to publish their work with Aro agree to the following terms:
-
Authors retain the copyright to their work and grant the journal the right of first publication. The work is simultaneously licensed under a Creative Commons Attribution License [CC BY-NC-SA 4.0]. This license allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.
-
Authors have the freedom to enter into separate agreements for the non-exclusive distribution of the journal's published version of the work. This includes options such as posting it to an institutional repository or publishing it in a book, as long as proper acknowledgement is given to its initial publication in this journal.
-
Authors are encouraged to share and post their work online, including in institutional repositories or on their personal websites, both prior to and during the submission process. This practice can lead to productive exchanges and increase the visibility and citation of the published work.
By agreeing to these terms, authors acknowledge the importance of open access and the benefits it brings to the scholarly community.
Accepted 2026-02-11
Published 2026-06-20








ARO Journal is a scientific, peer-reviewed, periodical, and diamond OAJ that has no APC or ASC.