Kurdish Textbooks Corpus (KTC)

On This Page

About

Kurdish Textbooks Corpus (KTC) is a domain-specific corpus containing K-12 textbooks in Sorani. It is composed of 31 educational textbooks published from 2011 to 2018 in various topics by the Ministry of Education of the Kurdistan Region of Iraq. The corpus is normalized and categorized into 12 educational subjects containing 693,800 tokens (110,297 types). More statistics are provided in the following table.

Module title Course level #Chapters #Tokens #Sentences
Economics 12 7 32,823 1,023
Genocide 10 8 16,243 670
Geography 10 10 27,999 884
History 10,12 20 79,845 2,065
Human Rights 10 5 11,527 340
Kurdish 7,8,9,10,12 86 153,334 6,348
Kurdology 10,11 6 34,282 931
Philosophy 11 6 21,953 549
Physics 1,2,3,4 30 111,032 4,022
Theology 1,4,5,6,7,8,9,10,11,12 191 115,349 3,661
Sociology 8,9 42 68,044 2,082
Social Study 10 6 21,369 578
Total 31 417 693,800 23,153

Given the accuracy of the text from scientific, grammatical and orthographic points of view, we believe that KTC is also a fine-grained resource. The corpus will contribute to various NLP tasks in Kurdish, particularly in language modeling and grammatical error correction.

Get KTC

Our resource is publicly available for non-commercial use under the CC BY-NC-SA 4.0 license. It can be downloaded at https://github.com/KurdishBLARK/KTC

Please cite the following paper if you are using KTC:

  @article{omarabdulrahman2018KTC,
    title={A Rule-Based Kurdish Text Transliteration System},
    author={Omer Abdulrahman, Roshna and Hassani, Hossein and Ahmadi, Sina},
    booktitle={Proceedings of the third WiNLP workshop},
    year={2019},
    publisher={ACM}
  }