GitHub Email ORCID Google Scholar Stack Overflow Twitter FOAF Zurich, Switzerland RSS
Kurdish Textbooks Corpus (KTC) is a domain-specific corpus containing K-12 textbooks in Sorani. It is composed of 31 educational textbooks published from 2011 to 2018 in various topics by the Ministry of Education of the Kurdistan Region of Iraq. The corpus is normalized and categorized into 12 educational subjects containing 693,800 tokens (110,297 types). More statistics are provided in the following table.
Module title | Course level | #Chapters | #Tokens | #Sentences |
---|---|---|---|---|
Economics | 12 | 7 | 32,823 | 1,023 |
Genocide | 10 | 8 | 16,243 | 670 |
Geography | 10 | 10 | 27,999 | 884 |
History | 10,12 | 20 | 79,845 | 2,065 |
Human Rights | 10 | 5 | 11,527 | 340 |
Kurdish | 7,8,9,10,12 | 86 | 153,334 | 6,348 |
Kurdology | 10,11 | 6 | 34,282 | 931 |
Philosophy | 11 | 6 | 21,953 | 549 |
Physics | 1,2,3,4 | 30 | 111,032 | 4,022 |
Theology | 1,4,5,6,7,8,9,10,11,12 | 191 | 115,349 | 3,661 |
Sociology | 8,9 | 42 | 68,044 | 2,082 |
Social Study | 10 | 6 | 21,369 | 578 |
Total | 31 | 417 | 693,800 | 23,153 |
Given the accuracy of the text from scientific, grammatical and orthographic points of view, we believe that KTC is also a fine-grained resource. The corpus will contribute to various NLP tasks in Kurdish, particularly in language modeling and grammatical error correction.
Our resource is publicly available for non-commercial use under the CC BY-NC-SA 4.0 license. It can be downloaded at https://github.com/KurdishBLARK/KTC
Please cite the following paper if you are using KTC: