Kurdish Textbooks Corpus (KTC) is a domain-specific corpus containing K-12 textbooks in Sorani. It is composed of 31 educational textbooks published from 2011 to 2018 in various topics by the Ministry of Education of the Kurdistan Region of Iraq. The corpus is normalized and categorized into 12 educational subjects containing 693,800 tokens (110,297 types). More statistics are provided in the following table.
|Module title||Course level||#Chapters||#Tokens||#Sentences|
Given the accuracy of the text from scientific, grammatical and orthographic points of view, we believe that KTC is also a fine-grained resource. The corpus will contribute to various NLP tasks in Kurdish, particularly in language modeling and grammatical error correction.
Please cite the following paper if you are using KTC: