Is Kurdish a less-resourced language?

TLDR Yes, for the time being, but hopefully no, if we take serious actions.

To know more about why a language spoken by 20-30 million speakers should still be less-resourced, read the post. 🙃


A language is called “less-resourced” when the only available resources for computationally processing it are descriptive grammars and online resources. Regarding Kurdish, it has been always highlighted by the researchers that Kurdish is a less-resourced language. As the following figure indicates, Kurdish is not the only less-resourced language, but most languages in the world are actually still less-resourced. More precisely, among the languages spoken in the Middle East, including Kurdish as a major one, only a couple of them can be considered medium-resourced, let alone high-resourced!

A conceptual view of the NLP resource hierarchy
A conceptual view of the NLP resource hierarchy (Source).

In this post, I am going to dig more into this problem: To what extent Kurdish is less-resourced? And, what types of resources are available/missing?

Kurdish lexicography

Lexicography is the science of words and compiling them into resources called “dictionaries”. Nowadays, much effort in this field is on creating electronic dictionaries, therefore called electronic lexicography (or e-lexicography). Electronic dictionaries form the backbone of fundamental applications and tools in natural language processing, such as spell-checking, named-entity recognition, i.e. finding name of places, people or organisations, and so on.

In a recent study (Ahmadi & Hassani, 2019), we carried out a comprehensive study on the Kurdish lexicographic resources based on the major Kurdish dialects (Kurmanji, Sorani, Hawrami, Southern Kurdish and Zaza), scripts (Latin, Persian-Arabic and Cyrillic) and dictionary types (bilingual, monolingual or multilingual). The results were astonishing. Let’s take a look at what our analysis says:

Kurdish Lexicographic Resources: A survey (Source).

Our survey which included all the dictionaries that we could have access in printed or electronic version, i.e. 71 dictionaries, suggests that all the Kurdish dialects have at least one lexicographic resource. Proportionally, more than 70% of the Kurdish dictionaries are in Sorani or Kurmanji. Only a handful of the resources are electronic, FreeDict which provides Kurmanji-Turkish/German/English dictionaries, as well as a Sorani-Kurmanji dictionary, Wiktionary and Dictio, to mention a few. On the other hand, only a couple of those resources are in structured formats like XML and none of them in RDF. This makes them less interoperable and integrable in other resources and applications.

Domain-specific and general-purpose corpora

Terminological resources, aka terminologies, and glossaries are other types of resources which provide information about a specific domain. Various such resources exist for Kurdish, such as glossary of name of animals or dictionary of cities and villages. One of the outstanding resources which is also one of the only online ones is the Kurmancî biannual linguistic magazine published by the Kurdish Institute of Paris since 1987. You can query the database of this terminological resource here.

General literature

Vejîn Books is an outstanding initiative for collecting historical and literary text in Kurdish. The project is a Wikiproject for which many collaborators actively contribute to the create electronic version of classical Kurdish works, such as poems.

The Kurdish Digital Library of the Kurdish Institute of Paris also contains over 10,000 monographs about the Kurds in 25 languages. However, not all of them are available in electronic version and accessible to the public.

Kurdish media

The Kurdish media has played a primordial role during the past decades in creating a universal Kurdish identity. Happily, there are many Kurdish news agencies which create content in Kurdish on the Web on a daily basis. Such resources not only enrich the language in terms of the resources, but also facilitate the computational processing of the Kurdish language. For instance, to create a machine translation system, a huge amount of parallel sentences are required. A parallel corpus, a corpus containing such parallel examples in two or several languages, is then used to create models for machine translation. Some of the main Kurdish broadcast news stations are the followings:

  • Kurdistan24: Turkish English Kurmanji (Latin script) Sorani (Arabic script) Arabic Persian
  • NRT: Kurmanji (Latin and Arabic scripts) Sorani (Arabic script) English Arabic
  • Rûdaw: Turkish English Kurmanji (Latin script) Sorani (Arabic script) Arabic

I spent some time playing with the links of the news articles to discover if the same news articles are available in different languages; this is important as we can make parallel corpora based on the news content. It seems that the same news articles are not available for all the languages. However, changing the language ID in the URL of a news to that of the other languages gives relatively identical topics in some cases. For instance, the following is the content of this new in the available languages on Rûdaw:

News agency English Arabic Sorani Kurmanji Turkish
Rûdaw Resolution of ongoing protests in central and southern Iraq cannot be achieved through a security solution, Iraq’s three presidencies and its judiciary chief asserted in a joint meeting on Sunday - despite security forces continuing their crackdown to bring an end to the protests. أكد صالح محمد العراقي، المقرب من زعيم التيار الصدري، مقتدى الصدر، أن الأخير لم يتفق مع أي جهة بشأن إبقاء الحكومة الحالية برئاسة عادل عبدالمهدي، مضيفاً: "الصدر مع الشعب وكفاكم سماعاً للأكاذيب والإشاعات". سەرچاوەیەك لە نووسینگەی عەلی سیستانی، مەرجەعی باڵای شیعەی عێراق لە نەجەف رەتیكردەوە مەرجەعیەت لایەنێك بێت لە رێككەوتنی نێوان سەدر، حەكیم و قاسم سولەیمانی لەبارەی كۆتاییهێنان بە خۆپیشاندانەكان و هێشتنەوەی حكومەتی ئێستای عێراق. Fraksyona Nehcî ya Wetenî li Parlamentoya Iraqê ragihand ku divê di dema 30 rojan de parlamento û ‎hikûmet bên hilweşandin û yasayeke nû bo hilbijartinê were derxistin.‎ İnsan Hakları İzleme Örgütü (HRW), Iraklı güvenlik güçlerini ülkede devam eden gözstericilere karşı aşırı şiddet kullanmakla eleştirdi. Örgüt, sadece 25 Ekim’nden bu yana devam eden gösterilerde 16 kişinin gaz bombası kapsülleri ile öldürüldüğünü açıkladı.

Community efforts

Wikipedia

Wikipedia is a free encyclopaedia created by volunteers. According to the list of Wikipedias, the three available Kurdish dialects, Kurmanji, Sorani and Zazaki, have over 10,000 articles on Wikipedia. There are a few other websites with wiki functionalities, like https://www.kurdipedia.org, that even if we take them into account, there is a considerable margin for Kurdish to be considered a medium-resourced language (having 100k documents). The following table summarises the current status of the Kurdish dialects on Wikipedia (November 2019):

Language Wiki Articles Total Edits Admins Users Active Users Files Depth
Kurmanji Kurdish ku 26334 64874 739494 4 39013 57 580 24
Sorani Kurdish ckb 24925 130974 645920 7 37344 142 955 89
Zazaki diq 12184 31055 399744 6 18903 67 215 31
Persian fa 700361 4479817 27557535 31 869033 5142 59532 179
Catalan ca 628692 1547511 22264282 21 338436 1597 11863 31
Serbian sr 626216 3838890 22263049 18 252892 771 32517 153

Wiktionary

Similar to Wikipedia, Wiktionary is also a free platform where volunteers provide lexicographic data. As of November 2019, according to the Wiktionary statistics, there are less than 1,300 lemmata available for Kurdish on Wiktionary!

Dialect #Gloss definitions #Entries #Gloss entries #Form definitions Total definitions
Northern Kurdish (Kurmanji) 2251 5340 2034 9188 11439
Central Kurdish (Sorani) 339 290 279 11 350
Southern Kurdish 55 46 46 0 55
Zazaki 842 697 691 5 847

I am not particularly happy that there are only 226k articles in Kurdish available on Wikipedia, in comparison to over 4 million in Persian, 1.5 million in Catalan and 3.8 million in Serbian!

Conclusion

Now that we know about some of the major resources of the Kurdish language, I would like to draw a few conclusions:

  1. Kurdish is less-resourced because of lack of electronic resources. Just take a look at the dictionaries. There are many Kurdish dictionaries which are only available in hardcopy while they could be used on the Web, if available electronically. Which one of those many Russian-Kurdish dictionaries have you ever queried online?

  2. Kurdish will remain less-resourced, if appropriate actions could not be taken to
    • retro-digitize printed resources
    • contribute to the current platforms such as Wikipedia and Wiktionary
    • raise awareness among Kurdish speakers to use their own language to create content and write on the Web
    • encourage Kurdish writers and scholars to provide their works free and open to the public
  3. Kurdish language processing: in a previous post, I have discussed why Kurdish language processing matters. One of the main reasons is actually to create new resources, manipulate the current ones and process them automatically.

Last updated on 16 November 2019.