TLDR Yes, for the time being, but hopefully no, if we take serious actions.
To know more about why a language spoken by 20-30 million speakers should still be less-resourced, read the post. 🙃
A language is called “less-resourced” when the only available resources for computationally processing it are descriptive grammars and online resources. Regarding Kurdish, it has been always highlighted by the researchers that Kurdish is a less-resourced language. As the following figure indicates, Kurdish is not the only less-resourced language, but most languages in the world are actually still less-resourced. More precisely, among the languages spoken in the Middle East, including Kurdish as a major one, only a couple of them can be considered medium-resourced, let alone high-resourced!
In this post, I am going to dig more into this problem: To what extent Kurdish is less-resourced? And, what types of resources are available/missing?
Lexicography is the science of words and compiling them into resources called “dictionaries”. Nowadays, much effort in this field is on creating electronic dictionaries, therefore called electronic lexicography (or e-lexicography). Electronic dictionaries form the backbone of fundamental applications and tools in natural language processing, such as spell-checking, named-entity recognition, i.e. finding name of places, people or organisations, and so on.
In a recent study (Ahmadi & Hassani, 2019), we carried out a comprehensive study on the Kurdish lexicographic resources based on the major Kurdish dialects (Kurmanji, Sorani, Hawrami, Southern Kurdish and Zaza), scripts (Latin, Persian-Arabic and Cyrillic) and dictionary types (bilingual, monolingual or multilingual). The results were astonishing. Let’s take a look at what our analysis says:
Our survey which included all the dictionaries that we could have access in printed or electronic version, i.e. 71 dictionaries, suggests that all the Kurdish dialects have at least one lexicographic resource. Proportionally, more than 70% of the Kurdish dictionaries are in Sorani or Kurmanji. Only a handful of the resources are electronic, FreeDict which provides Kurmanji-Turkish/German/English dictionaries, as well as a Sorani-Kurmanji dictionary, Wiktionary and Dictio, to mention a few. On the other hand, only a couple of those resources are in structured formats like XML and none of them in RDF. This makes them less interoperable and integrable in other resources and applications.
Terminological resources, aka terminologies, and glossaries are other types of resources which provide information about a specific domain. Various such resources exist for Kurdish, such as glossary of name of animals or dictionary of cities and villages. One of the outstanding resources which is also one of the only online ones is the Kurmancî biannual linguistic magazine published by the Kurdish Institute of Paris since 1987. You can query the database of this terminological resource here.
Vejîn Books is an outstanding initiative for collecting historical and literary text in Kurdish. The project is a Wikiproject for which many collaborators actively contribute to the create electronic version of classical Kurdish works, such as poems.
The Kurdish Digital Library of the Kurdish Institute of Paris also contains over 10,000 monographs about the Kurds in 25 languages. However, not all of them are available in electronic version and accessible to the public.
The Kurdish media has played a primordial role during the past decades in creating a universal Kurdish identity. Happily, there are many Kurdish news agencies which create content in Kurdish on the Web on a daily basis. Such resources not only enrich the language in terms of the resources, but also facilitate the computational processing of the Kurdish language. For instance, to create a machine translation system, a huge amount of parallel sentences are required. A parallel corpus, a corpus containing such parallel examples in two or several languages, is then used to create models for machine translation. Some of the main Kurdish broadcast news stations are the followings:
I spent some time playing with the links of the news articles to discover if the same news articles are available in different languages; this is important as we can make parallel corpora based on the news content. It seems that the same news articles are not available for all the languages. However, changing the language ID in the URL of a news to that of the other languages gives relatively identical topics in some cases. For instance, the following is the content of this new in the available languages on Rûdaw:
|Rûdaw||Resolution of ongoing protests in central and southern Iraq cannot be achieved through a security solution, Iraq’s three presidencies and its judiciary chief asserted in a joint meeting on Sunday - despite security forces continuing their crackdown to bring an end to the protests.||أكد صالح محمد العراقي، المقرب من زعيم التيار الصدري، مقتدى الصدر، أن الأخير لم يتفق مع أي جهة بشأن إبقاء الحكومة الحالية برئاسة عادل عبدالمهدي، مضيفاً: "الصدر مع الشعب وكفاكم سماعاً للأكاذيب والإشاعات".||سەرچاوەیەك لە نووسینگەی عەلی سیستانی، مەرجەعی باڵای شیعەی عێراق لە نەجەف رەتیكردەوە مەرجەعیەت لایەنێك بێت لە رێككەوتنی نێوان سەدر، حەكیم و قاسم سولەیمانی لەبارەی كۆتاییهێنان بە خۆپیشاندانەكان و هێشتنەوەی حكومەتی ئێستای عێراق.||Fraksyona Nehcî ya Wetenî li Parlamentoya Iraqê ragihand ku divê di dema 30 rojan de parlamento û hikûmet bên hilweşandin û yasayeke nû bo hilbijartinê were derxistin.||İnsan Hakları İzleme Örgütü (HRW), Iraklı güvenlik güçlerini ülkede devam eden gözstericilere karşı aşırı şiddet kullanmakla eleştirdi. Örgüt, sadece 25 Ekim’nden bu yana devam eden gösterilerde 16 kişinin gaz bombası kapsülleri ile öldürüldüğünü açıkladı.|
Wikipedia is a free encyclopaedia created by volunteers. According to the list of Wikipedias, the three available Kurdish dialects, Kurmanji, Sorani and Zazaki, have over 10,000 articles on Wikipedia. There are a few other websites with wiki functionalities, like https://www.kurdipedia.org, that even if we take them into account, there is a considerable margin for Kurdish to be considered a medium-resourced language (having 100k documents). The following table summarises the current status of the Kurdish dialects on Wikipedia (November 2019):
Similar to Wikipedia, Wiktionary is also a free platform where volunteers provide lexicographic data. As of November 2019, according to the Wiktionary statistics, there are less than 1,300 lemmata available for Kurdish on Wiktionary!
|Dialect||#Gloss definitions||#Entries||#Gloss entries||#Form definitions||Total definitions|
|Northern Kurdish (Kurmanji)||2251||5340||2034||9188||11439|
|Central Kurdish (Sorani)||339||290||279||11||350|
I am not particularly happy that there are only 226k articles in Kurdish available on Wikipedia, in comparison to over 4 million in Persian, 1.5 million in Catalan and 3.8 million in Serbian!
Now that we know about some of the major resources of the Kurdish language, I would like to draw a few conclusions:
Kurdish is less-resourced because of lack of electronic resources. Just take a look at the dictionaries. There are many Kurdish dictionaries which are only available in hardcopy while they could be used on the Web, if available electronically. Which one of those many Russian-Kurdish dictionaries have you ever queried online?
Last updated on 16 November 2019.