Why does Kurdish language processing matter?

⚡️ Ocotber 2020 Update: Check out the Kurdish Language Processing Toolkit


Since the first time that I touched my home computer keyboard in 2001, I used to ask myself if it would be possible to make computer understand my mother language, Kurdish. Well, what I was imagining was definitely something limited to a Kurdish interface for Windows, particularly when I was going through the installation descriptions of Return to Castle Wolfenstein and Mafia: The City of Lost Heaven! Let alone the spontaneous questions that Clippy was asking and how much I was curious to see the same messages in Kurdish.

Kurdish Clippy!
Me, when I was 9. “What if Clippy could ask “çon î” (how are you) in Kurdish?

In 2000s, there were a few application-based projects for introducing Kurdish to the computer world. For instance, a few Kurdish fonts were made based on the Persian and Arabic keyboards and the interface of some softwares were also translated into Kurdish. Since then, a lot of things have changed in the Information Technology (IT) world. Computers got smaller and more efficient, connections got faster than ever and subsequently, the world has become a smaller place. And still, the overall availability of electronic resources and processing tools for Kurdish is not to a satisfying extent. Kurdish is not just a language with a history that reflects the culture of its speakers. Indeed, it is the only element that has kept Kurds as a nation.

In this post, I would like to address the importance of language technology for Kurdish language. Nowadays, with a plethora of (unstructured) online Kurdish resources and the recent progress in natural language processing and machine learning, I think that it is more feasible than ever to provide tools and resources for finally making Kurdish understandable by machine. I will also explain why I believe that the difficulty of processing Kurdish language is no more like the 2000s and there is currently a huge potential for Kurdish processing.

What is Natural Language Processing?

Formally speaking, Natural Language Processing (NLP) refers to the computer processing of natural language: the language that we speak with. NLP is a part of the field of study which is called language technology. If you are using a spell checker on your cell phone or your computer, or you use Siri or Google Assistant to ask about the nearest restaurant, then you have been already using some of the most common applications of NLP. Language technology is so present in nowadays life that most of us have been using them without even knowing about!

Language technology is a primordial part of the Web. At our current pace (in 2019), we are producing 2.5 quintillion bytes of data each day, in other words 1018 bytes (source), mostly unstructured data such as text, speech and so on. NLP technologies enable us to extract information from data and process the data for different applications.

NLP total revenue by segments. Recreated based on the data from Tractica.

So, if you have a small or big business and you would like to know about the public opinion regarding your services, or you are a politician and would like to know what is happening around, or you are doing something for which human communication and interaction is important, then you will definitely need language technology.

NLP applications: a few examples for Kurdish

Some of the applications of NLP are:

  • Transliteration: Kurdish is written in two scripts. The Latin-based script is used for writing in Kurmanji dialect, mostly spoken in Bakûr (means “north” in Kurdish and refers to the Kurdish regions in Turkey) and parts of Rojava (means “west” in Kurdish and refers to the Kurdish regions in Syria), while the Arabic-based script is used in Rojhełat (“east” in Kurdish, the Kurdish regions of Iran) and Basûr (“south” in Kurdish, the Kurdish regions of Iraq). A transliterator enables us to automatically convert one script to the other. As an example, “birayetî” (brotherhood in Kurdish) in the Latin-based script is transliterated as “برایەتی” in the Arabic-based script.

  • Morphological analysis: Kurdish is an inflectional language. That is, a word can turn into a new word by adding a set of morphemes, such as suffixes and prefixes. For instance, “jinekan” (the women) is composed of “jin” (woman) and “ekan” (definite article added as a suffix to the noun). Detecting each morpheme is the task of morphological analyser. This is challenging, especially because of the huge number of derivational and inflectional morphemes in Kurdish. Another example, the Sorani word “dîtimin” which means “I saw them” is composed of “dîtim”, “I saw”, and “in” the object added as an accusative suffix.

  • Parts of Speech Tagging: How do we know what is the grammatical role of each parts of this sentence : “Min gułekanim bon kird” (I smelled the flowers)? Well, that is what a part-of-speech tagger (POS tagger) does. For that case, a Kurdish PoS tagger will tell you that “min” is a pronoun, “gułekanim” is a definite noun and “bon kird” is the first singular person of the verb “bon kirdin” (to smell) in the past tense.

  • Machine translation: This is maybe the most popular application of NLP. What if all dialects of Kurdish could be translated into other languages automatically? What is the equivalent of the Hawrami word “jerej” in Kurmanji and Sorani?[1] Machine translation takes care of the automatic translation using a parallel corpus. A corpus is a collection of text collected for a specific purpose. In a parallel corpus, the texts in two languages are aligned together. For instance, in a parallel corpus of Sorani Kurdish and English, for “The distance between Sanandaj and Sulaymaniyah is 253 kilometers”, the following sentence is aligned: “mewday nêwan Sine û Silêmanî 253 kîlomîtr e”. And now, imagine that in our corpus which contains thousands of parallel sentences, there are 100 sentences in which those words are used. That is where machine translation, based on statistical methods and more recently, neural network models, predicts that “Sine” means “Sanandaj” and “Silêmanî” means “Sulaymaniyah”, and even can translate sentences and more sophisticated texts.

  • Speech recognition Just try “OK google” on your Android phone, or “Hey, Siri” to activate your speech recogniser. Then talk as you talk with a real person and let the machine deal with it! That is what a speech recogniser does.

To discover more about NLP, Jurafsky and Martin’s handbook in speech and language processing and Natural Language Processing and its Applications are two useful resources.

Challenges of Kurdish language processing

Kurdish is a less-resourced language. A less-resourced language is a language for which there is not enough language resources to be fully processed. In the following, I mention some of the main characteristics of Kurdish language which are also the main reasons that may explain the challenges in Kurdish language processing and why they have not been efficiently addressed yet.

Diversity in dialects

Having various dialects and sub-dialects, Kurdish is known as a dialect-rich language and is sometimes referred to as a dialect continuum. This richness is intersting when you observe that what is called something in a village is called differently in the neighbourhood. As a personal observation, in Kêle Çermig, a village near Sanandaj in the Eastern Kurdistan, people use the word amêjeng for yeast, while in Syaseran it is called amyan. Such differences are not limited to the vocabulary, but also to the phonology and the phonetics. Almost all the dialects and sub-dialects of Kurdish have something distinct in terms of pronunciation.

Dialectal difference could even be observed between two neighbouring villages.

Having said that, the variety of the dialects adds a gap between the speakers of the same language and to some extent, creates a kind of barrier in communication. Some believe that such diversities should be addressed by defining a standard language. Defining a standard language for Kurdish has been a matter of debate without reaching a consensus.

Diversity in scripts

Due to historical and geological reasons, several scripts are used when it comes to Kurdish writing. Cyrillic, Arabo-Persian, Latin and even Armenian alphabets have been used to write Kurdish texts. In the recent years, the Kurdish Academy of Language has tried to unify those alphabets and present a unified alphabet for Kurdish called Yekgirtú. However, Yekgirtú does not seem to be as popular among the scholars nor the public; among all alphabets, the Arabo-Persian and the Latin alphabets are mostly used by Sorani and Kurmanji speakers, respectively. The following table shows those two alphabets in a comparative way.

Kurdish phonemes (IPA) Latin-based Yekgirtú Arabic-based
[a:] A a A a ا
[b] B b B b ب
[t͡ʃ] Ç ç C c چ
[d͡ʒ] C c J j ج
[d] D d D d د
[æ] E e E e ه
[eː ] Ê ê É é ێ
[f] F f F f ف
[g] G g G g گ
[h] H h H h ھ
[I] I i I i
[i:] Î î Í í ى
[ʒ] J j Jh jh ژ
[k] K k K k ک
[l] L l L l ل
[ɬ] Ł ł Ll ll ڵ
[m] M m M m م
[n] N n N n ن
[oː ] O o O o ۆ
[p] P p P p پ
[q] Q q Q q ق
[ɾ] R r R r ر
[r] Ř ř Rr rr ڕ
[s] S s S s س
[ʃ] Ş ş Sh sh ش
[t] T t T t ت
[ʊ] U u U u و
[u:] Û û Ú ú وو
[v] Vv V v ڤ
[w] W w W w و
[x] X x X x خ
[j] Y y Y y ى
[z] Z z Z z ز
[ħ] Ḧ ḧ H', h' ح
[ʕ] Ë ë ع
[ɣ] Ẍ ẍ X', x' غ
[ʉ:] Ù ù ۊ
[γ] ڎ
[ʁ] Ğ ğ

Lack of standards

As discussed earlier, the diversity in dialects and scripts turns the richness of the language into a challenge. Such challenges are usually addressed by defining certain standards which do not exist for Kurdish yet. Defining a standard Kurdish language or Kurdish alphabet and deploying those standards require governmental actions and should be supported by scholars and Kurdish public.

Lack of resources

Electronic resources provide textual information about language and are essential for text mining in particular, and NLP in general. Those information can be collected and used as a text corpus. Fortunately, there are currently websites which are active in creating content in Kurdish, notably news agencies. Unfortunately, we still need more resources, especially expert-made ones such as lexicons and parallel corpora.

Lack of investment

Honestly, I believe that lack of investment should have been listed as the first reason of Kurdish remaining behind in NLP.

A project needs to be funded to make a progress, which is unfortunately not the case for Kurdish-related projects. Due to the political constraints in Kurdistan, funding a Kurdish-related project is not a priority for the businesses. Even in the Southern Kurdistan, where there has been autonomy (kind of) since 1992, there has not been a big advancement in this field.

Such a difficulty seems to be the case of less-resourced languages. The following paragraph from the summary of the discussion on Less resourced languages and Language technology of the seventh international conference on Language Resources and Evaluation (LREC) explains how getting a project funded for a less-resourced language may be challenging:

One of the problems that was underlined is the difficulties in convincing politicians to fund the creation of language resources (LR) for less-resourced languages (LRL). Per Langgård suggested that it would be necessary to build a scheme to assist developers to have success in that endeavour; Khalid Choukri said that even for large European languages it was also difficult to convince European Union politicians to fund R&D in the field, and that we needed to give politicians a larger picture and something they can sell to the media. Along the same lines, Igor Leturia mentioned that we should convince politicians that we do not only do research but that we produce products that politicians can see.

Public awareness

In addition to the aforementioned factors, I should bitterly admit that there has been a kind of ignorance among the Kurds regarding their language and its correct usage as their formal language. I hope that my generation can promote using the power of Internet and make more people aware of the importance of language technology.

What to do then?

For you as a person

Use UTF-8, Please!

Each piece of text that you are typing is useful to make a language processable. Use a Kurdish keyboard and care about the correctness of what you type, as much as you can. Happily, there are currently many keyboards available on Google Play and iTunes which support Kurdish alphabets in UTF-8. (Take a look at Gboard - the Google Keyboard if you are an Android user).

Understand the value of data

We are making information each day by sending messages to our friends and visiting social media. Try creating content in Kurdish. Blogging is a very interesting way to let others know about your ideas.

Be creative and think about NLP

No matter what your expertise is, you can make a change in the current situation. If you are a Kurdish music fan, you can write the lyrics of the Kurdish songs and create a corpus for it. If you are interested in medical sciences, you can collect the terminology of your profession. If you are teaching a module related to computer science, NLP or computational linguistics, why not letting your students working on a Kurdish-related mini-project for their final project? And so on.

Research in Kurdish language processing

We need to do more research in Kurdish language processing. I strongly believe in the open data and open-source tools given the current situation of lack of investment.

For you as a business or entrepreneur

Invest

Just like any other field in computer engineering, your business can make money by creating Kurdish language technology.

Fund research projects

Collaborate with academic research units and fund research projects related to Kurdish language processing. Taking a few interns per year can make a huge contribution to the field..

Promote usage of tools for Kurdish

News agencies, publication houses, authorities and all those who have a voice in the society, can promote the usage of tools which are made for Kurdish language processing.

Footnotes

[1] “jerej” means “partridge“ in Hewrami Kurdish. In the Sorani Kurdish and in the Kurmanji Kurdish, “kew” and “vitik” are used respectively.


Last updated on 26 March 2019.