Kurdish Multilingual Parallel Corpus

On This Page

Aligned corpora are essential resources with various applications in natural language processing, particularly in Machine Translation. Such a resource is composed of pairs of sentences or chunks of sentences in two languages or dialects which are translations and thus, useful to train a model for automatic translation. One can imagine how tedious and expensive the task of manual translation can be!

Kurdish Lex workflow
Our approach to automatically retrieve identical news articles

Kurdish, as a less-resourced language, lacks large parallel corpora which are required for further progress in creating statistical and neural machine translation systems. To tackle this problem in a viable way, we create a large parallel text for Kurmanji and Sorani dialects of Kurdish and extend it to include Kurdish-English translations using multilingual news websites. Although there are many multilingual and multi-dialectal news websites for Kurdish, none of them provide their content in an inter-operable way. In other words, an identical news article is translated differently and oftentimes, the translations are not even linked together.

Parallel corpus

Our parallel corpus contains three manually-aligned corpus in Sorani-Kurmanji, Sorani-English and Kurmanji-English in various formats, namely Translation Memory eXchange file format (.tmx), parallel annotated text useful for ParaConc and raw parallel texts (.txt). In the latter, each line corresponds to the same line in the other aligned file. This corpus contains 12,327 translation pairs in the two major dialects of Kurdish, Sorani and Kurmanji. We also provide 1,797 and 650 translation pairs in English-Kurmanji and English-Sorani.

It should be noted that machine translation methods, particularly those based on neural networks, have voracious appetite for data. As a consequence, these datasets may be more suitable for evaluation purposes, fine-tuning or even for trainings for low-resource setups.

The following shows a sample of the corpus:

Source Translation target
لە سیاسەتی فەرمیدا ئەرکێکی زۆری خستە سەرشانی. Di siyaseta legal de wezîfeyên girîng da ser milê xwe.
Bi biharê re li çiyayên Kurdistanê gelek celebên pincaran şîn bû. In the mountains of Kurdistan different kinds of edible plants grow in spring.
لەشاری کاراسنۆدار-ی فیدراسیۆنی ڕووسیا کۆرسی زمانی کوردی بۆمندالان کرایەوە. A Kurdish language course has been started for children in the Russian city of Krasnodar.

Nota bene

Cite this corpus

Download this parallel corpus at https://github.com/KurdishBLARK/InterdialectCorpus. If you use this resource, please cite the following paper:

@misc{ahmadi2020leveraging,
      title={Leveraging Multilingual News Websites for Building a Kurdish Parallel Corpus}, 
      author={Sina Ahmadi and Hossein Hassani and Daban Q. Jaff},
      year={2020},
      eprint={2010.01554},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}