Zaza-Gorani Corpus

On This Page

In 2008, when I was 15, I went for the first time to Hawraman, precisely to Hawraman Takht. I remember how much I was intrigued by everything! I went to a local grocery and the saleswoman counted the money when she gave it back to me and I almost did not understand anything! One word that I still remember she used was yerê [yaere:] for three, while in Sorani Kurdish we say [se:].

Not only the language fascinated me, I was also impressed by their way of life. They build their houses like stairways on steep mountain slope. I cite Minorsky’s description in the Guran as follows:

In the gorges of Awraman (near Tawele and Beyare) one cannot help admiring the extraordinary skill with which the villagers build up and utilize small terraces of land for gardening and general crops.

Although Kurdish and Hawrami speaking regions are geographically contiguous and have been historically and culturally close together, there are significant differences when it comes to the language. The common belief is that Hawrami is a dialect of Kurdish. At the time, I was not much aware of the linguistic differences of the two languages and was thinking of Hawraman as another Kurdish speaking region and therefore, was considering Hawrami as a dialect of Kurdish. Now, I know that I was wrong!

A few photos of Hawraman. Photo by courtesy of Salah Elmizade.

A few years later, I came to know about another ethnicity called the Zazas. Being a part of the Kurdish regions of Turkey, I found out that Zazaki is considered a dialect of Kurdish too (by some). Reading a few references about Zaza and Gorani, the language to which Hawrami alos belongs, it turned out that I was not well-informed about those two which, in fact, belong to another language branch called the Zaza-Gorani language family and not the Kurdish branch of the Indo-European languages. Zazaki and Gorani, along with Shabaki, all belong to the Zaza-Gorani language family and have many characteristics in common.

The Zaza-Gorani corpus

As a native Kurdish speaker interested in languages and particularly computational linguistics, I recently delved into the same question to understand more about the differences between Kurdish and Zaza-Gorani languages. To this end, I created a corpus using news articles from various sources in several topics such as science, politics, culture and art. The outcome of this project is presented in this paper. The following provides a description of the corpus:

Zazaki Gorani
Articles No. 4,855 428
Words tokens 1,633,770 194,563
Words types 102,665 41,454
Words types 10,802,266 2,246,425
Average word length 4.84 5.50

Moreover, we examined the Zipf’s Law, also known as the rank-size distribution, for the Kurdish and Zaza-Gorani corpora. This rule states that in a reasonably large data set, including language corpus, there is a correlation between word frequencies and word ranks, both in logarithmic scales, that follows a power law function.

Zaza-Gorani corpus
Zipf's Law in the Pewan, containing Sorani and Kurmanji Kurdish texts, versus the Zaza-Gorani corpus, containing Zazaki and Gorani texts

Get the corpus

Download the Zaza-Gorani corpus at https://github.com/sinaahmadi/ZazaGoraniCorpus.

If you use this resource, please cite the following publication:

@inproceedings{ahmadi2020zazagorani,
  title= "Building a Corpus for the Zaza–Gorani Language Family",
  author= "Ahmadi, Sina",
  booktitle="Proceedings of the Seventh Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2020)",
  pages="",
  year="2020"
}