Kurdish lexicographical resources

On This Page

Towards Kurdish e-lexicography

This paper describes the development of lexicographic resources for Kurdish and provides a lexical model for this language. Kurdish is considered a less-resourced language, and currently, lacks machine-readable lexical resources. The unique potential which Linked Data and the Semantic Web offer to e-lexicography enables interoperability across lexical resources by elevating the traditional linguistic data to machine-processable semantic formats. Therefore, we present our lexicon in Ontolex-Lemon ontology as a standard model for sharing lexical information on the Semantic Web. The research covers the Sorani, Kurmanji, and Hawrami dialects of Kurdish. This research suggests that although Kurdish is a less-resourced language, in terms of documented lexicons, it has a wide range of resources, but because they are not machine-readable they could not contribute to the language processing. The outcome of this project, which is made publicly available, assists scholars in their efforts towards making Kurdish a resource-rich language.

Development workflow

The following is the workflow that we follow to create our electronic lexicographic resources:

Kurdish Lex workflow
Workflow to create lexicographic resources for Kurdish

I hope it could be as straightforward as it looks but it was not. The task required our full attention to manually extract relevant information and validate those parts which were retrieved semi-automatically. The following video shows how hectic the task was:

Manually curating and cleaning Kurdish lexicographic resources

Evaluation

Resource Number of entries Attributes Polysemy degree
Word MWE Gender & POS Etymology #idioms Examples
Kurmanji 4172 122 3420 (76.64%) 213 (4.96%) 340 265 (6.35%) 1.03%
Sorani 5683 160 5348 (91.37%) 111 (1.89%) 82 543 (9.55%) 1.06%
Hawrami 1184 165 1184 (87.76%) 242 (17.93%) 123 10 (0.008%) 1.01

To find out more, read our paper. Our data is available at Kurdish lexicographical resources.

Reference

If you’re using these resources in your researches, please don’t forget to cite our paper]:

@inproceedings{ahmadi2019kurdishlex,
  title = "Towards Electronic Lexicography for the {K}urdish Language",
  author = "Ahmadi, Sina and Hassani, Hossein and McCrae, John P.",
  booktitle = "Proceedings of the sixth biennial conference on electronic lexicography (eLex)",
  month = "10",
  year = "2019",
  address = "Sintra, Portugal",
  url = "https://elex.link/elex2019/wp-content/uploads/2019/09/eLex_2019_50.pdf",
  pages = "881--906",
  address= "Sintra, Portugal"
}

License

This corpus is openly available for non-commercial use under the Attribution-NonCommercial-ShareAlike 4.0 International.