Welcome / Hûn bi xêr hatin / بە خێر بێن! 🙂

Introduction

Language technology is an increasingly important field in our information era which is dependent on our knowledge of the human language and computational methods to process it. Unlike the latter which undergoes constant progress with new methods and more efficient techniques being invented, the processability of human languages does not evolve with the same pace. This is particularly the case of languages with scarce resources and limited grammars, also known as less-resourced languages.

Despite a plethora of performant tools and specific frameworks for natural language processing (NLP), such as NLTK, Stanza and spaCy, the progress with respect to less-resourced languages is often hindered by not only the lack of basic tools and resources but also the accessibility of the previous studies under an open-source licence. This is particularly the case of Kurdish.

Kurdish Language

Kurdish is a less-resourced Indo-European language which is spoken by 20-30 million speakers in the Kurdish regions of Turkey, Iraq, Iran and Syria and also, among the Kurdish diaspora around the world. It is mainly spoken in four dialects (also referred to as languages):

Kurdish has been historically written in various scripts, namely Cyrillic, Armenian, Latin and Arabic among which the latter two are still widely in use. Efforts in standardization of the Kurdish alphabets and orthographies have not succeeded to be globally followed by all Kurdish speakers in all regions. As such, the Kurmanji dialect is mostly written in the Latin-based script while the Sorani, Southern Kurdish and Laki are mostly written in the Arabic-based script.

KLPT - The Kurdish Language Processing Toolkit

KLPT - the Kurdish Language Processing Toolkit is an NLP toolkit for the Kurdish language. The current version (0.1) comes with four core modules, namely preprocess, stem, transliterate and tokenize, and addresses basic language processing tasks such as text preprocessing, stemming, tokenziation, spell error detection and correction, and morphological analysis for the Sorani and Kurmanji dialects of Kurdish. More importantly, it is an open-source project!

To find out more about how to use the tool, please check the "User Guide" section of this website.

Cite this project

Please consider citing this paper, if you use any part of the data or the tool (bib file):

@inproceedings{ahmadi2020klpt,
    title = "{KLPT} {--} {K}urdish Language Processing Toolkit",
    author = "Ahmadi, Sina",
    booktitle = "Proceedings of Second Workshop for NLP Open Source Software (NLP-OSS)",
    month = nov,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2020.nlposs-1.11",
    pages = "72--84"
}

You can also watch the presentation of this paper at https://slideslive.com/38939750/klpt-kurdish-language-processing-toolkit.

License

Kurdish Language Processing Toolkit by Sina Ahmadi is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License which means: