GitHub Email ORCID Google Scholar Stack Overflow Twitter FOAF Zurich, Switzerland RSS
Pewan is the first standard test collection to evaluate Sorani Information Retrieval systems. To build Pewan, we have carefully followed TREC’s standard test collection construction methodology. More specifically, we first collected a large volume of documents written in Sorani, and then used a powerful Desktop Search tool to compile a list of queries. Next, we leveraged three widely-used open-source information retrieval systems as well as our own implementation of two well-known retrieval models to create result pools for all queries. These pools were then manually assessed by our team members to generate the true list of relevant documents for each query.
The Pewan text corpus contains Sorani and Kurmanji texts collected by crawling the content of news agencies. Overall, 115,340 Sorani articles and 25,572 Kurmanji articles were collected. The articles are dated between 2003 and 2012 and their sizes range from 1KB to 154KB (on average 2.6KB).
Sorani | Kurmanji | |
---|---|---|
Articles No. | 115340 | 25572 |
Words No.,(dist.) | 501054 | 127272 |
Words No.,(all) | 18110723 | 4120027 |
Download Pewan at https://github.com/klpp/pewan/.
If you use this resource, please cite the following publications: