GitHub Email ORCID Google Scholar Stack Overflow Twitter FOAF Zurich, Switzerland RSS
Monolingual Word Sense alignment (MWSA) is the task of aligning word senses across resources in the same language. A word can be defined in different ways in different resources. Finding out which ones are somehow connected together is the task of word sense alignment. You can find out more about this task in this post.
As the principal material of my Ph.D. within the ELEXIS project, we developed a set of 17 datasets for the task of MWSA. These datasets covert 15 languages and are based on expert-made dictionaries along with collaboratively-curated ones, such as Wiktinary. The following table shows the statistics of the datasets by providing the number of senses (number of the words in the definitions are provided in parentheses).
Language | Resource | Nouns | Verbs | Adjectives | Adverbs | Other | All |
---|---|---|---|---|---|---|---|
Basque (eu) | Basque Wordnet | 929 (6836) | 0 (0) | 0 (0) | 0 (0) | 0 (0) | 929 (6836) |
Euskal Hiztegia | 971 (7754) | 0 (0) | 0 (0) | 0 (0) | 0 (0) | 971 (7754) | |
Bulgarian (bg) | BTB-WN | 1394 (15649) | 175 (1698) | 305 (3187) | 50 (338) | 0 (0) | 1924 (20872) |
Bulgarian Wiktionary | 1273 (12883) | 164 (1107) | 194 (1418) | 39 (306) | 0 (0) | 1670 (15714) | |
Danish (da) | Ordbog over det danske Sprog | 2176 (282040) | 983 (119163) | 436 (60599) | 0 (0) | 0 (0) | 3595 (461802) |
Den Danske Ordbog | 1036 (12326) | 383 (4045) | 248 (2228) | 0 (0) | 0 (0) | 1667 (18599) | |
Dutch (NL) | Woordenboek der Nederlandsche Taal | 1459 (28979) | 405 (5185) | 527 (7878) | 106 (2662) | 0 (0) | 2497 (44704) |
Algemeen Nederlands Woordenboek | 497 (8443) | 140 (1542) | 109 (1393) | 13 (172) | 0 (0) | 759 (11550) | |
English (KD) (en) | Global | 92 (532) | 107 (617) | 80 (457) | 57 (257) | 61 (283) | 397 (2146) |
Password | 66 (536) | 72 (417) | 62 (324) | 33 (177) | 46 (188) | 279 (1642) | |
English (NUIG) (en) | Webster 1913 | 1131 (11606) | 741 (4622) | 373 (2585) | 45 (269) | 0 (0) | 2290 (19082) |
Princeton WordNet | 730 (12166) | 496 (6980) | 249 (2892) | 24 (207) | 0 (0) | 1499 (22245) | |
Estonian (et) | Dictionary of Estonian (EKS) | 543 (4012) | 273 (1598) | 151 (747) | 98 (451) | 78 (370) | 1143 (7178) |
Estonian Basic Dictionary (PSV) | 543 (4492) | 273 (1983) | 151 (1097) | 98 (596) | 79 (468) | 1144 (8636) | |
German (de) | German Wiktionary | 2026 (15160) | 0 (0) | 0 (0) | 0 (0) | 0 (0) | 2026 (15160) |
German OmegaWiki | 1266 (14354) | 0 (0) | 0 (0) | 0 (0) | 0 (0) | 1266 (14354) | |
Hungarian (hu) | Comprehensive | X | X | X | X | X | 1355 (14654) |
Explanatory | X | X | X | X | X | 1038 (10934) | |
Irish (ga) | An Foclóir Beag | 891 (8053) | 11 (95) | 55 (267) | 10 (56) | 36 (171) | 1003 (8642) |
Irish Wiktionary | 1209 (6696) | 8 (45) | 61 (181) | 10 (41) | 36 (109) | 1324 (7072) | |
Italian (it) | ItalWordNet | 408 (3128) | 352 (2411) | 0 (0) | 0 (0) | 0 (0) | 760 (5539) |
SIMPLE | 290 (1990) | 218 (1240) | 0 (0) | 0 (0) | 0 (0) | 508 (3230) | |
Serbian (sr) | Serbian WordNet | 691 (5864) | 985 (6522) | 92 (713) | 0 (0) | 0 (0) | 1768 (13099) |
Dictionary of Serbo-Croatian Literary Language | 289 (2360) | 281 (1527) | 29 (215) | 0 (0) | 0 (0) | 599 (4102) | |
Slovenian (JSI) (sl) | Slovene WordNet | 409 (1106) | 303 (901) | 237 (733) | 44 (133) | 0 (0) | 993 (2873) |
Slovene Lexical Database | 284 (2237) | 191 (1047) | 220 (1486) | 29 (102) | 0 (0) | 724 (4872) | |
Slovenian (ISJFR) (sl) | Standard Slovenian Dictionary (eSSKJ) | 229 (2060) | 109 (911) | 76 (620) | 0 (0) | 60 (588) | 474 (4179) |
Kostelski slovar | 151 (1050) | 61 (308) | 45 (257) | 0 (0) | 38 (263) | 295 (1878) | |
Spanish (es) | Diccionario de la lengua española | 617 (7986) | 225 (2426) | 305 (3269) | 26 (161) | 24 (250) | 1197 (14092) |
Spanish Wiktionary | 602 (6421) | 227 (2045) | 294 (2825) | 25 (129) | 22 (123) | 1170 (11543) | |
Portuguese (pt-pt) | Dicionário da Língua Portuguesa Contemporânea | 285 (4060) | 58 (686) | 110 (1287) | 9 (143) | 1 (9) | 463 (6185) |
Dicionário Aberto | 199 (1521) | 53 (203) | 67 (372) | 3 (15) | 1 (5) | 323 (2116) | |
Russian (rs) | Ozhegov-Shvedova | 258 (2038) | 109 (615) | 101 (533) | 15 (77) | 44 (368) | 527 (3631) |
Dictionary of the Russian Language (MAS) | 310 (2811) | 173 (1338) | 190 (1219) | 20 (114) | 71 (1010) | 764 (6492) |
The following shows a sample of the dataset:
{
"lemma": "prehistoric",
"part-of-speech_tag": "adjective",
"gender": "",
"meta_ID": "",
"resource_1_senses": [
{
"#text": "belonging to or existing in times before recorded history",
"external_ID": "prehistoric.s.01"
},
{
"#text": "of or relating to times before written history",
"external_ID": "prehistoric.a.02"
},
{
"#text": "no longer fashionable",
"external_ID": "prehistoric.s.03"
}
],
"resource_2_senses": [
{
"#text": "of or pertaining to a period before written history begins;",
"external_ID": ""
}
],
"alignment": [
{
"sense_source": "belonging to or existing in times before recorded history",
"sense_target": "of or pertaining to a period before written history begins;",
"semantic_relationship": "related"
},
{
"sense_source": "of or relating to times before written history",
"sense_target": "of or pertaining to a period before written history begins;",
"semantic_relationship": "exact"
}
]
}
The datasets are publicly freely available at https://github.com/elexis-eu/MWSA.
If you’re using any part of these datasets, please don’t forget to cite our paper:
@inproceedings{ahmadi2020multilingual,
title={A Multilingual Evaluation Dataset for Monolingual Word Sense Alignment},
author="Ahmadi, Sina and McCrae, John P. and Nimb, Sanni and Khan, Fahad and Monachini, Monica and Pedersen, Bolette S. and Declerck, Thierry and Wissik, Tanja and Bellandi, Andrea and Pisani, Irene and Troelsgård, Thomas and Olsen, Sussi and Krek, Simon and Lipp, Veronika and Váradi, Tamás and Simon, László and Győrffy, András and Tiberius, Carole and Schoonheim, Tanneke and Ben Moshe, Yifat and Rudich, Maya and Abu Ahmad, Raya and Lonke, Dorielle and Kovalenko, Kira and Langemets, Margit and Kallas, Jelena and Dereza, Oksana and Fransen, Theodorus and Cillessen, David and Lindemann, David and Alonso, Mikel and Salgado, Ana and Sancho, José Luis and Ureña-Ruiz, Rafael-J. and Simov, Kiril and Osenova, Petya and Kancheva, Zara and Radev, Ivaylo and Stanković, Ranka and Perdih, Andrej and Gabrovšek, Dejan",
booktitle="Proceedings of the 12th Language Resource and Evaluation Conference (LREC 2020)",
year={2020},
date="2020-05-11",
address= "Marseille, France"
}