A Multilingual Dataset for Monolingual Word Sense Alignment

On This Page

Monolingual Word Sense alignment (MWSA)

Monolingual Word Sense alignment (MWSA) is the task of aligning word senses across resources in the same language. A word can be defined in different ways in different resources. Finding out which ones are somehow connected together is the task of word sense alignment. You can find out more about this task in this post.

As the principal material of my Ph.D. within the ELEXIS project, we developed a set of 17 datasets for the task of MWSA. These datasets covert 15 languages and are based on expert-made dictionaries along with collaboratively-curated ones, such as Wiktinary. The following table shows the statistics of the datasets by providing the number of senses (number of the words in the definitions are provided in parentheses).

Language Resource Nouns Verbs Adjectives Adverbs Other All
Basque (eu) Basque Wordnet 929 (6836) 0 (0) 0 (0) 0 (0) 0 (0) 929 (6836)
Euskal Hiztegia 971 (7754) 0 (0) 0 (0) 0 (0) 0 (0) 971 (7754)
Bulgarian (bg) BTB-WN 1394 (15649) 175 (1698) 305 (3187) 50 (338) 0 (0) 1924 (20872)
Bulgarian Wiktionary 1273 (12883) 164 (1107) 194 (1418) 39 (306) 0 (0) 1670 (15714)
Danish (da) Ordbog over det danske Sprog 2176 (282040) 983 (119163) 436 (60599) 0 (0) 0 (0) 3595 (461802)
Den Danske Ordbog 1036 (12326) 383 (4045) 248 (2228) 0 (0) 0 (0) 1667 (18599)
Dutch (NL) Woordenboek der Nederlandsche Taal 1459 (28979) 405 (5185) 527 (7878) 106 (2662) 0 (0) 2497 (44704)
Algemeen Nederlands Woordenboek 497 (8443) 140 (1542) 109 (1393) 13 (172) 0 (0) 759 (11550)
English (KD) (en) Global 92 (532) 107 (617) 80 (457) 57 (257) 61 (283) 397 (2146)
Password 66 (536) 72 (417) 62 (324) 33 (177) 46 (188) 279 (1642)
English (NUIG) (en) Webster 1913 1131 (11606) 741 (4622) 373 (2585) 45 (269) 0 (0) 2290 (19082)
Princeton WordNet 730 (12166) 496 (6980) 249 (2892) 24 (207) 0 (0) 1499 (22245)
Estonian (es) Dictionary of Estonian (EKS) 543 (4012) 273 (1598) 151 (747) 98 (451) 78 (370) 1143 (7178)
Estonian Basic Dictionary (PSV) 543 (4492) 273 (1983) 151 (1097) 98 (596) 79 (468) 1144 (8636)
German (de) German Wiktionary 2026 (15160) 0 (0) 0 (0) 0 (0) 0 (0) 2026 (15160)
German OmegaWiki 1266 (14354) 0 (0) 0 (0) 0 (0) 0 (0) 1266 (14354)
Hungarian (hu) Comprehensive X X X X X 1355 (14654)
Explanatory X X X X X 1038 (10934)
Irish (ga) An Foclóir Beag 891 (8053) 11 (95) 55 (267) 10 (56) 36 (171) 1003 (8642)
Irish Wiktionary 1209 (6696) 8 (45) 61 (181) 10 (41) 36 (109) 1324 (7072)
Italian (it) ItalWordNet 408 (3128) 352 (2411) 0 (0) 0 (0) 0 (0) 760 (5539)
SIMPLE 290 (1990) 218 (1240) 0 (0) 0 (0) 0 (0) 508 (3230)
Serbian (sr) Serbian WordNet 691 (5864) 985 (6522) 92 (713) 0 (0) 0 (0) 1768 (13099)
Dictionary of Serbo-Croatian Literary Language 289 (2360) 281 (1527) 29 (215) 0 (0) 0 (0) 599 (4102)
Slovenian (JSI) (sl) Slovene WordNet 409 (1106) 303 (901) 237 (733) 44 (133) 0 (0) 993 (2873)
Slovene Lexical Database 284 (2237) 191 (1047) 220 (1486) 29 (102) 0 (0) 724 (4872)
Slovenian (ISJFR) (sl) Standard Slovenian Dictionary (eSSKJ) 229 (2060) 109 (911) 76 (620) 0 (0) 60 (588) 474 (4179)
Kostelski slovar 151 (1050) 61 (308) 45 (257) 0 (0) 38 (263) 295 (1878)
Spanish (es) Diccionario de la lengua española 617 (7986) 225 (2426) 305 (3269) 26 (161) 24 (250) 1197 (14092)
Spanish Wiktionary 602 (6421) 227 (2045) 294 (2825) 25 (129) 22 (123) 1170 (11543)
Portuguese (pt-pt) Dicionário da Língua Portuguesa Contemporânea 285 (4060) 58 (686) 110 (1287) 9 (143) 1 (9) 463 (6185)
Dicionário Aberto 199 (1521) 53 (203) 67 (372) 3 (15) 1 (5) 323 (2116)
Russian (rs) Ozhegov-Shvedova 258 (2038) 109 (615) 101 (533) 15 (77) 44 (368) 527 (3631)
Dictionary of the Russian Language (MAS) 310 (2811) 173 (1338) 190 (1219) 20 (114) 71 (1010) 764 (6492)

Example

The following shows a sample of the dataset:

{
    "lemma": "prehistoric",
    "part-of-speech_tag": "adjective",
    "gender": "",
    "meta_ID": "",
    "resource_1_senses": [
        {
            "#text": "belonging to or existing in times before recorded history",
            "external_ID": "prehistoric.s.01"
        },
        {
            "#text": "of or relating to times before written history",
            "external_ID": "prehistoric.a.02"
        },
        {
            "#text": "no longer fashionable",
            "external_ID": "prehistoric.s.03"
        }
    ],
    "resource_2_senses": [
        {
            "#text": "of or pertaining to a period before written history begins;",
            "external_ID": ""
        }
    ],
    "alignment": [
        {
            "sense_source": "belonging to or existing in times before recorded history",
            "sense_target": "of or pertaining to a period before written history begins;",
            "semantic_relationship": "related"
        },
        {
            "sense_source": "of or relating to times before written history",
            "sense_target": "of or pertaining to a period before written history begins;",
            "semantic_relationship": "exact"
        }
    ]
}

Get the datasets

The datasets are publicly freely available at https://github.com/elexis-eu/MWSA.

Reference

If you’re using any part of these datasets, please don’t forget to cite our paper:

@inproceedings{ahmadi2020multilingual,
	title={A Multilingual Evaluation Dataset for Monolingual Word Sense Alignment},
	author="Ahmadi, Sina and McCrae, John P. and Nimb, Sanni and Khan, Fahad and Monachini, Monica and Pedersen, Bolette S. and Declerck, Thierry and Wissik, Tanja and Bellandi, Andrea and Pisani, Irene and Troelsgård, Thomas and Olsen, Sussi and Krek, Simon and Lipp, Veronika and Váradi, Tamás and Simon, László and Győrffy, András and Tiberius, Carole and Schoonheim, Tanneke and Ben Moshe, Yifat and Rudich, Maya and Abu Ahmad, Raya and Lonke, Dorielle and Kovalenko, Kira and Langemets, Margit and Kallas, Jelena and Dereza, Oksana and Fransen, Theodorus and Cillessen, David and Lindemann, David and Alonso, Mikel and Salgado, Ana and Sancho, José Luis and Ureña-Ruiz, Rafael-J. and Simov, Kiril and Osenova, Petya and Kancheva, Zara and Radev, Ivaylo and Stanković, Ranka and Perdih, Andrej and Gabrovšek, Dejan",
	booktitle="Proceedings of the 12th Language Resource and Evaluation Conference (LREC 2020)",
	year={2020},
	date="2020-05-11",
	address= "Marseille, France"
}