10 basic but essential SPARQL queries for lexicographical data on Wikidata

The Semantic Web as an extension of the World Wide Web (WWW) represents an effective means of data representation and enables users and computers to retrieve and share information efficiently. The Resource Description Framework (RDF) is the foundational data model for Semantic Web. Unlike traditional databases, such as relational ones, where data has to adhere to a fixed schema, RDF documents are not prescribed by a schema and can be described without additional information making RDF data model self-describing. To learn more about RDF, you can read one of my previous blog posts on data modelling with RDF.

More recently, the concept of the Web of Linked Data, which makes RDF data available using the HyperText Transfer Protocol (HTTP), and Linguistic Linked Open Data has gained traction along with the Semantic Web, particularly in the natural language processing (NLP) community as a standard for linguistic resource creation. According to the official definition of W3C,

Linked Data lies at the heart of what Semantic Web is all about: large scale integration of, and reasoning on, data on the Web. Almost all applications listed in, say collection of Semantic Web Case Studies and Use Cases are essentially based on the accessibility of, and integration of Linked Data at various level of complexities.

Moreover, the unique potential which the Semantic Web and Linked Data offer to electronic lexicography enables interoperability across lexical resources by leveraging printed or unstructured linguistic data to machine-readable semantic formats.

dictionary-rdf-library-linked-data
Semantic web and linked data facilitate retrieving information from huge resources such as printed dictionaries (Photo taken at DSL in Copenhagen)

Queries

We present 10 essential queries in SPARQL, an RDF query language, for lexicographical purposes to retrieve information. To this end, we use the SPARQL endpoint of Wikidata which comes with a few lexeme queries as example, too.

It is important to get familiar with Ontolex-Lemon and the Ontolex-Lemon lexicography module (lexicog) as lexicographical data on Wikidata are provided based on those ontologies.

Moreover, a list of other useful queries are provided at:

Unfortunately, not all languages are equally represented on Wikipedia. In this tutorial, we focus on some of the richly represented ones, e.g. English and French. So, if you modify the queries to work on another language, make sure that your language is sufficiently represented on Wikidata before double-checking the soundness of the syntax of your queries.

1- Retrieve lexemes describing book(L536) in different languages

SELECT ?lemma ?languageLabel WHERE {
  ?l a ontolex:LexicalEntry ; 
       ontolex:sense ?sense ; 
       dct:language ?language ; 
       wikibase:lemma ?lemma.
  ?sense wdt:P5137 wd:Q571 . # the concept for our sense is Q571
  # Set the language of the labels to English.
  # For any variable, if label exists, add "Label" to the variable to get its label using the following:
  SERVICE wikibase:label { 
    bd:serviceParam wikibase:language "en". 
  }
}

Run this query

2- List the antonyms of “sadness” in English

SELECT ?l ?lemma ?antonymLabel WHERE {
  VALUES ?lemma {'sadness'@en} 
  ?l wikibase:lemma ?lemma;
          wdt:P461 ?antonym . # P461 is the property for opposite of or antonymy
  SERVICE wikibase:label {
   bd:serviceParam wikibase:language "en" .
  }  
}

Run this query

3- Retrieve the lemmas in a given language with part-of-speech, glosses and usage examples

SELECT ?lemma ?categoryLabel ?gloss ?example WHERE {
   ?l a ontolex:LexicalEntry ;
        dct:language ?language ;
        wikibase:lemma ?lemma ;
        wikibase:lexicalCategory ?category ;
        ontolex:sense ?sense .
        ?language wdt:P218 'en' .
  # Find the gloss (definition) of the senses
  ?sense skos:definition ?gloss .
  # Usage example
  ?l p:P5831 ?statement .
  ?statement ps:P5831 ?example .
  # Get only those lexemes for which senses are available
  FILTER EXISTS {?l ontolex:sense ?sense }
  # Set the language of the glosses to English
  FILTER(LANG(?gloss) = "en")
  SERVICE wikibase:label {
   bd:serviceParam wikibase:language "en" .
  }  
}
LIMIT 100

Run this query

4- Find verb forms in French ending with “é”

SELECT DISTINCT * WHERE {
     ?l a ontolex:LexicalEntry ; 
       dct:language wd:Q150 ; 
       wikibase:lexicalCategory wd:Q24905 ; 
       wikibase:lemma ?lemma ; 
       ontolex:lexicalForm ?form .
    ?form ontolex:representation ?word .
    FILTER (regex(?word, 'é$'))
}

Run this query

5- Create a trilingual dictionary of headwords

# Create a French-German-Basque lexicon
SELECT DISTINCT ?sense ?frLemma ?deLemma ?euLemma WHERE {
  # Retrieve lemmata based on common senses (use ?sense)
    ?fr dct:language wd:Q150;
        wikibase:lemma ?frLemma;
        ontolex:sense [ wdt:P5137 ?sense ].
    ?de dct:language wd:Q188;
        wikibase:lemma ?deLemma;
        ontolex:sense [ wdt:P5137 ?sense ].
    ?eu dct:language wd:Q8752;
        wikibase:lemma ?euLemma;
        ontolex:sense [ wdt:P5137 ?sense ].
  }
ORDER BY ASC(UCASE(STR(?frLemma)))
LIMIT 100 

Run this query

6- Count the number of senses per language

SELECT ?languageLabel (COUNT(?sense) AS ?count ) WHERE {
  ?l a ontolex:LexicalEntry ;
       dct:language ?language ;
       ontolex:sense ?sense .
  SERVICE wikibase:label {
    bd:serviceParam wikibase:language "en" .
  }  
}
GROUP BY ?languageLabel
ORDER BY DESC(?count)

Run this query

7- Lemmatize a word in English

SELECT ?word ?lemma WHERE {
  VALUES ?word {'brought'@en} 
  ?l a ontolex:LexicalEntry ; 
       dct:language wd:Q1860 ; 
       wikibase:lemma ?lemma ; 
       ontolex:lexicalForm ?form .
  ?form ontolex:representation ?word .
} 

Run this query

8- Retrieve masculine nouns in French that end with specific characters

SELECT * WHERE {
  ?l a ontolex:LexicalEntry ; 
       # entries in French (Q150)
       dct:language wd:Q150 ;
       wdt:P5185 wd:Q499327 ; # masculine grammatical gender
       wikibase:lemma ?lemma.
  FILTER (regex(?lemma, '^*(tion|ie|ique|aison|sion)$'))
  SERVICE wikibase:label { 
    bd:serviceParam wikibase:language "en". 
  }
} 

Run this query

9- Create a picture dictionary of animals in English

#defaultView:ImageGrid
SELECT DISTINCT * WHERE {
  ?l dct:language wd:Q1860;
     wikibase:lemma ?lemma;
     ontolex:sense ?sense.
  # senes belonging to a concept related to animals
  ?sense wdt:P5137 ?concept .
  ?concept wdt:P18 ?image ;
           wdt:P279+ wd:Q729 .
  SERVICE wikibase:label {
    bd:serviceParam wikibase:language "en".
  }
}

Run this query

10- Check if a word exists in a given language (i.e. spell error detection)

ASK WHERE {
  VALUES ?word {'amazing'@en} 
  ?l a ontolex:LexicalEntry ; 
       dct:language wd:Q1860 ; 
       wikibase:lemma ?lemma ; 
       ontolex:lexicalForm ?form .
  ?form ontolex:representation ?word .
} 

Run this query

In addition to the Wikidata endpoint, you can integrate your SPARQL queries in your code. For instance, you can use the following in Python:

from SPARQLWrapper import SPARQLWrapper, XML

sparql = SPARQLWrapper("https://query.wikidata.org/")
sparql.setQuery("""
    ASK WHERE {
      VALUES ?word {'amazing'@en} 
      ?l a ontolex:LexicalEntry ; 
           dct:language wd:Q1860 ; 
           wikibase:lemma ?lemma ; 
           ontolex:lexicalForm ?form .
      ?form ontolex:representation ?word .
    } 
""")

sparql.setReturnFormat(XML)
results = sparql.query().convert()
print(results)

Last updated on 9 March 2021.