stem
package
The Stem module deals with various tasks, mainly through the following functions:
- check_spelling
: spell error detection
- correct_spelling
: spell error correction
- analyze
: morphological analysis
- stem
: stemming, e.g. "بڕاوە" → "بڕ"
- lemmatize
: lemmatization, e.g. "بردمنەوە" → "بردن"
It is recommended that this module be used on tokens using the tokenization module. Please note that only Sorani is supported in this version in this module. The module is based on the Kurdish Hunspell project.
Regarding stemming, the following procedure is followed: - for tokens of a single word, as "kirin" (to do), the stem of the token is returned. - for compound forms and multi-word expressions, the stem of the noun, adjective or adverb are taken into account. For instance, in the light verbal constructions such as "bar kirin" (to load), the stem of the nominal component "bar" is returned. In other cases, the stem of that part of the MWE token is returned that is semantically more important, as in "دەست تێ وەردان" (dest-tê-werdan) where the stem of "dest" is returned.
Examples:
>>> from klpt.stem import Stem
>>> stemmer = Stem("Sorani", "Arabic")
>>> stemmer.check_spelling("سوتاندبووت")
False
>>> stemmer.correct_spelling("سوتاندبووت")
(False, ['ستاندبووت', 'سووتاندبووت', 'سووڕاندبووت', 'ڕووتاندبووت', 'فەوتاندبووت', 'بووژاندبووت'])
>>> stemmer.analyze("دیتبامن")
[{'pos': ['verb'], 'description': 'past_stem_transitive_active', 'stem': 'دی', 'lemma': ['دیتن'], 'base': 'دیت', 'prefixes': '', 'suffixes': 'بامن'}]
>>> stemmer.stem("دەچینەوە")
['چ']
>>> stemmer.stem("گورەکە", mark_unknown=True)
['_گور_']
>>> stemmer.lemmatize("گوڵەکانم")
['گوڵ', 'گوڵە']
>>> stemmer = Stem("Kurmanji", "Latin")
>>> stemmer.analyze("dibêjim")
[{'base': 'gotin', 'description': 'vblex_tv_pri_p1_sg', 'pos': '', 'terminal_suffix': '', 'formation': ''}]
analyze(self, word_form)
Morphological analysis of a given word.
It returns morphological analyses. The morphological analysis is returned as a dictionary as follows:
- "pos": the part-of-speech of the word-form according to the Universal Dependency tag set.
- "description": is flag
- "prefixes": anything appearing before the base
- "suffixes": anything appearing after the base
- "st": the stem of the word
- "lem": the lemma of the word
- "formation": if ds flag is set, its value is assigned to description and the value of formation is set to derivational. Although the majority of our morphological rules cover inflectional forms, it is not accurate to say all of them are inflectional. Therefore, we only set this value to derivational wherever we are sure.
- "base":
ts
flag. The definition of terminal suffix is a bit tricky in Hunspell. According to the Hunspell documentation, "Terminal suffix fields are inflectional suffix fields "removed" by additional (not terminal) suffixes". In other words, the ts flag in Hunspell represents whatever is left after stripping all affixes. Therefore, it is the morphological base.
As for the word "دیتبامن" (that I have seen them), the morphological analysis would look like this: [{'pos': ['verb'], 'description': 'past_stem_transitive_active', 'stem': 'دی', 'lemma': ['دیتن'], 'base': 'دیت', 'prefixes': '', 'suffixes': 'بامن'}] If the input cannot be analyzed morphologically, an empty list is returned.
Sorani: More details regarding Sorani Kurdish morphological analysis can be found at https://github.com/sinaahmadi/KurdishHunspell.
Kurmanji: Regarding Kurmanji, we use the morphological analyzer provided by the Kurmanji part
Please note that there are delicate difference between who the analyzers work in Hunspell and Apertium. For instane, the base
in the Kurmanji analysis refers to the lemma while in Sorani (from Hunspell), it refers to the morphological base.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
word_form |
str |
a single word-form |
required |
Exceptions:
Type | Description |
---|---|
TypeError |
only string as input |
Returns:
Type | Description |
---|---|
(list(dict)) |
a list of all possible morphological analyses according to the defined morphological rules |
Source code in klpt/stem.py
def analyze(self, word_form):
"""
Morphological analysis of a given word.
It returns morphological analyses. The morphological analysis is returned as a dictionary as follows:
- "pos": the part-of-speech of the word-form according to [the Universal Dependency tag set](https://universaldependencies.org/u/pos/index.html).
- "description": is flag
- "prefixes": anything appearing before the base
- "suffixes": anything appearing after the base
- "st": the stem of the word
- "lem": the lemma of the word
- "formation": if ds flag is set, its value is assigned to description and the value of formation is set to derivational. Although the majority of our morphological rules cover inflectional forms, it is not accurate to say all of them are inflectional. Therefore, we only set this value to derivational wherever we are sure.
- "base": `ts` flag. The definition of terminal suffix is a bit tricky in Hunspell. According to [the Hunspell documentation](http://manpages.ubuntu.com/manpages/trusty/en/man4/hunspell.4.html), "Terminal suffix fields are inflectional suffix fields "removed" by additional (not terminal) suffixes". In other words, the ts flag in Hunspell represents whatever is left after stripping all affixes. Therefore, it is the morphological base.
As for the word "دیتبامن" (that I have seen them), the morphological analysis would look like this: [{'pos': ['verb'], 'description': 'past_stem_transitive_active', 'stem': 'دی', 'lemma': ['دیتن'], 'base': 'دیت', 'prefixes': '', 'suffixes': 'بامن'}]
If the input cannot be analyzed morphologically, an empty list is returned.
Sorani:
More details regarding Sorani Kurdish morphological analysis can be found at [https://github.com/sinaahmadi/KurdishHunspell](https://github.com/sinaahmadi/KurdishHunspell).
Kurmanji:
Regarding Kurmanji, we use the morphological analyzer provided by the [Kurmanji part](https://github.com/apertium/apertium-kmr)
Please note that there are delicate difference between who the analyzers work in Hunspell and Apertium. For instane, the `base` in the Kurmanji analysis refers to the lemma while in Sorani (from Hunspell), it refers to the morphological base.
Args:
word_form (str): a single word-form
Raises:
TypeError: only string as input
Returns:
(list(dict)): a list of all possible morphological analyses according to the defined morphological rules
"""
if not isinstance(word_form, str):
raise TypeError("Only a word (str) is allowed.")
else:
word_analysis = list()
if self.dialect == "Sorani" and self.script == "Arabic":
# Given the morphological analysis of a word-form with Hunspell flags, extract relevant information and return a dictionary
# print(self.huns.analyze(word_form))
for analysis in list(self.huns.analyze(word_form)):
analysis_dict = dict()
for item in analysis.split():
if ":" not in item:
continue
if item.split(":")[1] == "ts":
# ts flag exceptionally appears after the value as value:key in the Hunspell output
# anything except the terminal_suffix (ts) is considered to be the base
analysis_dict["base"] = item.split(":")[0]
affixes = utility.extract_prefix_suffix(word_form, item.split(":")[0])
analysis_dict["prefixes"] = affixes[0]
analysis_dict["suffixes"] = affixes[2]
elif item.split(":")[0] in self.hunspell_flags.keys():
# assign the key:value pairs from the Hunspell string output to the dictionary output of the current function
if item.split(":")[0] == "ds":
# for ds flag, add derivation as the formation type, otherwise inflection
analysis_dict[self.hunspell_flags[item.split(":")[0]]] = "derivational"
analysis_dict[self.hunspell_flags["is"]] = item.split(":")[1]
elif item.split(":")[0] == "st":
# for st flag, stem should be cleaned first
analysis_dict[self.hunspell_flags[item.split(":")[0]]] = self.clean_stem(item.split(":")[1])
else:
# remove I, T or V using clean_stem()
analysis_dict[self.hunspell_flags[item.split(":")[0]]] = self.clean_stem(item.split(":")[1])
# convert lemma and pos to a list and split based on _ when there is more than one output, e.g. more than one lemma for a given word
if "lemma" in analysis_dict:
analysis_dict["lemma"] = analysis_dict["lemma"].split("_")
else:
analysis_dict["lemma"] = [""]
if "pos" in analysis_dict:
analysis_dict["pos"] = analysis_dict["pos"].split("_")
else:
analysis_dict["pos"] = [""]
# for nouns, base is lemma
if len(analysis_dict["pos"]) and analysis_dict["pos"] != ["verb"]:
analysis_dict["lemma"] = [analysis_dict["base"]]
word_analysis.append(analysis_dict)
elif self.dialect == "Kurmanji" and self.script == "Latin":
att_analysis = Analysis("Kurmanji", "Latin").analyze(word_form)
# check if the word-form is analyzed or no
if not len(att_analysis):
# the word-form could not be analyzed
return []
for analysis in att_analysis:
analysis_dict = dict()
structure = analysis[0].split("<", 1)
analysis_dict["base"], analysis_dict["description"] = structure[0], structure[1].replace("><", "_").replace(">", "").strip()
analysis_dict["pos"] = ""
analysis_dict["terminal_suffix"] = ""
analysis_dict["formation"] = ""
# TODO: the description needs further information extraction in such a way that some values should be assigned to the "pos" key
# analysis_dict["terminal_suffix"] = word_form.replace(analysis_dict["base"], "")
word_analysis.append(analysis_dict)
return word_analysis
check_spelling(self, word)
Check spelling of a word
Parameters:
Name | Type | Description | Default |
---|---|---|---|
word |
str |
input word to be spell-checked |
required |
Exceptions:
Type | Description |
---|---|
TypeError |
only string as input |
Returns:
Type | Description |
---|---|
bool |
True if the spelling is correct, False if the spelling is incorrect |
Source code in klpt/stem.py
def check_spelling(self, word):
"""Check spelling of a word
Args:
word (str): input word to be spell-checked
Raises:
TypeError: only string as input
Returns:
bool: True if the spelling is correct, False if the spelling is incorrect
"""
if not isinstance(word, str) or not (self.dialect == "Sorani" and self.script == "Arabic"):
raise TypeError("Not supported yet.")
else:
return self.huns.spell(word)
clean_stem(self, word)
Remove extra characters in the stem The following issue was observed when stemming with Hunspell (version 2.0.2) where the retrieved stem of a verb is accompanied by the flag of the word, which is an unwanted extra character. Possible flags are T, V and I. :lf should also be taken into account.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
word |
[str] |
[stem] |
required |
Source code in klpt/stem.py
def clean_stem(self, word):
"""Remove extra characters in the stem
The following issue was observed when stemming with Hunspell (version 2.0.2) where
the retrieved stem of a verb is accompanied by the flag of the word, which is an unwanted extra character.
Possible flags are T, V and I. :lf should also be taken into account.
Args:
word ([str]): [stem]
"""
for char in ["V", "I", "T"]:
word = word.replace(char, "")
return word.replace(":lf", "")
correct_spelling(self, word)
Correct spelling errors if the input word is incorrect. It returns a tuple where the first element indicates the correctness of the word (True if correct, False if incorrect). If the input word is incorrect, suggestions are provided in a list as the second element of the tuple, as (False, []). If no suggestion is available, the list is returned empty as (True, []).
Parameters:
Name | Type | Description | Default |
---|---|---|---|
word |
str |
input word to be spell-checked |
required |
Exceptions:
Type | Description |
---|---|
TypeError |
only string as input |
Returns:
Type | Description |
---|---|
tuple (boolean, list) |
Source code in klpt/stem.py
def correct_spelling(self, word):
"""
Correct spelling errors if the input word is incorrect. It returns a tuple where the first element indicates the correctness of the word (True if correct, False if incorrect).
If the input word is incorrect, suggestions are provided in a list as the second element of the tuple, as (False, []).
If no suggestion is available, the list is returned empty as (True, []).
Args:
word (str): input word to be spell-checked
Raises:
TypeError: only string as input
Returns:
tuple (boolean, list)
"""
if not isinstance(word, str) or not (self.dialect == "Sorani" and self.script == "Arabic"):
raise TypeError("Not supported yet.")
else:
if self.check_spelling(word):
return (True, [])
return (False, list(self.huns.suggest(word)))
lemmatize(self, word)
A function for lemmatization of words
Parameters:
Name | Type | Description | Default |
---|---|---|---|
word |
[str] |
[given a word, return its lemma form, i.e. dictionary entry form] |
required |
Exceptions:
Type | Description |
---|---|
TypeError |
only string as input |
Returns:
Type | Description |
---|---|
list |
list of lemma(s) |
Source code in klpt/stem.py
def lemmatize(self, word):
"""A function for lemmatization of words
Args:
word ([str]): [given a word, return its lemma form, i.e. dictionary entry form]
Raises:
TypeError: only string as input
Returns:
list: list of lemma(s)
"""
if not isinstance(word, str) or not (self.dialect == "Sorani" and self.script == "Arabic"):
raise TypeError("Not supported yet.")
else:
word_analysis = self.analyze(word)
return list(set([item for sublist in word_analysis for item in sublist["lemma"] if item != '']))
stem(self, word, mark_unknown=False)
A function for stemming a single word
Parameters:
Name | Type | Description | Default |
---|---|---|---|
word |
str |
input word to be spell-checked |
required |
mark_unknown |
False |
if the given word is unknown in the tagged lexicon, KLPT stems is following rules. Such stems can be marked with "_" if this variable set to True |
False |
Exceptions:
Type | Description |
---|---|
TypeError |
only string as input |
Returns:
Type | Description |
---|---|
list |
list of stem(s) |
Source code in klpt/stem.py
def stem(self, word, mark_unknown=False):
"""A function for stemming a single word
Args:
word (str): input word to be spell-checked
mark_unknown (False): if the given word is unknown in the tagged lexicon, KLPT stems is following rules. Such stems can be marked with "_" if this variable set to True
Raises:
TypeError: only string as input
Returns:
list: list of stem(s)
"""
if not isinstance(word, str) or not (self.dialect == "Sorani" and self.script == "Arabic"):
raise TypeError("Not supported yet.")
else:
stems = list(set([self.clean_stem(i) for i in self.huns.stem(word)]))
if len(stems):
return stems
else:
# not detected by Hunspell or the word doesn't exist in the tagged lexicon
for verb in self.light_verbs:
if word.endswith(verb) and len(word.rpartition(verb)[0]):
stems = list(set([self.clean_stem(i) for i in self.huns.stem(word.rpartition(verb)[0].strip())]))
if len(stems):
# the word is a compound form with a light verb. The other part can be stemmed by Hunspell
return stems
else:
# the word is a compound form with a light verb but the other part cannot be stemmed by Hunspell
word = word.rpartition(verb)[0].strip()
# the other part of the word or the whole word cannot be stemmed by Hunspell
# so, find the stem following morphological rules by checking if removing possible prefixes and suffixes would help finding the stem
# Note: even though the same morphemes used in the tokenization system are used in the rules here, there is a delicate difference.
# In the tokenization system, the trimming is done in such a way that shorter morphemes are first checked for suffixes (suffixes in the json file is sorted by length) and
# longer prefixes are trimmer first.
# For the stemmer, however, we do differently by first checking the longer morphemes then shorter ones (for both prefixes and suffixes).
# This is due to the different purposes of the two tasks. Therefore, the list of the morphemes is to be reversed for suffixes (not prefixes).
for preposition in self.morphemes["prefixes"]:
if word.startswith(preposition) and len(word.split(preposition, 1)) > 1:
if len(list(set([self.clean_stem(i) for i in self.huns.stem(word.split(preposition, 1)[1])]))):
stems = list(set([self.clean_stem(i) for i in self.huns.stem(word.split(preposition, 1)[1])]))
if mark_unknown:
return ["_" + i + "_" for i in stems]
else:
word = word.split(preposition, 1)[1]
break
for postposition in reversed(list(self.morphemes["suffixes"])):
if word.endswith(postposition) and len(word.rpartition(postposition)[0]):
if len(list(set([self.clean_stem(i) for i in self.huns.stem(word.rpartition(postposition)[0])]))):
stems = list(set([self.clean_stem(i) for i in self.huns.stem(word.rpartition(postposition)[0])]))
if mark_unknown:
return ["_" + i + "_" for i in stems]
else:
word = word.rpartition(postposition)[0]
break
# not possible to stem the word using the tagged lexicon or the rule-based approach. Return the word as it is.
if mark_unknown:
return ["_" + word + "_"]
else:
return [word]