Stemming and Lemmatization in Natural Language Processing using spaCy
Stemming is the process of reducing inflected words to their word stem, base form.
A stemming algorithm reduces the words “saying” to the root word “say,” whereas “presumably” becomes presum. As you can see, this may or may not always be 100% correct.
Lemmatization is closely related to stemming, but lemmatization is the algorithmic process of determining the lemma of a word based on its intended meaning.
For example, in English, the verb “to walk” may appear as “walk,” “walked,” “walks,” or “walking.” The base form, “walk,” that one might look up in a dictionary, is called the lemma for the word.
Difference between Stemming and Lemmatization
|Stemming does the job in a crude, heuristic way that chops off the ends of words, assuming that the remaining word is what we are looking for, but it often includes the removal of derivational affixes.||Lemmatization tries to do the job more elegantly with the use of vocabulary and morphological analysis of words. It tries its best to remove inflectional endings only and return the dictionary form of a word, known as the lemma.|
Though few libraries provide methods for stemming as well as lemmatization, it’s always a best practice to use lemmatization to get the root word correctly.
Let’s try to explore lemmatization by taking some of the examples:
spaCy doesn’t have any in-built stemmer, as lemmatization is considered more correct and productive.
import spacy nlp = spacy.load('en_core_web_sm') doc = nlp(u'Go, Goes, Gone, Going') for token in doc: print(token.text, "==>", token.lemma_)
Go ==> go , ==> , Goes ==> go , ==> , Gone ==> go , ==> , Going ==> go
Since we are pretty much aware of what a stemming or lemmatization does in NLP, we should be able to understand that whenever we come across a situation where we need the root form of the word, we need to do lemmatization there.
For example, it is often used in building search engines. You must have wondered how Google gives you the articles in search results that you meant to get even when the search text was not properly formulated. This is where one makes use of lemmatization.
Imagine you search with the text, “When will the next season of Thomas Jones be releasing?” Now, suppose the search engine does simple document word frequency matching to give you search results. In this case, the aforementioned query probably won’t match an article with the caption “Thomas Jones next season release date.” If we do the lemmatization of the original question before going to match the input with the documents, then we may get better results.