Finding and Removing Stop Words from Text using spaCy
Stop words are high-frequency words like a, an, the, to, and also, that we sometimes want to filter out of a document before further processing.
Stop words usually have less lexical content and do not hold much of a meaning.
Below is a list of 25 semantically non-selective stop words that are common in Reuters-RCV1.
a | an | and | are |
as | at | be | by |
by | for | from | has |
he | in | is | it |
its | of | on | that |
the | to | was | were |
will | with | have | am |
Let’s get into some code and try to understand how things work.
To see all the words defined as stop words in spaCy we can run the following lines of code:
Example 1:
from spacy.lang.en.stop_words import STOP_WORDS print(STOP_WORDS)
Output:
{'nothing', 'seemed', 'herself', 'few', 'how', 'beyond', 'keep', 'thereby', 'with', 'fifty', 'can', 'two', 'where', 'he', 'third', 'then', '‘ve', 'anyhow', 'becomes', 'mostly', 'might', 'were', 'its', 'which', 'show', 'everywhere', 'have', 'doing', 'ca', 'whole', 'anywhere', 'the', '‘s', 'sometime', 'so', 'seems', 'been', 'front', 'may', 'beforehand', 'than', 'various', 'ourselves', 'except', 'none', 'bottom', 'will', 'has', 'on', 'never', 'yourselves', 'within', 'that', 'or', 'between', 'move', 'each', 'had', 'name', 'there', 'and', 'further', 'too', '‘d', 'they', 'because', "'ll", 'it', 'did', 'throughout', 'made', 'therefore', 'everyone', 'twenty', 'himself', 'used', 'indeed', 'together', 'whither', 'regarding', 'latterly', 'behind', "'re", 'hereby', 'say', 'every', 'almost', 'nine', 'wherein', 'a', 'cannot', 'whereas', 'hers', 'without', '‘ll', 'serious', 'yourself', 'about', 'n‘t', 'she', 'in', 'perhaps', 'besides', 'become', 'those', 'others', 'seeming', 'put', 'us', 'above', 'them', 'nor', 'get', 'not', 'meanwhile', 'four', 'our', 'am', 'hence', 'namely', 'make', 'her', '‘m', 'here', 'several', 'side', 'just', 'thereafter', 'what', 'anyway', 'over', 'myself', 'even', 'however', 'upon', 'neither', 'thus', 'someone', 'forty', 'we', 'one', 'part', 'take', 'noone', 'though', "'s", 'thence', 'still', 'own', '‘re', 'somehow', '’ve', 'very', 'as', "n't", 'although', 'anything', 'again', 'three', 'into', 'an', 'during', 'this', 'other', 'such', 'same', 'whoever', '’d', 'per', 'along', 'elsewhere', 're', '’m', 'why', 'down', 'five', 'whose', 'last', 'call', 'once', 'eleven', '’s', 'nevertheless', 'towards', 'some', 'top', 'is', 'but', 'mine', "'ve", 'do', 'whereby', 'him', 'all', 'to', 'could', 'among', 'sometimes', 'full', 'much', 'ever', 'most', 'thereupon', 'at', 'hereupon', 'more', 'rather', 'onto', 'through', 'also', 'up', 'whom', 'i', 'for', 'herein', "'m", 'ours', 'you', 'go', 'least', 'already', 'against', 'afterwards', 'formerly', 'amount', 'yet', 'please', 'first', 'six', 'former', 'whenever', 'often', 'until', 'his', 'off', 'now', 'only', 'be', 'themselves', 'should', 'yours', 'next', 'fifteen', 'if', 'everything', 'whereupon', 'by', 'whereafter', 'give', 'n’t', 'under', 'always', 'around', 'your', 'using', 'another', 'done', 'whence', 'many', 'empty', 'amongst', 'ten', 'me', 'alone', "'d", 'out', 'whatever', 'from', 'otherwise', 'below', 'of', 'well', 'my', 'while', 'twelve', 'these', 'nobody', 'latter', 'after', 'being', 'sixty', 'something', 'therein', 'via', 'became', 'must', 'when', 'both', 'was', 'whether', '’re', 'else', 'enough', 'either', 'moreover', 'quite', 'really', 'hereafter', 'since', 'unless', 'hundred', 'seem', 'thru', 'toward', 'would', 'somewhere', 'does', 'less', 'their', 'anyone', 'are', 'who', 'no', 'any', 'nowhere', 'back', 'becoming', '’ll', 'wherever', 'itself', 'across', 'due', 'before', 'eight', 'beside', 'see'}
There are about 305 stop words defined in spaCy’s stop words list. We can always define our own stop words if needed and override the existing list.
To see if a word is a stop word or not, we can use the NLP object of spaCy and then use the NLP object’s is_stop attribute.
Example 2:
import spacy nlp = spacy.load('en_core_web_sm') print(nlp.vocab[u'is'].is_stop)
Output:
True
Example 3:
import spacy nlp = spacy.load('en_core_web_sm') print(nlp.vocab[u'are'].is_stop) print(nlp.vocab[u'am'].is_stop) print(nlp.vocab[u'was'].is_stop) print(nlp.vocab[u'were'].is_stop)
Output:
True True True True
Example 4:
import spacy nlp = spacy.load('en_core_web_sm') print(nlp.vocab[u'going'].is_stop) print(nlp.vocab[u'playing'].is_stop) print(nlp.vocab[u'singing'].is_stop) print(nlp.vocab[u'dancing'].is_stop)
Output:
False False False False
Stop words are a very important part of text clean up. It helps in the removal of meaningless data before we try to do actual processing to make sense of the text.
Use Case of Stop Words Removal
Suppose we are in a situation where we are building a bot to make people happy by assessing their mood. Now, one needs to analyze the sentiment in the text input by the user so that the correct response can be formulated. Here, before begging to do basic sentiment analysis, we should remove the noise in the data that exists in the form of stop words.