Stop words are high-frequency words like a, an, the, to, and also, that we sometimes want to filter out of a document before further processing.

Stop words usually have less lexical content and do not hold much of a meaning.

Below is a list of 25 semantically non-selective stop words that are common in Reuters-RCV1.

a an and are
as at be by
by for from has
he in is it
its of on that
the to was were
will with have am

Let’s get into some code and try to understand how things work.

To see all the words defined as stop words in spaCy we can run the following lines of code:

Example 1:

from spacy.lang.en.stop_words import STOP_WORDS

print(STOP_WORDS)

Output:

{'nothing', 'seemed', 'herself', 'few', 'how', 'beyond', 'keep', 'thereby', 'with', 'fifty', 'can', 'two', 'where', 'he', 'third', 'then', '‘ve', 'anyhow', 'becomes', 'mostly', 'might', 'were', 'its', 'which', 'show', 'everywhere', 'have', 'doing', 'ca', 'whole', 'anywhere', 'the', '‘s', 'sometime', 'so', 'seems', 'been', 'front', 'may', 'beforehand', 'than', 'various', 'ourselves', 'except', 'none', 'bottom', 'will', 'has', 'on', 'never', 'yourselves', 'within', 'that', 'or', 'between', 'move', 'each', 'had', 'name', 'there', 'and', 'further', 'too', '‘d', 'they', 'because', "'ll", 'it', 'did', 'throughout', 'made', 'therefore', 'everyone', 'twenty', 'himself', 'used', 'indeed', 'together', 'whither', 'regarding', 'latterly', 'behind', "'re", 'hereby', 'say', 'every', 'almost', 'nine', 'wherein', 'a', 'cannot', 'whereas', 'hers', 'without', '‘ll', 'serious', 'yourself', 'about', 'n‘t', 'she', 'in', 'perhaps', 'besides', 'become', 'those', 'others', 'seeming', 'put', 'us', 'above', 'them', 'nor', 'get', 'not', 'meanwhile', 'four', 'our', 'am', 'hence', 'namely', 'make', 'her', '‘m', 'here', 'several', 'side', 'just', 'thereafter', 'what', 'anyway', 'over', 'myself', 'even', 'however', 'upon', 'neither', 'thus', 'someone', 'forty', 'we', 'one', 'part', 'take', 'noone', 'though', "'s", 'thence', 'still', 'own', '‘re', 'somehow', '’ve', 'very', 'as', "n't", 'although', 'anything', 'again', 'three', 'into', 'an', 'during', 'this', 'other', 'such', 'same', 'whoever', '’d', 'per', 'along', 'elsewhere', 're', '’m', 'why', 'down', 'five', 'whose', 'last', 'call', 'once', 'eleven', '’s', 'nevertheless', 'towards', 'some', 'top', 'is', 'but', 'mine', "'ve", 'do', 'whereby', 'him', 'all', 'to', 'could', 'among', 'sometimes', 'full', 'much', 'ever', 'most', 'thereupon', 'at', 'hereupon', 'more', 'rather', 'onto', 'through', 'also', 'up', 'whom', 'i', 'for', 'herein', "'m", 'ours', 'you', 'go', 'least', 'already', 'against', 'afterwards', 'formerly', 'amount', 'yet', 'please', 'first', 'six', 'former', 'whenever', 'often', 'until', 'his', 'off', 'now', 'only', 'be', 'themselves', 'should', 'yours', 'next', 'fifteen', 'if', 'everything', 'whereupon', 'by', 'whereafter', 'give', 'n’t', 'under', 'always', 'around', 'your', 'using', 'another', 'done', 'whence', 'many', 'empty', 'amongst', 'ten', 'me', 'alone', "'d", 'out', 'whatever', 'from', 'otherwise', 'below', 'of', 'well', 'my', 'while', 'twelve', 'these', 'nobody', 'latter', 'after', 'being', 'sixty', 'something', 'therein', 'via', 'became', 'must', 'when', 'both', 'was', 'whether', '’re', 'else', 'enough', 'either', 'moreover', 'quite', 'really', 'hereafter', 'since', 'unless', 'hundred', 'seem', 'thru', 'toward', 'would', 'somewhere', 'does', 'less', 'their', 'anyone', 'are', 'who', 'no', 'any', 'nowhere', 'back', 'becoming', '’ll', 'wherever', 'itself', 'across', 'due', 'before', 'eight', 'beside', 'see'}

There are about 305 stop words defined in spaCy’s stop words list. We can always define our own stop words if needed and override the existing list.

To see if a word is a stop word or not, we can use the NLP object of spaCy and then use the NLP object’s is_stop attribute.

 

 

Example 2:

import spacy

nlp = spacy.load('en_core_web_sm')

print(nlp.vocab[u'is'].is_stop)

Output:

True

Example 3:

import spacy

nlp = spacy.load('en_core_web_sm')

print(nlp.vocab[u'are'].is_stop)
print(nlp.vocab[u'am'].is_stop)
print(nlp.vocab[u'was'].is_stop)
print(nlp.vocab[u'were'].is_stop)

Output:

True
True
True
True

 

 

Example 4:

import spacy

nlp = spacy.load('en_core_web_sm')

print(nlp.vocab[u'going'].is_stop)
print(nlp.vocab[u'playing'].is_stop)
print(nlp.vocab[u'singing'].is_stop)
print(nlp.vocab[u'dancing'].is_stop)

Output:

False
False
False
False

Stop words are a very important part of text clean up. It helps in the removal of meaningless data before we try to do actual processing to make sense of the text.

Use Case of Stop Words Removal

Suppose we are in a situation where we are building a bot to make people happy by assessing their mood. Now, one needs to analyze the sentiment in the text input by the user so that the correct response can be formulated. Here, before begging to do basic sentiment analysis, we should remove the noise in the data that exists in the form of stop words.