Part-Of-Speech (POS) Tagging in Natural Language Processing using spaCy
Part-of-speech (POS) tagging in Natural Language Processing is a process where we read some text and assign parts of speech to each word or token, such as noun, verb, adjective, etc.
POS tagging becomes extremely important when we want to identify some entity in the given sentence.
Why is POS tagging needed for chatbots?
POS tagging needed for chatbots to reduce the complexity of understanding a text that can’t be trained or is trained with less confidence. By use of POS tagging, we can identify parts of the text input and do string matching only for those parts. For example, if we were to find if a location exists in a sentence, then POS tagging would tag the location word as NOUN, so you can take all the NOUNs from the tagged list and see if it’s one of the locations from your preset list or not.
Let’s get our hands dirty with some of the examples of real POS tagging.
Example 1:
import spacy
nlp = spacy.load('en_core_web_sm')
doc = nlp(u'I am learning the basics of natural language processing at Asquero')
for token in doc:
print(token.text, token.pos_)
Output:
I PRON
am AUX
learning VERB
the DET
basics NOUN
of ADP
natural ADJ
language NOUN
processing NOUN
at ADP
Asquero PROPN
Example 2:
import spacy
nlp = spacy.load('en_core_web_sm')
doc = nlp(u'I am going to visit India next week.')
for token in doc:
print(token.text, token.pos_)
Output:
I PRON
am AUX
going VERB
to PART
visit VERB
India PROPN
next ADJ
week NOUN
. PUNCT
Example 3:
import spacy
nlp = spacy.load('en_core_web_sm')
doc = nlp(u'Microsoft Corporation is an American multinational technology company with headquarters in Redmond, Washington.')
for token in doc:
print(token.text, token.lemma_, token.pos_, token.tag_, token.dep_, token.shape_, token.is_alpha, token.is_stop)
Output:
Text | Lemma | POS | Tag | Dep | Shape | Alpha | Stop |
Microsoft | Microsoft | PROPN | NNP | compound | Xxxxx | True | False |
Corporation | Corporation | PROPN | NNP | nsubj | Xxxxx | True | False |
is | be | AUX | VBZ | ROOT | xx | True | True |
an | an | DET | DT | det | xx | True | True |
American | american | ADJ | JJ | amod | Xxxxx | True | False |
multinational | multinational | ADJ | JJ | amod | xxxx | True | False |
technology | technology | NOUN | NN | compound | xxxx | True | False |
company | company | NOUN | NN | attr | xxxx | True | False |
with | with | ADP | IN | prep | xxxx | True | True |
headquarters | headquarter | NOUN | NNS | pobj | xxxx | True | False |
in | in | ADP | IN | prep | xx | True | True |
Redmond | Redmond | PROPN | NNP | pobj | Xxxxx | True | False |
, | , | PUNCT | , | punct | , | False | False |
Washington | Washington | PROPN | NNP | appos | Xxxxx | True | False |
. | . | PUNCT | . | punct | . | False | False |
Refer to the below table to find out the meaning of each attribute we printed in the above output.
TEXT | Actual text or word being processed |
LEMMA | Root form of the word being processed |
POS | Part-of-speech of the word |
TAG | They express the part-of-speech (e.g., VERB) and some amount of morphological information (e.g., that the verb is past tense). |
DEP | Syntactic dependency (i.e., the relation between tokens) |
SHAPE | The shape of the word (e.g., the capitalization, punctuation, digits format) |
ALPHA | Is the token an alpha character? |
STOP | Is the word a stop word or part of a stop list? |