Tokenization is the process of segmenting a string of characters into words.

During text preprocessing, we deal with a string of characters and a sequence of characters, and we need to identify all the different words in the sequence. So we will perform tokenization, where we will convert the string of characters into a sequence of words. We will segment the sequence of characters by the various words that we are observing.

Sometimes depending on the problem or the use case in hand, we might need to perform sentence segmentation instead of tokenization. Suppose that if we want to find out all the important sentences in a piece of text or document, we will then perform sentence segmentation in that case, i.e. we need to extract all the sentences from the piece of text or document.

Is sentence segmentation a challenging problem?

Sentence Segmentation is a problem of deciding where my sentence begins and ends so that we have a complete unit of words that we call a sentence. Are there any challenges involved in this problem? One way of solving this problem is to find out if there is a dot after a character. Or if we have an exclamation mark or a question mark at some point in the text. So this can mark the end of a sentence. But there are a few exceptions to this. What if there is a dot after an abbreviation? What if there are decimal numbers in the text? So these problems make sentence segmentation a bit challenging.

 

 

How do we solve the problem of sentence segmentation?

The problem of sentence segmentation can be easily solved if we can justify if a dot is a full stop. If yes, then this marks the end of the sentence otherwise not. So, therefore this becomes a classification problem. So now the task is to build a classifier that will classify if the current point is the end of the sentence or not the end of the sentence. Since there are two classes, it will be a binary classifier. In general, there can be multiple classes also.

How do we build the binary classifier for sentence segmentation? We can start with a set of rules. If a dot satisfies this set of rules, then it belongs to a particular class otherwise it will belong to the other class.

Challenges in tokenization

When designing a tokenization algorithm, there are certain cases we need to take care of. For example, suppose in the process we encounter the word "Finland's". Now how do we treat this token? Should we take it as Finland's itself or remove the apostrophe and make it Finland. Other examples are, sometimes we need to decide if we treat something as a single token or multiple tokens. For example, "What are", "Should not", "I am", etc. Similar cases occur with names, such as "San Francisco". Should we treat it as a single token or multiple tokens? The answers to these questions are flexible and depend on the type of problem or the use case in hand.

 

 

Tokenization using spaCy
import spacy

nlp = spacy.load('en_core_web_sm')
doc = nlp(u'I am going to visit India next week.')

for token in doc:
   print(token.text)

Output:

I
am
going
to
visit
India
next
week
.
Sentence Segmentation using spaCy
import spacy
‚Äč
nlp = spacy.load("en_core_web_sm")
doc = nlp("This is first sentence. This is second sentence. And this is third sentence.")

for sent in doc.sents:
  print(sent.text)

Output:

This is first sentence.
This is second sentence.
And this is third sentence.