Zero Frequency Model in Markov Models in Text Compression
Markov models are particularly useful in text compression, where the probability of the next letter is heavily influenced by the preceding letters.
In current text compression literature, the kth-order Markov models are more widely known as finite context models.
Consider the word "preceding".
Suppose that we have already processed "preceding" and are going to encode the next letter.
If we take no account of the context and treat each letter as a surprise, the probability of the letter "g" occurring is relatively low.
If we use a first-order Markov model or a single-letter context (that is, we look at the probability model given n), we can see that the probability of "g" would increase substantially.
As we increase the context size (go from n to in to din and so on), the probability of the alphabet becomes more and more skewed, which results in lower entropy.
The longer the context, the better its predictive value.
If we were to store the probability model with respect to all the contexts of a given length, the number of contexts would grow exponentially with the length of context.
Consider a context model of order four (the context is determined by the last four symbols).
If we take an alphabet size of 95, the possible number of contexts is 95 (more then 81 million).
Context modeling in text compression schemes tends to be an adaptive strategy in which the probabilities for different symbols in different contexts are updated as they are encountered.
However, this means that we will often encounter symbols that have bot been encountered for the first time, followed by a prearranged code for that symbol.
This would significantly increase the length of the code for the symbol on its first occurrence.
However, if this situation did not occur too often, the overhead associated with such occurrences would be small compared to the total number of bits used to encode the output of the source.
Unfortunately, in context-based encoding, the zero frequency problem is encountered often enough for overhead to be a problem, especially for longer contexts.