Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Natural Language Processing-The N-grams model, Lecture notes of Computer Science

Machine learning with N_grams model.

Typology: Lecture notes

2016/2017

Uploaded on 02/22/2017

khalil.Ben_Fadhel
khalil.Ben_Fadhel 🇨🇦

1 document

1 / 28

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Speech and Language Processing. Daniel Jurafsky & James H. Martin. Copyright c
2016. All
rights reserved. Draft of November 7, 2016.
CHAPTER
4Language Modeling with N-
grams
“You are uniformly charming!” cried he, with a smile of associating and now
and then I bowed and they perceived a chaise and four to wish for.
Random sentence generated from a Jane Austen trigram model
Being able to predict the future is not always a good thing. Cassandra of Troy had
the gift of foreseeing but was cursed by Apollo that her predictions would never be
believed. Her warnings of the destruction of Troy were ignored and to simplify, let’s
just say that things just didn’t go well for her later.
In this chapter we take up the somewhat less fraught topic of predicting words.
What word, for example, is likely to follow
Please turn your homework ...
Hopefully, most of you concluded that a very likely word is in, or possibly over,
but probably not refrigerator or the. In the following sections we will formalize
this intuition by introducing models that assign a probability to each possible next
word. The same models will also serve to assign a probability to an entire sentence.
Such a model, for example, could predict that the following sequence has a much
higher probability of appearing in a text:
all of a sudden I notice three guys standing on the sidewalk
than does this same set of words in a different order:
on guys all I of notice sidewalk three a sudden standing the
Why would you want to predict upcoming words, or assign probabilities to sen-
tences? Probabilities are essential in any task in which we have to identify words
in noisy, ambiguous input, like speech recognition or handwriting recognition. In
the movie Take the Money and Run, Woody Allen tries to rob a bank with a sloppily
written hold-up note that the teller incorrectly reads as “I have a gub”. As Rus-
sell and Norvig (2002) point out, a language processing system could avoid making
this mistake by using the knowledge that the sequence “I have a gun” is far more
probable than the non-word “I have a gub” or even “I have a gull”.
In spelling correction, we need to find and correct spelling errors like Their
are two midterms in this class, in which There was mistyped as Their. A sentence
starting with the phrase There are will be much more probable than one starting with
Their are, allowing a spellchecker to both detect and correct these errors.
Assigning probabilities to sequences of words is also essential in machine trans-
lation. Suppose we are translating a Chinese source sentence:
He to reporters introduced main content
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19
pf1a
pf1b
pf1c

Partial preview of the text

Download Natural Language Processing-The N-grams model and more Lecture notes Computer Science in PDF only on Docsity!

Speech and Language Processing. Daniel Jurafsky & James H. Martin. Copyright ©c 2016. All rights reserved. Draft of November 7, 2016.

CHAPTER

Language Modeling with N-

grams

“You are uniformly charming!” cried he, with a smile of associating and now and then I bowed and they perceived a chaise and four to wish for. Random sentence generated from a Jane Austen trigram model

Being able to predict the future is not always a good thing. Cassandra of Troy had the gift of foreseeing but was cursed by Apollo that her predictions would never be believed. Her warnings of the destruction of Troy were ignored and to simplify, let’s just say that things just didn’t go well for her later. In this chapter we take up the somewhat less fraught topic of predicting words. What word, for example, is likely to follow

Please turn your homework ...

Hopefully, most of you concluded that a very likely word is in, or possibly over, but probably not refrigerator or the. In the following sections we will formalize this intuition by introducing models that assign a probability to each possible next word. The same models will also serve to assign a probability to an entire sentence. Such a model, for example, could predict that the following sequence has a much higher probability of appearing in a text:

all of a sudden I notice three guys standing on the sidewalk

than does this same set of words in a different order:

on guys all I of notice sidewalk three a sudden standing the

Why would you want to predict upcoming words, or assign probabilities to sen- tences? Probabilities are essential in any task in which we have to identify words in noisy, ambiguous input, like speech recognition or handwriting recognition. In the movie Take the Money and Run, Woody Allen tries to rob a bank with a sloppily written hold-up note that the teller incorrectly reads as “I have a gub”. As Rus- sell and Norvig (2002) point out, a language processing system could avoid making this mistake by using the knowledge that the sequence “I have a gun” is far more probable than the non-word “I have a gub” or even “I have a gull”. In spelling correction, we need to find and correct spelling errors like Their are two midterms in this class, in which There was mistyped as Their. A sentence starting with the phrase There are will be much more probable than one starting with Their are, allowing a spellchecker to both detect and correct these errors. Assigning probabilities to sequences of words is also essential in machine trans- lation. Suppose we are translating a Chinese source sentence: 他 向 记者 介绍了 主要 内容 He to reporters introduced main content

2 CHAPTER 4 • LANGUAGE MODELING WITH N-GRAMS

As part of the process we might have built the following set of potential rough English translations: he introduced reporters to the main contents of the statement he briefed to reporters the main contents of the statement he briefed reporters on the main contents of the statement

A probabilistic model of word sequences could suggest that briefed reporters on is a more probable English phrase than briefed to reporters (which has an awkward to after briefed) or introduced reporters to (which uses a verb which is less fluent English in this context), allowing us to correctly select the boldfaced sentence above. Probabilities are also important for augmentative communication (Newell et al.,

  1. systems. People like the physicist Stephen Hawking who are unable to physi- cally talk or sign can instead use simple movements to select words from a menu to be spoken by the system. Word prediction can be used to suggest likely words for the menu. Models that assign probabilities to sequences of words are called language mod- language model els or LMs. In this chapter we introduce the simplest model that assigns probabilities LM to sentences and sequences of words, the N-gram. An N-gram is a sequence of N N-gram (^) words: a 2-gram (or bigram) is a two-word sequence of words like “please turn”, “turn your”, or ”your homework”, and a 3-gram (or trigram) is a three-word se- quence of words like “please turn your”, or “turn your homework”. We’ll see how to use N-gram models to estimate the probability of the last word of an N-gram given the previous words, and also to assign probabilities to entire sequences. In a bit of terminological ambiguity, we usually drop the word “model”, and thus the term N-gram is used to mean either the word sequence itself or the predictive model that assigns it a probability. Whether estimating probabilities of next words or of whole sequences, the N- gram model is one of the most important tools in speech and language processing.

4.1 N-Grams

Let’s begin with the task of computing P(w|h), the probability of a word w given some history h. Suppose the history h is “its water is so transparent that” and we want to know the probability that the next word is the:

P(the|its water is so transparent that). (4.1)

One way to estimate this probability is from relative frequency counts: take a very large corpus, count the number of times we see its water is so transparent that, and count the number of times this is followed by the. This would be answering the question “Out of the times we saw the history h, how many times was it followed by the word w”, as follows:

P(the|its water is so transparent that) = C(its water is so transparent that the) C(its water is so transparent that)

(4.2)

With a large enough corpus, such as the web, we can compute these counts and estimate the probability from Eq. 4.2. You should pause now, go to the web, and compute this estimate for yourself.

4 CHAPTER 4 • LANGUAGE MODELING WITH N-GRAMS

we approximate it with the probability

P(the|that) (4.6)

When we use a bigram model to predict the conditional probability of the next word, we are thus making the following approximation:

P(wn|wn 1 − 1 ) ≈ P(wn|wn− 1 ) (4.7)

The assumption that the probability of a word depends only on the previous word Markov is called a Markov assumption. Markov models are the class of probabilistic models that assume we can predict the probability of some future unit without looking too far into the past. We can generalize the bigram (which looks one word into the past) N-gram to the trigram (which looks two words into the past) and thus to the N-gram (which looks N − 1 words into the past). Thus, the general equation for this N-gram approximation to the conditional probability of the next word in a sequence is

P(wn|wn 1 − 1 ) ≈ P(wn|wn n−−^1 N+ 1 ) (4.8) Given the bigram assumption for the probability of an individual word, we can compute the probability of a complete word sequence by substituting Eq. 4.7 into Eq. 4.4:

P(wn 1 ) ≈

∏^ n

k= 1

P(wk|wk− 1 ) (4.9)

How do we estimate these bigram or N-gram probabilities? An intuitive way to estimate probabilities is called maximum likelihood estimation or MLE. We get

maximum likelihood estimation the MLE estimate for the parameters of an N-gram model by getting counts from a normalize corpus, and normalize the counts so that they lie between 0 and 1.^1 For example, to compute a particular bigram probability of a word y given a previous word x, we’ll compute the count of the bigram C(xy) and normalize by the sum of all the bigrams that share the same first word x:

P(wn|wn− 1 ) =

C(wn− 1 wn) ∑ w C(wn−^1 w)^

(4.10)

We can simplify this equation, since the sum of all bigram counts that start with a given word wn− 1 must be equal to the unigram count for that word wn− 1 (the reader should take a moment to be convinced of this):

P(wn|wn− 1 ) = C(wn− 1 wn) C(wn− 1 )

(4.11)

Let’s work through an example using a mini-corpus of three sentences. We’ll first need to augment each sentence with a special symbol at the beginning of the sentence, to give us the bigram context of the first word. We’ll also need a special end-symbol. ^2 (^1) For probabilistic models, normalizing means dividing by some total count so that the resulting prob- abilities fall legally between 0 and 1. (^2) We need the end-symbol to make the bigram grammar a true probability distribution. Without an end-symbol, the sentence probabilities for all sentences of a given length would sum to one. This model would define an infinite set of probability distributions, with one distribution per sentence length. See Exercise 4.5.

4.1 • N-GRAMS 5

I am Sam Sam I am I do not like green eggs and ham Here are the calculations for some of the bigram probabilities from this corpus P(I|) = 23 =. 67 P(Sam|) = 13 =. 33 P(am|I) = 23 =. 67 P(|Sam) = 12 = 0. 5 P(Sam|am) = 12 =. 5 P(do|I) = 13 =. 33 For the general case of MLE N-gram parameter estimation:

P(wn|wn n−−^1 N+ 1 ) =

C(wn n−−^1 N+ 1 wn) C(wn n−−^1 N+ 1 )

(4.12)

Equation 4.12 (like Eq. 4.11) estimates the N-gram probability by dividing the observed frequency of a particular sequence by the observed frequency of a prefix.

frequencyrelative This ratio is called a^ relative frequency.^ We said above that this use of relative frequencies as a way to estimate probabilities is an example of maximum likelihood estimation or MLE. In MLE, the resulting parameter set maximizes the likelihood of the training set T given the model M (i.e., P(T |M)). For example, suppose the word Chinese occurs 400 times in a corpus of a million words like the Brown corpus. What is the probability that a random word selected from some other text of, say, a million words will be the word Chinese? The MLE of its probability is 1000000400 or .0004. Now .0004 is not the best possible estimate of the probability of Chinese occurring in all situations; it might turn out that in some other corpus or context Chinese is a very unlikely word. But it is the probability that makes it most likely that Chinese will occur 400 times in a million-word corpus. We present ways to modify the MLE estimates slightly to get better probability estimates in Section 4.4. Let’s move on to some examples from a slightly larger corpus than our 14-word example above. We’ll use data from the now-defunct Berkeley Restaurant Project, a dialogue system from the last century that answered questions about a database of restaurants in Berkeley, California (Jurafsky et al., 1994). Here are some text- normalized sample user queries (a sample of 9332 sentences is on the website): can you tell me about any good cantonese restaurants close by mid priced thai food is what i’m looking for tell me about chez panisse can you give me a listing of the kinds of food that are available i’m looking for a good place to eat breakfast when is caffe venezia open during the day

Figure 4.1 shows the bigram counts from a piece of a bigram grammar from the Berkeley Restaurant Project. Note that the majority of the values are zero. In fact, we have chosen the sample words to cohere with each other; a matrix selected from a random set of seven words would be even more sparse. Figure 4.2 shows the bigram probabilities after normalization (dividing each row by the following unigram counts):

i want to eat chinese food lunch spend 2533 927 2417 746 158 1093 341 278

Here are a few other useful probabilities: P(i|) = 0. 25 P(english|want) = 0. 0011 P(food|english) = 0. 5 P(|food) = 0. 68

4.2 • EVALUATING LANGUAGE MODELS 7

probabilitieslog^ as^ log probabilities. Since probabilities are (by definition) less than or equal to 1, the more probabilities we multiply together, the smaller the product becomes. Mul- tiplying enough N-grams together would result in numerical underflow. By using log probabilities instead of raw probabilities, we get numbers that are not as small. Adding in log space is equivalent to multiplying in linear space, so we combine log probabilities by adding them. The result of doing all computation and storage in log space is that we only need to convert back into probabilities if we need to report them at the end; then we can just take the exp of the logprob:

p 1 × p 2 × p 3 × p 4 = exp(log p 1 + log p 2 + log p 3 + log p 4 ) (4.13)

4.2 Evaluating Language Models

The best way to evaluate the performance of a language model is to embed it in an application and measure how much the application improves. Such end-to-end evaluationextrinsic evaluation is called^ extrinsic evaluation.^ Extrinsic evaluation is the only way to know if a particular improvement in a component is really going to help the task at hand. Thus, for speech recognition, we can compare the performance of two language models by running the speech recognizer twice, once with each language model, and seeing which gives the more accurate transcription. Unfortunately, running big NLP systems end-to-end is often very expensive. In- stead, it would be nice to have a metric that can be used to quickly evaluate potential evaluationintrinsic improvements in a language model. An^ intrinsic evaluation^ metric is one that mea- sures the quality of a model independent of any application. For an intrinsic evaluation of a language model we need a test set. As with many of the statistical models in our field, the probabilities of an N-gram model come from training set the corpus it is trained on, the training set or training corpus. We can then measure the quality of an N-gram model by its performance on some unseen data called the test set test set or test corpus. So if we are given a corpus of text and want to compare two different N-gram models, we divide the data into training and test sets, train the parameters of both models on the training set, and then compare how well the two trained models fit the test set. But what does it mean to “model the test set”? The answer is simple: whichever model assigns a higher probability to the test set—meaning it more accurately predicts the test set—is a better model. Given two probabilistic models, the better model is the one that has a tighter fit to the test data or that better predicts the details of the test data, and hence will assign a higher probability to the test data. Since our evaluation metric is based on test set probability, it’s important not to let the test sentences into the training set. Suppose we are trying to compute the probability of a particular “test” sentence. If our test sentence is part of the training corpus, we will mistakenly assign it an artificially high probability when it occurs in the test set. We call this situation training on the test set. Training on the test set introduces a bias that makes the probabilities all look too high and causes huge inaccuracies in perplexity. Sometimes we use a particular test set so often that we implicitly tune to its characteristics. We then need a fresh test set that is truly unseen. In such cases, we developmenttest call the initial test set the development test set or, devset. How do we divide our data into training, development, and test sets? We want our test set to be as large

8 CHAPTER 4 • LANGUAGE MODELING WITH N-GRAMS

as possible, since a small test set may be accidentally unrepresentative, but we also want as much training data as possible. At the minimum, we would want to pick the smallest test set that gives us enough statistical power to measure a statistically significant difference between two potential models. In practice, we often just divide our data into 80% training, 10% development, and 10% test. Given a large corpus that we want to divide into training and test, test data can either be taken from some continuous sequence of text inside the corpus, or we can remove smaller “stripes” of text from randomly selected parts of our corpus and combine them into a test set.

4.2.1 Perplexity

In practice we don’t use raw probability as our metric for evaluating language mod- perplexity els, but a variant called perplexity. The perplexity (sometimes called PP for short) of a language model on a test set is the inverse probability of the test set, normalized by the number of words. For a test set W = w 1 w 2... wN ,:

PP(W ) = P(w 1 w 2... wN )−^

(^1) N (4.14)

= N

P(w 1 w 2... wN )

We can use the chain rule to expand the probability of W :

PP(W ) = N

∏N

i= 1

P(wi|w 1... wi− 1 )

(4.15)

Thus, if we are computing the perplexity of W with a bigram language model, we get:

PP(W ) = N

∏N

i= 1

P(wi|wi− 1 ) (4.16)

Note that because of the inverse in Eq. 4.15, the higher the conditional probabil- ity of the word sequence, the lower the perplexity. Thus, minimizing perplexity is equivalent to maximizing the test set probability according to the language model. What we generally use for word sequence in Eq. 4.15 or Eq. 4.16 is the entire se- quence of words in some test set. Since this sequence will cross many sentence boundaries, we need to include the begin- and end-sentence markers and in the probability computation. We also need to include the end-of-sentence marker (but not the beginning-of-sentence marker ) in the total count of word to- kens N. There is another way to think about perplexity: as the weighted average branch- ing factor of a language. The branching factor of a language is the number of possi- ble next words that can follow any word. Consider the task of recognizing the digits in English (zero, one, two,..., nine), given that each of the 10 digits occurs with equal probability P = 101. The perplexity of this mini-language is in fact 10. To see that, imagine a string of digits of length N. By Eq. 4.15, the perplexity will be

10 CHAPTER 4 • LANGUAGE MODELING WITH N-GRAMS

case. Imagine all the words of the English language covering the probability space between 0 and 1, each word covering an interval proportional to its frequency. We choose a random value between 0 and 1 and print the word whose interval includes this chosen value. We continue choosing random numbers and generating words until we randomly generate the sentence-final token . We can use the same technique to generate bigrams by first generating a random bigram that starts with (according to its bigram probability). Let’s say the second word of that bigram is w. We next chose a random bigram starting with w (again, drawn according to its bigram probability), and so on. To give an intuition for the increasing power of higher-order N-grams, Fig. 4. shows random sentences generated from unigram, bigram, trigram, and 4-gram models trained on Shakespeare’s works.

1

–To him swallowed confess hear both. Which. Of save on trail for are ay device and rote life have gram –Hill he late speaks; or! a more to leg less first you enter

2

–Why dost stand forth thy canopy, forsooth; he is this palpable hit the King Henry. Live king. Follow. gram –What means, sir. I confess she? then all sorts, he is trim, captain.

3

–Fly, and will rid me these news of price. Therefore the sadness of parting, as they say, ’tis done. gram –This shall forbid it should be branded, if renown made it empty.

4

–King Henry. What! I will go seek the traitor Gloucester. Exeunt some of the watch. A great banquet serv’d in; gram –It cannot be but so. Figure 4.3 Eight sentences randomly generated from four N-grams computed from Shakespeare’s works. All characters were mapped to lower-case and punctuation marks were treated as words. Output is hand-corrected for capitalization to improve readability.

The longer the context on which we train the model, the more coherent the sen- tences. In the unigram sentences, there is no coherent relation between words or any sentence-final punctuation. The bigram sentences have some local word-to-word coherence (especially if we consider that punctuation counts as a word). The tri- gram and 4-gram sentences are beginning to look a lot like Shakespeare. Indeed, a careful investigation of the 4-gram sentences shows that they look a little too much like Shakespeare. The words It cannot be but so are directly from King John. This is because, not to put the knock on Shakespeare, his oeuvre is not very large as corpora go (N = 884 , 647 ,V = 29 , 066), and our N-gram probability matrices are ridiculously sparse. There are V 2 = 844 , 000 , 000 possible bigrams alone, and the number of possible 4-grams is V 4 = 7 × 1017. Thus, once the generator has chosen the first 4-gram (It cannot be but), there are only five possible continuations (that, I, he, thou, and so); indeed, for many 4-grams, there is only one continuation. To get an idea of the dependence of a grammar on its training set, let’s look at an N-gram grammar trained on a completely different corpus: the Wall Street Journal (WSJ) newspaper. Shakespeare and the Wall Street Journal are both English, so we might expect some overlap between our N-grams for the two genres. Fig. 4. shows sentences generated by unigram, bigram, and trigram grammars trained on 40 million words from WSJ. Compare these examples to the pseudo-Shakespeare in Fig. 4.3. While superfi-

4.3 • GENERALIZATION AND ZEROS 11

1

Months the my and issue of year foreign new exchange’s september were recession exchange new endorsed a acquire to six executives gram

2

Last December through the way to preserve the Hudson corporation N. B. E. C. Taylor would seem to complete the major central planners one gram point five percent of U. S. E. has already old M. X. corporation of living on information such as more frequently fishing to keep her

3

They also point to ninety nine point six billion dollars from two hundred four oh six three percent of the rates of interest stores as Mexico and gram Brazil on market conditions Figure 4.4 Three sentences randomly generated from three N-gram models computed from 40 million words of the Wall Street Journal, lower-casing all characters and treating punctua- tion as words. Output was then hand-corrected for capitalization to improve readability.

cially they both seem to model “English-like sentences”, there is obviously no over- lap whatsoever in possible sentences, and little if any overlap even in small phrases. This stark difference tells us that statistical models are likely to be pretty useless as predictors if the training sets and the test sets are as different as Shakespeare and WSJ. How should we deal with this problem when we build N-gram models? One way is to be sure to use a training corpus that has a similar genre to whatever task we are trying to accomplish. To build a language model for translating legal documents, we need a training corpus of legal documents. To build a language model for a question-answering system, we need a training corpus of questions. Matching genres is still not sufficient. Our models may still be subject to the problem of sparsity. For any N-gram that occurred a sufficient number of times, we might have a good estimate of its probability. But because any corpus is limited, some perfectly acceptable English word sequences are bound to be missing from it. That is, we’ll have many cases of putative “zero probability N-grams” that should really have some non-zero probability. Consider the words that follow the bigram denied the in the WSJ Treebank3 corpus, together with their counts:

denied the allegations: 5 denied the speculation: 2 denied the rumors: 1 denied the report: 1

But suppose our test set has phrases like: denied the offer denied the loan

Our model will incorrectly estimate that the P(offer|denied the) is 0! zeros These zeros— things that don’t ever occur in the training set but do occur in the test set—are a problem for two reasons. First, their presence means we are underestimating the probability of all sorts of words that might occur, which will hurt the performance of any application we want to run on this data. Second, if the probability of any word in the test set is 0, the entire probability of the test set is 0. By definition, perplexity is based on the inverse probability of the test set. Thus if some words have zero probability, we can’t compute perplexity at all, since we can’t divide by 0!

4.4 • SMOOTHING 13

4.4.1 Laplace Smoothing

The simplest way to do smoothing is to add one to all the bigram counts, before we normalize them into probabilities. All the counts that used to be zero will now have a count of 1, the counts of 1 will be 2, and so on. This algorithm is called smoothingLaplace^ Laplace smoothing. Laplace smoothing does not perform well enough to be used in modern N-gram models, but it usefully introduces many of the concepts that we see in other smoothing algorithms, gives a useful baseline, and is also a practical smoothing algorithm for other tasks like text classification (Chapter 7). Let’s start with the application of Laplace smoothing to unigram probabilities. Recall that the unsmoothed maximum likelihood estimate of the unigram probability of the word wi is its count ci normalized by the total number of word tokens N:

P(wi) =

ci N Laplace smoothing merely adds one to each count (hence its alternate name add- add-one one smoothing). Since there are V words in the vocabulary and each one was incre- mented, we also need to adjust the denominator to take into account the extra V observations. (What happens to our P values if we don’t increase the denominator?)

PLaplace(wi) =

ci + 1 N +V

(4.18)

Instead of changing both the numerator and denominator, it is convenient to describe how a smoothing algorithm affects the numerator, by defining an adjusted count c∗. This adjusted count is easier to compare directly with the MLE counts and can be turned into a probability like an MLE count by normalizing by N. To define this count, since we are only changing the numerator in addition to adding 1 we’ll also need to multiply by a normalization factor (^) NN+V :

c∗ i = (ci + 1 )

N

N +V

(4.19)

We can now turn c∗ i into a probability P i∗ by normalizing by N. discounting A related way to view smoothing is as discounting (lowering) some non-zero counts in order to get the probability mass that will be assigned to the zero counts. Thus, instead of referring to the discounted counts c∗, we might describe a smooth- discount ing algorithm in terms of a relative discount dc, the ratio of the discounted counts to the original counts:

dc = c∗ c Now that we have the intuition for the unigram case, let’s smooth our Berkeley Restaurant Project bigrams. Figure 4.5 shows the add-one smoothed counts for the bigrams in Fig. 4.1. Figure 4.6 shows the add-one smoothed probabilities for the bigrams in Fig. 4.2. Recall that normal bigram probabilities are computed by normalizing each row of counts by the unigram count:

P(wn|wn− 1 ) = C(wn− 1 wn) C(wn− 1 )

(4.20)

14 CHAPTER 4 • LANGUAGE MODELING WITH N-GRAMS

i want to eat chinese food lunch spend i 6 828 1 10 1 1 1 3 want 3 1 609 2 7 7 6 2 to 3 1 5 687 3 1 7 212 eat 1 1 3 1 17 3 43 1 chinese 2 1 1 1 1 83 2 1 food 16 1 16 1 2 5 1 1 lunch 3 1 1 1 1 2 1 1 spend 2 1 2 1 1 1 1 1 Figure 4.5 Add-one smoothed bigram counts for eight of the words (out of V = 1446) in the Berkeley Restaurant Project corpus of 9332 sentences. Previously-zero counts are in gray.

For add-one smoothed bigram counts, we need to augment the unigram count by the number of total word types in the vocabulary V :

P Laplace∗ (wn|wn− 1 ) =

C(wn− 1 wn) + 1 C(wn− 1 ) +V

(4.21)

Thus, each of the unigram counts given in the previous section will need to be augmented by V = 1446. The result is the smoothed bigram probabilities in Fig. 4.6.

i want to eat chinese food lunch spend i 0.0015 0.21 0.00025 0.0025 0.00025 0.00025 0.00025 0. want 0.0013 0.00042 0.26 0.00084 0.0029 0.0029 0.0025 0. to 0.00078 0.00026 0.0013 0.18 0.00078 0.00026 0.0018 0. eat 0.00046 0.00046 0.0014 0.00046 0.0078 0.0014 0.02 0. chinese 0.0012 0.00062 0.00062 0.00062 0.00062 0.052 0.0012 0. food 0.0063 0.00039 0.0063 0.00039 0.00079 0.002 0.00039 0. lunch 0.0017 0.00056 0.00056 0.00056 0.00056 0.0011 0.00056 0. spend 0.0012 0.00058 0.0012 0.00058 0.00058 0.00058 0.00058 0. Figure 4.6 Add-one smoothed bigram probabilities for eight of the words (out of V = 1446) in the BeRP corpus of 9332 sentences. Previously-zero probabilities are in gray.

It is often convenient to reconstruct the count matrix so we can see how much a smoothing algorithm has changed the original counts. These adjusted counts can be computed by Eq. 4.22. Figure 4.7 shows the reconstructed counts.

c∗(wn− 1 wn) =

[C(wn− 1 wn) + 1 ] ×C(wn− 1 ) C(wn− 1 ) +V (4.22)

i want to eat chinese food lunch spend i 3.8 527 0.64 6.4 0.64 0.64 0.64 1. want 1.2 0.39 238 0.78 2.7 2.7 2.3 0. to 1.9 0.63 3.1 430 1.9 0.63 4.4 133 eat 0.34 0.34 1 0.34 5.8 1 15 0. chinese 0.2 0.098 0.098 0.098 0.098 8.2 0.2 0. food 6.9 0.43 6.9 0.43 0.86 2.2 0.43 0. lunch 0.57 0.19 0.19 0.19 0.19 0.38 0.19 0. spend 0.32 0.16 0.32 0.16 0.16 0.16 0.16 0. Figure 4.7 Add-one reconstituted counts for eight words (of V = 1446) in the BeRP corpus of 9332 sentences. Previously-zero counts are in gray.

16 CHAPTER 4 • LANGUAGE MODELING WITH N-GRAMS

In a slightly more sophisticated version of linear interpolation, each λ weight is computed in a more sophisticated way, by conditioning on the context. This way, if we have particularly accurate counts for a particular bigram, we assume that the counts of the trigrams based on this bigram will be more trustworthy, so we can make the λ s for those trigrams higher and thus give that trigram more weight in the interpolation. Equation 4.26 shows the equation for interpolation with context- conditioned weights:

Pˆ(wn|wn− 2 wn− 1 ) = λ 1 (wn n−−^12 )P(wn|wn− 2 wn− 1 ) +λ 2 (wn n−−^12 )P(wn|wn− 1 )

  • λ 3 (wn n−−^12 )P(wn) (4.26)

How are these λ values set? Both the simple interpolation and conditional inter- held-out polation λ s are learned from a held-out corpus. A held-out corpus is an additional training corpus that we use to set hyperparameters like these λ values, by choosing the λ values that maximize the likelihood of the held-out corpus. That is, we fix the N-gram probabilities and then search for the λ values that—when plugged into Eq. 4.24—give us the highest probability of the held-out set. There are various ways to find this optimal set of λ s. One way is to use the EM algorithm defined in Chap- ter 9, which is an iterative learning algorithm that converges on locally optimal λ s (Jelinek and Mercer, 1980). In a backoff N-gram model, if the N-gram we need has zero counts, we approx- imate it by backing off to the (N-1)-gram. We continue backing off until we reach a history that has some counts. In order for a backoff model to give a correct probability distribution, we have discount to discount the higher-order N-grams to save some probability mass for the lower order N-grams. Just as with add-one smoothing, if the higher-order N-grams aren’t discounted and we just used the undiscounted MLE probability, then as soon as we replaced an N-gram which has zero probability with a lower-order N-gram, we would be adding probability mass, and the total probability assigned to all possible strings by the language model would be greater than 1! In addition to this explicit discount factor, we’ll need a function α to distribute this probability mass to the lower order N-grams. Katz backoff This kind of backoff with discounting is also called Katz backoff. In Katz back- off we rely on a discounted probability P∗^ if we’ve seen this N-gram before (i.e., if we have non-zero counts). Otherwise, we recursively back off to the Katz probabil- ity for the shorter-history (N-1)-gram. The probability for a backoff N-gram PBO is thus computed as follows:

PBO(wn|wn n−−^1 N+ 1 ) =

P∗(wn|wn n−−^1 N+ 1 ), if C(wnn−N+ 1 ) > 0 α(wn n−−^1 N+ 1 )PBO(wn|wn n−−^1 N+ 2 ), otherwise. (4.27) Good-Turing Katz backoff is often combined with a smoothing method called Good-Turing. The combined Good-Turing backoff algorithm involves quite detailed computation for estimating the Good-Turing smoothing and the P∗^ and α values.

4.5 • KNESER-NEY SMOOTHING 17

4.5 Kneser-Ney Smoothing

One of the most commonly used and best performing N-gram smoothing methods Kneser-Ney is the interpolated Kneser-Ney algorithm (Kneser and Ney 1995, Chen and Good- man 1998). Kneser-Ney has its roots in a method called absolute discounting. Recall that discounting of the counts for frequent N-grams is necessary to save some probabil- ity mass for the smoothing algorithm to distribute to the unseen N-grams. To see this, we can use a clever idea from Church and Gale (1991). Consider an N-gram that has count 4. We need to discount this count by some amount. But how much should we discount it? Church and Gale’s clever idea was to look at a held-out corpus and just see what the count is for all those bigrams that had count 4 in the training set. They computed a bigram grammar from 22 million words of AP newswire and then checked the counts of each of these bigrams in another 22 million words. On average, a bigram that occurred 4 times in the first 22 million words occurred 3.23 times in the next 22 million words. The following table from Church and Gale (1991) shows these counts for bigrams with c from 0 to 9:

Bigram count in Bigram count in training set heldout set 0 0. 1 0. 2 1. 3 2. 4 3. 5 4. 6 5. 7 6. 8 7. 9 8. Figure 4.8 For all bigrams in 22 million words of AP newswire of count 0, 1, 2,...,9, the counts of these bigrams in a held-out corpus also of 22 million words.

The astute reader may have noticed that except for the held-out counts for 0 and 1, all the other bigram counts in the held-out set could be estimated pretty well discountingAbsolute by just subtracting 0.75 from the count in the training set!^ Absolute discounting formalizes this intuition by subtracting a fixed (absolute) discount d from each count. The intuition is that since we have good estimates already for the very high counts, a small discount d won’t affect them much. It will mainly modify the smaller counts, for which we don’t necessarily trust the estimate anyway, and Fig. 4.8 suggests that in practice this discount is actually a good one for bigrams with counts 2 through 9. The equation for interpolated absolute discounting applied to bigrams:

PAbsoluteDiscounting(wi|wi− 1 ) =

C(wi− 1 wi) − d ∑ v C(wi−^1 v)

  • λ (wi− 1 )P(wi) (4.28)

The first term is the discounted bigram, and the second term the unigram with an interpolation weight λ. We could just set all the d values to .75, or we could keep a separate discount value of 0.5 for the bigrams with counts of 1. Kneser-Ney discounting (Kneser and Ney, 1995) augments absolute discount- ing with a more sophisticated way to handle the lower-order unigram distribution.

4.6 • THE WEB AND STUPID BACKOFF 19

word types that we discounted; in other words, the number of times we applied the normalized discount. The general recursive formulation is as follows:

PKN(wi|wi i−−^1 n+ 1 ) =

max(cKN (wii−n+ 1 ) − d, 0 ) ∑ v cKN^ (w

i− 1 i−n+ 1 v)^

  • λ (wi i−−^1 n+ 1 )PKN (wi|wi i−−^1 n+ 2 ) (4.35)

where the definition of the count cKN depends on whether we are counting the highest-order N-gram being interpolated (for example trigram if we are interpolat- ing trigram, bigram, and unigram) or one of the lower-order N-grams (bigram or unigram if we are interpolating trigram, bigram, and unigram):

cKN (·) =

count(·) for the highest order continuationcount(·) for lower orders (4.36)

The continuation count is the number of unique single word contexts for ·. At the termination of the recursion, unigrams are interpolated with the uniform distribution, where the parameter ε is the empty string:

PKN(w) = max(cKN (w) − d, 0 ) ∑ w′^ cKN^ (w ′) +^ λ^ (ε)^

V

(4.37)

If we want to include an unknown word , it’s just included as a regular vo- cabulary entry with count zero, and hence its probability will be a lambda-weighted uniform distribution λ^ ( Vε ). The best-performing version of Kneser-Ney smoothing is called modified Kneser- Kneser-Neymodified Ney^ smoothing, and is due to Chen and Goodman (1998). Rather than use a single fixed discount d, modified Kneser-Ney uses three different discounts d 1 , d 2 , and d 3 + for N-grams with counts of 1, 2 and three or more, respectively. See Chen and Goodman (1998, p. 19) or Heafield et al. (2013) for the details.

4.6 The Web and Stupid Backoff

By using text from the web, it is possible to build extremely large language mod- els. In 2006 Google released a very large set of N-gram counts, including N-grams (1-grams through 5-grams) from all the five-word sequences that appear at least 40 times from 1,024,908,267,229 words of running text on the web; this includes 1,176,470,663 five-word sequences using over 13 million unique words types (Franz and Brants, 2006). Some examples: 5-gram Count serve as the incoming 92 serve as the incubator 99 serve as the independent 794 serve as the index 223 serve as the indication 72 serve as the indicator 120 serve as the indicators 45 serve as the indispensable 111 serve as the indispensible 40 serve as the individual 234

20 CHAPTER 4 • LANGUAGE MODELING WITH N-GRAMS

Efficiency considerations are important when building language models that use such large sets of N-grams. Rather than store each word as a string, it is generally represented in memory as a 64-bit hash number, with the words themselves stored on disk. Probabilities are generally quantized using only 4-8 bits instead of 8-byte floats), and N-grams are stored in reverse tries. N-grams can also be shrunk by pruning, for example only storing N-grams with counts greater than some threshold (such as the count threshold of 40 used for the Google N-gram release) or using entropy to prune less-important N-grams (Stolcke, 1998). Another option is to build approximate language models using techniques Bloom filters like Bloom filters (Talbot and Osborne 2007, Church et al. 2007). Finally, effi- cient language model toolkits like KenLM (Heafield 2011, Heafield et al. 2013) use sorted arrays, efficiently combine probabilities and backoffs in a single value, and use merge sorts to efficiently build the probability tables in a minimal number of passes through a large corpus. Although with these toolkits it is possible to build web-scale language models using full Kneser-Ney smoothing, Brants et al. (2007) show that with very large lan- guage models a much simpler algorithm may be sufficient. The algorithm is called stupid backoff stupid backoff. Stupid backoff gives up the idea of trying to make the language model a true probability distribution. There is no discounting of the higher-order probabilities. If a higher-order N-gram has a zero count, we simply backoff to a lower order N-gram, weighed by a fixed (context-independent) weight. This algo- rithm does not produce a probability distribution, so we’ll follow Brants et al. (2007) in referring to it as S:

S(wi|wi i−−^1 k+ 1 ) =

count(wii−k+ 1 ) count(wi i−−^1 k+ 1 ) if count(w

i i−k+ 1 )^ >^0 λ S(wi|wi i−−^1 k+ 2 ) otherwise

(4.38)

The backoff terminates in the unigram, which has probability S(w) = count N (w). Brants et al. (2007) find that a value of 0.4 worked well for λ.

4.7 Advanced: Perplexity’s Relation to Entropy

We introduced perplexity in Section 4.2.1 as a way to evaluate N-gram models on a test set. A better N-gram model is one that assigns a higher probability to the test data, and perplexity is a normalized version of the probability of the test set. The perplexity measure actually arises from the information-theoretic concept of cross-entropy, which explains otherwise mysterious properties of perplexity (why Entropy the inverse probability, for example?) and its relationship to entropy. Entropy is a measure of information. Given a random variable X ranging over whatever we are predicting (words, letters, parts of speech, the set of which we’ll call χ) and with a particular probability function, call it p(x), the entropy of the random variable X is:

H(X) = −

x∈χ

p(x) log 2 p(x) (4.39)

The log can, in principle, be computed in any base. If we use log base 2, the resulting value of entropy will be measured in bits. One intuitive way to think about entropy is as a lower bound on the number of bits it would take to encode a certain decision or piece of information in the optimal