Kneser–Ney smoothing

Statistical method

Kneser–Ney smoothing, also known as Kneser-Essen-Ney smoothing, is a method primarily used to calculate the probability distribution of n-grams in a document based on their histories.[1] It is widely considered the most effective method of smoothing due to its use of absolute discounting by subtracting a fixed value from the probability's lower order terms to omit n-grams with lower frequencies. This approach has been considered equally effective for both higher and lower order n-grams. The method was proposed in a 1994 paper by Reinhard Kneser, Ute Essen and Hermann Ney [de].[2]

A common example that illustrates the concept behind this method is the frequency of the bigram "San Francisco". If it appears several times in a training corpus, the frequency of the unigram "Francisco" will also be high. Relying on only the unigram frequency to predict the frequencies of n-grams leads to skewed results;[3] however, Kneser–Ney smoothing corrects this by considering the frequency of the unigram in relation to possible words preceding it.

Method

Let c ( w , w ) {\displaystyle c(w,w')} be the number of occurrences of the word w {\displaystyle w} followed by the word w {\displaystyle w'} in the corpus.

The equation for bigram probabilities is as follows:

p K N ( w i | w i 1 ) = max ( c ( w i 1 , w i ) δ , 0 ) w c ( w i 1 , w ) + λ w i 1 p K N ( w i ) {\displaystyle p_{KN}(w_{i}|w_{i-1})={\frac {\max(c(w_{i-1},w_{i})-\delta ,0)}{\sum _{w'}c(w_{i-1},w')}}+\lambda _{w_{i-1}}p_{KN}(w_{i})} [4]

Where the unigram probability p K N ( w i ) {\displaystyle p_{KN}(w_{i})} depends on how likely it is to see the word w i {\displaystyle w_{i}} in an unfamiliar context, which is estimated as the number of times it appears after any other word divided by the number of distinct pairs of consecutive words in the corpus:

p K N ( w i ) = | { w : 0 < c ( w , w i ) } | | { ( w , w ) : 0 < c ( w , w ) } | {\displaystyle p_{KN}(w_{i})={\frac {|\{w':0<c(w',w_{i})\}|}{|\{(w',w''):0<c(w',w'')\}|}}}

Note that p K N {\displaystyle p_{KN}} is a proper distribution, as the values defined in the above way are non-negative and sum to one.

The parameter δ {\displaystyle \delta } is a constant which denotes the discount value subtracted from the count of each n-gram, usually between 0 and 1.

The value of the normalizing constant λ w i 1 {\displaystyle \lambda _{w_{i-1}}} is calculated to make the sum of conditional probabilities p K N ( w i | w i 1 ) {\displaystyle p_{KN}(w_{i}|w_{i-1})} over all w i {\displaystyle w_{i}} equal to one. Observe that (provided δ < 1 {\displaystyle \delta <1} ) for each w i {\displaystyle w_{i}} which occurs at least once in the context of w i 1 {\displaystyle w_{i-1}} in the corpus we discount the probability by exactly the same constant amount δ / ( w c ( w i 1 , w ) ) {\displaystyle {\delta }/\left(\sum _{w'}c(w_{i-1},w')\right)} , so the total discount depends linearly on the number of unique words w i {\displaystyle w_{i}} that can occur after w i 1 {\displaystyle w_{i-1}} . This total discount is a budget we can spread over all p K N ( w i | w i 1 ) {\displaystyle p_{KN}(w_{i}|w_{i-1})} proportionally to p K N ( w i ) {\displaystyle p_{KN}(w_{i})} . As the values of p K N ( w i ) {\displaystyle p_{KN}(w_{i})} sum to one, we can simply define λ w i 1 {\displaystyle \lambda _{w_{i-1}}} to be equal to this total discount:

λ w i 1 = δ w c ( w i 1 , w ) | { w : 0 < c ( w i 1 , w ) } | {\displaystyle \lambda _{w_{i-1}}={\frac {\delta }{\sum _{w'}c(w_{i-1},w')}}|\{w':0<c(w_{i-1},w')\}|}

This equation can be extended to n-grams. Let w i n + 1 i 1 {\displaystyle w_{i-n+1}^{i-1}} be the n 1 {\displaystyle n-1} words before w i {\displaystyle w_{i}} :

p K N ( w i | w i n + 1 i 1 ) = max ( c ( w i n + 1 i 1 , w i ) δ , 0 ) w c ( w i n + 1 i 1 , w ) + δ | { w : 0 < c ( w i n + 1 i 1 , w ) } | w c ( w i n + 1 i 1 , w ) p K N ( w i | w i n + 2 i 1 ) {\displaystyle p_{KN}(w_{i}|w_{i-n+1}^{i-1})={\frac {\max(c(w_{i-n+1}^{i-1},w_{i})-\delta ,0)}{\sum _{w'}c(w_{i-n+1}^{i-1},w')}}+\delta {\frac {|\{w':0<c(w_{i-n+1}^{i-1},w')\}|}{\sum _{w'}c(w_{i-n+1}^{i-1},w')}}p_{KN}(w_{i}|w_{i-n+2}^{i-1})} [5]

This model uses the concept of absolute-discounting interpolation which incorporates information from higher and lower order language models. The addition of the term for lower order n-grams adds more weight to the overall probability when the count for the higher order n-grams is zero.[6] Similarly, the weight of the lower order model decreases when the count of the n-gram is non zero.

Modified Kneser–Ney smoothing

Modifications of this method also exist. Chen and Goodman's 1998 paper lists and benchmarks several such modifications. Computational efficiency and scaling to multi-core systems is the focus of Chen and Goodman’s 1998 modification.[7] This approach is once used for Google Translate under a MapReduce implementation.[8] KenLM is a performant open-source implementation.[9]

References

  1. ^ 'A Bayesian Interpretation of Interpolated Kneser-Ney NUS School of Computing Technical Report TRA2/06'
  2. ^ Ney, Hermann; Essen, Ute; Kneser, Reinhard (January 1994). "On structuring probabilistic dependences in stochastic language modelling". Computer Speech & Language. 8 (1): 1–38. doi:10.1006/csla.1994.1001.
  3. ^ 'Brown University: Introduction to Computational Linguistics '
  4. ^ 'Kneser Ney Smoothing Explained'
  5. ^ 'NLP Tutorial: Smoothing'
  6. ^ 'An empirical study of smoothing techniques for language modeling'
  7. ^ An Empirical Study of Smoothing Techniques for Language Modeling p 21
  8. ^ Large Language Models in Machine Translation
  9. ^ "Estimation . Kenlm . Code . Kenneth Heafield".