N-Grams: Joint Probability
Let’s start by examining the simplest case of joint probability with single word probability via a uni-gram. If we want to know the probability of a word, without any context appearing before or after, then we could take a corpus of text and convert it into a dictionary with the key for the word and the count as the value: { "a": 1000, "an": 3, "animal": 12, ... }. Say we want to calculate P("a"), meaning the likelihood of “a.” We would take the count of occurrences for the word “a” and divide it by the total number of words in the corpus: P("a") ≈ count("a") / corpus_word_count. This is called maximum likelihood estimation (MLE) and we use the “≈” sign because our corpus is not a perfect representation of the actual likelihood of the word. ...