# On the terminology of Bayes' theorem

Take Bob. Bob is not feeling so great and has a runny nose. This is an observation that may depend on various other conditions. Perhaps Bob has a cold, perhaps he has allergies, or perhaps he unfortunately picked up COVID-19. Given that we know Bob has a runny nose, which one of these potential explanations is more likely (assuming for the sake of simplicity that these are the three options)?

Bayes’ theorem allows us to formulate an answer to that question. Let `R=runny nose`

stand for the observation that Bob has a runny nose, and let `H`

be the variable indicating the three hypotheses. Then Bayes’ theorem looks as follows:

\[P(H|R)=\frac{P(R|H) \times P(H)}{P(R)}\]

Where e.g. `P(H=cold|R=True)`

should be read as “the probability that Bob has a cold, given that we know he has a runny nose”. Our confidence in each of the three hypotheses depends on several factors.

To start with, it depends on how likely that hypothesis is in the first place. For example, even during the current pandemic, if you look at the whole population it is still more likely you have a regular cold than COVID-19. This is called the “prior” probability of the hypothesis. Our confidence in the hypotheses also depends on the conditional probability `P(R|H)`

, which answers questions like: “Assuming I have COVID-19, how likely is it then that I develop a runny nose?” This probability is commonly called the “likelihood” of the hypotheses. Finally, Bayes’ theorem normalizes the whole bunch into a proper probability by taking into account the “marginal” probability of someone (a random individual from the overall population) developing a runny nose. Bringing it all together, Bob is for example more likely to have a regular cold than COVID-19, if 1) a cold occurs more commonly across the population and 2) almost all people who have a cold have a runny nose (i.e. a cold would be a good “explanation” of the symptoms).

In this terminology we can informally rewrite Bayes’ theorem as:

\[posterior\ probability = \frac{likelihood \times prior}{marginal\ probability}\]

The resulting probability of Bayes’ theorem is usually called the “posterior” probability, because it expresses how much our “prior” confidence in `H={h1, h2, h3}`

has changed *after* we learn that Bob has a runny nose. Of course, these probabilities change again once Bob takes a test for either of these ailments. And even if Bob takes a test, Bayes’ theorem allows us to take into account the probabilities of false positives and negatives.

In short, Bayes’ theorem is pretty awesome because it can be used to express how the probability of one event depends on related possible events. The application of Bayes’ theorem is not necessarily Bayesian though. If you are a typical Bayesian, you would also interpret the involved probabilities as “credences” or “degrees of belief”, and then apply the process of *conditionalization* (the diachronic application of Bayes’ theorem over an “old” and “new” moment in time) in order to express how our (subjective) beliefs change when we learn new information.

Now, there’s one term in Bayes’ theorem that caused some confusion when I first considered it more closely. That’s the so-called “likelihood” `P(R|H)`

. Bayes’ theorem is derived from the definition of conditional probability. As a conditional probability, we would for example read `P(R=True|H=runny nose)`

as follows: “How likely is it that Bob has a runny nose, assuming he has COVID-19”. In that case, we talk of the likelihood of the *data*, in this case the observed symptom of Bob. In the literature however, `P(R=True|H=runny nose)`

is usually called the likelihood *of the hypothesis*.

What cleared up my confusion was to have a look at the mathematics of *maximum likelihood estimation*. I’ll get technical for one minute and then recap in more understandable language.

Assume we have a parametric model with parameters \(\theta\), e.g. a probability density function \(p_{theta}(x)\). Then the *likelihood* of this parametric model is defined as \(L(\theta|X) = log \prod_{x \in X} p_{\theta}(x)\). What we are doing with maximum likelihood estimation is finding the best parametric model given the observed data, which means that we want to choose the parameters of our model such that the data is most likely under this parametric model.

This is, by definition, the task of finding the maximum likelihood:

\[ \hat{\theta} = argmax_{\theta}L(\theta|X)\]

where \(\hat{\theta}\) is known as the *maximum likelihood estimate* (MLE). In slightly more understandable terms: the likelihood of parameters \(\theta\) for a given observation is how likely that observation is under the parametric model with parameters \(\theta\).

And now in plain language: a good hypothesis/model assigns a high probability to the observed data.

So when `P(R=True|H=runny nose)`

is called the likelihood of the hypothesis `H=runny nose`

, this is because that hypothesis assigns a very high probability to the symptom of a runny nose and is thus a likely “explanation” of that symptom (where we understand explanation bare bones in terms of probability). The bottom line is that a hypothesis is more likely when it makes the data, that we *observed* and *know* to be the case, more likely.

## Comments

Nothing yet. Be the first!