The Frequentist Hypothesis Testing method

When Primer published this video a few days ago, I jumped on the occasion to learn more about the Frequentist Hypothesis Testing method.

I had heard before of the p-value and the null hypothesis but had never really understood what they were. Following Primer's explanation, I started adding a new class to LibCapy to implement that method, but got quickly lost. Something wasn't quite right, not in his explanations, but in my understanding. For example, what if we change the game such as the winner is the one with the least head? The distribution's mean of the cheater move to the left of the one of the fair player, and the definitions of p-value and statistical power, as they are given in the video, don't make sense any more (to me at least). I felt that there should be some kind of symetry, and the reasoning should be the same wherever the distribution of the null hypothesis is, relative to the one of the alternative hypothesis.

Then I looked for other materials on the web, and found in particular these slides from a lecture at the MIT, which confirmed my gut feelings. So, to nail it down I'm writing this article, explaining to myself how I understand it.

We make two hypothesis: the null hypothesis (the one we want to check if it's correct), and the alternate hypothesis (the one that is correct if the null hypothesis isn't correct). In the example of Primer, the null hypothesis is "the player is a cheater", the alternate hypothesis is "the player is fair". We could also have more than one alternate hypothesis to check the null hypothesis against, the reasoning stays the same. Each hypothesis is described as a probability distribution. In the example of Primer, it is the probability distribution of getting \(n\) heads after \(N\) plays for both hypothesis.

We define a range for \(n\) on the distribution of the null hypothesis such as it covers "most of" that distribution. In the test rule of the video, \(n\in[16,23]\) is chosen (cf 14:26), which covers approx 80.4% of the distribution for the cheater when \(N=23\). We can also calculate the proportion of the distribution of the alternate hypothesis on that same range: in the test rule of the video it is approx 4.7%. How the range is choosen depends on what we are looking at and what we are trying to check. That's where my confusion came from when looking at Primer's video. In general, we will have a normal distribution and want to look at the range spanning symetrically around the mean of the distribution. But in the example of the video, a very low \(n\) would be a sign of fairness, and a very high \(n\) would be a sign of cheating. It then makes sense to consider the right-hand side range for the null hypothesis, and the left-hand side range for the alternate hypothesis.

The proportion inside of the range for the alternate hypothesis is called the "p-value". The proportion inside the range for the null hypothesis is called the "statistical power". To affirm with confidence that an event satisfies the null hypothesis (in the example, "a player having \(n\) heads after \(N\) tries") we want: the event to be in the range, the statistical power to be as large as possible, and the p-value to be as small as possible.

If the distributions don't overlap much, it's easy to have a range including large proportion of the null hypothesis distribution (ie a large statistical power) without including too much of the alternate hypothesis distribution (ie a small p-value). The two hypothesis are well distinguishable, it's easy to judge if an event matches the null hypothesis, and probable we won't mislabel an event from the alternate hypothesis.

If the distributions overlap a lot, we can't have a range including a large proportion of the null hypothesis without including a large proportion of the alternate hypothesis too. We have to sacrifice either the statistical power or the p-value. This expresses the uncertainty about the decision we make about whether or not an event matches the null hypothesis.

In the example of the video, as \(N\) increases the distance between the mean of the two distributions increases as well, allowing a larger statistical power and smaller p-value, hence a more confident judgement.

In conclusion, all that matter is not finding the range, which is just a concrete example to explain how the frequentist hypothesis testing can be used. What we are really looking at here is: given an event and two distributions representing each an hypothesis ("event matches distribution 1", and "event matches distribution 2"), and given a level of confidence when saying "event matches distribution 1" and another level of confidence when saying "event doesn't match distribution 2", is the affirmation "event matches distribution 1 and doesn't match distribution 2" correct or incorrect ? In other words, "is event in the (statistic power)% most frequent events of distribution 1 and not in the (1.0 - p-value)% most frequent events of distribution 2", where 'most frequent event of a distribution' is to be defined based on what we are testing.