Twitter

Wednesday, July 11, 2018

Forget P-values? or just let it be what it is

P-value has always been controversial. It is required for certain publications, banned from some journals, hated by many, yet quoted widely. Not all p-values are loved equally. Because of a "rule" popularized some 90 years ago, small values below 0.05 have been the crowd's favorite.

When teaching hypothesis testing, we explain that the entire spectrum of p-value serves a single purpose: quantifying the "agreement" between an observed set of data and a statement (or claim) known as the null hypothesis.

Why are we "obsessed" with the small values then? Why can't we talk about any specific p-value the same way we talk about today's temperature? i.e., as a measure of something.

First of all, the scale of the p-value is hard to talk about. This is different from temperature. The difference between 0.21 and 0.20 is not the same as 0.02 and 0.01.

It almost feels like we should use the reciprocal of a p-value to discuss the likeliness of the corresponding observed "data" (represented by a summary/test statistic), assuming the null hypothesis is true.
If the null hypothesis is true, it takes, on average, 100 independent tests to observe a p-value below 0.01. The occurrence of a p-value under 0.02 is twice as likely, taking only about 50 tests to observe. Therefore 0.01 is twice as unlikely as 0.02. Using a similar calculation, 0.21 and 0.20 are almost identical in terms of likeliness under the null.
In Introductory Statistics, we teach that a test of significance has four steps:
  1. stating the hypotheses and a desired level of significance;
  2. computing the test statistics;
  3. finding the p-value;
  4. concluding given the p-value. 
It is step 4 that requires us to draw a line somewhere on the spectrum of p-value between 0 and 1. That line is known as the level of significance.

I never enjoyed explaining how one would choose a level of significance. Many of my students felt confused. Technically speaking, if a student derived a p-value of 0.28, she can claim it is significant at a significance level of 0.30. The reason why this is silly is that a chosen significance level should convey a certain sense of rare occurrence: so rare that it is deemed contradictory with the null hypothesis. No one of common sense would argue a chance that is close to 1 out of 3 represents rarity.

What common sense fails to deliver is how rare is contradictory enough. A recent HBR article showed that people have a wide variation in how they perceive a concept of likelihood such as "rare".

The solution?
"Use probabilities instead of words to avoid misinterpretation." 
P-value and significance level serve precisely this purpose.

Why 1/20 needs to be a universal choice? It doesn't. Statisticians are not quite bothered by "insignificant results" as we think 0.051 is just as interesting as 0.049. We, whenever possible, always just report the actual p-value instead of stating that we reject/accept the null hypothesis at a certain level. We use p-value to quantify the strength of evidence between variables and studies.

However, sometimes we don't have a choice so we got creative.

For any particular test between a null hypothesis and an alternative, a representative (i.e., not with selection bias) sample of p-values will offer a much better picture than the current published record of a handful of p-values under 0.05 out of who-knows-how-many trials. There have been suggestions on publishing insignificant results to avoid the so-called "cherry-picking" based on p-values. Despite the apparent appeal of such a reform, I cannot imagine it is practically possible. First of all, if we can assume that most people have been following the 0.05 "rule", publishing all the insignificant results will result in a 20-fold increase in the number of published studies. Yet it probably will create a very interesting data set for data mining. What would be useful is to have a public database of p-values on repeated studies of the precisely same test (not just the null hypothesis, as the test depends on what is the alternative as well). In such a database, p-value can finally be just what it is, a measure of agreement between a data set and a claim.