E-values in statistics: apt additions or instruments of generational revolt?

Moinak Bhaduri

Title: E-values: Calibration, combination and applications

Authors and year: Vladimir Vovk, Ruodu Wang, June 2021

Journal: The Annals of Statistics (https://doi.org/10.1214/20-AOS2020)

It was never meant to last, you know. Statistical measures have their heydays; permanent relevance is no guarantee. The p-value was – and still is – a tool like no other. Through the years it has been caressed and condemned, worshipped and feared, praised and slandered – all the while standing at the crossroads of almost every hypothesis testing, modeling, and prediction. Operationally, a p-value is convenient: we reject, almost mechanically, our null assumption if this value falls below certain discipline-specific thresholds like 0.01, 0.05, etc. Still, its cumbersome construction, triggering its tricky interpretation and stunning misuses, frequently lands it on the wrong side of both practitioners and stats purists. Bodies such as the American Statistical Association routinely issue caution around its use (https://doi.org/10.1080/00031305.2016.1154108). Experts have been hearing its death rattle for quite a while. The article “E-values: calibration, combination, and applications” by V. Volk and R. Wang could be the final twist of the knife. Here, the authors offer a promising alternative – the e-value – which can coexist with – and, at times, replace – its troubled ancestor.

But first, let’s brew some tea!

Recall an anecdote from your stats 101 class about a lady tasting tea. The “lady”, Muriel, claimed she could tell, just by tasting, whether milk or tea was added to a cup first. Curious, R.A. Fisher, the statistician, laid out eight cups, four with milk (M) added first, four with tea (T):

{T,M,M,T,M,T,T,M}

Muriel guessed all of them right, stunning Fisher. What would be the usual p-value in this setting? Treating Fisher’s belief – that it was not possible to detect which one was added first (leaving aside whether Muriel may be taken as a standard representation of the general public) – as our null hypothesis – the default assumption, this p-value would be 1/70. Why? Recall the p-value is defined as the probability of observing something at least as extreme as the one that we saw assuming the null is true. So, assuming Muriel was bluffing (i.e., the null is true – she is simply guessing randomly) all arrangements of the 4 Ts and 4 Ms must have been equally likely to her. How many such arrangements are there? Well, 70. Since out of eight places, four can be reserved for Ms in (8C4) = 70 ways (the Ts automatically get assigned to the other spots). So, the probability that Muriel would get one specific arrangement – the right one – is 1/70, a slim number – 0.01428 – less than the much fabled 0.05 threshold, showing it was nearly impossible for Muriel to get the right order had she been purely guessing. So, maybe, there is a way to say which ingredient was added first (technically, we could reject the null assumption). Now this is a neat ending – a happy one for Muriel who gets just credit for her tasting prowess.

Why, then, the migration to E-values?

The need is best grasped through the tough statistical environment under which Muriel had to perform. The odds were stacked heavily against her. What if she was still extremely good, but not perfect? What if, let’s say, she got one wrong somehow? Will the p-value still be so tiny even though she got the other seven right? The probability that a chance guess would yield at most one error is 17/70 = 0.24 – inflating the p-value, making us a lot more prone to saying the differentiation is impossible even though she’s still a great taster, getting the majority right.

E-values rectify this issue, restoring a fairness – allowing Muriel to express how confident she feels in her judgment. Imagine Fisher conducting the experiment as a sequence, each time tossing a coin: if it lands head, he prepares {T,M}, if tails, {M,T}. Muriel, starting with an initial wealth of $1, is expected to bet on the outcome of the toss. Table 1 describes the details.

Table 1: Tracking Muriel’s sequential wealth through E-values. The larger these values are, the more evidence we have against the null hypothesis.

Table

Description automatically generated

And the idea is this: if Muriel is bluffing, it would be hard for her to amass a huge fortune just by chance. A record of Muriel’s wealth, then, would offer another way to test the same hypothesis, only this time leveling the playing field somewhat (see the table above). Her wealth values, therefore – call them E- values, construct a superior logical system for the same testing job, threatening to dethrone the more established p-values.

Generalizations

How potent is this milk-tea metaphor? Can every hypothesis be viewed as some version of this tasting conundrum? To produce an e-value, must there always be a betting of some kind going on? Volk and Wong scream an emphatic no: they show how to generate an e-value from a p-value that already exists, they engineer bridges (called calibrators, usually decreasing functions on [0,1]) to switch back and forth between the two systems, and highlight which bridge would be the most efficient (in the sense of generating the biggest possible e-value for a given p-value). The benefits of a switch from the p-system – reject the null for small p-values – to the e-system – reject the null for high e-values, however, are not just conceptual but also computational:

aggregating evidence was always a headache in the p-system: in case several people conducted the same test, there was no easy way to combine the p-values to generate a grand verdict. With e-values, this is not a problem: they can be averaged.
E-values embrace entanglements: to create a p-value, we frequently need independent observations. Real observations, however, are frequently correlated. E-values can be made easily out of such dependent data.

Additionally, in case we are not super sure how our data are generated (that is, we fear a misspecified model), p-values falter while e-values still produce a reasonably reliable test. In case, however, we are sure of the model, p-values generate better power. It’s up to the experimenter to decide whether staying with the old system is a price worth paying for such occasional effectiveness.

At least among frequentist tools, p-values have enjoyed almost unchallenged authority over decades and there may be some hesitancy in letting go of an arrangement so firmly established. Some fond memories of p-values may linger on, maybe even a nostalgic tug. Still, with each passing day, it becomes harder for us to pirouette around their fractured reputation. Their stranglehold on inference is over. And the case against them that Volk and Wang have made is devastatingly frank.