# The TAS Articles: Geoff’s Take

**Yay!** I salute the editors and everyone else who toiled for more than a year to create this wonderful collection of **TAS articles**. Yes, let’s move to a **“post p<.05 world”** as quickly as we can.

**Much to applaud**

Numerous good practices are identified in multiple articles. For example:

- Recognise that there’s much more to statistical analysis and interpretation of results than any mere inference calculation.
- Promote Open Science practices across many disciplines.
- Move beyond dichotomous decision making, whatever ‘bright line’ criterion it’s based on.
- Examine the assumptions of any statistical model, and be aware of limitations.
- Work to change journal policies, incentive structures for researchers, and statistical education.

And much, much more. I won’t belabour all this valuable advice.

**The editorial is a good summary** I said that in this blog post. The slogan is ATOM: “**A**ccept uncertainty, and be **T**houghtful, **O**pen, and **M**odest.” Scan the second part of the editorial for a brief dot point summary of each of the 43 articles.

**Estimation to the fore** I still think **our (Bob’s) article** sets out the most-likely-achievable and practical way forward, based on **the new statistics** (estimation and meta-analysis) and **Open Science**. **Anderson** provides another good discussion of estimation, with a succinct explanation of confidence intervals (CIs) and their interpretation.

**What’s (largely) missing: Bayes and bootstrapping**

The core issue, imho, is moving beyond dichotomous decision making to estimation. Bob and I, and many others, advocate CI approaches, but Bayesian estimation and bootstrapping are also valuable techniques, likely to become more widely used. It’s a great shame these are not strongly represented.

There are articles that advocate a role for Bayesian ideas, but I can’t see any article that focuses on explaining and advocating **the Bayesian new statistics**, based on credible intervals. The closest is probably **Ruberg et al.**, but their discussion is complicated and technical, and focussed specifically on decision making for drug approval.

I suspect **Bayesian estimation** is likely to prove an effective and widely-applicable way to move beyond NHST. In my view, the main limitation at the moment is the lack of good materials and tools, especially for introducing the techniques to beginners. Advocacy and a beginners’ guide would have been a valuable addition to the TAS collection.

**Bootstrapping** to generate interval estimates can avoid some assumptions, and thus increase robustness and expand the scope of estimation. An article focussing on explaining and advocating bootstrapping for estimation would have been another valuable addition.

**The big delusion: Neo- p approaches**

I and many others have long argued that we should simply *not* use NHST or *p* values at all. Or should use them only in rare situations where they are necessary—if these ever occur. For me, the biggest disappointment with the TAS collection is that a considerable number of articles present some version of the following argument: “Yes, there are problems with *p* values as they have been used, but what we should do is:

- Use .005 rather than .05 as the criterion for statistical significance, or
- teach about them better, or
- think about
*p*values in the following different way, or - replace them with this modified version of
*p*, or - supplemente them in the following way, or
- …”

There seems to be an assumption that *p* values should—or at least will—be retained in some way. Why? I suspect that none of the proposed neo-*p* approaches is likely to become very widely used. However, they blunt the core message that it’s perfectly possible to move on from any form of dichotomous decision making, and simply not use NHST or *p* values at all. To this extent they are an unfortunate distraction.

** p as a decaf CI** One example of

**neo-**as a needless distraction is the contribution of

*p***Betensky**. She argues correctly and cogently that (1) merely changing a

*p*threshold, for example from .05 to .005 is a poor strategy, and (2) interpretation of any

*p*value needs to consider the context, in particular

*N*and the estimated effect size. Knowing all that, she correctly explains, permits calculation of the CI, which provides a sound basis for interpretation. Therefore, she concludes, a

*p*value, when considered in context in this way, does provide information about the strength of evidence. That’s true, but why not simply calculate and interpret the CI? Once we have the CI, a

*p*value adds nothing, and is likely to mislead by encouraging dichotomisation.

**Using predictions **

I’ll mention just one further article that caught my attention. **Billheimer** contends that “observables are fundamental, and that the goal of statistical modeling should be to predict future observations, given the current data and other relevant information” (abstract, p.291). Rather than estimating a population parameter, we should calculate from the data a prediction interval for a data point, or sample mean, likely to be given by a replication. This strategy keeps the focus on observables and replicability, and facilitates comparisons of competing theories, in terms of the predictions they make.

This strikes me as an interesting approach, although Billheimer gives a fairly technical analysis to support his argument. A simpler approach to using predictions would be to calculate the 95% CI, then interpret this as being, in many situations, on average, approximately an 83% prediction interval. That’s one of the several ways to think about a CI that we explain in ITNS.

**Finally**

I haven’t read every article in detail. I could easily be mistaken, or have missed things. Please let me know.

I suggest (maybe slightly tongue-in-cheek):

- Read the
**editorial**, and skim the 43 brief summaries. - Read the
**comment and editorial in**.*Nature* - Read
**our article**and use it as a blueprint for future practice!

Geoff

Thank you for your further comment. As in various exchanges we’ve had over the years, I suspect we agree on way more than a reader of these comments might suspect. But not everything.

I fully agree that “nullistic conventions… need to be challenged and broken”. I agree that much of current practice needs drastic improvement, in relation to CIs as well as p values and other techniques. I agree that, if using p values, it can often be valuable and revealing to calculate them for more than one value of the null. I agree that p values around .05, corresponding to null values near an end of a 95% CI, provide only weak evidence against those null values. Yes, in typical situations, a 95% CI provides strong evidence only against null values at least some little distance from the interval.

However, I still contend that a CI is more likely to prove effective as a basis for good understanding and interpretation than one or more p values. (Or than one or more single values, each some transformation of a p value.) Yes, “p values can be calculated across the entire relevant spectrum of parameter values”. In UTNS, p. 105, I included a version of Poole’s p value function that illustrates how the p value varies across and beyond a CI. Also, in Chapter 6 of ITNS we explain how a CI can be used to eyeball the p value for any value of the null that is of interest, anywhere across or beyond the interval. A CI, especially when supplemented (either in the graph, or in the reader’s mind’s eye) with the cat’s eye figure, indicates how the relative strength of evidence against any null of interest varies as that null takes any chosen value across and beyond the interval.

We emphasise in ITNS, and in our TAS article, that an essential part of interpreting any CI is to pay attention to the full extent of the interval. So, for your example CI of [0.997, 2.59], we would want any reader to consider, in particular, the meaning in the research context of each of those interval endpoints. Yes, this is not always done, but it should be, and providing the CI is a good first step to enabling and encouraging that.

Geoff

OK, maybe we’re just down to some narrow misstatements in your reply, like “a 95% CI provides strong evidence only against null values at least some little distance from the interval”.

1) What does “some little distance” mean? The CI has to be pretty far from a parameter value to provide “strong” evidence in any sense I can think of. E.g., the 5-sigma requirement in physics corresponds to falling farther from the interval than the interval limits are from the center!

2) Please, the only correct English use of “null” for no difference, effect, or association – check your dictionary. One of the many ways Fisher screwed stats was misusing “null” for any tested value, just as Neyman screwed stats by calling CIs “confidence” intervals – a use which Arthur Bowley call a “confidence trick” in 1934. These abuses of English are every bit as misleading as “significance” for P<0.05.

3) I like cat's eye graphs, but I don't trust most readers to have an accurate mind's eye – especially when looking at ratio measures.

4) Nitpicking, but "Poole's P-value function"? Please, no: As Poole notes the P-value function is not his idea – it goes back at least to Birnbaum 1961. I just think Poole's 1987 exposition (actually in two articles in the Am J Public Health that year) is the clearest and most compelling to date.

Finally, just to emphasize: If I am seriously focusing on a single association, all I would need to see is the P-value function since all CIs and P-values can be read off that. But given that's asking for a bit much, I want to see the main results from it as given by a CI and P-values. And then I also want to see at least a fit P-value or some diagnostics for the model used to create those association-focused statistics (or at least have some assurance the analyst checked the model before giving us the focal results). So in my book the P-value remains a central concept of frequentist analyses.

Thank you again. Just quickly:

2). OK, so we need a new term to refer to the value asserted by H0 and used to calculate the p value. Maybe ‘reference value for p’, or ‘H0 assumed value’? I think it was Bruce Thompson who used the term ‘non-nil null’, which I suspect you would label a contradiction.

1). The longer the distance, the stronger the evidence, of course. If MoE (margin of error) is the half-length of the CI (assumed here for simplicity to be symmetric) then an H0 assumed value that’s one-third of MoE beyond the end of the 95% CI gives approx p=.01, and two-thirds gives approx p=.001. We could no doubt work out the corresponding LR values. (LR is approx 7 for the point estimate vs. an end of the 95% CI, so LR increases from 7 as we move further from the CI.) In summary, strength of evidence increases fairly quickly as we move away from the 95% CI. The 5-sigma, etc, standard represents very very strong evidence. But once we move much beyond, say one MoE from an end of the CI (i.e. 4-sigma), our usual model probably is not a good guide. In practice the uncertainty due to sampling variability (as accounted for by that model) is probably overshadowed by bias or other problems not captured by that model. So in most cases we’re kidding ourselves if we report exact p values below, say, .001. (Accordingly, the APA Publication Manual recommends reporting exact, rather than relative, p values, except that p<.001 is preferred to any smaller exact value of p.) 3). A fair point that ratio measures are harder to represent well and think about clearly. Squared measures similarly. 4). Fair point. In UTNS I described my version as 'the CI-function' and marked the vertical axis also with corresponding p values.

2) No need for a new term to replace “null hypothesis” for non-null hypotheses: Just use Neyman’s term “tested hypothesis” or its abbreviation, “test hypothesis”.

1) No need for all that tortured nonintuitive normal/SD dependent tradition to measure distance from the test hypothesis: Just measure the information against the test hypothesis supplied by its P-value p by converting it to the Shannon information (now over 60 years history as “surprisal”, “logworth” and other names including S-value) s = -log(p). Unlike the P-value, the S-value is additive across independent tests (as Fisher exploited), equal-interval scaled, unbounded above so hard to confuse with a posterior probability; and when using base-2 logs has immediate translation into a coin-tossing experiment, e.g., p of 0.03 is s = -log(0.03) = 5 bits of information against the hypothesis, which is the same amount of information as 5 heads in a row supplies against fairness of a coin tossing set-up. The 1-sided 5-sigma physics criterion becomes about 22 bits or 22 heads in a row. And so on.

Yes what I am saying is The New Statistics is already old and in need of update – you should read my 2019 TAS-supplement paper and update your book accordingly:

Greenland, S. (2019). Some misleading criticisms of P-values and their resolution with S-values. The American Statistician, 73 suppl 1, 106-114, open access at

http://www.tandfonline.com/doi/pdf/10.1080/00031305.2018.1529625

No, sorry, I’m utterly unconvinced by your reply. It would be fine but for the fact it presumes the only P-value one can calculate is the null P-value. That’s the only P-value calculated in most practice but that’s an unnecessary and distortive tradition stemming from nullistic conventions which need to be challenged and broken.

P-values can be calculated across the entire relevant spectrum of parameter values to visualize the P-value as a function of the tested parameter (Birnbaum 1961; Poole 1987; Modern Epidemiology 2008 Ch. 10, see p. 158-163). Even just one alternative P-value besides the null can provide a drastically altered perspective on the results, making more difficult the kind of dichotomous treatment that confidence intervals leave unchanged.

For example Brown et al. JAMA 2017 tried to pass off a hazard-ratio CI of (0.997, 2.59) as supporting the null. I can’t help thinking how much more difficult it would have been for the authors to present this false conclusion if they had been forced to give the P-values of ~0.78 for HR=1.5 and ~0.37 for HR=2 alongside the null (HR=1) P-value of 0.051.

So based on my reading of the med literature and the fact that a CI forces a dichotomy on the viewer, I think simply replacing P-values with CIs perpetuates the dichotomania problem and does not effectively counter nullistic bias. Presenting multiple P-values for different contextually relevant parameter values (e.g., HR=1, 1.5, 2) as well as CI does address both problems head on.

Even better would be to convert those multiple P-values into S-values (surprisals, Shannon information) s = log2(1/p) to show how weak the evidence against a parameter value is when it falls near the 95% CI, and present the CI instead not as “confidence” intervals but instead as areas of high compatibility between parameter values and the data under the model used to generate the results. Conversion to an information scale would help avoid confusion of frequentist P-values and CI with posterior probabilities and intervals.

Thank you for your comment.

Yes, CIs can be, and alas often are, interpreted merely in terms of ‘includes’ or ‘does not include’ the null. This impoverished dichotomous interpretation ignores much of the useful information that a CI provides.

However, I can’t see any evidence or convincing argument that it is interval estimates that perpetuate dichotomous thinking. Far more plausible, I suggest, is that NHST and the way p values and sharp cutoffs are customarily used are major reasons that dichotomous thinking and dichotomous decision making remain so prominent in statistical inference.

Conventional CIs and p values are, usually, based on the same theory, so it is not surprising that, if we make the usual statistical model assumptions, either can be converted into the other. In Chapter 6 of ITNS we give some simple heuristics that guide the eyeballing of an approximate p value, given a CI. And others that guide the approximate eyeballing of a CI, given a p value and the point estimate. The latter is probably the best way to interpret a p value—convert it (plus knowledge of the point estimate) into a CI, which makes the degree of uncertainty salient.

Yes, there is sampling variability in CIs, but the extent of the single CI calculated from our data usually gives a good indication of the extent of that variability. In stark contrast, the p value calculated from our data is a single number that gives no indication of its underlying sampling variability. An exact replication is likely to give a considerably different p value. The same holds for any single value that is a transformation of that p value.

This is just awful:

“Therefore, she concludes, a p value, when considered in context in this way, does provide information about the strength of evidence. That’s true, but why not simply calculate and interpret the CI? Once we have the CI, a p value adds nothing, and is likely to mislead by encouraging dichotomisation.”

I regard that last sentence as categorically false and in fact backwards, reflecting a self-inflicted cognitive bias in favor of CIs despite all their abuse out there:

A P-value adds the information of precisely how far the tested model is from the reference model (usually of course null+some regression model vs the regression model alone, but it doesn’t have to be. A 95% CI degrades that information into a dichotomy of P>0.05 if inside or P<0.05 if on the outside. And the literature is full of studies that claim support for the null because the CI includes the null, showing how CI perpetuate dichotomania.

Now I'm all for presenting CIs (and/or posterior intervals) but those are dichotomies and so need to be balanced out by P-values (and/or posterior probabilities) to fight the dichotomous perceptions perpetuated by interval estimates.