• Welcome to Phoenix Rising!

    Created in 2008, Phoenix Rising is the largest and oldest forum dedicated to furthering the understanding of, and finding treatments for, complex chronic illnesses such as chronic fatigue syndrome (ME/CFS), fibromyalgia, long COVID, postural orthostatic tachycardia syndrome (POTS), mast cell activation syndrome (MCAS), and allied diseases.

    To become a member, simply click the Register button at the top right.

Psychology Journal Bans Significance Testing

Dolphin

Senior Member
Messages
17,567
I was aware how unreliable p-values might be

Another problem with the p-value is that it is not highly replicable. This is demonstrated nicely by Geoff Cumming as illustrated with a video
. He shows, using computer simulation, that if one study achieves a p-value of 0.05, this does not predict that an exact replication will also yield the same p-value. Using the p-value as the final arbiter of whether or not to accept or reject the null hypothesis is therefore highly unreliable.

Cumming calls this the “dance of the p-value,” because, as you can see in his video, when you repeat a virtual experiment with a phenomenon of known size, the p-values that result from the data collection dance all over the place.

Regina Nuzzo, writing in Nature in 2014 http://www.nature.com/news/scientific-method-statistical-errors-1.14700, echoes these concerns. She points out that if an experiment results in a p-value of 0.01, the probability of an exact replication also achieving a p-value of 0.01 (this all assumes perfect methodology and no cheating) is 50%, not 99% as many might falsely assume.
 

CBS

Senior Member
Messages
1,522
Yes, p-values are widely mis-used. They are only a part of the story. Surprised that BASP is taking this stance. I've always referred to p-values as a measure of statistical stability, not significance. Significance is a much more complex issue but this seems in part to be an issue of editorial oversight, a system that values quantity over quality and a larger ignorance regarding basic scientific principles.
 

Woolie

Senior Member
Messages
3,263
I don't want to offend anyone in the field, but this journal BASP doesn't seem to be one of the highest impact in its domain - perhaps they hope to create a bit of a stir to improve their profile?
 

anciendaze

Senior Member
Messages
1,841
Researchers tend to follow procedures while ignoring logical assumptions on which their validity is based. The fundamental assumption in all of these techniques tends to be that you are dealing with normal distributions, but very few researchers go to any lengths to validate this. They will even use standard procedures in cases where there is strong evidence the underlying distribution is far from normal.

A normal (Gaussian) distribution is one which is completely specified by two parameters: mean and variance, or standard deviation. All other moments are assumed to be zero or negligible. The Central Limit Theorem gives some conditions under which random processes with other distributions will combine produce normal distributions. Among the preconditions are additive combination, independence, the existence of means and bounded variance in these other distributions. It is possible for each of these conditions to be violated in practice.

What it guarantees, in those cases where it applies, is that the distribution of means and variance of sample sets will be normal in the limit where you repeat sampling many times. This is far from guaranteeing that any particular small set of samples will form a normal distribution. Even if the sampling process does produce a normal distribution you still have the problem of estimating those two parameters. Mean values tend to be more reliable than standard deviation, yet the estimates of standard deviation have enormous effect on measures of significance. The computer experiments above illustrate very well how sensitive results are to those estimates, yet these are experiments done in a mathematical world where the underlying assumptions are, as far as possible, true by construction!

In the situation where one of the processes being sampled has a Lévy distribution you cannot even count on the existence of a bounded standard deviation in the pure mathematical case. Any such computed standard deviation will be an artifact determined by the bounds set on sampling and the number of samples taken. Unfortunately for convenience, Lévy distributions are quite common in many fields. They should be suspected any time you are dealing with asymmetrical distributions which exhibit a pronounced extended tail in one direction.

This is precisely what I saw in looking at Chalder's data on physical performance in the general population. I could even plausibly explain this as the result of a number of pathological processes combining multiplicatively to reduce performance. In the general population there was no exclusion of multiple pathological processes, and in the CFS subset the existence of any undetected pathologies was simply dismissed.

You can even find examples in the literature where researchers who were warned that they were not dealing with normal distributions persisted, saying "sometimes you have to work hard to get a normal distribution". Such people are completely oblivious to the obvious contradiction between the idea of necessary independence during sampling and "working real hard" to make the resulting sample set look normal. If the resulting groupings all exhibit increased variance as a function of time, and the bounds are set in a way to exploit this, this will guarantee that a study will be able to report "statistically significant results" because of procedures based on invalid assumptions. Having a number of sample points right up against one bound is a great way to do this, even if you don't adjust bounds during the study.

Even that stops short of adjusting bounds for entry to a trial without changing thresholds for recovery. Any field which allows such travesties to persist must be considered fundamentally suspect.