• Welcome to Phoenix Rising!

    Created in 2008, Phoenix Rising is the largest and oldest forum dedicated to furthering the understanding of and finding treatments for complex chronic illnesses such as chronic fatigue syndrome (ME/CFS), fibromyalgia (FM), long COVID, postural orthostatic tachycardia syndrome (POTS), mast cell activation syndrome (MCAS), and allied diseases.

    To become a member, simply click the Register button at the top right.

Scientific method: rethinking 'gold standards' of research validity

natasa778

Senior Member
Messages
1,774
Scientific method: Statistical errors
P values, the 'gold standard' of statistical validity, are not as reliable as many scientists assume.


http://www.nature.com/news/scientific-method-statistical-errors-1.14700

... It turned out that the problem was not in the data or in Motyl's analyses. It lay in the surprisingly slippery nature of the P value, which is neither as reliable nor as objective as most scientists assume. “P values are not doing their job, because they can't,” says Stephen Ziliak, an economist at Roosevelt University in Chicago, Illinois, and a frequent critic of the way statistics are used.

For many scientists, this is especially worrying in light of the reproducibility concerns. In 2005, epidemiologist John Ioannidis of Stanford University in California suggested that most published findings are false2; since then, a string of high-profile replication problems has forced scientists to rethink how they evaluate results.

At the same time, statisticians are looking for better ways of thinking about data, to help scientists to avoid missing important information or acting on false alarms. “Change your statistical philosophy and all of a sudden different things become important,” says Steven Goodman, a physician and statistician at Stanford. “Then 'laws' handed down from God are no longer handed down from God. They're actually handed down to us by ourselves, through the methodology we adopt.” ...


P-hacking is especially likely, he says, in today's environment of studies that chase small effects hidden in noisy data. It is tough to pin down how widespread the problem is, but Simonsohn has the sense that it is serious. In an analysis10, he found evidence that many published psychology papers report P values that cluster suspiciously around 0.05, just as would be expected if researchers fished for significant P values until they found one. ...
 
Last edited:

alex3619

Senior Member
Messages
13,810
Location
Logan, Queensland, Australia
Yes, P values are NOT reliable. In the hard sciences they usually like P values very very low, and distrust the analysis unless they are. In softer sciences they seem to think only somewhat low P values are good.

These kinds of arguments are things I have been looking into. P values only reflect the possibility that a result is due to chance, and probability is a tricky thing. It turns out that uncommon events are very common as the system is stacked to favour all sorts of biases. So its possible that over 50% of results in psych are false, and the rest of medicine is not much better. Yet the medical profession treats these results as Gold.
 

CBS

Senior Member
Messages
1,522
The term term "statistically significant results" is often employed. Statistics are more or less likely to be stable or reproducible (that's what a p value tells you) but the results alone are never significant. Only clinical value is significant or insignificant. The best studies address issues of clinically significant outcomes BEFORE setting a budget and as an essential guide to sample size and types of measures to be collected. It is disheartening to see how few researchers in all disciplines appreciate this critical distinction.
 

barbc56

Senior Member
Messages
3,657
Here's an interesting article. It also includes some very informative links.

Basically, it comes down to prior plausability and not just looking at one study in isolation for a theory it to be valid. That's the short version.:)

http://www.sciencebasedmedicine.org/5-out-of-4-americans-do-not-understand-statistics/

Love this quote.

As I understand it as a clinician, the take home is that a p of 0.05 or even 0.01 maybe statistically significant, it is unlikely to mean the result is ‘true’, that you can reject the null hypothesis. In part this is due to the unfortunate fact than many clinical studies stink on ice
:eek:
.
 

alex3619

Senior Member
Messages
13,810
Location
Logan, Queensland, Australia
Yes @barbc56, that is just one of the articles I have used to come to my own conclusions.

P values cannot always ensure something is correct. It still has to make sense, be determined with good methodology and analysis, etc etc. A biased study, a fraudulent study, poor methodology, even pure chance, can give statistically significant results. P value is only a heuristic, only suggestive.

PS Some might find this amusing. I have read that article before, but my fluency in ME Typoese and not paying much attention to titles meant I autotranslated the title as 4 out of 5. This time I read it correctly, 5 out of 4.
 
Last edited:

biophile

Places I'd rather be.
Messages
8,977
Clinical significance is another can of worms. Take the PACE Trial for example, where a mere 2 points on a scale of 0-33 was regarded as a "moderate" clinically significant improvement in fatigue. How did that happen?

PACE based their definitions of clinically significant improvement on the standard deviation of the baseline scores. 0.3SD for minimal clinical important difference and 0.5SD for clinically useful difference. SDs of the group scores were limited by the exclusion of severely and mildly affected, so the thresholds for clinically significant improvement were also low.

In a worse case scenario to expose the problem with this method: if PACE had chosen patients who all scored the same, improving by a single increment would represent an unlimited effect size, which is absurd.

PACE abandoned their generally more stringent definitions of clinically significant improvement and replaced almost all of them with post-hoc definitions. The original definitions were equally as arbitrary but generally based on previous trials.

Who here would regard a mere 2 point improvement on the Chalder fatigue scale (Likert scoring) as "moderate"? Just another reason why more attention needs to be given to what is actually a clinically useful improvement to patients.
 
Last edited:

Firestormm

Senior Member
Messages
5,055
Location
Cornwall England
Nature

12 February 2014

Number Crunch [Editorial that links to the main piece in the top post]

...When it comes to statistical analysis of experimental data, the piece says, most scientists would look at a P value of 0.01 and “say that there was just a 1% chance” of the result being a false alarm. “But they would be wrong.”

In other words, most researchers do not understand the basis for a term many use every day. Worse, scientists misuse it. In doing so, they help to bury scientific truth beneath an avalanche of false findings that fail to survive replication.

As the News Feature explains, rather than being convenient shorthand for significance, the P value is a specific measure developed to test whether results touted as evidence for an effect are likely to be observed if the effect is not real. It says nothing about the likelihood of the effect in the first place.

You knew that already, right? Of course: just as the roads are filled with bad drivers, yet no-one will admit to driving badly themselves, so bad statistics are a well-known problem in science, but one that usually undermines someone else’s findings...

...Among the most common fundamental mistakes in research papers submitted to Nature, for instance, is the failure to understand the statistical difference between technical replications and independent experiments.

Statistics can be a difficult discipline to master, particularly because there has been a historical failure to properly teach the design of experiments and the statistics that are relevant to basic research.

Attitudes are also part of the problem. Too often, statistics is seen as a service to call on where necessary — and usually too late — when, in fact, statisticians should be involved in the early stages of experiment design, as well as in teaching.

Department heads, lab chiefs and senior scientists need to upgrade a good working knowledge of statistics from the ‘desirable’ column in job specifications to ‘essential’. But that, in turn, requires universities and funders to recognize the importance of statistics and provide for it...

...[Nature] are actively recruiting statisticians to help to evaluate some papers in parallel with standard peer review — and can always do with more help. (It has been hard to find people with the right expertise, so do please get in touch.) Our sister journal Nature Methods has published a series of well-received columns, Points of Significance, on statistics and how to use them...

Most scientists use statistics. Most scientists think they do it pretty well. Are most scientists mistaken about that? In the News Feature, Nature says so. Go on, prove us wrong.


Nice to have heard that at least Columbia and Lipkins team have a team of biostatisticians standing by from the start of the proposed Microbiome and Cytokine Study:
Sophisticated analysis will be required on the vast amount of data generated by microbiome and cytokine profiling; happily, Lipkin’s Center for Infection and Immunity have a team of biostatisticians dedicated to such work.
 
Last edited:

Sean

Senior Member
Messages
7,378
Yes, P values are NOT reliable. In the hard sciences they usually like P values very very low, and distrust the analysis unless they are.

Come on down, 6 Sigma!

(I would accept 4.5 ;) )

And add me to list of those who regard clinical significance as a far more important and relevant standard to reach. Reaching statistical significance is one of those obviously-must-be-ticked boxes, but nothing more than that – it just gets the basic results into the main discussion, it doesn't make them 'real'.