Is science broken? The reproducibility crisis

Simon · Mar 25, 2015

Interesting blog about a meeting at UCL this week looking at scientific standards in psychology and neuroscience - with an appearance there by blogger Neuroskeptic

Is science broken? The reproducibility crisis
by Liz Bal at Biomed Central blog network

highlights:
Chris Chambers, Professor of Psychology and Neuroscience at Cardiff University, blamed this problem on the pressure to publish ‘good’ results. Too often, the quality of science is measured by the perceived level of interest, novelty and impact of the results. This leads to a number of problems in the research process – publication bias; significance chasing; ‘HARKing’ (hypothesizing after the results are known); a lack of data sharing and replication; and low statistical power.

he argues for pre-registration of studies with journals, that are peer-reviewed and accepted on the basis of the methodology, and published regardless of results so long as protocol was followed. He oversees this process at the journal Cortex.

Neuroskeptic, Neuroscience, Psychology and Psychiatry researcher and blogger, said he became disillusioned by poor practices as a PhD student - referring to a “tacit decision” among scientists to accept methods that they would not dream of teaching to undergraduates.

On the other hand, Sam Schwarzkopf, Research Fellow in Experimental Psychology at UCL, argued science is not broken and is actually working better than ever before

read the full blog

anciendaze · Mar 25, 2015

This is not at all limited to psychology. I've had some interesting arguments about parametric statistics in medical research with people who should certainly know better. I am far from the first to question meaning of published p values. Even in cases where my eyeball quickly tells me they are likely dealing with something other than a normal distribution researchers persist in assuming they have one. When questioned some senior researchers have said "you have to work hard to get a good normal distribution." This raises a suspicion in my mind that they are looking at samples, then rerunning sampling until they get a distribution that appears normal. Senior researchers will deny this, but if you talk to incautious graduate students you will hear that it is taking place.

These same senior researchers will invoke the Central Limit Theorem to explain how they got the normal distributions in samples when the population distribution is far from normal. There is apparently no connection in their minds between rerunning sampling until you get what you want, and the essential prerequisite for the CLT that samples be independent. This is a massive violation of that condition.

The one positive thing I can say about this approach is that their understanding of what they are doing is so defective they can't be sure which way they are biasing outcomes.

alex3619 · Mar 25, 2015

There is nothing new covered in the blog, but that is not the point. The point is these things need to be discussed, and that is what is happening.

Sean · Mar 25, 2015

Simon said:
he argues for pre-registration of studies with journals, that are peer-reviewed and accepted on the basis of the methodology, and published regardless of results so long as protocol was followed.

If methodology is sound (and followed diligently), and results published promptly, all these problems would be largely resolved.

IOW, get the methodology right, and stick to it.

(I have no in principle problem with additional post-hoc analyses being done, that can probably sometimes add more useful info. But the original protocol must be followed for the primary paper, and published before any altered post-hoc analyses are run.)

alex3619 · Mar 25, 2015

Science is broken in many ways, but then it has always had issues. Today we have also have undue influence of private funding in science, and the rise of consensus views being considered science. One huge thing we need is a better educated public, but not just on science but also politics. We also need open publishing. As many as possible need to be able to read every paper. Transparency is one key to adequate criticism.

I am very in favour of detailed research plans being published before the study is done, and the study being considered with those plans in mind by reviewers and editors.

Simon · Mar 26, 2015

Sean said:
(I have no in principle problem with additional post-hoc analyses being done, that can probably sometimes add more useful info. But the original protocol must be followed for the primary paper, and published before any altered post-hoc analyses are run.)

I think that's exactly right, and there needs to be a clear separation between what was planned and what is exploratory.

Online debate erupts to ask: is science broken? : Nature News & Comment
Schwarzkopf said that he is “wary” of any system that prevents scientists from thoroughly examining their results. “I think except for the simplest designs there will always be things you can think of only when you see the data,” he wrote.

So long as the labelling is clear, that's all well and good. I also think p values and effect sizes come into it too: a chance finding that only scrapes significance or is only a small effect is by the by, but a study that throws up a sizeable and significant finding should discuss it - I would want to know, though also know what was planned and what was stumbled across.

user9876 · Mar 26, 2015

I don't really see what is wrong with a post hoc analysis particularly if all data is publicly available for others to examine. The problem is that others don't examine the validity of the analysis and there is a culture of cherry picking. I tend to think just having a predefined analysis plan isn't sufficient because people don't adequately review them so as experience is gained in an area a pre-defined analysis can be written that cherry picks. In fact that can be done unintentionally by looking for techniques that have worked in the past.

Two things are really needed.
1) Make the data available
2) Give credit to people who spend their time picking apart other peoples data and methodology.

anciendaze · Mar 26, 2015

Just to reiterate: common parametric statistics all depend on the assumption of normal distributions; if you are dealing with something else the meaning of the numbers you get is highly questionable. Normal (Gaussian) distributions originally arose in the context of instrumental errors, as in astronomy. If you are measuring positions of star images on a glass plate with a traveling microscope they work very well. In other contexts instrumental errors may scarcely be relevant, and the natural processes you are studying are likely to have other distributions. I keep mentioning Lévy distributions, but there are plenty of cases of power-law distributions in examples as different as earthquakes, stock market prices, reliability of machines and physiology.

The assumption of a normal distribution is equivalent to saying only the first two moments of the distribution are significant. All measures of significance depend not only on estimated mean value, which is fairly reliable, but also on estimated standard deviation or variance, which is much less reliable. You can run computer experiments to see just how sensitive this is, even in the ideal world of a mathematical model, to such things as a small number of questionable outliers. Once you realize how vulnerable a study is to manipulation by including or excluding outliers, you should be very cautious about drawing inferences from it.

If the underlying distribution is like a Lévy distribution, the analytical expression of the distribution will not even have a well-defined standard deviation. You can perform arithmetic on a sample set to get a number, but this will be completely dependent on the bounds on sampling and the number of samples -- factors which are entirely under the control of experimenters. The temptation to adjust these parameters to produce desired results should be obvious.

Even in textbook cases where you really do have a normal distribution, like measured heights of army recruits, you can make nonsense of this assumption simply by mixing together the two different normal distributions for male and female heights. (Tip: you can't depend on "the law of large numbers" to save you here, because 2 is not a large number. This is one of the inside secrets I learned from advanced training in mathematics.) It is all too easy to find published examples of such blunders, and the PACE assumption that healthy and sick people are really the same is not an isolated absurdity.

Esther12 · Mar 26, 2015

I'm not sure I'm following the points you're making @anciendaze.

eg I don't get this:

anciendaze said:
If the underlying distribution is like a Lévy distribution, the analytical expression of the distribution will not even have a well-defined standard deviation. You can perform arithmetic on a sample set to get a number, but this will be completely dependent on the bounds on sampling and the number of samples -- factors which are entirely under the control of experimenters. The temptation to adjust these parameters to produce desired results should be obvious.

Admittedly, I did have to look up what a Lévy distribution was in order to even try, but can you dumb this down even more?

I haven't really thought about how routinely normal distribution is assumed in medical papers. I need to kick myself into gear with learning some more basic stats. I keep putting this off.

re PACE combining sick and healthy people: I get what you're saying, but also, you could say that splitting sick and healthy people requires a somewhat arbitrary division be made.

Valentijn · Mar 27, 2015

Esther12 said:
I'm not sure I'm following the points you're making @anciendaze.

My basic understanding is that there are certain calculations which can only be done using data points which have a "normal distribution" and look similar to a standard bell curve. If there is no normal distribution, then the concept of a standard deviation cannot be applied, and is meaningless babble.

Scatterplot graphs are great (and should be mandatory whenever applicable!), because you can glance at them and get a good idea of the distribution. If it's got a single high point and decreases similarly on both sides, then it might be a normal distribution. If there's a clump on one side and trails off to the other side, or has multiple high points, or clumps scattered around, or just has data points all over the place, it isn't a normal distribution, and standard deviation and other calculations are inappropriate and likely to be misleading. And there's the added bonus of seeing any extreme outliers which have skewed other calculations (averages).

The SF-36 PF scale, which was used in PACE, does not have a normal distribution. It's very heavily skewed toward the higher scores around 90-100, with the percentage of the population having lower scores trickling away to the left as the scores go toward 0. Hence using a standard deviation with the SF-36 PF was extremely inappropriate, and created the problem where "very sick" in the real world was 65 or lower, and "normal/recovered" in their world of mangled statistics was 60 or higher.

Valentijn · Mar 27, 2015

Esther12 said:
I haven't really thought about how routinely normal distribution is assumed in medical papers. I need to kick myself into gear with learning some more basic stats. I keep putting this off.

Coursera has a lot of great statistics courses. One which is basic and started a few days ago is https://www.coursera.org/course/biostats

It's free to register with Coursera, pretty simple to use it, and you get a cute little certificate to download and/or print when you complete a course

anciendaze · Mar 27, 2015

Part of the problem with education in statistics comes from indoctrination with a particular worldview. I came up a different path, and was already using some pretty advanced statistical ideas (probability amplitudes) which could be compared with objective measurements before I had an elementary course. This made the implicit assumptions much more obvious to me, since I already knew counterexamples.

The extent to which fundamental data on physiology depart from the conceptual model in which there are important mean values and meaningless random variation around these can be judged by such publications as these: fractal dynamics in physiology, power law RR-interval in heart, pulmonary power laws. These are really fundamental physiological variables which behave quite differently from expectations based on mean values, homeostasis and random variation. Work on gait and movement planning may even be directly applicable to people with MS, ME or Parkinson's disease.

The simplified version of my argument is that you really need evidence you are dealing with normal distributions before you blithely assume anything you encounter will conform to such expectations.

In the case of PACE, where the original population distribution is one-sided and has a long left tail, with mean, median and mode entirely different, the problem becomes more subtle because sampling is said to convert this into a set of normally-distributed groups in each arm of the study. The original assumption of "one standard deviation below the mean" was only used to produce an arbitrary threshold that sounded scientific. This statement of the problem had the implicit effect of trivializing the illness by using common assumptions of professional readers about the percentage of cases within one-standard deviation of the mean of a normal distribution. This was a psychological maneuver to mislead readers and exploit known preconceptions about "CFS" which did not actually change any numerical results, only the way they were interpreted. Because we are dealing with professional psychologists and statisticians we must assume this was deliberate.

The sampling question takes us much deeper into fundamental questions about statistical practices than busy doctors are likely to go. I've already said there are good reasons to believe the important idea of sample independence was massively violated. What distribution did individual groups have? I don't know. I can only say that the quoted p values don't mean very much.

This still limits us to the clear air of theoretical statistics, where it is easy to decide if playing the game according to pre-established rules will result in particular numbers. What sneaks up on people glancing at a paper is the extent to which these numbers are clinically meaningless. Yes, you had some effect on group means. No, it wouldn't mean much even if you were talking about improving performance of patients with heart failure. Returning patients to mean values of the healthy population for their ages would have meant adding about 300 meters to distance walked in six minutes, not less than 40 meters. We still don't know if the change in group means was caused by a few misdiagnosed individuals returning to healthy norms, or by insignificant changes in the bulk of the group. The fact that data which would decide this is being withheld invites speculation over motives.

Is science broken? The reproducibility crisis

Simon

Senior Member

anciendaze

Senior Member

alex3619

Senior Member

Sean

Senior Member

alex3619

Senior Member

Simon

Senior Member

user9876

Senior Member

anciendaze

Senior Member

Esther12

Senior Member

Valentijn

Senior Member

Valentijn

Senior Member

anciendaze

Senior Member