Does p-hacking and other statistical malpractice happen in biomed as well as psych studies?

trishrhymes · Aug 16, 2017

A topic for discussion if anyone is interested.

I have just been laying down the law on a particularly bad case of p-hacking and unjustified assumption that correlation implies causation in yet another psych study.

post #8 on this thread:
http://forums.phoenixrising.me/inde...sment-predicts-me-severity.53509/#post-889232

In summary, here's what I said:

1.
p hacking. Carry out lots of statistical tests on a pile of data. Look for any that happen to fall just below the magic p = 0.05 level, and attribute meaning to what is probably a chance variation. If you do enough statistical tests on a large completely random set of data, some of them are sure to fall in the p less than 0.05 category by chance.

That's why psychologists love p hacking - they can do lots of questionnaires, run them through stats packages they probably barely understand, search for magic numbers less than 0.05, and hey presto, a published paper. They have 'discovered' something.

WRONG.

2.
Secondly, assume correlation implies causation.

They find most of the factors they studied were not statistically significant, but luckily one was, so they build a theory around it. Hey presto, a published paper that has 'discovered' something clinically significant.

WRONG.
...........................................

We are all aware that this goes on a lot in psychological research. Esther Crawley is a master of the art (in the worst possible sense). That's partly why I am so against her running MEGA - it will give her access to thousands of questionnaires full of lovely data for her to hack her way through till retirement, digging out all sorts of meaningless correlations and giving her more opportunities to blame patients and parents.

So far, so uncontroversial.
..................................................

But what about all the lovely biomedical studies?

We hear for example that some researchers somewhere have found a biomarker, using quite a small sample of patients and a lot of data. But did they just p hack through the data and find a p value less than 0.05 and shout bingo! I don't know.

And other small studies have come up with all sorts of interesting sounding 'significant' findings that other teams don't seem to be able to replicate.

And publication and future research funding can depend on having a track record of finding interesting and significant results. And jobs depend on publication and publication depends on 'significant' results.

I think the good scientists are well aware of this, and make it clear that their findings are preliminary and need to be validated in further large unrelated cohorts.

But are we (and I include myself in this) so eager to hear about biological evidence that we jump on every 'significant' p value, and want to to know the implications for treatment before it has even been validated?

Discuss.

Londinium · Aug 16, 2017

I'm suspicious of a number of studies that use post-hoc subgroup analysis and then find 'statistically significant' results, whether that is a biomedical or a psych study. It's tricky as it seems reasonable (to me) that ME/CFS is a heterogeneous condition that will require subgroup analysis - but the size of many of the studies means that they're just too underpowered to do this properly. I'm also a little wary of the SNP studies pending replication.

I would really like to see more testing/validation cohorts. So you've got 100 patients? Great, do a statistical analysis on 60 of them and then test the other 40 only for the variables you think might be significant based on your findings from the first 60.

alex3619 · Aug 16, 2017

The better studies also want p values much lower than .05, and often starting with a string of zeros. You can also partially counter chance associations by a variety of methods. The most important is probably replication on a new cohort, and not accepting .05 results. P values are only an estimate of the risk the result is due to chance. They are not magical. Biased methods can lead to good p values. Its not by chance, its by bias.

All results should be considered tentative, even from our favourite researchers. Multiple validation from multiple groups on multiple cohorts is definitely needed. Sadly there is typically very little or no money to do that.

Most science considers even a p value of 0.01 to be too high. Physics reviewers are looking for many zeros. However in less precise areas of investigation, of which psychology, psychiatry and economics are some of the worst, good p values are rare. In some of the biomedical research on ME however we sometimes see very low p values. Those are the most interesting findings.

P hacking is most definitely found in many disciplines. Its considered a major problem, I keep reading about it. So is the publication of such papers, as both reviewers and editors have usually failed.

trishrhymes · Aug 16, 2017

I agree with both of you, @Londinium and @alex3619 .

When I studied statistics at University back in the dark ages (50 years ago), I'm sure I was taught that 0.05 was to be considered just a clue that something might be worth investigating further, and for things that matter, like medical studies, 0.01 or even 0.001 were more sensible levels to use.

And when I taught A level statistics, we gave the same message to students, along with the mantra 'correlation doesn't imply causation' with some daft examples to drum the point home.

Back in my Uni days, p hacking wasn't really an option in the same way it is now because computers were big things that lived in air conditioned buildings and were contacted via punch cards... God I sound old!

Now we are faced with psychs who probably last studied maths at age 16, learned mostly non- parametric stats tests on their psychology courses and hadn't a clue what they meant, and then graduated to vast stats packages on computers doing all sorts of magic things with numbers. No wonder they don't seem to realise what crap they are producing.

Then of course there's the other old manta: Garbage in Garbage out. Enough said.

I think I started this thread because I've caught myself several times ranting on about p-hacking etc on other threads, so I thought I'd give my rants a home of their own.

Do feel free to join in and rant away....

alex3619 · Aug 16, 2017

trishrhymes said:
contacted via punch cards... God I sound old!

Punch cards, mark sense cards, paper tape ... I guess I am sounding old too.

alex3619 · Aug 16, 2017

trishrhymes said:
Then of course there's the other old manta: Garbage in Garbage out.

I coined BIBO as an update - Babble In, Babble Out.

barbc56 · Aug 16, 2017

Here is an informative article about p values. Some scientists such as John ioannidis are calling for a p value of .005 instead of .05.

P values are easily misinterpreted such as P values in an experiment pertain to only that experiment. Also P values are not the same as the probability of making a mistake in an experiment.

It took me a long time to wrap my head around these concerts and I still have to occasionally review them.

Maybe, someone else such as @Dolphin or others can explain this better than I can.

These are explained in the below website. There are also other references cited within the article.

http://blog.minitab.com/blog/adventures-in-statistics-2/how-to-correctly-interpret-p-values

Sean · Aug 16, 2017

Punch cards, hand cranked calculators, and Gestetner copiers. Ah, the good old days.

Barry53 · Aug 16, 2017

This has been doing my head in, and it's your fault why I'm up so late tonight Trish

. Stats is a weak subject of mine, and I've been trying to understand what this is really about.

So I'm well aware many of you have a very good understanding of this, and the following are notes I've written to myself in my efforts to understand it. I'd be grateful if people could comment on what might be right or wrong with my understanding here, based on these notes therefore:-

If a null hypothesis is true for a population, the p-value is the probability your sample data shows it to be false, assuming the only differences between your sample data and the population are due to random variations in the sample data.

So if we take a null hypothesis that CBT has no effect in a given population, and an experimental sample has a p-value of 0.05, then if CBT has no effect in that population, there is a 5% chance due to sampling variation, that the sample data would show CBT does have an effect.

Conversely, given the same population, there is a 95% chance that the same sample data would show CBT has no effect.

Importantly, given the same population and sample, this is not saying there is a 95% chance CBT does have an effect!

The p-value therefore only gives the probability that data sampled from a population where the null hypothesis is true, does not support the null hypothesis for that population, assuming the only source of error is random sampling error. It is about the likelihood of random variation causing sampled data to mislead about the population the sample was drawn from, nothing else.

So a p-value therefore gives no probability relating to any other sources of error, such as a sample being drawn from a population different to the population it purports to represent, or the myriad other methodological errors that can occur.

Is the above correct, and if not where is it wrong please?

Woolie · Aug 16, 2017

@trishrhymes, I'm also very frustrated with a lot of the biomedical research into MECFS for the same reason. In cytokine studies, there are so many comparisons that can be performed, and consequently, many opportunities to obtain false positives. Sometimes the results are not internally coherent either - for example, some cytokines are found to be more abundant in severe than the mild patients, but then the same study may find no difference between patients and controls. Okay, this is not proof that the finding is meaningless, but it suggests a need for extreme caution.

My own area, neuroscience, is one of the worst offenders. Its crippled with false positive findings - people doing too many comparisons between groups, then reporting the inevitable positive results they will get using his approach. You're virtually guaranteed to find something.

Also, again, there's often little attempt to examine whether findings are coherent. For example, one study I recently read reported that MECFS patients had selectively reduced grey matter volume in the occipital cortex. But this is inconsistent with the cognitive profile of MECFS - visual recognition deficits are very far down on the list of cognitive complaints (much bigger problems are working memory, sustained attention and cognitive control). Too often, researchers just blindly and uncritically report their findings without considering whether they even make sense.

So false positives are a huge problem generally across many domains. But still, I think they are way more dangerous when they involve psychological/social variables, because they may close down future, potentially more fruitful research avenues.

Woolie · Aug 17, 2017

There is a huge debate in Psychology about whether it should change from using p < .05 to p < .005. The argument is that this would cut down on the number of false positives and improve the replicability of findings.

Not everyone agrees. Some say the problem is not with the p value itself, but with the practice of null hypothesis testing. They say we should stop doing that, and instead report effect sizes, confidence intervals, or perhaps Bayes factors. Bayes factors assess the degree of evidence in support of two opposing possibilities. So they can be used to show that one thing actually doesn't affect another (you can't show this with traditional hypothesis testing. Not finding significance just means you haven't found enough evidence to show an effect, its not the same as definitively being able to conclude 'no effect').

Edit: I expect that the (largely very weak) researchers that look into psychological aspects of MECFS are not aware of this debate, or if they are, they're not interested. And frankly, these weak researchers probably do need rules. I think they would be able to abuse confidence intervals and probabilites even more than they abuse hypothesis testing methods.

alex3619 · Aug 17, 2017

Sean said:
hand cranked calculators

I must be really old, I totally forgot I used to use those.

Barry53 · Aug 17, 2017

Woolie said:
My own area, neuroscience, is one of the worst offenders. Its crippled with false positive findings - people doing too many comparisons between groups, then reporting the inevitable positive results they will get using his approach. You're virtually guaranteed to find something.

Are you saying this is because if you do more and more comparisons of different things, then it gets harder and harder to be sure the only error-source is sampling errors, rather than some other kind of sampling (or maybe other) bias that might unwittingly creep in, without the researchers even being aware of it? The p-values showing there is an adequately low probability of sampling-error skewing the results, and not realising (or looking hard enough for) other bias that might be skewing things? So easier to fool themselves?

Londinium · Aug 17, 2017

Just some more thoughts: p-hacking really takes two forms. The first is where a researcher just looks at too many possibilities, doesn't allow properly for multiple comparisons and finds something entirely by chance. This, coupled with publication bias (they wouldn't have published had they not found this spurious correlation), is a semi-accidental way that incorrect stuff gets in the literature. The second is the more invidious form, where the data is tortured until it confesses to a correlation. Whilst we've definitely seen it in BPS studies on ME/CFS, it affects all realms of science and we shouldn't rule it out for biomedical research just because we favour the results. (I can think of one study in particular where I have a suspicion this was the case but as it's just a gut feeling it would be unfair for me to say which one).

On a partly-related point - and apologies if this is hijacking your thread @trishrhymes - if this hasn't already been done I wonder if it's worth pulling together some sort of checklist of questions to ask when reviewing a paper (regardless of whether it's BPS or biomedical)? I have a sort of mental checklist that I run through. For me it's along the lines of (and this isn't exhaustive):

Did they use proper selection criteria?
What is the population size?
Are the controls properly matched?
Have they adjusted p-values for multiple comparisons?
Is the result dependent on a subgroup analysis? Did they decide on this subgroup before or after seeing the data?
Has there been any attempt at replication in a validation cohort?
[For clinical trials] Has there been any outcome switching? Are there objective measures? Are the primary measures consistent with secondary measures?
Are possible confounding factors discussed and ruled out? (This is a big bugbear of mine on many microbiome studies)
[For biomedical ME/CFS studies] Was testing performed in or out of serum?

I'm sure many others here do the same and that the more scientifically literate than me could do a far more comprehensive job.

trishrhymes · Aug 17, 2017

Londinium said:
and apologies if this is hijacking your thread @trishrhymes

Not my thread, everyone's thread. I'm delighted you've picked up the ball and run with it.

I like your idea of a checklist. I think we probably need at least 3 separate checklists for -

Clinical trials - both drug and psychological

Epidemiological studies - biomed, eg genomics. and psychological, eg Crawley's stuff about prevalence recovery rates, parental influence, etc.

Hypothesis free searches for information - like the Naviaux metabolomics and Ron Davis's severely ill study.

Part of my idea with this thread was that I seem to have tried to pull lots of the small terrible psych studies apart on their individual threads and said the same stuff about p hacking and correlation/causation. Instead I can just refer anyone interested to this thread!

trishrhymes · Aug 17, 2017

Woolie said:
There is a huge debate in Psychology about whether it should change from using p < .05 to p < .005. The argument is that this would cut down on the number of false positives and improve the replicability of findings.

Not everyone agrees. Some say the problem is not with the p value itself, but with the practice of null hypothesis testing. They say we should stop doing that, and instead report effect sizes, confidence intervals, or perhaps Bayes factors. Bayes factors assess the degree of evidence in support of two opposing possibilities. So they can be used to show that one thing actually doesn't affect another (you can't show this with traditional hypothesis testing. Not finding significance just means you haven't found enough evidence to show an effect, its not the same as definitively being able to conclude 'no effect').

Edit: I expect that the (largely very weak) researchers that look into psychological aspects of MECFS are not aware of this debate, or if they are, they're not interested. And frankly, these weak researchers probably do need rules. I think they would be able to abuse confidence intervals and probabilites even more than they abuse hypothesis testing methods.

That's really interesting, thanks, @Woolie. I confess my stats knowledge doesn't run to the details of effect sizes and Bayes factors. I really must update my knowledge sometime. Things have changed in 50 years!

Woolie · Aug 17, 2017

Barry53 said:
Are you saying this is because if you do more and more comparisons of different things, then it gets harder and harder to be sure the only error-source is sampling errors, rather than some other kind of sampling (or maybe other) bias that might unwittingly creep in, without the researchers even being aware of it? The p-values showing there is an adequately low probability of sampling-error skewing the results, and not realising (or looking hard enough for) other bias that might be skewing things? So easier to fool themselves?

Good question, @Barry53. Just to see if I can do it, I'm gonna answer without assuming any prior knowledge of stats (although I realise you know quite a lot).

Here's the basic logic:
Inferential statistics like t tests use the information about the variability within each group to assess how reliable the difference between groups is. So to illustrate, in the figure below, we would tend to be more persuaded by the results of Study 1 on the left than those of Study 2. In both studies, the difference between means for the groups is the same, but in Study 2, person to person variability was high, so the difference between means for the group looks unimpressive. But in in Study 1, the difference between groups look much bigger than you'd expect just based on the variability from person to person.

(see table in attached file)

The t test and its close relatives (like anova) use this very same logic. They generate a ratio, which expresses the between-group difference relative to the within group variability. The higher this ratio, the higher the likelihood that participants in your two groups are actually genuinely different. If you assume the scores are distributed in a certain way, you can calculate a pretty reliable estimate of this likelihood, called p. We say that if p is less than .05, that means the chances of the two groups being no different is less than 1 in 20.

The actual inference you can make from a p below .05 is that there is less than a 1 in 20 chance that you would observe this value if the two populations you compared in fact didn't differ reliably on this measure. That is, because you just happened, by accident, to pick out a few people that were unusually high (sampling error). You can further infer that if you sampled another group of participants from each population (controls, ME patients whatever), these would be also be highly likely to be different too.

Why doing many tests is problematic:
Importantly, its all probabilities. You could do a test a get a p value less than .05, and it could be a fluke. You simply chose people for your study that happened to be very different on this measure, when really the two populations, considered as a whole are not reliably different.

According to the logic of the t-test, you have a 1 in 20 chance of getting a fluke like this, even if you do everything right and conduct only one t-test.

But imagine that you do 20 t-tests, each comparing a different cytokine or the like. One of these is almost certain to be significant just by chance alone. Remember you have a 1 in 20 chance of observing a p value below .05 even if the populations you're comparing are not genuinely different.

In actual fact, the problem is even messier than this. If you perform each comparison on the same sample of people, the tests you do are not independent. If your groups are not representative of their population, this error will affect all the analyses in very similar ways.

To avoid this problem:
Researchers should reduce their criterion vale for p. If you're doing 20 comparisons, then it would be more appropriate to use a criterion p value of .0025 (20 times lower).

A better way is to only do comparisons that you have a prediction about, and not do the others. Then you don't have to correct so steeply.

What's p-hacking?
Its when you cheat - you do lots of comparisons, but then you never fess up to the ones that weren't significant. You only report the significant ones. As we've shown above, it you do only one test and get a significant p value, that means something. But if you do 20 and get one significant p value, this is very like to be due to chance alone. By hiding the other tests you did, you're concealing the problem from the reader.

barbc56 · Aug 17, 2017

Thank you @Woolie. You explained this so well!

Londinium · Aug 18, 2017

But are we (and I include myself in this) so eager to hear about biological evidence that we jump on every 'significant' p value, and want to to know the implications for treatment before it has even been validated?

Because of very real flaws in scientific research from mainstream medicine when it comes to ME/CFS, there is also the trap of being more open to nonsense coming from those sitting outside mainstream medicine. Many of the flaws seen in something like the PACE trial mirror those seen in 'alternative' medicine - the fact that these got past The Lancet doesn't mean that they validate the same flaws in CAM 'research'.

Given that the MEA is this morning uncritically retweeting newspaper stories based on press releases from some deeply scientifically dubious organisation, I should add to my list potential red flags to watch for when it comes to 'alternative' medicine:

Does it use poorly specified 'sciencey' sounding terms like 'energy', 'natural' or 'chemicals'? (Especially if it draws an entirely false distinction between 'natural' and 'chemicals'!!!)
Does the author have a deep financial interest in convincing you of their method? e.g. does the author claim serious and complicated medical disorders can be cured just by taking a few (expensive) food supplements that the author just happens to sell.
Is the claim accompanied by a wide-ranging antipathy towards science and the scientific method? Not specific criticisms of specific trials/treatments, but just a general hostility.
(My favourite) Does the author list about 400 letters after their name as if they're overcompensating when trying to convince you of their credentials?

Does p-hacking and other statistical malpractice happen in biomed as well as psych studies?

trishrhymes

Senior Member

Londinium

Senior Member

alex3619

Senior Member

trishrhymes

Senior Member

alex3619

Senior Member

alex3619

Senior Member

barbc56

Senior Member

Sean

Senior Member

Barry53

Senior Member

Woolie

Senior Member

Woolie

Senior Member

alex3619

Senior Member

Barry53

Senior Member

Londinium

Senior Member

trishrhymes

Senior Member

trishrhymes

Senior Member

Woolie

Senior Member

Attachments

barbc56

Senior Member

Londinium

Senior Member