"By J. Burmeister: Keep an Eye on Your Walitt: NIH Study Poses Dramatic Risk to Long-Term Disability

Dolphin · Mar 23, 2016

An F test for three groups just involves one test so doesn't need to be adjusted for multiple comparisons.

[As an aside, there is a lot of controversy about adjusting for multiple comparisons. In many ways it doesn't make sense].

Nothing has been said which would change my mind from thinking that having two control groups (n=20 x 2) which show similar results to each other but different results to ME/CFS would be better than having just one control group (n=20) with different results to ME/CFS.

It would be different if we or an ME/CFS charity was footing the bill for this extra data.

Jonathan Edwards · Mar 24, 2016

Dolphin said:
An F test for three groups just involves one test so doesn't need to be adjusted for multiple comparisons.

[As an aside, there is a lot of controversy about adjusting for multiple comparisons. In many ways it doesn't make sense].

Nothing has been said which would change my mind from thinking that having two control groups (n=20 x 2) which show similar results to each other but different results to ME/CFS would be better than having just one control group (n=20) with different results to ME/CFS.

It would be different if we or an ME/CFS charity was footing the bill for this extra data.

The problem I think is what question the F test is addressing. The standard question is whether your test group (ME) is compatible with being a sample of the same group as the controls. I do not follow how a three group test is better than a two group test for that question. If you have three groups then there is a risk that one of your control groups will be abnormal in the same direction as the ME group and then you will see no difference when in fact there is a difference from normals. I does not add up to me. I quite agree that more control groups allow you to ask more questions and these may be interesting questions but I cannot see how it improves the power of any particular question.

The reason for adjusting for multiple comparisons becomes very clear if you work in a lab where people do these analyses. On Monday your PhD student looks glum and says that there is no significant difference in her results. By Wednesday she says she has analysed it a different way and thinks there may be something in it. By Friday she pronounces that there is a clear difference with p value less than 0.002. It is a difference for a question nobody thought they were asking last week but who cares - you can publish!!!

In my view you need to decide what it is you really want to know at the beginning and optimise the chances of answering. If you really want to know if a result in ME is actually anything other than normal variation you want to have one big control group and one big ME group. Another control group may tell you that an abnormality in ME is specific or not to that illness but if you don't have the statistical power to know that it is even an abnormality what is the point?

My biggest worry about what I have seen about the design of this study and what has been said by the investigators is that they may not have thought through exactly how difficult it may be to even get a clear result, before moving on to detail.

Sidereal · Mar 24, 2016

In my opinion, designing studies that you know in advance cannot answer any question reliably due to lack of statistical power is unethical because you're putting the participants through the hassle of doing tests (potentially dangerous ones too like 2-day CPET and lumbar punctures) and you're polluting the published literature with spurious findings that might negatively influence the direction of future research and medical care of patients. Just consider the impact and the long-lasting harm of that abysmally designed IV acyclovir study the NIH published in the NEJM decades ago. ME/CFS is not the same politically as other diseases. If someone designs a shit study that doesn't find anything in MS or ALS, the negative finding isn't used by doctors and psychobabblers to literally humiliate and abuse patients or to deny them disability payments.

I worry that the design of this study will almost certainly result in its failure, just by default, due to lack of power. I have no idea why you would go on a fishing expedition like this with such small sample sizes. Since they have absolutely no idea about what ME/CFS is and have no hypothesis as to what they're looking for, they also have no idea about the likely effect size which means they cannot do a priori power calculations to determine appropriate sample size. Therefore, the best thing to do would be to get rid of all the disease control groups and just study as many ME/CFS patients and normal controls as they can afford. That way if something turns up, a signal of some sorts, even if not statistically significant, it can be followed up in a future more focused investigation designed to address just that one question.

Having said that, I am strongly opposed to the vehement opposition to this study. While the proposed study makes no sense to me, I think we should be trying to work with the NIH to improve it, not trying to get it cancelled. I see their willingness to do biological research on ME/CFS as a big step in the right direction and I worry that the rabid response from some corners of this community could put them off continuing with this research programme.

Sasha · Mar 24, 2016

I hope everybody with concerns about the study (me included) will write in with questions for Dr Nath for his Solve webinar in April. Let's make the most of our chance to ask questions.

Registration isn't open yet but should be before too long.

chipmunk1 · Mar 24, 2016

jamie said:
Here is a former patient who was involved in a Walitt in the comment section o the Jeanette Blog we are discussing.

"Thank you for this article. I met Dr. Walitt years ago when he opened the Fibromyalgia Center at Georgetown University Hospital and was recruiting patients for his research. I was referred over from my rheumatologist at the time and very excited to be a part of the research. Then I went to the first day, which involved interviews, an exam, and a lecture. I couldn’t wait to leave and never went back. I was not impressed with the doctor or the program. Mostly I was offended by the continuous lecturing from Dr. Walitt which can be summed up as “you probably did too much and caused fatigue and pain and now you’re depressed and anxious and can’t ever get better because your depression and anxiety make you think you’re sicker than you are.” He was cold, direct, and impatient. I was exhausted, confused, and upset by the whole experience and cried the whole way home. It baffles me that he’s involved at the NIH level and frustrates me to no end to know that he’s involved in anything at all related to CFS/ME and FM."

So we have an investigator who does not believe the illness exists? What kind of Investigation could that be?

duncan · Mar 24, 2016

Five will get you 10, @Sidereal , that most detractors of the NIH CFS study are detractors of the way it presently stands. They do not oppose an earnest biological study of CFS conducted by qualified and unbiased NIH investigators.

If the NIH changes the protocol so it makes sense, so that it excludes the psychological component, so that the Math works...I suspect most current detractors will not only support the study, but sign up for it if they could.

LiveAgain · Mar 24, 2016

Valentijn said:
I think they may have tried to bring in everyone from the NIH that has ever published about CFS or even mentioned CFS in a publication. If so, this was obviously a bad way to recruit people, given their history in handling the disease.

I think this is exactly what happened. Question because I'm not sure how this works.. if they were willing to replace Walitt (and others), who do they put in their place, assuming they have to come from NIH since this is an NIH study? and there are no doctors there with real ME/CFS knowledge or experience?

Sasha · Mar 24, 2016

LiveAgain said:
there are no doctors there with real ME/CFS knowledge or experience?

I'm not convinced that Walitt has any such experience, and it's better to have people with no experience who can be taught than people with the wrong experience who will introduce bias.

If they don't have NIH personnel who can do a particular job, I hope they'd have the sense to hire in. There's a brilliant French word, "psychorigide", which I think explains itself. I hope they're not going to be psychorigide about this.

LiveAgain · Mar 24, 2016

Oh believe me, I didn't mean to imply Walitt is experienced and I hate that he's involved. Maybe I didn't phrase my question well. I'm just asking who they could put in that role from NIH that would be acceptable to the patient community? I'd be okay with doctors new to the field who are unbiased, would take the necessary time to learn the condition and consult with experts. If they were able to hire in, that would be even better.

Dolphin · Mar 24, 2016

Jonathan Edwards said:
The reason for adjusting for multiple comparisons becomes very clear if you work in a lab where people do these analyses. On Monday your PhD student looks glum and says that there is no significant difference in her results. By Wednesday she says she has analysed it a different way and thinks there may be something in it. By Friday she pronounces that there is a clear difference with p value less than 0.002. It is a difference for a question nobody thought they were asking last week but who cares - you can publish!!!

Another way of looking at it is researcher #1 get some abnormal test results with two samples with one experiment (X) p<0.04. Researcher #2 does 20 tests including getting the same results for experiment X. However the same results now are not considered significant due to adjustments for multiple testing and going forward that is seen as evidence that there wasn't a difference in experiment X.

Epidemiology. 1990 Jan;1(1):43-6.
No adjustments are needed for multiple comparisons.
Rothman KJ.

Abstract
Adjustments for making multiple comparisons in large bodies of data are recommended to avoid rejecting the null hypothesis too readily. Unfortunately, reducing the type I error for null associations increases the type II error for those associations that are not null. The theoretical basis for advocating a routine adjustment for multiple comparisons is the "universal null hypothesis" that "chance" serves as the first-order explanation for observed phenomena. This hypothesis undermines the basic premises of empirical research, which holds that nature follows regular laws that may be studied through observations. A policy of not making adjustments for multiple comparisons is preferable because it will lead to fewer errors of interpretation when the data under evaluation are not random numbers but actual observations on nature. Furthermore, scientists should not be so reluctant to explore leads that may turn out to be wrong that they penalize themselves by missing possibly important findings.

PMID:
2081237
[PubMed - indexed for MEDLINE]

user9876 · Mar 24, 2016

Jonathan Edwards said:
The problem I think is what question the F test is addressing. The standard question is whether your test group (ME) is compatible with being a sample of the same group as the controls. I do not follow how a three group test is better than a two group test for that question. If you have three groups then there is a risk that one of your control groups will be abnormal in the same direction as the ME group and then you will see no difference when in fact there is a difference from normals. I does not add up to me. I quite agree that more control groups allow you to ask more questions and these may be interesting questions but I cannot see how it improves the power of any particular question.

The reason for adjusting for multiple comparisons becomes very clear if you work in a lab where people do these analyses. On Monday your PhD student looks glum and says that there is no significant difference in her results. By Wednesday she says she has analysed it a different way and thinks there may be something in it. By Friday she pronounces that there is a clear difference with p value less than 0.002. It is a difference for a question nobody thought they were asking last week but who cares - you can publish!!!

In my view you need to decide what it is you really want to know at the beginning and optimise the chances of answering. If you really want to know if a result in ME is actually anything other than normal variation you want to have one big control group and one big ME group. Another control group may tell you that an abnormality in ME is specific or not to that illness but if you don't have the statistical power to know that it is even an abnormality what is the point?

My biggest worry about what I have seen about the design of this study and what has been said by the investigators is that they may not have thought through exactly how difficult it may be to even get a clear result, before moving on to detail.

I wonder about different approaches.

Firstly, if we view this as a hypothesis finding exercise rather than one of confirming a hypothesis then any significant tests could be viewed as interesting. The test would say there is an increased chance that this may be a marker. And a non-significant test shouldn't be seen as ruling something out just reducing its likelihood. In essence this creates a work list of things to investigate further. This is where I feel unhappy about a simple testing approach because consideration needs to be given to why any readings are wrong and what it could say about the biology and then design properly powered experiments in line with the theories that may be generated. There is a good paper written by David Freeman called something like statistics and shoe leather where he basically argues against complicating models to control for too many variables but says use statistics to give us an idea of where to look and then do more work to understand.

Another thing worries me. Hypothesis testing basically tests whether values are likely to be drawn from different distributions. But these can be overlapping distributions. So it doesn't mean they are any good as a test. I did work on classification techniques a while ago and the idea there is to try to see what technique reliably will separate out different classes of data (e.g. different illnesses or ME vs others vs health). We never used significance testing but results would be judged on classification performance where we had three sets of data a training set to develop the classifiers. A test set used to pick the best classifiers and finally a verification set that was unseen whilst developing the classification systems. It was this last set that helped answer the question of how well the classifier would generalize to unseen data. This seems a much better approach in situations like this (although requires more data) in part because classifiers can be build using multivariate methods (i.e. looking at many variables at once).

Valentijn · Mar 24, 2016

Rothman said:
A policy of not making adjustments for multiple comparisons is preferable because it will lead to fewer errors of interpretation when the data under evaluation are not random numbers but actual observations on nature.

It would be nice if the NIH used such an approach for this study, but they won't. Primarily because Rothman is probably in a pretty small minority who advocate against making corrections based on multiple comparisons.

The other problem is that several of the investigators have used small sample sizes and/or 2-3 control groups to bury results in the past, and the excess control groups in the present study almost certainly were proposed by some of the more questionable researchers.

Why hope that they won't take advantage of methodological flaws, when we can push to have those flaws rectified? Maybe everything will turn out okay with a trio of psychobabblers screening patients and several small control groups, but I think it's much more practical to acknowledge that these could become huge problems, and demand that the NIH rectify them. I don't want a bunch of "maybes", which could be pretty catastrophic in combination.

Dolphin · Mar 24, 2016

Valentijn said:
It would be nice if the NIH used such an approach for this study, but they won't. Primarily because Rothman is probably in a pretty small minority who advocate against making corrections based on multiple comparisons.

My arguments aren't contingent on this. As I said, a F-test is one test. It's not the same as running two or three t-tests. I think if one ran computer models on data, one should find that if one had 2 similar control groups (both n=20), one would find more statistically significant F-test results than if there was just one control group (n=20) (with t-tests). When one talks about p values, one is talking about the probability of the findings happening by chance. More data in the same direction should decrease the chances the findings were due to chance and increase one's confidence about genuine results, leading to more p-values below 0.05. I don't have the facilities to run such models to test such a hypothesis but intuitively I believe that is what should happen.

Jonathan Edwards · Mar 24, 2016

Dolphin said:
Another way of looking at it is researcher #1 get some abnormal test results with two samples with one experiment (X) p<0.04. Researcher #2 does 20 tests including getting the same results for experiment X. However the same results now are not considered significant due to adjustments for multiple testing and going forward that is seen as evidence that there wasn't a difference in experiment X.

That situation is different, Dolphin. If one is replicating previous findings one views the data differently. Standard P value cut offs like 0.05 are not actually what determine decisions. So when at UCL the lab tried to replicate Bradley and Bansal's findings on B cells they would have taken a trend in the same direction as indicating further work was worthwhile. In fact one of the reasons why the work was set up was that Curriu had found the same as Bradley but without a significant p value - the trend was enough to make it look worth trying again.

In the NIH study they may be replicating things like NK function and I would expect a lower threshold for continuing work if a trend is found that fits. But that does not alter the general problem that with small groups you easily get type II errors because you have to draw the line somewhere.

The abstract from Rothman does not make sense to me. I doubt this is a widely held view. The bit about nature following regular laws seems to me completely non sequitur. Sure you get more type II errors if you do not correct but that is the price you pay for indiscriminate fishing in biological systems that obey regular laws with added random noise. Small samples give you type II errors too. The skill is to know when a sub-threshold trend is worth pursuing in either case. At some point you need to set out saying this is what we are trying to confirm and do a big enough sample to get something like p<0.001 and believe it.

Jonathan Edwards · Mar 24, 2016

user9876 said:
I wonder about different approaches.

Firstly, if we view this as a hypothesis finding exercise rather than one of confirming a hypothesis then any significant tests could be viewed as interesting. The test would say there is an increased chance that this may be a marker. And a non-significant test shouldn't be seen as ruling something out just reducing its likelihood. In essence this creates a work list of things to investigate further. This is where I feel unhappy about a simple testing approach because consideration needs to be given to why any readings are wrong and what it could say about the biology and then design properly powered experiments in line with the theories that may be generated. There is a good paper written by David Freeman called something like statistics and shoe leather where he basically argues against complicating models to control for too many variables but says use statistics to give us an idea of where to look and then do more work to understand.

Another thing worries me. Hypothesis testing basically tests whether values are likely to be drawn from different distributions. But these can be overlapping distributions. So it doesn't mean they are any good as a test. I did work on classification techniques a while ago and the idea there is to try to see what technique reliably will separate out different classes of data (e.g. different illnesses or ME vs others vs health). We never used significance testing but results would be judged on classification performance where we had three sets of data a training set to develop the classifiers. A test set used to pick the best classifiers and finally a verification set that was unseen whilst developing the classification systems. It was this last set that helped answer the question of how well the classifier would generalize to unseen data. This seems a much better approach in situations like this (although requires more data) in part because classifiers can be build using multivariate methods (i.e. looking at many variables at once).

I absolutely agree that the reality is that you do not either pursue or not pursue based on some arbitrary p value. In this sort of look see exercise it would be false logic to do so. But the problem I see here is simply that what experience we have in ME/CFS suggests that if there is something to find in these sorts of tests it may well be something affecting a minority which will be pretty hard to eyeball in a group of 20. It has been suggested that type II errors are unlikely because something driving the disease would stick out like a sore thumb, but our experience with other immunological conditions is that this may not be the case. If it was a question of picking out a unique autoantibody it would be easy. But if it is a question of picking up some subtle shift that indirectly reflects something specific in a minority it will be tough.

Sasha · Mar 24, 2016

Jonathan Edwards said:
I absolutely agree that the reality is that you do not either pursue or not pursue based on some arbitrary p value. In this sort of look see exercise it would be false logic to do so. But the problem I see here is simply that what experience we have in ME/CFS suggests that if there is something to find in these sorts of tests it may well be something affecting a minority which will be pretty hard to eyeball in a group of 20. It has been suggested that type II errors are unlikely because something driving the disease would stick out like a sore thumb, but our experience with other immunological conditions is that this may not be the case. If it was a question of picking out a unique autoantibody it would be easy. But if it is a question of picking up some subtle shift that indirectly reflects something specific in a minority it will be tough.

Hope you're writing and telling them so!

Dolphin · Mar 24, 2016

Jonathan Edwards said:
Dolphin said:

Another way of looking at it is researcher #1 get some abnormal test results with two samples with one experiment (X) p<0.04. Researcher #2 does 20 tests including getting the same results for experiment X. However the same results now are not considered significant due to adjustments for multiple testing and going forward that is seen as evidence that there wasn't a difference in experiment X.

Click to expand...

That situation is different, Dolphin. If one is replicating previous findings one views the data differently.

Perhaps I wasn't clear: what I was discussing wasn't meant to be a replication. I was talking about in a parallel universe the same experiment is run and the same results obtained but this time other tests were run at the same time. Looks like we'll have to agree to disagree because I don't accept that a positive result suddenly becomes a negative result because more tests were run the same time which is the shorthand way of reading significance testing. I do accept that more caution should be used when interpreting situations when more tests have been run but like Rothman don't think that adjusting for multiple comparisons is the way to go.

Jonathan Edwards · Mar 24, 2016

Dolphin said:
Perhaps I wasn't clear: what I was discussing wasn't meant to be a replication. I was talking about in a parallel universe the same experiment is run and the same results obtained but this time other tests were run at the same time. Looks like we'll have to agree to disagree because I don't accept that a positive result suddenly becomes a negative result because more tests were run the same time. I do accept that more caution should be used when interpreting situations when more tests have been run but like Rothman don't think that adjusting for multiple comparisons is the way to go.

Nothing ever becomes a positive or a negative. What changes is the probability that the result reflects a real biological difference rather than noise, and that depends on all sorts of hard to pin down Bayesian considerations. The real problem is people - meaning scientists - and what they will convince themselves is real biology rather than noise. Lab science is an uphill struggle against PhD students wanting to get something publishable - and in most cases fiddling the data to get that (sadly). The p value adjustment is arbitrary but as people in genome searches know so well, it is a rough and ready way of making sure we do not chase every wild goose a PhD student (or just biological variation) has cooked up for us.

The specific problem here is that if you do a three group analysis instead of a two group analysis you probably have at least a six tailed instead of a two tailed test. If you have plausible reasons for saying that only one or two of the tails are what one would reasonably be looking for that is fine, you would have greater power, but I am not sure I see that here.

Edit: actually the p value adjustment is not arbitrary, I suspect it is metaphysically cast iron. What that does not take into account is that all the hard to pin down Bayesian factors skew what your p value threshold should be for a decision on any particular finding. For instance if you don't get a smooth dose response curve you suspect you have a lemon - on the basis of prior likelihoods relating to physical chemistry.

Dolphin · Mar 24, 2016

I agree that Bayesian and other methods to analyse data are better than simple yes/no hypothesis testing.

From reading the ME/CFS literature, I've just seen too many possibly promising results not being counted/being discarded because they were just over some arbitrary p-value threshold and having stricter thresholds (because one has adjusted for multiple comparisons) can make this worse.

jimells · Mar 24, 2016

Valentijn said:
Why hope that they won't take advantage of methodological flaws, when we can push to have those flaws rectified? Maybe everything will turn out okay with a trio of psychobabblers screening patients and several small control groups, but I think it's much more practical to acknowledge that these could become huge problems, and demand that the NIH rectify them. I don't want a bunch of "maybes", which could be pretty catastrophic in combination.

My understanding of the PACE trial is that advocates raised objections to the study design as soon as it was announced. Those objections were of course dismissed. It is now obvious to anyone not paid to be blind that the objections were valid. It took over a decade of struggle - the trial was announced in 2003.

No severe patient can afford to spend another decade fighting against more bad research - even if they have the strength to do so.

We all laughed when Dr Collins stated that the UK psychobabblers don't have "the right skill set". I now wonder if he meant the PACE People don't have "the right skill set" to keep this illness classified as psychosomatic - but NIH does.

"By J. Burmeister: Keep an Eye on Your Walitt: NIH Study Poses Dramatic Risk to Long-Term Disability

Senior Member

"Gibberish"

Senior Member

Fine, thank you

Senior Member

Senior Member

Senior Member

Fine, thank you

Senior Member

Senior Member

Senior Member

Senior Member

Senior Member

"Gibberish"

"Gibberish"

Fine, thank you

Senior Member

"Gibberish"

Senior Member

Senior Member