Measures of outcome for trials and other studies

Jonathan Edwards · Mar 25, 2015

I am interested in what people think about the design of outcome measures for trials and perhaps epidemiological research. Should they be subjective or objective and what exactly do those terms imply? Should people try to measure things like fatigue quantitatively or is this a bogus idea? Should specific neurocognitive or activity (e.g. CPET) test be used or are they too indirect? ...

There are a mass of questions and some of them have been covered before usefully. I would like to focus on one or two specifically but I am happy for the thread to go wherever it goes. I have been trying to figure out what is wrong with the outcome measures currently being used, particularly in the context of the difficulty of blinding treatments. CBT and GET cannot be blinded. Drugs like valcyte or rituximab can in theory be blinded but it is far from certain that they really are, since patients get to know what the real drug feels like. That makes subjective outcome measures very suspect, but it can be argued that more objective surrogate measures are not measuring what is really important. And so on.

During discussions with PWME and also psychiatrists I was reminded of the structure of an outcome measure that has been used in rheumatology for decades, although it may recently have been overtaken by something else (maybe not for the better). The measure is the American College of Rheumatology Grade of Improvement and it comes in 20%, 50% and 70% versions.

The first thing to note about this measure is that it is only a measure of change. It combines various measurements that you could compare between patients, like how many joints are swollen, but it does not make that comparison. This seems to me important, especially for ME, since I personally think it is bogus to even try to compare how severe one persons illness is compared to another. So a version for ME might try to measure change in fatigue but would not 'measure fatigue'.

The second thing is that although a grade is calculated using a series of point scores or lab measurements these are not added up. It measures swollen joints, tender joints, ESR, fatigue, global pain score etc. but these are not given weighting and added up. You reach the grade if a certain number of measures have made the necessary percentage improvement. A reduction of 10 swollen joints to 4 would count towards ACR20, ACR50 but not ACR70 so you can score ACR50 but not ACR70, however much other things have improved.

Although this scoring system may recently have been overtaken by an additive one I think that there may be important logical reasons why the 'threshold' maths of ACR grades is more what is needed. What I am not sure is why this should be the right maths. I have a suspicion of something like this. The reason for these composite scores is not really to measure more things or even to measure at finer grain with a wider range of numbers. It is to try to overcome the unreliability of a single measure as a guide to what a reasonable person would think is a modest, moderate or major improvement.

The point of being composite is then to increase the confidence, or probability, that you will end up with a number that reflects common sense. And in general, our confidence in decisions does not come from adding up signs but from considering whether a range of signs are all in line with the conclusion - they meet a threshold maybe. And probability is, if anything, based on multiplication and division, as in Bayes's theorem, rather than addition.

So I can say 'well she says her pain is down by 80% and only three joints seem swollen now instead of seven (as far as I can tell, because some are iffy) and her ESR has gone down from 70 to 34, so I guess it is fair to say she is at least 50% better'. And I am in a sense saying that the probability that she is just saying she is better because it is sunny outside is made pretty small by the way the various sorts of evidence against it multiply the unlikelihood (of her really being just the same). And they are good at multiplying unlikelihood because they are different sorts of measures that give us rather independent sorts of evidence.

That's already a very long post but hopefully people will have some thoughts to get my ideas clearer - particularly on what we are really trying to do when we combine measurements.

Kati · Mar 25, 2015

There are a few functionality scales out there (made for ME) including Dr Lerner's EIPS

SF-36
Dr Lenny Jason could be a good help in that regard

The patient's voice (yes I improved, or no I didn't improve) (and no I am not trying to please the researcher, and no I wasn't paid to say I improved) could help but then is subjective.

Objective measures would be ideal But what to measure?
Exercise tests induce relapse, so a measure before intervention will cause a relapse and some of us relapse for months. Is it fair to the patient, and fair to the validity of the trial if patient starts drug trial in relapse mode?
Actimeter (Aka fitbit) could prove useful if used properly, not just one day measure
NK cells function could be of help. (Dr Peterson linked it to severity)
leptin??? (Jarred Younger's research)

Just food for thought

Denise · Mar 25, 2015

I suspect that a mix of objective and subjective measures might be helpful. A person may feel better before the objective findings indicate that or vice versa - the objective findings may indicate improvement before the person feels more functional.

Gijs · Mar 25, 2015

I think that like Fluge and Mella who use flow-mediated dilation to test large blood vessel endothelial functioning in a substudy is important. They will also test microvascular endothelial function using skin laser-doppler measurements.
Repeated (24, or 48 hours after 1 test)) CPET can be helpfull measuring VO2MAX, AT and some gene expression(Light et al.) SF-36 and neurocognitive function test (concentration) are helpfull.

Kyla · Mar 25, 2015

Hi Dr. Edwards,

please excuse any mistakes, I am not a doctor, just a person with ME...and opinions

In terms of Objective measures:
1) Actigraphy - Has the advantage of measuring activity over an extended period of time, which means it will show whether patients actually have improved energy and patterns of activity or are simply displacing energy from their activities of daily living in order to complete a single test

2) 2 day CPET - while this is obviously very hard on patients, I think it is still an important measure for a few reasons:
- it shows a reproducible, objective abnormality that is currently believed to be unique to ME/CFS
- As far as I know no one has yet tried to show whether or not this is a reversible change - ie - if patients that improve on something like Rituximab for example have reversed the anaerobic dysfunction or if something else is happening to make them improve. - either way I think this would be helpful to understanding what is happening in this disease generally and what the CPET test "means" about what is happening to us.
- some patients have been willingly doing 2-day CPET tests for the purposes of insurance coverage, or to find out their aerobic threshold for the purposes of heart-rate based pacing, so I think many would be willing to undergo this for quality research

3) Orthostatic Intolerance measures (Tilt table or Standing test) - No one seems to be using this, possibly because estimates of prevalence seem to vary based on what the cut-off point is, but I have seen studies saying as high as 95%. Even if this is just a subset, it would still be valuable to see if this is improved. It is certainly used in drug trials for POTS quite frequently. if this is routinely measured at the beginning and end of studies it could help to identify whether this is a meaningful subset of patients.
I would especially like to see this become a standard measure because it would put to bed the circular reasoning of excluding any patients with "testable abnormailities". ie - if this becomes standard data to include then studies that exclude patients based on positive tilt-table tests are clearly not studying the same population of patients

4) employment and benefits usage measures - this is admittedly a tough as many are not on benefits, and age and various other life factors would play into whether one is returning to work. Still, in terms of meaningful recovery it is certainly more useful than slightly improved results on a 6 minute walking test.

OverTheHills · Mar 25, 2015

I like the sound of those multifactorial measures.Just thinking about how I would implement this for me (it would be much more difficult for a severe patient)

I would get right away from trying to measure fatigue which is a nebulous concept and using questionnaires which mean different things to different people and rely on our hopeless memories. Get into trying to measure energy used and/or stamina objectively wherever possible. Because what we want and what has gone is energy, and ability to use energy over time = stamina.

Patients can feel when they have more energy and naturally start to do more. This would be particularly true if we were in a trial where we have good reason to think we might have improved. If we are fooling ourselves we will get PEM.

To measure mental energy I think neurocognitive tests where you try to do as many problems as possible in a certain time (of a complex nature - involving a lot of working memory) will soon show whether mental stamina and energy have improved. These can be done on a PC at home to measure improvement and repeatedly to eliminate once off 'bad day'/'good day' issues.

For physical energy actometer type data needs to be analysed over the long term to to show both duration and intensity of activity. They will increase if the patient is improving. When I have my regular seasonal improvement PEM is largely absent, I am active for longer each day and the intensity of my activities increases.

Periods of PEM will be easy to spot in the actometer data when combined with a patient diary. I would be prepared to wear an actometer indefinitely (like a diabetic monitoring their blood sugar). And happy to do PC cognitive tests on the same basis.

Esther12 · Mar 25, 2015

I'm not sure I have much to add, and say similar things to the above.

My instinct is to try to measure a range of different things, be honest and clear about the limitations that all outcome measures have, and then hopefully things will slowly progress. I don't think that there is a great solution to the problem of outcome measures for CFS trials and that this is something which will need to be thought carefully about for every trial. Hopefully some progress with identifying sub-types, causes, etc will allow some progress here too.

I am now in the slightly weird position of defending PACE, in that I think that if it had been carried out as originally planned, with actometers as an outcome measures and relatively demanding interpretations of subjective questionnaire scores, I think that it would have been able to provide some useful information (and actually still did, once you dig through the spin).

although re actometers: I still think that there is a danger of substituting activity (eg: less reading/thinking, more walking) and they are still not ideal measures of 'activity'.

re grading: I'd still want to see results for these different outcomes presented individually too. I wonder if the fact that symptoms are so varied for CFS could make using that sort of system more difficult? eg: Some people will not have problems in some areas.

@Jonathan Edwards mentioned allowing patients to set their own criteria for a positive outcome. That sounds like an interesting additional outcome measure, although I did have a concern that it could mean that an intervention could be sold to all patients as helping them achieve their own goals when it was really only able to help people with a particular type of goal, that may be of no interest to others.

A.B. · Mar 25, 2015

OverTheHills is spot on. Being able to increase activity levels and sustain them for six months or so would be a good indicator of a successful treatment (unless self reported symptoms got worse in the meantime). I believe we can safely assume that a patient will increase their activity levels if they start feeling better.

PS: I do think that self reported improvements in symptoms should always be considered as well. Relief from pain etc. is important.

Valentijn · Mar 25, 2015

If it's a blinded treatment, I'm fairly happy with the SF-36 Physical Functioning subscale. Though I'd be happier in any event with actometers being used, most especially if the treatment isn't double-blinded.

I don't know that it's worth risking a relapse via TTT or 2-day CPET to use those just to test improvement - though I think they're definitely worth using to look for correlating blood or similar abnormalities.

Sasha · Mar 25, 2015

I agree that actimetry is important. Over the short-term I think it could be 'gamed' by the patient according to their expectation but I think it would be hard to keep that up in the long-term. I don't think it's possible for us to raise our game and sustain that if there has been no underlying improvement: indeed, I think that's a defining factor of our condition.

I have concerns about the CPET testing as being 'destructive' testing. Unless there is evidence that it's safe, I don't think it's appropriate for it to be used as an outcome measure.

I agree that tilt-table testing would be good.

Poor sleep is common in ME and these days there are unobtrusive gadgets that measure how much you move around (I think) - I think some i-gadget does this in a standard programme that lots of normal

people use. That would be a good objective measure because (again, especially in the long term) it's relatively hard to directly control your own sleep.

The proposed IOM diagnostic criteria are:

reduction in social, occupational etc. activities;
fatigue;
PEM;
unrefreshing sleep; and
cognitive impairment or
OI

I'm wondering if that's a helpful list to look at in trying to come up with outcome measures.

Jonathan Edwards · Mar 25, 2015

Thanks to all for the comments. Tilt table sounds interesting in that it does not involve exertion and measures an involuntary reflex that would not be biased by trying hard or not. SF36 may be useful as a source of data but I am suspicious of the pre-defined analyses that you can get out of it using the approved software! Actometry does seem sensible if done over an extended period.

The key issue for me is that although all these things may have merit individually there is a need for a single deciding measure for the question ' does the treatment really work, and enough to be worth the trouble'? In a sense we are only interested in lots of measures because we are not sure any one of them gives the answer. And you need to decide what your deciding measure is in advance otherwise you fall foul of the statistical cost of asking the same question lots of ways (Bonferoni). What seems clear is that since no single measure is reliable we need a composite measure that we think is most reliable.

And so the nitty gritty is how you design the composite? The SF36 is in itself a composite - but all subjective and it is additive, which I have grave doubts about. As a first shot I am thinking maybe something like this:

A measure of change based on thresholds a bit like the ACR score, in which you need to have all three of these change by a certain degree (could be percentage or could be descriptive - minor, moderate, major:

1. A subjective questionnaire based account of feeling improved in the most important way relevant to the illness and the person. I doubt you need much more than a single visual analogue scale in fact. The more questions you ask the more bogus maths creeps in.
2. An objective test of some physiological change that adds confidence to 1 by indicating that the body is responding differently - maybe the tilt table test would fit here. (I don't think we know enough about how to link symptoms with things like NK cells to rely on that).
3. A test of restored function relevant to daily life that adds confidence to 1 by indicating that the symptomatic improvement is reflected in a change in lifestyle relevant to the person. Actometry would be easy but might be too narrow.

I still like the idea of some degree of personalisation of what is being assessed but I agree with Esther12 that it must not be too open ended. The ACR grading deals with this by allowing one to choose from a narrow range of options as to what you use to build your case for a percentage improvement.

One of the things about the ACR grading is that because you need several measures to meet a threshold (I think it is five) it is pretty demanding. As a result even 50% of patients getting ACR20 is considered not bad (remember that ACR20 means anything up to ACR49, just not quite ACR50). (Note that there are no mean changes here - the result is how many patients reached the target grade. That avoids the false impression that everyone gets a bit of benefit.)

Sasha · Mar 25, 2015

Is an issue with a composite score that different subgroups may be impaired on different variables? For example, some people might have no OI, so including an OI score in a composite would obscure any signal?

Or is avoiding the problem of multiple analysis worth that?

Kati · Mar 25, 2015

@Jonathan Edwards could you provide links to ACR grading system please?

i am not entirely sure if enough studies have been done on the day to day variation of the TTT testing in autonomic dysfunction patient, and in that regards would that mean that patients taking beta blockers for POTS would be excluded of the study?

OverTheHills · Mar 25, 2015

I am fine with the 3 measures outlined above in principle but in practice I'm concerned about what would get measured in (2) the symptomatic improvement.

I think it is possible to confuse things there, because we don't yet know enough about which symptoms are downstream and which are core to the disease. For example my unrefreshing sleep is controlled by Naltrexone pretty well but that doesn't affect other aspects of the disease, and I developed POTS in my teens and ME 3 decades later.

So despite what IOM definitions say I think picking which symptoms are core is still premature. PEM is the one which seems to be the most unique and patients gut feel is that this is the very heart of things but we do not have a safe test.

Jonathan Edwards · Mar 25, 2015

Good points - there clearly needs to be a lot of thought about how a composite score would work for subgroups. One reasonable thing is to require that you pick something for 2 and for 3 that is definitely abnormal at baseline for that individual. As for ACR grading you can have some leeway. Since 2 and 3 are more to do with just being able to show that the results of 1 are corroborated by some improvement in physiology and some evidence of functioning better I think a good deal of flexibility in choice of measure for each person is allowed.

The other thing is that it is not necessarily an advantage for these measures to be 'core' or 'unique' to ME. IN RA the most specific feature is the presence of rheumatoid factor or CCP antibodies but these are never included in outcome scores because they do not change in a way that tells you if a treatment has worked or not. After rituximab rheumatoid factor goes down but not CCP much. Moreover, we would not use either as a measure of efficacy in fact, because neither is even an indication that inflammation has reduced - the CRP is more relevant there.

barbc56 · Mar 25, 2015

How do you factor the normal relapsing/remitting aspect of our illness and it's impact when measuring any study outcomes of improvement? I would think time would make this part of our illness less of an influence when attempting to get a true measure of improvement. Are there any stastical methods that also do this?

Not sure if that is clear as I'm just kind of brainstorming.

This is quite an interesting thread. Thanks for starting it.

Barb

I may have answered my own question here. Time and replication would cancel this effect as you would get regression towards the mean?

Kyla · Mar 25, 2015

I think a fair amount of the dilemma centers around perception and politics, as opposed to what is a "true" or accurate measurement of recovery.
Outcomes for studies on Alzheimers for example would be totally based on subjective measures. The problem is that ME/CFS is held to a higher standard based on scepticism about the actual disease and muddiness (some manufactured) around definitions and diagnosis.
So I think a salient question is what outcome measure is the most bullet-proof to criticism?

Ask the biggest ME sceptic you can find what measure they would accept as a positive result in a drug trial?

RWP (Rest without Peace) · Mar 25, 2015

@Jonathan Edwards,

I'd like to post a question about Rituximab for you to answer (if you would like to) on a relevant thread. Can you direct me to one?

Thank you.

voner · Mar 25, 2015

@Jonathan Edwards, is one of the benefits of using a ACR type grade of improvement system is that it can more accurately measure a patient group with widely varying symptoms like in ME/CFS? for example, some patients experience a lot of pain and others experience none. Some patients have orthostatic intolerance while others don't have any, etc.

What were the issues in RA that led to the development of this system?

I would stay far away from the "fatigue" word and measure Functionality/Ability similar to Dr. David Bells "ME/CFS Ability Scale".

Persimmon · Mar 25, 2015

Including objective measures would be a massive step forward. It would provide credibility and bypass the politics.

To address patient heterogeneity, I'd like to see studies with narrow cohorts
Eg only include patients who meet CCC and who fail a tilt table test.
(You could try to generalise any positive findings to other ME patients at a subsequent stage. Doing so would either be successful or else provide valuable sub-typing info.)

Which tests to include?

TTT - definitely

The applicability of the two-day CPET depends of which findings prove to be robust. Keller found VO2 max to decrease unusually in ME patients. The Workwell group found the same thing in their early research, but found in their more recent (and larger) study that the ventilatory threshold (VT) is the thing that decreases in ME patients.

If VT decrease can be demonstrated to be a robust finding, then it is a viable objective measure to use as an objective trial endpoint measure. (Exercising to one's VT is unlikely to cause distressing relapses - some patients will hit their VT by just sitting on an exercise bike. In contrast, exercising an ME patient to their VO2 max would be expected to cause adverse health affects.)

If the Light's metabolite findings were to be replicated, they would offer great potential as an objective endpoint; but again, there would be a question of how much exertion is needed to trigger distinctive metabolic performance.

Cognitive testing - @Jonathan Edwards, might I suggest you ask Gudrun Lange if she could recommend a suitable form of testing to use in a trial context. A full neurocognitive assessment takes two days of testing and much post-testing analysis, but that's necessary for the therapist to be able to justify a disability finding (ie for insurance or welfare payments). For a clinical trial, she might be able to suggest one or two specific tests that would serve your purposes, that focus on abnormalities she consistently sees in ME patients.
Eg she might be able to suggest one or two tests that measure the ability to multi-task in a cognitive context.

CD4, CD8 and the CD4:CD8 ratio offer potential, although we obviously don't yet know enough to rely on them as endpoint measures. Studies of these measures have yielded inconsistent findings in ME.
Eg it appears that CD8 is consistently elevated in some ME patients and consistently suppressed in others. It would be fascinating to see if one or both of these groups reverted to normal CD8 levels after positive response Rituximab. Again this might point to sub-typing.

Measures of outcome for trials and other studies

"Gibberish"

Patient in training

Senior Member

Senior Member

ᴀɴɴɪᴇ ɢꜱᴀᴍᴩᴇʟ

Senior Member

Senior Member

Senior Member

Senior Member

Fine, thank you

"Gibberish"

Fine, thank you

Patient in training

Senior Member

"Gibberish"

Senior Member

ᴀɴɴɪᴇ ɢꜱᴀᴍᴩᴇʟ

Senior Member

Senior Member

Senior Member