Jonathan Edwards
"Gibberish"
- Messages
- 5,256
I am interested in what people think about the design of outcome measures for trials and perhaps epidemiological research. Should they be subjective or objective and what exactly do those terms imply? Should people try to measure things like fatigue quantitatively or is this a bogus idea? Should specific neurocognitive or activity (e.g. CPET) test be used or are they too indirect? ...
There are a mass of questions and some of them have been covered before usefully. I would like to focus on one or two specifically but I am happy for the thread to go wherever it goes. I have been trying to figure out what is wrong with the outcome measures currently being used, particularly in the context of the difficulty of blinding treatments. CBT and GET cannot be blinded. Drugs like valcyte or rituximab can in theory be blinded but it is far from certain that they really are, since patients get to know what the real drug feels like. That makes subjective outcome measures very suspect, but it can be argued that more objective surrogate measures are not measuring what is really important. And so on.
During discussions with PWME and also psychiatrists I was reminded of the structure of an outcome measure that has been used in rheumatology for decades, although it may recently have been overtaken by something else (maybe not for the better). The measure is the American College of Rheumatology Grade of Improvement and it comes in 20%, 50% and 70% versions.
The first thing to note about this measure is that it is only a measure of change. It combines various measurements that you could compare between patients, like how many joints are swollen, but it does not make that comparison. This seems to me important, especially for ME, since I personally think it is bogus to even try to compare how severe one persons illness is compared to another. So a version for ME might try to measure change in fatigue but would not 'measure fatigue'.
The second thing is that although a grade is calculated using a series of point scores or lab measurements these are not added up. It measures swollen joints, tender joints, ESR, fatigue, global pain score etc. but these are not given weighting and added up. You reach the grade if a certain number of measures have made the necessary percentage improvement. A reduction of 10 swollen joints to 4 would count towards ACR20, ACR50 but not ACR70 so you can score ACR50 but not ACR70, however much other things have improved.
Although this scoring system may recently have been overtaken by an additive one I think that there may be important logical reasons why the 'threshold' maths of ACR grades is more what is needed. What I am not sure is why this should be the right maths. I have a suspicion of something like this. The reason for these composite scores is not really to measure more things or even to measure at finer grain with a wider range of numbers. It is to try to overcome the unreliability of a single measure as a guide to what a reasonable person would think is a modest, moderate or major improvement.
The point of being composite is then to increase the confidence, or probability, that you will end up with a number that reflects common sense. And in general, our confidence in decisions does not come from adding up signs but from considering whether a range of signs are all in line with the conclusion - they meet a threshold maybe. And probability is, if anything, based on multiplication and division, as in Bayes's theorem, rather than addition.
So I can say 'well she says her pain is down by 80% and only three joints seem swollen now instead of seven (as far as I can tell, because some are iffy) and her ESR has gone down from 70 to 34, so I guess it is fair to say she is at least 50% better'. And I am in a sense saying that the probability that she is just saying she is better because it is sunny outside is made pretty small by the way the various sorts of evidence against it multiply the unlikelihood (of her really being just the same). And they are good at multiplying unlikelihood because they are different sorts of measures that give us rather independent sorts of evidence.
That's already a very long post but hopefully people will have some thoughts to get my ideas clearer - particularly on what we are really trying to do when we combine measurements.
There are a mass of questions and some of them have been covered before usefully. I would like to focus on one or two specifically but I am happy for the thread to go wherever it goes. I have been trying to figure out what is wrong with the outcome measures currently being used, particularly in the context of the difficulty of blinding treatments. CBT and GET cannot be blinded. Drugs like valcyte or rituximab can in theory be blinded but it is far from certain that they really are, since patients get to know what the real drug feels like. That makes subjective outcome measures very suspect, but it can be argued that more objective surrogate measures are not measuring what is really important. And so on.
During discussions with PWME and also psychiatrists I was reminded of the structure of an outcome measure that has been used in rheumatology for decades, although it may recently have been overtaken by something else (maybe not for the better). The measure is the American College of Rheumatology Grade of Improvement and it comes in 20%, 50% and 70% versions.
The first thing to note about this measure is that it is only a measure of change. It combines various measurements that you could compare between patients, like how many joints are swollen, but it does not make that comparison. This seems to me important, especially for ME, since I personally think it is bogus to even try to compare how severe one persons illness is compared to another. So a version for ME might try to measure change in fatigue but would not 'measure fatigue'.
The second thing is that although a grade is calculated using a series of point scores or lab measurements these are not added up. It measures swollen joints, tender joints, ESR, fatigue, global pain score etc. but these are not given weighting and added up. You reach the grade if a certain number of measures have made the necessary percentage improvement. A reduction of 10 swollen joints to 4 would count towards ACR20, ACR50 but not ACR70 so you can score ACR50 but not ACR70, however much other things have improved.
Although this scoring system may recently have been overtaken by an additive one I think that there may be important logical reasons why the 'threshold' maths of ACR grades is more what is needed. What I am not sure is why this should be the right maths. I have a suspicion of something like this. The reason for these composite scores is not really to measure more things or even to measure at finer grain with a wider range of numbers. It is to try to overcome the unreliability of a single measure as a guide to what a reasonable person would think is a modest, moderate or major improvement.
The point of being composite is then to increase the confidence, or probability, that you will end up with a number that reflects common sense. And in general, our confidence in decisions does not come from adding up signs but from considering whether a range of signs are all in line with the conclusion - they meet a threshold maybe. And probability is, if anything, based on multiplication and division, as in Bayes's theorem, rather than addition.
So I can say 'well she says her pain is down by 80% and only three joints seem swollen now instead of seven (as far as I can tell, because some are iffy) and her ESR has gone down from 70 to 34, so I guess it is fair to say she is at least 50% better'. And I am in a sense saying that the probability that she is just saying she is better because it is sunny outside is made pretty small by the way the various sorts of evidence against it multiply the unlikelihood (of her really being just the same). And they are good at multiplying unlikelihood because they are different sorts of measures that give us rather independent sorts of evidence.
That's already a very long post but hopefully people will have some thoughts to get my ideas clearer - particularly on what we are really trying to do when we combine measurements.