(v. academic) Minimal Important Change (MIC) scores research -international consensus

Dolphin · Jul 2, 2011

I don't claim this is of much general interest but it came up in the PACE Trial where they gave percentages "improved" based on this. I am writing something so decided to research it more.

I'm writing something at the moment so doing a lot of reading.

Those sorts of scores are quite a bit bigger than the 2 on the Chalder Fatigue Scale (0-33) and 8 on the SF-36 Physical Functioning scale (0-100) used in the PACE Trial.

For example, if one was to use the 30% of baseline values (mean SF-36 PF: 38.025, mean CFQ: 28.175), the values would have been:

SF-36 PF: 11.4075 (vs. 8)
CFQ: 8.4525 (vs. 2)

Interpreting change scores for pain and functional status in low back pain: towards international consensus regarding minimal important change.

Spine (Phila Pa 1976). 2008 Jan 1;33(1):90-4.

Ostelo RW, Deyo RA, Stratford P, Waddell G, Croft P, Von Korff M, Bouter LM, de Vet HC.

Source

EMGO Institute, VU University Medical Centre, Amsterdam, The Netherlands. r.ostelo@vumc.nl

Abstract

STUDY DESIGN:

Literature review, expert panel, and a workshop during the "VIII International Forum on Primary Care Research on Low Back Pain" (Amsterdam, June 2006).

OBJECTIVE:

To develop practical guidance regarding the minimal important change (MIC) on frequently used measures of pain and functional status for low back pain.

SUMMARY OF BACKGROUND DATA:

Empirical studies have tried to determine meaningful changes for back pain, using different methodologies. This has led to confusion about what change is clinically important for commonly used back pain outcome measures.

METHODS:

This study covered the Visual Analogue Scale (0-100) and the Numerical Rating Scale (0-10) for pain and for function, the Roland Disability Questionnaire (0-24), the Oswestry Disability Index (0-100), and the Quebec Back Pain Disability Questionnaire (0-100). The literature was reviewed for empirical evidence. Additionally, experts and participants of the VIII International Forum on Primary Care Research on Low Back Pain were consulted to develop international consensus on clinical interpretation.

RESULTS:

There was wide variation in study design and the methods used to estimate MICs, and in values found for MIC, where MIC is the improvement in clinical status of an individual patient. However, after discussion among experts and workshop participants a reasonable consensus was achieved. Proposed MIC values are: 15 for the Visual Analogue Scale, 2 for the Numerical Rating Scale, 5 for the Roland Disability Questionnaire, 10 for the Oswestry Disability Index, and 20 for the QBDQ. When the baseline score is taken into account, a 30% improvement was considered a useful threshold for identifying clinically meaningful improvement on each of these measures.

CONCLUSION:

For a range of commonly used back pain outcome measures, a 30% change from baseline may be considered clinically meaningful improvement when comparing before and after measures for individual patients. It is hoped that these proposals facilitate the use of these measures in clinical practice and the comparability of future studies. The proposed MIC values are not the final answer but offer a common starting point for future research.

PMID: 18165753 [PubMed - indexed for MEDLINE]

Free full text at: http://journals.lww.com/spinejourna...Change_Scores_for_Pain_and_Functional.15.aspx

Esther12 · Jul 2, 2011

Thanks Dolphin.

With this sort of thing, it seems like the researchers have a lot of leeway to spin their results however they want. It's a bit ridiculous.

oceanblue · Jul 3, 2011

It's geeky but I like stuff like this...

I think it's worth noting that there is plenty of genuine debate about how to measure 'meaningful' improvement. Unfortunately this means that any researchers so-minded can exploit the ambiguity to select whichever method best suits their chosen findings.

However, I think there may be an issue over using percentage of baseline: it seems to matter if 'worse' scores high (eg Chalder) or low (eg SF36).

Using the PACE examples, 30% of baseline is:
11.4 for SF36 = 11% of the maximum scale score (0-100)
8.5 for CFQ = 38% of maximum scale score (efffectively 11-33 ie 22 max, it's 26% of 0-33 scale)

So - if I've got this right - using percentage of baseline is harder on low-is-better scales like CFQ than high-is-better scales like SF36. Not sure if I've explained this very well - it's something like 30% of a high score will always give a more demanding threshold than 30% of a low score.

I'm talking principles here, I still feel 2 as the threshold for 'meaningful improvement' is way, way too low on CFQ. I've had a thought on why this might be and will post more if I can knock it into something more coherent.

Looking at SF36 research on other illnesses I'm pretty sure there was a consensus that a score of 10 counted as clinically significant improvement. that actually tallies quite well with the 30% of baseline rule and also with PACE. A nominal improvement of 8 is actually 10 for any individual since the scale is scored in 5-point increments.

Finally, I couldn't help noticing that the definition of clinical improvement in this pain study was based on expert opinion. It's a shame they didn't bring patient's view into it too! I think this has been done is some other studies: perhaps the PACE authors might like to try this approach too.

Dolphin · Jul 3, 2011

As there are at least a couple of people reading this, I thought I'd put a little more time into it.

Here are a few bits I underlined

Statistical significance does not necessarily mean the change is clinically important.5 For some clinical outcomes such as blood pressure, empirical research, and clinical experience may produce a general feeling whether a change is important or not. But the importance of changes on many questionnaires is less intuitively apparent.6

Seems like the language hasn't been standardised:

Empirical Evidence. The literature was reviewed for studies estimating the minimal important change (MIC) for the above questionnaires. MEDLINE was searched using a combination of Medical Subject Heading (MeSH) terms back pain and low back pain; the specific names of the questionnaires; and any of the following termsresponsiveness, minimal(ly) clinically important change, minimal(ly) clinically important difference, minimum clinically important difference, minimum detectable change, smallest detectable change and questionnaires. Studies were included that reported on the importance of the change scores.

I thought the point about time interval might be interesting e.g. one might expect a bigger change over a longer period:

As expected, there was wide variation in study design and the methods used to estimate MICs. For example, the included studies used different time intervals for the test-retest (ranging from a 1-day interval to a 1-year interval), different external criteria to define important were used and many different statistical techniques were used to calculate MIC.

Lots of debate still:

On the third question, some experts and workshop participants felt that 1 simple (absolute) value for MIC for each questionnaire is easier to produce from the available evidence. Furthermore, such a uniform value is more likely to be used in clinical practice. Others felt that was an oversimplification as there is evidence that MIC is baseline dependent, so initial values should be taken into account e.g., as percentage improvement from baseline. It was therefore decided to work toward consensus on both issues.

The discussions in the expert group and workshop
raised also several other issues. Debate remains about the
meaning and definition of a clinically important
change. For example, some participants regarded
slightly improved as clinically important whereas others
considered this within the range of natural fluctuation.
The latter reasoned that an important improvement
should be greater than these (unimportant) natural
fluctuations. Furthermore, patients may easily say that
they are slightly improved just to please their physician
or therapist. Better methodology will not resolve this;
rather, these are clinical judgments that then determine
the methodology used.

and

Second, the empirical evidence is limited and heterogeneous and there are no agreed scientific grounds or empirical evidence to determine the optimum method of estimating the MIC.

Interesting point about different treatments possibly requiring different MICs:

Nevertheless, different MICs may be more appropriate for different patients or contexts, e.g., children or surgical patients. Again, a smaller MIC may be appropriate to a simple, cheap, and safe intervention, whereas a larger MIC may be more appropriate to an expensive, risky procedure. Indeed, an ODIMIC of 15 points has been suggested for surgical interventions,3 compared with the 10 points proposed here (Table 3). Types of patients and treatments were not specifically taken into account in these proposals. Thus, the proposed values should be taken as generic lower limits for the MICs which can (and should) be modified when necessary.

Dolphin · Jul 3, 2011

oceanblue said:
Finally, I couldn't help noticing that the definition of clinical improvement in this pain study was based on expert opinion. It's a shame they didn't bring patient's view into it too! I think this has been done is some other studies: perhaps the PACE authors might like to try this approach too.

I agree.

However, they did a literature review first and some of the values in some of the studies there could have been from patients' views.

Esther12 · Jul 3, 2011

Thanks for the summary Dolphin.

Seems like the language hasn't been standardised

From taking the time to close read more papers, I've come to realise how much of what is said in medical papers is so ill defined as to be impossible for front-line staff to really understand it. Very few docotors/CBT therapists/etc will take the time to really understand exactly what the figures/questionnaire scores/etc mean, so they are reduced to relying upon the vague impression created by the write-up's use of language.

CFS is particularly bad for this sort of thing - but I get the impression it's a fairly widespread problem, affecting lots of medical conditions.

Sean · Jul 3, 2011

My impression from reading peer review papers is that there is often a serious problem with disparity between the abstract (which is all most people ever read), and the full content. I have seen others make that comment in other fields of medicine.

(v. academic) Minimal Important Change (MIC) scores research -international consensus

Dolphin

Senior Member

Esther12

Senior Member

oceanblue

Guest

Dolphin

Senior Member

Dolphin

Senior Member

Esther12

Senior Member

Sean

Senior Member