PACE Trial and PACE Trial Protocol

Dolphin · Mar 8, 2011

(Nit-picking?) Briefer description of SAE type d

(section d is what I'm referring to)

In the full (unpublished) protocol we are told:

14.1.1 Serious Adverse Events (SAEs) An adverse event (AE) is defined as serious (an SAE) if it results in one of the following
outcomes:
a) Death,
b) Life-threatening (i.e., with an immediate, not hypothetical, risk of death at the time of the event),
c) Requires hospitalisation (hospitalisation for elective treatment of a pre-existing condition is not included),
d) Increased severe and persistent disability, defined as:
. severe = a significant deterioration in the participant's ability to carry out their important activities of daily living (e.g. employed person no longer able to work, caregiver no longer able to give care, ambulant participant becoming bed bound); and . persistent = 4 weeks continuous duration
e) Any other important medical condition which, though not included in the above, may jeopardise the participant and may require medical or surgical intervention to prevent one of the outcomes listed.
f) Any episode of deliberate self-harm

If there is any doubt in the minds of the research nurse and the centre leader as to whether the AE is a serious AE, the centre leader will obtain a second opinion from one of the Pls.

Remember that one doesn't have to prove that an adverse event is caused by the treatment.

In the appendix, description d is a bit briefer:

d) Increased severe and persistent disability, defined as a significant deterioration in the participant's ability to carry out their important activities of daily living of at least four weeks continuous duration;

This is the data from the study:

d) Increase in severe and persistent significant disability/incapacity (10) Breast cancer, cerebrovascular accident, prolapsed intervertebral discs, assault leading to disability, multiple life events leading to disability, upper respiratory infection leading to disability, acutely unwell, acute allergic reaction, blackout (2).

It would have been interesting if they had told us of people who gave up work (either full-time or part-time). I would be fairly sure that some people did partly because of the demands of the intervention and the hope they might make a good recovery if they concentrated on the program.

oceanblue · Mar 9, 2011

Dolphin said:
Re quotes from submission to NICE

Draft text:
6.3.6.16 [by Guideline Development Group, GDG] When planning a programme of GET the healthcare professional
should:
. discuss with the patient ultimate goals with the patient that are important and relevant to them. This may be, for example a 2 x 15 minutes daily brisk walk to the shop, a return to previous active hobby such as cycling or gardening, or, if more severely affected, sitting up in bed to eat a meal.
. recognise that it may take weeks, months, or even years to achieve goals, and it is essential that the therapy structure takes this pace of progress into account.

Click to expand...

So, Peter White wants a full recovery within a year, the GDG thinks a 2 x 15 minute brisk walk to the shops might do - PACE didn't get close to either.

oceanblue · Mar 9, 2011

Huh? CBT didn't make a "clinically useful difference" to physical function scores

Don't know how I missed this, but the CBT group's SF-36 mean score only improved by 7.1 compared with the SMC group, that's less than the "clinically useful difference" of 8 (0.5 SD of baseline scores). Maybe that's why the authors said this:

Mean differences between groups on primary
outcomes almost always exceeded predefined clinically
useful differences for CBT and GET when compared
with APT and SMC

Have I made a mistake here?

Dolphin · Mar 9, 2011

A contact from Sweden sent me the following which he thought I might like to re-post:

I looked at the Lancet PACE trial report. One thing struck me when looking at figure 2 on page 8 (page 830 on paper). Is it not strange that the "controls" (receiving medical care only) also improves on all scales over time? I would not expect this from a randomized selection of ME-patients. I think then mean would be zero improvement, unless patients only have had ME for a short time (less than 1-2 years).

My thinking is that the patients in the trial may be for example depressed. That would explain the improvement over time. Also, people with postinfectious fatigue (not resulting in a permanent ME state) may show a similar pattern I believe. People with burn-out I think recover in this way.

I think some medical physicians and researchers should look at the curves and judge what kind of conditions that might a natural history looking like the curves in figure 2.

Many of my ME-friends in Sweden are now worse compared to a few years ago. The average natural history for us is rather a worsening one.

To me it is very strange with the curves for the controls in the report show a such distinct improvement over the first 12 weeks. Very strange indeed.

oceanblue · Mar 9, 2011

Dolphin said:
A contact from Sweden sent me the following which he thought I might like to re-post:

I think then mean would be zero improvement, unless patients only have had ME for a short time (less than 1-2 years)....

...To me it is very strange with the curves for the controls in the report show a such distinct improvement over the first 12 weeks. Very strange indeed.

Click to expand...

Agreed, that is very strange. I'm baffled, unless the previous medical care received by patients was very poor so that treating mood/sleep/pain problems made a big difference?

Also, it's worth noting that the control group had only been ill for a mean of 25 months, only just over 2 years (so presumably quite a lot of the group had been ill for under 2 years).

Dolphin · Mar 9, 2011

oceanblue said:
Don't know how I missed this, but the CBT group's SF-36 mean score only improved by 7.1 compared with the SMC group, that's less than the "clinically useful difference" of 8 (0.5 SD of baseline scores). Maybe that's why the authors said this:

Mean differences between groups on primary
outcomes almost always exceeded predefined clinically
useful differences for CBT and GET when compared
with APT and SMC

Click to expand...

Have I made a mistake here?

Looks right to me.
The wording they use doesn't make clear that this was a post-hoc analysis (and we didn't get to see some of the figues they had promised like "improvement"):

A clinically useful difference between the means of
the primary outcomes was defi ned as 05 of the SD of
these measures at baseline,31 equating to 2 points for
Chalder fatigue questionnaire and 8 points for short
form-36. A secondary post-hoc analysis compared the
proportions of participants who had improved between
baseline and 52 weeks by 2 or more points of the Chalder
fatigue questionnaire, 8 or more points of the short
form-36, and improved on both. In another post-hoc
analysis (i.e. they're admitted that the first one is a post-hoc analysis), we compared the proportions of participants
who had scores of both primary outcomes within the
normal range at 52 weeks. This range was defi ned as
less than the mean plus 1 SD scores of adult attendees
to UK general practice of 142 (+46) for fatigue (score
of 18 or less) and equal to or above the mean minus 1 SD
scores of the UK working age population of 84 (24) for
physical function (score of 60 or more).32,33

oceanblue · Mar 9, 2011

Dolphin said:
Looks right to me.
The wording they use doesn't make clear that this was a post-hoc analysis (and we didn't get to see some of the figues they had promised like "improvement"):

Well that's amazing, then: CBT didn't make the grade on physical function, even though the grade was set very low.

I'm pretty sure the use of 0.5 SD to compare means was not post-hoc; the post-hoc use you refer to was the percentage of participants who improved by more than the 'clinically useful difference'. I'm about to make a rather long - but terribly exciting - post about primary outcomes that covers this.

oceanblue · Mar 9, 2011

Understanding PACE primary outcomes - 1

Part 1 - the protocol mainly a summary of earlier points I thought would be helpful, feel free to jump to summary in bold.

Participants who improve by 50% or more are classed as improvers. The proportion of improvers in the therapy groups is compared with the proportion improving in the SMC group.

Positive outcomes (improvers) defined as:
A 50% increase from baseline in SF-36, or a score of 75
A 50% reduction in fatigue score, or a score of 3 or less (bimodal scoring)
(Participants improving in both fatigue and physical function are classed as overall improvers.)

Note that these are quite high thresholds: a 50% increase (decrease) or a return to near-normal functioning. One of the advantages of such a high threshold is that it makes it much more likely that these reflect real changes in health, rather than subjective changes in reporting.

Nb this approach categorises participants into either improvers or not improvers, a simplification that makes interpretation easier but does lose some of the information.

The trial then defines success as significantly more improvers in the therapy groups than in the SMC control. Protocol:

We propose that a clinically important difference would be between 2 and 3 times the improvement rate of (S)SMC.

However, the sample size calculation in the protocol reveals the authors were predicting that therapy groups improve at up to 6x the rate of the SMC group (protocol estimates for improvement rates:
CBT@60%, GET@50%, APT@25%, SMC@10%).

In effect, the protocol was only interested in big improvements for individuals, and was anticipating that CBT/GET would do much better than SMC.

oceanblue · Mar 9, 2011

Understanding PACE primary outcomes - 2

apologies, this is a bit long, but oh-so worth it

Part 2 - The Lancet paper: Abandoning the protocol, instead mean changes in fatigue and physical function scores were compared. A clinically important difference was defined as a gain of 0.5 Standard Deviation (SD) or more:

We used continuous scores for primary outcomes...

A clinically useful difference between the means of
the primary outcomes was defined as 0•5 of the SD of
these measures at baseline, equating to 2 points for
Chalder fatigue questionnaire and 8 points for short
form-36.

Results - difference in means
Fatigue: CBT -3.4; GET -3.2 (both relative to SMC), target = -2
Physical Function: CBT +7.1; GET + 9.4 (both relative to SMC), target =+8

Problems with the way ‘clinically useful difference' was calculated.
There’s a lot of debate about the best definition of a ‘clinically useful difference’ and the choice of 0.5 STandard Deviation (SD) of baseline scores itself is not particularly controversial (given that they’ve changed the primary outcomes...). However, the Guyatt paper PACE cites in support of the 0.5 SD method notes that if the participants are particularly homogenous (with less variation between their baseline scores) that will lower the SD and therefore lower the threshold for a clinically useful difference.

Now, in some ways the PACE trial have selected a relatively homogenous sample for fatigue and physical function because they used explicit fatigue and activity thresholds to recruit patients (SF-36<70, CFQ>5) and an implicit threshold because housebound patients were too ill to take part - lets assume a SF-36 of 30 is needed to be well enough to get to the trial for therapy.

So for SF-36, baseline participants’ scores are effectively restricted to 30-65. That leads to a lower SD and in turn to a lower threshold for clinical difference. If, for instance, the SD had come out at 20 rather than about 16, the threshold for clinical difference would have been 10 and in this case neither GET nor CBT would achieve a ‘clinically useful difference’. (CBT 7.1, GET 9.4).

The situation is even worse for the Chalder Fatigue scale because of its well-known ceiling effect, whereby it’s easy for participants to hit the maximum score of 33 (baseline mean was about 28). Consequently, instead of participants scoring, say, 33, 34, 35 or 38 they all score 33 – this further reduces the variance in the sample, which reduces the SD and so in turn lowers the threshold for a ‘clinically useful difference’. This could explain why the clinically useful score is only 2. My feeling is that derivation of this score of 2 is not credible, because of problems with the scale itself, and the relatively homogenous group of participants.

So, if the fatigue scale threshold is dodgy, that leaves the only primary outcome as the SF-36 and CBT already fails to reach this, with a mean difference of 7.1 vs a target of 8. The relatively homogenous sample of patients casts further doubt on whether 8 is too low, which could then rule out GET too as not being ‘useful’.

Added to this is the fact that fatigue and physical function scores are subjective, self-rated measures prone to reporting bias. When we look to the objective measures for confirmation, it turns out that CBT 6MWT did not improve significantly, though GET did.

So the published data used more ‘sensitive’ measures of difference but even then the therapy groups barely scraped over the new, much closer finishing line.

Given the problems inherent in using subjective measures, and issues over whether the ‘clinically useful’ threshold is artificially low, there must be real doubts over whether the primary outcome figures support the authors’ claim of ‘moderate effectiveness’ for CBT & GET.

Dolphin · Mar 9, 2011

oceanblue said:
Nb this approach categorises participants into either improvers or not improvers, a simplification that makes interpretation easier but does lose some of the information.

Yes, but any paper would still likely include the average scores as well.

We're not getting some information, as you say, by not seeing how many had "decent" (my wording) improvements/got up to a "decent" level.

Dolphin · Mar 9, 2011

oceanblue said:
I'm pretty sure the use of 0.5 SD to compare means was not post-hoc; the post-hoc use you refer to was the percentage of participants who improved by more than the 'clinically useful difference'. I'm about to make a rather long - but terribly exciting - post about primary outcomes that covers this.

I'm not sure exactly how "post-hoc" should be defined. I would have thought it meant not in the protocol. I can't see any mention of 0.5SD in the protocol.

oceanblue · Mar 9, 2011

Dolphin said:
I'm not sure exactly how "post-hoc" should be defined. I would have thought it meant not in the protocol. I can't see any mention of 0.5SD in the protocol.

Fair point, but I think the crucial definition of post-hoc is 'after you've had sight of the data', since that allows researchers to 'torture the data until it tells them what they want to hear'.

oceanblue · Mar 9, 2011

Dolphin said:
Yes, but any paper would still likely include the average scores as well.
We're not getting some information, as you say, by not seeing how many had "decent" (my wording) improvements/got up to a "decent" level.

I was really making the point from the thoughtful Guyatt paper that things that simplify are easier to understand but lose information; what that paper concludes is that no measure is perfect and ideally you need to use several different measures to get the full picture, as you imply.

Dolphin · Mar 9, 2011

oceanblue said:
Fair point, but I think the crucial definition of post-hoc is 'after you've had sight of the data', since that allows researchers to 'torture the data until it tells them what they want to hear'.

Yes, I supposed that is reasonable.

I suppose what I was trying to get at was the use of predefined in the following:

Mean differences between groups on primary
outcomes almost always exceeded predefined clinically
useful differences for CBT and GET when compared
with APT and SMC

which gives the reader the impression that these were set out in the protocol.

oceanblue · Mar 9, 2011

Dolphin said:
I suppose what I was trying to get at was the use of predefined in the following:
which gives the reader the impression that these were set out in the protocol.

I completely agree. Their crucial cop-out phrase in the paper is

The statistical analysis plan was finalised, including changes
to the original protocol, and was approved by the trial
steering committee and the data monitoring and ethics
committee before outcome data were examined.

I doubt many people will have taken that in and certainly wouldn't know which particular measures were changed from the protocol.

Dolphin · Mar 9, 2011

oceanblue said:
apologies, this is a bit long, but oh-so worth it

Part 2 - The Lancet paper: Abandoning the protocol, instead mean changes in fatigue and physical function scores were compared. A clinically important difference was defined as a gain of 0.5 Standard Deviation (SD) or more:

We used continuous scores for primary outcomes...

A clinically useful difference between the means of
the primary outcomes was defined as 05 of the SD of
these measures at baseline, equating to 2 points for
Chalder fatigue questionnaire and 8 points for short
form-36.

Click to expand...

Results - difference in means
Fatigue: CBT -3.4; GET -3.2 (both relative to SMC), target = -2
Physical Function: CBT +7.1; GET + 9.4 (both relative to SMC), target =+8

Problems with the way clinically useful difference' was calculated.
Theres a lot of debate about the best definition of a clinically useful difference and the choice of 0.5 STandard Deviation (SD) of baseline scores itself is not particularly controversial (given that theyve changed the primary outcomes...). However, the Guyatt paper PACE cites in support of the 0.5 SD method notes that if the participants are particularly homogenous (with less variation between their baseline scores) that will lower the SD and therefore lower the threshold for a clinically useful difference.

Now, in some ways the PACE trial have selected a relatively homogenous sample for fatigue and physical function because they used explicit fatigue and activity thresholds to recruit patients (SF-36<70, CFQ>5) and an implicit threshold because housebound patients were too ill to take part - lets assume a SF-36 of 30 is needed to be well enough to get to the trial for therapy.

So for SF-36, baseline participants scores are effectively restricted to 30-65. That leads to a lower SD and in turn to a lower threshold for clinical difference. If, for instance, the SD had come out at 20 rather than about 16, the threshold for clinical difference would have been 10 and in this case neither GET nor CBT would achieve a clinically useful difference. (CBT 7.1, GET 9.4).

The situation is even worse for the Chalder Fatigue scale because of its well-known ceiling effect, whereby its easy for participants to hit the maximum score of 33 (baseline mean was about 28). Consequently, instead of participants scoring, say, 33, 34, 35 or 38 they all score 33 this further reduces the variance in the sample, which reduces the SD and so in turn lowers the threshold for a clinically useful difference. This could explain why the clinically useful score is only 2. My feeling is that derivation of this score of 2 is not credible, because of problems with the scale itself, and the relatively homogenous group of participants.

So, if the fatigue scale threshold is dodgy, that leaves the only primary outcome as the SF-36 and CBT already fails to reach this, with a mean difference of 7.1 vs a target of 8. The relatively homogenous sample of patients casts further doubt on whether 8 is too low, which could then rule out GET too as not being useful.

Added to this is the fact that fatigue and physical function scores are subjective, self-rated measures prone to reporting bias. When we look to the objective measures for confirmation, it turns out that CBT 6MWT did not improve significantly, though GET did.

So the published data used more sensitive measures of difference but even then the therapy groups barely scraped over the new, much closer finishing line.

Given the problems inherent in using subjective measures, and issues over whether the clinically useful threshold is artificially low, there must be real doubts over whether the primary outcome figures support the authors claim of moderate effectiveness for CBT & GET.

Excellent and astute points. Well done.

Indeed, if we used the SDs for both fatigue and physical functioning at the end, the 0.5SD threshold would have approximately doubled-

When we look to the objective measures for confirmation, it turns out that CBT 6MWT did not improve significantly, though GET did.

To remind people, it wasn't that there was just no significant increase in the 6MWT, CBT did a tiny bit worse in the adjusted and unadjusted figures for SMC.

Again with the walking test, it would have been interesting to see how many actually reached a "reasonable" level i.e. not just be given average figures. Because of the lack of use of gases, etc., in the 6MWT, there is still a bit of scope there for the result to be a bit artificial with GET participants pushing themselves a bit harder, and also having more experience with walks so having an idea what sort of pace they can maintain.

Angela Kennedy · Mar 9, 2011

oceanblue said:
Fair point, but I think the crucial definition of post-hoc is 'after you've had sight of the data', since that allows researchers to 'torture the data until it tells them what they want to hear'.

I like the analogy.

Dolphin · Mar 9, 2011

oceanblue said:
So for SF-36, baseline participants’ scores are effectively restricted to 30-65. That leads to a lower SD and in turn to a lower threshold for clinical difference. If, for instance, the SD had come out at 20 rather than about 16, the threshold for clinical difference would have been 10 and in this case neither GET nor CBT would achieve a ‘clinically useful difference’. (CBT 7.1, GET 9.4).

Indeed for some of the trial, 60 was the top score possible.

Technically, one may get scores below 30 but I wonder if the person hasn't really answered the question very well.

I've now uploaded Kathy Fulcher's PhD thesis to http://rapidshare.com/files/451772356/FulcherKathy245687.pdf (this was the basis of the Fulcher & White, 1997 BMJ study that first "launched" GET in effect). At the end, it gives the raw data for 68 participants. I'm guessing this is data for two drop-outs in there for some reason.

Anyway, it is interesting to see raw individual data. So amongst the baseline scores, on a quick check, there were four below 30: 25, 15 (x2) and 5. As I say, I think the people (5-15, anyway) most likely didn't give a fair reflection of their functioning/didn't answer it as others might answer it.

One can see how the scores progressed. Remember that the Wessely fatigue scores are on the 14 version of the Chalder Fatigue Scale (sic).

Dolphin · Mar 9, 2011

(Not important)

This is an observation on one of the primary outcome measures in the published protocol which probably is not that important now as they didn't use it.
---------
This study:

Stulemeijer M, de Jong LW, Fiselier TJ, Hoogveld SW,
Bleijenberg G (2005). Cognitive behaviour therapy for
adolescents with chronic fatigue syndrome: randomised
controlled trial. British Medical Journal 330. Published
online : 7 December 2004. doi :10.1136/
bmj.38301.587106.63.

used this as one of its criteria:
Self rated improvement
†Increase of ≥50 or end score of ≥75.-----
It referenced this study:

Powell P, Bentall R, Nye F, Edwards RH. Randomised controlled trial of patient education
to encourage graded exercise in chronic fatigue syndrome. BMJ 2001;322:387-90.

for this which did indeed use the same cut-outs (the scoring for that is 10-30 so initially the scoring looks weird).

What the PACE Trial protocol used can look the same but isn't:

The SF-36 physical function sub-scale [29] measures physical function, and has often been used as a primary outcome measure in trials of CBT and GET. We will count a score of 75 (out of a maximum of 100) or more, or a 50% increase from baseline in SF-36 sub-scale score as a positive outcome. A score of 70 is about one standard deviation below the mean score (about 85, depending on the study) for the UK adult population [51,52].

So for example, a jump from 40 to 60 would have counted on this definition as it increased by 50%.

anciendaze · Mar 9, 2011

Meanwhile, back at that histogram, I showed it to another friend with experience in statistics at lunch today. While I was explaining the meaning of the various parts, I realized the histogram shows no change at all over the range from 25 to 65. No movement within that range should be taken as statistically significant in terms of the general population, even after you switch to a non-normal distribution with kurtosis and skew. Quantization of scores makes the entire PACE trial virtually meaningless.

PACE Trial and PACE Trial Protocol

Senior Member

Guest

Guest

Senior Member

Guest

Senior Member

Guest

Guest

Guest

Senior Member

Senior Member

Guest

Guest

Senior Member

Guest

Senior Member

Senior Member

Senior Member

Senior Member

Senior Member