• Welcome to Phoenix Rising!

    Created in 2008, Phoenix Rising is the largest and oldest forum dedicated to furthering the understanding of and finding treatments for complex chronic illnesses such as chronic fatigue syndrome (ME/CFS), fibromyalgia (FM), long COVID, postural orthostatic tachycardia syndrome (POTS), mast cell activation syndrome (MCAS), and allied diseases.

    To become a member, simply click the Register button at the top right.

PACE Trial and PACE Trial Protocol

Bob

Senior Member
Messages
16,455
Location
England (south coast)
I've been trying to find detailed percentiles for SF-36 PF scores for the adult population, but have been unable to so far.

But I have come across some helpful percentiles for the six minute walking distance test, for healthy adults.

In the PACE Trial, the distances walked in the 6MWD test, post-treatment, for the therapy groups was as follows:

GET = 379m
CBT = 354m

I've found a study that gives the 10th percentile of 'normal' men and women...

The 6-min walk distance in healthy subjects: reference standards from seven countries
C. Casanova et al.
doi: 10.1183/09031936.00194909
ERJ January 1, 2011 vol. 37 no. 1 150-156
June 4, 2010,
http://erj.ersjournals.com/content/37/1/150.full

10th percentile for 'normal' men (age 50-59) = ~475m (approximation based on graph - precise data not given)
10th percentile for 'normal' women (age 50-59) = ~475m (approximation based on graph - precise data not given)


And there's another paper that shows the 5th percentiles for healthy men and women...

Reference Equations for the Six-Minute Walk in Healthy Adults
PAUL L. ENRIGHT and DUANE L. SHERRILL
AM J RESPIR CRIT CARE MED 1998;158:1384–1387.
http://ajrccm.atsjournals.org/content/158/5/1384.full.pdf

5th percentile for healthy men = 399m (n = 117, median age = 59.5)
5th percentile for healthy women = 310m (n = 173, median age = 62)



So I think the post-treatment mean 6MWDT distances in the PACE Trial, are roughly at about the 5th percentile of the healthy adult population. (i.e. within the lowest performing 5% of the healthy population.)

I can't find any data for the whole population, as opposed to healthy adults only.
 

Graham

Senior Moment
Messages
5,188
Location
Sussex, UK
Thanks Dolphin. The purchase of 12 actometers was right, but I didn't allow for the greater spread of time. My puppy like enthusiasm on finding a fresh thought in my brain overwhelmed my duty to think it through.

But if they had hoped to use the actometers to back up the fatigue and physical function scores, they must have expected to use them three times per patient at least (at the start, at 26 weeks and at 52 weeks). It would have taken military precision to get those 12 recycled amongst 640 patients. Knowing how easily and often things like this are damaged, delayed or lost, it never would have worked. If the patients were anything like my pupils at school, I couldn't have got enough data from 60 kids with only 12 actometers.
 

Graham

Senior Moment
Messages
5,188
Location
Sussex, UK
Thanks Dolphin. I had read your entries before. Very useful. I should really have commented, and let you know that you were being appreciated! I'm just getting lazy.
 

Dolphin

Senior Member
Messages
17,567
Thanks Dolphin. I had read your entries before. Very useful. I should really have commented, and let you know that you were being appreciated! I'm just getting lazy.
No worries, Graham.
Just in case it wasn't clear, that's a new thread I started tonight.
 

Bob

Senior Member
Messages
16,455
Location
England (south coast)
Effect size

I was wondering about what is commonly considered a 'moderate' improvement?

One of Guyatt's papers (Brożek et al. 2006), gives some helpful, and perhaps authoritative, info:

How a well-grounded minimal important difference can enhance transparency of labelling claims and improve interpretation of a patient reported outcome measure
Jan L Brożek, Gordon H Guyatt, and Holger J Schünemann
27 September 2006
Health and Quality of Life Outcomes 2006, 4:69
doi:10.1186/1477-7525-4-69
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC1599713/pdf/1477-7525-4-69.pdf

Extracts from Brożek at al.:

"Cohen [44] provided a rough rule of thumb to interpret the magnitude of the effect sizes.
Changes in the range of 0.2 standard deviation units represent small changes, 0.5 – moderate changes, and 0.8 – large changes."

"Some recent empirical studies suggest that Cohen's guideline may in fact be generally applicable [49], but other authors propose that the MID is in the range of 0.2 to 0.5 standard deviation unit [50] or corresponds with an effect size of 0.5 [51,52]."


So, this suggests that:

'Small' is in the range of 0.2 x SD
'Moderate' is in the range of 0.5 x SD
'Large' is in the range of 0.8 x SD

I'm not sure how outcomes between 0.3 & 0.4 SD should be defined: is it 'small' or 'moderate'? (this might be 'moderately weak' - see quote at bottom of post)

And outcomes between 0.6 & 0.7 SD: Would they be moderate or large?
(this might be 'moderately large' - see quote at bottom of post)


Here is another source, which perhaps helps to answers these questions, although it gives slightly different info to Brożek et al. (see quote at bottom):

Research Methods
Eighth Edition
Donald H. McBurney and Theresa L. White
Cengage Learning, 2009
(Chapter: Appendix A. Review of Statistics)
(pg 412)

Quote re Effect size:
"Correlations less than 0.2 are considered weak; 0.2 to 0.4, moderately weak; 0.4 to 0.6, moderate; 0.6 to 0.8, moderately strong; and 0.8 to 1.0, strong."

Link:
http://books.google.co.uk/books?id=AUDoy-lSe_EC&pg=PA412&lpg=PA412&dq=0.4 to 0.6 common accepted moderate effect size&source=bl&ots=aLRl1mLN7o&sig=YimFkLBTGgjSGUEaB6h9wDNQdiY&hl=en&sa=X&ei=L7N6UOKXL4W-0QXnhIAo&ved=0CCoQ6AEwAg#v=onepage&q=0.4 to 0.6 common accepted moderate effect size&f=false
 

Graham

Senior Moment
Messages
5,188
Location
Sussex, UK
I have real problems with using the standard deviation for this. Here's why. Suppose you are short and fat. You join a weight watchers club and a height watchers club. The first has a standard deviation of 30 pounds, the second a standard deviation of 3 inches. What would impress you more, a weight loss of 20 pounds or a height increase of 2 inches? Standard deviation measures variation between people, not acceptable variation for each individual. It isn't too hard to change your weight, but to change your height even slightly would be astounding.
 

user9876

Senior Member
Messages
4,556
I have real problems with using the standard deviation for this. Here's why. Suppose you are short and fat. You join a weight watchers club and a height watchers club. The first has a standard deviation of 30 pounds, the second a standard deviation of 3 inches. What would impress you more, a weight loss of 20 pounds or a height increase of 2 inches? Standard deviation measures variation between people, not acceptable variation for each individual. It isn't too hard to change your weight, but to change your height even slightly would be astounding.

I think you are right Graham. The point is that an index just measures something and std gives the variance over a population. It says nothing about mechanism or what is ideal. Hence we know there is no mechanism to increase your height but we also know weight is something that varies quite a lot. We know this by observation and we also have an understanding of how people lay down and use fat reserves. If I was looking at stats around a weight watchers club I would be looking at the average difference between starting and end rates along with the standard deviation.

One of the problems with the way psychiatrists and social scientists use stats is that they believe the figures and forget to investivate the underlying mechanisms. I read a good paper a while ago that talked about Snow's use of statistics to understand the spread of cholora where he did the work to understand mechansim and comparing this to someone else who had quite an accurate model relating cholora outbreaks to elevation.

More generally we need to worry about measurement systems. How accurate are the weight measurements in our weight watchers club. Perhaps people wear lighter clothes or there are issues in reading scale values or calabrating the scale. Hence if someone appears to have lost a small amount of weight can we credit this change to weight watchers or is it an issue with the measurement system. Of course if they have lost a lot then the weight loss will be much bigger then the measurement error and we can have much more confidence in our result.

With the PACE trial there seem to be many issues with the accuracy of the scales they use and they appear to only have small gains hence measurement errors should be discussed.

Of course a course of CBT could perhaps help you increase your height. Not because it changes your height but because each time you look at the tape measure you exagerate the reading just a little.
 

Graham

Senior Moment
Messages
5,188
Location
Sussex, UK
Of course a course of CBT could perhaps help you increase your height. Not because it changes your height but because each time you look at the tape measure you exagerate the reading just a little.

I like it!

Ah but if you were PACE, you would drop the use of a tape measure once you had got the funds to go ahead, and then you would base your conclusions on whether patients thought they were getting taller.

There is another aspect of variability though. Few people have an "exactly" constant weight (in comparison to height). With ME and fatigue, I guess the fatigue is a lot more variable than weight. At the moment I am working on the results to our survey on the Chalder Fatigue scale to produce some sort of "paper", and I am looking closely at the different variabilities. It gets pretty complicated, yet it appears to me that a lot of these studies just plug their data into sophisticated statistical packages without looking at what essential assumptions are made by these packages about the data.

When I get the "paper" more sorted, if anyone would like to check it through before I try to get it accepted, I'd appreciate that. Mind you, whether it can ever reach those levels is another matter altogether.
 

user9876

Senior Member
Messages
4,556
I like it!

Ah but if you were PACE, you would drop the use of a tape measure once you had got the funds to go ahead, and then you would base your conclusions on whether patients thought they were getting taller.

There is another aspect of variability though. Few people have an "exactly" constant weight (in comparison to height). With ME and fatigue, I guess the fatigue is a lot more variable than weight. At the moment I am working on the results to our survey on the Chalder Fatigue scale to produce some sort of "paper", and I am looking closely at the different variabilities. It gets pretty complicated, yet it appears to me that a lot of these studies just plug their data into sophisticated statistical packages without looking at what essential assumptions are made by these packages about the data.

When I get the "paper" more sorted, if anyone would like to check it through before I try to get it accepted, I'd appreciate that. Mind you, whether it can ever reach those levels is another matter altogether.

I came across this discussion of scales which I thought was pretty good. I've not got through their work but the issues discussed in the early section seem like a good discussion of acuracy, error etc

http://www.munshi.4t.com/papers/likert.html

One key issue with the chalder fatigue scale is that it is measuring multiple things. i.e. mental fatigue, physical fatigue and I think I have read depression. Whilst I think mental and physical fatigue levels are often highly corrolated I think they can also vary separately.

Another issue is the timing of filling out the scale. I don't know how they were processed but it could make a difference. Basically if people are given a choice as to when to fill the form out then I think there is a chance that the form gets left till a better day and hence may provide better readings. Its like giving someone at weightwatchers the choice of when to be weighed. The question is is there more choice at different points within the intervention. It shouldn't make any difference between different groups though but might help explain the overall positive results.
 
Messages
13,774
Of course a course of CBT could perhaps help you increase your height. Not because it changes your height but because each time you look at the tape measure you exagerate the reading just a little.

I'd genuinely like to see an RCT done for some Lightning Process style approach on this (if we ignore that obvious moral concerns anyway).
 

Simon

Senior Member
Messages
3,789
Location
Monmouth, UK
I have real problems with using the standard deviation for this. Here's why. Suppose you are short and fat. You join a weight watchers club and a height watchers club. The first has a standard deviation of 30 pounds, the second a standard deviation of 3 inches. What would impress you more, a weight loss of 20 pounds or a height increase of 2 inches? Standard deviation measures variation between people, not acceptable variation for each individual. It isn't too hard to change your weight, but to change your height even slightly would be astounding.
I'm pretty sure that for Cohen's d the relevant standard deviation is the SD for CHANGE scores, i.e. endpoint minus baseline, calculated for each individual. So for weight the mean change would be zero hence Cohen's d effect size is zero too*. At least that's what it say in my stats course...


*or potentially, er, 'error', if SD is zero too, but in any case it would not be a normal distribution so Cohen's d could not be computed
 

Dolphin

Senior Member
Messages
17,567
I'm pretty sure that for Cohen's d the relevant standard deviation is the SD for CHANGE scores, i.e. endpoint minus baseline, calculated for each individual. So for weight the mean change would be zero hence Cohen's d effect size is zero too*. At least that's what it say in my stats course...


*or potentially, er, 'error', if SD is zero too, but in any case it would not be a normal distribution so Cohen's d could not be computed
I've just watched these videos and that doesn't sound right - it says that it's the pooled standard deviation that is used.

I actually thought Graham was right but the video says that is actually Glass' delta: the video explains how Glass' delta, the "real" Cohen's d, and Hedges g are all sometimes called Cohen's d!

1st two videos are easy enough - second half of third one is less basic but not really relevant:

Cohen's d video 1 (5 mins 4 secs):
Cohen's d video 2 (5 mins 3 secs):
Cohen's d video 3 (4 mins 39 secs):
He also points out that he thinks most people actually don't do the calculations for the proper cohen's d the right way (they just add the two sd's together and divide by two). No wonder there's a bit of confusion.
 

Simon

Senior Member
Messages
3,789
Location
Monmouth, UK
I've just watched these videos and that doesn't sound right - it says that it's the pooled standard deviation that is used.

I actually thought Graham was right but the video says that is actually Glass' delta: the video explains how Glass' delta, the "real" Cohen's d, and Hedges g are all sometimes called Cohen's d!

1st two videos are easy enough - second half of third one is less basic but not really relevant:

Cohen's d video 1 (5 mins 4 secs):
Cohen's d video 2 (5 mins 3 secs):
Cohen's d video 3 (4 mins 39 secs):
He also points out that he thinks most people actually don't do the calculations for the proper cohen's d the right way (they just add the two sd's together and divide by two). No wonder there's a bit of confusion.
Not seen those videos, but it's correct that the SD is the pooled SD. However, it's the pooled (and wieghted) SD of the change/DIFFERENCE scores for each group. So it would still be a mean change of zero for height, with a non-normal distribution.
 

Dolphin

Senior Member
Messages
17,567
Not seen those videos, but it's correct that the SD is the pooled SD. However, it's the pooled (and wieghted) SD of the change/DIFFERENCE scores for each group. So it would still be a mean change of zero for height, with a non-normal distribution.
On the video it doesn't use the SD of the individual change or difference scores, but potentially he's wrong, but I'd like a link to be convinced.
 

Graham

Senior Moment
Messages
5,188
Location
Sussex, UK
I'll have a look at the videos tomorrow (with luck). Thanks for that. But I still have my Statistics course to finish as well. Cognitive indigestion.

But to be specific, unless you conduct repeated baseline assessments, how can you tell whether the baseline assessment is pretty fixed (like height), merely chance (like rolling dice), cyclical (say hormone changes), or a mix of these? (I am ignoring measurement error in this, of course). My interest is in Chalder Fatigue Scores, and I am not aware of any study carried out that repeatedly assessed individuals over a reasonable period of time to determine the pattern and extent of good/bad patches. The closest we get to this is when we have a proper control group, where then we have the opportunity to see "natural" individual variation. What bothers me is when the standard deviation across a group of individuals is used to determine the significance of a change: that is the type of standard deviation that the PACE trial appears to use as a measure to determine both normal function and clinical difference, which is where Bob's original question came about.

The last lecture on my Statistics course started to touch on this, and separated the two as systematic and unsystematic variation. Unsystematic variation is the muddling natural variations of the individual, as opposed to the systematic variations caused to individuals by some form of treatment (or at least I think that is how it went).

Following this thread and trying to be coherent is doing wonders for my understanding!
 

Simon

Senior Member
Messages
3,789
Location
Monmouth, UK
Have to say it's proving quite a work-out for my understanding of Cohen's d too, it's definitely more complex than I'd appreciated ! I will try again to make some sense of this, with a clearer head this morning. Unfortunately I have a lousy internet connection so can't access any videos.

Comparing the difference between means: t-tests and Cohen's d
It's probably worth looking at the fundamentals of two related tests used to assess the difference between means. The first is the t-test used to measure the statistical significance of a difference

t-test formula:

t = (Observed change minus Expected change)/Standard Error

The standard error term depends on the specific t-test being conducted (more on that later), but standard error is always strongly affected by the sample size, n: the larger the sample, the smaller the standard error, and therefore the more likely t and mean difference is significant. This also means that very small changes can be statistically significant in large samples.

On the other hand, Cohen's d looks at how BIG the effect is: even if it is significant, is it important?

Cohen's d formula:

d = (Observed change minus Expected change)/Standard Deviation

As with the t-test, the relevant Standard Deviation depends on the nature of the means being tested. However, unlike Standard Error, SD is not affected generally by sample size so in big and small samples Cohen's d gives a consistent measure of Effect size (though the confidence limits of Cohen's d will be much wider in a small sample).


Measuring pre/post differences in a sample (dependent t-test)
So imagine you take a bunch of people, measure their weight and height (as in Graham's example) and give them CBT. Then you can measure (as opposed to asking subjects their perception of...) height and weight after CBT, and compare the significance of the change in means, pre to post.

In this case the same subjects are in the pre and post groups so you use a dependent t-test. The Standard Error term is SD/n, and crucially the SD is for the difference scores for each individual, pre to post.

In any event, the observed change for height will be post minus pre, which will be zero. The expected change is also zero as this is null hypothesis testing. Result: t=0, the difference is not significant. Similarly, Cohen's d would be 0.


Comparing independent means (independent t-test)
Say you wanted to compare two independent groups e.g. blood glucose levels in men vs women. In this case there is no 'difference' score as such. The mean difference is mean.men minus mean.women and the Standard Deviation is the pooled standard deviation of the two groups, just as Dolphin described.

However, I think there is an important difference in experimental conditions and here I'm talking PACE. For the SMC control group you can measure a difference score, for baseline vs 52 weeks, and the same goes for the CBT, GET and APT groups. I'm now not quite sure what comparisons PACE did. They could have:
a) compared the gain from baseline for each therapy group vs gain from baseline for SMC (i.e. using difference scores), or
b) compared the final 52 weeks scores for each therapy group vs 52 week score for SMC, adjusted for baseline differences (i.e using SD for groups)

However, with b), which SD would you use - baseline or final? I know I looked into this a year ago and according to my notes the right answer was 'use the difference SDs', but unfortunately I don't have a source for that! On the other hand, what's being asked here is 'how much better is CBT at improving patients (relative to SMC's improvement of patients)?' In both groups we are looking at improvements/change so I would argue we do need to use difference scores, but can't be sure.

One final point. PACE used a General Linear Model to assess the results so the correct measure of effect size is probably eta-squared, not Cohen's d. eta-squared is analagous to the correlation coefficient r and is the proportion of variance explained (by the treatment in this case), varying from 0 to 1.0.
 

Graham

Senior Moment
Messages
5,188
Location
Sussex, UK
Simon, you should volunteer to take over the lectures! Lucid, logical and structured!

The big problem with PACE is the lack of a control group. Even the SMC group had treatment. But I guess we could argue that the changes in individual scores in the SMC group between week 26 and week 52, when no treatment was given, would give us an indication of individual variability, but we do not have any information about this. As a group though, the overall pattern of results wouldn't (and didn't) change much (group mean and s.d.), and that appears to be the information that we have been given.

Could it possibly be that having access to the raw data would be useful?
 

Dolphin

Senior Member
Messages
17,567
Thanks Simon.
The example Graham had mentioned was before and after a weight watchers/similar with no control group so that was what I discussing. The video talks about two sets of SDs while the change one you mentioned would only be one.

However, the more interesting case is the one you have then discussed where one has four sets of numbers: before and after treatment for groups 1 and 2 (or two sets of change scores), and I'm not sure what is done in that case - you could be right but I haven't thought it through or recall reading much on it. ETA: thinking about it, perhaps that is right - the change for a control group is like a baseline score.

General Linear Model is something I don't know anything about so glad to see you have some sort of understanding of it.
 

Simon

Senior Member
Messages
3,789
Location
Monmouth, UK
Thanks Simon.
The example Graham had mentioned was before and after a weight watchers/similar with no control group so that was what I discussing. The video talks about two sets of SDs while the change one you mentioned would only be one.

However, the more interesting case is the one you have then discussed where one has four sets of numbers: before and after treatment for groups 1 and 2 (or two sets of change scores), and I'm not sure what is done in that case - you could be right but I haven't thought it through or recall reading much on it. ...

General Linear Model is something I don't know anything about so glad to see you have some sort of understanding of it.
I think that was my entire knowledge of the General Linear Model in one sentence. All else I can say is that multiple regression is a sub-type of the GLM, and ANOVA is a particular form of multiple regression. I think things like Structured Equation Modelling and Path Analysis go beyond the GLM.

My knowledge for the pre/post thing comes from the Coursera stats course run by Princeton professor who has run the same course for years. I wouldn't bet on him being wrong :) Though I might have misunderstood him.
ETA: thinking about it, perhaps that is right - the change for a control group is like a baseline score.
That's well put. I particularly like that this approach seems to be making the most of all the available information.