New attempt to avoid releasing data on 'recovery' from PACE

Esther12 · Nov 12, 2012

Dolphin said:
It really depends if they use CFQ also. If they do, I think the SF-36 PF threshold would be easier again. I'm wondering whether not satisfying both of the entry criteria is what they are counting?

Yeah - that's what I'd guess too. Maybe they realised that having papers defining 'recovery' in a way that overlapped with their own definition for "severe and disabling fatigue" was too transparently absurd.

Bob · Nov 13, 2012

Has anyone come across a definition of 'recovery' in any other research?

alex3619 · Nov 13, 2012

http://www.gbs-cidp.org/wp-content/uploads/2012/01/GBS-survey-post-illness-EJN-20101.pdf

I haven't time to read this yet but it does consider recover status and it does use SF-36 for a neurological disorder, Guillain-Barre Syndrome.

Bob · Nov 13, 2012

http://www.forward-me.org.uk/Reports/White%20to%20Lancet%20re%20Hooper%20complaint%20(2).pdf

PACE - Response to the complaint to The Lancet of March 2011

Extract From White’s letter:

"The threshold SF36 score given in the protocol for recovery (85) was an estimated mean (without a standard deviation) derived from several population studies. We are planning to publish a paper comparing proportions meeting various criteria for recovery or remission, so more results pertinent to this concern will be available in the future."

I wonder why the mean score for the general population is considered a 'recovery'?
I would have thought that the median score (95 for SF-36 PF) would be more appropriate.
I wonder if median scores have been used in research to indicate a recovery?
Or, is any particular percentile of SF-36 physical function scores, for the general population, commonly considered a 'recovery'?

user9876 · Nov 13, 2012

Bob said:
http://www.forward-me.org.uk/Reports/White%20to%20Lancet%20re%20Hooper%20complaint%20(2).pdf

PACE - Response to the complaint to The Lancet of March 2011

Extract From White’s letter:

"The threshold SF36 score given in the protocol for recovery (85) was an estimated mean (without a standard deviation) derived from several population studies. We are planning to publish a paper comparing proportions meeting various criteria for recovery or remission, so more results pertinent to this concern will be available in the future."

I wonder why the mean score for the general population is considered a 'recovery'?
I would have thought that the median score (95 for SF-36 PF) would be more appropriate.
I wonder if median scores have been used in research to indicate a recovery?
Or, is any particular percentile of SF-36 physical function scores, for the general population, commonly considered a 'recovery'?

I have a problem with the SF-36-pf scale in that I don't think it is valid to use the mean and std deviation for it. I've been trying to formulate a couple of different arguments, they are complicated and not yet well formed. but I feel a discussion of the properties of the scales they are using as primary outcomes is important. The basic arguments I'm making also need to be better researched in terms of what is required from a scale and the validity of statistics.

I'm thinking that since improvements were small then to justify the economics of the treatments they propose the measurment scales and definitions for improvement or remission need to be accurate. Hence casting doubt over very dodgy measurement techniques becomes important.

1)

To me it seems to be obvious that the SF36 measures for the general population will give a multimodal distribution where there is a distribution for healthy people along with distributions for different groups of sick and disabled people. The sum of these individual distributions then add up to give us a multimodal distribution. Statisticians will often refer to the central limit theorm and say that the sum of many of independant random variables is normally distributed. However, I would argue that these are not independant since h is a set of healthy people and each group of sick people is a particular deviation from that health group and hence there is a relationship. Thus we are left with a multimodal distribution and it makes no sense to talk of the mean and standard deviation.

2)
The scale has 10 questions each of which reflects an ability to do a physical task either easily, with some difficulty or with alot of difficult. The aim is to measure physical function as a single variable hence it is based on the theory that physical function is a single measurable thing.

Lets define a person as having a physical function of x. where x is a member of X (the set of possible physical functions). Lets go a little bit further and define X as a continious set between 0 and 1 (dead and completely able). What we are really interested in is how different values of x map onto different scores from the sf36 pf scale. If this is linear (that is a change y in x1 leads to the same change in the scale as a change y in x2) then the scale is an interval scale and the mean and standard deviation can be used. Otherwise we have a situation where the scale function is simply monotonic and only the median or perceptiles can be used. We could have a situation where the scale is a non monotonic function in which case it is not a nominal scale and it only makes sense to talk about how many of each class exist. http://www.mpopa.ro/statistica_licenta/Stevens_Measurement.pdf

Now if we break up the questions so that rather that having a question q1 we have Q1e, Q1s and Q1d which represent the three sets of people those who find q1 easy those with some difficulty and those with a lot of difficulty.

We can define i as a member of Q1e when x > q1e and similarly for q1s and q1d giving us three thresholds defined on the underlying variable of physical function.

If we do this for all questions we have a set of thirty thresholds which can then be placed onto our interval X (a line from 0 to 1). We do this assuming that there is a well defined ordering of how hard different activities are. We can define this in terms of a set of thresholds t1 to t30 where each maps to one of the question thresholds and ti > ti+1.

A persons score will be determined by which of these intervals that they map into.

The linearity of the scale will depend on having each of these thiry thresholds evenly distributed across X. If they are clumped or non evenly distributed the sf-36 scale is not an interval scale and the mean and standard devation should not be used. To my mind it is up to those using the scale to demonstrate that these intervals are even to justify their analysis using the mean and std.

There could be some errors in that different people may consider the thresholds to have different orderings. There are two points here some is just error on where each person percieved the ordering differently. If the points are close and the ordering is debatable then this is not important to the argument since it still suggests that the scale is non linear. If people disagree strongly over the different orderings and would place big differences between the thresholds then I would argue that physical function isn't a single concept to measure and a different analysis would apply since the scale would then be over multiple variables.

Esther12 · Nov 13, 2012

?

I'm sure I posted a reply to you 9876. Damn it... that took ages! My abrupt version: I thought that maybe you weren't taking sufficient account of the innate difficulty of designing a questionnaire that would assess disability. I don't know if a measure could be constructed which would be an interval scale. There are additional problems specific to the way SF36-PF data was used in PACE, and the way the normal range was defined, and I see how the points you raised also affect some of the assumptions made about these sorts of measures of disability, but I'm not sure how important a point it is in relation to PACE.

Also, I stumble upon this blog post, and thought it seemed relevant to PACE.

http://neuroskeptic.blogspot.co.uk/2011/02/decline-and-fall-of-effects-in-science.html

The Decline And Fall of Effects In Science

Nature has a piece called Unpublished results hide the decline effect.

This refers to the fact that many scientific findings which seem to indicate something big is happening, end up getting smaller and smaller as more people try to replicate them until they, eventually, may vanish entirely.

Schooler doesn't go into detail as to how this repository would be set up, but he does cite the fact that we already have a pretty good one for clinical trials of medicines conducted in the USA. Anyone running a clinical trial is required to register it in advance, saying what they're planning to do and crucially, to spell out which statistics they are going to run on the data when it arrives.

What's really silly is that most scientists already do this when applying for funding: most grant applications include detailed statistical protocols. The problem is that these are not made public so people can ignore them when it comes to publication. Back in 2008 I suggested that scientific journals should require all studies, not just clinical trials, to be publicly pre-registered if they're to be considered for publication. This would be eminently do-able if there was a will to make it happen.

Snow Leopard · Nov 13, 2012

Given that the median and modal SF-36 scores are above the mean and the scale sharply cuts off at 100, it makes no sense to talk about normal being within 1 SD, because it is clearly not a normal distribution.

Bob · Nov 13, 2012

Snow Leopard said:
Given that the median and modal SF-36 scores are above the mean and the scale sharply cuts off at 100, it makes no sense to talk about normal being within 1 SD, because it is clearly not a normal distribution.

But if you're a PACE statistician, or a Lancet editor, then it makes perfect sense!!!

Mark · Nov 13, 2012

Has anybody tried contacting academic statisticians to put this point to them? Although I have a maths degree, stats was never really a significant part of what I did, but from the little I understand of it, this point about the distribution seems really quite clear. Surely there must be respected statisticians out there somewhere who can put their name to agreeing (with handy quote, perhaps) that it is questionable what they have done here? Or even have a conversation with us here about these issues, to put them in their proper context?

Snow Leopard · Nov 13, 2012

About 68% of a population is within 1 SD of the mean, of a normally distributed set. So we could just look a the SF-36 data for a healthy population and see the cut off where 68 (+16% for the upper bound)% of the population lies.

Because in the end, that is what they are trying to say.

Snow Leopard · Nov 13, 2012

The raw data isn't provided, but the graph shows a sharp cutoff at 100 (and reports the ceiling effect), and the mean/SD for different age ranges:

http://health.adelaide.edu.au/pros/...llbeing_south_australian_population_norms.pdf

For the 35-44 age group, the 25th percentile PF score is 90 - and this is what I'd consider the cutoff for 'normal'.

At 45-54, it drops to 80. (increased prevalence of illness - arthritis, type 2 diabetes etc)

biophile · Nov 13, 2012

A modification:

Bob · Nov 13, 2012

Snow Leopard said:
About 68% of a population is within 1 SD of the mean, of a normally distributed set. So we could just look a the SF-36 data for a healthy population and see the cut off where 68% of the population lies.

Because in the end, that is what they are trying to say.

That's what I originally thought, but then I thought about it further, and realised that +/-1SD cuts off the top and bottom 16% of values. In which case we would need to determine the 16th percentile for the normal population.
Would you agree, or disagree, Snow Leopard?
I've looked very hard to find the 16th percentile in the normative data, but I can't find it.
In any case, it's not an appropriate analysis for many reasons, the skewed distribution being just one of them. (It's too late at night to illustrate them all at the mo.)

Mark said:
Has anybody tried contacting academic statisticians to put this point to them? Although I have a maths degree, stats was never really a significant part of what I did, but from the little I understand of it, this point about the distribution seems really quite clear. Surely there must be respected statisticians out there somewhere who can put their name to agreeing (with handy quote, perhaps) that it is questionable what they have done here? Or even have a conversation with us here about these issues, to put them in their proper context?

It's a good idea, Mark. I don't know any prominent statisticians, but it might be worth trying to find one.

Bob · Nov 13, 2012

‘Percentiles’ for the HEALTH SURVEY FOR ENGLAND 1996

http://www.archive.official-documents.co.uk/document/doh/survey96/tab5-18.htm

Health Survey for England (HSE) 1996 (ages 16+)

SF-36 Physical Function scores

All Adults (ages 16+)

Mean for all adults = 81
Median for all adults = 95

36% of adults have the highest/maximum score (100) for Physical Functioning (64th percentile = 100)
http://www.archive.official-documents.co.uk/document/doh/survey96/ehch5.htm

25th percentile = 75

50th percentile (median score) = 95

75th percentile = 100

Of course, these aren't age-matched to the PACE Trial, so they aren't exactly appropriate.

(I always advise people to check for mistakes before quoting me!)

Snow Leopard · Nov 14, 2012

Bob said:
That's what I originally thought, but then I thought about it further, and realised that +/-1SD cuts off the top and bottom 16% of values. In which case we would need to determine the 16th percentile for the normal population.

Yes, you are right. But the key is that the data for the age/sex matched regular population itself can't be used, because a variety of disabling medical conditions have been excluded from the CFS cohort, which were not excluded from the regular population data.

I note from that link,

A quarter of men (25%) and women (27%) reported a longstanding illness that limited their activities in some way, while under a fifth (18% of men and 16% of women) reported a longstanding illness that did not limit their activities.

Which is why I feel the 25th percentile for normal population is appropriate. Or 16th percentile for the population with limiting longstanding illness excluded.

Simon · Nov 14, 2012

Snow Leopard said:
Yes, you are right. But the key is that the data for the age/sex matched regular population itself can't be used, because a variety of disabling medical conditions have been excluded from the CFS cohort, which were not excluded from the regular population data.

I note from that link,

Which is why I feel the 25th percentile for normal population is appropriate. Or 16th percentile for the population with limiting longstanding illness excluded.

I checked at the time and the PACE used of 'mean - 2SD' does in fact give almost exactly the 16th percentile on the Bowling population data. Though just as you say, Bowling data used was for ALL adults (>30% over 65) and would include those with illnesses excluded from PACE. Using the 16th percentile on a healthy working age population gives a threshold of around 80, not the 60 used in PACE.

user9876 · Nov 14, 2012

Esther12 said:
?

I'm sure I posted a reply to you 9876. Damn it... that took ages! My abrupt version: I thought that maybe you weren't taking sufficient account of the innate difficulty of designing a questionnaire that would assess disability. I don't know if a measure could be constructed which would be an interval scale. There are additional problems specific to the way SF36-PF data was used in PACE, and the way the normal range was defined, and I see how the points you raised also affect some of the assumptions made about these sorts of measures of disability, but I'm not sure how important a point it is in relation to PACE.

Also, I stumble upon this blog post, and thought it seemed relevant to PACE.

http://neuroskeptic.blogspot.co.uk/2011/02/decline-and-fall-of-effects-in-science.html

I agree that it is hard to construct a measure that can be an interval scale. The point is that in many of the trials and in particular the PACE trial they treat scales as if they were interval scales when they are not. They then go on to define things like 'normal ranges' and clinically useful differences based on these assumptions.

There seems to be two approaches that could be correct. Firstly to report more dimensions for each questionaire or secondly to use a utility approach such as they do in the EQ-5d survey.

The point for PACE is that if you have scales are basically not up to supporting the analysis that is done then the results published are not trust wortthy and should be withdrawn.To me this is not a case of saying there are unpublished trials but one of saying trials that have been run don't have a suitable measurement framework. To use an analogy the way the PACE trial have used their measurements is a bit like me drawing some marks at random on a bit of paper, marking them as distances and then using them to measure the length of stuff.

There are so many arguments as to why the PACE results cannot be trusted. As well as a mathematical argument around the nature of the scales there needs to be psychological arguments around how CBT frames peoples mind set into giving more positive answers to questions (such as in Grahams latest video).

Simon · Nov 14, 2012

user9876 said:
I have a problem with the SF-36-pf scale in that I don't think it is valid to use the mean and std deviation for it.

Think you are making some very interesting points. Some comments and musings on yours:

I'm thinking that since improvements were small then to justify the economics of the treatments they propose the measurment scales and definitions for improvement or remission need to be accurate. Hence casting doubt over very dodgy measurement techniques becomes important.

I think they used EUROqol for cost-benefit analysis, not sf36

1. Agreed that mean and SD not appropriate when looking at the general population.

To my mind it is up to those using the scale to demonstrate that these intervals are even to justify their analysis using the mean and std.

But should be OK for comparing the means of two patient groups eg SMC vs CBT?

2. Not an interval scale
Yes, pretty tough to make an interval scale. I'm pretty sure there is no evidence SF36 is an interval/ratio scale and that's true of most questionnaires generally. Every now and then statisticians complain this invalidates some statistical interpretations, but this view never seems to get much traction

I suspect it is an ordered scale rather than simply giving classes so had some utility. Also, even if some patients disagree to about some of the ordering the main thing measured is within-subject score (pre/post) and so patients presumably score themselves consistently there.

Also, with the SF36 for modest change most item scores don't change at all ie remain 'not limited at all' or 'limited a lot' (which includes 'impossible'). Probably only 2 or 3 ex 10 questions change for most people and in each case they are moving from:
- limited a lot > limited a little, or
- limited a little > not limited
not sure if this helps make things more consistent or not.

Snow Leopard · Nov 14, 2012

Simon said:
Using the 16th percentile on a healthy working age population gives a threshold of around 80, not the 60 used in PACE.

I'd prefer objective measures of activity and neuropsychiatric functioning for measuring recovery/remission, but if they want to use the SF-36 PF score, anything less than 80 cannot be considered a reasonable indication of remission.

Bob · Nov 14, 2012

Simon said:
I checked at the time and the PACE used of 'mean - 2SD' does in fact give almost exactly the 16th percentile on the Bowling population data. Though just as you say, Bowling data used was for ALL adults (>30% over 65) and would include those with illnesses excluded from PACE. Using the 16th percentile on a healthy working age population gives a threshold of around 80, not the 60 used in PACE.

But if using a well-defined population, my understanding is that the common methodology for a 'normal range' is to use +/-2SD, which cuts of the top and bottom 2.5% of values. So it includes 95% of the population.

Edit: This seems like a sensible definition of a 'normal range' to me, but I'm not sure if the general healthy population can be considered a good example of a 'well-defined' population.

New attempt to avoid releasing data on 'recovery' from PACE

Senior Member

Senior Member

Senior Member

Senior Member

Senior Member

Senior Member

Hibernating

Senior Member

Senior Member

Hibernating

Hibernating

Places I'd rather be.

Senior Member

Senior Member

Hibernating

Senior Member

Senior Member

Senior Member

Hibernating

Senior Member