http://www.forward-me.org.uk/Reports/White%20to%20Lancet%20re%20Hooper%20complaint%20(2).pdf
**PACE - Response to the complaint to The Lancet of March 2011**
Extract From White’s letter:

"The threshold SF36 score given in the protocol for recovery (85) was an estimated mean (without a standard deviation) derived from several population studies. We are planning to publish a paper comparing proportions meeting various criteria for

**recovery or remission**, so more results pertinent to this concern will be available in the future."

I wonder why the mean score for the general population is considered a 'recovery'?

I would have thought that the median score (95 for SF-36 PF) would be more appropriate.

I wonder if median scores have been used in research to indicate a recovery?

Or, is any particular percentile of SF-36 physical function scores, for the general population, commonly considered a 'recovery'?

I have a problem with the SF-36-pf scale in that I don't think it is valid to use the mean and std deviation for it. I've been trying to formulate a couple of different arguments, they are complicated and not yet well formed. but I feel a discussion of the properties of the scales they are using as primary outcomes is important. The basic arguments I'm making also need to be better researched in terms of what is required from a scale and the validity of statistics.

I'm thinking that since improvements were small then to justify the economics of the treatments they propose the measurment scales and definitions for improvement or remission need to be accurate. Hence casting doubt over very dodgy measurement techniques becomes important.

1)

To me it seems to be obvious that the SF36 measures for the general population will give a multimodal distribution where there is a distribution for healthy people along with distributions for different groups of sick and disabled people. The sum of these individual distributions then add up to give us a multimodal distribution. Statisticians will often refer to the central limit theorm and say that the sum of many of independant random variables is normally distributed. However, I would argue that these are not independant since h is a set of healthy people and each group of sick people is a particular deviation from that health group and hence there is a relationship. Thus we are left with a multimodal distribution and it makes no sense to talk of the mean and standard deviation.

2)

The scale has 10 questions each of which reflects an ability to do a physical task either easily, with some difficulty or with alot of difficult. The aim is to measure physical function as a single variable hence it is based on the theory that physical function is a single measurable thing.

Lets define a person as having a physical function of x. where x is a member of X (the set of possible physical functions). Lets go a little bit further and define X as a continious set between 0 and 1 (dead and completely able). What we are really interested in is how different values of x map onto different scores from the sf36 pf scale. If this is linear (that is a change y in x1 leads to the same change in the scale as a change y in x2) then the scale is an interval scale and the mean and standard deviation can be used. Otherwise we have a situation where the scale function is simply monotonic and only the median or perceptiles can be used. We could have a situation where the scale is a non monotonic function in which case it is not a nominal scale and it only makes sense to talk about how many of each class exist.

http://www.mpopa.ro/statistica_licenta/Stevens_Measurement.pdf
Now if we break up the questions so that rather that having a question q1 we have Q1e, Q1s and Q1d which represent the three sets of people those who find q1 easy those with some difficulty and those with a lot of difficulty.

We can define i as a member of Q1e when x > q1e and similarly for q1s and q1d giving us three thresholds defined on the underlying variable of physical function.

If we do this for all questions we have a set of thirty thresholds which can then be placed onto our interval X (a line from 0 to 1). We do this assuming that there is a well defined ordering of how hard different activities are. We can define this in terms of a set of thresholds t1 to t30 where each maps to one of the question thresholds and ti > ti+1.

A persons score will be determined by which of these intervals that they map into.

The linearity of the scale will depend on having each of these thiry thresholds evenly distributed across X. If they are clumped or non evenly distributed the sf-36 scale is not an interval scale and the mean and standard devation should not be used. To my mind it is up to those using the scale to demonstrate that these intervals are even to justify their analysis using the mean and std.

There could be some errors in that different people may consider the thresholds to have different orderings. There are two points here some is just error on where each person percieved the ordering differently. If the points are close and the ordering is debatable then this is not important to the argument since it still suggests that the scale is non linear. If people disagree strongly over the different orderings and would place big differences between the thresholds then I would argue that physical function isn't a single concept to measure and a different analysis would apply since the scale would then be over multiple variables.