The Hawthorne Effect... & Overestimation of Treatment Effectiveness (2010)

oceanblue · Mar 6, 2012

Evidence that self-report measures often used in clinical trials, including CBT for CFS trials, can overstate the real effectiveness of treatment.

The Hawthorne effect, sponsored trials, and the overestimation of treatment effectiveness, Wolfe 2010

This study, on a Rheumatoid Arthritis treatment, provides evidence that patients report higher scores on questionnaires when they are in clinical trials than when they are being treated by their own doctor, leading to an overstatement of effectiveness of the treatment being trialed.

Abstract

Objective. To determine if the results of rheumatoid arthritis (RA) clinical trials are upwardly biased by the Hawthorne effect.

Methods. We studied 264 patients with RA who completed a commercially sponsored 3-month, open-label, phase 4 trial of a US Food and Drug Administration approved RA treatment. We evaluated changes in the Health Assessment Questionnaire disability index (HAQ) and visual analog scales for pain, patient global, and fatigue during 3 periods: pretreatment in the trial, on treatment at the close of the trial, and by a trial-unrelated survey 8 months after the close of the trial, but while the patients were receiving the same treatment.

Results. The HAQ score (03) improved by 41.3% during the trial, but only by 16.5% when the endpoint was the post-trial result. Similar results for the other variables were patient global (010) 51.9% and 34.6%, pain (010) 51.7% and 39.7%, fatigue (010) 45.6% and 24.6%. Worsening between the trial end and the first survey assessment was HAQ 0.29 units, pain 0.8 units, patient global 0.8 units, and fatigue 1.1 units.

Conclusion. Almost half the improvement noted in the clinical trial HAQ score disappeared on entry to a non-sponsored followup study, and from 23% to 44% of improvements in pain, patient global, and fatigue also disappeared. These changes can be attributed to the Hawthorne effect. Based on these data, we hypothesize that the absolute values of RA outcome variables in clinical trials are upwardly biased, and that the treatment effect is less than observed.

oceanblue · Mar 6, 2012

The Hawthorne Effect is named after a famous study in the 1920s looking at worker productivity at an industrial plant. The Hawthorne effect occurs where subjects improve or modify an aspect of their behavior being experimentally measured simply in response to the fact that they know they are being studied, not in response to any particular experimental manipulation.

From the paper (all emphasis mine):

In the clinical trial setting, the effect may be defined as the additional clinical response that results from increased attention provided by participation in the clinical trial.

The Hawthorne effect, if present, could result in RA [Rheumatoid Athritis] improvement in 2 ways. First, it could result in true improvement as in the Hawthorne effect of observed productivity in the factory and, second, it could result in reported but not true improvement; for example, a patient might indicate by HAQ [disability] score that his function is improved, while in reality it is not. If the Hawthorne effect is present in that manner, we would expect that patients would not do as well in the subsequent clinical setting, where the intensive clinical trial attention is absent. We call the first type of Hawthorne effect Type A and the second type of Hawthorne effect Type B.

The putative 'type B' Hawthorne Effect, where patient-reported improvements aren't real, is highly relevant to a lot of CBT clinical trials e.g. PACE. This paper studies the type B effect in a RA drug trial. I'll try to explain the study design, which is a little bit complex:

1. The Sponsored Clinical Trial
Stage 1 involved around 2,000 patients in a FDA-required manufacturer run trial of a drug* for Rheumatoid Arthritis. Patients were evaluated pre-treatment and at the conlcusion of the trial using a number of measures including the Health Assessment Questionnaire Disability Index (HAQ), as well as scales for pain, fatigue and overall improvement.

The study found large improvements (effect sizes of around 1.0) across all measures at the end of the trial. Success!

*the drug was unnamed at the request of the manufacturer, and as conditon of its permission for access to patients in this study

2. The sneaky follow-up
The researchers in this study, which was independent of the manufacturers, thought that patients might be inflating questionnaire scores just because of the extra attention they received in the trial. So they recruited patients at the end of the trial into the National Data Bank for Rheumatic Diseases (RDB), a large ongoing study of patients in normal clinical practice under the care of their usual physician. Crucially, the patients continued the same treatment they'd been taking in the formal clinical trial. The RDB routinely collects data from patients on a number of outcomes, including HAQ, pain, fatigue and overall status. This allowed researchers to compare the outcomes from the clinical trial with the outcomes for the same patients on the same medication but in a normal, non-trial, environment.

The Findings
All of the measures (HAQ disability, fatigue, pain, global health status) were higher in the sneaky follow-up study than at pre-treatment, but were only about half as high as found at the end of the formal clinical trial. The researchers hypothesised that these changes can be attributed to the Hawthorne Effect, where clinical trial reports of improvement were overstated.

More to follow when I have more energy

Sean · Mar 6, 2012

Which just emphasises the importance of using genuinely objective outcome measures.

floydguy · Mar 6, 2012

Sean said:
Which just emphasises the importance of using genuinely objective outcome measures.

Yes, questionnaires suck and shouldn't be used. There seems to be a false belief that if one tallies up the numbers and does statistical analysis this miraculously becomes objective research.

charityfundraiser · Mar 7, 2012

There are two things here, the data results and the explanation which is actually just a hypothesis which the study wasn't even set up in a way to be able to test. The only thing they can really say is that the reported improvement was higher in the first study and lower in the second. They haven't shown why this is, whether it is due to as they claim, sponsored clinical trial vs. non-trial survey, or any other reason one could come up with to "explain" it such as first study vs. second study. How do they know it is due to the type of trial rather than time? They didn't do the study in a way that could distinguish that.

They could have compared two groups side by side, one sponsored clinical trial and one non-trial survey. They could have taken some non-self-report type tests that indicate level of improvement to see if the change in self-report variables did or did not match non-self-report variables.

They didn't show whether self-report variables differed from non-self-report variables. Even if they had shown that, they didn't show whether the difference was due to "being watched", time, natural adjustment of perspective to an improvement, drug effect wearing off the longer one takes it, or anything else. (As far as the abstract says. I haven't seen the full paper.)

If you have been very unwell, even bedridden like CFS patients, and something makes you improve, initially you might feel like oh my gosh, I can take a shower, walk around, and surf the Web and call that a big improvement because it is compared to where you were before. After a couple months, you get used to the improvement and might realize, well yes that was an improvement but nowhere near normal.

From reading various forums, this effect seems to exist in Dr. Montoya's Valcyte pilot study as well. 90% of patients initially reported 90% improvement. News article profiles of a participant and anecdotes on other boards suggest that after some time, the self-reported improvement was actually lower.

Enid · Mar 7, 2012

Never doubted the Hawthorne effect - I've always responded to Docs/Consultants as "things do seem a bit better" for many reasons - eg. they are trying to aid, thinking more positively may improve the situation. Of course nothing done stopped the course of the illness. And I've known times of yep OK just to get any more useless questioning out of one's hair and escape.

Fact is they do not believe in illness - full stop.

oceanblue · Mar 7, 2012

Background
Evidence before this study:

Although biologic therapy improves the health status of patients with rheumatoid arthritis (RA), we have observed that RA patients treated with biologics and followed in the large National Data Bank for Rheumatic Diseases (NDB) observational data bank, and patients followed in clinical practice, do not have RA outcomes that are as good as those seen in clinical trials[1,2,3,4]. This observation is somewhat unexpected because RA patients treated in the community have less severe RA than participants in clinical trials and should be expected to have better results.

Peppercorn, et al evaluated 24 published articles of outcomes among cancer patients who were enrolled and not enrolled in clinical trials. They noted that 14 comparisons provided some evidence that patients enrolled in trials have improved outcomes. However, strategies to control for potential confounding factors were inconsistent and frequently inadequate, and they concluded that Despite widespread belief that enrollment in clinical trials leads to improved outcomes in patients with cancer, there are insufficient data to conclude that such a trial effect exists.8

By contrast, McCarney, et al found that more intensive followup of individuals in a placebo-controlled clinical trial of Ginkgo biloba for treating mild to moderate dementia resulted in a better outcome than minimal followup, as measured by their cognitive functioning[6]. The intensive group had comprehensive assessment visits at baseline and 2, 4, and 6 months postrandomization compared with an abbreviated assessment at baseline and a full assessment at 6 months

Note that in the final example both groups remain in the trial, the only difference was the intensity of the follow-up.

Discussion
This study found worse results in patients in the post-trial follow-up than had been recorded in the same patients at the end of the trial, even though they were on the same medication and the trial readings were taken in the final open label (unblinded) phase. This adds weight to the evidence for self-report bias from questionnaires, but is by no means definitive (as is true of most research in CFS). I thought it was worth summarising strengths and weaknesses of the study:

Strengths & Weaknesses:
This study compared the same patients on the same medication but no longer in a formal clinical trial.

It's possible that the lower effect found after the trial is due to the effect of the drug wearing off within a year. The authors comment "The alternative interpretation that true improvement occurred in the trial but is lost at the end of the trialseems untenable", and perhaps more persuasively note that "Sponsored extension studies of RA trials usually show maintenance of improvement". So drug effects wearing off is possible but seems unlikely.

Also, the observation of lower effects outside clinical trials ties in with their earlier evidence from the National Database that drug effects in normal clinical practice are less than those in trials. They then took a new sample from the National Database and found the outcomes in this trial were very similar to those for patients already in the database:

We adjusted the results to the characteristics of the patients in the current study. For current study compared with other NDB RA patients, the results were HAQ 1.01 versus 1.12, patient global 3.4 versus 3.5, pain 3.5 versus 3.8, and fatigue 4.3 versus 4.5, respectively.

One significant problem with this study is that the patients from the original trial who volunteered to join the National Database (and so were measured in the 'sneaky follow-up') were not fully representative of all patients in the original trial:

Patients who participated were systematically different from nonparticipants. Compared with nonparticipants, participants had increased rates of ACR20, ACR50, and ACR70 responses11, and had lower (better) HAQ12, pain, patient global, and fatigue scores.
...
Patients who chose not to participate or discontinued therapy were not included in the study, and such patients had worse outcomes or were unsatisfied with their therapy. However, this exclusion process worked to the advantage in the study becaause it allowed us to observe how patients who are doing well fared when they changed settingthe model for Hawthorne effect observation. Even so, it is possible, although unlikely, that nonparticipants would have responded differently in the followup period.

It's also possible that there regression to the mean could play a role as the 'best' responders only were followed.

So, not a perfect study, but I think taken with the all the other studies discussed here this does provide further evidence that subjective self-reports in clinical trials may overstate the benefits of treatment. Equally, I'm not aware of any good evidence that subjective self-report questionnaires are not subject to bias in clinical trials.

oceanblue · Mar 7, 2012

charityfundraiser said:
There are two things here, the data results and the explanation which is actually just a hypothesis which the study wasn't even set up in a way to be able to test. The only thing they can really say is that the reported improvement was higher in the first study and lower in the second. They haven't shown why this is, whether it is due to as they claim, sponsored clinical trial vs. non-trial survey, or any other reason one could come up with to "explain" it such as first study vs. second study. How do they know it is due to the type of trial rather than time? They didn't do the study in a way that could distinguish that.

They could have compared two groups side by side, one sponsored clinical trial and one non-trial survey. They could have taken some non-self-report type tests that indicate level of improvement to see if the change in self-report variables did or did not match non-self-report variables.

Hi Charityfundraiser

You make some excellent points. Unfortunately I ran out of energy to complete my posts on the study before you replied and I hope some of the new information I posted there, including other relevant studies discussed by the author,s helps put their findings in context.

Certainly this study does not give a definitive answer. You've suggested some better ways to run a replication study (the authors hands were tied to some extent by the manufacturer running the original trial who would placed a lot of limitations on how the study could operate). Similarly comparing self-report to objective measures in trial and non-trial environments is a great idea.

However, as I mentioned in my previous post, I'm not aware of any good evidence that self-reports are reliable measures of pre/post change in clinical trials, even though the possiblity of self-report bias is often acknowledged. In fact, the only evidence I do know of is the often-cited Wiborg CFS/CBT study which found that improvements in self-rated physical function was not matched by any improvement in objective actometer ratings.

I very much home that some researchers will try to get better data on this, not just in the field of CFS research but across all clinical trials. I'm always amazed at how often researchers rely on self-reported improvement data without good evidence that it accurately reflects reality.

oceanblue · Mar 7, 2012

Sean said:
Which just emphasises the importance of using genuinely objective outcome measures.

I certainly agree with that - where it is feasible. As yet there are no objective measures of either fatigue or pain so self-report is the only option. Nonetheless, if it can be convincingly demonstrated that self-report measures that are open to objective verification leads to overestimation of therapeutic effects, the same is likely to be true for purely subjective symptoms like fatigue.

Comparison with CFS/CBT studies
Like CFS/CBT studies, the Wolfe RA study is open label ie particpants know if they are receiving the 'active' therapy. However, the Hawthorne Effect is based on the extra attention received by patients in a trial. In the case of face-to-face therapy, that attention is far greater than in a drug trial. For instance, in PACE, participants had 15 hours of contact with CBT, GET and Pacing therapists and the therapeutic relationship that developed was independently rated as very strong. So any effect due to attention could be significantly stronger in such therapy trials compared with drug trials.

Snow Leopard · Mar 7, 2012

There was a CBT paper which compared the results of clinical trials to that of regular practise and found the effect size in clinical trials was much larger. I can't remember it off the top of my head though.

charityfundraiser · Mar 7, 2012

Thanks for including those quotes from the paper. The study about ginko biloba and level of follow up is interesting. With the others, one other thing that comes to mind with commercially sponsored clinical trials and research is that the published ones generally have more positive results than non-commercially sponsored research due to the "file-drawer" problem.

Personally, I don't like the self-report questionnaires because all of the questions are relative. I'd rather have a list of activities and check off which ones I can do. I've been told by the doctor/researcher whom I asked that this is even more difficult than the vague relative scales but I don't understand why.

The actometer studies would be more interesting if they hadn't used a homegrown software that hasn't been validated. I did some research in the actometer thread. If the research sucks and the actometer sucks, well, what can you make of it....

oceanblue · Mar 7, 2012

charityfundraiser said:
Personally, I don't like the self-report questionnaires because all of the questions are relative. I'd rather have a list of activities and check off which ones I can do. I've been told by the doctor/researcher whom I asked that this is even more difficult than the vague relative scales but I don't understand why.

The actometer studies would be more interesting if they hadn't used a homegrown software that hasn't been validated. I did some research in the actometer thread. If the research sucks and the actometer sucks, well, what can you make of it....

Thanks for that, and you're right re publication bias too.

The SF-36 scale does measure what activities you can do (HAQ too, I think). Try it yourself. Nb there are likely to be only a couple of questions for each person where the score is in doubt e.g. limited a little vs not limited at all, but that could give a 5-10 point difference in scores (ex 100). PACE, for instance, only found an 8 point difference.

Actometers are imperfect devices, whether home-grown or not. However, they are not subject to self-report bias so where self-report measures show a change and actometers don't, it's still a cause for concern - though not definitive proof the study didn't work.

re:

If you have been very unwell, even bedridden like CFS patients, and something makes you improve, initially you might feel like oh my gosh, I can take a shower, walk around, and surf the Web and call that a big improvement because it is compared to where you were before. After a couple months, you get used to the improvement and might realize, well yes that was an improvement but nowhere near normal.

The outcomes don't report recovery but improvement, and that improvement is measured on an absolute scale eg 0-100, not relative to how they felt about it last time, so a change in prespective like you suggest shouldn't apply. If it did, then all sponsored extension clinical trials would report a diminishing effect, wheras most of them (according to the authors of this study), show the initial gains being maintained.

oceanblue · Mar 7, 2012

Snow Leopard said:
There was a CBT paper which compared the results of clinical trials to that of regular practise and found the effect size in clinical trials was much larger. I can't remember it off the top of my head though.

I remember the same paper just as vaguely! BACME data published to date is similarly disappointing. CBT proponents have argued the difference is that outside trials CBT therapists 'don't do it right'. That can't apply to this Wolfe Study as it uses a drug, not a face to face therapy.

Enid · Mar 7, 2012

Not a scientist oceanblue but that defence "don't do it right" sounds questionable - what not persuasive enough.

Snow Leopard · Mar 7, 2012

oceanblue said:
I remember the same paper just as vaguely! BACME data published to date is similarly disappointing. CBT proponents have argued the difference is that outside trials CBT therapists 'don't do it right'.

They can imagine up all the reasons they want, but the effect is real.

See also:

Cognitive-behaviour therapy for chronic fatigue syndrome: comparison of outcomes within and outside the confines of a randomised controlled trial.
Quarmby L, Rimes KA, Deale A, Wessely S, Chalder T.
http://www.ncbi.nlm.nih.gov/pubmed/17074300

Sean · Mar 7, 2012

As yet there are no objective measures of either fatigue or pain so self-report is the only option.

A minimum of 50 % of primary outcome measures should be objective, in all clinical trials.

Otherwise we will spend the rest of our days in endless frustrating arguments about the 'meaning' of subjective terms, and the psychosocial crowd will win that game hands down. Just like they have done for the last quarter century.

oceanblue · Mar 8, 2012

Sean said:
A minimum of 50 % of primary outcome measures should be objective, in all clinical trials.

Otherwise we will spend the rest of our days in endless frustrating arguments about the 'meaning' of subjective terms, and the psychosocial crowd will win that game hands down. Just like they have done for the last quarter century.

Makes sense then to have a fatigue questionnaire (maybe on of Lenny Jasnon's, properly validated first) and an objective one for physical function. But I wouldn't drop SF-36 altogether until there is good evidence that actometers (or whatever else is chosen) does a better job of measuring genuine change in physical function in clinical trials.

oceanblue · Mar 8, 2012

Snow Leopard said:
They can imagine up all the reasons they want, but the effect is real.
See also:

Cognitive-behaviour therapy for chronic fatigue syndrome: comparison of outcomes within and outside the confines of a randomised controlled trial.
Quarmby L, Rimes KA, Deale A, Wessely S, Chalder T.
http://www.ncbi.nlm.nih.gov/pubmed/17074300

Good recall and now read - thanks.

I had hoped this study would be useful for looking at the Hawthorne Effect as it looks at CBT for CFS in the same clinic both within a clinical trial and in normal clinical practice. Unfortunately the two groups are too different for direct detailed comparison to be meaningful. The Clinical Trial was carried out in 93-94 by one therapist, also the main researcher (Alicia Deale), using a written manual for the CBT and on only 30 patients. The clinical data comes from 227 patients with significantly different characteristics from the Trial, between 1995 and 2000, treated by a number of different therapists not using the formal manual and with much lower follow-up rates (and those not completing follow-up had different characteristics from those that did).

However, it seems reasonable to use the data on this large sample of patients in normal clinical practice to gauge the effectiveness of CBT in normal use. The two outcomes reported are Fatigue (Chalder Fatigue Scale) and the Work & Social Adjustment Scales (WSAS) which gives some measure of function.

WSAS asks if the patients illness interferes with 5 different areas:
1. Ability to work
2. Home management e.g. cooking and cleaning (also childcare)
3. Social life
4. Private leisure eg watching TV, reading or gardening
5. Close personal relationships
Oddly, each area is give equal weight so if you could do everything else but not be able to work at all, you would score very low (low is better). Each question is scored 0-8 (0,2,4,6,8) giving a total of 40, although this can be expressed as an average score per question i.e. min 0.0 max 8.0.

There don't appear to be any published WSAS norms for healthy populations. However, it has been evaluated in depression and OCD, which concluded:

A WSAS score above 20 [4.0] appears to suggest moderately severe or worse psychopathology. Scores between 10 and 20 [2.0-4.0] are associated with significant functional impairment but less severe clinical symptomatology. Scores below 10 [2.0] appear to be associated with subclinical populations. Whether such a pattern will generalise to other disorders remains to be tested. [brackets give average score per question, as used in this Quarmby study]

Click to expand...

Results for CBT for CFS in normal clinical practice
The paper only gives outcomes on a graph so data is estimated from that.

WSAS
Baseline 5.7; six months post-treatment 4.0, (change 1.7). Nb scores above 2.0 associated with significant functional impairment.

Chalder Fatigue Score bimodal scoring: 0 best, 11 worst
Baseline: 9.0; six months 5.8 (change 3.2). General population, (including those with health problems) approx 3.3.

These figures are significantly worse than achieved in the clinical trial (6 months post treatment WSAS=3.3; Chalder=4.0).

Snow Leopard · Mar 12, 2012

More stuff from 2012 - http://journals.cambridge.org/action/displayAbstract?fromPage=online&aid=8501666

GET and a CBT booklet don't work in primary care etc etc...

Enid · Mar 12, 2012

No it doesn't :victory: (oh just from personal experience)

The Hawthorne Effect... & Overestimation of Treatment Effectiveness (2010)

Guest

Guest

Senior Member

Senior Member

Senior Member

Senior Member

Guest

Guest

Guest

Hibernating

Senior Member

Guest

Guest

Senior Member

Hibernating

Senior Member

Guest

Guest

Hibernating

Senior Member