"Points of significance: Importance of being uncertain" (2-page educational piece on stats)

Dolphin

Senior Member
Messages
17,566
This is going to be a monthly series.

I can't remember where I saw the link that caused me to print this out.

Anyway, it's an educational piece that explains sampling and the Central Limit Theorem. It requires little knowledge of mathematics.

But there are lots and lots of educational pieces that make the same point so not important to read this one specifically.

I found the first column relatively dense and not really important.

I think I only really fully accepted the Central Limit Theorem when I played around with a tool online that automatically did distributions of the mean for all sorts of weird samples. It is a little counter-intuitive that when one samples from all sorts of distributions e.g. heavily skewed ones, that the distribution of the sample means tends towards being normally distributed (i.e. a nice bell shaped curve esp. when the samples removed are not small). The implications of this are used a lot in statistics.


NATURE METHODS | THIS MONTH
Points of significance: Importance of being uncertain
• Martin Krzywinski
• & Naomi Altman

Nature Methods 10, 809–810 (2013) doi:10.1038/nmeth.2613

Published online: 29 August 2013

http://www.nature.com/nmeth/journal/v10/n9/full/nmeth.2613.html
 

anciendaze

Senior Member
Messages
1,841
Two points here: 1) the CLT is about sample means, not individual points in a sample; 2) it requires the sampling process be independent and identically-distributed for all sampled data. It is very easy to violate independence when all decisions are made by the same people; using dice for a small part of the process is not the same as preserving independence. The identically-distributed aspect says that mixing different populations may invalidate the CLT. With the current mess in ME/CFS diagnostic criteria it is practically impossible to meet it. There can be weaker conditions for the CLT, but the proofs get increasingly hard to follow.

One bizarre consequence of different forms of the CLT is that a sufficiently large number of random variables with different distributions will also approach a Gaussian distribution, provided some conditions are met. At the extremes of identical distributions and many different distributions we can depend on the CLT, but mixing two very different distributions may well invalidate it. This is precisely what seems to have happened in the PACE study, where independence is also questionable. The numerical values reported for confidence are practically meaningless.

Gaussian (normal) distributions work best when you are dealing with elementary particles, where there are fundamental reasons to assume they are truly identical, and the sampling process has to be independent because no one on Earth knows how to control individual particles. Having millions of particles also helps.

Human beings are not nearly as desirable subjects for application of this theorem. At the other extreme, using Gaussian distributions to bet on the stock market is a recipe for disaster, even if advised by Nobel Prize winners.
 

Simon

Senior Member
Messages
3,789
Location
Monmouth, UK
Greeting, fellow stats geeks. Part 2 is out now, don't miss it!

Points of Significance: Error bars : Nature Methods

The article covers the different types of error bar and how to interpret them. The main types I've seen in CFS research are SEM 'Standard Error of Mean', and 95% confidence intervals and you interpret them very differently. As the fig from the article shows below, 95% confidence error bars can overlap (sample 1 vs sample 2) a lot and still indicate a p value of <0.05, while SEM error bars need to have a substantial gap between them to be significant.

btw, 95% confidence intervals are based on SEM x 1.97 ie are roughly twice the size

nmeth.2659-F3.jpg
 
Last edited:

Simon

Senior Member
Messages
3,789
Location
Monmouth, UK
This series is now open access:
Points of Significance : Statistics for Biologists


I just wanted to add something that came up in the excellent free online stats course I'm doing:
Statistics and R for the Life Sciences | edX

In a lecture on how not to present data, Prof Rafael Irizarry showed this slide:

Those bars on the left often crop up in biomed mecfs papers - see those tiny error bars at the top? Looks like these are really clear-cut differences - but the data on the right is more informative (overlapping confidence intervals shown by green vertical lines in right hand graph indcate differences are not significant):
compare-graphs.gif
 
Last edited: