SNP Data Analysis

Mark · Jan 22, 2011

I am nowhere near up to speed on the subject of GcMAF in general, so please forgive and correct any inaccuracies in what follows...

ETA: Garcia has now done so...

My understanding is: in association with the GcMAF trial of Dr de Meirleir's patients, a genetic analysis was carried out, the main headline of which is that the 20% of patients who are failing to respond to GcMAF all have the same allele on one of the tests.

ETA: What I had misunderstood was the context of the data, which is nothing to do with GcMAF or Dr De Meirleir; instead the spreadsheet is a 'proof of concept' for an ongoing project (as Garcia explains below)...so I've removed the sections of this post that were based on that fals premise...the rest of my analysis seems sound though...

When I saw the dataset of these genetic tests, I got really excited at the prospect of mining this raw data for clues - and I still am: that's what this thread is for: analysis of those results. Unfortunately, the work I've done so far shows quite clearly that at least one of the six statistically significant differences found between ME patients and controls is spurious.

Background on genetics

First: I've done a lot of reading around the subject in the last day or two, and there's a little bit of background on the science I'd like to present, to help those (like me) with an incomplete background in biology, to understand some of the basics of the subject...again please do correct any inaccuracies in what follows...

The genetic analysis is a study of Single Nucleotide Polymorphisms (SNPs). SNPs ("snips") are differences of a single element in the genetic code - so for example a subsection of genetic code like "...CGTCAGCG..." might appear in 20% of the population as "...CGTCAACG..." - some people have "A" at that position, others have "G".

Since we have two copies of the code, it's possible to have one copy of each of these variants (which are called alleles). Although it's my rough guess that it may be questionable in many cases to refer to either of these alleles as a "mutation", ancestral genetic analysis does allow one of these alleles to be identified as the "wild type" - the 'typical' form - and the other allele as a "mutation". This distinction is referenced using '+' and '-', and the normal convention is that '+' denotes the 'wild type' and '-' denotes the mutation.

So: since we have two copies of the code, each of us may either have two copies of the wild type, two copies of the mutation, or one of each, denoted as +/+, +/-, or -/-.

There has already been some confusion in the de Meirleir data, in that the positives and negatives are the wrong way round for one of the SNPs studied - and I believe it was Garcia who did some brilliant work in identifying and analysing this issue, and notified the researchers of that detail which caused quite a bit of confusion.

One last bit of preamble: in my reading about SNPs, I read that the average adult has about 1 million SNPs in their DNA. SNPs, then, are not generally to be seen as serious genetic deficiencies...whereas double nucleotide polymorphisms are liable to cause much more significant problems.

Understanding the spreadsheet

Next I'd like to offer a few hints for understanding the spreadsheet of data from the genetic analysis, which I found linked on one of the GcMAF threads. Here's the spreadsheet:

https://spreadsheets.google.com/ccc?key=0Ar76dNWyEQLIdEpJblFOTnU5NFVxRy1LLUFCN0dSOXc&hl=en

Across the top of the spreadsheet, the headers reference the various SNPs studied. The genes in which the SNP occurs are referenced by code names: ACE, CBS, COMT etc...and within each gene, a few known SNPs are studied, one in each column - referenced by the code number for the SNP, such as rs1799752. The next line in the spreadsheet shows, for each SNP, which allele is considered to be the '+' and which is the '-'. So: +T/-C means that the variation with a "T" is considered to be '+' (wild type) and the variation with a "C" is considered to be '-' (mutation).

Reading down the spreadsheet, next follows the full results for all 49 patients studied. And then, we get to the statistical analysis...

Three rows show the percentages of each of the 3 types (+/+, +/-, -/-) in CFS patients, then the next 3 rows show the percentages for the controls. After that follow the raw numbers for this data: the total number (n) of individuals studied, and the numbers for each of the 3 types.

Finally, the p-values follow, representing the percentage chance that the differences observed between the ME patients and the controls could be explained by random chance. Some of those percentages are so low that they round to zero in the spreadsheet.

The colour coding of the spreadsheet helps to separate the genes studied, but also indicates the strengths of association observed - the headers and p-values for each column are coloured black for those SNPs where the results are statistically significant (the lowest p-values).

So it's the columns headed in black that we're most interested in...

AHCY-01 variance

I've only studied one of the SNPs so far: the data that jumped out at me as being most interesting, based on the numbers alone...

For the AHCY-01 gene, a p-value of 0.0172 is observed: about a 2% chance that this is just random.

The percentages work out as follows:

Patients: 78.8% +/+ 21.2% +/- 0% -/-
Controls: 54.6% +/+ 33.0% +/- 12.4% -/-

What's very striking is that not one of the patients carries the -/- variation, which is expected in 12.4% of the population. This becomes even more striking when you look at the rest of the AHCY genes, which have less statistical significance, but still show variance with p-values of 0.06 and 0.1 (rounded) - when you look at the numbers for those, again no patients have the -/- form, where rates of about 12% are expected.

This sort of correlation seems really exciting. When I saw it, I speculated that the -/- form is something that ME/CFS patients just don't have...it could be something that defines the population studied, that we don't have this allele...so maybe, this means that this allele protects against CFS?

Sadly, no.

My next step was to google my way to the database references for these polymorphisms - you can google 'rs1799752' and find a data sheet for that SNP. So you can see what proteins are encoded by that section of genetic code, what the gene itself does, what disease associations are known for the SNP variants, etc etc. Note that in this case, the SNPs studied relate to aspects of the methylation process, because that, of course, is what the researchers were exploring.

And my googling led me to a most disappointing explanation for this particular variation in the data...

AHCY-01 variance is explained by racial profile

Finally...the exciting, yet at the same time disappointing discovery that I made last night...

I came across this page, which I'm afraid seems to me to kill this apparent genetic variance of ME patients stone dead:

http://www.ncbi.nlm.nih.gov/SNP/snp_retrieve.cgi?subsnp_id=ss48292451

The significant data is at the bottom of the page, headed "Population Allele Frequency Batch", where one can see that this is clearly the same data set that was used for the control data in the case of AHCY-01.

The first batch under that heading clearly shows that the numbers are the same as those in the spreadsheet, where n=97 (no of chromosomes sampled 194, 2 for each subject), and the percentages match the spreadsheet exactly: 54.6%, 33%, 12.4% - the spreadsheet data comes from this P1 batch, then:

Handle|PopulationID: SNP500CANCER|P1
No. of Chromosomes Sampled: 194

Allele: A=0.711/G=0.289
Genotype: AG=0.33/AA=0.546/GG=0.124

The next set of batches listed, it turns out, are subsets of the first batch, grouped by ethnicity: Cauc1 is "caucasian", Afr1 is "african african american", Hisp1 is "hispanic" and Pac1 is "pacific rim":

Handle|PopulationID: SNP500CANCER|CAUC1
No. of Chromosomes Sampled: 58

Allele: A=0.793/G=0.207
Genotype: AG=0.345/AA=0.621/GG=0.034

Handle|PopulationID: SNP500CANCER|AFR1
No. of Chromosomes Sampled: 46

Allele: A=0.391/G=0.609
Genotype: AG=0.522/AA=0.13/GG=0.348

Handle|PopulationID: SNP500CANCER|HISP1
No. of Chromosomes Sampled: 42

Allele: A=0.833/G=0.167
Genotype: AG=0.143/AA=0.762/GG=0.095

Handle|PopulationID: SNP500CANCER|PAC1
No. of Chromosomes Sampled: 48

Allele: A=0.813/G=0.187
Genotype: AG=0.292/AA=0.667/GG=0.041
And so, at last, I come to the point...

In the caucasian batch, the distribution of this AHCY-01 SNP is very similar to the distribution in the ME patients studied. Whereas in the AFR1 group the prevalence of the GG genotype is 34.8%, in the CAUC1 group it's just 3.4%. The spreadsheet is comparing with the overall level of GG - 12.4% - but the racial profile from the more detailed data for that batch tells us: GG is an african gene.

That 3.4% in the CAUC1 group is 3.4% of the 29 caucasians studied: ie. 1/29.

I think it's pretty obvious from the above data that the true explanation for the variation in the AHCY-01 gene found by this study lies in geographic/ethnic variance.

What does this mean for the study as a whole?

I can't begin to tell you how disappointed I am that what I've found during the course of my investigation casts doubt on the results of the study as a whole, because this is absolutely not what I was hoping to do! But this analysis does highlight a serious flaw in the data in that spreadsheet. The comparisons the spreadsheet makes between the patients and the controls are not comparisons with matched controls, but instead they are comparisons with overall data from the US. Thus it now seems to me that all of the associations found in the study are suspect: perhaps they are all explained by geographical differences, differences of ethnicity, etc.

So the top priority seems to me to be obtaining better matched control data...and I understand that those involved with this work are working on that issue...

ETA: Deleted the rest of this post, since it was based on the false premise that this was related to the GcMAf trial.

garcia · Jan 22, 2011

Over to you all now to do the same for me: corrections, criticisms, better explanations of all the above will be warmly welcomed...I'd dearly love to be wrong here...

Mark, I can help you out!

You've done some great work there. But rest assured that this dataset has nothing to do with either GcMAF or Kenny De Meirleir. Instead it is methylation genetic data collected by patients (mostly from their Yasko genetic tests), then added to by Mindy Kitei and W. L. Karns. The "control data" were added by Kitei & Karns, and it is they who performed the statistical analysis.

You have shown that the AHCY-01 gene is not in fact different between ME/CFS patients and controls which is useful to know.

Mark · Jan 22, 2011

Oh that's fantastic news Garcia, thanks very much for that!

I was unclear where the spreadsheet had come from: that was part of why I posted that disclaimer right at the beginning. It's really great to hear that it has nothing to do with De Meirleir and GcMAF per se.

The connection I made in my mind, I think, came from the observation that the 20% who were not responding to GcMAF had a particular genetic link. Can you clarify what that 20% observation was about, and how it relates to this data? And whether KDM's dataset that gave rise to that 20% observation is public domain?...

Given what you've said, and since this genetic data is a work in progress rooted in the patient community - which is great to see - I feel much more positive about spending some more time looking into the rest of those SNPs and finding out what else I can uncover...I can see loads of interesting lines of inquiry...

Can you fill me in on what other work is being done on all this, please? Is there a definitive place where people are doing this analysis, and if not, that's the overall idea of this thread, to be that place (within PR)....hence the thread title, which I guess needs the "GcMAF" removing...can you suggest a better description and post some relevant links, or point me at any PR threads that are relevant?

Thanks.

Mark · Jan 22, 2011

Also, the implication of all this is that what we need to have in order to make good use of this data is a relevant control group, and thus the whole thing needs re-doing with different control data.

As a first stab, it looks like any caucasian subsets that are available would be better than what we have currently, but if the patient population could be even better defined (I'd like to see gender as a column in the spreadsheet for a start) then maybe we could get an even better set of control data. Maybe a western european dataset would be a good idea, if we can find any? And the strong variation in that particular SNP by ethnicity suggests we would do well to confirm the ethnic profile of the patients as well.

cigana · Jan 25, 2011

Thanks for this Mark - I particularly benefitted from the preamble! Presumably the 0% can then just be explained as a statistical inability to resolve the real 3.4% value? (Assuming the subjects were all caucasian).

Mark · Jan 25, 2011

Glad the preamble was helpful, Mark.

I'm not entirely sure I follow your question, but my reading of it is that - assuming the patients were all caucasian, maybe many of them european - the 0% in 47 patients suggests to me that the 3.4% in the control data should really have been 0% if the control samples had been correctly typed by ethnicity...it suggests that the 1/29 caucasians with GG wasn't really caucasian (or at least, had some african ancestry).

The p-value of 0 indicates a statistically strong variance from the control group, which an be explained by the fact that the control group was of the general US population, whereas the patient group was - it would seem - caucasian. That's an assumption/prediction based on the observation that the caucasian figures closely resemble those of the patient group. But I don't know that the patients were all caucasian...that's just what the numbers suggest.

The 0% means 0/47 patients had the SNP. Whereas the 3.4% means that, in the US control data, 1/29 of caucasian patients had the SNP. One would have expected therefore that 1 or 2 of the 47 patients would have had it...but the variance is nowhere near enough to even suggest an association with anything.

If anything, I think if you knew more about the patient data, you could say that these observations add to the data that's in the control study (a cancer study from 2005) and suggest that the one caucasian patient out of 29 with GG was an anomaly. If the 47 patients were all caucasian, or all european, then the CFS patient data is adding to the existing data set and suggesting that this particular polymorphism is not a white/european/caucasian one - and maybe you might find that the 1/29 who had it in the control data, might turn out to have some mixed race ancestry as an explanation for that...or if the patient data is european, maybe it's a polymorphism that is even rarer in europe than in the US amongst caucasians.

If you look at all the rest of the ethically-typed data, it looks clear that the G is of african origin: it's 35% in the african group, 10% in hispanic, 4% in pacific rim. Even with those last two...adding all that with the CFS patient data, what I'm seeing is: that SNP is an african polymorphism, and the other 2 or 3 samples that contradict that rule, well if you looked at their ancestry my guess is they would have african ancestry.

*GG* · Jan 25, 2011

I know very little about GcMaf, so what does SNP stand for?

GG

August59 · Jan 25, 2011

Single Nucleotide Polymorphism = SNP

cigana · Jan 26, 2011

I think I see, cheers.

*GG* · Jan 26, 2011

August59 said:
Single Nucleotide Polymorphism = SNP

Thanks for the info!

August59 · Jan 26, 2011

ggingues said:
Thanks for the info!

Oh. No problem as I was curious too. Googled it and surprisingly it took quite a few pages before one actually gave the whole name.

SNP Data Analysis

Mark

Senior Member

garcia

Aristocrat Extraordinaire

Mark

Senior Member

Mark

Senior Member

cigana

Senior Member

Mark

Senior Member

GG

senior member

August59

Daughters High School Graduation

cigana

Senior Member

GG

senior member

August59

Daughters High School Graduation