ME/CFS Variant Spreadsheet with Frequencies, CADD scores, Variant Effect Prediction, etc (Citizen Science)

kday · Sep 27, 2019

I'm wondering if anyone is willing to contribute to citizen science.

I created a Spreadsheet of Exomic variants from 25 ME/CFS 23andMe genomes hosted on OpenSNP. There are relevance rankings, RS#'s, Ref/Alt allele, ME/CFS frequencies, gnomAD population frequencies, type of mutation, as well as variant effect prediction (SIFT, Polyphen) and CADD scores like in the Klimas study.

▶ View the Google Spreadsheet
(Anyone with a link can edit, so feel free to create more columns as needed and sort the data in the way you wish, but please don't delete the core data)

However, there are a lot of miscalls on 23andMe, so perhaps adding a Miscall column with a Y (yes) or N (no) would be helpful. There is a link to each variant in OpenSNP.

Things I've noticed:

23andMe doesn't appear to have included any deleterious IDO2 SNPs as annotated by SIFT and Polyphen-2. However, you can see an increase in frequency of IDO2 SNPs for the IDO2 SNPs with higher CADD scores. But to properly assess IDO2, you need a Whole Genome Sequence or Whole Exome Sequence.

Also one thing I noticed when looking at the table is that common variant TRPM3 seems to have very high rankings compared to the global population frequency. This gene is what's thought to be implicated in decreased NK cell function. Like IDO2, common TRPM3 variants seem nearly universal in the ME/CFS genomes. When looking I thought I might have discovered something. Nope. Both gene expression and genotypes have been studied in ME/CFS already. And I think it's prudent that researchers to continue to study TRPM3 so we can find methods to get those NK cells functioning again!

https://www.me-pedia.org/wiki/Transient_Receptor_Potential_Melastatin_3

This TRPM3 SNP for example is predicted to be deleterious/damaging in computer models. It's a common SNP, but the vast majority of ME/CFS genome does not match reference and it's only expected to have an ALT alelle frequency of ~57% in non-Finnish Europeans!

https://gnomad.broadinstitute.org/variant/9-73150984-C-T

One more interesting fun fact, is that I can't seem to find anyone with ME/CFS that doesn't have rs16891982. The search for anyone that doesn't matches reference includes looking at several genomes that are not on this Google Spreadsheet. Yes, this SNP indicates light-skinned european ancestry, and maybe people of other ethnicities are not likely to be diagnosed with ME/CFS. But it either shows that European ancestry is nearly universal with ME/CFS, or there is a huge cultural bias in diagnosis. That said, this SNP is expected to be 95-98% in Europeans and 91% in Ashkenazi Jews, so if this is a predominately European syndrome, maybe I just haven't seen a reference genotype yet by chance.

https://www.snpedia.com/index.php/Rs16891982
https://gnomad.broadinstitute.org/variant/5-33951693-C-G

Anybody has the right to use, copy, modify, or redistribute this spreadsheet for any purpose.

Judee · Sep 27, 2019

I think you should lock the sheet so someone doesn't inadvertently delete something important unless you have a backup copy. That's just IMO. Especially since it looks like you put a lot of time and energy into the project.

kday · Sep 28, 2019

Judee said:
I think you should lock the sheet so someone doesn't inadvertently delete something important unless you have a backup copy. That's just IMO. Especially since it looks like you put a lot of time and energy into the project.

I locked the sheet and left two columns for anonymous people to edit in. I also realized that I slightly miscalculated frequencies (who would have thought calculating frequencies would be so hard?).

Unfortunately, this fix makes the IDO2 SNPs (and others) not very statistical significant in terms of elevated frequency. But the numbers are now calculated properly.

And for now, I removed X and Y chromosomes because it's a little more tricky to calculate these frequencies correctly.

Moof · Sep 28, 2019

Can I just check something, @kday? I'm a bit fat-fingered these days, so I pasted the data into a spreadsheet on my own computer before I started looking at it. I thought I'd check my own data for the highest ranking SNPs, but whilst doing it, I noticed that some of the lines are repeated. Is this because there's more than one reference / more than one gene attached to the individual SNP? ('Sketchy' is an over-statement of my understanding of the subject!)

In terms of helping with the project, what's the most useful way for us to get involved?

Moof · Sep 28, 2019

Just for fun, here's my top 25, with the alleles I carry shaded (where heterozygous, both are shaded).

The two variants with the highest CADD scores – rs7755898 and rs8176928 – don't appear at all in my data.

According to the readout on my WGS sequencing, which I ran through your online tool, I don't have any TRPM3 variants. There are four other uncommon transient receptor potential variants (TRPS1, TRPM1, and two in TRPM4), all categorised as benign.

kday · Sep 28, 2019

Moof said:
Just for fun, here's my top 25, with the alleles I carry shaded (where heterozygous, both are shaded).

The two variants with the highest CADD scores – rs7755898 and rs8176928 – don't appear at all in my data.

View attachment 34900

According to the readout on my WGS sequencing, which I ran through your online tool, I don't have any TRPM3 variants. There are four other uncommon transient receptor potential variants (TRPS1, TRPM1, and two in TRPM4), all categorised as benign.

All those very rare ones at the top are probably miscalls. Instead of comparing it to your own data, it would help to compare it to OpenSNP data. If 100% of ME/CFS patients have something that 1 out of 10,000 have, it's a miscall and the "bad?" checkbox should be checked. Remember to multiple the Freq * 100 to get percentage of the population! Makes it a bit easier to understand.All those very rare ones at the top are probably miscalls. Instead of comparing it to your own data, it would help to compare it to OpenSNP data. If 100% of ME/CFS patients have something that 1 out of 10,000 have, it's a miscall and the "bad?" checkbox should be checked. Remember to multiple the Freq * 100 (move decimal place to points to the right) to get percentage of the population! Makes it a bit easier to understand.

The TRPM3 variants generally aren't in ClinVar, I don't think. So they won't show up with conventional analysis.

kday · Sep 28, 2019

@Moof

I've noticed the multiple variant bug. Sometimes it's because there are multiple gene names, others it's because the database is off. And I think sometimes it's because ne It's not that big of a deal if you can just ignore it.

But maybe I should allow people to edit the gene name. (edit: actually, for now, just check the "bad?" box for the incorrect gene names and I'll manually remove or merge multiple names).

Other ways for the community to help:

Probably the most helpful way to get involved is click on the OpenSNP link and check to see if the variant is a miscall and check the "bad?" checkbox if it is. If the ME/CFS frequency percentages don't mostly line up with the Allele frequency table on OpenSNP (OpenSNP's calculations aren't perfect, but they should be within a ballpark estimate), then the bad checkbox needs checked. Multiply the frequency * 100 (move decimal place to points to the right) on the spreadsheet to get percentage. (tip: Most of the miscalls have very high relevance score. If you sort the sheet by relevance score, you can easily find and evaluate the miscalls).

Also rarer variants where you see approximately double (or more) the frequency from gnomAD should be evaluated. If the relevance score is 2, that means the CFS Frequency is double the gnomAD frequency, and so on (you may want to update your local spreadsheet as I just fixed the calculations for relevance). This may make it easier than directly looking at frequencies. I've been commenting on each rsID field to annotate the spreadsheet, and this seems to work very well. Click the comments button, and you can click the rsID to get to that area of the spreadsheet, and you can even reply to the comments so I or others who make comments get notifications.

If it's a very common variant (1 on frequency means 100%), then variants that have 20-30% (or more) difference from reference or more may be important. It's hard to tell what frequencies are chance or not when frequencies on gnomAD are very close.

Did I explain that in a way that makes sense?

Moof · Sep 28, 2019

kday said:
All those very rare ones at the top are probably miscalls.

I misunderstood the whole thing completely, then!

I thought the 'relevance' score meant relevance to ME. Typical of me – I have dyscalculia, so when it comes to anything involving numbers, I'm usually guessing.

I haven't got my brain around your second post yet, but it's mid-evening here. I think I'll probably have a much better chance of understanding it if I re-read it tomorrow.

Moof · Sep 29, 2019

Have now re-read your post with an awake brain, @kday, and yes – it does make sense! Thank you. I won't have much time today, but I'll certainly help when I can.

Two quick things:

For laptop users, it might be helpful to double the depth of the header column and format the text alignment to 'wrap', thus columns D to I to be reduced in width (similar to what I did on the screenshot in post #5 above). This would mean the whole line can displayed on a smaller screen, making it much easier to work on. Would this be okay?

In terms of not duplicating effort, can people assume that the lowest 'miscall' tick means that all the lines above it have been checked? If not, we need some way of indicating this – for instance, bolding the whole line.

Thanks again!

EDIT: For people like me who struggle with numbers, it would also help to add a column next to 'gnomAD frequency' that multiplies the value by 100. Perhaps shading it light grey would help distinguish it from the actual gnomAD frequency? (I'm happy to reformat the column widths and add this extra column myself, but I don't want to do it without asking. )

kday · Sep 29, 2019

Moof said:
Have now re-read your post with an awake brain, @kday, and yes – it does make sense! Thank you. I won't have much time today, but I'll certainly help when I can.

Two quick things:

For laptop users, it might be helpful to double the depth of the header column and format the text alignment to 'wrap', thus columns D to I to be reduced in width (similar to what I did on the screenshot in post #5 above). This would mean the whole line can displayed on a smaller screen, making it much easier to work on. Would this be okay?

In terms of not duplicating effort, can people assume that the lowest 'miscall' tick means that all the lines above it have been checked? If not, we need some way of indicating this – for instance, bolding the whole line.

Thanks again!

EDIT: For people like me who struggle with numbers, it would also help to add a column next to 'gnomAD frequency' that multiplies the value by 100. Perhaps shading it light grey would help distinguish it from the actual gnomAD frequency? (I'm happy to reformat the column widths and add this extra column myself, but I don't want to do it without asking. )

I followed many suggestions you made. No frequencies in percentages yet. I think what I might add miscalls to a second spreadsheet? And leave checkboxes for others to indicate other miscalls. Should make the spreadsheet more clean.

I don't know how much others can edit the sheet as I have most of it locked to prevent inadvertent changes. But comments can still be made on the spreadsheet. And other data can be added on the two columns on the right.

In fact, check out my comments! The comments feature works really well actually. You can read the comments, reply to comments and even click on the rsID of the comments to get to the specific line of the spreadsheet. This comment feature is wonderful for collaboration.

kday · Sep 30, 2019

Looking through the data and doing my own queries, I have preliminarily determined that the gene MMP26 is one of the most "mutated" genes in the ME/CFS 23andMe coding region. Followed by OR51F1, and ADAMTS13.

ME/CFS patients have 9 (!) MMP26 SNPs with a relevance score greater than 2. This means 9 SNPs in these gene are 2x the frequency or more in the ME/CFS.

There are 6 OR51F1 variants with a relevance score greater than 2 and 4 ADAMTS13 variants with relevance scores greater than 2.

MMP26 can explain MMP-9 activation and the EDS-like symptoms in ME/CFS.

Proteins of the matrix metalloproteinase (MMP) family are involved in the breakdown of extracellular matrix in normal physiological processes, such as embryonic development, reproduction, and tissue remodeling, as well as in disease processes, such as arthritis and metastasis. The encoded preproprotein is proteolytically processed to generate the mature enzyme. This enzyme may degrade collagen type IV, fibronectin, fibrinogen, and beta-casein, and activate matrix metalloproteinase-9 by cleavage.

https://www.genecards.org/cgi-bin/carddisp.pl?gene=MMP26

Specifically, for MMP26:

May hydrolyze collagen type IV, fibronectin, fibrinogen, beta-casein, type I gelatin and alpha-1 proteinase inhibitor. Is also able to activate progelatinase B.

I'm not exactly sure what OR51F1 does. It has to do with neuronal response that triggers the perception of a smell. And ADAMTS13 has to do with blood clotting issues.

Other genes of interest (genes that have 3 variants with a relevance score > 2) include: ~~KIAA1551~~, XIRP2, ~~BLM~~, KIAA0319, ~~PIGN~~, ~~EPHA10~~, and ~~SLC10A2~~.

These findings are based on an algorithm, not hand picking genes/variants, so there can only be algorithmic bias. Findings may be different with WGS/WES sequencing as it has all the variants. 23andMe has a small number in comparison and is missing a lot of important genes and SNPs.

kday · Sep 30, 2019

Big addition/addendum to the post above.

There is strong linkage disequilibrium (LD) in MMP26. So factoring in LD, I'd give a more conservative number of 5 for the amount of variants with a relevance score over 2 (which is still significant). And there are a couple more SNPs that are 1.75x frequency. There is also LD in OR51F1 which moves the more conservative SNP count from 6 to 3. 3 is still possibly significant as that's a high amount of SNPs with a relevance score > 2.

KIAA1551 has LD and the conservative SNP count is downgraded to 1, so probably not a very important SNP.

XIRP2 might have linkage. If it does, it moves the conservative SNP count from 3 to 2, so possibly not as important as I thought. Same goes for KIAA0319 and OR51F1.

BLM looks to have complete LD for all SNPs counted, so the more conservative number is now 1, which means it's likely not important.

PIGN, EPHA10, and SLC10A2 has two SNPs in LD, so probably not as important (moves the conservative SNP count from 3 to 2).

TL;DR
After looking at the data more closely, I have changed the SNPs of interest.
The genes of the highest interest are still MMP26 (and should be investigated yesterday!) and ADAMTS13. Other genes of interest are possibly OR51F1, XIRP2 and KIAA0319 if they don't actually have true LD.

It's possible that ADAMSTS13 has some linkage disequilibrium too, but I haven't been able to completely confirm this. Nonetheless, the most significant variant remains MMP26.

That narrows things down.

Moof · Sep 30, 2019

You might have to bear with me a bit here, @kday – for me, reading statistics is like trying to make sense of Middle Kingdom hieroglyphs without any training.

What does it mean if a SNP frequency is very high in gnomAD (98%), but much lower in ME/CFS (31%), and the relevance score is low? I'm looking at rs294777, but there are one or two others where I can see quite significant differences.

It may not mean anything at all, in which case just tell me it's irrelevant!

kday · Sep 30, 2019

Moof said:
You might have to bear with me a bit here, @kday – for me, reading statistics is like trying to make sense of Middle Kingdom hieroglyphs without any training.
What does it mean if a SNP frequency is very high in gnomAD (98%), but much lower in ME/CFS (31%), and the relevance score is low? I'm looking at rs294777, but there are one or two others where I can see quite significant differences.

It means that there is a difference in ME/CFS vs the normal population. Could be good, could be bad. Depending on how many genomes, that seems significant if it's not a miscall. I would add a comment for that SNP on the RS# cell with the comment button on the spreadsheet like I do. It's hard to know what it means. Sometimes the reference variant is almost everyone, and the alt allele is almost nobody. The orientation of many of these on the newer reference (GRCh38) genome is reversed.

I'm reading that the gene may have something to do with choline deficiency.
https://www.genecards.org/cgi-bin/carddisp.pl?gene=UGT2B10

First you note it, then you see if you can make any sense of it.

kday · Sep 30, 2019

So I've done more research on what's going on. And the variants I am finding are actually not MMP26 variants. BUT, they were within olfactory pseudogenes within first intron for MMP26. So basically, there are a whole bunch of olfactory genes/SNPs embedded within an MMP26 intron!

I'm a bit over my head, so don't know the implications of this. But my thoughts are is that if there is increased variance in several olfactory receptor genes in ME/CFS inside the intron variant of MMP26, perhaps there could be a weird effect on the expression of MMP26.

This is a really unqiue region and seems very important. However, I can't find much literature on it, ME/CFS or otherwise!

Below is a picture from UCSC genome browser showing these genes. In ME/CFS, there is greatly increased frequency of variants in OR51F1, OR51G1. and OR52R1 (2-4x increased frequency than the average population depending on the gene/SNP). Since these olfactory receptor genes are being selected out, perhaps this increased frequency of variance is evolutionary? I don't know!

Again, this relevance of this region was discovered by an algorithm and database query, not me handpicking a region of interest. It's hard to say what is happening, but I know in ME/CFS with have changes in our perception of smell and we have MMP-9 activation with collagen breakdown (this gene degrades type IV collagen) and I don't know how much MMP26 could contribute to this because of lack of literature. While there is a HUGE intron in this MMP-26 gene, the coding exons are tiny. This is very unique especially considering how there are coding olfactory pseudogenes in the MMP26 intron.

That all said, I think it's possible that this specific region is what creates the susceptibility to ME/CFS (not the only region, but potentially an important one). It could be the region of the genome that makes ME/CFS possible. And nobody has ever studied it.

kday · Sep 30, 2019

An excerpt from the abstract of a study titled Matrix metalloproteinase-26, a novel MMP, is constitutively expressed in the human intervertebral disc in vivo and in vitro.

Molecular analyses showed significant downregulation of expression of MMP-26 (p=0.03), and significant 9.8-fold upregulation of TGF-beta (p=0.01) in more degenerated discs vs. healthier discs. Findings document the first identification of MMP-26 in the disc at the molecular and protein levels. Results point to the potentially important role of MMP-26 in matrix modulation during disc health and degeneration.

https://www.ncbi.nlm.nih.gov/pubmed/21945733

Moof · Sep 30, 2019

kday said:
First you note it, then you see if you can make any sense of it.

Excellent, I have got the hang of spotting the zebras, then! Thank you.

I hoped I had, but as I can't grasp statistical concepts, I just have to muddle things out in my own way.

Moof · Sep 30, 2019

Been making a few annotations, where there appears to be a difference between frequency in ME and the general population. @kday, could you look at these for me, please?

rs1951708
I think this might be an example of something that looks so odd it's probably a miscall. Do you agree?

rs196912 and rs196939
Potentially interesting link to a B-cell cancer, that appears to be more frequent in ME patients. However, I don't have enough knowledge to know whether (a) the apparent increase in frequency in ME is large enough to be meaningful; or (b) whether I've added 2 + 2 and made 19, based on unconfirmed evidence from one study. Do you think this sort of frequency increase is worth chasing up and noting?

Be grateful for any thoughts!

kday · Sep 30, 2019

Moof said:
Been making a few annotations, where there appears to be a difference between frequency in ME and the general population. @kday, could you look at these for me, please?

rs1951708
I think this might be an example of something that looks so odd it's probably a miscall. Do you agree?

rs196912 and rs196939
Potentially interesting link to a B-cell cancer, that appears to be more frequent in ME patients. However, I don't have enough knowledge to know whether (a) the apparent increase in frequency in ME is large enough to be meaningful; or (b) whether I've added 2 + 2 and made 19, based on unconfirmed evidence from one study. Do you think this sort of frequency increase is worth chasing up and noting?

Be grateful for any thoughts!

OpenSNP makes rs1951708 falsely look like a miscall. But it most definitely is not when looking at all the data!

rs196912 and rs196939 are real too. But it looks like they may have what's called linkage disequilibrium, Meaning they are SNPs that are inherited together rather than randomly inherited. Though the total genomes with that SNP is only 11. This doesn't mean the data is wrong, but the chance of these numbers being chance are higher than if the total genome count was 25.

You can double check linkage disequilibrium by going to https://gnomad.broadinstitute.com and looking at the population data. If the data is almost identical for all population ethnicities, it's likely linkage!

You are doing everything right and thanks for your contributions!

nandixon · Sep 30, 2019

kday said:
OpenSNP makes rs1951708 falsely look like a miscall. But it most definitely is not when looking at all the data!

@kday, I believe everyone should be GG for that one. OpenSNP is suggesting that 23andMe apparently had a number of no-calls for that SNP, so I think that may be why the allele frequency for the G allele (i.e., the normal allele) in the spreadsheet has the appearance of being so low for the ME/CFS patients.

ME/CFS Variant Spreadsheet with Frequencies, CADD scores, Variant Effect Prediction, etc (Citizen Science)

Senior Member

Psalm 46:1-3

Senior Member

Senior Member

Senior Member

Senior Member

Senior Member

Senior Member

Senior Member

Senior Member

Senior Member

Senior Member

Senior Member

Senior Member

Senior Member

Senior Member

Senior Member

Senior Member

Senior Member

Senior Member