Download for Rare SNP Analysis

Sea · May 18, 2015

Sounds fabulous Valentijn.

Have you looked into imputing data at all? A program that "guesses" with a good level of accuracy results for your entire genome based on the results from 23andme. I think that would be very interesting but haven't a clue how to start with it. There is so much data that the program takes days to execute.
https://mathgen.stats.ox.ac.uk/impute/impute_v2.html

Valentijn · May 19, 2015

Sea said:
Have you looked into imputing data at all? A program that "guesses" with a good level of accuracy results for your entire genome based on the results from 23andme. I think that would be very interesting but haven't a clue how to start with it. There is so much data that the program takes days to execute.
https://mathgen.stats.ox.ac.uk/impute/impute_v2.html

I think it's best to leave that to other programs for now, which people can try to figure how to use if they feel inclined. Currently it looks like a rather complicated process to run the existing software.

It would also inevitably introduce a pretty large degree of uncertainty in results, so I'm not sure it's a good approach in general. At any rate, I definitely want to keep it separate from 23andMe results which do have a very high degree of certainty.

ukxmrv · May 19, 2015

It sounds great to me @Valentijn

Thank you for all this hard work

Snow Leopard · May 19, 2015

Valentijn said:
Additionally, we might offer the full 23andMe table correlated with the dbSNP data for download. So someone could automatically generate a new file with all of their 23andMe data in it, but with gene names, possible alleles, i numbers translated into rs numbers, mutation data, and allele frequency all automatically included. This could then make it easy for someone who is interested in specific genes to look at them all at once, instead of logging in to 23andMe and looking up data for 1 SNP at a time.

Any thoughts?

This would be brilliant!

TheChosenOne · May 21, 2015

Works perfect in Linux

Eeyore · May 21, 2015

@Sea @Valentijn - I've used impute2 quite a bit to fill in holes in my 23andme results, generally focused on something interesting. Imputing a small region with accuracy (say a couple thousand base pairs) can take a day. It's useful if you know what you are looking at - but phasing the whole genome and imputing missing snp's is very computationally intensive.

You can improve information a lot if you get data from parents. My immediate family all shares info. Having parental info and siblings (or any relatives) really helps with phasing and hence imputation vs phased haplotypes.

These programs aren't that simple to use and you really need to understand a decent amount about genetics and statistics. I don't think for most it will really reveal anything that interesting unless you know where to look. Say you impute something and it looks bad - now what? Do you pay 8k and get your whole genome sequenced to confirm it?

I'm certainly not saying not to use it - I have, and do - but it's not something to take lightly.

Valentijn · May 22, 2015

Update:
We're making good progress on creating a list with the relevant data. Of all of the V3 and V4 23andMe SNPs, we have 955,381 which have dbSNP entries and at least some associated data. The 10% file currently has 239,525 SNPs and the 1% file currently has 51,464 SNPs. Those amounts will probably go down very slightly and each file includes duplicate entries where 23andMe has multiple i and/or rs numbers for each unique location.

But 1,856 of those are either rs numbers without an exact rs match, or i numbers, both with multiple possible rs matches for the same chromosome location. Usually one rs at the same location is a normal A/C/G/T SNP, and one or more are deletion/insertion SNPs. Instead of looking them up one at a time on 23andMe to see which alleles are being tested for, and using that to determine the appropriate rs number match, we're making a gigantic table of the 23andMe data from about 1500 data sets which people have publicly posted on openSNP.org .

This will take 3-4 days for my elderly laptop to process

Initially we'd planned to do this later, when we're ready to analyze the patient 23andMe data to look for patterns, but it's also looking like the easiest way to sort out unclear duplicate possibilities. It can also be used to correct for any weirdness in the dbSNP data regarding forward and reverse orientation. So basically it'll ensure that we're listing the correct alleles for each SNP, and that the minor allele is listed in the forward orientation and therefore flagged for patients when (and only when) it should be.

My hard drive folder turned bright red yesterday for the first time ... apparently in reaction to only having 20GB free. It went back to about 50 after I'd deleted some huge downloads that I no longer needed, but I do need to keep a close eye on that.

Oh, and something interesting I noticed: the V4 chip has about 11,000 useless duplicates on it. They added quite a few extra i numbers which are testing exactly the same thing as other i or rs numbers on the chip. So the total number of unique SNPs tested for is really about 590,000 now and not 600,000+. This might have been a marketing decision, since going from "almost 1 million" to "500,000+" looks like a pretty huge 50% reduction ... whereas going from "900,000+" to "600,000+" looks like a somewhat more modest 30,000 reduction.

Eeyore · May 22, 2015

@Valentijn - I think this will be very helpful. Having used promethease, there is a lot of garbage data there. It's not entirely their fault, but sometimes it will flag your genotype as very rare when a bit more analysis shows that it's either normal or even very common. You really can't always trust their genotype percentage analyses or GMAFs. It happens quite a bit too., and I can see how it could needlessly worry someone who doesn't know what to look for, not to mention create mountains of data that one would have to go through to find real data.

I do have a few quick questions (which maybe you answered already) -

First, are you basing frequencies on genotypes or on haplotypes / GMAFs?

Second, are you planning to run a GWAS on all SNPs or only rare ones? Say, for example, there is a genotype with a MAF of 10%, so that, assuming Hardy-Weinberg equilibrium, you would distribute to about 1%, 18%, and 81%. It could very well be that ME is dramatically overrepresented (or even possibly only found) in the 81%, and that the minor allele is completely protective. I can actually think of some minor alleles already that would be pathogenic but would protect one from ME (but would be worse... heh). Or is this an error of study power?

Lastly, are you going to apply a Bonferroni multiple test correction, and is that the reason you are starting with rare alleles, to try to achieve statistical significance with fewer samples?

Valentijn · May 22, 2015

Eeyore said:
First, are you basing frequencies on genotypes or on haplotypes / GMAFs?

It's almost all based on the general 1000 genomes data via dbSNP. Ethnic data is not included. In some cases where that data is lacking, I might add in prevalence data from large general groups submitted by other parties to dbSNP. That would be the pretty big ones with allele sample sizes of around 2000-5000.

For purposes of generating the rare 1% and 10% result, any missense mutation with no prevalence data is going to be assumed to be <= 1%. These are usually pathogenic mutations which are very rare, and for some reason seem to be excluded from 1000 genomes sampling and similar - possibly because they are so very rare. But the current plan is to leave the MAF (Minor Allele Frequency) column blank for those or to have a "??", so that the uncertainty is apparent to users.

Second, are you planning to run a GWAS on all SNPs or only rare ones? Say, for example, there is a genotype with a MAF of 10%, so that, assuming Hardy-Weinberg equilibrium, you would distribute to about 1%, 18%, and 81%. It could very well be that ME is dramatically overrepresented (or even possibly only found) in the 81%, and that the minor allele is completely protective. I can actually think of some minor alleles already that would be pathogenic but would protect one from ME (but would be worse... heh). Or is this an error of study power?

I will probably look for over- or under-represented genotypes and allele counts at some point. Statistical significance might not even be possible, due to small sample size: we have 50 sets of patient data thus far, and over 960,000 SNPs for most of them.

I also want to look at larger trends. One example could be genes where we have an unusual number of mutations as a group, even if nothing is too unbalanced regarding any specific SNPs on that gene. Similarly, we should be able to apply the same concept to groups of gene which are involved in the same processes.

Lastly, are you going to apply a Bonferroni multiple test correction, and is that the reason you are starting with rare alleles, to try to achieve statistical significance with fewer samples?

I think we definitely want to correct for multiple comparisons. But Mr Valentijn or one of our PR statistics people is going to have to help with that, because I haven't been able to absorb the technical aspects of it into my ME-brain

Mostly I just think rare SNPs are the most practical starting point to look for anything interesting. On the individual level it can be used to identify pathogenic or potentially pathogenic mutations. Some of these are pretty low-impact, similar to the more common MTHFR mutations, but others can have dramatic impact, such as by causing deafness if a certain class of antibiotic is used. And some might even be potential culprits for some ME symptoms, or immune dysfunction, etc.

And on the group level the rare SNPs present a collection of SNPs which are small enough to be easily manipulated and more closely examined. For example, I currently have a compilation of <=1% results from 39 ME patients. Some of these SNPs occur much more often than expected in ME patients, and sometimes there are several such SNPs on a single gene.

By compiling just these rare results, we can easily focus on looking into those more closely : is the minor allele really that rare, or is 23andMe over-reporting it or 1000 Genomes under-reporting it? We can compare them to 23andMe data from dozens or hundreds of random people to find out. Are they missense mutations, and thus much more likely to have an impact? I was able to look them up manually on dbSNP to find out. Is the particular missense mutation capable of having a pathogenic result, when there's no research regarding it yet? Online protein modelling programs can make a pretty good guess, and at least weed out the likely harmless ones.

Anyhow, the plan is to examine the data in several ways, and we'll probably add more to the list as time goes on and people make suggestions

Eeyore · May 22, 2015

@Valentijn - How/When are you planning to release this data?

Valentijn · May 22, 2015

Eeyore said:
@Valentijn - How/When are you planning to release this data?

No plan to publish or anything. It'll get posted here though.

Eeyore · May 22, 2015

@Valentijn - that's all I meant really - that we'd see it here. It's not really feasible to publish unless you have MD or PhD after your name.

What may be most useful is to look for a general locus (or loci) that seem to associate with ME. This way it won't actually matter if it's coding or not coding or if it's just in LD with the actual cause, and it will still catch different mutations in the same gene. There are good centimorgan maps readily available of the whole genome, so you might be able to do a lot that way. In the past, almost all genetic discoveries were done by localization to a specific chromosome, then a specific region of a specific chromosome, etc. and narrowed down to a pretty small area, after which sequencing was done. NGS just gives so much data and so many tests that you have a ton of false positives in anything but a ridiculously large sample set. Of course, it wouldn't be that expensive to do SNP arrays on 100K ME patients ($75-80K through for profit places like 23andme, which is chump change by government grant standards, and it would probably be much less). Think of how much more we have already wasted on psychobabble.

ohtheennui · May 24, 2015

Can anyone help me make sense of this report? I'm not finding much data for rs17537486.

SNP CHRM RARE PERCENT GENOTYPE ETC
rs11587965 1 A 1.00 AG
rs3918018 1 T 1.00 CT
rs17559127 1 A 1.00 AG
rs17410855 1 T 0.32 CT
rs11803863 1 A 1.00 AG
rs17625605 1 A 1.00 AG
rs35366573 1 T 1.00 CT
rs1804645 2 T 1.00 CT
rs2020912 2 C 1.00 CT
rs17746486 2 T 1.00 CT
rs13014679 2 C 1.00 AC
rs13012441 2 G 1.00 AG
rs2228545 2 A 1.00 AG
rs17537486 3 G 1.00 GG Homozygous
rs13067016 3 A 1.00 AG
rs4361233 3 G 1.00 AG
rs17788141 3 C 1.00 CT
rs2172397 3 T 1.00 CT
rs9883197 3 G 1.00 AG
rs11717889 3 A 1.00 AC
rs11725499 4 A 1.00 AC
rs2231134 4 G 1.00 CG
rs17216887 4 G 1.00 CG
rs1546249 4 A 0.23 AC
rs13118702 4 T 1.00 CT
rs2322180 4 A 1.00 AC
rs5528 4 A 1.00 AG
rs10446841 4 G 1.00 AG
rs10512749 5 T 1.00 GT
rs6887007 5 A 1.00 AC
rs17712772 5 A 1.00 AC
rs1353885 5 T 1.00 CT
rs4151681 5 A 1.00 AG
rs7724670 5 A 1.00 AG
rs13217575 6 A 1.00 AC
rs12202093 6 C 0.41 CT
rs12211859 6 C 1.00 CT
rs12195007 6 G 1.00 GT
rs4134982 6 A 1.00 AG
rs11962520 6 T 1.00 CT
rs130060 6 C 1.00 AC
rs938152 6 A 1.00 AG
rs11155856 6 T 1.00 CT
rs7799716 7 G 1.00 AG
rs4986910 7 G 0.32 AG
rs11760237 7 C 1.00 CT
rs13231116 7 T 1.00 GT
rs10503527 8 T 1.00 CT
rs785006 8 G 1.00 AG
rs13290184 9 T 1.00 GT
rs7900194 10 A 1.00 AG
rs17884201 11 A 1.00 AG
rs35329661 11 T 1.00 CT
rs1800054 11 G 1.00 CG
rs11605530 11 C 1.00 CT
rs11055389 12 A 1.00 AG
rs11540149 12 T 1.00 CT
rs10850948 12 T 1.00 CT
rs12184642 13 T 1.00 CT
rs8187840 13 T 1.00 CT
rs17514635 15 C 1.00 CT
rs17515108 15 A 1.00 AC
rs6151562 15 A 1.00 AG
rs1886 15 T 1.00 CT
rs41278174 16 A 1.00 AG
rs11569775 16 G 1.00 CG
rs16943885 16 C 1.00 CT
rs11650007 17 A 1.00 AG
rs11869580 17 A 1.00 AG
rs4251735 17 C 1.00 CT
rs11656940 17 A 1.00 AG
rs936018 17 C 1.00 AC
rs2016648 17 T 1.00 CT
rs17651039 17 G 1.00 AG
rs6504459 17 G 1.00 AG
rs16019 19 G 1.00 GT
rs34982899 19 C 1.00 CG
rs8107027 19 A 1.00 AG
rs34934920 19 T 1.00 CT
rs35979566 19 A 0.27 AT
rs6088236 20 T 1.00 CT
rs13039715 20 T 1.00 GT
rs34716589 20 G 1.00 AG
rs8124453 20 T 1.00 CT
rs9332312 22 T 0.27 CT
rs5934953 X C 1.00 CT
rs5904487 X A 1.00 AG
rs5905625 X G 1.00 AG
rs5906681 X T 1.00 CT
rs12848140 X G 0.48 AG
rs765697 X C 1.00 CT
rs2858121 X G 1.00 AG

Valentijn · May 24, 2015

@ohtheennui - rs17537486 isn't on a gene, but rather is just near one, so it probably isn't having an impact itself. You can see some basic data at http://www.ncbi.nlm.nih.gov/projects/SNP/snp_ref.cgi?rs=rs17537486 and you can search for other SNPs there as well.

Aside from the homozygous result, the heterozygous results with a "PERCENT" under 1 (instead of at 1.00) are the ones most likely to be missense mutations. And it's usually missense mutations which have an impact.

ohtheennui · May 24, 2015

Thanks for the link. Unfortunately I'm new to this and there's a sharp learning curve. I wish I knew how to make sense of this information.

Eeyore · May 24, 2015

Sorry @Valentijn, you may have answered this, but I'm still unclear on one point.

Are you flagging alleles with a frequency under 1%, or genotypes with a frequency under 1%?

Valentijn · May 24, 2015

Eeyore said:
Sorry @Valentijn, you may have answered this, but I'm still unclear on one point.

Are you flagging alleles with a frequency under 1%, or genotypes with a frequency under 1%?

Alleles.

MimiLoves · Aug 7, 2015

Hello. Sorry to crash this post, especially as I see that no one has posted on it since May and it is now August. I finally signed up for this site, because for the last 2 years when I have been googling my rare alleles, there are a couple of you which I share these rare alleles with. So I think that means that we may share a common ancestor, and in addition I have a couple of genetic conditions that some of you might be interested in. Sorry if I am crashing this post inappropriately - I am not sure where to start or how to reach out to some of you. I am a genealogist and for the last few years, I have began to dabble in health related genetics. I am a patient at the NIH (National Institute of Health) in Bethesda, Maryland, USA which is the research hospital funded by the government is they created the NCBI database. Much to the amusement of the researchers there, I have been able to provide them with countless pieces of information and clues about the conditions that I have.

MimiLoves · Aug 10, 2015

Here are my rare alleles. I have Albinism, Cystic Fibrosis, Long QT Syndrome, Fuchs Corneal Dystrophy, and Ehlers-Danlos Syndrome. I hit the genetic jackpot! Seriously though, in spite of it all, I am relatively healthy. Some of them cause no real problems daily, such as Long QT and Ehlers-Danlos - I am just aware that I have them. If anyone would like to see my family tree, which is quite extensive, please let me know.

SNP CHRM RARE PERCENT GENOTYPE
rs17887074 1 A 0.18 AG
rs9333061 1 T 1 CT
rs2225882 1 T 1 GT
rs12141657 1 G 1 AG
rs16798 1 G 1 GT
rs10494246 1 C 1 CT
rs28384834 1 A 0.46 AG
rs17625605 1 A 1 AG
rs4648310 1 C 1 CT
rs6425087 1 T 0 CT
rs35292876 1 T 0.41 CT
rs17738903 2 T 1 CT
rs45471294 2 T 1 CT
rs13004547 2 T 1 CT
rs17268945 2 C 1 CT
rs17696740 2 A 1 AG
rs1035977 2 T 1 GT
rs840964 2 A 1 AG
rs702885 2 A 1 AG
rs3188996 2 G 1 AG
rs13006003 2 A 1 AG
rs33942096 3 A 1 AG
rs35867420 3 T 1 CT
rs34166957 3 A 1 AG
rs7651172 3 T 1 CT
rs201713 3 A 1 AG
rs16822634 3 T 1 CT
rs11915975 3 C 1 CT
rs11719987 3 G 1 GT
rs5030094 3 C 1 CT
rs11724027 4 T 1 CT
rs1281138 4 A 1 AG
rs292050 4 T 1 GT
rs17216887 4 G 1 CG
rs284797 4 T 0.05 CT
rs13139935 4 A 1 AG
rs5528 4 A 1 AG
rs16903514 5 C 1 CT
rs12188139 5 G 1 AG
rs28763954 5 A 0.32 AG
rs17524123 5 G 1 AG
rs13178431 5 G 1 AG
rs3811993 5 T 1 CT
rs12195092 6 A 1 AA
rs17282871 6 A 1 AG
rs17689215 6 C 1 CT
rs7758568 6 G 1 AG
rs17856332 6 G 1 AG
rs17664201 6 A 1 AA
rs17664591 6 T 1 TT
rs11968228 6 T 1 CT
rs12191420 6 A 1 AG
rs12192967 6 T 1 CT
rs10498877 6 G 1 AG
rs17764105 6 G 1 GT
rs9401670 6 G 1 GT
rs13203014 6 C 1 CT
rs311354 6 C 1 CT
rs7799716 7 G 1 AG
rs17324153 7 C 1 CT
rs72555745 7 C 1 CT
rs52815063 7 T 1 AT
rs28371763 7 A 1 AT
rs2158576 7 T 1 CT
rs28359524 8 G 1 AG
rs7460146 8 T 1 GT
rs1484796 8 A 1 AG
rs17317857 8 A 1 AG
rs7837140 8 A 1 AG
rs13254260 8 A 1 AG
rs13281766 8 G 1 AG
rs7835203 8 G 1 AG
rs16897111 8 C 1 AC
rs6997635 8 T 1 CT
i3003038 8 G 0.18 AG
rs12114698 8 A 1 AG
rs1031946 8 G 1 AG
rs41313971 9 C 0.18 AC
rs2993177 9 A 1 AG
rs167901 9 A 1 AG
rs9410653 9 A 1 AG
rs11791546 9 T 1 CT
rs13288277 9 G 1 AG
rs12554810 9 T 1 CT
rs11596047 10 A 1 AG
rs2239659 10 A 1 AG
rs2505688 10 A 1 AG
rs17390084 10 A 1 AG
rs11592846 10 T 1 CT
rs3134609 10 A 0.32 AG
rs41310298 10 A 0.27 AG
rs12780429 10 G 1 AG
rs12783456 10 C 1 CT
rs11606345 11 C 1 CT
rs7113429 11 C 1 CT
rs831627 11 A 1 AG
rs11230648 11 T 1 CT
rs7102974 11 T 1 CT
rs17825668 11 G 1 AG
rs34584708 11 G 1 AG
rs35809865 11 A 1 AG
rs11217642 11 A 1 AG
rs1800499 11 T 1 CT
rs603996 11 A 1 AG
rs1805555 12 A 1 AG
rs17415853 12 C 1 CT
rs11179139 12 C 1 AC
rs11610000 12 C 1 CT
rs11609959 12 C 1 CT
rs3751215 12 G 1 AG
rs11112308 12 A 1 AG
rs12300729 12 C 1 AC
rs8192441 12 G 1 GT
rs3026434 12 G 1 AG
rs28613273 12 G 1 CG
rs11831037 12 T 1 GT
rs12867036 13 A 1 AG
rs17320607 13 C 1 CT
rs17220870 13 T 1 TT
rs10507792 13 A 1 AC
rs11499034 14 C 1 CT
rs3212102 14 T 1 CT
rs17759504 15 A 1 AG
rs8176928 16 G 0.37 GG
rs3093391 16 T 1 CT
rs7195131 16 G 1 AG
rs12949990 17 A 1 AG
rs201597 17 A 1 AG
rs28363284 17 C 1 CT
rs17616365 17 A 1 AG
rs17678817 17 T 1 CT
rs17679086 17 G 1 AG
rs17679361 17 C 1 CT
rs4135012 17 A 1 AG
rs17491503 17 C 1 AC
rs3887424 17 C 1 CT
rs2156840 18 G 1 AG
rs4987745 18 T 0.27 CT
rs4921 19 A 1 AC
rs4926123 19 T 1 CT
rs28371534 19 A 1 AG
rs1801272 19 T 1 AT
rs57266494 19 A 1 AG
rs2231943 19 T 1 CT
rs11536997 20 T 1 CT
rs8124453 20 T 1 CT
rs17514727 20 T 1 CT
rs2234916 21 G 0.18 AG
rs41557318 21 T 0.14 CT
rs8142331 22 A 1 AC
rs9332314 22 T 1 CT
rs139275 22 A 1 AG
rs13056402 22 T 1 CT
rs11796544 X A 1 AG
rs12557092 X T 0.24 CT
rs5914148 X G 1 AG
rs409454 X G 1 AG
rs760871 X C 1 CT

Valentijn · Aug 11, 2015

@MimiLoves - Did they ever find a specific genetic cause for your EDS (or other diagnoses)? You have a couple pretty rare heterozygous SNPs on VPS13B, one of which (rs28940272/i3003038) is known to cause Cohen's Syndrome when homozygous. Cohen's can account for joint laxity and some heart issues, and it sounds like the traditional signs of obesity and retardation are now considered much more optional. Plus there's a tendency to discover that while more severe manifestations of many disease require homozygosity, quite a few of those can cause milder problems in some people even when heterozygous.

Anyhow, it seems excessively unlucky to have five genetic diseases all at once. Which could be an indication that there's really just one or two actual diseases which are manifesting with the same symptoms as the other diseases.

Download for Rare SNP Analysis

Sea

Senior Member

Valentijn

Senior Member

ukxmrv

Senior Member

Snow Leopard

Hibernating

TheChosenOne

Senior Member

Eeyore

Senior Member

Valentijn

Senior Member

Eeyore

Senior Member

Valentijn

Senior Member

Eeyore

Senior Member

Valentijn

Senior Member

Eeyore

Senior Member

ohtheennui

Valentijn

Senior Member

ohtheennui

Eeyore

Senior Member

Valentijn

Senior Member

MimiLoves

MimiLoves

Valentijn

Senior Member