First, are you basing frequencies on genotypes or on haplotypes / GMAFs?
It's almost all based on the general 1000 genomes data via dbSNP. Ethnic data is not included. In some cases where that data is lacking, I might add in prevalence data from large general groups submitted by other parties to dbSNP. That would be the pretty big ones with allele sample sizes of around 2000-5000.
For purposes of generating the rare 1% and 10% result, any missense mutation with no prevalence data is going to be assumed to be <= 1%. These are usually pathogenic mutations which are very rare, and for some reason seem to be excluded from 1000 genomes sampling and similar - possibly because they are so very rare. But the current plan is to leave the MAF (Minor Allele Frequency) column blank for those or to have a "??", so that the uncertainty is apparent to users.
Second, are you planning to run a GWAS on all SNPs or only rare ones? Say, for example, there is a genotype with a MAF of 10%, so that, assuming Hardy-Weinberg equilibrium, you would distribute to about 1%, 18%, and 81%. It could very well be that ME is dramatically overrepresented (or even possibly only found) in the 81%, and that the minor allele is completely protective. I can actually think of some minor alleles already that would be pathogenic but would protect one from ME (but would be worse... heh). Or is this an error of study power?
I will probably look for over- or under-represented genotypes and allele counts at some point. Statistical significance might not even be possible, due to small sample size: we have 50 sets of patient data thus far, and over 960,000 SNPs for most of them.
I also want to look at larger trends. One example could be genes where we have an unusual number of mutations as a group, even if nothing is too unbalanced regarding any specific SNPs on that gene. Similarly, we should be able to apply the same concept to groups of gene which are involved in the same processes.
Lastly, are you going to apply a Bonferroni multiple test correction, and is that the reason you are starting with rare alleles, to try to achieve statistical significance with fewer samples?
I think we definitely want to correct for multiple comparisons. But Mr Valentijn or one of our PR statistics people is going to have to help with that, because I haven't been able to absorb the technical aspects of it into my ME-brain
Mostly I just think rare SNPs are the most practical starting point to look for anything interesting. On the individual level it can be used to identify pathogenic or potentially pathogenic mutations. Some of these are pretty low-impact, similar to the more common MTHFR mutations, but others can have dramatic impact, such as by causing deafness if a certain class of antibiotic is used. And some might even be potential culprits for some ME symptoms, or immune dysfunction, etc.
And on the group level the rare SNPs present a collection of SNPs which are small enough to be easily manipulated and more closely examined. For example, I currently have a compilation of <=1% results from 39 ME patients. Some of these SNPs occur much more often than expected in ME patients, and sometimes there are several such SNPs on a single gene.
By compiling just these rare results, we can easily focus on looking into those more closely : is the minor allele really that rare, or is 23andMe over-reporting it or 1000 Genomes under-reporting it? We can compare them to 23andMe data from dozens or hundreds of random people to find out. Are they missense mutations, and thus much more likely to have an impact? I was able to look them up manually on dbSNP to find out. Is the particular missense mutation capable of having a pathogenic result, when there's no research regarding it yet? Online protein modelling programs can make a pretty good guess, and at least weed out the likely harmless ones.
Anyhow, the plan is to examine the data in several ways, and we'll probably add more to the list as time goes on and people make suggestions