The downloadable rare gene program is done, aside from some minor tweaking. Basically it uses Java, but is a version which is compatible both with modern computers an well as the dinosaurs which some of us have
The program itself is about 2MB (for comparison, the 23andMe file is 23.6MB), and extremely user-friendly. Basically a box with three big buttons comes up. The top button is for selecting a rare gene database, but if there's one in the directory with the program, it'll select it automatically. If no database is selected, this button is orange, and turns green when the database has been selected (automatically or manually). This set up will make it easy for updated or alternate databases (prevalence of 5%, 10%, 2%, etc) to be downloaded and used.
The middle button is for selecting your 23andMe file. This starts off orange, and turns green when an appropriate file has been selected. It won't allow selection of a non-23andMe file.
The bottom button is for running the program. I think it starts off as yellow, then turn green when running. A progress bar appears while results are run (takes about 30 seconds), with each "hit" added to a tab-separated table as it is found. The buttons can't be hit while the program is running.
When it's done running, you have a tab-separated list of all variations, in order by chromosome and location (sample output is attached to this post). They are listed by 23andMe rsID and "i" number, and Rare allele, percentage (converted to percent instead of decimal number), and the users genotype are also listed. These results can be cut and pasted into a text (or other) file, and there's a button for selecting to convert into a .csv (comma-separated values, useful for opening with Excel or similar).
We also need to add a flag for homozygous results (in text form), and I think we might try to add in "rs" numbers where 23andMe has used "i" numbers, to make it easier to find additional info. We should probably add location back in to, so that anyone can sort them back into proper order if playing with Excel.
Insertion and deletions aren't included currently, for a few reasons: 1) the raw data for prevalence rates is pretty garbled, 2) most insertion/deletion SNPs don't have prevalence data anyhow, and related to this is 3) 23andMe has trouble detecting insertions and deletions reliably. This is probably something we'll look into more, but it would delay the project considerably if we attempted to get it working at this stage.
But basically ... it's a small download, it's easy to use, it's fast, it's easy to use, and thus far it seems very accurate. For the 1% and under bunch, it gives me a list of 139 rare SNPs. I've looked into the first 70 and confirmed that they are indeed rare according to other sources, and it's reporting the rare alleles and my genotype accurately.
I still need to compare it to similar programs on the internet, to see if there's something they're picking up which we're missing - this is unlikely, since we're generating more results, possibly due to using the most recent prevalence data from 1000genomes, or due to cutting off at 1% instead of under 1%. Cutting it off 1% does keep the list much smaller, but I think that also creates an artificial distinction where there shouldn't be one: an SPN with a prevalence of 1.0% is very nearly as relevant as an SNP with a prevalence of 0.99%, so it doesn't make much sense to flag one and ignore the other.
This problem disappears after the 1% level, as after that everything is rounded to a whole percentage. So you only get decimal points under 1%, but otherwise everything is round to 2%, 4%, etc. The downside is that you're getting a lot more SNPs: in the neighborhood of 139 instead of 53. The upside is that you're getting more homozygous results, which are the ones most likely to cause major malfunctions. In fact, the 10% cut off is likely to produce a huge list, but will pick up on homozygous results that are essentially in the 1% or below range for genotype, yet would otherwise be missed due to each minor allele being too common.