Big Data App to Explore Genomes for Clinical Relevance, Rare Variants, Drug Response, etc (Free)

kday

Senior Member
Messages
297
Likes
128
I am an ME/CFS patient and established member of this forum. I built a service called HGPAT, which stands for Human Genome Pathogenic Allele Tool. It is ready to be tested right now! I am looking for a better name, so new name suggestions are very welcome. This is a research tool so both citizen scientists and established scientists/researchers can make sense of raw genetic data when it comes to disease and risk factors. It's optimized for Desktop/Laptop/Tablet/Mobile, but I suggest Desktop/Laptop/Tablet to look at genome files as the app is very data heavy.

It's a powerful genomics tool so it may have a learning curve, but is extremely easy to use and has a UI that is meant for humans.

Just upload your 23andMe or AncestryDNA raw data here:
โžก๏ธ https://hgpat.tinybox.io

This is not the permanent domain. This is a temporary domain for user testing. The web application will moved when it's ready and I will update the link when it happens.

The app shows rare variants <1% Minor Allele Frequency (MAF), uncommon variants <5% MAF, other variants that the algorithm thinks could be relevant, and lists genetic conditions that are reviewed by an NCBI assigned Expert Panel as well as variants that are in the Genetic Testing Registry (GTR).

autoannotate.png

The app compares all uploaded data to ClinVar, which includes almost 500,000 submitted variants. For each variant, there are many relevant links to third party genetic websites for research. All associated diseases from ClinVar are listed for each variant as well as descriptions of diseases when you click on them. It doesn't tell you what you definitively have and definitely don't have like Promethease. You are the interpreter. All data is pulled by databases and "curated" by algorithms instead of pulled from a hand-curated Wiki. The information, including the written summaries and written predictions for each variant are completely automated. It assists you in research, and in some ways, could be considered more powerful than Promethease for disease/variant research and discovery. It uses Big Data techniques to predict relevance of SNPs and tells you whether or not it thinks the variant will have negative impact on enzyme function (no matter if it's classfied as pathogenic, benign, etc). The app utilizes a database called CADD that use machine learning and other techniques to rank how deleterious variants are. Very cool stuff.

It currently works with 23andMe and AncestryDNA. It also works with Whole Genome Sequencing (WGS/WES) files, but I don't currently have the servers configured to handle this much data. It's actually extremely efficient at Whole Genome Data processing. The biggest problem is that I currently don't have an uploader installed to send the data in chunks as you can be uploading upwards towards a gigabyte of data. If you want to process your whole genome data, send me a message!

It takes less than a minute to compute your 23andMe and Ancestry raw data and less than two minutes to process WGS/WES data after upload. However, support for WGS/WES is not currently configured, so please don't try as it will not work.

As some of you may know, I created Genetic Genie years ago. I've been working on this project for the past year. It is a complex project that took a lot of thought, brain power, and development. It's one part of the overall picture for an update of Genetic Genie. I wasn't physically or mentally capable of doing these complex tasks for many years, so sorry about the very slow updates! It is a research tool, not a diagnosis tool.

The app is set up to be tested. Please feel free to upload as many genomes as you like. Since the app consumes a lot of computing power and resources, it is set to automatically scale in the cloud. The more data that is ran, the more I can evaluate how well the app scales! Your browser is confined to the same server instance, so if you run multiple files at the same time in the same web browser, processing of subsequent data will be queued until the other data is finished processing.

Oh, and please report BUGS on this thread! ๐Ÿž๐Ÿ›๐Ÿคข


Current Limitations and Bugs:

1) While 23andMe proprietary SNVs and indels are supported (a ton of work!), they are not always 100% accurate. This is because 23andMe uses proprietary identifiers for a lot of variants. To my knowledge, there hasn't been anyone that has completely reverse engineered 23andMe's proprietary identifiers in a reliable fashion. This is because several rsID's and indels (insertions, deletions, etc) can co-exist at the exact chromosome/position. Without knowing the reference and alt alleles, you cannot convert them all with 100% accuracy. 23andMe doesn't offer this data to the public. A workaround may be possible but takes a lot more programming and it is not part of this release. Often a variant or indel at the very same location share the same clinical relevance, but this isn't always the case. [A successful workaround was implemented, and this issue is now considered resolved. Non-proprietary identifiers are now accurately called in 23andMe data even if there are multiple on the some chromosomal position.]

2) Because of the above limitation, we are not currently reporting more than one variant that is at the same exact position with 23andMe data. This may or may not change the results you see.

3) The above limitations are not limitations with AncestryDNA data as they do not use proprietary identifiers. However, because of the [previous] 23andMe limitations and trying to keep code consistent, AncestryDNA will not report multiple variants at the same location. This should be considered a bug that will be corrected.

4) Because these testing companies have used many chips over the years, their data is not always consistent and can be ambiguous between chips. For this reason, some data may not show up. For example, A Founder's mutation for BRCA1 that's common amongst Ashkenazi jews won't show up. This is because of a combination of ambiguous data and not being able to tell exactly what mutation between different chip versions. For example, one 23andMe chip version may show the reference allele for a deletion as II and another version of a 23andMe chip may show this data as a DD. For deletions the reference allele should always be II and for insertions it should always be DD, but 23andMe scientists (in combination with Illumina who makes the chips) aren't always consistent with their notation.

5) Tens of thousands of variants have been manually looked through and have had the data corrected. This doesn't mean that there won't be variants that are incorrect. However a lot of care was taken when filtering out noise and "bad variants." Information that you may find irrelevant will be returned along with data you find relevant. This is not a bug. This is by design as it's a robust research tool and some may find relevance in things that most others would consider irrelevant.

6) Drug responses may be corrected towards the risk of a good/bad response to the drug instead of the reference allele or minor allele frequency. This is intentional. However, the drug section can be confusing as sometimes a heterozygous variant carries the biggest risk, etc. This section will likely need some work to make the information more clear and easier to decipher.

7) Included is a text-based javascript filter so you can narrow down variants of interest. There are plans to create a button with a bunch of tools to filter variants in various ways. I'm considering creating "sort by" tools to reorder results as long as it proves to be efficient on the client end with javascript.

8) You currently can't download CSV files for results. This a feature that many citizen scientists and established scientists would want. I'm trying to think of a way to maintain privacy and store as little as possible data on the servers. Right now, the servers are configured so I generally cannot see information about genetic data. I am exploring ways to dymamically generate CSV files without storing such information on the servers themselves. A bit of a puzzle headache. [I may have a potential way to solve this issue as well as maintain strict privacy.]

9) Data is deleted automatically after upload. Sometimes this fails if the query doesn't finish. In this case, these de-identified files will be automatically discarded (hopefully) within 24 hours or sooner. There are still some bugs though around discarding files. Nevertheless, files disappear by design as server instances start and stop (which is very often).

10) I updated the link names to use the word "Variants" instead of "Mutations" I also updated the language in the blue boxes on each page as it wasn't written well. However, I didn't deploy these updates properly. Oops! A minor bug I guess, but still a bug. [This was fixed, but decided to continue to use the word "mutations" on the links.

Privacy & Terms of Service

No official privacy policy or terms of service have been created yet. In general, the service collects about as much information about you and your data as a cheap calculator. There are internal logs that I don't generally look at unless there are problems that can list things like IP addresses, etc. Google Analytics is currently not used with this testing phase. So currently no tracking/analytics cookies either! There is only one cookie set so the load balancer can identify your web browser and know how to scale the web application for you and other users.

It's possible that I will see de-identified genomes during routine maintenance in the case they aren't already automatically scrubbed. This data will never be distributed/shared in any way. There is no identifying information on the file names, in the files, etc.

While I believe the service to be secure, hacks happen. In the case of a compromising hack, it's possible for de-identified genetic information to be re-identified (such as if your genetic information is on other databases associated with your name or alias). In the event that law enforcement or any other authority requests data, it's unlikely that we'll be able to provide them with much, if any data because of the architecture of the system. Though it's certainly possible that an authority could get access to other information such as website logs. Of course, I would never expect a genomics research tool to ever be of interest to law enforcement or other authorities.

It's certainly possible that genetic disease can be discovered with this tool. Some may find this type of information as exciting and others may find such information scary or worrying, especially when they are looking at their own genome or a genome of a person they care about. If you are afraid of what you might find, do not use this tool! Since you were told this is a research tool and not a clinical diagnostics tool, we hold no liability if the service makes you seek a healthcare provider or further genetic testing. As I've stated numerous times, Big Data tools like this can and will have bugs and are therefore 100% accuracy can never be guaranteed. It can show a disease-causing variant that you don't really have as well as miss variants giving a false sense of reassurance. Doing research can help determine if the variant is accurately portrayed, but 23andMe, Ancestry, and WGS/WES data can also be wrong. We cannot vouch for the accuracy of third party data.

While it's possible that the data was misinterpreted by the service. It's also very possible for the user to misinterpret the data. This can also cause anxiety and cause one to seek medical care, especially if they are looking at their own genome or a genome of a person they are close to. Do your research as links to many amazing research websites are provided for each variant!

Aside from the databases I created myself, databases used were generally obtained from sources such as the U.S. Government (NCBI/NIH), Universities, and other third parties that provide public access to their information. I reserve all rights to my intellectual property and methods used to create and display data. I believe the look and feel of the data is very unique to this app and consider the styling of it as a creative work of art. Users can save copies of genomes by downloading/saving a website copy of the information generated. At this point in time, I do not give rights for the data to be used for commercial purposes without explicit permission and do not give rights for the generated data to be sold or distributed without explicit permission.

Annotation 2019-05-14 032620.jpg Annotation 2019-05-14 032149.jpg Annotation 2019-05-14 031738.jpg autoannotate.png
 
Last edited:

ScottTriGuy

Stop the harm. Start the research and treatment.
Messages
1,253
Likes
5,061
Location
Toronto, Canada
Wow, thanks @kday.

Uploading my 23andme data was super easy, and the results came back ultra fast, so great job.

My challenge is interpreting the results. In my initial perusal, I've focused on the homo (red) results, but I am not educated enough to understand the ClinVar database.

I am particularly interested in my numerous homo results related to HIV, as I am HIV+.

For example, in the attached screenshot: Coronary artery disease, development of, in HIV

I'm assuming this means I am genetically at higher risk for coronary artery disease because I am HIV+. But does this take into account that I am taking HIV medications?

I'm also homo for: Acquired immunodeficiency syndrome, slow progression to

Which I take to mean that without HIV meds, I would be a slow progressor to AIDS.

Any way, very interesting output with lots to explore. Thanks for sharing freely.
 

Attachments

kday

Senior Member
Messages
297
Likes
128
So the plan is to make tutorial videos (or perhaps making an online class) that teaches people some basics about genetics, raw data, and Whole Genome Sequencing (WES/WGS) data.

I've learned a lot myself just making this and would like to share what I learned with others.

My initial idea is maybe a 5-10 minute video explaining how to make sense of your results.

If you are looking for disease causing variants (or even major risk factors), I would look for the word "autosomal dominant" and "autosomal recessive." If it says "autosomal dominant," That means heterozygous (yellow) can cause disease. If it says Autosomal Recessive it means red (homozygous) can cause disease.

Certainly variants that are more rare or more likely to cause disease. But some more common variants can cause disease too, like the HFE SNVs for Hemochromatosis or MBL SNVs for Mannose-Binding Lectin Deficiency (an extremely common immune disorder that can affect up to 10-20% of the population and double to triple of that in Africa).

So what you can do, for example, is use the filter at the top right and search for MBL. If you have multiple MBL SNPs (yellow or red), this can cause the immune deficiency. And I wonder how this relates to AIDS. That's something you can research if you have MBL SNPs, because I have never looked into it.

Also pay attention for "OMIM Allelic Variant." If one of the descriptions says something is an OMIM Allelic Variant, you can click on the name of the disease and if you scroll to the bottom of the modal window that pops up, you can click on a the "OMIM Allelic Variant" link and read about that specific variant from the curators at OMIM. It's one of my favorite resources and is maintained by Johns Hopkins Medicine. I don't think you can find a better resource when it comes literature/summaries of disease causing variants.

A lot of the other tools are cool too. Another one of my favorites is LitVar for research. It uses PubMed data, but displays/contextualizes the information in a new way and gives you a ton of options for filtering by disease, chemical/drug, etc.

Also pay attention to the CADD score. A CADD score of 20 is the top 1% of deleterious variants. 30 is the top 0.1% of deletrious variants, 40 is the top 0.01% and so on. It's sort of an AI prediction score, so it's not perfect (some diseases ironically have a low score and benign things can have a high score), but if you see a high score for something that looks interesting, it might be worth taking a look at!
 
Last edited:

ScottTriGuy

Stop the harm. Start the research and treatment.
Messages
1,253
Likes
5,061
Location
Toronto, Canada
Thanks, that has been helpful in focusing my efforts - and given me a couple of things already that I'll have my doc look into.
 

bertiedog

Senior Member
Messages
1,300
Likes
2,184
Location
South East England, UK
@kday I uploaded my data ok and the message was it would tae a minute for the results but its now about 45 minutes and still nothing.

Do you think there is a problem with my data?

Thanks so much

Pam
 

kday

Senior Member
Messages
297
Likes
128
@bertiedog

I couldn't tell you why your first upload did that. Wish I knew why, but glad it worked on the second upload!

If it sits like that, it really should never be longer than a minute or two. Usually less.
 
Last edited:

Dan_USAAZ

Senior Member
Messages
143
Likes
168
Location
Phoenix, AZ
@kday , this is great! The upload/processing took only 54 seconds.

While I am still learning the user interface, it appears very easy to use. It brings much of the important information together, which i was previously having to manually jump to other sites in order to research.

Thank you!
 

kday

Senior Member
Messages
297
Likes
128
One thing I've been considering is making an app for iOS/Android that can store computed data from many genomes locally, giving you the ability to both save computed results easily and have complete control over the privacy of genetic information.

Please thumbs up ๐Ÿ‘ if you think you are someone who would use such an app. Some near future ideas.
 

rel8ted

Senior Member
Messages
304
Likes
951
Location
VA
So the plan is to make tutorial videos (or perhaps making an online class) that teaches people some basics about genetics, raw data, and Whole Genome Sequencing (WES/WGS) data.
Please, because my mind is frazzled!
 
Messages
1,197
Likes
3,164
I built a service called HGPAT, which stands for Human Genome Pathogenic Allele Tool. It is ready to be tested right now! I am looking for a better name, so new name suggestions are very welcome.
if the acronym was HugPat: you could at least say it!

PathAnon

GenPathAll

this is very exciting, and this is reminding me to: get that genetic data out of there!
 

kday

Senior Member
Messages
297
Likes
128
@rel8ted

When working on Big Data tools, I've been focusing on questions like "how do I make this accessible to everyone" and "how to a make a user interface that interfaces with a human well."

While I'd like to think I'm solving the "how do you make tools accessible to everyone" dilemma, it's harder to solve how do you make it easy to understand for those that are less technically minded or have limited knowledge about genomics.

And I think the truth is that this is not a consumer tool. It is a tool for citizen scientists and even real researchers. Even with a good user interface, concepts around interpretation have to be learned. I think almost anyone can learn how to use a tool like this. But it might take a short instructional video (for someone with more knowledge) or a short online course (for those with little to no knowledge about genomics).

My learning style is clicking around and poking things until I figure out how they work, which may not be the most common learning style. But I really think some short videos can help people grasp how to use this tool very quickly.

I can totally see why you are overwhelmed. I can tell you that it is not as complicated as it looks, but of course that statement will probably just sound silly and won't resonate with you.
 
Messages
1,197
Likes
3,164
Just downloaded my raw DNA from Ancestry!!.

Gee: I really learned alot when I Aced Genetics in Graduate School (not, apparently)

I will attempt to see: what happens if I try to do this. Luddite here, generationaly post-modern (ie: what is a computer, something you get somebody else to do for you?).
 
Messages
1,197
Likes
3,164
Thanks the for ideas though!
I worked for the Government. Am a professional at: detesting Acronyms.:rofl::whistle:

to the degree you can SAY IT, this keeps the brain continuing to read the sentence. When one cannot say it, it becomes a Brain Stop. :bang-head::bang-head::bang-head::bang-head::bang-head: Do kinda like GenPath. :hug: It also just sounds rather- uplifitng,
 
Messages
1,197
Likes
3,164
Just upload your 23andMe or AncestryDNA raw data here:
Sorry: I am Cognitively fog-ridden. Need simplifications and hand holding.

So: my ancestry file looks like a long list of plain text, the file name is not specified or unique to me. So: when I send it in, does it just turn around and send back out directly the answers? where is it sending these answers?

Sorry: clearly your capable of computer brilliance, and I really appreciate it. I am: not.
 
Messages
1,197
Likes
3,164
@kday

Success: apparently: I just did it, and something worked. So I have results....Where do we go to worry?
'This all looks pretty interesting to me"
so these are considered a Web Archive? does that mean I cannot print this out on paper, that is somehow goes back to you site and talks to it there? (sorry, we just don't do this, here).

Does this result continue to be available? Once its been generated? Do I need to make a hard copy so it doesn;t disappear?

(for Luddites only)....somebody above commented on: just play around with it. No, Luddites: stare, and are afraid, and think the machine might blow up, or the file: evaporate. So you don't: just play.
 

kday

Senior Member
Messages
297
Likes
128
@Rufous McKinney

YESSSS!!! Me too!

I've considered removing that one, But I've enjoyed seeing whether someone is likely to tolerate milk or not. ๐Ÿ„๐Ÿฅ›

"I've got bad news for you. You have a Genetic Condition. You are a mutant that tolerates milk."