De Profundis

Blog entry posted by anciendaze, Aug 10, 2015.

The Human Genome Project found essentially all of the "coding DNA" being transcribed into proteins back in 2003. The first draft of a human genome was declared complete, though there were certainly loose ends to be tied up.

One surprise at the time was that the protein coding genes identified were far less numerous than expected. Humans have 22,000, but mice have 23,000, even fruit flies have 17,000, and marbled lungfish have us beat by a mile at 133,000,000,000 base pairs and 30,000 genes. Even onions have a genome 5 times the size of ours. This made it more difficult to put humans at the top of the list, where all humans know they belong.

By the 10th anniversary of this event, there was a decidedly less rosy view about the medical benefits of that accomplishment. There were fewer easy genetic disorders discovered than expected, and many had already been identified without offering realistic possibilities for treatment. Even Nancy Wexler, who had tracked down the gene responsible for Huntington's disease, and developed a test for it, decided not to find out if she would develop the disease later in life. We know the cause of Huntington's pretty well, but we still have little we can do about it.

Another big-science project, ENCODE, had been launched to find out what might be going on in parts of the genome we had not expected. The goal was to build an encyclopedia of all functional DNA elements, and gain some idea of what they were doing.

Scientific disputes about what ENCODE accomplished continue, but a semi-popular account of the changes in understanding can be found in the recent book The Deeper Genome by John Parrington. This was my starting point for rethinking what we have learned from genomics.

This was the point in history where another acronym turned up in scientific literature, GWAS: Genome-Wide Association Studies. The goal of these was to associate particular elements of the genome with specific diseases. (If you don't like that acronym, I'm afraid the alternative WGAS, Whole Genome Association Study, is no better.)

Complex diseases like arthritis, asthma or schizophrenia didn't turn up single genes which might cause the disease, in the vast majority of cases. What was found instead turned out to be hundreds of genes with variants that might increase risk by a few percent. What can you do with that kind of information?

One feature common to many of these studies was reliance on SNPs, single-nucleotide polymorphisms. This is probably a weakness, because we now know that genomes are much more active in ways that don't fit common assumptions about tiny random mutations limited to single base-pairs. We now have thousands of possible associations with diseases, and questions about what they mean keep arising.

If you wanted to find one feature of humans that would return them to a position of smug superiority you would have to change the criteria you use. What do we have a lot of? The answer is "junk DNA", about 98.8%. (One reason you may see different figures for this is that even coding regions containing genes are more likely than not to contain "introns" within genes which are removed before the RNA transcript is spliced together and passed to protein synthesis. Do you count introns?) In mice only about 95% is non-coding "junk". What is more, a lot is going on in there nobody had predicted.

What about all that non-coding RNA (ncRNA) stuff being transcribed from DNA in the genome which does not go on to produce proteins, is it meaningless and nonfunctional? Hardly. For 20 years we have known about RNA interference limiting expression of genes. Part of this can be described as simple interactions between RNA sequences which pair up with single strands of RNA loose in cells, essentially jamming the machinery. This is far from the end of the story.

There is a family of proteins called argonaute (for reasons that would take us way off-topic) which can use short RNA sequences to identify the location on which to operate on longer strands.

One of these, called Dicer, is very familiar to molecular biologists because it has been a useful tool in experiments. (Yes, there is also a Slicer. I knew you'd ask.) Another argonaute protein which has been studied extensively is found in thermus thermophilus, a species of bacteria which made an important contribution to the development of PCR. When very similar molecules are produced by DNA sequences found in both humans and in bacteria that live in hot springs you should guess that those sequences are highly conserved, and for some good reason.

We now know about several families of RNA sequences with regulatory functions: siRNA, miRNA, rasiRNA and piRNA. These are generally quite short, as the name siRNA "short interfering" RNA reveals. Another group, which operates through a different pathway, has been dubbed microRNA. Still others are associated with repeating sequences, rasiRNA. The final group, piRNA, is of most interest to me today.

The origin of this terminology goes back to experiments with fruit flies which showed problems with reproductive organs. The original name was "P-element induced wimpy testis". The subgroup of Argonaute proteins found here were named piwi. (And you thought molecular biologists were sober people.) The corresponding RNA sequences are piRNA.

While these sequences are not highly-conserved, the clusters of them in genomes do appear to be conserved. The astonishing thing we've learned here is the sheer variety of piRNA sequences. A lower estimate is 55,000 types, with upper bounds vaguely around 800,000. This sequence diversity is reminiscent of diversity of antibodies.

The proposal has been put forward that these act to regulate parasitic DNA/RNA like that inserted by viruses, particularly retroviruses. They may also regulate transposons, which make analysis of genomes much more complicated by moving around. Humans have a limited number of known transposons, and some of these are definitely associated with diseases. The species where transposons were discovered, zea maize, has 85% of its genome made up of these mobile elements. Something is keeping human genomes from going the way of corn, and piRNA sequences are reasonable candidates.

I had been wondering why all the retroviruses to which humans have been exposed have not completely taken over our genomes during the debate over XMRV. Lots of retroviruses can infect human cells in vitro. A concentration of defenses against retroviruses in the testes makes sense from an evolutionary perspective. Keeping them out of the germline is more important in evolutionary terms than even the survival of an infected individual. Anyone who figures out how to make this work against HIV might expect a Nobel Prize.

This is like discovering a whole new immune system. Standard discussions in immunology center on antibodies, which are proteins generated by randomly rearranging small DNA sequences to generate some which happen to match particular molecular shapes identified as antigens. You might see this as evolution in action inside your own body, just at a lower level than complete organisms. Please note that those thermophilic bacteria do not have anything like a vertebrate immune system to protect them from infection, and retroviruses targeting bacteria are well-known (phages). We seem to be discovering biological defenses in the opposite order in which they arose in nature.

Immunologists are still coming to grips with evidence that immune cells can communicate via peptides, sequences of amino acids shorter than those which fold into proteins. The idea that there is a very active defense against parasitic DNA/RNA operating below the level of cells and proteins will take much longer to digest.

Does this overview exhaust the subject of ncRNA? Certainly not. There are also "long non-coding" (lncRNA) sequences which directly affect gene expression, and become involved in epigenetics. Most of you here will have heard about DNA methylation in one particular case proposed as therapy for malfunctioning genes. You may not have heard about histone acetylation, which is another way of producing epigenetic changes without actually modifying genes. Histones are the chemicals which provide the structure for nucleosomes around which DNA is wound to form a histone complex.

Reversible modification of histones can either suppress or enhance expression of particular genes. What is more this can be passed from one generation to the next even if the DNA remains unchanged. We don't know how far such inheritance can go, but we do know that environmental factors affecting your mother or grandmother may affect you. There are even epigenetic changes passed from male parents.

It turns out that these complicated 3-D structures of DNA on chromosomes enable DNA sequences far apart in the linear sequence to interact. Narrow concentration on sequences and things immediately adjacent in the linear sequence would completely miss this. Even worse, from the standpoint of simplicity, is the discovery that 3-D structures within the nucleus change dynamically in normal life, DNA origami. At present we know very little about dynamic changes.

There is still a great deal of controversy about what constitutes functional DNA. Some of the higher figures quoted are 80% of the genome, and even the possibility this will ultimately turn out to be close to 100%. There are critics who claim the whole project was mismanaged from a scientific standpoint, and the result should be much closer to earlier estimates.

What gets overlooked in the controversy is that even the critics are saying that 9% of the DNA is active and functional. Depending on what you consider the previous percentage of non-junk DNA to be this could be 4.5 or even 9 times earlier estimates. This illustrates a huge gap in our understanding.

If you can find a match with some known problem you may save a life, and Stephen Kingsmore has done just that with technology that can sequence the whole genome of a sick infant in less that 24 hours. I've read that he was able to identify the genetic disease in 28 out of 44 cases, and could recommend a treatment in 14 cases. Perhaps 5 infants have been saved so far as a result. Unfortunately, this is still a minority of cases.

We are still very much in the dark about what a lot of this genetic material is doing. A great deal of our understanding of basic biology, including human biology, is in the process of changing.

I really wonder how the medical profession is going to deal with this sea change in understanding.

My thoughts in this regard are not original. Consider this historical figure who saw one of the first balloon ascents:

About the Author

As the name suggests, I am old and dazed. The avatar illustrates my rule of thumb: "Hang on! This ride isn't over."
  1. ambieex3
    Very well written Great job!
  2. anciendaze
    Some private feedback on this subject has prompted thoughts on why the genome exposed in 2003 has yet to produce the dramatic opportunities for therapeutic intervention then expected. There are banal reasons which miss the fundamental flaw in reasoning exposed by ENCODE.

    The problem with this whole current line of research is that we have been concentrating on sequences that are conserved, and often highly-conserved. This makes sense for studying fundamental biology and evolution, but it has a strong tendency to show you things you can't change. Some features may have been of overwhelming importance 1.5 billion years ago, when eukaryotic cells appeared, but are now so interconnected with other parts of the genome that any attempt to alter them would be very likely to be catastrophic. We have concentrated on those things we are least likely to be able to change, as demonstrated in the case of Huntington's disease.

    When patients are generally able to walk, talk, see, hear and rationalize, even if they are not reasoning well, it is practically certain there are compensating mechanisms in action. Serious disruption of fundamental processes is more likely to result in death. Simply changing things that appear abnormal is very likely to interfere with compensation.
    Hutan and Marco like this.
  3. anciendaze
    We have also, as I've emphasized before, viewed medically-important measurements as quasi-static, and ignored variation as random -- despite strong evidence of nonlinear deterministic behavior. The word games we play center on the concept of "homeostasis", but this is demonstrably wrong when applied to quantitative vital measurements like heart rate.

    Purely homeostatic behavior would say that beat intervals near the mean value would be more likely to persist than those away from that value. Actual heart rates exhibit antipersistence near mean values. The natural pacemaker is constantly "hunting" for the right beat interval to match changing conditions, even when you do your best to make those conditions constant. Loss of this "hunting" behavior is a very bad prognostic sign. This finding is far from isolated. Variations in breathing and gait also exhibit similar departures from common medical assumptions.

    I am willing to bet that many ME/CFS patients show disturbances in gait which are not due to mechanical damage to joints. Like most other problems they experience this will be ignored.

    Classifying genomic activity which is not well conserved as random fits this pattern of intellectual blindness to dynamics. Yet dynamic processes like RNA interference are not only likely to carry information about pathology, they are also far more likely to be accessible to therapeutic manipulation.
    alkt, natasa778, Hutan and 1 other person like this.