reasoning about sequences and algorithms

anciendaze · Dec 22, 2010

The debate on XMRV and the recent papers in Retrovirology concerning contamination is going so hot and heavy I don't have time, under present conditions, to keep up. This thread is devoted to some background on the reasoning commonly used in arguments about sequence data. I hope others will supply links to appropriate parts of the debate, or other background material which will make this discussion more intelligible to general readers. Perhaps a moderator can edit this post.

Since the referenced papers generally ignore results of culture tests and transmission electron microscope imaging, I feel safe in concluding the authors are basing their arguments on sequence data. The fact that a computer actually yielded a phylogenetic tree showing likely origin of XMRV sequences in a cultured cell line is taken as proof positive, by an unbiased arbiter, that the source must be contamination.

Let me go over the origin of those algorithms, which I am old enough to remember. The earliest sequence data did not come from DNA. At the time, the genetic code was a mystery. Hard data concerning polypeptide sequences in fragments of proteins was just emerging. When human beings look at sequences of apparently meaningless symbols of modest length they quickly run into difficulty deciding which "most resembles" which. At this point workers in the nascent field of molecular biology took the problem to their resident nerds in the computer field.

This was so far back said nerds could be identified by white shirts with pocket protectors and lots of pens/pencils. They tended to carry extra (magic) JCL cards with them, just in case the card reader ate the ones on the front of the current deck of punch cards. (The story actual begins before JCL, but that is a trivial distinction.)

The first problem for the biologist was explaining simplified Mendelian genetics. (It was not clear at that time that nerds were capable of reproducing, being unfamiliar with the opposite sex, to say nothing of required behavior.) Once the hurdles of the alien concepts were surmounted, the nerd would start looking for some feature of the problem which would make the job easy enough to be handled by the feeble electronic brains available at the time, preferably before the heat death of the Universe set in.

Early sequence data were not for obscure molecules, even the most fundamental were scarcely understood. A popular example was cytochrome C. So the biologist shows the nerd example sequences which could conveniently be encoded with fewer than 26 characters, even if the associated amino acid names were jaw breakers. He explains that they think the sequences for human cytochrome C must be closely related to cytochrome C in Pan Troglodytes. They have clues they are 98% the same, in some sense not yet well defined.

This is what the nerd is looking for, a simplifying assumption. The alien concept of sexual transmission of information, and phylogeny, provides another. Except for combinations formed through mating, and other messy situations, all information is transmitted vertically along the diagram.

As the bits and pieces of cytochrome C were grubbed from laboratory tests, biologists
went through a series of emotional ups and downs. At some points, possible differences turned out to be mistaken. Wits suggested that perhaps there was no biological distinction, all differences between chimpanzees and humans were due to environment. It came as a relief when clear differences finally emerged.

The resulting algorithms encode the assumptions those nerds found useful in simplifying the problem, and output results as phylogenetic trees, because that was what they were asked to provide.

Now we must look at the wider field of microbiology. Here we find bacteria, which are generally described as asexual, (though like nerds they may have hidden depths.) Most of the time they simply divide, passing all genetic information, except some due to errors in the process, on to their offspring. Indeed, there is no clear idea of which resulting bacterium is the original and which the offspring. This would be unusually simple, if that were the whole story.

Bacteria also have a behavior known as conjugation. At unpredictable times and places they will stick together and pass plasmids through pores or tubes: bacterial sex. This is called horizontal or lateral transfer of information. It need not obey Mendel's laws. Because the entire bacterial genome is likely to be much larger than the plasmids passed back and forth, the assumption of modest changes at each stage is not entirely invalidated.

There are also some limitations on kinds of bacteria likely to be involved. The concepts of genus and species have not completely disappeared.

When we come to viruses, there is a whole new ballgame. First of all, viruses are not even complete independent organisms. Secondly, recombination can take place in ways a mink breeder would not tolerate. A virus is not restricted to "mating" with the same, or even similar, species. Horizontal transfer of information looms large, and the entire viral genome is smaller than that of virtually any independent organism. Assumptions built into algorithms are violated.

Diagrams of heredity in viruses are only trees in degenerate cases, in general the diagram is properly called a directed acyclic graph (dag). Algorithms for computing dags from sequence data are not in anything like the state of development of algorithms for computing phylogenetic trees. Those algorithms remain in use because they exist, are familiar, and often provide useful insights. Algorithms should not be confused with oracles.

Another measure of the problem comes from data showing the XMRV sequence 94% homologous with MLV, and 95% homologous with an endogenous sequence. This means the differences we are considering amount to only about 400 nucleotides of genetic information. How fast does such a change take place?

In mice, we know it may take place on a time scale which can be observed in laboratories. What about in humans?

The best studied human retrovirus at present is HIV-1, (once known as LAV or HTLV-III.)
A great deal of work and sequence data suggests this originated about a century ago when a strain of SIV found in chimpanzees jumped to humans through the food chain. There have been arguments in favor of other origins.

At present HIV-1 is known to have 48 major strains. In general these exhibit only about 50% homology with the presumed ancestral SIV. This means there have been changes in thousands of nucleotides in about 100 years. For a human retrovirus, a change of hundreds of nucleotides in 10 years seems extremely plausible. With this degree and rate of variability, arguments based on smaller numbers of differences should be considered weak. We would need hundreds of complete sequences before we could feel confident the apparent source would not change with the next added sequence.

A second inference from observed variability of human retroviral sequences is that the presumed great age of some endogenous sequences may be suspect. If on-going retroviral infections are still inserting sequences which vary the way HIV-1 varies, a resulting endogenous sequence which is assumed to be thousands of years old may be less than a century old. Molecular clock arguments can be very slippery.

August59 · Dec 22, 2010

Man - This had fuses popping in my head :victory::Sign Good Job:

oceanblue · Dec 22, 2010

Diagrams of heredity in viruses are only trees in degenerate cases, in general the diagram is properly called a directed acyclic graph (dag). Algorithms for computing dags from sequence data are not in anything like the state of development of algorithms for computing phylogenetic trees. Those algorithms remain in use because they exist, are familiar, and often provide useful insights. Algorithms should not be confused with oracles.

Thanks, anciendaze. You seem to be saying that the phylogenetic maps used to compare viruses are intrinisically unreliable. Is that right, and are you also saying that therefore the Phylogenetic findings of the Hue Retrovirology paper are not as conclusive as they appear? Thanks

jewel · Dec 22, 2010

This is fascinating; thanks for the explanation.

anciendaze · Dec 22, 2010

oceanblue said:
Thanks, anciendaze. You seem to be saying that the phylogenetic maps used to compare viruses are intrinisically unreliable. Is that right, and are you also saying that therefore the Phylogenetic findings of the Hue Retrovirology paper are not as conclusive as they appear? Thanks

I'm saying the phylogeny may be a tree in special cases. There is no reason to assume this in all cases, and good evidence to the contrary. Ancestral trees are traditional assumptions, not necessary truths. Building assumptions into a program or a machine does not alter the fact that they are assumptions.

To assume a particular heredity based on present data is very risky.

Sean · Dec 22, 2010

Thanks for that, anciendaze.

ukxmrv · Dec 22, 2010

That makes complete sense Anciendaze, thank you.

SOC · Dec 22, 2010

Thanks for the useful info. Too bad the biologists had to ask for help from us feeble-minded, socially inept "nerds". I guess it was better than the option they had otherwise -- sitting around in a cold, dark cave waiting for the heat death of the Universe.

The first problem for the biologist was explaining simplified Mendelian genetics. (It was not clear at that time that nerds were capable of reproducing, being unfamiliar with the opposite sex, to say nothing of required behavior.) Once the hurdles of the alien concepts were surmounted, the nerd would start looking for some feature of the problem which would make the job easy enough to be handled by the feeble electronic brains available at the time, preferably before the heat death of the Universe set in.

Surprisingly to all educational bigots, we were not only familiar with the necessary concepts and techniques (WE didn't have to be taught basic biology concepts in college), we are reproducing quite nicely, thank you.

illsince1977 · Dec 22, 2010

anciendaze said:
This was so far back said nerds could be identified by white shirts with pocket protectors and lots of pens/pencils. They tended to carry extra (magic) JCL cards with them, just in case the card reader ate the ones on the front of the current deck of punch cards. (The story actual begins before JCL, but that is a trivial distinction.)
=

Hey, I resent that! I never wore a pocket protector or carry lots of pens/pencils. Besides, you forgot about the slide rules and the first programmable TI calculators. My cards always ran through the reader on the first try. Guess you had to know how to handle the deck and the readers, JCL, Basic, Fortran, SPSS, Algol, Spitbol, Assembler, PL/1 and all!

I also have children: ergo I had sex at some point! At least last I checked that was the theory!

However, I really appreciate and concur with the possibility that oversimplification may be involved in the translation of reality into a computer programmable set of variables. Scientists, however are perfectly capable of not representing reality on their own without the complicity of computer programmers. I believe this is what modelling is, and limiting the number of variables in order to achieve a testable hypothesis is, which is the essence of the modern scientific method as I understand it. The fault lays not only with either programmer's limited brainpower, nor the limited processing capacity of early computers or the lack of extant programs at the time upon which to build new more complex representations, but with the scientific method itself and the limitations of its ability to represent reality/natural processes.

George · Dec 24, 2010

Oh my Gosh! (wiping away tears of myth) That was hilarious. Daze my friend you missed a calling in comedy. That was absolutely excellent. And true. There are not enough sequenced fragments from either the Prostate Cancer Studies(4) or the CFS (only 2) to run any kind of algorithm with any certainty. Thank you for the Christmas Cheer!

Warm Puppy Hugs from the resident Dawg.

anciendaze · Jan 10, 2011

While I can't promise to maintain the previous level of amusement, I do have more to say on this subject. Here's my latest attempt.

As I pointed out above, sequence comparison started even before the genetic code was firmly pinned down. The first sequences of important biological significance were sequences of amino acids in polypeptides. There are 22 distinct amino acids common in biology, while DNA uses only 4 distinct base pairs to encode information. Right off the bat this makes amino acid sequences less ambiguous.

There are more complications for DNA sequencing. Triples of base pairs encode for specific amino acids, and for actions like start and stop translation. Shifting the starting point often causes such big changes that generally the result is biologically useless. (Not always. Nothing in biology is entirely simple.) You hear people talk about open reading frames, which indicate you are still in the middle of a sequence coding for a gene. For double strands of DNA, you also have sense and anti-sense versions of a sequence. And, if all that isn't enough, you have introns, which don't code for anything useful, and must be cut out. A great deal of complicated stuff in the nucleus of eukaryotic cells also works to correct errors in the code, which can easily arise. The result is that matching DNA sequences is trickier than those amino acid sequences we started comparing, irrespective of our choice of subject matter.

Which things you compare is also important. The example of cytochrome C was chosen because it is a highly-conserved sequence. Even a change in a single amino acid can result in a molecule with the wrong shape ("tertiary structure"), or even the wrong way of bending in reactions. Most big changes are likely to be immediately lethal, because this molecule is fundamental to metabolism. The result is that you only see a few changes over millions of years of separation between chimpanzees and humans, and these are often isolated changes in single letters of the code.

Viruses evolve within individual patients. Some, including XMRV, keep their genetic information in the form of RNA, which is generally less stable than DNA. They transfer genetic information horizontally at the drop of a (molecular) hat. They are not essential for living cells to survive. Changes known as Long Tandem Repeats (LTR) are common, and apparently do help to hide the virus from immune systems. Most of the assumptions with which sequence comparison began are violated.

Battles between viruses and natural defenses, like APOBEC3 enzymes which cause hypermutation, introduce more confusion in sequence data. If you doubt this has any effect on the ability to find a sequence, you must have never lost real email because of a spam filter. This is a tiny example in which computer viruses may resemble biological viruses. Warfare with biological viruses has been going on much, much longer. We dare not assume those viruses are unsophisticated.

Added: (Wow! This just gave me a great idea for sending secret messages across the Internet, disguised as appeals from Nigerian widows.)

Another aspect shows up when you consider effects of selecting input data. If all the input sequences were derived from cultured cell lines, the root of the phylogenetic tree constructed would necessarily land in a cultured cell line. When actual data are used there is a bias toward data from cultured cell lines for a simple reason: it is much easier to get consistent sequence data from a homogeneous culture of cells than from anything as messy as an intact organism, even if the organism is not infected with some nasty disease, let alone the multiple infections characteristic of ME/CFS.

Databases for sequence data show an inherent bias in this direction. It is a sophisticated version of the drunk looking for his keys under a lamp post where the light is better.

reasoning about sequences and algorithms

anciendaze

Senior Member

August59

Daughters High School Graduation

oceanblue

Guest

jewel

Senior Member

anciendaze

Senior Member

Sean

Senior Member

ukxmrv

Senior Member

SOC

Senior Member

illsince1977

A shadow of my former self

George

waitin' fer rabbits

anciendaze

Senior Member