anciendaze
Senior Member
- Messages
- 1,841
The debate on XMRV and the recent papers in Retrovirology concerning contamination is going so hot and heavy I don't have time, under present conditions, to keep up. This thread is devoted to some background on the reasoning commonly used in arguments about sequence data. I hope others will supply links to appropriate parts of the debate, or other background material which will make this discussion more intelligible to general readers. Perhaps a moderator can edit this post.
Since the referenced papers generally ignore results of culture tests and transmission electron microscope imaging, I feel safe in concluding the authors are basing their arguments on sequence data. The fact that a computer actually yielded a phylogenetic tree showing likely origin of XMRV sequences in a cultured cell line is taken as proof positive, by an unbiased arbiter, that the source must be contamination.
Let me go over the origin of those algorithms, which I am old enough to remember. The earliest sequence data did not come from DNA. At the time, the genetic code was a mystery. Hard data concerning polypeptide sequences in fragments of proteins was just emerging. When human beings look at sequences of apparently meaningless symbols of modest length they quickly run into difficulty deciding which "most resembles" which. At this point workers in the nascent field of molecular biology took the problem to their resident nerds in the computer field.
This was so far back said nerds could be identified by white shirts with pocket protectors and lots of pens/pencils. They tended to carry extra (magic) JCL cards with them, just in case the card reader ate the ones on the front of the current deck of punch cards. (The story actual begins before JCL, but that is a trivial distinction.)
The first problem for the biologist was explaining simplified Mendelian genetics. (It was not clear at that time that nerds were capable of reproducing, being unfamiliar with the opposite sex, to say nothing of required behavior.) Once the hurdles of the alien concepts were surmounted, the nerd would start looking for some feature of the problem which would make the job easy enough to be handled by the feeble electronic brains available at the time, preferably before the heat death of the Universe set in.
Early sequence data were not for obscure molecules, even the most fundamental were scarcely understood. A popular example was cytochrome C. So the biologist shows the nerd example sequences which could conveniently be encoded with fewer than 26 characters, even if the associated amino acid names were jaw breakers. He explains that they think the sequences for human cytochrome C must be closely related to cytochrome C in Pan Troglodytes. They have clues they are 98% the same, in some sense not yet well defined.
This is what the nerd is looking for, a simplifying assumption. The alien concept of sexual transmission of information, and phylogeny, provides another. Except for combinations formed through mating, and other messy situations, all information is transmitted vertically along the diagram.
As the bits and pieces of cytochrome C were grubbed from laboratory tests, biologists
went through a series of emotional ups and downs. At some points, possible differences turned out to be mistaken. Wits suggested that perhaps there was no biological distinction, all differences between chimpanzees and humans were due to environment. It came as a relief when clear differences finally emerged.
The resulting algorithms encode the assumptions those nerds found useful in simplifying the problem, and output results as phylogenetic trees, because that was what they were asked to provide.
Now we must look at the wider field of microbiology. Here we find bacteria, which are generally described as asexual, (though like nerds they may have hidden depths.) Most of the time they simply divide, passing all genetic information, except some due to errors in the process, on to their offspring. Indeed, there is no clear idea of which resulting bacterium is the original and which the offspring. This would be unusually simple, if that were the whole story.
Bacteria also have a behavior known as conjugation. At unpredictable times and places they will stick together and pass plasmids through pores or tubes: bacterial sex. This is called horizontal or lateral transfer of information. It need not obey Mendel's laws. Because the entire bacterial genome is likely to be much larger than the plasmids passed back and forth, the assumption of modest changes at each stage is not entirely invalidated.
There are also some limitations on kinds of bacteria likely to be involved. The concepts of genus and species have not completely disappeared.
When we come to viruses, there is a whole new ballgame. First of all, viruses are not even complete independent organisms. Secondly, recombination can take place in ways a mink breeder would not tolerate. A virus is not restricted to "mating" with the same, or even similar, species. Horizontal transfer of information looms large, and the entire viral genome is smaller than that of virtually any independent organism. Assumptions built into algorithms are violated.
Diagrams of heredity in viruses are only trees in degenerate cases, in general the diagram is properly called a directed acyclic graph (dag). Algorithms for computing dags from sequence data are not in anything like the state of development of algorithms for computing phylogenetic trees. Those algorithms remain in use because they exist, are familiar, and often provide useful insights. Algorithms should not be confused with oracles.
Another measure of the problem comes from data showing the XMRV sequence 94% homologous with MLV, and 95% homologous with an endogenous sequence. This means the differences we are considering amount to only about 400 nucleotides of genetic information. How fast does such a change take place?
In mice, we know it may take place on a time scale which can be observed in laboratories. What about in humans?
The best studied human retrovirus at present is HIV-1, (once known as LAV or HTLV-III.)
A great deal of work and sequence data suggests this originated about a century ago when a strain of SIV found in chimpanzees jumped to humans through the food chain. There have been arguments in favor of other origins.
At present HIV-1 is known to have 48 major strains. In general these exhibit only about 50% homology with the presumed ancestral SIV. This means there have been changes in thousands of nucleotides in about 100 years. For a human retrovirus, a change of hundreds of nucleotides in 10 years seems extremely plausible. With this degree and rate of variability, arguments based on smaller numbers of differences should be considered weak. We would need hundreds of complete sequences before we could feel confident the apparent source would not change with the next added sequence.
A second inference from observed variability of human retroviral sequences is that the presumed great age of some endogenous sequences may be suspect. If on-going retroviral infections are still inserting sequences which vary the way HIV-1 varies, a resulting endogenous sequence which is assumed to be thousands of years old may be less than a century old. Molecular clock arguments can be very slippery.
Since the referenced papers generally ignore results of culture tests and transmission electron microscope imaging, I feel safe in concluding the authors are basing their arguments on sequence data. The fact that a computer actually yielded a phylogenetic tree showing likely origin of XMRV sequences in a cultured cell line is taken as proof positive, by an unbiased arbiter, that the source must be contamination.
Let me go over the origin of those algorithms, which I am old enough to remember. The earliest sequence data did not come from DNA. At the time, the genetic code was a mystery. Hard data concerning polypeptide sequences in fragments of proteins was just emerging. When human beings look at sequences of apparently meaningless symbols of modest length they quickly run into difficulty deciding which "most resembles" which. At this point workers in the nascent field of molecular biology took the problem to their resident nerds in the computer field.
This was so far back said nerds could be identified by white shirts with pocket protectors and lots of pens/pencils. They tended to carry extra (magic) JCL cards with them, just in case the card reader ate the ones on the front of the current deck of punch cards. (The story actual begins before JCL, but that is a trivial distinction.)
The first problem for the biologist was explaining simplified Mendelian genetics. (It was not clear at that time that nerds were capable of reproducing, being unfamiliar with the opposite sex, to say nothing of required behavior.) Once the hurdles of the alien concepts were surmounted, the nerd would start looking for some feature of the problem which would make the job easy enough to be handled by the feeble electronic brains available at the time, preferably before the heat death of the Universe set in.
Early sequence data were not for obscure molecules, even the most fundamental were scarcely understood. A popular example was cytochrome C. So the biologist shows the nerd example sequences which could conveniently be encoded with fewer than 26 characters, even if the associated amino acid names were jaw breakers. He explains that they think the sequences for human cytochrome C must be closely related to cytochrome C in Pan Troglodytes. They have clues they are 98% the same, in some sense not yet well defined.
This is what the nerd is looking for, a simplifying assumption. The alien concept of sexual transmission of information, and phylogeny, provides another. Except for combinations formed through mating, and other messy situations, all information is transmitted vertically along the diagram.
As the bits and pieces of cytochrome C were grubbed from laboratory tests, biologists
went through a series of emotional ups and downs. At some points, possible differences turned out to be mistaken. Wits suggested that perhaps there was no biological distinction, all differences between chimpanzees and humans were due to environment. It came as a relief when clear differences finally emerged.
The resulting algorithms encode the assumptions those nerds found useful in simplifying the problem, and output results as phylogenetic trees, because that was what they were asked to provide.
Now we must look at the wider field of microbiology. Here we find bacteria, which are generally described as asexual, (though like nerds they may have hidden depths.) Most of the time they simply divide, passing all genetic information, except some due to errors in the process, on to their offspring. Indeed, there is no clear idea of which resulting bacterium is the original and which the offspring. This would be unusually simple, if that were the whole story.
Bacteria also have a behavior known as conjugation. At unpredictable times and places they will stick together and pass plasmids through pores or tubes: bacterial sex. This is called horizontal or lateral transfer of information. It need not obey Mendel's laws. Because the entire bacterial genome is likely to be much larger than the plasmids passed back and forth, the assumption of modest changes at each stage is not entirely invalidated.
There are also some limitations on kinds of bacteria likely to be involved. The concepts of genus and species have not completely disappeared.
When we come to viruses, there is a whole new ballgame. First of all, viruses are not even complete independent organisms. Secondly, recombination can take place in ways a mink breeder would not tolerate. A virus is not restricted to "mating" with the same, or even similar, species. Horizontal transfer of information looms large, and the entire viral genome is smaller than that of virtually any independent organism. Assumptions built into algorithms are violated.
Diagrams of heredity in viruses are only trees in degenerate cases, in general the diagram is properly called a directed acyclic graph (dag). Algorithms for computing dags from sequence data are not in anything like the state of development of algorithms for computing phylogenetic trees. Those algorithms remain in use because they exist, are familiar, and often provide useful insights. Algorithms should not be confused with oracles.
Another measure of the problem comes from data showing the XMRV sequence 94% homologous with MLV, and 95% homologous with an endogenous sequence. This means the differences we are considering amount to only about 400 nucleotides of genetic information. How fast does such a change take place?
In mice, we know it may take place on a time scale which can be observed in laboratories. What about in humans?
The best studied human retrovirus at present is HIV-1, (once known as LAV or HTLV-III.)
A great deal of work and sequence data suggests this originated about a century ago when a strain of SIV found in chimpanzees jumped to humans through the food chain. There have been arguments in favor of other origins.
At present HIV-1 is known to have 48 major strains. In general these exhibit only about 50% homology with the presumed ancestral SIV. This means there have been changes in thousands of nucleotides in about 100 years. For a human retrovirus, a change of hundreds of nucleotides in 10 years seems extremely plausible. With this degree and rate of variability, arguments based on smaller numbers of differences should be considered weak. We would need hundreds of complete sequences before we could feel confident the apparent source would not change with the next added sequence.
A second inference from observed variability of human retroviral sequences is that the presumed great age of some endogenous sequences may be suspect. If on-going retroviral infections are still inserting sequences which vary the way HIV-1 varies, a resulting endogenous sequence which is assumed to be thousands of years old may be less than a century old. Molecular clock arguments can be very slippery.