5.30.2010

Proving Evolution: Post 6 - Phylogenetic Analysis

Previously we skimmed over the creation of a phylogenetic tree with a simplified example of how they are constructed using only a few major genetic characteristics. In the last post we touched on how even much less obvious genetic characteristics can also be analyzed for phylogenetic relationships… like ERVs. As the discussion progresses the importance of the nested hierarchy and it’s nontrivial nature will continue to become more apparent. Like in the case of ERVs it goes significantly beyond such superficially obvious observations as “we never expect to find snakes producing orange juice”. It applies right down to the molecular level even to genetic sequences which have absolutely no reason, from the standpoint of observing the “obvious” groupings of organisms, to display nested hierarchical patterns... except that evolutionary theory says they should because of their patterns of common ancestry.

When actually constructing a consensus phylogenetic tree such as the one shown at (Life on Earth) not only are a great many genetic traits taken into account, but a rigorous mathematical analysis of the actual DNA sequences of the organisms in question (where such DNA is available) is done to create cladograms (the branching diagrams showing patterns of descent) with the highest possible percentage confidence. These techniques have been tested in situations where the correct evolutionary relationships are already independently known for an absolute certainty to verify that they do in fact not simply produce an evolutionary relationship but the correct evolutionary relationship to within a very low margin of error..

One example:

http://mbe.oxfordjournals.org/cgi/reprint/19/2/170.pdf

In the paper above the researchers started with an original sample of DNA from Trypanosoma cruzi. They bred it over successive generations and allowed it to continually mutate, and every 70 generations 2 of the resulting DNA sequences were isolated at random and then used to found new populations. This process was repeated 4 times until 16 different ancestral DNA sequences had been generated. A rough diagram illustrating the process is shown in Figure 1 on page 2 of the paper.

Now this might not sound like much… but the number of possible phylogenetic trees that can be generated for a group of N different related genetic sequences increases in a steeply exponential manner as N increases. That number is described by the equation: (2N-3)!/((2^(N-2)) (N-2)!).

For 2 organisms this gives us only 1 possible tree (which should be obvious).

For 3 organisms it gives us 3 possible trees.

For 5 it gives us 105.

For 10 it gives us over 34 million.

For 16 organisms that gives us a total of (29!)/((2^14)(14!)) = 29!/1.428x10^15 = 6.19028x10^15 possible phylogenetic tree diagrams that can be generated. Picking the correct one isn’t something you can do by luck... unless of course you can beat better than 6 quintillion to 1 odds. And if that's the case, why aren't you in Vegas right now?


If you have mathematical routines that can, when applied to genetic sequences from those 16 organisms, subsequently generate the correct tree or even a very close approximation of it, it can safely be concluded that it’s because the routine works and works well.

So, they subjected the 16 final (terminal) sequences to phylogenetic analysis to see what the calculated highest likelihood phylogenetic tree for the organisms was. The result is displayed in figure 3 on page 5 of the paper. The top tree is the actual observed branching pattern during the experiment. Each of the circles represent a point at which sample sequences were isolated to found new populations… ie: an evolutionary branching of the population into two separate groups. They are numbered to correspond to the illustrated points in figure 1. The numbers along each branching line along the diagram represent the “branch length”. A value that can be used to represent either time between nodes… or amount of genetic sequence changes between nodes. In this case, the latter. For example, between node 2.1 and 3.1 the sequence undergoes 5 changes… while between node 2.1 and 3.2 it undergoes 6. T1 through T16 are the final 16 sequences generated as the end result of the process.

Displayed below that is the highest probability tree returned by the phylogenetic analysis of the sequences. Note that not only is every single node and branch correctly placed but the predicted length of each branch is also found in 29 out of 30 cases to within the calculated margin of error (on the branch linking the 2.2 and 3.3 nodes it missed the branch length by 1 sequence change more than it’s calculated margin of error.)

The entire evolutionary history of all 16 terminal sequences back to their common ancestor… reconstructed completely starting only from the end product and working backwards. Just as we can do with any other living things we have DNA samples from.

In short, the method works. Very well.

As noted in discussion of the previous topic there are, occasionally, some grey areas where it is not clear where a species should be placed in the tree to within a node or so due, in most cases, to some small scale discrepancy between phylogenies based on morphological data and phylogenies based on molecular or genetic data. An example will follow further down the post.

Evolution critics will often point to these regions of uncertainty as some kind of indication that evolutionary theory is incapable of explaining the evolutionary origins of some species… that evolution is “stumped” by certain species and should therefore be rejected. This is ludicrous. Even in a cladogram of only 16 organisms if this had been true of one of them… and a single branch had been mis-located by one node… given the amount of possible trees that had to be eliminated to arrive at the correct location for each of those nodes and branches it amounts to the equivalent of a margin of error in the results of 1 part in roughly 3x10^15…. or a measurement inaccuracy once we reach the equivalent of the 14th decimal place. An incredibly tiny margin of error if ever there was one.

To contrast … last I checked the charge of the electron has been measured reliably to 7 decimal places. G, the gravitational constant, to 3 decimal places. Nobody in their right mind suggests that this means we need to toss out physics and start from scratch because G and the charge of the electron “stumps” us through our inability to achieve a 100% perfect correlation between experimental results and theoretical modelling. 99.99% is pretty damn good too.

99.999999999...% is extraordinary. (They don’t say that evolutionary theory is one of the (if not the) most strongly evidentially supported scientific theories in the history of science just because they think it sounds good.)

Is it frustrating on those occasions when there is one branch on the tree with a positioning uncertainty of one branch... or maybe even two on sufficiently zoomed in scales? Yes. Ideally we would like to have absolutely every last detail right down to every single individual species nailed down with absolute certainty. It is why scientific research always continues to try to narrow those uncertainties... to add just that one more decimal place to that correlated value…

Is it somehow fatal to evolutionary theory that we still require some more data and better measurements to get that one branch position nailed down once and for all? Ridiculous.

Actual example of discrepancy between two phylogenetic analyses:


These are two different phylogenies for species of crocodile. One based on the morphological data, one based on a molecular analysis of the c-myc proto-oncogene… taken from this study:

http://163.238.8.180/~fburbrink/Cour...s/gharials.pdf

Morhohological data will under almost any circumstances be considered secondary to molecular and genetic analysis... this being because the units of biological inheritance are the genes themselves. Analyzing morphology is observing a secondary characteristic of inheritance and thus has an expected slightly larger margin of error which can occasionally cause minor discrepancies in the two phylogenies like this one. If you scan down to the figure on page 8 of the linked paper you get a slightly better picture of the extent to which the sequences are analyzed to establish the tree in a genetic analysis. The chart shows the multiple mutations which were experienced along each branch to arrive at the final c-myc sequences.

The two charts created differ only on their placement of Gavialis. Based on the morphological data it was expected it would be less closely related to Tomistcoma than to other crocodiles… but the genetic analysis says they’re more closely related than other crocodiles. Notice that with the exception of the single Gavialis branch both trees are identical.

Note that even if we are to consider only these 8 species in isolation from the much larger tree into which they fit, and in which their position is well established, a difference of a single branch position for a single member of the group between one measurement and the other is minuscule. There are over one hundred and thirty five thousand possible phylogenetic trees for a group of 8 organisms… having the morphological and genetic sequence data correlate to this degree is an impressive level of agreement. Resolving that last branch position is the same as resolving a measurement out at the 4th or 5th decimal place.

No comments:

Post a Comment