Wednesday, September 30, 2015

Are networks actually used to explore reticulate histories?


A look at the modern literature clearly shows that many, if not most, researchers do not use network methods when exploring reticulate evolutionary histories. As examples of the range of possible approaches, I will briefly discuss two papers from a recent journal issue.

Archaic introgression
Pengfei Qin and Mark Stoneking (2015) Denisovan ancestry in East Eurasian and Native American populations. Molecular Biology and Evolution 32: 2665-2674.
The data used for this study of archaic introgression in hominids were genome-wide SNPs from 2,493 modern humans, plus a chimpanzee and two fossils, one from the only known Denisovan individual and one from a Neandertal. The data were reduced to f4 summary statistics, which assess the correlation between the allele frequency differences of two pairs of populations. (If populations A and B are consistent with forming a clade with respect to populations C and D, then the f4 statistic is expected to be 0.) The proportions of introgressions between populations were then calculated as the ratios between selected f4 statistics. Finally, the results of the series of calculations were presented as an admixture (or introgression) network.


There are design problems with this experiment, but at least the authors do use an explicit method to produce the introgression pattern for their phylogenetic network. They do, however, draw the network manually.

The obvious experimental problem is lack of replication, which is a basic requirement of traditional science. In this case, the work is ostensibly about archaic introgression, but there is no replication of the Denisovan, Neandertal or chimpanzee samples, which are the key ones for quantifying archaic patterns. Mind you, there are only a couple of bones of the Denisovan, so the lack of replication is hardly surprising, however regrettable it may be.

There are also technical problems, such as the artifactual arch pattern in the PCA plot (see Distortions and artifacts in Principal Components Analysis analysis of genome data).

Finally, note that the "introgression" arrows in the network do not point from the ostensible source but always from a sister taxon of that source. This is basically the argument that we cannot know ancestors, and so we must represent them as sister taxa to their putative descendants in an evolutionary diagram.

Yeast recombination
Baojun Wu, Adnan Buljic and Weilong Hao (2015) Extensive horizontal transfer and homologous recombination generate highly chimeric mitochondrial genomes in yeast. Molecular Biology and Evolution 32: 2559-2570.
The authors studied aligned sequences of 40 mitochondrial genomes from yeasts, and report "extensive, homologous-recombination-mediated, mitochondrial-to-mitochondrial HGT, leading to genomes that are highly chimeric." Recombination was evaluated using various methods from the RDP4 program. Horizontal gene transfer (HGT) was evaluated by comparing different mitochondrial genome regions (introns as well as exons). No phylogenetic network was presented to summarize the phylogenetic relationships, just a long series of incongruent gene (or locus) trees.

The lack of a network summary of HGT studies is quite common. This is in spite of programs available to evaluate HGT and display the results. The focus in such studies seems to be on mechanisms, instead, rather than on the phylogenetic history.

The general experimental issue with the study of HGT is that evidence for it is solely inference from incongruence: (i) incongruent gene trees must be the result of either incomplete lineage sorting (ILS), gene duplication-loss (DL) or gene flow, and (ii) if it is the latter and the taxa are not closely related, then it is called HGT. This is not particularly evidence, especially when ILS and DL are not explicitly evaluated. These days, there are several methods available for doing this.

Monday, September 28, 2015

Complex hybridizations in barley and its relatives


In a former blog post I discussed the complex series of polyploid hybridizations that led to modern wheat cultivars (Complex hybridizations in wheat). A recent paper has discussed the even more complex series of polyploid hybridizations involved in the genus Hordeum, which includes cultivated barley:
Jonathan Brassac and Frank R. Blattner (2015) Species-level phylogeny and polyploid relationships in Hordeum (Poaceae) inferred by next-generation sequencing and in silico cloning of multiple nuclear loci. Systematic Biology 64: 792-808.

The authors note:
With nearly half of the species being polyploids (tetra- and hexaploids), including allo- and autopolyploids, the genus Hordeum is a good model to study speciation through polyploidization ... Studies on polyploid taxa are generally impeded by the complex evolution of these organisms, involving recurrent formation, gene loss or retention, and homoeologous recombination ... [However,] Chloroplast DNA is usually maternally inherited in angiosperms, [and thus] can be used to identify the direction of hybrid speciation in polyploids, that is, to determine maternal parents.
Here we present an analysis that is based on 12 nuclear loci, distributed on six of the seven barley chromosomes, and one chloroplast region ... Phylogenetic analyses were conducted on single loci and concatenated data from all loci ... We included 105 individuals representing all 33 species and most subspecies of the genus.
After aligning the sequences from all loci, (i) models of sequence evolution were determined for each locus. Gene trees were calculated for each locus with (ii) the sequences derived from the diploid taxa by Bayesian phylogenetic inference (BI), and (iii) sequences from all diploid plus, consecutively, single polyploid individuals were clustered by neighbor-joining analysis to determine phylogenetic affiliation (phasing) of the homoeologous gene copies found in polyploid taxa. Concatenated sequences from all loci (supermatrices) were used for BI of (iv) diploid and (v) diploid plus phased homoeologs of polyploid taxa. (vi) A MSC-based [multispecies coalescent] analysis was conducted to infer species trees from gene trees for the diploid individuals. (vii) To date nodes within the Hordeum phylogeny a molecular clock approach was conducted together with the MSC. (viii) A BCA [Bayesian concordance analysis] was conducted on the diploid taxa to estimate gene tree incongruences. Finally, (ix) chloroplast matK sequences were analyzed by BI to detect the maternal lineages in allopolyploids.
The results of this analysis were summarized into a scheme where polyploids were integrated in the modified diploid species tree. The MSC topology was modified to take into account the incongruences between the different methods and to integrate the inferred extinct lineages. The polyploid relationships could mostly be identified with confidence. The wide genetic variety found in some species probably indicates multiple origins of such polyploids.

This was obviously a rather complex procedure; and use of a MUL-tree would be simpler for much of the work. The authors ended up drawing a hybridization network manually, as explained in the legend to their figure. (Note that MSC is the multi-species coalescent and BI is bayesian inference.)


The authors do finally note that "It could also be interesting to test the strategy suggested by Marcussen et al. (2015) to evaluate potential network topologies for such a particularly complex polyploid taxon." This would certainly be a more direct way to produce a phylogenetic network for polyploids.

References

Jakob SS, Blattner FR (2006) A chloroplast genealogy of Hordeum (Poaceae): long-term persisting haplotypes, incomplete lineage sorting, regional extinction, and the consequences for phylogenetic inference. Molecular Biology and Evolution 23: 1602-1612.

Marcussen T, Heier L, Brysting AK, Oxelman B, Jakobsen KS (2015) From gene trees to a dated allopolyploid network: insights from the angiosperm genus Viola (Violaceae). Systematic Biology 64: 84-101.

Wednesday, September 23, 2015

Uses of MUL-trees for evolutionary networks


Creating evolutionary phylogenetic networks is currently a somewhat ad hoc procedure, with a number of competing strategies based on various models of how gene flow occurs.

One possibility is to use multi-labeled trees. Here, multiple gene trees can be represented by a single multi-labeled tree (a MUL-tree), which in turn can also be represented as a reticulating network. A MUL-tree has leaves that are not uniquely labeled by a set of species (ie. each species can appear more than once). This means that multiple gene trees can be represented by a single MUL-tree, with different combinations of the leaf labels representing different gene trees.

The most obvious uses of a MUL-tree are where there are multiple copies of genes within an organism, as each gene copy can be represented independently in the MUL-tree. This will apply when there has been gene duplication, for example, or when there has been polyploidy (ie. multiple copies of the entire genome). Computer programs such as PADRE or MulRF can then be used to derive an optimal single-labeled species network from the MUL-tree.

However, this same strategy can also be used whenever there is conflict among gene trees. In this scenario, the conflicting genes are treated as different leaves in the MUL-tree. One labeled leaf would have the data for the first gene, with the second gene entered as missing data, and the second leaf would then have the inverse situation (the data for gene one are missing and those for gene two are present).

This can be illustrated by a recent example of the Erica (heather plants) genus, from Mugrabi de Kuppler et al. (2015). The authors were interested in whether the observed gene tree conflict in Erica lusitanica could be the result of hybridisation between morphologically dissimilar species, as this has previously been suggested.

They collected sequence data for a number of plastid regions as well as the nuclear ribosomal ITS region. The observed conflict was between the plastid (chloroplast) and nuclear sequences. They note:
A targeted supermatrix strategy was employed, whereby more variable ITS and trnL-trnF spacer sequences were obtained for most samples, and the other, mostly less variable chloroplast markers were added for selected taxa in order to improve resolution of deeper nodes in the chloroplast tree. 
Where gene tree conflict was identified, the taxa with conflicting phylogenetic signals were duplicated in a combined matrix following the approach of Pirie et al. (2008, 2009) in order to infer a single multi-labelled "taxon duplication" tree. [This occurred for only one species. Thus, one leaf label for E. lusitanica has the data only for the chloroplast sequences, and the other leaf has the data only for the nuclear sequence.]


The figure shows the result of the coalescent BEAST analysis of the multi-labeled data, with E. lusitanica appearing twice in the MUL-tree. Inset is the resulting single-labeled network, with E. lusitanica appearing once, as a reticulation.

This is an interesting application of MUL-trees. However, there are two issues that I wish to highlight about the procedure.

First, the reticulation as shown in the example is not actually time-consistent, given that the horizontal axis of the MUL-tree is scaled to time. This could, for example, be resolved by having "E. lusitanica CP" attached to a ghost lineage.

Second, the data matrix from which the MUL-tree is created will have a non-random distribution of missing data, by definition. This non-randomness is known to have a bad effect on likelihood analyses (Simmons 2012). In the example, the non-randomness is exacerbated by further non-randomness in the acquisition of the plastid sequences. So, if this form of MUL-tree analysis is to be pursued then maybe this potential limitation should be investigated.

References

Mugrabi de Kuppler AL, Fagúndez J, Bellstedt DU, Oliver EGH, Léon J, Pirie MD (2015) Testing reticulate versus coalescent origins of Erica lusitanica using a species phylogeny of the northern heathers (Ericeae, Ericaceae). Molecular Phylogenetics and Evolution 88: 121-131.

Pirie MD, Humphreys AM, Galley C, Barker NP, Verboom GA, Orlovich D, Draffin SJ, Lloyd K, Baeza CM, Negritto M, Ruiz E, Cota Sanchez JH, Reimer E, Linder HP (2008) A novel supermatrix approach improves resolution of phylogenetic relationships in a comprehensive sample of danthonioid grasses. Molecular Phylogenetic and Evolution 48: 1106-1119.

Pirie MD, Humphreys AM, Barker NP, Linder HP (2009) Reticulation, data combination, and inferring evolutionary history: an example from Danthonioideae (Poaceae). Systematic Biology 58: 612-628.

Simmons MP (2012) Radical instability and spurious branch support by likelihood when applied to matrices with non-random distributions of missing data. Molecular Phylogenetics and Evolution 62: 472-484.

Monday, September 21, 2015

More literature on trees and networks


In a previous blog post I listed some of the Books about the history of trees and networks. While I mentioned a review paper by Pascal Tassy, I did not list his earlier book on the subject, which actually pre-dates all of the books listed.


  

• Tassy, Pascal (1991) L'Arbre à Remonter le Temps. Paris: Christian Bourgois Éditeur. First edition. (1998) Paris: Diderot Éditeur. Second edition.
In French; paperback. A conceptual history of phylogenetics up to the end of the 20th century. Covers the development of ideas for the general public, with a few illustrations. Networks are barely mentioned.


In this context, there are also other sources that illustrate at least part of the history of trees and networks (along with textbooks covering systematics). These include:

Archibald, J. David (2009) Edward Hitchcock's pre-Darwinian (1840) "Tree of Life". Journal of the History of Biology 42: 561-592.

Bigoni, Francesca and Barsanti, Giulio (2011) Evolutionary trees and the rise of modern primatology: the forgotten contribution of St. George Mivart. Journal of Anthropological Sciences 89: 93-107.

Coggon, Jennifer (2002) Quinarianism after Darwin’s Origin: the circular system of William Hincks. Journal of the History of Biology 35: 5-42.

Gaffney, Eugene S. (1984) Historical analysis of theories of chelonian relationship. Systematic Zoology 33: 283-301.

Gontier, Nathalie (2011) Depicting the Tree of Life: the philosophical and historical roots of evolutionary tree diagrams. Evolution: Education and Outreach 4: 515-538.

Lam, Herman J. (1936) Phylogenetic symbols, past and present (being an apology for genealogical trees). Acta Biotheoretica 2: 153-194.

Nelson, Gareth and Platnick, Norman (1981) Systematics and Biogeography: Cladistics and Vicariance. Columbia Uni. Press, New York.

O’Hara, Robert J. (1988) Diagrammatic classifications of birds, 1819-1901: views of the natural system in 19th-century British ornithology. In: H. Ouellet (ed.) Acta XIX Congressus Internationalis Ornithologici, pp. 2746-2759. National Museum of Natural Sciences, Ottawa.

O’Hara, Robert J. (1991) Representations of the natural system in the nineteenth century. Biology and Philosophy 6: 255-274.

O'Hara, Robert J. (1996) Trees of history in systematics and philology. Memorie della Società Italiana di Scienze Naturali e del Museo Civico di Storia Naturale di Milano 27: 81-88.

Ragan, Mark A. (2009) Trees and networks before and after Darwin. Biology Direct 4: 43.

Stevens, Peter F. (1984) Metaphors and typology in the development of botanical systematics 1690-1960, or the art of putting new wine in old bottles. Taxon 33: 169-211.

Tassy, Pascal (2011) Trees before and after Darwin. Journal of Zoological Systematics and Evolutionary Research 49: 89-101.

Willmann, Rainer (2003) From Haeckel to Hennig: the early development of phylogenetics in German-speaking Europe. Cladistics 19: 449–479.

Wednesday, September 16, 2015

Some new additions to the dataset database


Recently, I have added three new datasets to the database of "gold standards" that might be used to evaluate network algorithms. All three are different to what has previously been included, and so I will briefly discuss them here.

Pedigree data

I have included a known pedigree from a small group of thoroughbred stallions (Eclipse dataset) for which there are mitochondrial D-loop (control region) sequences. Pedigrees are networks, not trees, whenever there is inter-breeding among close relatives, and so their inclusion in the database is needed.

There are practical problems with including more pedigrees. Most of the known pedigrees do not have readily available sequence data associated with them, as the collected data have been mainly for features associated with diseases syndromes. Conversely, most of the available sequence data are not associated with known pedigrees, although for humans they are often taken from known social / linguistic / geographical groups (usually based on the place of birth of all four grandparents).

Language data

The database currently contains only a few examples from the social sciences, notably some experimental manipulations from stemmatology. However, there is so far nothing from linguistics, mainly because the phylogenetic history of languages is often poorly known. Nevertheless, languages form networks whenever there is borrowing of words (ie. loan words) between languages (usually as a result of geographical contact), and so their inclusion is desirable.

I have now included one dataset (the List dataset) taken from what appears to be the best-curated source of linguistic data, the Indo-European Lexical Cognacy Database. Known loan words are explicitly tagged in this source; and the phylogenetic relationships of many Indo-European languages are also tolerably well known (eg. see Ethnologue: Languages of the World).

Simulated data

I have not previously included simulated data, for two reasons. First, such data can easily be generated anew each time a set is required; and even if this is impractical then there are readily available datasets online (eg. see the compilation at utcs Phylogenetics). Second, and more importantly, simulations are based on a model (eg. using Brownian motion, Ornstein–Uhlenbeck, or Markov chains), and therefore they model only a subset of reality. Simulations are useful for situations involving a few well-defined variables, but they are much less useful for multivariate data such as occur in phylogenetics.

Nevertheless, I have included one well-known dataset, the Caminalcules (Camin dataset). These data were simulated manually back in the 1960s, and they include morphological features for both extant and fossil organisms. Over the years, the data have been used for many pedagogic purposes in the teaching of systematics, particularly in the U.S.A. (see Pasta have no phylogeny, so don't try to give them one). The data are strictly tree-like, and they do match real datasets in a number of ways (see Sokal 1983). However, there are also known ways in which they differ detectably from real data (see Holman 1986; Wirth 1993).

References

Holman EW (1986) A taxonomic difference between the Caminalcules and real organisms. Systematic Zoology 35: 259-261.

Sokal RR (1983) A phylogenetic analysis of the Caminalcules. I. The data base. Systematic Zoology 32: 159-184.

Wirth U (1993) Caminalcules and Didaktozoa: imaginary organisms as test-examples for systematics. In: Opitz O, Lausen B, Klar R (eds) Information and Classification: Concepts, Methods and Applications, pp. 421-433. Springer, Berlin.

Monday, September 14, 2015

Multiple sequence alignment


Following a previous post on Multiple sequence alignment, celebration of the 20th anniversary of my first publication in the alignment field continues, with a new publication:

  • Morrison DA, Morgan MJ, Kelchner SA (2015) Molecular homology and multiple sequence alignment: an analysis of concepts and practice. Australian Systematic Botany 28: 46-62.

This paper places sequence alignment within the larger picture of detecting homologies in molecular data, emphasizing the hierarchical nature of homologies. Surprisingly, this relationships has not been emphasized before. It also points out why nucleotide alignments are a unique form of homology assessment, even within this framework. Indeed, the only genotypic data are nucleotides, since everything else is an expression of the nucleotide sequences, rather than being inherited.

The article is Open Access.


Wednesday, September 9, 2015

Sharing supplementary data: a linguist's perspective


The Problem of Data Sharing

In 2013, Nature launched a discussion on how to increase the reproducability of research in the biomedical sciences. David addressed the problem of data sharing more concretely in two blog posts from 2013, one on the practice of releasing phylogenetic data, and one on its public availability. In my opinion, this topic does not only concern the sciences, but also, and even specifically, the humanities. In times where more and more data for anthropological research is being produced, and the formerly manually conducted analyses are being automated, we need to increase the awareness of scholars and publishers that publishing only the results is not enough to meet rigorous scientific standards.

When discussing these issues with colleagues, various reasons have been brought up as to why scholars would not release their data along with a publication. Apart from practical considerations (which mostly concern the publishers who do not provide the infrastructure to host supplementary material transparently), scholars often also bring up personal and legal concerns: they are afraid that their painstaking efforts in collecting a dataset will have been in vain, once they release the data to the public, since other researchers might take over and run analyses they would like to run themselves in the future. Furthermore, there are situations when data cannot simply be published completely, because the compilers of the datasets do not obtain the copyrights on the data itself.

In my opinion, all of these problems can be solved directly, and there is no reason to publish a study in which at least a part of the data is not provided in supplementary form.

Practical Solutions: GitHub and Zenodo

Regarding practical issues, one can use GitHub to host and curate data and computer source code. The advantage of using GitHub is that it allows for distributed revision control: all changes and modifications to the data can be tracked, and all of those who contributed to the compilation of a given dataset can receive the credit they deserve. Even for the case of anonymous data submission, there is a simple solution available along with GitHub Gist: by just uploading data to a Gist (a flat repository which does not allow for a folder structure) without being logged in with a GitHub account, one can anonymously host the data for review purposes.

If one doesn't completely trust the longeavity of GitHub in hosting the data forever (it might well happen that GitHub changes its payment policy at some point in the future, or limits the amount of open repositories), there is Zenodo, which offers full GitHub integration and allows storage of up to 2 GB per dataset. For more information regarding the possibilities that the GitHub integration offers, see this blog post by Robert Forkel. Zenodo was developed by CERN and, although they write on their website that their sustainability plan is still in development, it is quite unlikely that they will run out of funding within the next twenty years.

As a recommended way of hosting data, one would start with an anonymous Gist when submitting a paper. This would then be converted to a full GitHub repository once the paper has been accepted. By setting up an official release of this repository, the data would be automatically transferred to Zenodo, where it is permanently stored and provided with a DOI.

Sharing Data Prevents Data Theft

Regarding the personal concerns that one's data might be "stolen" by other scholars, I think it is important to make clear that at the core of all research we build on the work of our colleagues. Nobody should own a dataset, as well as nobody should own a theory. It is clear that in the stage of developing datasets (as well as theories), we may decide to be careful in sharing them with certain colleagues. But once they are finished and ready to use, we should allow our colleagues to run their own analyses on them.

What is important and missing here is an established practice, but also infrastructure support to give credits to the work of others. In linguistics, we lack journals, such as BMC Bioinformatics, that publish articles on source code or databases. There are, however, recent attempts to address these problems in linguistic research (see, for example, this blog post by Martin Haspelmath).

But even while this infrastructure is lacking, it should be made clear that scholars win more than they risk when submitting their data along with their publication. If the data turns out to be useful for additional research, then they will receive credit in the form of citations, and they will even prevent others from actually stealing their data — as with ideas, data can only be stolen by falsely associating it with another name. Once the data is out along with the publication, this is not likely to happen.

Giving Something is More than Giving Nothing

Even in those cases where there are real copyright restrictions, one can make a compromise and publish an illustrative snapshot of the data and the detailed results. Especially, computational analyses produce a large amount of data as part of their results, and this data may well turn out to be interesting for other scholars. Instead of publishing just a tree or a network, we may want to see the individual character evolution that was inferred along with the algorithm. And when illustrating a new algorithm for homolog detection in historical linguistics, it may be interesting for one scholar or another (but maybe also for the reviewer) to have a look at the detailed results apart from the aggregated evaluation scores.

Summary and Outlook

Current research practice in historical linguistics faces serious reproducability problems. Fortunately, solutions exist for most of the practical problems of the past. What we need now is to increase the awareness among scholars that all research based on data and source code is nothing without the data and the source code. Publishing both source code and data along with a paper is easy nowadays, especially thanks to GitHub and Zenodo. Guaranteeing that one gets the credit for ones efforts in the humanities is a bit more difficult, but not impossible, and colleagues are working on solutions.

What we need in addition to the publication of the raw data itself are explicit formats of data exchange. In historical linguistics, using only NEXUS-format files is not sufficient, since the nature of our data requires its own representation. Here again, scholars are already working on a solution by trying to define and establish specific formats for data sharing in historical linguistics and typology (see this discussion on GitHub).

In an ideal future scenario that was introduced to me by Michael Cysouw, all publications involving automatic analyses should provide not only the supplementary data, but also some kind of a MAKE file containing the code for the workflow that enables scholars to carry out the computational analyses immediately on their computer.

Monday, September 7, 2015

The Tree of Trees is a network


A couple of years ago this paper appeared:
Marie Fisler and Guillaume Lecointre (2013) Categorizing ideas about trees: a tree of trees. PLoS One 8: e68814.
The authors note:
We study the history of the use of trees in systematics to represent the diversity of life from 1766 to 1991. We apply to those ideas a method inspired from coding homologous parts of organisms. We discretize conceptual parts of ideas, writings and drawings about trees contained in 41 main writings; we detect shared parts among authors and code them into a 91-characters matrix and use a tree representation to show who shares what with whom. In other words, we propose a hierarchical representation of the shared ideas about trees among authors: this produces a "tree of trees."
The authors continue:
Why should we choose the tree that maximizes contiguity of identical character states (i.e. the most parsimonious tree) and not another one? [That is,] why should we choose the tree maximizing consistency among characters? ... Maximizing consistency among characters is just offering a rational interpretation of the character distribution across the compared entities, by using a hierarchy from the most general to the most particular. We prefer this hierarchical representation over networks in a first step because it is what we need to test for consistency of previous categories, propose new ones and exhibit sharings (even homoplastic ones if needed).
Unfortunately, "the parsimony analysis provides 279 trees of 378 steps, with a C.I. of 0.24 and a R.I. of 0.61". In other words, there is very little consistency among the characters; and there is very little hierarchical structure in the data, as shown by my NeighborNet analysis of the same data.


The conclusion, that "we consider that networks are not useful to represent shared ideas at the present step of the study" seems rather dubious. The tree-makers do not generally form groups, but share phylogenetic  ideas in a more haphazard manner. Nevertheless, the network neighborhoods shared by the various writers sampled do actually show quite clearly who shared tree ideas with whom.

It is interesting that the tree ideas are shared in a network manner, rather than a tree, as this indicates that there are no really clear schools of phylogenetics represented. Indeed, the writers are inter-mingled in a way that shows no development of tree ideas over time, although the various neighborhoods do tend to associate writers of similar vintage. There are no real surprises among the compositions of these neighborhoods.

Perhaps the most interesting aspect of the network, as shown, is that both Wallace and Haeckel changed their ideas about trees through time, whereas most of the other writers were more "of their time".

It is also worth noting that Buffon, Duchesne and Rühling all illustrated reticulated networks, not trees; but this is not one of the characteristics included in the dataset. The paper's authors do acknowledge that Buffon's diagram is "a tree-like extension of maps", but they fail to mention that Linnaeus also likened biological relationships to a map (not a tree), but instead treat him as part of the outgroup (see An outline history of phylogenetic trees and networks).

Wednesday, September 2, 2015

Is this a "gold standard" dataset?


I have just added another dataset to our database. This one is of considerable interest, because it is a complex one. As the authors note, it is likely to contain ancient hybrid speciation, recent introgression and deep coalescence. Thus, identifying recent hybrids will be problematic.
Michael L. Moody and Loren H. Rieseberg (2012) Sorting through the chaff, nDNA gene trees for phylogenetic inference and hybrid identification of annual sunflowers (Helianthus sect. Helianthus). Molecular Phylogenetics and Evolution 64: 145–155.
There are 29 accessions from 13 species, with data for 11 loci in 5 linkage groups (a total of 8,077 aligned nucleotides). The accessions have sequences for either 1 or 2 of the alleles, and sometimes 3 (the latter are likely to be the result of PCR artifacts). The authors have also tried to identify recombinant sequences. Three of the species are previously identified hybrid taxa.

Unfortunately, adding this dataset to the database has also been problematic, because there are internal inconsistencies. For complete consistency, Figure 1 of the paper should agree with its own Table 1, and the GenBank data should agree with both of them. Unfortunately, this three-way consistency exists for only 2 of the 11 loci. For the rest, in 7 instances the dataset is the odd one out, in 4 cases it is the table, and in four instances it is the figure. For the data discrepancies, in 2 cases a sequence is missing, in 1 case there is an extra sequence, and for the remaining 2 pairs it is likely that there is mis-labelling of the sequences.

It is therefore not immediately obvious to what extent this counts as a "gold standard" dataset. I have included it because of its intrinsic interest, but obviously with a caveat emptor warning. Sadly, this sort of situation has been all too common in my search for suitable datasets.