Tuesday, March 27, 2012

An update on level-k phylogenetic networks


Picture of a galled tree, obtained from
http://carrot.mcb.uconn.edu/~olgazh/
Today we take a look at research aimed at constructing rooted level-k phylogenetic networks.

The notion of level was first introduced by Jesper Jansson and Wing-Kin Sung in 2006. They say that a binary rooted phylogenetic network has level-k if each biconnected component (tangled part) of the network contains at most k reticulations. They introduced these level-k networks as a generalization of galled trees, which were introduced by Dan Gusfield, Satish Eddhu and Charles Langley in 2003 as networks in which cycles do not overlap. The name ``galled trees'' was motivated by trees that have large swellings called galls, like the tree in the picture. Using the notion of level, galled trees are basically level-1 networks.

However, level-k networks can also just be seen as a generalization of networks with k reticulations.

It should be noted that there is a difference between searching for a network with minimum level and searching for a network with a minimum reticulation number. A network with minimum level might not have minimum reticulation number. Moreover, there might not be a minimum-level network that has minimum reticulation number (over all networks). This can be seen from a famous counter example by Gusfield, Bansal, Bafna and Song. A variant of this example appeared in the book by Huson, Rupp and Scornavacca. It gives a set of clusters and two networks that represent those clusters: a level-2 network with four reticulations and a level-3 network with three reticulations. The first network has minimum level, and minimum reticulation number over all minimum-level networks, but it does not have a minimum reticulation number over all networks. However, Gusfield et al. show that these counter examples are rare and Huson et al. argue that even in such cases the level-2 network is preferable over the level-3 network since in the latter network ``two completely unrelated parts of the phylogeny are linked together via reticulation edges''.

Constructing level-k phylogenetic networks has been studied for different inputs. An overview of results and a few open problems can be found at http://homepages.cwi.nl/~iersel/overview.html

That website shows a table with results for inputs consisting of trees, clusters and triplets. Triplets are rooted trees on three leaves each (the rooted variant of quartets). Triplets are sometimes also called three-taxon statements by biologists. Inputs consisting of triplets or clusters can for example be obtained from gene trees, or directly from DNA or character data.

An example of a real tree with a single cycle. This tree can
be seen as a network with one reticulation and thus as a
level-1 network.
For each input, four different problems are included in the table. For all problems the level k is fixed. The table shows which problems are in P (polynomial-time solvable) and which ones are NP-hard, and sometimes specifies some other results like approximation or FPT-algorithms. It can for example be seen that problems with a general set of triplets as input are mostly NP-hard. However, these problems are more tractable when the set of triplets is dense, i.e. if it contains a triplet for every combination of three taxa. For example, triplet sets obtained from binary trees are dense. Unfortunately, practical data is almost never binary and dense triplet sets are usually difficult to obtain.

In practice, of course, some of the input triplets/clusters/trees might be incorrect. Especially for triplets and clusters, it is therefore interesting to aim at finding a level-k network that is consistent with a maximum number of input triplets or clusters. This is the first row of the website table. Unfortunately, these problems are all extremely hard.

A more tractable problem is to search for a level-k network that is consistent with all elements of the input (trees, triplets or clusters), see the second row of the website table. For sets of clusters, there is an algorithm that is not only polynomial-time for fixed k, but even fixed-parameter tractable in k.

If it is possible to find a level-k network, it is also interesting to search for such a network that has a minimum number of reticulations (over all level-k networks). For dense triplet sets, we can do this in polynomial time, but whether such an algorithm is also possible for clusters is unclear. See the third row of the table.

Finally, a mathematically-interesting problem that is not directly applicable in practice is the question whether there exists a level-k network that is consistent with precisely those triplets/clusters/trees in the input. In other words, is there a level-k network N such that the set of triplets/clusters/trees represented by N is equal to the set of triplets/trees/clusters that are given in the input. There is not much known about this problem except that it is polynomial-time solvable for sets of triplets (see the last row of the table).

Any additions/corrections/comments on the table are welcome.

The picture of the galled tree (top right) was downloaded from: http://carrot.mcb.uconn.edu/~olgazh/ The second picture was taken in Queensland (Australia) by Leo van Iersel.

No comments:

Post a Comment