Phylogenetic Methods and its Applications

The phylogenetic tree portrays a relationship between sets of species and represents a model of molecular evolution. The current forms of species retain many of their ancestral features, some of which gradually change to help these species adjust to their environment. We provide a review of phylogenetic construction and validation methods and its application in cancer analysis. The need of alignment free methods for sequence comparison is also explored as an alternative to alignment based methods.


Introduction
Phylogenetic trees help scientists gain a better understanding of how species have evolved while explaining the similarities and differences amongst them. The phylogenetic study can help in analysing the evolution and similarities amongst diseases and viruses, and further helps in prescribing their vaccines [1]. The methods of phylogenetics are broadly classified as distance based and character based methods [2]. Phylogenetics relies on information extracted from the genetic material such as deoxyribonucleic acid (DNA), ribonucleic acid (RNA) or protein sequences. The species can be expressed as DNA strings which are formed by combining four nucleotides A,T,C and G (adenine, thymine, cytosine and guanine respectively). In literature various string processing algorithms are reported which can quickly analyse these DNA and RNA sequences, and build a phylogeny of sequences or species based on their similarity and dissimilarity [1,3]. A high similarity among two sequences usually implies significant functional or structural likeliness, and these sequences are closely related in a phylogenetic tree.
To get precise information about the extent of similarity to some other sequence stored in a database; we must be able to compare sequences quickly with a set of sequences. For this we need to perform multiple sequence comparison, which can be done using alignment based and alignment free methods. Excellent results can be achieved using alignment based methods when the sequences are closely related and can be aligned reliably, but divergence between the sequences affects the alignment [4]. As the number of datasets in phylogenomics increases, the alignment based methods become unaffordable. Multiple alignments of related sequences may often yield the most helpful information on its phylogeny. However, it can produce incorrect results when applied to more divergent sequence rearrangements [5]. Some computationally intensive multiple alignment methods align sequences strictly based on the order in which they receive them, i.e. the input order, without any considering their relationship. Multiple sequence alignment methods emphasize that more closely related sequences should be aligned first. In cases of sequences being less related to one another, however sharing a common ancestor may be clustered separately [4,5]. This implies that they can be more accurately aligned, but may result in incorrect phylogeny. If the differences among the lengths of sequences are very high, the alignment performance significantly impacts tree generation. Another factor that plays a crucial role in the tree construction is, choice of suitable scoring matrices and gap penalties that apply to a set of sequences used in sequence alignment. Gaps in alignments can be exemplified as mutational changes in sequences including insertions, deletions, or rearrangement of genetic material. For phylogenetic analysis the selected sequences should align with each other along with their entire lengths, or else each should have a common set of patterns or domains which provide a strong indication of evolutionary relatedness. Considering the limitation of alignment based methods alignment free methods are proposed in literature.
Alignment free sequence comparison methods include alternative metrics like k-tuple (k is length of subsequence) and probabilistic methods. In the k-tuple method, a genetic sequence is represented by a frequency vector of fixed length subsequence and the similarity or dissimilarity measures are found based on frequency vector of sub-sequences [6]. The probabilistic methods represent the sequences using the transition matrix of a Markov chain of a pre-specified order and comparison of two sequences is done by finding the distance between two transition matrices [7,8].
The output of sequence comparison is used to generate phylogeny of sequences in form of a tree and to extract information from tree structured data; tree mining is to be performed [9]. Tree mining will help to answer various questions like: whether a tree contains all the information contained in another tree or set of trees, ii) whether the constructed phylogenetic tree is correct, iii) whether the various trees generated contain any common patterns. Several methods already exist in literature to answer these questions some of which are bootstrapping, maximum agreement subtree (MAST), and frequent patterns mining [9]. In bootstrap method new alignment matrix of identical dimensions is created to replicate results of multiple sequence alignment that is called as bootstrap replicate. Each bootstrap replicate is in the form of phylogenetic tree contains same number of species [10]. The branches repeated in maximum bootstrap replicates have high confidence. This method has been very successful among alignment based methods [11]. To find the similarity among trees we can find common patterns within a tree which utilizes the concepts of frequent pattern mining [12]. Some methods are based on finding the minimal number of changes required to transform one tree into another. One of its variation is the maximum agreement subtree [5] that finds the maximum common patterns among binary trees.
Phylogeny is not only limited to find similarity and evolutionary patterns in species, it has been utilized in disease and virus analysis as well [13]. The phylogenetic methods have also been identified to understand cancer progression [14]. The methods of phylogenetics have been widely used for tumour classification as it generates a tumour network and infer their progression pathways [15]. Phylogenetics methods can solve the problem of class prediction by using a classification tree [16]. Phylogenetics is a powerful tool for grouping samples of cancer subtypes, and phylogeny inference algorithms can be used to infer how different cancer subtypes have evolved in the population [17]. With help of deep sequencing phylogenetic methods can help in analysing breast cancer progression. Phylogentic methods as hierarchical clustering give us a deeper understanding of biological heterogeneity among cancer subtype when applied to gene expression data. Some methods of phylogenetic like maximum parsimony makes assumptions about the rate at which characters of sequence data change in different regions of the tree, which may give incorrect result [18].

Conclusion
The review presented three main aspects associated with phylogenetics. The first aspect reviews methods of sequence comparison which are alignment based and alignment free methods. Second part reviews methods of tree mining and tree validation. The third aspect has reviewed application of phylogenetic in cancer analysis. The review has highlighted the challenges associated with sequence comparison methods. Phylogenetic methods can be an important tool in cancer analysis; however the methods chosen play a very critical role cancer analysis and its progression.