Comparative analysis of DNA or protein sequences relies on an intuitively appealing model of their evolution: It starts as a random process in which every region has an equal a priori chance of mutating. However, mutations that negatively impact a functionally important region get cleared out of the population.
The mechanistic model in which functional positions on a protein sequence can acommodate only a small number of amino acid types found one of its nicest confirmations in the series of experiments lead by J.H. Miller [1, 2]. In this amazing quest, spanning over two decades, the impact of mutations in almost all positions on the DNA sequence encoding the bacterial protein LacI was examined. In the final paper of the series, Miller et al were able to explain the consequence of almost all mutations resulting in a phenotype different from the wild-type, as belonging to one of the several functional parts this protein possesses: folding core, hinge, inducer binding pocket, and regions responsible for dimerization and interaction with DNA.
Evolution will thus restrict the number of residue types observable at each position to the set which is allowable by the function. Nowadays, we can reproduce and trace the process in the lab . When we analyze conservation of residues or nucleotides, we are reverse engineering a nature-devised system, and looking for a plausible functional explanation for why particular residues are conserved .
Then, noting that a prominent mechanism of genome evolution is gene duplication, we may enquire which of the copies (termed paralogues) changes to acquire new function . We can look for residues that distinguish otherwise similar groups of genes or proteins. These may, but do not need to be conserved in both paralogous groups . After the gene duplication, the rate of evolution may stay the same in the two newly-founded branches (homotachy, in the fanciful terminology of , or type II divergence ), but is in general free to proceed at different rates (heterotachy, type I divergence). As a limiting case of the former, a position may be conserved as a different residue type in each of the branches (”constant-but-different” , discriminant ), or even, as a further extreme, conserved across two groups of related proteins. In any case, locating positions with markedly different evolutionary behavior in different paralogues can be used to understand or inform redesign protein function .
There are several practical problems to solve, though, to get meaningful results out of sequence comparison.
First, let us focus on the word "conserved." We might notice that it carries a hidden catch: it makes sense only when coupled with the definition of the set (or class) of sequences to which it applies. (Conserved in all protein kinases or conserved in CK1 group? Conserved in all vertebrates, or in mammals only?). The problem is twofold: we have to decide what defines the class of sequences within which we want to look for the conservation, and, then, we need to find those and only those sequences that belong to the class that we want to study.
While patterns of conservation or specialization are not hard to appreciate once they are pointed out, they might be difficult to detect by a human observer - the alignment of one hundred vertebrate genes can easily approach a megabyte of data. Therefore, we would like to have ways to detect and classify of evolutionary behavior computationally.
When bioinformaticians develop methods for detecting any particular type of evolutionary behavior, the fundamental way in which they present their work is by publishing the method - the scoring function or the algorithm. It is the compact way, usually involving some algebra, for explaining what the method does. At this point the methods may remain nameless. The names get attached later in the process - to the implementations, and even more often, to the servers. Sometimes authors will offer the implementation for download - if well written, this is the ultimate piece of documentation for a method.
However, using an implementation directly is a task for aficionados. Servers provide shortcuts for a broader audience - they hide the implementational details from the user, and sometimes combine several sources of information. They differ widely in the way they present the output - from plain text tables that appear in the browser, to automatically generated printable reports and embedded visualization tools. It is notable however that the value expected to be added by the server increases as the field matures.
Sometimes the involved pipeline is so complicated, prone to breaking down, difficult to completely automate, or just time-consuming to complete, that the authors decide to present their results in the form of databases of pre-calculated results. The drawback to databases is that their content is fixed, and they do not allow the interested user to inquire how a change in the input data affects the offered conclusions.
Some of the methods discussed here can be found implemented in the omnibus annotation and visualization applications [11, 12], but the discussion of these is beyond the scope of this introductory overview.
Thus, papers by Valdar [13, 14] describe the eponymous score,  describes the phylogenetic method implemented in rate4site program,  and  integer- and real-valued evolutionary trace respectively, and  the score used for detection of functional specialization in Cube. Diverge [18, 19], is the name of the implementation of a statistical model of functional divergence described in [6, 20], and so on.
The Scorecons Server provides access to the methods described in  (with the implementation apparenlty not available). SPEER score , is available both as a standalone version and a server , while ConSurf [23, 24] server combines the structural knowledge with the conservation estimates provided by rate4site.
As an example of authors deciding to present their results as a database, we refer the reader to the ConSurf-DB , serving the purpose for the ConSurf pipeline, ET report maker  for real-valued trace, FunShift  for the method presented in  or Cube-DB  for .
In terms of the underlying evolutionary events they aim to capture, the available applications can be classified in broad strokes as follows.
AMAS  and evolutionary trace in its various incarnations [31, 32, 17, 26, 33, 34], seek out positions that are mutating slowly, according to the evolutionary scenario provided by the sequence similarity tree. Rate4site uses a more advanced maximum-likelihood calculation to achieve the same, with the application more difficult to scale to large sequence alignments. It can be found implemented and enhanced by structural information in ConSurf [23, 24, 25] and ConSurfDB . Similarly, INTREPID server , implements a method  that captures heuristically the variation in each subtree. Notably, it comes with a variation that makes it applicable for detection of type II events
It is worth registering that the most straightforward methods, such as Shannon entropy and majority fraction  actually work perfectly well for the purpose, especially if a ’clean’ set of orthologues is provided as an input.
Type 1 divergence is the type of behavior that seems to have the thinnest coverage in the literature. FunShift  database, collects results calculated using a maximum likelihood method to establish the rate of mutation across branches of a presumptive evolutionary tree . Diverge software is one of the few applications that can autonomously distinguish between type I and type II behavior [6, 20], rather than taking type II behavior as a trait to look for.
Type 2 divergence is a model behind SDP prediction method , and SDR database , implementing . They highlight the positions of functional importance that are conserved in all paralogues, a model that works well for detection of catalytic sites of enzymes [40, 41]. Treedet  implements two methods  aimed at capturing this type of behavior, as does , with the implementation available from the authors. Multi-RELIEF method and server  applies a machine learning method (RELIEF) established in other informatics fields and applies it to this type of behavior. The algorithm does not require the input classification of the sequences, but comes up with it on its own. SPEER-server , with the methodology described in  uses rate4site to estimate within-group conservation, and ultimately, looks for positions with low mutational rates in all provided groups.
It should be noted in the light of the above discussion that Cube is not in itself a method. Rather, It is a server, providing access to several methods. Behind the server are two pieces of code (available here) implementing several conservation [14, 31, 17] and one specialization detection method . Its sister database of pre-calculated results is called Cube-DB, and it can be found here. The database, however, collects only the results for vertebrate proteins. The purpose of Cube server is to enable conservation and specialization scoring for any selection of sequences provided by the user. These can be vertebrate sequences outside of the scope of ENSEMBL, other animal, fungal, bacterial, or even viral sequences.
The specialization method implemented in Cube allows description of both divergence type I and type II events. It is a lightweight application with the aim of presenting our work in a format that we have found to be practical in development and planning of experiments (mutagenesis experiments in particular): tabulation, mapping on the structure, and the sequence - the image that can further be annotated. It leaves the user fully in control over the sequences that the analysis is based on. It furthermore places side-by-side and invites the contemplation of three types of evolutionary behavior: conservation and type I and type II specialization, conserved vs. determinant and discriminating residues.
1. Miller JH, Ganem D, Lu P, Schmitz A (1977) Genetic studies of the lac repressor: I. correlation of mutational sites with specific amino acid residues: Construction of a colinear gene-protein map. Journal of molecular biology 109: 275–298.
2. Suckow J, Markiewicz P, Kleina L, Miller J, Kisters-Woike B, et al. (1996) Genetic studies of the lac repressor XV: 4000 single amino acid substitutions and analysis of the resulting phenotypes on the basis of the protein structure. J Mol Biol 261: 509-523.
3. Robins WP, Faruque SM, Mekalanos JJ (2013) Coupling mutagenesis and parallel deep sequencing to probe essential residues in a genome or gene. Proceedings of the National Academy of Sciences 110: E848–E857.
4. Adikesavan AK, Katsonis P, Marciano DC, Lua R, Herman C, et al. (2011) Separation of recombination and sos response in escherichia coli reca suggests lexa interaction sites. PLoS genetics 7: e1002244.
12. Schneider G, Sherman W, Kuchibhatla D, Ooi HS, Sirota FL, et al. (2012) Protein sequence–structure–function–network links discovered with the annotator software suite: Application to elys/mel-28. In: Computational Medicine, Springer. pp. 111–143.
22. Chakraborty A, Mandloi S, Lanczycki CJ, Panchenko AR, Chakrabarti S (2012) Speer-server: a web server for prediction of protein specificity determining sites. Nucleic acids research 40: W242–W248.
23. Armon A, Graur D, Ben-Tal N (2001) Consurf: an algorithmic tool for the identification of functional regions in proteins by surface mapping of phylogenetic information. Journal of molecular biology 307: 447–463.
24. Landau M, Mayrose I, Rosenberg Y, Glaser F, Martz E, et al. (2005) ConSurf 2005: the projection of evolutionary conservation scores of residues on protein structures. Nucleic Acids Res 33: W299-302.
37. Kalinina O, Mironov A, Gelfand M, Rakhmaninova A (2004) Automated selection of positions determining functional specificity of proteins by comparative analysis of orthologous groups in protein families. Protein Sci 13: 443-456.
45. Ye K, Anton Feenstra K, Heringa J, Ijzerman A, Marchiori E (2008) Multi-RELIEF: a method to recognize specificity determining residues from multiple sequence alignments using a Machine-Learning approach for feature weighting. Bioinformatics 24: 18-25.