Comparing and contrasting protein sequences - why and how


  • Given several groups or classes of sequences, one can look for, three types (roughly) of evolutionary behavior for a single amino acid: conserved across all groups, conserved-but-different in each group, or of varying degree of variability in each group.
  • Bioinformaticians publish their work as methods/algorithms and their implementations, servers, and databases of pre-calculated results. Each has a different purpose and target audience.
  • Cube is a server. It can be used to highlight each of the three types of evolutionary behavior at each position in a peptide.

Evolutionary behavior of biological sequences and the practical value of its analysis.

Comparative analysis of DNA or protein sequences relies on an intuitively appealing model of their evolution: It starts as a random process in which every region has an equal a priori chance of mutating. However, mutations that negatively impact a functionally important region get cleared out of the population.

The mechanistic model in which functional positions on a protein sequence can acommodate only a small number of amino acid types found one of its nicest confirmations in the series of experiments lead by J.H. Miller  [12]. In this amazing quest, spanning over two decades, the impact of mutations in almost all positions on the DNA sequence encoding the bacterial protein LacI was examined. In the final paper of the series, Miller et al were able to explain the consequence of almost all mutations resulting in a phenotype different from the wild-type, as belonging to one of the several functional parts this protein possesses: folding core, hinge, inducer binding pocket, and regions responsible for dimerization and interaction with DNA.

Evolution will thus restrict the number of residue types observable at each position to the set which is allowable by the function. Nowadays, we can reproduce and trace the process in the lab   [3]. When we analyze conservation of residues or nucleotides, we are reverse engineering a nature-devised system, and looking for a plausible functional explanation for why particular residues are conserved  [4].

Then, noting that a prominent mechanism of genome evolution is gene duplication, we may enquire which of the copies (termed paralogues) changes to acquire new function  [5]. We can look for residues that distinguish otherwise similar groups of genes or proteins. These may, but do not need to be conserved in both paralogous groups  [6]. After the gene duplication, the rate of evolution may stay the same in the two newly-founded branches (homotachy, in the fanciful terminology of  [8], or type II divergence  [6]), but is in general free to proceed at different rates (heterotachy, type I divergence). As a limiting case of the former, a position may be conserved as a different residue type in each of the branches (”constant-but-different”  [9], discriminant  [10]), or even, as a further extreme, conserved across two groups of related proteins. In any case, locating positions with markedly different evolutionary behavior in different paralogues can be used to understand or inform redesign protein function  [7].

There are several practical problems to solve, though, to get meaningful results out of sequence comparison.

First, let us focus on the word "conserved." We might notice that it carries a hidden catch: it makes sense only when coupled with the definition of the set (or class) of sequences to which it applies. (Conserved in all protein kinases or conserved in CK1 group? Conserved in all vertebrates, or in mammals only?). The problem is twofold: we have to decide what defines the class of sequences within which we want to look for the conservation, and, then, we need to find those and only those sequences that belong to the class that we want to study.

While patterns of conservation or specialization are not hard to appreciate once they are pointed out, they might be difficult to detect by a human observer - the alignment of one hundred vertebrate genes can easily approach a megabyte of data. Therefore, we would like to have ways to detect and classify of evolutionary behavior computationally.

Methods and their implementations; servers and databases of pre-calculated results.

When bioinformaticians develop methods for detecting any particular type of evolutionary behavior, the fundamental way in which they present their work is by publishing the method - the scoring function or the algorithm. It is the compact way, usually involving some algebra, for explaining what the method does. At this point the methods may remain nameless. The names get attached later in the process - to the implementations, and even more often, to the servers. Sometimes authors will offer the implementation for download - if well written, this is the ultimate piece of documentation for a method.

However, using an implementation directly is a task for aficionados. Servers provide shortcuts for a broader audience - they hide the implementational details from the user, and sometimes combine several sources of information. They differ widely in the way they present the output - from plain text tables that appear in the browser, to automatically generated printable reports and embedded visualization tools. It is notable however that the value expected to be added by the server increases as the field matures.

Sometimes the involved pipeline is so complicated, prone to breaking down, difficult to completely automate, or just time-consuming to complete, that the authors decide to present their results in the form of databases of pre-calculated results. The drawback to databases is that their content is fixed, and they do not allow the interested user to inquire how a change in the input data affects the offered conclusions.

Some of the methods discussed here can be found implemented in the omnibus annotation and visualization applications  [1112], but the discussion of these is beyond the scope of this introductory overview.

Thus, papers by Valdar  [1314] describe the eponymous score,  [15] describes the phylogenetic method implemented in rate4site program,  [16] and  [17] integer- and real-valued evolutionary trace respectively, and  [10] the score used for detection of functional specialization in Cube. Diverge  [1819], is the name of the implementation of a statistical model of functional divergence described in  [620], and so on.

The Scorecons Server provides access to the methods described in  [14] (with the implementation apparenlty not available). SPEER score  [21], is available both as a standalone version and a server  [22], while ConSurf [2324] server combines the structural knowledge with the conservation estimates provided by rate4site.

As an example of authors deciding to present their results as a database, we refer the reader to the ConSurf-DB  [25], serving the purpose for the ConSurf pipeline, ET report maker  [26] for real-valued trace, FunShift [27] for the method presented in  [28] or Cube-DB  [29] for  [10].

In terms of the underlying evolutionary events they aim to capture, the available applications can be classified in broad strokes as follows.

AMAS  [30] and evolutionary trace in its various incarnations  [313217263334], seek out positions that are mutating slowly, according to the evolutionary scenario provided by the sequence similarity tree. Rate4site uses a more advanced maximum-likelihood calculation to achieve the same, with the application more difficult to scale to large sequence alignments. It can be found implemented and enhanced by structural information in ConSurf [232425] and ConSurfDB [25]. Similarly, INTREPID server  [35], implements a method  [36] that captures heuristically the variation in each subtree. Notably, it comes with a variation that makes it applicable for detection of type II events

It is worth registering that the most straightforward methods, such as Shannon entropy and majority fraction  [14] actually work perfectly well for the purpose, especially if a ’clean’ set of orthologues is provided as an input.

Type 1 divergence is the type of behavior that seems to have the thinnest coverage in the literature. FunShift  [27] database, collects results calculated using a maximum likelihood method to establish the rate of mutation across branches of a presumptive evolutionary tree  [28]. Diverge software is one of the few applications that can autonomously distinguish between type I and type II behavior  [620], rather than taking type II behavior as a trait to look for.

Type 2 divergence is a model behind SDP prediction method  [37], and SDR database  [38], implementing  [39]. They highlight the positions of functional importance that are conserved in all paralogues, a model that works well for detection of catalytic sites of enzymes  [4041]. Treedet  [42] implements two methods  [43] aimed at capturing this type of behavior, as does  [44], with the implementation available from the authors. Multi-RELIEF method and server [45] applies a machine learning method (RELIEF) established in other informatics fields and applies it to this type of behavior. The algorithm does not require the input classification of the sequences, but comes up with it on its own. SPEER-server  [22], with the methodology described in  [21] uses rate4site to estimate within-group conservation, and ultimately, looks for positions with low mutational rates in all provided groups.

Why Cube.

It should be noted in the light of the above discussion that Cube is not in itself a method. Rather, It is a server, providing access to several methods. Behind the server are two pieces of code (available here) implementing several conservation [143117] and one specialization detection method  [10]. Its sister database of pre-calculated results is called Cube-DB, and it can be found here. The database, however, collects only the results for vertebrate proteins. The purpose of Cube server is to enable conservation and specialization scoring for any selection of sequences provided by the user. These can be vertebrate sequences outside of the scope of ENSEMBL, other animal, fungal, bacterial, or even viral sequences.

The specialization method implemented in Cube allows description of both divergence type I and type II events. It is a lightweight application with the aim of presenting our work in a format that we have found to be practical in development and planning of experiments (mutagenesis experiments in particular): tabulation, mapping on the structure, and the sequence - the image that can further be annotated. It leaves the user fully in control over the sequences that the analysis is based on. It furthermore places side-by-side and invites the contemplation of three types of evolutionary behavior: conservation and type I and type II specialization, conserved vs. determinant and discriminating residues.


1.   Miller JH, Ganem D, Lu P, Schmitz A (1977) Genetic studies of the lac repressor: I. correlation of mutational sites with specific amino acid residues: Construction of a colinear gene-protein map. Journal of molecular biology 109: 275–298.

2.   Suckow J, Markiewicz P, Kleina L, Miller J, Kisters-Woike B, et al. (1996) Genetic studies of the lac repressor XV: 4000 single amino acid substitutions and analysis of the resulting phenotypes on the basis of the protein structure. J Mol Biol 261: 509-523.

3.   Robins WP, Faruque SM, Mekalanos JJ (2013) Coupling mutagenesis and parallel deep sequencing to probe essential residues in a genome or gene. Proceedings of the National Academy of Sciences 110: E848–E857.

4.   Adikesavan AK, Katsonis P, Marciano DC, Lua R, Herman C, et al. (2011) Separation of recombination and sos response in escherichia coli reca suggests lexa interaction sites. PLoS genetics 7: e1002244.

5.   Taylor JS, Raes J (2004) Duplication and divergence: the evolution of new genes and old ideas. Annu Rev Genet 38: 615–643.

6.   Gu X (1999) Statistical methods for testing functional divergence after gene duplication. Mol Biol Evol 16: 1664-1674.

7.   Khoury GA, Fazelinia H, Chin JW, Pantazes RJ, Cirino PC, et al. (2009) Computational design of candida boidinii xylose reductase for altered cofactor specificity. Protein Science 18: 2125–2138.

8.   Lopez P, Casane D, Philippe H (2002) Heterotachy, an important process of protein evolution. Mol Biol Evol 19: 1-7.

9.   Gribaldo S, Casane D, Lopez P, Philippe H (2003) Functional divergence prediction from evolutionary analysis: a case study of vertebrate hemoglobin. Mol Biol Evol 20: 1754-1759.

10.   Bharatham K, Zhang ZH, Mihalek I (2011) Determinants, discriminants, conserved residues - a heuristic approach to detection of functional divergence in protein families. PLoS ONE 6: e24382.

11.   Porollo A, Meller J (2007) Versatile annotation and publication quality visualization of protein complexes using polyview-3d. BMC bioinformatics 8: 316.

12.   Schneider G, Sherman W, Kuchibhatla D, Ooi HS, Sirota FL, et al. (2012) Protein sequence–structure–function–network links discovered with the annotator software suite: Application to elys/mel-28. In: Computational Medicine, Springer. pp. 111–143.

13.   Valdar WS, Thornton JM (2001) Conservation helps to identify biologically relevant crystal contacts. Journal of molecular biology 313: 399–416.

14.   Valdar W (2002) Scoring residue conservation. Proteins 48: 227-241.

15.   Pupko T, Bell R, Mayrose I, Glaser F, Ben-Tal N (2002) Rate4Site: an algorithmic tool for the identification of functional regions in proteins. Bioinformatics 18: S71-77.

16.   Lichtarge O, Yamamoto K, Cohen F (1997) Identification of functional surfaces of the zinc binding domains of intracellular receptors1. J Mol Biol 274: 325-337.

17.   Mihalek I, Reš I, Lichtarge O (2004) A family of evolution–entropy hybrid methods for ranking protein residues by importance. J Mol Biol 336: 1265-1282.

18.   Gu X, Vander Velden K (2002) DIVERGE: phylogeny-based analysis for functional-structural divergence of a protein family. Bioinformatics 18: 500-501.

19.   Gu X, Zou Y, Su Z, Huang W, Zhou Z, et al. (2013) An update of diverge software for functional divergence analysis of protein family. Molecular biology and evolution .

20.   Gu X (2001) Maximum-likelihood approach for gene family evolution under functional divergence. Mol Biol Evol 18: 453-464.

21.   Chakrabarti S, Bryant S, Panchenko A (2007) Functional specificity lies within the properties and evolutionary changes of amino acids. J Mol Biol 373: 801-810.

22.   Chakraborty A, Mandloi S, Lanczycki CJ, Panchenko AR, Chakrabarti S (2012) Speer-server: a web server for prediction of protein specificity determining sites. Nucleic acids research 40: W242–W248.

23.   Armon A, Graur D, Ben-Tal N (2001) Consurf: an algorithmic tool for the identification of functional regions in proteins by surface mapping of phylogenetic information. Journal of molecular biology 307: 447–463.

24.   Landau M, Mayrose I, Rosenberg Y, Glaser F, Martz E, et al. (2005) ConSurf 2005: the projection of evolutionary conservation scores of residues on protein structures. Nucleic Acids Res 33: W299-302.

25.   Goldenberg O, Erez E, Nimrod G, Ben-Tal N (2009) The consurf-db: pre-calculated evolutionary conservation profiles of protein structures. Nucleic acids research 37: D323–D327.

26.   Mihalek I, Reš I, Lichtarge O (2006) Evolutionary and structural feedback on selection of sequences for comparative analysis of proteins. Proteins 63: 87-99.

27.   Abhiman S, Sonnhammer E (2005) Funshift: a database of function shift analysis on protein subfamilies. Nucleic Acids Res 33: D197-200.

28.   Knudsen B, Miyamoto M (2001) A likelihood ratio test for evolutionary rate shifts and functional divergence among proteins. Proc Natl Acad Sci USA 98: 14512-14517.

29.   Zhang ZH, Bharatham K, Chee SM, Mihalek I (2012) Cube-db: detection of functional divergence in human protein families. Nucleic acids research 40: D490–D494.

30.   Livingstone C, Barton G (1993) Protein sequence alignments: a strategy for the hierarchical analysis of residue conservation. Bioinformatics 9: 745-756.

31.   Lichtarge O, Yamamoto K, Cohen F (1997) Identification of functional surfaces of the zinc binding domains of intracellular receptors1. J Mol Biol 274: 325-337.

32.   Innis C, Shi J, Blundell T (2000) Evolutionary trace analysis of TGF-{beta} and related growth factors: implications for site-directed mutagenesis. Protein Eng 13: 839847.

33.   Morgan DH, Kristensen DM, Mittelman D, Lichtarge O (2006) Et viewer: an application for predicting and visualizing functional sites in protein structures. Bioinformatics 22: 2049–2050.

34.   Lua RC, Lichtarge O (2010) Pyetv: a pymol evolutionary trace viewer to analyze functional site predictions in protein complexes. Bioinformatics 26: 2981–2982.

35.   Sankararaman S, Kolaczkowski B, Sj√∂lander K (2009) Intrepid: a web server for prediction of functionally important residues by evolutionary analysis. Nucleic acids research 37: W390–W395.

36.   Sankararaman S, Sjolander K (2008) INTREPID–INformation-theoretic TREe traversal for Protein functional site IDentification. Bioinformatics 24: 2445-2452.

37.   Kalinina O, Mironov A, Gelfand M, Rakhmaninova A (2004) Automated selection of positions determining functional specificity of proteins by comparative analysis of orthologous groups in protein families. Protein Sci 13: 443-456.

38.   Donald J, Shakhnovich E (2009) Sdr: a database of predicted specificity-determining residues in proteins. Nucleic Acids Res 37: D191-194.

39.   Donald JE, Shakhnovich EI (2005) Determining functional specificity from protein sequences. Bioinformatics 21: 2629–2635.

40.   Chakrabarty S, Panchenko A (2010) Ensemble approach to predict specificity determinants: benchmarking and validation. BMC Bioinformatics 10.

41.   Rausell A, Juan D, Pazos F, Valencia A (2010) Protein interactions and ligand binding: from protein subfamilies to functional specificity. Proc Natl Acad Sci USA 107: 1995-2000.

42.   Carro A, Tress M, De Juan D, Pazos F, Lopez-Romero P, et al. (2006) Treedet: a web server to explore sequence space. Nucleic Acids Res 34: W110-115.

43.   del Sol Mesa A, Pazos F, Valencia A (2003) Automatic methods for predicting functionally important residues. J Mol Biol 326: 1289-1302.

44.   Capra J, Singh M (2008) Characterization and prediction of residues determining protein functional specificity. Bioinformatics 24: 1473-1480.

45.   Ye K, Anton Feenstra K, Heringa J, Ijzerman A, Marchiori E (2008) Multi-RELIEF: a method to recognize specificity determining residues from multiple sequence alignments using a Machine-Learning approach for feature weighting. Bioinformatics 24: 18-25.