Royal Society Publishing

Practical and theoretical advances in predicting the function of a protein by its phylogenetic distribution

Philip R Kensche, Vera van Noort, Bas E Dutilh, Martijn A Huynen


The gap between the amount of genome information released by genome sequencing projects and our knowledge about the proteins' functions is rapidly increasing. To fill this gap, various ‘genomic-context’ methods have been proposed that exploit sequenced genomes to predict the functions of the encoded proteins. One class of methods, phylogenetic profiling, predicts protein function by correlating the phylogenetic distribution of genes with that of other genes or phenotypic characteristics. The functions of a number of proteins, including ones of medical relevance, have thus been predicted and subsequently confirmed experimentally. Additionally, various approaches to measure the similarity of phylogenetic profiles and to account for the phylogenetic bias in the data have been proposed. We review the successful applications of phylogenetic profiling and analyse the performance of various profile similarity measures with a set of one microsporidial and 25 fungal genomes. In the fungi, phylogenetic profiling yields high-confidence predictions for the highest and only the highest scoring gene pairs illustrating both the power and the limitations of the approach. Both practical examples and theoretical considerations suggest that in order to get a reliable and specific picture of a protein's function, results from phylogenetic profiling have to be combined with other sources of evidence.


1. Introduction

The approach of modern biology to understand the complexity of organisms is to analyse the cooperation of their individual components. Nevertheless, extensive experimental studies are necessary just to identify the function of a single protein in its cellular context. Owing to this, even for well-studied model organisms, the functions of many proteins are yet unknown. For example, even for the economically and scientifically important budding yeast, Saccharomyces cerevisiae, approximately 20% of the open reading frames are completely uncharacterized (Saccharomyces Genome Database, January 2007; Cherry et al. 1998) and for many other proteins, we have only knowledge of certain aspects of their function, like their subcellular location, or the phenotype after knockout of its gene.

Owing to the wealth of data accumulating in databases and by means of powerful algorithms, e.g. for aligning sequences and inferring evolutionary relations between them, it becomes increasingly interesting to predict protein functions in a comparative approach. This comparative approach is essentially based on the observation that homologous proteins retain aspects of their function over long evolutionary times. This allows for a homology-based function prediction, i.e. the transfer of knowledge about the function between homologous proteins. However, although homology frequently implies a conservation of mechanistic aspects of function, it provides less information about the functional context, i.e. about the processes a protein is involved in. For example, homologous enzymes may catalyse similar reactions, but the substrates and products involved in this reaction may be part of different pathways. The function of a protein can only be fully understood if both its mechanistic and contextual aspects are considered. Hence, homology-based function prediction can be complemented by context-based function prediction that focuses on how a protein is linked to other proteins and that uses the ‘guilt by association’ principle (Aravind 2000) to transfer knowledge between non-homologous proteins, i.e. the function of a protein can be predicted by linking it to proteins of known function. As the examples discussed in this review show, the links predicted by context-based methods like phylogenetic profiling are not very specific and generally do not provide information on the exact role of the protein in a process but frequently they are decisive to guide further analysis.

The full predictive power of the guilt by association principle unfolds with high-throughput data on co-expression, or genetic and physical interaction. For instance, for about 90% of the genes of S. cerevisiae, associations to other genes have been found (BioGRID, v. 2.0.20; Stark et al. 2006). Nevertheless, comparison with information from the literature and comparison of different protein–protein interaction datasets indicate that despite this good coverage in terms of proteins, the coverage in terms of the interactions between proteins is low (Han et al. 2005; Reguly et al. 2006). Furthermore, high-throughput data have been suggested to contain considerable levels of noise (Bader & Hogue 2002; von Mering et al. 2002). Again, the comparative approach proved to enhance the data quality, for example by considering the evolutionary conservation of co-expression (Stuart et al. 2003; van Noort et al. 2003; Bergmann et al. 2004; Snel et al. 2004).

The so-called genomic-context methods are stricter in their use of the comparative approach because they exclusively rely on comparing sequenced and annotated genomes to predict functional associations. A functional link between genes is suggested by gene fusion and fission events (Rosetta stone method; Enright et al. 1999; Marcotte et al. 1999); conservation of gene neighbourhood (Dandekar et al. 1998; Overbeek et al. 1999; Korbel et al. 2004) correlating evolutionary rates (mirror tree method; Pazos & Valencia 2001) or correlating gene occurrence (phylogenetic profiling; Gaasterland & Ragan 1998; Huynen & Bork 1998; Pellegrini et al. 1999; Date & Marcotte 2003). Gene fusion and fission are relatively rare genomic events that indicate functional links with high reliability and usually affect genes that are functionally tightly coupled (Yanai et al. 2001; von Mering et al. 2003a). Conservation of gene neighbourhood uses the genomic proximity of genes that, in particular in prokaryotes, suggests co-regulation (Yanai et al. 2002; von Mering et al. 2003a).

Since its invention, phylogenetic profiling diversified into a large number of related approaches and no systematic attempt has yet been made to sort this diversity. The original form of phylogenetic profiling uses binary vectors—the phylogenetic profiles or phyletic patterns (Tatusov et al. 1997)—that indicate in which species a homologue is present or absent (Gaasterland & Ragan 1998; Huynen & Bork 1998; Pellegrini et al. 1999). The idea is that genes that are functionally related are gained and lost together from genomes during evolution, which results in a correlation of their occurrence vectors. Nevertheless, despite the obvious simplicity and straightforwardness of the concept, it is, as so often in bioinformatics, not immediately obvious how to optimally translate the concept into an algorithm that gives the best performance. For instance, in a variant of the method, a phylogenetic profile is defined as a vector of similarities of a query gene from a query genome to its highest scoring hit in a number of subject genomes (Date & Marcotte 2003), which, in turn, is related to the mirror tree method that correlates matrices of pair-wise similarities (Pazos & Valencia 2001). Here, we first give an overview of successful applications that illustrate the principle and role of phylogenetic profiling in a simple and intuitive way and classify phylogenetic profiling as a trait correlation method. Subsequently, we describe the different variants of phylogenetic profiling and how to account for the non-independence of profile values due to the evolutionary relation between species. This discussion is concluded by pointing out links of phylogenetic profiling to similarity-based methods such as the mirror tree method (Pazos & Valencia 2001). Finally, we discuss the assumptions and limitations of phylogenetic profiling and relate them to the structure and evolution of protein function and their interactions. Although some limitations have been overcome by recent technical advances others, such as a heterogeneous distribution of evolutionary modularity over different cellular processes (Snel & Huynen 2004; Campillos et al. 2006), may inherently limit the coverage of the method. Furthermore, these limitations underpin the importance of integrating results from phylogenetic profiling, as a versatile and easily applicable method, with other types of evidence for functional associations to get a comprehensive picture of cellular systems.

2. Successful applications of phylogenetic profiling

The predictions made by phylogenetic profiling usually hint at the functional role of a protein without giving precise information about the molecular mechanisms or the nature of the functional association (Huynen et al. 2000). Frequently, complementary evidence from other context-based, homology-based or experimental methods is consulted to get more specific predictions. A number of predictions concerning various cellular processes have been made by phylogenetic profiling and were experimentally confirmed (table 1). They underscore that the role of phylogenetic profiling is frequently a pinpointing of genes that could be involved in a process about which we have only incomplete knowledge. For example, phylogenetic profiling identified enzymes of the MEP/DOXP pathway, which in plant chloroplasts, apicomplexa, cyanobacteria and a number of other bacteria produces the building blocks of isoprenoids. Cunningham et al. (2000) determined the occurrence profiles of the first five known enzymes in the MEP/DOXP pathway and found two other proteins that co-occurred with the pathway, LytB and GcpE. Although the involvement of LytB in the pathway was supported by experimental evidence, eventually further genetic experiments were necessary to determine the exact positions of LytB and GcpE within the pathway (Altincicek et al. 2001a,b). An alternative approach to phylogenetic profiling stems from the observation that a gene may be substituted during evolution by another gene of the same or similar function. Such a non-orthologous gene displacement (Koonin et al. 1996) leads to an anti-correlation of the displaced and the substitute gene's occurrence profiles. This is illustrated by the folate-dependent thymidylate synthase ThyA and its flavin-dependent counterpart ThyX (Thy1). Dynes & Firtel (1989) showed that thyX complements thymidine prototrophy in Dictyostelium discoideum but a thymidylate synthase function could not be formally proven because the exact nature of the mutation in the D. discoideum strain they used was unknown. An additional hint on the function of thyX came from the observation that the occurrence profile of thyX is complementary to that of the known folate-dependent thymidylate synthase gene thyA (Galperin & Koonin 2000). Subsequently, further complementation experiments and a biochemical (Myllykallio et al. 2002) and structural (Kuhn et al. 2002) characterization showed that ThyX is indeed a new class of flavin-dependent thymidylate synthases.

View this table:
Table 1

Phylogenetic profiling-based function predictions that have been verified in the original or by subsequent publications. Complement: anti-correlating occurrence profiles; ibid.: verification in same publication as prediction.

In many cases, we are not even aware of specific gaps in our knowledge about a cellular system. For example, a physical complex can have a different composition depending on the cellular state and an experimental exploration of all possible conditions remains impractical. Here, phylogenetic profiling can suggest new functional links. This was done, for instance, for NADH: ubiquinone oxidoreductase (Complex I), an energy-transducing multi-protein complex located in the mitochondrial inner membrane. Several of its members have paralogues that are members of Complex I itself or of other mitochondrial complexes. Gabaldon et al. (2005) noted that Complex I member N7BM (B17.2) and its paralogue B17.2L have similar occurrence profiles and were lost together from multiple independent lineages. Unfortunately, the function of N7BM within the complex was unknown and thus the co-occurrence- and homology-based link to this protein and to Complex I did not provide information on the specific function of B17.2L. It was an experimental study of Complex I assembly that showed that B17.2L is an assembly factor involved in this process and that its mutation can lead to a Complex I deficiency associated with a progressive encephalopathy in human patients (Ogilvie et al. 2005).

Generally, the examples illustrate that phylogenetic profiling derives most of its power from a large-scale approach that allows using it as an explorative method while being relatively easy to implement. After narrowing down the search space to a reasonable number of candidates, additional lines of evidence are needed. These may include results from other genomic-context methods, like gene order conservation, from published high-throughput experiments or from small-scale studies. In this process, towards the goal of discovering good candidates for further experiments, all these different sources should be considered and phylogenetic profiling is one of them. Notably, a number of web-based databases such as Predictome (Mellor et al. 2002), PLEX (Date & Marcotte 2005) or String (von Mering et al. 2007) integrate results from genomic context as well as small-scale and high-throughput experiments into unified, easily accessible interfaces and thus ease the process of gathering information for a broad public.

3. The link to the phenotype

In contrast to what may be suggested by the previous examples, co-occurrence profiling can be applied to any genotypic or phenotypic trait that can be coded as binary indicator vector. Examples of alternative genotypic traits are protein domains (Rodionov et al. 2002; Pagel et al. 2004; Liu et al. 2006), signal sequences (Haft et al. 2006), regulatory sites (Rodionov & Gelfand 2005) and restriction sites (Gelfand & Koonin 1997). Genotype/phenotype correlation has thus, for example, been applied to identify genes associated with pathogenicity (Huynen et al. 1997), hyperthermophily (Makarova et al. 2003; Jim et al. 2004), respiratory tract tropism, pili assembly (Jim et al. 2004), Gram-negativity, oxidative respiration, endospore formation, intracellular pathogenicity (Slonim et al. 2006) as well as eukaryotic and prokaryotic flagella (Jim et al. 2004; Li et al. 2004; Slonim et al. 2006). Similar to linking genes to each other via their co-occurrence in genomes, linking genes to phenotypes implicates them in a biological process. However, it does not require prior knowledge about the involvement of other genes in that process.

The link between genotype and phenotype was well exemplified by a study of the eukaryotic flagellum and basal body, which eventually identified BBS5 as a new disease gene. Li et al. (2004) selected three species—human, the aflagellate plant Arabidopsis thaliana and the flagellate alga Chlamydomonas reinhardtii. Blast searches identified genes present in human and the alga but lacking from the plant. This corresponds to a simple set union and intersection procedure that selects genes that have the same occurrence profile as the trait ‘having flagella and basal bodies’ (figure 1). Indeed, the resulting gene set included 88% of the known flagella and basal body genes of C. reinhardtii. Nevertheless, 92% of the resulting gene set was not known to be related to the flagellum and thus the reduction was not sufficient to draw conclusions about individual genes. A further intersection with a set of about 230 genes contained in a genomic locus associated with Bardet–Biedl syndrome (BBS), a deficiency of the basal body, resulted in a dramatic reduction to only two genes. Indeed, other evidence supports that one of these genes, BBS5, is related to the disease. This includes mutations that lead to premature termination of transcription of the BBS5 gene in four patients and cell-biological assays which show that that BBS5 localizes to basal bodies in flagellate cells of Caenorhabditis elegans.

Figure 1

Genotype/phenotype profiling as exemplified by the study of the eukaryotic flagellum by Li et al. (2004).

Phenotype/genotype profiling illustrates the principle of phylogenetic profiling from a new perspective by linking the environment of an organism to its molecular evolution. The environment and its change are important factors that drive the evolution because they induce more or less specific adaptations in the organism. Examples of specific adaptation that involve changes of only few proteins are the resistance to pentachlorophenol by recruitment of proteins into a new degradative pathway in Sphingomonas chlorophenolica (Copley 2000), the agglutination of lactobacilli to budding yeast, which is based on a single mannose-dependent adhesin protein that was discovered by comparing the lactobacilli's phenotypes to their genomes (Pretzer et al. 2005), or the presence of specific metabolites, such as C9-methylated glucosylceramides in fungi (Ternes et al. 2006). In contrast, other environmental factors, e.g. stress induced by extreme temperature, require adaptation of various unrelated cellular processes (Felsenstein 1985). This suggests that the diversity, specificity and correlation structure of environmental factors active on a branch of the phylogeny will determine the value of coordinated gene gains and gene losses on this branch for phylogenetic profiling and that by increasing the resolution of the phylogenetic tree also the resolution in terms of selective factors is increased. This may be one reason for the observed improvement of results from phylogenetic profiling both by increasing the number of species and by a balanced choice of species (Sun et al. 2005). The value of genotype/phenotype profiling will increase with the availability of detailed phenotypic information for many species from, for instance, literature mining (Korbel et al. 2005; Liu et al. 2006). Furthermore, the combination of phenotypic information with genotypic data from metagenomic studies (e.g. Tyson et al. 2004; Venter et al. 2004; Tringe et al. 2005) or microarrays (e.g. Pretzer et al. 2005) will probably improve the prediction of functional links not only for already sequenced model organisms but also for organisms that cannot be cultivated under laboratory conditions.

4. Co-occurrence methods

Most of the above presented success stories only considered identical profiles or used a simple distance measure, such as the number of occurrence values differing between two profiles. These ‘naive’ distance measures, however, ignore that the occurrence of a homologue in one species is not independent from its occurrence in another, probably closely related species. A number of ‘model-based’ approaches have been developed to account for this non-independence of profile values. Model-based phylogenetic profiling uses explicit models of evolution to infer gene gain and gene loss events and correlates the evolutionary processes rather than static absence/presence patterns. A complete overview and classification of all variants of phylogenetic profiling and of related methods discussed in this review is given in figure 2. For a discussion of similarity-based methods and a number of further extensions, such as methods that predict triples of functionally related genes instead of gene pairs or that focus on local regions on proteins, we refer to §§5 and 6, respectively.

Figure 2

Overview of published phylogenetic profiling methods and related matrix-similarity methods. The ‘arity’ is the number of genes between which a functional relation is predicted. ‘Localized’ methods require only a coevolution of local region on two proteins to predict a functional link.

4.1 Naive co-occurrence profiling methods

In the first application of co-occurrence profiling, Pellegrini et al. (1999) used Hamming distance between two profile vectors as similarity measure. The Hamming distance is the number of species that do not have the same absence/presence value. It is a member of a family of distance measures that also includes the Euclidean distance (§8.4.6). Notably, although these measures differ in magnitude, they produce the same order of profile pairs: if a profile pair is the Nth-closest with Hamming distance, it will also be the Nth-closest pair with Euclidean distance or any other Lp-norm. Alternatively, profiles can be compared by statistical correlation measures, such as the Pearson correlation coefficient (Glazko & Mushegian 2004), Fisher's exact test (Barker & Pagel 2005) or mutual information (Huynen et al. 2000; Wu et al. 2003). The Pearson correlation coefficient quantifies the degree of linearity of two factors and is only zero if there is no linear correlation. In contrast, mutual information is a general correlation measure that detects any kind of correlations (Steuer et al. 2002) and measures the average amount of information that one profile conveys about the other and vice versa (MacKay 2005). Fisher's exact test and mutual information have been specifically designed for categorical data such as occurrence profiles. An alternative approach is taken by Wu et al. (2003, 2006) who have derived a formula for the co-occurrence probability of two genes. Note that anti-correlating profiles are treated differently by the different similarity measures: for instance, Lp-norms assign particularly high distances to complementary profiles while Pearson correlation gives them a negative correlation coefficient. In contrast, mutual information in principle does not distinguish anti-correlating from correlating profile pairs and both just appear as pairs with high mutual information. In practice, however, anti-correlating profiles are easy to identify and can be treated separately, and have actually been used successfully to predict protein function, e.g. in the case of ThyX (see §2). Finally, phylogenetic profiles have repeatedly been compared by the Jaccard coefficient (Jaccard 1912; Glazko & Mushegian 2004; Yamada et al. 2004, 2006). The Jaccard coefficient usually accounts only for the similarity generated by co-presence by ignoring the number of genomes that do not contain any of the compared orthologues in its calculation (see §8.4.5).

4.2 Phylogeny-aware profiling

All above-mentioned statistical approaches to compare phylogenetic profiles assume statistical independence of the ‘sampled’ occurrence values in different species. However, because species are evolutionarily related the independence assumption is clearly violated, which may have negative effects on the quality of the predictions. Consistent with this, it was found that a substantial portion of modularity derived from the species distribution of orthologous genes is the results of such a phylogenetic signal (Snel & Huynen 2004). The effects of the non-independence of profile values may differ for the different naive profiling methods. For example, the similarity measures differ in how they score lineage-specific genes (table 2). Hamming distance always indicates high similarity for identical profiles independent of the lineage specificity, even for gene-families that occur only in a single genome. In contrast, mutual information is limited by the minimal information content of the profiles (see §8.4.8). Thus, for identical profiles with increasing lineage specificity, mutual information yields a lower correlation value. This results in a counter-intuitive behaviour. Consider two lineage-specific genes that are co-lost from some species within a clade (figure 3a). If a lineage encompasses less than half of the species, the additional co-losses will lead to lower mutual information, despite the additional evidence for functional linkage.

Figure 3

Negative influence of non-independence on some naive profiling methods. (a) (i) Two orthologous groups A and B occur in half of the species and have identical patterns of gains. Both Hamming distance (dH) and mutual information (MI) indicate high similarity. Although two additional co-losses in (ii) represent stronger evidence for a functional relation, the mutual information score is lower than in the previous situation (i). (b) (i) A and B are gained and lost independently but Hamming distance suggests high similarity (false positive). (ii) A single independent loss of B early in the phylogeny leads to high Hamming distance (false negative), despite two co-losses. In contrast, with differential Dollo parsimony (dP; see §8.4.3) the example of dependent evolution (ii) would correctly result in a better score than the example of independent evolution (i). dH, Hamming distance; MI, mutual information; dP, differential Dollo parsimony.

View this table:
Table 2

Naive co-occurrence measures differ in how they score ineage-specific genes. With pairs of identical profiles over 30 species both Hamming distance and Pearson correlation constantly yield scores that indicate high similarity. In contrast, Fisher's exact test (two-tailed) and mutual information yield less significance with a narrower and broader than 50% species distributions.

One way to reduce the effect of the phylogenetic bias is a reasonable, tree-guided selection of species that improves the prediction quality in comparison with a naive inclusion of all species (Sun et al. 2005). A related approach was taken for the STRING database, in which subtrees of the phylogenetic tree are collapsed and substituted by their ancestral state (von Mering et al. 2003a; figure 4). Nevertheless, both approaches still rely on naive correlation measures and thus only reduce the bias, but the remaining occurrence values are still statistically dependent.

Figure 4

Tree-guided approach implemented in the String database (von Mering et al. 2003a). A subtree is collapsed only if all its leaves have the same presence/absence pattern, i.e. if the ancestral state at the subtree's root is known with high certainty.

4.3 Model-based co-occurrence methods

A completely different approach to handle the non-independence of the profile values is to use the phylogenetic tree to correlate the evolutionary processes rather than their observed outcomes. Model-based occurrence profiling methods use a model of gene content evolution to reconstruct ancestral genomes in a phylogenetic tree based on the observed occurrence patterns. Three methodological frameworks have been applied to phylogenetic profiling: the parsimony principle, maximum likelihood and a kernel-based method that models the evolutionary process by a Bayesian tree.

According to the parsimony principle, of many alternative evolutionary histories the one with the least costs, i.e. gene gains and losses, is the most credible. Liberles et al. (2002) used Fitch's parsimony algorithm to determine the presence or absence of each gene in the ancestral species of the used phylogeny (Fitch 1971). Fitch's parsimony model allows arbitrary and equally penalized changes between character states, i.e. gain and loss of homologous groups are considered equally probable. Ambiguities arising from equally parsimonious reconstructions were resolved by branch length weighting (D. Liberles 2006, personal communication). Barker et al. (2007) used Dollo parsimony for ancestral state reconstruction. Dollo parsimony allows a gene to be gained only once throughout a phylogenetic tree, which may require an arbitrary number of subsequent gene losses (Farris 1977). Based on the parsimonious reconstruction, gain/loss profiles for branches rather than occurrence profiles for genomes are constructed. The gains and losses on different branches can be assumed as reasonably independent, which justifies an application of similarity measures like Hamming distance, Pearson correlation coefficient or mutual information. Indeed, a simple similarity measure applied to Dollo-based gain/loss profiles yielded considerably better results on a eukaryotic dataset than a naive application of Hamming distance on occurrence profiles (Barker et al. 2007).

The parsimony approach usually treats the inferred ancestral states as if they are known without error. Nevertheless, there be can considerable uncertainty in the reconstructed ancestral states even if the parsimony solution is unambiguous. A requirement for parsimony methods to accurately reconstruct ancestral states is that the rates of change are low (Omland 1999). In eukaryotes, this requirement may be met because horizontal gene transfer (HGT) from non-endosymbiotic origin is rare and mainly confined to phagotrophic protists (reviewed in Andersson 2005). In contrast, in prokaryotes, HGT provides a mechanism of repeated gene gain and is estimated to affect a fraction of 40–60% (Kunin & Ouzounis 2003; Beiko et al. 2005) or even 90% of gene families (Mirkin et al. 2003).

One way to account for the uncertainty in the estimate of the ancestral state is to use ‘parsimony intervals’ that contain a number of suboptimal solutions (Schluter et al. 1997). Recently, Zhou et al. (2006) proposed a dynamic programming algorithm to calculate such parsimony intervals for phylogenetic profile comparison. The algorithm determines the best 100 suboptimal ancestral state reconstructions for each phylogenetic profile and compares them by a similarity measure that quantifies the number of correlated events while accounting for the degree of suboptimality of the reconstructions. However, a parsimony interval of 100 reconstructions is small in comparison with the number of possible reconstructions, which is exponential in the number of ancestral species. Two alternative model-based methods account for the uncertainty in the ancestral state estimation by considering all possible reconstructions: the maximum likelihood approach of Barker et al. (2005, 2007) and the tree-kernel method of Vert (2002).

The maximum likelihood method uses continuous-time Markov models to describe the evolutionary gain and loss of two genes (Barker & Pagel 2005). In order to quantify the probability that two genes have been gained and lost together, the likelihood (i.e. the goodness of fit) of a model of contingent evolution, in which the gain and loss of one gene is influenced by the presence and absence of the other, is compared to the likelihood of a model of independent evolution. In contrast to the parsimony approaches, maximum likelihood accounts for the branch lengths in the tree. Furthermore, the likelihood values are independent from a specific ancestral state reconstruction because they are calculated over all possible combinations of ancestral states. Although this can be done in linear time (Pagel 1994), the fitting of the model remains a computationally demanding task, which has to be solved by heuristic optimization. Recently, the method was considerably improved by assuming low gain rates instead of estimating these rates for each pair of occurrence profiles (Barker et al. 2007). The global gain-rate parameter depends on the branch lengths in the phylogenetic tree and has to be estimated from the dataset itself. Furthermore, the maximum likelihood method can also account for uncertainty in the tree topology (Barker et al. 2007) that is known to negatively affect predictions made by model-based phylogenetic profiling methods (Zhou et al. 2006).

The tree-kernel method circumvents the costly estimation of the rate parameters by considering both gain and loss probabilities as priors of a Bayesian tree (Vert 2002). Each branch of the tree has two associated prior probabilities—a gain probability p(1|0) and a loss probability p(0|1). Additionally, a probability for a gene to be present at the root has to be provided. A simplifying assumption of the original publication was that the gain and loss probabilities were the same for all branches of the tree, independent of the length of the branch. However, the Bayesian tree representation allows for more complex models with branch-specific change probabilities and differing gain and loss probabilities. The virtue of the method also comes from the combination of this Bayesian tree with a kernel-based approach that allows efficient calculation of a profile distance that accounts for all possible ancestral state reconstructions.

Although in recent applications model-based methods assumed that the phylogenetic tree and the patterns are known without error, this may not always be the case. The annotation of genomes sometimes misses genes and phylogenetic profiling has successfully been used to identify these (e.g. Natale et al. 2000; Mikkelsen et al. 2005). The uncertainty in the occurrence values can be expected to be even more pronounced in metagenomic data and in data from microarray genotyping (Molenaar et al. 2005; Pretzer et al. 2005). Both the maximum likelihood method and the tree-kernel method could be modified to account for such uncertainty.

5. Similarity methods

The original phylogenetic profiling approach, which uses the co-occurrence of genes, is related to another genomic-context method that identifies functionally associated gene pairs based on correlating rates of sequence evolution. For the purpose of this review, we will refer to these two alternative approaches as ‘co-occurrence profiling’ and ‘similarity profiling’, respectively. There are a number of reasons to include a discussion of similarity-based methods in a review about phylogenetic profiling. Both co-occurrence and similarity profiling methods use the idea of coevolution of functionally related genes to predict functional associations. Furthermore, there is a significant, positive correlation between the sequence evolutionary rate (estimated as average similarity to homologues in an outgroup species) and the rate of gene loss, although both mutually explain only about 10–15% of their variation (Krylov et al. 2003). This suggests that co-occurrence profiling and similarity profiling rely on distinct but not entirely independent signals. Finally, the phylogenetic profiling method of Date & Marcotte (2003) seamlessly integrates co-occurrence profiling and similarity profiling.

The phylogenetic profiling method of Date & Marcotte (2003) correlates vectors of transformed Blastp E-values that code the similarity of a protein in a query species to their best hits in a number of subject genomes (Marcotte 2000). To make the similarity values independent from the length and sequence composition of the query protein, they can be normalized by the score of the protein's self-alignment (Enault et al. 2003). The resulting similarity vectors are compared by mutual information, which requires a binning of values. The fact that the number and boundaries of the bins can be arbitrarily chosen illustrates that the similarity vectors also contain information about the occurrence of genes. If the subject genome does not contain a significant hit, the corresponding component of the vector is set to zero but is not excluded from the profile and thus contributes to the correlation value. In the extreme situation with only two bins representing worse than cut-off (absence) and better than cut-off (presence), the Date & Marcotte method and co-occurrence profiling are equivalent (Snitkin et al. 2006). According to our definition, the method of Date & Marcotte is thus both a similarity and co-occurrence profiling method.

The data structure of pair-wise similarities of a query gene to a number of best hits in subject genomes can be extended into a matrix of pair-wise similarities between homologous genes. Goh et al. (2000) demonstrated that the correlation coefficient between the resulting similarity matrices quantifies the degree of coevolution. Comparing similarity matrices has been used by two types of methods that have different fields of application. The matrix alignment methods of Gertz et al. (2003) and Ramani & Marcotte (2003) predict pairs of interacting partners between members of two groups of paralogues that reside in a single species. For each gene family, a matrix of pair-wise similarities is constructed. Two matrices are ‘aligned’ by shuffling the columns and rows in order to make them as similar as possible, i.e. to match members of both families that are most similar in their evolutionary rates. The matrix alignment is computationally costly, which restricts the method to small gene families, preferably ones of which some members are already known to interact. In contrast, the mirror tree method of Pazos & Valencia (2001) predicts associations between two groups of orthologous genes across different species. Each matrix contains the similarities between unique representatives of a gene family from each species. Different matrices are compared by matrix correlation coefficients in order to identify pairs of gene families that have similar evolutionary rates. In contrast to the similarity-vector method of Date & Marcotte, the mirror tree method can be considered as pure similarity profiling because it only accounts for the coevolution signals of those proteins that are present and ignores the loss or gain events; organisms that do not contain a homologue are treated as missing values by skipping the respective matrix rows and columns for the calculations of matrix similarity.

To our knowledge, no model-based similarity profiling methods have yet been proposed. Nevertheless, both the similarity vector method and the mirror tree method have been modified to correct for the dependence of similarity values on the distance of the compared species. In the similarity vector method of Date & Marcotte (2003), the similarity of the query protein to its best hits in a subject genome can be expected to decrease with increasing evolutionary distance of the query and subject species. This distance effect can be corrected for by dividing each vector component by the average similarity of all proteins' hits to the subject genome (Enault et al. 2003). Similarly, the mirror tree method can be corrected for the background similarity of the genomes either by using an algebraic projection operator (a linear map; Sato et al. 2005) or by subtracting a distance matrix based on a species tree from the homologous group distance matrices (Pazos et al. 2005).

6. Limitations and extensions

Although the successful small-scale studies using phylogenetic profiling provide an intuitive introduction to the method, phylogenetic profiling remains abstract, as long as one did not go through the process of using it on real data. We analysed a set of 25 fungi and the microsporidium Encephalitozoon cuniculi with 12 naive and model-based co-occurrence profiling methods. Although we observe a low overall performance with this dataset (figure 5a), highly reliable predictions can be made if a stringent score cut-off is used and only the highest scoring orthologous group pairs are considered as positive predictions (figure 5b), which indicates the presence of a signal that reflects functional associations. The fraction of positive controls (true positives) among the positive predictions drops quickly if the cut-offs are relaxed (figure 6), an observation that has been made before (e.g. Enault et al. 2003; von Mering et al. 2003a; Barker & Pagel 2005; Snitkin et al. 2006; Zhou et al. 2006) and can be explained by a number of technical causes, such as incorrect gene occurrence values due to misannotations of genomes or incorrect sets of positive and negative controls. One likely important cause for observing a low performance of phylogenetic profiling is the low number of only 26 genomes used here. Although it is of the same order of magnitude as in multiple other studies that used eukaryotic genomes (Barker & Pagel 2005; Snitkin et al. 2006; Barker et al. 2007), a recent study on the similarity/occurrence vector method of Date & Marcotte (2003) showed that this method yields significantly better results with increasing number of bacterial genomes (Sun et al. 2005). This probably applies also to the pure co-occurrence profiling methods benchmarked here. Nevertheless, also the studies on larger sets of genomes indicated a relatively limited coverage for gene co-occurrence (Huynen et al. 2003; von Mering et al. 2003a).

Figure 5

(a) Phylogenetic profiling has low overall performance (b) but makes highly reliable predictions for the highest scoring 5% of orthologous group pairs. Plotted are the bootstrap medians of AUC and PPV0.05 estimated from a bootstrap sample (n=100) of positive and negative controls. (Results are based on a set of orthologous groups for 25 fungi and the microsporidium E. cuniculi (figure 1 of the electronic supplementary material) and functional associations for S. cerevisiae from the MIPS (Mewes et al. 2006) and KEGG (Kanehisa et al. 2004) databases. In the bootstrapping procedure, each MIPS complex (KEGG pathway) was given the same weight to account for the overrepresentation of some functional categories in the MIPS dataset, such as the large ribosomal subunit. The bootstraps for the PPV estimates were done such that positive and negative controls were sampled with the same probability as to produce an average P/(P+N) ratio of 0.5. Box plots of the ‘weighted’ as well as a ‘normal’, i.e. non-weighted, bootstrap distributions are shown in figure 2 of the electronic supplementary material.) Dashed line, performance of the random classifier; AUC, area under ROC curve; PPV0.05, positive predictive value, i.e. fraction of true positives among the 5% highest scoring predictions. Open circles, MIPS; open squares, KEGG.

Figure 6

The positive predictive value drops quickly with increasing rate of positive predictions of the full MIPS dataset. The KEGG dataset produced similar results (data not shown). The dashed horizontal line indicates the performance of a random classifier, i.e. P/(P+N).

An alternative cause for the limited coverage of phylogenetic profiling is the discrepancy between the complexity of organisms and the severe simplifications required for a phylogenetic profiling analysis. A discussion of this discrepancy may suggest limitations of phylogenetic profiling and serves to introduce some methods that can be interpreted as extensions of, or at least closely related to phylogenetic profiling. For this discussion, it is helpful to consider ‘functional similarity’ as quantifiable. For example, a quantification can be based on similarity measures for controlled vocabularies that describe protein function, such as the function description in the clusters of orthologous groups (COG) database (Tatusov et al. 2000) or the gene ontology (Ashburner et al. 2000), e.g. the co-occurrence of terms associated with two proteins could be quantified by the Jaccard coefficient or the average Resnik similarity for ontology terms (Resnik 1999).

6.1 Functional divergence after gene duplication

Already during the process of profile construction, i.e. the mapping of the complexity of organisms into simple vectors of numbers, information is lost. The construction of phylogenetic profiles from groups of homologous genes is relatively easy by standard homology detection algorithms, but ignores that genome evolution is highly dynamic and driven by events such as de novo gene genesis, gene duplication, gene loss and HGT. Consequently, multiple homologous genes may coexist in the same genome. In particular, sub- or neo-functionalization of duplicated genes (reviewed in e.g. Roth et al. 2006) represents a major complication to any homology-based and genomic-context method, because they violate the assumption that homologous genes retain their function during evolution. For instance, if the homologue of ancestral function is not maintained but lost from one genome then, in the phylogenetic profile, the presence of homologous genes in different genomes will represent the presence of different functions.

Fortunately, the impact of functional divergence after duplication on the quality of genomic context-based predictions can be reduced by splitting homologous groups into orthologous groups. Two homologous genes are called orthologues if they were separated by a speciation. In contrast, if two genes have been separated by gene duplication they are called paralogues (Fitch 1970). Reliable orthology and paralogy relations between members of a gene family can be determined either by gene tree reconciliation, i.e. by comparing the tree of homologous genes to a trusted species tree or by examining the species overlap between subtrees to infer duplications (van der Heijden et al. 2007). Nevertheless, due to the practical complications of large-scale phylogenetic analyses, orthology is usually operationally based on some clustering approach or other heuristic that uses relative levels of sequence similarity (e.g. Tatusov et al. 1997; Remm et al. 2001). Alternative to constructing the orthologous groups and phylogenetic profiles by oneself, one can also download them from public databases such as RoundUp (Deluca et al. 2006) or COG (Tatusov et al. 2003).

By switching from homologous groups to orthologous groups, the impact of functional divergence after duplication can be reduced, but the situation is not principally solved. Despite the orthology relationship, still about 40% of the bacterial orthologous groups in the COG database contain paralogues that do not originate from species-specific family expansion (Tatusov et al. 2000). Furthermore, in a study of modularity based on various genomic-context methods, it was found that 40% of the genomic context-derived modules could not be unambiguously assigned to a metabolic pathway owing to the limited resolution of their orthologous groups (von Mering et al. 2003b).

6.2 Multifunctionality

Proteins often functionally interact with or depend on various other proteins. If all these associated proteins are functionally related with each other, then the functional context of the protein is homogeneous. An interesting situation arises for ‘multifunctional’ proteins whose functional context is heterogeneous, i.e. for those proteins that have ‘long-range’ dependencies bridging different processes. In this situation, it is not obvious whether it is more advantageous to maintain the gene or to lose it. If one part of the gene's functional context is lost, the loss of the gene itself may still affect the functioning of other parts of its context. Hence, multifunctionality probably influences the gene loss rate and thus the occurrence values in the phylogenetic profiles. There are numerous examples of multifunctional and ‘moonlighting’ proteins some of which are members or important cellular processes like glycolysis or tricarboxylic acid cycle (for a review see Jeffery 1999; Moore 2004). High-throughput protein interaction experiments indicate that many proteins are shared between different complexes (Gavin et al. 2002; Krause et al. 2004), which suggests that multifunctionality may be an abundant phenomenon. Various mechanisms allow a protein to act in different functional contexts, such as broad substrate specificity, differential expression and differential localization (Jeffery 1999). One of these mechanisms, the functional specialization of different domains of the protein, allows a protein to contribute to different processes by the activity of its different domains. It is amenable to what could be called ‘localized’ phylogenetic profiling methods. Both co-occurrence and similarity profiling methods have been extended in this direction.

Pagel et al. (2004) proposed a domain-based co-occurrence profiling method. For each PFAM and SCOP domain (Bateman et al. 2002; Andreeva et al. 2004), they constructed an occurrence profile and compared profile pairs with Hamming distance in order to predict functional relations between domains. Subsequently, proteins were considered as functionally related if they contained such functionally related domains. Thus, the approach taken by Pagel and co-workers compares proteins indirectly via their domain content. In contrast, the localized similarity profiling introduced by Kim & Subramaniam (2006) directly compares the sequences of proteins by local alignments using a variant of the phylogenetic profiling method of Date & Marcotte (2003). Each query protein (from a query species) is considered as a set of overlapping 120 amino acid segments that are individually compared to the database of proteins. The coevolution between two protein segments is quantified by mutual information, just as for full length proteins in the Date and Marcotte method. The correlation between two proteins was defined as the highest correlation between any of their segments (Kim & Subramaniam 2006). In order to be independent of pre-defined segments, Kim et al. (2006) further improved this method into residue-level profiles. Both local co-occurrence profiling and similarity profiling were found to be largely complementary to their global variants (Pagel et al. 2004; Kim & Subramaniam 2006). For example, the methods of Kim and Subramaniam and of Date and Marcotte had very similar performance but shared only about 36% of their predictions or less depending on the choice of cut-off (Kim & Subramaniam 2006; Kim et al. 2006). Localized phylogenetic profiling thus provides an additional source of evidence for functional relation and adds information when combined with global profiling methods.

6.3 Higher order functional relations

Most co-occurrence profiling methods make two simplifying assumptions about the dependencies of proteins: they only search for binary, i.e. pair-wise, and symmetric relationships. ‘Symmetric’ here means that if gene A is associated with gene B then the reverse, i.e. gene B is associated with gene A, is also true. For example, for Hamming distance, it is irrelevant if we calculate the distance dH(A, B) or dH(B, A). Consequently, co-occurrence profiling may frequently fail to detect asymmetric relationships. Barker & Pagel (2005) pointed out that their maximum likelihood method can be adapted to model contingent evolution of gene pairs, i.e. to model asymmetric, binary relationships between phylogenetic profiles. For example, asymmetric relationships could occur whenever a major system, such as a complex or pathway, is modified during evolution. While the main function of a complex is evolutionarily conserved, accessory proteins could be added or removed for fine tuning. Sometimes these modifications of systems happen in a gradual fashion over longer evolutionary time-scales, as has been observed for a variety of multi-protein complexes and metabolic pathways (Fothergill-Gilmore & Michels 1993; Petsko et al. 1993; Gabaldon et al. 2005; Tanaka et al. 2005). Asymmetric, binary relationships could be used to uncover these evolutionary trends, in particular, if more sequenced genomes become available and the resolution of phylogenetic trees is increased. Nevertheless, currently a systematic assessment of binary, asymmetric relationships between occurrence profiles is lacking. In contrast, ternary relationships, i.e. for triples of gene families, have been studied in the past. In the Boolean logic formalism proposed by Bowers et al. (2004), the profiles of orthologous groups are interpreted as vectors of truth values. For example, consider an enzyme C that uses substrates from both a pathway containing enzyme A and from a pathway containing enzyme B. The relation between the three enzymes can be expressed as the logic relation ABC. Given such a relation, it can be concluded that both the activities of A and B are required for the activity of C. For example, for members A and B of two signalling cascades whose crosstalk involves the protein C. Bowers and co-workers identified eight possible ‘logic types’ that model different possible relationships between genes, excluding cases that could easily be modelled by binary relationships, and scored them using an entropy-related measure. Intriguingly, logic types that are easier to relate to our understanding of biological and evolutionary relationships tended to be more frequent than those that are difficult to interpret. Owing to the higher expressiveness of the ternary relation formalism, its results add to the results of classical, binary co-occurrence methods. Recently, the formalism has been used to model more complex relations of four genes (Zhang et al. 2006). It should, however, be noted that the identification of higher order relationships has practical limitations. The number of possible logic functions is exponential in the number of related genes. Each of these possible relations can be interpreted as an alternative hypothesis and the fitting of such a large number of hypotheses may not be possible with the currently available number of genomes. Furthermore, also the number of combinations of genes between which higher order relations are inferred grows quickly, which limits further extensions to small gene sets. Finally, with increasing complexity of the logic functions, a biological interpretation of the relationship between the involved genes becomes harder.

6.4 The evolution of functional context

If we consider binary functional dependencies and ignore that these dependencies can be asymmetric, we obtain a network of functional associations. Networks of functional associations between proteins can evolve on the level of edges, i.e. the interactions between proteins, and on the level of nodes, i.e. by gain and loss of protein-encoding genes. Obviously, both levels are linked because if a gene is lost from a genome also all functional links to it are lost. The edge-level evolution can be interpreted as a change of protein function; more specifically, by rewiring the network the functional context of a protein changes. The edge-level of the network is particularly relevant for phylogenetic profiling simply because phylogenetic profiling tries to predict the edges. However, it does this in a comparative way and thus disregards that the edges may have changed during evolution. Functional links between genes must have persisted sufficiently long during evolution to be detectable by phylogenetic profiling methods.

Unfortunately, no conclusive answer can yet be given on what is this rate of edge-level evolution, even for protein–protein interaction networks or networks derived from co-expression. The estimation of this rate is hampered by the high levels of noise in high-throughput data (Snel et al. 2004; Cesareni et al. 2005; Gandhi et al. 2006). For example, for the same dataset of tissue-specific expression in human and mouse, estimates of the fraction of gene pairs with conserved co-expression range between less than 10 and 84% (Liao & Zhang 2006; Tsaparas et al. 2006). In comparison, for gene pairs found in the relatively distantly related worm and budding yeast, the fraction of co-expressed gene pairs that share a transcription factor-binding site in yeast that is also co-expressed in worm was determined to be 76% (Snel et al. 2004), which indicates a high conservation of co-expression at least for conserved gene pairs.

A different aspect of the evolution of functional context is the ‘severity’ of change. A ‘severe’ change of functional context occurs if a protein is recruited to a new context that is unrelated to its original context. This recruitment may or may not involve a loss of the original functional links. Notably, if a protein maintains both its original and acquired functions, its functional context becomes heterogeneous, i.e. it can be considered multifunctional. However, if the original function is lost then the selective forces acting on the protein before and after the recruitment will be distinct. Recruitment appears to be a major mode in the evolution of metabolic pathways (Teichmann et al. 2001; Lecompte et al. 2002; Rison et al. 2002; Light & Kraulis 2004;). In contrast, ancestral and acquired functions can be related such that the ancestral and the divergent form of a gene may be subjected to similar selective constraints. For example, in the small-molecule metabolic network of Escherichia coli homologous genes separated by 11 or less reaction steps are overall rare but occur more frequently at very short distances (1–3), i.e. with relatively high functional similarity, than at larger distances (4–11; Rison et al. 2002). Similarly, duplicated physical complexes in most cases have similar functions (Pereira-Leal & Teichmann 2005).

6.5 Evolutionary modularity

A further abstraction from networks of functional links is to analyse the structure of organisms in terms of modularity. Usually, a module is defined as a group of proteins that have stronger associations among each other than to proteins outside of the module. The modular structure of biological networks is figured as hierarchical and overlapping (e.g. Ravasz et al. 2002; Palla et al. 2005): different modules are related to each other by combination into higher order modules or by coupling proteins. Certain topological properties of biological networks can be explained by high false positive rates in interaction data (Han et al. 2005) or by simple evolutionary models (Amoutzias et al. 2004; van Noort et al. 2004). Nevertheless, the modular organization is apparent and it remains an open question which selective forces or evolutionary mechanisms are responsible (e.g. Wagner 1996; Kashtan & Alon 2005). One would expect that the functional modularity of organisms constraints the evolutionary process, leading to a definition of evolutionary modularity that closely follows the general definition of modularity: an evolutionary module can be defined as a group of genes that have stronger co-evolved with each other than with genes outside of the module.

A number of genomic-context studies have included gene co-occurrence to find functional modules (Snel et al. 2002; Yanai & DeLisi 2002; von Mering et al. 2003b; Yamada et al. 2004, 2006; Spirin et al. 2006; Wu et al. 2006). Despite the technical and conceptual problems, genomic-context networks do indeed reflect functional modules, although the resulting modules are frequently distinct from traditional module definitions (Spirin et al. 2006; Yamada et al. 2006). The highest scoring pair-wise associations predicted by phylogenetic profiling can be considered as small co-gained and co-lost evolutionary modules and clearly contain information about function. Furthermore, also larger modular structures are reflected in the networks. Of particular value for their description are multifunctional network nodes that link different modules. For example, metabolic networks are usually modelled as graphs with nodes representing metabolites and the edges between them representing enzymes. The linkers in these networks are metabolites that are involved in diverse pathways. Indeed, in networks derived from genomic context, the evidence of a functional association between genes decreased with the degree of the metabolite node connecting them (Huynen & Snel 2003; von Mering et al. 2003b) and, consistently, for linear pathways the correspondence of the genomic-context network and the known metabolic network is highest (Spirin et al. 2006). A similar observation was made in a network based on gene order conservation. The linkers in these networks connect locally unconnected clusters and were enriched in multifunctional enzymes (Snel et al. 2002).

A lack of evolutionary modularity may be an important cause for the limited predictive coverage of phylogenetic profiling and other genomic-context methods. Studies that did not rely on genomic context to define their modules found a heterogeneous distribution of their evolutionary modularity; only about half of the functional classes, like metabolic pathways or protein complexes, evolved significantly modularly (Snel & Huynen 2004; Campillos et al. 2006). Consistent with this, genomic context is not equally informative about every cellular process (Campillos et al. 2006). For example, catabolic pathways were found to be less modular than biosynthetic pathways (Snel & Huynen 2004; Campillos et al. 2006) and showed a lower coverage in combined genomic-context networks (von Mering et al. 2003b). This suggests that the signal that genomic-context methods rely on may have only a limited coverage simply because there are many cellular processes whose evolution does not reflect a modular structure.

7. Concluding remarks

Phylogenetic profiling is a versatile method for the prediction of functional interactions, but its coverage is limited to those cellular systems that evolved in a modular fashion. Nevertheless, basically all experimental and computational methods have limits as they measure only one or few aspects of functional association. For instance, microarrays only capture co-expression on the level of mRNA abundance and thus ignore the effects of post-translational regulation. Numerous examples show that the evolutionary evidence for a functional interaction based on phylogenetic profiling as well as other genomic-context methods is frequently neither strong nor specific enough to pin down the exact function of a protein. Consequently, in virtually all examples, additional experimental evidence was used to stress the conclusions drawn from phylogenetic profiling and to understand protein function at a much higher level of detail.

The necessity to combine multiple lines of evidence extrapolates to phylogenetic profiling used as a broad-scale method that predicts functional interactions for a large number of gene pairs. Phylogenetic profiling has thus been integrated with other genomic-context methods. For prokaryotes, conservation of gene neighbourhood is usually found to have the highest coverage, followed by phylogenetic profiling and, finally, gene fusion/fission (Huynen et al. 2000; Manson McGuire & Church 2000; Yanai & DeLisi 2002; von Mering et al. 2003b). Genomic-context methods can have higher coverage than for example yeast two-hybrid at about the same accuracy (von Mering et al. 2002). Although such measurements depend somewhat on the benchmarking set (Lee et al. 2004), this shows that genomic-context methods can well compete with high-throughput experimental approaches. Frequently, one type of genomic evidence is clearly dominating (Huynen et al. 2000; Yanai & DeLisi 2002) and, thus, their contributions to the integrated network are complementary. It will be interesting to see if the predictions made by localized profiling approaches, ternary interactions and binary, asymmetric interactions will add further to the predictive coverage of phylogenetic profiling and of genomic-context methods in general.

8. Data and methods

8.1 Phylogenetic profiles

The protein sequences of the microsporidium Encephalitozoon cuniculi and of 25 fungi (Ustilago maydis, Cryptococcus neoformans, Phanerochaete chrysosporium, Stagonospora nodorum, Aspergillus nidulans, Aspergillus fumigatus, Magnaporthe grisea, Neurospora crassa, Trichoderma reesei, Fusarium graminearum, Schizosaccharomyces pombe, Yarrowia lipolytica, Debaromyces hansenii, Candida albicans, Candida glabrata, Ashbya gossypii, Kluyveromyces lactis, Kluyveromyces waltii, Saccharomyces kluyveri, Saccharomyces castellii, Saccharomyces bayanus, Saccharomyces kudriavzevii, Saccharomyces mikatae, Saccharomyces paradoxus, Saccharomyces cerevisiae) were downloaded from the corresponding sequencing projects. Similarity scores between the proteomes were computed using the Smith–Waterman P algorithm (Smith & Waterman 1981) on a TimeLogic DeCypher (matrix: Blosum62; e-value cut-off: 0.01; low-complexity filter on). The orthologous groups were constructed by an approach similar to that used for the COGs (Tatusov et al. 2000). Inparalogues specific to a clade usually are more similar to each other than to any gene outside the clade. This was used to construct species-specific inparalogous groups and inparalogous groups specific to the Saccharomyces sensu stricto clade (S. kudriavzevii, S. mikatae, S. bayanus, S. paradoxus, S. cerevisiae) whose members are closely related. Subsequently, triangles of mutual best hits between the inparalogous groups were merged if they shared two members (Tatusov et al. 2000). The resulting orthologous groups were further refined: we aligned the orthologous genes using Muscle v. 3.52 (Edgar 2004) with default parameters, calculated neighbour-joining trees with Bio-NJ (Gascuel 1997), inferred duplications with Loft (van der Heijden et al. 2007) and split orthologous groups according to ancient duplications. These refined orthologous groups were used to construct the phylogenetic profiles.

8.2 Control datasets

We retrieved the complex catalogue from the Munich Information Centre for Protein Sequences (MIPS; as of 14 November 2005; Mewes et al. 2006). From this hierarchy of categories, we removed those that had subcategories, contained the keywords ‘other’ or ‘complexes’, or referred to high-throughput studies (category 550). The procedure yielded a total of 1195 assignments of orthologues to 195 complexes. All pairs of orthologous groups that shared a MIPS complex were considered as positive controls. An alternative positive control dataset was constructed from pairs of orthologous groups with members in budding yeast that shared a KEGG map (as of 17 January 2006; 187 maps; Kanehisa et al. 2004) and are found in the same cellular location (Huh et al. 2003). We found that the latter requirement improved the benchmarking results (data not shown). The negative controls, which should be functionally unrelated, were constructed from the orthologous groups that occurred in the positive controls. To this aim, from each set of positive controls, we sampled pairs of orthologous groups that did not occur in any of the positive control datasets. Additionally, negative controls were not allowed to reside in the same cellular location in budding yeast (Huh et al. 2003). We took this approach to have the same distribution of profile entropies and loss rates in positive and negative controls because some methods, such as mutual information, are sensitive to the entropy of the profiles (see §8.4.8) or loss rate (Barker et al. 2007). Note that both control datasets only refer to a subset of the 5997 orthologous groups present in budding yeast. From all datasets, we excluded anti-correlating pairs (Pearson correlation r<0) because mutual information score gives high scores to both correlating and anti-correlating pairs and thus deviates from the other methods. Similarly, pan-orthologues, i.e. orthologues or profiles with presence values for all species in the dataset, were excluded because the maximum likelihood approach (Barker & Pagel 2005) and the Pearson correlation coefficient cannot be calculated for these. A summary of the data filtering is given in table 1 of the electronic supplementary material.

8.3 Fungal tree

The fungal tree was based on a concatenated alignment of selected sequences (Dutilh et al. 2007): only orthologous genes occurring in at least 25 of the 26 genomes were used. Thus, in order to have a maximum of residues for the tree construction, the unnecessary loss of orthologous groups that were only lacking from, for example, the degenerated genome of E. cuniculi was minimized. We started by merging inparalogous groups (see §8.1) into clusters based on the bidirectional best hits between them. Subsequently, species-specific expansions were filtered out: we aligned the genes in each cluster using Muscle (Edgar 2004) with default parameters, and calculated pair-wise protein distances with Tree-Puzzle v. 5.2 (Schmidt et al. 2002; approximate parameter estimates; parameter estimation uses neighbour-joining tree; JTT model of substitution; estimate amino acid frequencies from dataset; four gamma categories; alpha=1.00 (weak rate heterogeneity)). From the pair-wise distances, the neighbour-joining trees were constructed with Bio-NJ (Gascuel 1997) and after identification of species-specific duplications by Loft (van der Heijden et al. 2007) from each genome, only the duplicate with the shortest branch lengths to the root was retained. The underlying assumption is that the gene with the highest sequence conservation is most probably the one with the conserved function after duplication. This yielded 229 orthologous groups that contained no more than one gene in each of at least 25 species. For genes that were lacking from a genome, the concatenated Muscle alignments contained a gap. We used GBlocks (default parameters; Castresana 2000) to select blocks of unambiguously aligned amino acids, which resulted in a super-alignment of 132 409 conserved positions. Finally, we used PhyML to calculate the maximum likelihood tree depicted in figure 1 of the electronic supplementary material (JTT model of substitution; estimated proportion of invariable sites; four substitution rate categories; gamma fixed with alpha=1.00; Guindon & Gascuel 2003). This tree was used for the tree-guided and model-based profiling methods.

8.4 Phylogenetic profiling methods

8.4.1 Tree-kernel method

The C implementation of the tree-kernel method was downloaded from∼vert/publi/ismb02/index.html and run with the default gain/loss and retention probabilities (Vert 2002). We used the distances in feature space as scores with small distances corresponding to high similarity of evolutionary history.

8.4.2 Maximum likelihood method

The likelihood ratios (LhRs) from the maximum likelihood method were calculated with a version of BayesMultiState program that allows fixing the state at the trees root (Barker & Pagel 2005), which was kindly provided by Andrew Meade. It is defined as Embedded Image, with Embedded Image denoting the maximum likelihood of the null model of independent evolution and Embedded Image the maximum likelihood of the alternative model of contingent evolution. The number of tries for finding the maximum likelihood of each model was set to 100, which deviates considerably from the default (10 tries) and was chosen as a suitable trade-off between computational costs and the variance of the estimated LhR as determined in five repeats of a sample of 100 profiles (data not shown). Despite the increased number of tries in few cases, we observed negative LhRs, which indicate a problem of the algorithm to find the global likelihood optimum. In these cases, we ran the algorithm another 100 tries, which reduced the number of orthologous group pairs with LhR less than 0 to 17.

8.4.3 Differential parsimony

The distance d(A, B) between the parsimonious reconstructions of two orthologous groups A and B was calculated as (D. Liberles 2006, personal communication)Embedded Image

Here, for instance, anc(ai) denotes the ancestral state for the branch i inferred for the orthologous group A and desc(ai) denotes the corresponding descendant's state. This gives no penalty to coordinate gains, coordinate losses or if no gain or loss of either orthologous group occurred, a penalty of 1 to independent gains or losses and a penalty of 2 for coordinate gain in one orthologous groups and loss in the other. In our implementation of differential parsimony, we use Dollo parsimony instead of Fitch parsimony (cf. Liberles et al. 2002) and we restricted the dataset to orthologous groups present in budding yeast. Consequently, a penalty of 2 did not occur. Notably, an alternative representation of the ancestral state reconstruction is as a vector with components corresponding to branches of the phylogeny and values corresponding to the ‘gain’ and ‘loss’ events as well as ‘no’ event. In the absence of penalty 2, differential parsimony is simply the Hamming distance between these gain/loss vectors. As an alternative similarity measure between the gain/loss vectors, we used a two-tailed Fisher's exact test to quantify the co-loss probability on the branches of the tree.

8.4.4 Tree-guided methods

Both mutual information and Jaccard coefficient used the tree-guided collapsing of subtrees that is described in §4.2. Furthermore, a single pseudocount for each of the four possible presence/absence combinations was added (von Mering et al. 2003a).

8.4.5 Jaccard coefficient

The Jaccard coefficient of two occurrence vectors A and B (Jaccard 1912) can be defined for co-occurrence of presences (a=1) as well as for co-occurrences of absences (a=0; see below). Here, we used the co-occurrence of presences (a=1).Embedded Image

8.4.6 Lp-norms

Lp-norms are defined as Embedded Image. Frequently used Lp-norms are the Manhattan distance or Hamming distance (p=1) and the Euclidean distance (p=2). For arbitrary values of p, the orthologous group pairs will have the same order when sorted by distance Lp and we thus use only the Hamming distance.

8.4.7 Pearson correlation coefficient

We used a standard linear correlation coefficientEmbedded Image

8.4.8 Mutual information

Mutual information can be defined based on the Kullback entropy between two probability distributions {p0} and {p} asEmbedded Image

The Kullback entropy quantifies the amount of information gained when substituting distribution {p0} by distribution {p}. The index i refers to the possible values drawn from the distributions. For the purpose of comparing two profiles A and B, the possible values correspond to the four combinations of occurrence values of two genes, i.e. we define pip(a, b) and pi0p0(a, b), aA, bB. We furthermore assume that the occurrence values of the genes are statistically independent, i.e. p0(a, b)=p(a) p(b). Thus, the mutual information is defined asEmbedded Image

The probabilities p are usually estimated by the frequencies of the occurrence values (for p(a) and p(b)) or combinations of occurrence values (for p(a, b)) in the compared profiles. Note that the maximally achievable mutual information is determined by the minimum entropy of the compared profiles (Steuer et al. 2002)Embedded ImageH(A) being the entropy of a profile defined asEmbedded Image

8.5 Benchmarking

The positive predictive value (PPV) is defined asEmbedded Image

Here, TP(x) and FP(x) are the numbers of true positives and false positives, respectively, in the test set that score better than the specified fraction or number x of positive prediction. If the fraction of positive predictions is set to x=1 then the positive predictive value will be completely determined by the ratio of positive and negative controls in the dataset, i.e. PPV=P/(P+N). This ratio is equivalent to the performance of the random classifier that selects positive predictions randomly from the controls. To be able to compare positive predictive values between datasets, it is important that the datasets have the same ratio of positive to negative controls. Alternatively, identical P/N ratios can be achieved by bootstrapping (see §8.5.1).

8.5.1 Bootstrapping

We used a weighted bootstrapping procedure that assigns equal weights to all complexes/pathways and generates a ratio of positive to negative controls of 1 : 1. The members of each complex/pathway were assigned the weight 1/(complex size). Proteins shared between complexes/pathways were assigned the average of the per complex/pathway weights. The negative controls were given the same weight as the sum of the weights of the positive controls.


We are grateful to David Liberles for giving us information on the differential parsimony approach and Andrew Meade for providing the newest version of the BayesMultiState software. We thank Fiona Nielsen for valuable discussion. We also thank the three anonymous referees who helped to improve this document. This work was funded by the Netherlands Bioinformatics Centre (NBIC), which is supported by the Netherlands Genomics Initiative (NGI) and by the European Union's Sixth Framework Programme EPISTEM (CT-2005-019067). The orthologous groups were constructed by V.N. The fungal tree was constructed by B.E.D.


    • Received March 15, 2007.
    • Accepted May 5, 2007.


View Abstract