## Abstract

The analysis of molecular networks, such as transcriptional, metabolic and protein interaction networks, has progressed substantially because of the power of models from statistical physics. Increasingly, the data are becoming so detailed—though not always complete or correct—that the simple models are reaching the limits of their usefulness. Here, we will discuss how network information can be described and to some extent quantified. In particular statistics offers a range of tools, such as model selection, which have not yet been widely applied in the analysis of biological networks. We will also outline a number of present challenges posed by biological network data in systems biology, and the extent to which these can be addressed by new developments in statistics, physics and applied mathematics.

## 1. Introduction

Following the enormous advances in functional genomics and molecular biology it is now possible to at least contemplate studying cellular processes at the level of a whole cell, rather than in isolation. Molecular networks, such as protein interaction (Uetz *et al*. 2000; Maslov & Sneppen 2002; Agrafioti *et al*. 2005), metabolic (Ma & Zeng 2003) and gene regulation networks (Ronen *et al*. 2002; Evangelisti & Wagner 2004) aim to capture such sets of biological processes in a single and coherent framework. In reality, of course, these different networks are intricately connected and interwoven inside a cell; protein products will interact with each other, regulate the expression of genes as well as digesting nutrients and catalysing basic biochemical reactions in a cells metabolism. We are still far away from being able to consolidate these different networks into a realistic *in-silico* organism.

The analysis and interpretation of present network data is, however, already challenging enough. Since the late 1990s research has been aided considerably by the work of a host of physicists (see Albert & Barabasi 2002; Dorogovtsev & Mendes 2003; Newman 2003*a*; Evans 2004, for mainly physics-oriented reviews). While the models proposed have, despite their elegant simplicity, been able to explain certain aspects of complex biological networks, they increasingly reach the limit of their usefulness given the amount of data becoming available. New models, based on sound statistical principles, and informed by bioinformatics are now slowly taking their place.

The theoretical underpinnings for the analysis of networks come from statistical physics, mathematics (in particular random graph theory; Bollobás 1998) and computer science. Despite several attempts over the last few years to apply concepts from these disciplines to biological networks, success has often been modest. There are numerous examples where terminology or concepts have been taken from, e.g. statistical physics, and wrongly applied in the description of molecular networks. We will first discuss how networks can be analysed statistically and described theoretically. The statistical analysis may either focus on (structural) properties of the network itself, or on biological properties of the constituents of the network. While the former is well advanced the latter will (even in the presence of high quality data) pose a range of fascinating and challenging problems.

Below we will first introduce the biological networks that are currently attracting most interest as framework for systems biology. After that we will discuss the different theoretical models that have been used to model complex biological (among many other types) networks before introducing a set of statistical tools and, more interestingly, problems. Whenever biological examples are discussed in this exposition they will have a distinctly evolutionary perspective.

## 2. Biological network data

As already mentioned we can, very coarsely, distinguish between three types of molecular networks.

**Metabolic networks:**these aim to describe the basic biochemistry in a cell. Biologically important reactions have been described in terms of reaction pathways and metabolic networks are systematic collections of such biochemical data.**Transcriptional networks:**these consist of genes and a directed edge is added between two genes if one regulates the transcription of the other gene.**Protein interaction networks (PIN):**an undirected edge is drawn between each pair of proteins for which there is evidence of a physical or biochemical interaction.

Making these distinctions and simplifications must necessarily neglect details of the biological processes. In reality these networks will be highly and intricately interconnected and factorizing them into distinct networks will ultimately underestimate the biological complexity.

These problems are exacerbated when one considers the often woeful quality of the data: for PIN the rates for false-positive and false-negative results are estimated to be around 40% (Bader *et al*. 2004; Tong *et al*. 2004). Bioinformatics and statistics may help to clean the data to some extent but improvements in the experimental techniques offer the only real solution to this problem. Although important and interesting (Lappe & Holm 2004) we will here not be concerned with such issues of quality control. Rather we will discuss what should be included in theoretical descriptions of complex networks in a biological setting.

It has to be kept in mind, though, that present network data are highly averaged and artificial constructs: the language of graph theory may simply be too static to usefully describe complex biological networks. We may in approximation seek to understand networks as entities that change over three different time-scales: (i) they will change over evolutionary time-scales between species (millions of years); (ii) they will change during the course of an organism's development (years); and finally, (iii) connections will be formed and lost in response to physiological change and external stimuli (sub-second to minutes). For PIN experimental methods can at the moment only resolve the changes in PIN structure accumulated between species (Fraser *et al*. 2002; Jordan *et al*. 2003; Qin *et al*. 2003), but data are not yet sufficiently reliable to make meaningful comparisons.

One of the fundamental evolutionary questions underlying comparative genomics and the fledgeling discipline of systems biology is illustrated in figure 1. Within each species' PIN, interactions introduce a dependence between interacting proteins, i.e. it may no longer be possible to consider them independently (Agrafioti *et al*. 2005). The phylogeny underlying the different model organisms introduces a further level of correlation (Li 1997; Felsenstein 2003). In the functional analysis of networks we will often have to include both types of correlation, which makes the correct statistical analysis of biological network data highly non-trivial.

Below we will first briefly discuss theoretical descriptions of networks (and their ensembles). We will continue by discussing various measures that have been applied to characterize the (structural) properties of networks before considering how the network affects properties of the nodes and vice versa, e.g. when analysing PIN data we may want to evaluate the extent to which the network shapes the evolutionary rate of the constituent proteins.

## 3. Describing the structure of networks

We describe networks in terms of (static) graphs (Bollobás 1998); mathematically a graph is a pair of sets , where *V* is the set of *N* vertices or nodes and *E* the set of *M* (undirected) links or edges which connect pairs of nodes. Thus, each edge has an associated pair of vertices *N*_{i} and *N*_{j} (we will generally adopt the terminology used in the physics literature and also strive for a similar level of mathematical sophistication unless this may cause problems). Note that a node *N*_{k} may not have an associated edge, i.e. it may not be connected to any other node in the network; we also call such nodes ‘orphans’. A *connected component* is a set of nodes that is linked by edges but where no node in the component is connected to any node outside of the connected component. The largest component is often called the giant connected component.

Several representations for graphs exist but the conceptually easiest is the adjacency matrix, *A*_{g} (Bollobás 1998; Albert & Barabasi 2002). For *N* nodes the entries, *a*_{ij}, of this *N*×*N* matrix are simply the number of edges between nodes *i* and *j*. For undirected graphs *A*_{g} is symmetric, *a*_{ij}=*a*_{ji}; for so-called *simple* graphs *a*_{ij} is either 0 or 1 and *a*_{ii}=0, i.e. multiple edges and edges beginning and ending on the same node are not allowed. As far as PINs are concerned present data do not allow us to specify either a direction or a weight to an individual edge; proteins may, however, interact with themselves and, therefore, non-zero diagonal elements of the adjacency matrix are possible. If a network consists of several components then it will be possible to write the adjacency matrix in block-form. Rows in the adjacency matrix which correspond to orphaned nodes will contain only the value 0 (Valiente 2002).

## 4. Network Statistics

A number of statistics have been defined which seek to summarize structural properties of networks. These have been applied to both theoretical and real network data. We will now discuss them in some detail and outline their behaviour for both classical and scale-free random graphs. The description of networks is complicated by the fact that unlike regular lattices there is no real connection between nodes and their spatial relationship; a node has no spatial position *per se* (Evans 2004). From a statistical perspective it is perhaps interesting to note that there exists, to our knowledge, no sufficient (in a formal statistical sense; see, for example, Cox & Hinkley 1974; Silvey 1975) statistic for networks.

### 4.1 The degree distribution

The degree *k* of a node is the number of edges attached to it and the degree distribution *n*(*k*) is the number of nodes of degree *k* for all *k*≥0 (Albert & Barabasi 2002; Newman 2003a). It captures the diversity of local neighbourhoods in the network. In a regular lattice like an *d*-dimensional hypercube or Caley-tree all nodes or lattice points will have identical neighbourhoods and the degree is simply the coordination number. In the Erdös–Rényi random graph it is possible to show that the number of edges attached to a node is given by(4.1)i.e. for *N*→∞ the degree distribution takes on the form of a Poisson distribution with parameter *N*_{p}, the average degree in the network. In the scale-free network, the degree distribution takes on a power-law (Barabasi & Albert 1999),(4.2)where *ζ*(γ) is Riemann's zeta function, for *x*>1 (Abramowitz & Stegun 1974). The term scale-free follows from analogy to concepts from statistical physics (in particular the theory of second order phase transitions) and the fact that for a pure power-law degree distribution the ratio *n*(*αk*)/*n*(*k*) depends only on *α* but not on *k*; there is no natural scale to the network. In many cases power-law-like behaviour is confined to the tails of the degree distribution. In order to identify power-laws we normally require that they last over at least two to three orders of magnitude (Jensen 1998; Sornette 2003).

The degree distribution describes only one aspect of the data and graphs with hugely different architecture can exhibit similar degree distributions: a tree can have the same degree distribution as a highly reticulated graph (Bender & Canfield 1978; Dorogovtsev *et al*. 2000; Burda *et al*. 2001). Nevertheless, it is perhaps the most commonly considered network characteristic. In particular, scale-free behaviour of networks is generally inferred solely from the shape of the degree distribution.

### 4.2 The clustering coefficient

The clustering coefficient (Watts & Strogatz 1998; Newman 2003*b*) is a measure of the average local neighbourhoods in a graph/network. It is defined as the probability that two nodes *j* and *k* which are connected to node *i* are themselves connected. For a node *i* with degree *k*_{i} there are *k*_{i}(*k*_{i}−1)/2 potential links among its direct neighbours. If *K*_{i} denotes the links actually observed among *i*'s neighbours then the clustering coefficient of node *i* is defined by(4.3)for *k*<2 we define *c*_{i}=0. The clustering coefficient of the total network is then given by averaging over all nodes, . While it has become customary to show degree distributions rather than just the average degree, regrettably the same is not universally true for clustering coefficients. This is despite the fact that the clustering coefficients often vary quite considerably across a network (Newman 2001). Studying such variation, for example, reveals that most nodes have a small clustering coefficient *c*_{i}≈0. This reflects the observation that local neighbourhoods of nodes in a network are often tree-like.

### 4.3 Average path-length and network diameter

As the position of nodes in networks bears no relationship to their spatial positions the distance between two nodes *i* and *j* is defined through the minimum number of edges that have to be traversed to reach *j* starting from node *i*. For directed networks the shortest path in the network may be the distance between *i* and *j* and *j* and *i*, respectively; in undirected networks it is of course the same. If we denote the distance between nodes *i* and *j* by *l*_{ij} then the average path-length is defined by(4.4)where 〈*i*, *j*〉 indicates that the sum runs over all different pairs *i*, *j* (Valiente 2002).

The diameter of a network is given by the maximum distance in the network, i.e.(4.5)If in an undirected network there is no path connecting nodes *i* and *j* then the distance *l*_{ij} is set to ∞. This is, for example, the case if the network consists of a number of connected components whence the average path length and the network diameter are also defined to be ∞. Unlike the previous statistics average path-length and network diameter are computationally quite intensive. Calculating all shortest paths in a graph is at least of order *O*(*N*^{2} ln(*N*)) (if we use adjacency lists).

These statistics have attracted considerable interest following the work of Watts, Strogratz (Watts & Strogatz 1998) and others (Newman & Watts 1999; Newman 2000; Zhou 2002) which reevaluates Stanley Milgram's classical notion of six-degrees of separation (Milgram 1967; Travers & Milgram 1969). Briefly, social networks, and most biological networks have a much smaller diameter than would be naively expected. In a regular one-dimensional network the diameter is given by the number of nodes *N*; in a ring the diameter is equal to *N*/2 and in a square lattice the largest distance is given by . For real networks it was, however, observed that the diameter (or equivalently the average path-length) increases very slowly, e.g. *D*∝log(*N*).

There appears to have been some misunderstanding (Bader *et al*. 2004) about just how common short average pathlengths or small network diameters are in real networks as well as in theoretical network models: classical random graphs (above the structural phase-transition) have this property as do (many) scale-free networks and virtually all naturally occurring networks. Thus, small world effects are the rule, not the exception (Watts 1999; Newman 2000). In a strict sense small-world behaviour (e.g. as exemplified by the models of Watts 1999; Newman 2000) requires a logarithmically growing diameter (or average pathlength) and a high clustering coefficient.

### 4.4 Network motifs

Alon and co-workers (Milo *et al*. 2002, 2004; Shen-Orr *et al*. 2002; Kashtan *et al*. 2004) have introduced the notion of network motifs. In their definition (and the term has been applied differently by other authors) a motif is a pattern that occurs at a statistically increased frequency in the network. In part *a* of figure 2 we show the possible motifs that can occur between three nodes in a directed network; part *b* of the same figure shows the four-node motifs in an undirected network.

In the search for motifs one first counts all occurrences of the various patterns of interest. The statistical significance of each pattern is then assessed by randomizing the edges in the true network among the nodes, keeping the degree of each node fixed to its observed connectivity; the frequencies of the patterns are then determined for the randomized network. Repeating this for a sufficiently large number of times yields a frequency distribution for each pattern in the ensemble of randomized networks. From this it is possible to arrive at a *p*-value for the pattern in the true network. Those patterns that occur at a significantly increased frequency in the true network are called motifs (Milo *et al*. 2002; Maslov *et al*. 2003).

It is worthwhile to consider what the meaning of these motifs is; this is, in fact, somewhat easier to see in directed networks where a direct and intuitive analogy to logic or electronic circuits exists. For example, the pattern 9 shown in part *a* of figure 2 corresponds to a *feed-forward* loop. Scanning through a network may thus elucidate the regulatory architecture of the network. Alon *et al*. (Milo *et al*. 2004) have applied such an approach to study motif spectra of different networks and suggested that it is possible detect superfamilies of networks with similar motif-spectra, i.e. a similar local logical organization of the network. For undirected graphs, however, motifs will only tell us about the extent to which certain neighbourhoods are overrepresented in the network. In the case of PIN we may for example be able to determine how often quadruplets of proteins occur where each pair is interacting, pattern 6 in figure 2*b*.

While the notion of motifs is appealing, the interpretation of motifs and their biological relevance can be subject to some controversy. For example, Wuchty & Stadler (2003) find that proteins in highly connected motifs are more evolutionary conserved than would be expected by chance.1 Mazurie *et al*. (2005) on the other hand find no correlation between motifs and evolutionary or functional characteristics. These authors conclude that motifs cannot be analysed in isolation from the rest of the network.

### 4.5 Network spectra

The final, most detailed but perhaps hardest to interpret perspective on networks is provided by a network's spectrum (Chung 1997; Farkas *et al*. 2001; Albert & Barabasi 2002; Bu *et al*. 2003). This follows from the eigenvalues *λ* of the adjacency matrix ** A**, i.e. the solutions of (

**−**

*A**λ*

*)=0, where*

**I***is the identity matrix. For a*

**I***N*×

*N*adjacency matrix we will have

*N*eigenvalues

*λ*

_{i},

*i*=1, 2,…,

*N*and the spectrum of the adjacency matrix is defined by(4.6)with

*δ*(

*x*) the standard Dirac delta function.

There has been considerable interest in the extent to which the spectrum of a graph reflects local and global network structure (Bu *et al*. 2003; Chen & Xu 2003; Kamp & Christensen 2003) and this will probably continue to be an area of interest.

## 5. Theoretical network ensembles

The analysis of real networks is greatly aided by understanding how the various statistics discussed in the previous section behave in theoretical models of networks (Krzywicki 2001). We will briefly outline the behaviour of, what have become, the two canonical ensembles of networks, the classical or Erdös–Rényi (Erdös & Rényi 1959, 1960) random graphs and the scale-free random graphs (Aiello *et al*. 2001; Bollobás & Riordan 2003).

Graph ensembles play a central role in the theoretical analysis of networks. They are defined by the following.

A set of graphs .

A statistical weight for each graph .

If is constant then the graph ensemble will be equivalent to the microcannonical ensemble of statistical physics. Similarly, for varying , network ensembles corresponding to the canonical (*N* and *M* fixed) and grand canonical ensembles (*N* fixed *M*≥0) can be constructed (Dorogovtsev & Mendes 2003).

### 5.1 Classical random graphs

There are two definitions of a classical random graph; these become identical in the thermodynamic limit, *N*→∞. The first, due to Erdös–Rényi (Erdös & Rényi 1959, 1960)and denoted by *G*(*N*, *M*), is given by a set of *N* nodes and *M* edges which are randomly placed among the nodes; one may explicitly specify that there can be at most one edge between every pair of nodes but this is negligible until *M*≃*N*(*N*−1)/2. The second classical random graph ensemble was proposed by Gilbert (1959) and is here denoted by *G*(*N*, *p*), where *N* is again the number of nodes and *p* is the probability of a pair of nodes being connected by an edge; in this ensemble the expected number of edges, . The degree distribution of a classical random graph is given approximately a Poisson distribution (Binomial at small values of *N*) with parameter *λ* equal to the average number of edges per node.

Classical random graphs have been studied extensively in mathematics (Bollobás 1998; Janson *et al*. 1999) and statistical physics (Stauffer & Aharony 1992). Of particular interest has been the structural phase-transition in the thermodynamic limit *N*→∞. For the network or graph will consist of many separate small connected components. At one of these components grows, increasingly amalgamating with other smaller components; this is often referred to as the giant connected component. Quite generally classical random graphs exhibit the small-world property for . This point has sometimes not been appreciated in parts of the literature, where network concepts have been applied to (especially) biological network data.

### 5.2 Scale-free random graphs

Many important real networks, including the molecular networks, have degree distributions which decay much more slowly than exponentially (Alm & Arkin 2003; Evangelisti & Wagner 2004; Li *et al*. 2004; Agrafioti *et al*. 2005). In terms of the degree distribution classical random graphs are therefore unable to explain at least some aspects of real data. Barabási and Albert (Barabasi *et al*. 1999*b*) have shown that a simple probabilistic model can give rise to networks with a fat-tailed degree distribution. The so-called Barabási–Albert model has since then been shown to be mathematically ill-defined by Bollobás & Riordan (2003), but the LCD construction2, which gives rise to a properly defined graph ensemble can be used (Bollobás 1998). Essentially, scale-free random networks are generated through growing a network by adding a new node at each time step and attaching it to existing nodes proportional to their connectivity. Other growth models, in particular certain processes where existing nodes are duplicated, retaining their edges with probability 0<*π*<1, can also give rise to scale-free models in the limit where *N*→∞ (Aiello *et al*. 2001). These networks are called scale-free because the ratio Pr(*αk*)/Pr(*k*) depends only on *α* but not on *k*. Scale-free networks capture some vestiges of real networks but the great attention they have received is also due to the fact that scale-free behaviour can be generated by relatively simple models (Barabasi *et al*. 1999*b*, 2001; Dorogovtsev *et al*. 2000; Goh *et al*. 2002; Moreira *et al*. 2002). They, too, are an oversimplification of the true process underlying network evolution (Yook *et al*. 2004).

### 5.3 Other random graph ensembles

Classical and scale-free random graphs have become the canonical theoretical exampled for real complex networks. Due to their shortcomings other models have been considered, too. These other models can be loosely divided into two classes: theoretically motivated models which generate networks that capture one or more aspect of real networks, e.g. an empirical degree distribution, or evolve networks by a more flexible mechanism (Dorogovtsev *et al*. 2002) and mechanistic models which implement a model of network growth or evolution.

The former approach was pioneered by Bender & Canfield (1978), and refined by Molloy & Reed (1995) and Newman and co-workers (Newman *et al*. 2001; Newman 2003*c*). The latter approach is more recent and examples include the various models that have been proposed to generate scale-free networks. Networks are generated by defining a set of nodes and the number of edges incident on them. These are then connected randomly to form a network with a predefined degree distribution.

More recently still, biologically motivated models have been introduced (Wagner 2003; Berg *et al*. 2004) where authors have looked at the basic biological processes involved in the generation of real biological networks. These may include:

**Node duplication:**existing nodes are duplicated and the copy retains some or all of the interactions of the original node. Levels of duplication can be estimated from phylogenetic comparisons of paralogues; initially, at least, duplicated genes/proteins would fulfil similar functions.**Node attachment:**new nodes are added to the network and attached to existing nodes (preferentially or randomly); horizontal transfer may offer a corresponding biological mechanism.**Node deletion:**existing nodes and their incident edges are deleted; this may for example happen if a gene incurs a loss-of-function mutation.**Edge dyanamics:**new edges can be formed, existing edges deleted or rewired. Again, this may be caused by mutations to the coding sequence.

Based on these processes it is possible to derive properly defined network ensembles. These can either be parameterized by biological data or be used to estimate biological parameters such as the effective rate at which proteins where duplicated. It has to be kept in mind, though, that the true evolutionary process underlying networks was much more complicated and will have contained a number of unique events, e.g. the whole genome duplication event in *S. cerevisiae* about 200 million years ago. At the moment it is unclear if such contingent processes can be modelled by statistical network ensembles. As pointed our by Burda *et al*. (2001), however, even if a network ensemble does not capture the true dynamics of network evolution, the study of suitable network ensembles can provide insights into the probabilistic behaviour of networks (Berg & Lässig 2002; Burda & Krzywicki 2004). In light of the problems addressed above further studies into biologically motivated finite-size network models may offer more interesting insights into biological network than has been the case for scale-free networks.

## 6. Analysis of protein interaction networks

We will illustrate some of the challenges posed by network data using protein interaction network data. The poor quality of such data has been well documented and there have been some attempts at improving data sets using *in*-*silico* methods or labourious curation. Ultimately, more reliable experimental techniques may offer the only option to arrive at more reliable data; however in evolutionary studies the mean of an observable is frequently overwhelmed by the corresponding variance. Thus, even given more reliable interaction data considerable statistical challenges will nevertheless persist and below we will outline three such areas.

### 6.1 Structural analysis

In figure 3 we show the degree distribution of the yeast PIN (circles) and the best-fit power-law model (red). The first observation is that present data differs quite substantially from the pure power-law. The deviation is most pronounced at low and high values of *k*. This is often believed to be due to finite-size effects (Dorogovtsev & Mendes 2003). Also shown in the figure is the best-fit lognormal distribution, which, interestingly, does a rather good job at describing the empirical distribution. The improved fit is confirmed using methods from formal statistical model selection theory. The curves in figure 3 were fitted by determining the parameters of two curves in a maximum likelihood framework. If we denote the log-likelihood (Davison 2003) of a (potentially vector-valued) parameter *θ* by lk(*θ*)=Σ_{k}ln(Pr(*k*)) then the Akaike information criterion (AIC) (Akaike 1983; Burnham & Anderson 1998) for model *i* is given by AIC_{i}=−2lk(*θ*_{i})+2*ν*_{i}, where *ν*_{i} is the number of parameters of model *i*. Unlike the likelihood ratio test the AIC allows us to formally compare the expanatory power of non-nested probabilistic models. The AIC—or the related Bayesian information criterion—biases against models with more parameters that do not add significantly to a model's power to explain the data. While these methods frequently used in various branches of quantitative biology (see, for example, Strimmer & Rambaut 2002), they have not been widely applied in the analysis of network data.

The power-law has been favoured over other models, e.g. the Erdös–Rényi random graphs, because (i) it has a broader tail and (ii) simple mechanistic models asymptotically give rise to power-law degree distributions. Simulating the Barabási–Albert or LCD model we find that the AIC always favours the power–law distribution over the log–normal distribution (especially if analysis is restricted to connectivities ), even for small networks with *N*=500; for real network data this does not appear to be the case. We find, that for the yeast PIN the lognormal distribution (blue) offers a better description of the data than the pure power-law or its heuristic finite-size versions (Stumpf & Ingram 2005; Stumpf *et al*. 2005, in press). The AIC for the power–law is ≈26 920 compared to ≈25 430 for the lognormal; when translated to relative likelihoods the lognormal is 10^{5} more likely to explain the observed data than the power-law (Stumpf & Ingram 2005; Stumpf *et al*. 2005, in press). As all assertions of scale-free behaviour have been based on the degree distribution (Barabasi & Albert 1999; Wolf *et al*. 2002) this suggests that the notion of scale-free behaviour—as identified by a power-law degree distribution—may not hold up to statistical scrutiny. There have also been interesting attempts at using geometric random graphs (Penrose 2003) to describe networks; these also show that there are other network ensembles which may be better at explaining real biological networks than scale-free networks (Przulj *et al*. 2004).

The degree distribution captures only one, and by no means the most important, aspect of network data. Recently, Middendorf *et al*. (Middendorf *et al*. 2004, 2005) have developed approaches that aim to characterize networks formally and determine which ensemble best fits the data using machine learning techniques such as support vector machines. Their results confirm that there is more to networks than can be captured by scale-free ensembles; for example real networks tend to exhibit certain degree–degree correlations which standard network ensembles do not Berg & Lässig (2002)

### 6.2 Sampling properties

So far the vast majority of network analyses have considered the network data as potentially unreliable but, at least in principle, as representative of the true network. Until very recently (Stumpf *et al*. 2005) nobody had considered the fact that most network datasets, and certainly all biological network datasets, are only subnets embedded in the true but largely unobserved networks. The extent to which such subnets have identical or at least similar properties of the global network depends on the networks under different sampling schemes.

In figure 4 two different sampling schemes are illustrated. Under neighbourhood sampling, nodes are chosen (perhaps because of prior biological knowledge) and their local neighbourhood is explored; for example proteins involved in similar processes or in the same compartment may be included into the experimental set-up with higher probability. We would thus expect the local network neighbourhoods of the initial target nodes to fairly well represented. Under random sampling there is a finite probability 0<*p*<1 for a protein to be included in the experiments. This is the most parsimonious sampling scheme as it does not require prior knowledge and/or bias of the experimenter. It also has by far the nicest mathematical properties and is most amenable to mathematical analysis (Stumpf *et al*. 2005; Stumpf & Wiuf in press).

For random sampling it is possible to determine whether the subnet will have similar properties as the overall global network; if this is the case we say the network is closed under the sampling scheme. For example, it is possible to show that the degree distribution of a subnet sampled from an Erdös–Rényi network will also be described by a Poisson distribution, but with parameter *pλ* rather than *λ*. Interestingly, however, subnets from scale-free networks are not scale-free; in fact for most types of networks the subnet will be qualitatively different from the true network. The reverse is also true: if a subnet—for example, any of the current protein interaction network datasets—is shown to be scale-free, then the true global network cannot itself be scale-free. For neighbourhood sampling the situation is even more complicated: even Erdös–Rényi networks are no longer closed under this sampling scheme.

For networks of the type used by Berg *et al*. (2004) it can be shown, that they are also not closed under random sampling. This means inferences from network data, which do not take into account the fact that present molecular network data only describe subnets sampled from the whole network could be misleading. Graph spectra and inferences from motifs are even more susceptible to sampling error than the degree distribution. Interestingly, under random sampling the probability of retaining a motif in the subnet depends only on the number of nodes. Thus the probability of keeping any of the motifs in part *b* of figure 2 is *p*^{4}.

It is, however, relatively straightforward to adapt current network ensembles so that they incorporate sampling properties. If *N*′ out of *N* nodes are included in the network data, then the ensembles can be used to model the evolution of a network of size *N* and then cull the nodes until a dataset of size *N*′ is obtained. The network properties of subnets, thus constructed can then be compared to experimental data. Failure to appreciate the incomplete nature of networks, and the peculiar sampling features of networks, has compromised some published studies and some results may need to be reevaluated in light of the sampling process by which network data are collected.

### 6.3 Evolutionary and functional analysis

So far we have only discussed structural properties of networks. Here, we will assess the extent to which network structure reflects biological or biochemical processes or properties inside cells. In particular we are interested in the extent to which the network affects evolutionary properties of its constituent nodes (e.g. genes or proteins). Such system-level evolutionary information, in turn, will be informative about how transferable results of functional studies are between species.

An initial analysis of the impact of the interaction network on the evolutionary rate of proteins has been analysed by several groups (Pal *et al*. 2001; Wagner 2001; Fraser *et al*. 2002, 2003; Jordan *et al*. 2003; Fraser & Hirsh 2004; Bloom & Adami 2004) with different results. Some studies suggest that the evolutionary rate of proteins decreases with their connectivity (see, for example, Wagner 2001; Fraser *et al*. 2003) while others have suggested that this effect only applies to the most highly connected proteins, or becomes negligible once expression level differences have been accounted for (Jordan *et al*. 2003). These studies differed in the protein interaction data sets used, the species analysis and use of other data beyond interaction data. And all of these differences could have contributed to the different results obtained. A recent comparative study of the PINs in *S. cerevisiae* and *C. elegans* by Agrafioti *et al*. (2005), which used larger protein interaction data sets and larger, better resolved phylogenetic panels of closely species for rate estimation, as well as protein expression data and functional annotations reexamined this question. Expression appears to be indeed more closely correlated with changes in a protein's evolutionary rate than its connectivity. Present connectivity data does, in fact, have negligible explanatory power when other data such as gene ontology (GO) data is available. It has to be again kept in mind, though, that PIN data does not cover the whole protein interaction network but only the interactions between a subset of the proteins known to exist in *S. cerevisiae* and *C*. *elegans*.

Finally, there have been several studies which suggest that the properties, such as evolutionary rate of protein expression levels, of connected nodes are more similar than would be expected by chance (Williams & Hurst 2000; Fraser *et al*. 2002). The statistical significance of observed patterns is evaluated against a Null model of the network. Two different types of Null model have been used in the literature; the first has already been discussed in relation to network motifs (Maslov *et al*. 2003). In most evolutionary studies, however, a simpler bootstrapping (Efron & Tibshirani 1998) procedure was employed (e.g. see Fraser *et al*. 2002): if there are *M* edges in the original network then 2*M* nodes are chosen at random and their similarity calculated. Repeating this procedure a sufficiently large number of times yields an empirical Null distribution against which properties of the observed network are compared. This turn out to inflate the statistical significance of observed network data, as nodes with connectivity *k*=1000 contribute to the bootstrap samples with the same weight as nodes with connectivity *k*=1 or *k*=0. It has been shown that weighing nodes by their connectivity increases the 95% bootstrap confidence intervals considerably by up to 25% (Agrafioti *et al*. 2005).

More generally, the correct Null distribution for the statistical analysis of network data will depend not only on the structure of the network itself, but also its hierarchical organization (Yook *et al*. 2004): if proteins belonging to the same biological processes or located in the same cellular compartment are more likely to interact with proteins in the same process/compartment, then this ought to inform the Null model.

### 6.4 Comparing networks

Species comparisons have been used extensively to complement functional studies, including gene prediction and functional annotation of genes. A comparative approach, of network data from different species, for example like the one illustrated in figure 1, will provide insights into the functional organization at the system level and several groups have attempted to compare networks from different species. Most recently, Sharan *et al*. (2005) have identified conserved patterns of protein interactions in *S cerevisiae*, *C elegans* and *D. melanogaster* using sequence similarity to identify orthologous proteins. They have shown that such protein sequence-based network alignments can be used to assign protein functions and predict protein interactions. Unfortunately, such studies are plagued by the quality and the sampling nature of the data; therefore inferences are still quite limited. The numbers of orthologous proteins for which interaction data exists in more than one species is quite small (Kelley *et al*. 2003; Sharan *et al*. 2005; Wuchty & Almaas 2005). Using reciprocal BLAST searches to find putative orthologous proteins reveals that only a tiny number of patterns are shared between e.g. *D. melanogaster* and *C. elegans* where we find only a single pattern involving three nodes (see table 1). Recently, Sharan *et al*. (2005) have pointed out that straightforward reciprocal BLAST search does an insufficient job at reliably identifying shared patterns in different PIN; including additional potential orthologous proteins flagged up in the BLAST searches does, however, lead to the identification of pairs of proteins that (i) are sufficiently similar in sequence and (ii) appear to play a similar part in the PINs of different species. The identification of orthologous proteins between distantly related species—such as the species in table 1—will require the development of more powerful and rigorous inferential frameworks (Sonnhammer & Koonin 2002).

A complementary approach to such sequence-based approaches would be to align biological networks without reference to the sequence of the proteins/genes but based solely on the occurrence of certain patterns. The work by Berg & Lässig (2004) is a promising start in this direction; given the complexity of the graph isomorphism problem (Valiente 2002) and the lack of a well defined distance between graphs this is, however, a challenging problem. Combining and comparing resulting alignments will elucidate the impact of the network on the biology/biochemistry and vise versa. Biological information may also guide in the development of suitable heuristics for network alignment algorithms. One challenge here will be to construct an alignment procedure which integrates biological (sequence or GO) information, with strutural network properties.

## 7. Conclusion

Quantitative methods for the analysis of biological network data are still in their infancy. Given the scope and detail (however unreliable) simple statistical network models are reaching the limits of their applicability. Improved network models will have to be more flexible and take into account that networks are finite (and often too small for mean-field theories (Barabasi *et al*. 1999*a*; Newman *et al*. 2000) to be useful), have modular organization, and that present network data often contain only incomplete samples from the true network. Finally, they have to be more flexible at incorporating additional biological information. Unfortunately, this may entail using models with more parameters than the simple models used so far. Statistical model selection tools, like the AIC, will be useful in establishing sets of models which (i) can describe the data adequately and (ii) are not over-parameterized.

When describing biological processes at the system level (e.g. a cell) it is important to remember that the data is often very noisy, and the processes highly complex: molecular abundances and interactions change over time and in response to external stimuli as well as to dynamical intrasystem processes. The static language of graph theory used to describe todays biological networks may reach its limits when the conditionality and contingency of interactions need to be considered.

## Acknowledgements

We thank Carsten Winf and Bob May for many helpful discussions; the manuscript has greatly benefited from the comments of three anonymous referees. Financial support from the Wellcome Trust is gratefully acknowledged.

## Footnotes

↵We note here that only the observed number of motifs is cited in Wuchty & Stadler (2003), not their

*Z*-scores. Moreover in a network comprising 3183 proteins they find e.g. 3.6 million copies of motif 1 in figure 2*b*. This can only happen if motifs are counted in a highly degenerate way which raises the question as to whether such a motif definition will give rise to biologically meaningful results.↵In the LCD construction a graph is evolved as follows.

Start from an empty graph at time

*t*=0 which contains no nodes and no edges (we could also start from , a network with a single node and a single edge which starts and ends at the same node).Add a new node and attach it to node

*s*with probability(5.1)where*d*_{s}(*t*) is the connectivity of node*s*at time*t*.Return to (ii).

*m*edges are added at each time-step,; is generated from by combining the first*m*nodes, to form node*ν*_{1}, etc.- Received June 3, 2005.
- Accepted July 14, 2005.

- © 2005 The Royal Society