## Abstract

Complex networks have been successfully employed to represent different levels of biological systems, ranging from gene regulation to protein–protein interactions and metabolism. Network-based research has mainly focused on identifying unifying structural properties, such as small average path length, large clustering coefficient, heavy-tail degree distribution and hierarchical organization, viewed as requirements for efficient and robust system architectures. However, for biological networks, it is unclear to what extent these properties reflect the evolutionary history of the represented systems. Here, we show that the salient structural properties of six metabolic networks from all kingdoms of life may be inherently related to the evolution and functional organization of metabolism by employing network randomization under mass balance constraints. Contrary to the results from the common Markov-chain switching algorithm, our findings suggest the evolutionary importance of the small-world hypothesis as a fundamental design principle of complex networks. The approach may help us to determine the biologically meaningful properties that result from evolutionary pressure imposed on metabolism, such as the global impact of local reaction knockouts. Moreover, the approach can be applied to test to what extent novel structural properties can be used to draw biologically meaningful hypothesis or predictions from structure alone.

## 1. Introduction

The central findings in network-based research suggest that there exist simple mechanisms directing the evolution of both engineered and natural networks [1–10]. However, the relation between the functions of a biological system and its network properties is hardly understood. Therefore, the advantage of using network representations for positing meaningful hypotheses about biological systems remains largely debatable [11].

Properties of biological systems arise from two fundamental origins: *physical principles*, universally constraining the feasibility of biochemical processes, and *evolutionary pressure*, bearing the specific functional abilities required for an organism's vitality [12]. The former comprise well-understood physical laws, such as mass balance and thermodynamics, which constitute the basic requirements imposed on all living systems. In contrast, evolution depends on the interplay of complex phenomena, such as adaptation to environmental changes, symbiosis and biodiversity of populations [13,14], leading to diverse cellular functions. Consequently, the unique properties related to the functions of a biological system are a result of its evolutionary history.

Explaining cellular behaviour through network representations and their properties is a key challenge of modern biology. While many structural properties of metabolic networks are similar to those of other complex networks [10], it is unclear whether they are a consequence of the evolutionary history or merely arise as a result of general physical principles. Here, we apply a randomization method to determine which properties of metabolic networks, represented as bipartite metabolite-reaction graphs, may result from evolutionary pressure. This is an essential step in understanding the relation between the functional characteristics of biological systems and their network representations.

The common approach for estimating the relevance of a network property is to determine the statistical significance (*p*-value) by comparing the value of the property in the investigated network with those in the null-model distribution obtained from randomized networks [15]. Clearly, the significance of a property strongly depends on the chosen null model, which should be constrained to preserve universal network properties [16,17]. Since the *p*-value is the probability that the value of a property originates from the null-model distribution, a statistically significant property is likely to have emerged from some non-arbitrary process influencing network evolution independently of the imposed constraints.

In virtually all network-based studies [18–24], a Markov-chain switching algorithm, *switch randomization*, has been employed to determine the significance of network properties by generating randomized networks with preserved degree sequence. Its motivation stems from the finding that heavy-tail degree distributions are a universal feature of complex networks. This generic null model can be applied to any type of network, and guarantees the independence of an identified property from vertex degrees. We demonstrate how switch randomization affects the citric acid (TCA) cycle, a central respiratory metabolic pathway of outstanding importance for aerobic organisms (figure 1*a*): two reactions substrate_{1} → product_{1} and substrate_{2} → product_{2} are substituted with new reactions substrate_{1} → product_{2} and substrate_{2} → product_{1}, ensuring that the vertex degrees remain unchanged (figure 1*b*). Since chemical feasibility is disregarded, a reaction that converts *α**-ketoglutarate* into *succinyl-CoA* may be generated, where several atoms are created out of nothing. Hence, it remains hypothetical to what extent the properties, identified as significant with this method, relate to the function of the network, as they could well result from universal physical constraints imposed during network evolution.

## 2. Results

### 2.1. Measuring evolutionary significance

To identify the properties that originate from evolutionary pressure, a network should be compared with random networks that evolved free of evolutionary pressure, but persistently satisfy all relevant physical constraints. As this is practically impossible to simulate, we apply our recent method for randomizing metabolic networks while preserving mass balance of the biochemical reactions [25]. A reaction *r* with substrate set *S* and product set *P* is mass balanced if the number of substrate atoms equals the number of product atoms:
2.1where *m*_{s}, *m*_{p} are the vectors of sum formulas of *s* and *p*, respectively, and *a*_{s,r}, *a*_{p,r} their stoichiometric coefficients (see §4). The mass-balanced randomization of the TCA cycle does not violate this basic physical constraint, as shown in figure 1*c*.

Thermodynamic properties, reflecting the energy change of reactions, constitute another important physical requirement for metabolic networks. As shown in figure 2, the reactions generated by mass-balanced randomization of the *Escherichia coli* network are characterized by plausible Gibbs-free energy changes under standard conditions (pH = 7, *T* = 298.15 K, see the electronic supplementary material) [26]. In contrast, switch randomization results in unrealistic energy ranges. By preserving mass balance and thermodynamic properties during randomization, our null model imposes realistic physical constraints on the generated randomized networks. This ensures that the significant properties are independent of the fundamental physical requirements, and instead are likely to result from evolutionary pressure. Therefore, we refer to the statistically significant properties under the proposed null model as *evolutionary significant*.

For illustration, consider a landscape formed by the values of any given property over all randomized networks (figure 3). The constrained networks, obtained by mass-balanced randomization, carve out a region in the vicinity of the original network that is embedded in the region of unconstrained networks resulting from switch randomization. As these regions exhibit different distributions of values, illustrated by different magnitudes of the peaks, an evolutionary significant property may only be identified when comparing the property of the original network with the constrained region.

### 2.2. Biosynthetic capabilities

To verify our approach, first we determine the evolutionary significance of the scope size distribution in the genome-scale metabolic networks of six model organisms: *Bacillus subtilis*, *E. coli* (bacteria), *Saccharomyces cerevisiae* (fungi), *Chlamydomonas reinhardtii* (protista), *Arabidopsis thaliana* (plantae) and *Homo sapiens* (animalia; see §4). The scope represents the set of compounds that can be produced in a metabolic network from a given set of initial nutrients [27]. We determine the scope size distribution of each network by repeatedly calculating the scope for 5000 randomly chosen sets of nutrient compounds, one set at a time, according to the following procedure: (i) from the initial set of nutrients, determine the reactions for which all substrates are contained in the nutrient set; (ii) add the products of these reactions; and (iii) repeat the procedure, until no more products can be added (see electronic supplementary material, §1.3).

The scope size distribution characterizes the biosynthetic capability of a network and has been shown to exhibit a strong correlation with the evolutionary history of organisms [28,29]. After applying mass-balanced randomization to the six networks, we compare the scope size distributions of each organism and its randomized network ensemble, and determine *p*-values using the Kolmogorov–Smirnov test (see §4). We find the scope size distributions to be evolutionary significant for all studied organisms (*p*-values <10^{−49}; electronic supplementary material, table S4 and figure S1), which demonstrates that our method correctly identifies the interdependence of the network property and its evolutionary background.

### 2.3. Small-world property

In the following, we focus on determining the evolutionary significance of salient network properties that have been extensively studied in complex network research and are prominently applied in biological studies. In particular, we analyse the small-world property [30], defined by a large clustering coefficient in conjunction with small average path length, and the metabolite degree distribution [8] (electronic supplementary material, table S2). We find that the clustering coefficient is significant in all species (*p*-values < 10^{−5}), regardless of the applied null model. On the other hand, the average path length is evolutionary significant with *p*-values < 0.025 in all species (electronic supplementary material, table S4). With switch randomization, this property is significant (*p*-values < 10^{−5}) in all but *S. cerevisiae* (*p*-value =0.77; electronic supplementary material, table S5).

More importantly, we may now assess the importance of the small-world phenomenon by determining whether this property is more pronounced in the analysed networks when compared with their randomized variants. Interestingly, in each species we find that the average path length is smaller and the clustering coefficient is greater than the values of the respective properties obtained from mass-balanced randomization (figure 4). This finding indicates that the small-world property is independent of physical constraints, and thus likely to be of evolutionary importance for metabolic networks. By contrast, when comparing the networks with their switch randomized ensembles, we arrive at a contrary conclusion—larger average path lengths and smaller clustering coefficients are prominent in real-world metabolic networks. Therefore, the results from switch randomization suggest that metabolic networks are the opposite of small worlds, disproving the small-world hypothesis. Moreover, this finding hints at two major hazards of network null models: (i) the results obtained crucially depend on the model chosen, and (ii) the application of a generic null model that provides an unrealistically constrained environment may lead to counterintuitive results.

### 2.4. Degree distributions

Next, we analyse the metabolite degree distributions, where the degree of a metabolite is the number of reactions it is involved in either as a substrate or a product. The degree can be interpreted as metabolite specificity, with highly specific metabolites occurring in only few reactions. To our knowledge, the significance of degree distributions was never studied, since switch randomization is unsuited for this task. The degree distributions of all six organisms are evolutionary significant (*p*-values < 10^{−17}; electronic supplementary material, table S4 and figure S3), suggesting that the patterns of metabolite specificities across different organisms emerge as a consequence of their evolutionary history, and not from the imposed physical constraints. This finding complements the well-known evolutionary requirement of a network architecture that is robust to random errors, as exhibited by the heavy-tail degree distributions [8].

### 2.5. Reaction centrality

Finally, we propose a measure for determining the global importance of individual reactions, which is based on a centrality index previously used in sociological studies [31] (also referred to as Hubbell Index). For two reactions *r*_{i} and *r*_{j} , we define the dependence of *r*_{j} on *r*_{i} as the largest ratio by which *r*_{i} contributes to the overall production of an intermediary *c* (i.e. a compound that is produced by *r*_{i} and consumed by *r*_{j} ): *ω* (*r*_{i}, *r*_{j} ) = max_{c}*d*_{in}(*c*)^{−1}, where *d*_{in}(*c*) is the in-degree of *c*, which is the total number of reactions producing *c*. Note that this definition corresponds to the strength of impact of a knockout of *r*_{i} on *r*_{j} , where *ω* (*r*_{i}, *r*_{j} ) = 1, if *r*_{j} becomes inoperable upon knockout of *r*_{i} (e.g. if *r*_{i} and *r*_{j} are neighbours in a linear chain of reactions), and *ω*(*r*_{i}, *r*_{j} ) ∼ 0, if the intermediaries required by *r*_{j} can be produced by many other reactions in the network.

The global impact of the knockout of a reaction on the entire network, which we call *reaction centrality*, is
2.2where *R* is the set of all reactions in the network, and *ω*(*r*_{i}, *r*_{j}) = 0, if *r*_{i} and *r*_{j} do not share any intermediary compound (i.e. *r*_{i} and *r*_{j} are not directly connected). This measure accounts for the direct dependencies between reactions through their intermediary compounds, as well as the global importance of the affected reactions: a knockout may affect only few other reactions directly, but can still have a large impact on the network, if an important reaction is affected indirectly (e.g. the knockout of a reaction at the beginning of a linear chain that leads to a reaction producing many important compounds).

Equation (2.2) can be written in matrix form as *A**ν* = *ν*, where *A*_{i,j} = *ω*(*r*_{i}, *r*_{j}). In order to solve this eigenvalue problem, we need to ensure that the inverse of *A* exists, which can be achieved by the PageRank transformation [32]. In particular, the transformed matrix *A*′ is obtained by normalizing the columns of *A* and applying a damping factor *d*:
which yields the Markov chain represented by *A*′ ergodic, as the corresponding network is strongly connected, and ensures that the largest eigenvalue is 1. In order to minimize the diluting effect of the damping factor on the topology of *A*, we choose *d* = 0.99. The eigenvector *ν* corresponding to the eigenvalue 1 of *A*′ then contains the global centrality values of the reactions in the network, where *ν*(*i*) corresponds to the reaction centrality of the *i*th reaction. The calculation for large networks is tractable using a Fortran implementation of the Implicitly Restarted Arnoldi Method [33].

We determine a *p*-value for each reaction by comparing its centrality value in the original network with those obtained from mass-balanced randomized networks while preserving the reaction itself. In order to estimate the effect of evolutionary pressure towards high centrality values, we apply a one-sided test with the null hypothesis that the values obtained from randomization are at least as large as the values of the original reactions (see §4).

Table 1 shows the reactions that have a significant centrality (*p*-value ≤ 0.025) in at least three of the analysed species (see the electronic supplementary material, table S7 for a complete list). The references provide evidence that each reaction is of outstanding importance for metabolism, as demonstrated by their evolutionary ubiquity, severity of knockout or inhibition effects, and clinical applications. For instance, catalase (EC 1.11.1.6) inactivation was shown to have severe effects on the lifespan of *S. cerevisiae* cells [34]. Superoxide dismutase (EC 1.15.1.1) is essential for defense against oxygen toxicity and aerobic growth in eukaryotes [36,37], and is involved in a multitude of diseases [38]. Carbonic anhydrase (EC 4.2.1.1) fulfil diverse metabolic functions in organelles, tissues and membranes of virtually all species, is used as a drug target for various diseases and is one of the evolutionary oldest enzymes [39–45]. The numerous experimental corroborations suggest that the proposed centrality index, in conjunction with the evolutionary significance determined by using our null model, could be used to predict enzymes responsible for maintaining organismal viability solely from the network structure.

For comparison, when repeating the analysis using switch randomization, the picture is less clear. In *S. cerevisiae*, *A. thaliana* and *H. sapiens*, 89, 27 and 14 per cent of the reactions have a *p*-value of 0.0099, rendering the analysis useless at least for the first two species. Five reactions have a significant centrality in at least two of the remaining three analysed species (electronic supplementary material, tables S6 and S8). We omit a detailed statistical analysis of these initial results, which will be necessary to draw further conclusions.

## 3. Discussion

To conclude, we proposed a novel method to reveal the relation between network properties and their evolutionary background by preserving the universal physical principles that constrain the design of metabolic networks. Any property that originates from evolutionary pressure, and thus relates to an important biological function, should not be observed in artificial metabolic networks, which evolved free of evolutionary pressure, but satisfy all relevant physical constraints. This should even hold for properties evolved from complex time-dependent phenomena, if they are reflected in the ultimately observed network.

We recognize that the proposed method only preserves mass balance and thermodynamic constraints, while other physical principles, such as electric charges, may also be relevant for metabolic network properties. Nevertheless, the considered physical constraints are the most fundamental and ubiquitous ones. Therefore, we believe that the method is a reasonable first approach to extract the biological importance of metabolic network properties. Accounting for additional physical constraints is complicated by the lack of reliable data for genome-scale metabolic networks; however, we expect such extensions to become possible in the future, which should further improve the biological relevance of the significance measure and the accuracy of the resulting predictions.

In contrast to the commonly applied switch randomization, our approach provides a realistic network background, and attributes an important evolutionary role to the small-world property and heavy-tail degree distributions. Our findings shed new light on the conclusions of previous studies, and suggest that the salient network properties are indeed a product of evolutionary pressure. Therefore, these properties carry important biological information, and can be justifiably used to generate meaningful hypotheses for experimental research.

We demonstrate that the proposed centrality index is one such network property that determines reactions important for viability of organisms. The method could therefore be used to identify candidate reactions for metabolic engineering and drug development. The results provide an impetus for addressing the long-standing doubts concerning the biological relevance of network properties. In addition, the proposed null model could be employed to verify the evolutionary assumptions in constraint-based approaches [52] and to provide an interface to synthetic biology studies.

Finally, we envision that, similar to the proposed approach for metabolic networks, specifically designed null models will be developed for other physically constrained systems, represented by gene-regulatory, protein–protein interaction and signalling networks. For instance, transcription factors depend on cis-elements and DNA-binding domains, which constrain the sequence of genes by which they are encoded. Likewise, protein interactions and signalling interactions depend on functional domains and binding sites. Development of null models which integrate the governing physical constraints of such systems will likely stimulate novel insights into the structure–function relationship in complex biological networks.

## 4. Methods

### 4.1. Genome-scale metabolic networks

We conduct our analyses on the most widely used genome-scale metabolic networks of six model organisms from all kingdoms of life: *B. subtilis* [53], *E. coli* [26], *S. cerevisiae* [54], *C. reinhardtii* [55], *A. thaliana* [56] and *H. sapiens* [57]. The sizes of the networks vary according to the complexity of the represented organisms, ranging from 855 reactions and 766 compounds (*B. subtilis*) to 2819 reactions and 2691 compounds (*H. sapiens*). Resulting from the bipartite graph reconstruction, detailed in §4.2, the number of vertices and edges varies accordingly, from 1877 vertices and 5368 edges (*B. subtilis*) to 7059 vertices and 19651 edges (*H. sapiens*). The networks further differ in their quality regarding mass balance of reactions, availability of information on reversible reactions, and the number of (strongly) connected components (electronic supplementary material, table S3): only the network of *E. coli* is fully balanced and consists of one connected component.

### 4.2. Mass-balanced randomization

To estimate the evolutionary significance of network properties, we generated 10^{4} mass-balanced randomized networks for each of the six analysed genome-scale metabolic networks. A metabolic network is represented as a directed bipartite graph *G* = (*V*_{c} ∪ *V*_{r}, *E*), where *V*_{c} is the set of compound vertices, *V*_{r} the set of reaction vertices and is the set of *directed* edges denoting substrate–reaction and product–reaction relationships. For a compound *c* ∈ *V*_{c}, we denote by *m*_{c} ∈ ℕ^{n} its *mass vector*, i.e. the vector representation of *c* over *n* chemical elements. Here, we consider only the six most abundant elements in biological systems [58]: carbon (C), hydrogen (H), nitrogen (N), oxygen (O), phosphorus (P) and sulphur (S). The mass vector of water is then (0,2,0,1,0,0) · (C,H,N,O,P,S)^{T}. Reversible reactions are represented by one reaction vertex for each direction: *r*^{+} and *r*^{−}, such that *r*_{in}^{+} = *r*_{out}^{−} and *r*_{out}^{+} = *r*_{in}^{−}.

In order to uniformly randomize a network while preserving mass balance, each possible mass-balanced network has to be generated with equal probability. This requires enumeration of all possible sets of substrates and products, for which equation (2.1) is satisfied. As this problem is a special case of the Knapsack problem [59], the number of possible mass-balanced networks is at least exponential in the number of compounds.

We approach the complexity of the general problem by applying a new method for mass-balanced randomization, introduced in Basler *et al.* [25]. The set of possible solutions to equation (2.1) is restricted twofold: (i) the in- and out-degrees of reactions are preserved and (ii) the substitution of compounds is limited to certain subsets, as detailed later, which allows to easily find a solution for equation (2.1). The first restriction is in line with the observation that reaction degrees are biochemically constrained by the number of interacting compounds. The second allows to divide the randomization procedure into a precalculation step and an actual randomization. As a result, the generation of a large set of mass-balanced randomized networks becomes computationally feasible.

The randomization procedure consists of two steps: In the first step, for a given metabolic network *G*, we determine the classes of mass equivalent compounds and pairs of compounds from *V*_{c}(*G*). Two compounds are called mass equivalent, if the mass vector of one compound is a multiple of the other (e.g. CO_{2} and C_{2}O_{4}). Two pairs of compounds are called mass equivalent, if the sum of mass vectors of one pair is a multiple of the sum of mass vectors of the other pair (e.g. (CH_{2}O, CO_{2} ) and (C_{4}H_{2}O_{4}, H_{2}O_{2} )). This definition ensures that the mass vectors of two compounds (and the sums of the mass vectors of two pairs of compounds) from the same mass equivalence class differ only by rational factors (e.g. 2CH_{2}O + 2CO_{2} = C_{4}H_{2}O_{4} + H_{2}O_{2} ). The precalculation of mass equivalent compounds is to be executed only once for all subsequent randomizations of the same network and renders the generation of large sets of mass-balanced randomized networks computationally feasible (see supplementary table S1 in [25] for a performance comparison to switch randomization).

In the second step, the reactions of *G* are randomized while preserving mass balance. To randomize a reaction chosen uniformly at random from *V*_{r} (*G*), its substrates and products are replaced by randomly chosen substitutes from their corresponding mass equivalence classes. When substituting an individual substrate or product, the stoichiometric coefficients of the new reaction are obtained by multiplying the corresponding previous coefficients with the earlier mentioned factor, such that equation (2.1) is satisfied. For the substitution of a pair of substrates or products, the stoichiometric coefficients satisfying equation (2.1) are determined by solving a system of *n* linear equations with two unknowns (electronic supplementary material, table S1 for examples). In case there is no solution, the substitution is not carried out. The output from this step is an (almost) uniformly randomized network in which stoichiometric coefficients are changed, edges are replaced and, consequently, the degrees of the compounds are altered [25]. The approach is in line with the observation that some fundamental properties should be fixed while carrying out the randomization—here, these are the degrees of the reaction vertices and mass balance.

### 4.3. Calculation of *p*-values

The analysed properties are calculated in the original metabolic network and in each of the 10^{4} randomized networks. For the average path length and clustering coefficient, we derive a *z*-score, , from the original value *x*, the average randomized value , and the standard deviation of randomized values *σ*. The two-sided *p*-value is determined as *p* = 2∫_{|z|}^{∞} 𝒩(0,1).

For comparing the metabolite degree and scope size distributions of the metabolic networks with their randomized versions, we apply the two-sample Kolmogorov–Smirnov test. From the cumulative distribution *F*_{n} of the property in the original network, and the joint cumulative distribution *F*_{n′} of the randomized networks, a test statistic is derived as *d*_{n,n′} = sup_{x} |*F*_{n}(*x*) − *F*_{n′}(*x*)|, where *n* and *n*′ are the number of values in the original, respectively the joint randomized distributions. The *p*-value is .

For each reaction vertex *r*, we determine its centrality, *ν*(*r*), in the original network and in 100 randomized networks, which are obtained by preserving *r* and randomizing the remaining reactions. The *p*-value of *r* is , where *q*′_{r} is the number of randomized networks, in which the centrality of *r* is at least as large as in the original network, and *n*′ = 100.

## Acknowledgements

We thank Abdelhalim Larhlimi, Alexander Skupin and Alisdair Fernie for critical comments on the manuscript. This work was supported by German Federal Ministry of Education and Research (to G.B., S.G., J.S. and Z.N.) and by Scottish Funding Council (to O.E.).

- Received September 23, 2011.
- Accepted November 8, 2011.

- This journal is © 2011 The Royal Society

This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.