## Abstract

We discuss the ramifications of noisy and incomplete observations of network data on the existence of a giant connected component (GCC). The existence of a GCC in a random graph can be described in terms of a percolation process, and building on general results for classes of random graphs with specified degree distributions we derive percolation thresholds above which GCCs exist. We show that sampling and noise can have a profound effect on the perceived existence of a GCC and find that both processes can destroy it. We also show that the absence of a GCC puts a theoretical upper bound on the false-positive rate and relate our percolation analysis to experimental protein–protein interaction data.

## 1. Introduction

In many areas of the physical, engineering, life- and social sciences network data are now becoming increasingly abundant. In light of empirical network data, mathematical models of networks (Durrett 2006) and the statistical tools for their analysis have changed from the beginnings of random graph theory and improved considerably over the past 10 years. However, despite these advances, not much effort has been devoted to detailed investigation and modelling of the effects of erroneous network data.

While for technological networks sampling (Stumpf *et al.* 2005*b*) and noise are of relatively minor importance, for biological and social networks the situation is markedly different. Especially in biology, there appears to be a genuine trade-off between data quality and the quantity in which network data are being generated. In particular, in the context of protein-interaction network data, the false-positive and -negative rates have been well documented with reported error rates frequently exceeding 50 per cent (von Mering *et al.* 2002; Bork *et al.* 2004; Reguly *et al.* 2006). With such high error levels, it may not be automatically expected that inferences obtained from noisy networks are informative about the true network, and incorporating a detailed analysis of the effects of noise has now become widely accepted practice (Middendorf *et al.* 2004; Yook *et al.* 2004; deSilva *et al.* 2006). Yet, wherever analyses were repeated on artificially perturbed networks (noise was introduced by adding false-positive interactions and deleting reported interactions) the perturbed networks had similar properties to the original network (Middendorf *et al.* 2004; Yook *et al.* 2004; de Silva *et al.* 2006). The observation that some networks have similar properties when perturbed in this way may suggest that some balance between false positives and false negatives has already been obtained.

Of the reported genome-wide protein interaction surveys that have been published to date, only a small number (Ito *et al.* 2001; Gavin *et al.* 2002) do not show evidence for the existence of a giant connected component (GCC). This puzzling observation—*a priori* we would assume that the molecular machinery underlying living systems is highly interconnected and tightly linked—suggests that noise can induce a percolation transition (Stauffer & Aharony 1992) on real networks. Here we describe percolation transitions on general random networks due to network sampling (Han *et al.* 2005; Stumpf & Wiuf 2005; Stumpf *et al.* 2005*b*; de Silva *et al.* 2006) and noise in the network observation. This will serve to highlight some differences in the way noise and incompleteness of networks will affect our observations. We will also show that the very existence of a GCC sets an upper limit on the false-positive rate in interaction data.

## 2. Noisy and incomplete networks and percolation

Here we introduce the notion of noisy and incomplete network ensembles (NINEs), which extend the standard notion of network ensembles (Burda *et al.* 2001) in order to account for the differences between true networks, and the noisy and incomplete representations thereoff that we can observe.

### 2.1. Notation

In order to define NINEs, we first assume that we have a true and complete network described by a graph,
2.1
which consists of a set of nodes or vertices, 𝒱, and a set of edges among the nodes, ℰ. The order and size of the network are defined as the number of nodes, *N* (or the size of 𝒱), and edges, M (or the size of ℰ), respectively,

Now let 𝒱_{S} and ℰ_{S} denote subsets of 𝒱 and ℰ, respectively, with the property
2.2
and that *e*(*i,j*) ∈ ℰ_{S} iff *i,j* ∈ 𝒱_{S} and *e*(*i,j*) ∈ ℰ. Thus, the graph
2.3
is the subgraph of 𝒢 that is induced by 𝒱_{S} ⊆ 𝒱. Trivially, we have ℰ_{S} ⊆ ℰ for noiseless data.

Equally, we define noisy networks via
2.4
where 𝒱_{D} = 𝒱 and the set of edges ℰ_{D} is defined by
2.5

Thus *ρ* and *ξ* are the true-positive and false-positive rates, respectively. Trivially, *E*[*M*_{D}] = *ρ**M* + *ξ*.

We can also consider the case of a noisy–incomplete network by simply replacing 𝒱 and ℰ by 𝒱_{S} and ℰ_{S} in equations (2.4) and (2.5). We will later introduce different schemes for specifying *p*, *ρ* and *ξ*, and study their interplay in shaping observed network data.

### 2.2. Percolation on networks

Below we consider general networks and seek to derive conditions for the GCC to emerge in the context of noise and incompleteness. The relative size of a component is defined as the number of nodes in the component divided by the number of non-zero degree nodes. The GCC has non-zero relative size as the size of the network becomes large.

In the theoretical section we assume that the network is uncorrelated and large and hence it is sufficient to know the degree sequence or the probability distribution generating the degree sequence. Networks given by their degree sequence were first studied by Hakimi (1962), and later in more detail by Bender & Canfield (1978) and Molloy & Reed (1995, 1998). The latter also studied percolation processes, i.e. in the context of random graphs the emergence of the GCC (Bollobás & Riordan 2006). We shall, however, refer to these graphs as Bender–Canfield (BC) random graphs. This has since been studied repeatedly in relationship to real networks (Callaway *et al.* 2000; Newman *et al.* 2001). The recent monographs by Durrett (2006) and Chung & Lu (2006) provide excellent surveys of this area from (predominantly) probabilistic and combinatorial perspectives, respectively.

The central result of Molloy & Reed (1995), frequently referred to as the Molloy–Reed (MR) criterion is given as follows.

### Theorem 1.

*A Graph 𝒢 = (𝒱,ℰ) has a GCC with high probability as N → ∞ iff*
2.6

*where*

*z*_{1}and*z*_{2}refer to the average numbers of nearest and next-nearest neighbours, respectively.The proof of this statement can be found in Molloy & Reed (1995). Different perspectives are also offered by Durrett (2006). Here, we are content with analysing the behaviour of the GCC using equation (2.6) in the context of noisy and incomplete networks. In particular we are interested in uncorrelated large BC networks, where we need only to consider the degree distribution, Pr(*k*) (rather than the actual sequence of integers). Below, we denote the first and second moments of the degree distribution by 〈*k*〉 and 〈*k*^{2}〉, respectively.

### 2.3. Calculating *z*_{1} and *z*_{2}

Calculation of *z*_{1} and *z*_{2} is straightforward for the cases we consider and well covered in the literature, e.g. Dorogovtsev & Mendes (2003) or Burda & Krzywicki (2004); here it is only briefly repeated for the sake of completeness. The number of nearest neighbours, *z*_{1}, is given by the average degree,
2.7

In uncorrelated graphs, the probability of a random edge ending in a node of degree *l* is
2.8

We can obtain *z*_{2} by summing over the probabilities of a node having degree *k*′ multiplied by the average degrees of the *k* neighbours, i.e.
2.9

For classical or Erdös–Rényi (Erdös & Rényi 1959, 1960) random graphs, which are characterized by an approximately Poisson degree distribution, the MR criterion yields the well-known result that the GCC appears as *λ* > 1, where *λ* is the average degree in the network.

Equations (2.7) and (2.9) together with equation (2.6) allow us to study the percolation behaviour of uncorrelated BC graphs by simply determining the functional form for the degree distribution in the incomplete, 𝒢_{S}, and noisy, 𝒢_{D}, network ensembles. Below, we will study the percolation transitions in incomplete and noisy BC networks, and we will briefly comment on the validity of these results for real correlated networks (Berg & Lässig 2002).

## 3. Percolation processes due to incomplete and unreliable network data

Noise and incompleteness are known to affect present network data sets; in particular, most biological network data sets are subject to considerable uncertainty (Han *et al.* 2005; de Silva *et al.* 2006). For incomplete network data, several studies have recently made some theoretical progress (Stumpf & Wiuf 2005; Stumpf *et al.* 2005*b*; Lee *et al.* 2006; Wiuf & Stumpf 2006), while investigations of the effects of noise have thus far been confined to semi-rigorous assessments of experimental data sets, in particular regarding protein interaction data (von Mering *et al.* 2002; Hart *et al.* 2006).

### 3.1. Percolation due to network sampling

Let *p* be the probability of sampling a node to be included in the set of subnet vertices 𝒱_{S}. We can either specify *p* ∈ (0,1) or set *p* to the fraction of sampled nodes *p* = *N*_{S}/*N*. The degree distribution in the subnet Pr_{S}(*k*) is then given by (Stumpf & Wiuf 2005; Lee *et al.* 2006),
3.1

Thus, the degree distributions in the true network and random subnets are generally of a different functional form.

The number of nearest and next-nearest neighours in the subnet, *z*_{1}^{S} and *z*_{2}^{S}, respectively, are straightforward to calculate and we can find the condition for which the MR criterion, equation (2.6), is fulfilled. We obtain for the number of nearest neighbours in the subnet,
3.2
and the next nearest neighbours,
3.3

Taken together with the MR criterion we can thus derive the critical value for the sampling fraction which indicates the appearance of the GCC, 3.4

From equation (3.4), it is obvious that for degree distributions with a diverging second moment, the GCC only disappears as *p*_{c} → 0; of course, this is only relevant in the limit, *N* → ∞.

### 3.2. Percolation due to noisy network data

Here we consider how observational noise affects the GCC, i.e. if each edge is observed with (true-positive) probability *ρ* ∈ [0,1], and each non-existing edge is observed with (false-positive) probability *ξ* ∈ [0,1], do we observe the GCC or not?

We start again by determining the degree distribution in a network with given error rates, *ρ* and *ξ*. Assuming that a node has degree *k* in 𝒢, the degree of a node in 𝒢_{D} is the sum of two binomial variables, one controlling the number of false positives, the other the number of false negatives. Hence, the node has degree *l* in 𝒢_{D} with probability,
3.5
where _{2}*F*_{1} (*a, b, c, x*) is the hypergeometric function (Gradshteyn *et al.* 1994; Arfken & Weber 2005). We note that *l* > *k* may occur as the sum of the numbers of observed true positive and false positive edges may be larger than the number of true edges. For the degree distribution of the noisy network, 𝒢_{D}, we thus have
3.6

While this expression is clearly intractable, the two moments 〈*k*〉_{D} and 〈*k*^{2}〉_{D} have tractable analytic expressions in terms of 〈*k*〉 and 〈*k*^{2}〉. Here we find
3.7
and
3.8

Next, we define *δ* = *ρ*− *ξ*; then *z*_{1}^{D} and *z*_{2}^{D} are given by
3.9
and
3.10

Note, that many real networks show average degrees that are not comparable to the entire network size *N*. It suggests that it is reasonable to consider the situation *ξ* ≈ 0 with *Ξ* = *N**ξ*, and *δ* ≈ *ρ* (assuming *ρ*≫*ξ*) in which case *z*_{1}^{D} and *z*_{2}^{D} become
3.11
and
3.12

The parameter *Ξ* is essentially the expected number of false positives per node. Together with equation (2.6), equations (3.11) and (3.12) allow us to determine the critical boundary where the GCC appears. For fixed *ρ* (or *Ξ*) we can thus determine the critical value of *Ξ* (or *ρ*) where the percolation transistion occurs. In contrast to network sampling, we see that depending on the parameter values the GCC might appear even if it is absent in the true network.

One important observation is that if *Ξ* ≥ 1 then *z*_{2}^{D} > *z*_{1}^{D} always, i.e. if more than one false positive is expected per node a GCC is present. This is not surprising since the number of false positives approximately follows a Poisson distribution with parameter *Ξ* for large networks. It also transpires from equations (3.11) and (3.12) that *ρ* essentially affects the number of nearest and next-nearest neighbours in the same way as *p* does under network sampling; though in addition there is a term that also depends on *Ξ*. The quantities *z*_{1}^{S} and *z*_{2}^{S} might be obtained from *z*_{1}^{D} and *z*_{2}^{D} by letting *ρ* = *p* and *Ξ* = 0.

### 3.3. Percolation due to the combined effects of noise and sampling

Combining sampling and noise is straightforward. There is, however, a subtle—so subtle to be undetectable in practice in fact—dependence on the order of sampling and noise. If we first choose the set of nodes, 𝒱_{S} to assay for interactions, and then observe interactions subject to noise, we find
3.13
and
3.14

Should we, however, consider a subnet drawn from an already global but noisy network data set then we have 3.15 and 3.16

In practice, however, the differences are so small that the order in which noise is added to the network and nodes are sampled from it does not matter, i.e.
3.17
and
3.18
which is valid for *pN* ≫ 1. The percolation transition again occurs when *z*_{1}^{DS} = *z*_{2}^{DS}. We reiterate that the MR criterion is derived in the limit of a large network and we note that as *N* → ∞ or *N**ξ* → *Ξ* the order in which sampling and noise enter the problem becomes irrelevant. But for very small networks this order may indeed matter.

### 3.4. Examples—theoretical network ensembles

We illustrate the percolation transitions using classical random graphs (Bollobás 2001) and scale-free random graphs (Barabasi & Albert 1999; Barabasi *et al.* 2001).

#### 3.4.1. Classical random graphs

Classical random graphs are characterized by a binomial, or for sufficiently large graphs, Poisson degree distribution, 3.19

The first and second moments are given by 3.20

The percolation transition thus occurs at the critical sampling fraction
3.21
and a GCC is present if *p* > 1/*λ*. The noise-induced transition occurs at the critical values fulfilling
3.22
and a GCC is present if
3.23

Note that this condition reduces to equation (3.21) for *Ξ* = 0, as it should.

#### 3.4.2. Scale-free random graphs

Scale-free random graphs exhibit a power-law degree distribution,
3.24
where *ζ*_{x} is Riemann's zeta function and acts as a normalizing constant. For *γ* ≤ 3, the second moment diverges and therefore neither sampling nor noise will induce a percolation transition on a scale-free random network unless *γ* > 3. Below we assume that we are dealing with a finite (though potentially very large, i.e. *N* > 10^{6}, scale-free random graph) with a degree sequence generated by drawing *N* random numbers from the distribution given by equation (3.24); in order to qualify as a proper degree distribution the sum of the degree sequence has, of course, to be even.

The sampling transition occurs at a critical value of the sampling fraction given by 3.25

Likewise, the noise-induced transition occurs when
3.26
and if the left side is larger than zero for a given value of (*ρ*,*Ξ*) then a GCC appears. As stated before, this is automatically fulfilled for *Ξ* ≥ 1. The areas where a GCC exists are shown in figure 1 for different values of the exponent *γ*.

Note that for some fixed values of *ρ* and *Ξ* = 0 the GCC is present, disappears for small non-zero values of *Ξ* and then reappears for larger values of *Ξ*. The explanation for this phenomomen is that the size of the GCC is relative to the number of non-zero degree nodes: for *Ξ* = 0 there are many zero degree nodes; when *Ξ* increases they become connected in many small disconnected components and for large *Ξ*, the small components are connected to form a GCC.

## 4. Percolation analysis for finite systems

Most real networks are finite in size and we would not expect to see a sharp percolation transition (i.e. the disappearance of a GCC) for either noise or sampling processes on networks. In figure 2, we show the effect of sampling on three different networks of order *N* = 5000; each network has 15 000 edges but was generated in a different way. From each network we sampled, at random, a fraction *p* of the nodes in the network and determined the components of the resulting induced subgraph. The networks considered are an Erdös–Rényi random graph (with *λ* = 3), a network generated by the Barabasi–Albert (BA) construction (Barabasi & Albert 1999) (at each timestep a new node was added and preferentially attached to the existing graph with three edges), and a corresponding BC random graph (i.e. we took the degree sequence of a BA graph and generated a randomly rewired graph with the same degree sequence).

The fraction of nodes which are contained in the GCC is virtually identical for the BA network and its BC counterpart. This suggests that, as far as the percolation behaviour is concerned, the type of network which is grown to order *N* under the BA preferential attachment model is not different from a corresponding BC network ensemble. That is the degree sequence determines the percolation behaviour as suggested by the MR criterion (Molloy & Reed 1995). For the Erdös–Rényi random graph we observe, of course, differences, and the transition appears to be somewhat sharper. In fact, as predicted by equation (3.19), the transition occurs in the vicinity of 1/*λ* = ⅓.

For noisy networks, the lack of simple analytic expressions makes the analysis of simulations somewhat more complicated. What we find in extensive simulations studies, however, is that the GCC is remarkably stable against noise (see figure 1). Only when both the true-positive rate, *ρ*, and the false-positive rate, *ξ*, are sufficiently small, will the GCC vanish. It follows straightforwardly from equation (2.5) that the expected number of false-positive edges in a network—we always assume independence among edges—is given by
4.1
As most real networks are sparse, dominates *M*, and therefore for moderate values of *E*_{ξ}[*M**] will generally be sufficiently large to ensure that a GCC exists.

## 5. Error bounds for biological data

Protein interaction network (PIN) data have been rightly chastised for the high error rates in the experimental assays. But the very fact that a GCC is absent can be used in order to put an upper bound on the false-positive rate, *ξ*, or, equivalently, the average number of false-positive edges per node, *Ξ*. Only two high-throughput surveys (we considered all yeast and human PIN data sets deposited in the IntAct database, Kerrien *et al.* 2007) show no evidence for a GCC; some of the basic properties of these networks are summarized in table 1. Using the expressions for *z*_{1}^{DS} and *z*_{2}^{DS}, equations (3.15) and (3.16), respectively, we obtain an inequality for the existence of a GCC which depends on *Ξ*, *ρ* and *p*.

For real networks, however, we do not know the moments of the degree distribution, 〈*k*〉 and 〈*k*^{2}〉. For yeast, a range of studies has estimated the size of the true PIN and with the estimated number of edges, *M̂*, we may in turn estimate 〈*k̂*〉 = *M̂*/*N*, which still leaves us with 〈*k*^{2}〉 unknown. If we further set *p̂* = *N*_{S}/*N* we can write as:
5.1
to identify the region for which a GCC should exist according to the MR criterion.

These regions are shown in figure 3 for the only two published data sets that do not exhibit a GCC (according to the usual definitions). Depending on the second moment of the degree distribution, 〈*k*^{2}〉, the absence of a GCC indicates fairly low true- and false-positive rates. From the results in figure 4, we conclude that for those PIN data sets for which we do not observe a GCC, the fraction of false positive edges must be below 50 per cent, a number that has sometimes been suggested (von Mering *et al.* 2002; Bader *et al.* 2004).

## 6. Conclusions

Networks offer a convenient and powerful framework for the analysis of structured and complex processes and data. But network data, especially in biology, are also subject to considerable uncertainty and highly incomplete. Here we have studied the effects of incompleteness and noise on the existence of a GCC in theoretical network ensembles and experimental network data. We have introduced the notion of NINEs as a suitable conceptual framework to deal with observational vagaries of experimental network data sets. For uncorrelated networks we have derived the criteria which determine the existence of a GCC (with high probability). We have furthermore performed some simulations for finite-size network ensembles which show that the theoretical results describe the average behaviour of realistic networks.

We have seen that the GCC's persistence under sampling but especially under the effects of noise depends strongly on the second moment of the degree distribution (by virtue of the MR criterion). If networks are scale-free with exponent *γ* ≤ 3, then the divergence of 〈*k*^{2}〉 means that a GCC will persist for all levels of noise. For real data sets that are finite—and almost certainly not scale-free in the classical sense (Przulj *et al.* 2004; Stumpf *et al.* 2005*a*; Tanaka *et al.* 2005; Khanin & Wit 2006)—we can, however, use the emergence of a GCC as (weakly) indicative of the relative effects of false-positive to true-positive error rates.

We have illustrated this in the context of published high-throughput protein interaction data in yeast and humans. Only two data sets fail to exhibit a GCC. The data of Gavin *et al.* (2002) and Ho *et al.* (2002) focused on interaction in protein complexes and therefore we would not necessarily expect a GCC to be visible in the data. Previous analyses of error rates in protein interaction data have sometimes focused on overlap between different sets of data (von Mering *et al.* 2002; Hart *et al.* 2006), which we expect to systematically overestimate error rates in such data. However, we are aware that our analysis relies on errors being distributed evenly in the network. This assumption is naturally debatable and further theoretical developments might be called for.

Fruitful applications of the present analysis could arise when reversing our approach: how much can we learn from incomplete and noisy data about the true (but unobserved) network? The notion of NINEs introduced above allows us to address this question rigorously. For any realistic complex system, we may start from the assumption that observed data were sampled from a connected graph. There have been only preliminary attempts (e.g. Stumpf *et al.* (2008)) at reverse-engineering the global topology of networks from partial (often poor) data. Especially, in the social sciences Robins & Pattison (2001) where network sampling is fraught with all manner of methodological, social and ethical problems, and in systems biology, where technological problems frequently prevent us from seeing e.g. protein interactions that are conditional on post-translational modifications, we see the need and much scope for suitable inferential procedures.

## Acknowledgements

The authors thank Sophie Lèbre and Thomas Thorne for helpful discussions. MPHS is a Royal Society Wolfson Research Merit Award holder and acknowledges support from the BBSRC; CW acknowledges research support from the Danish Research Councils.

## Footnotes

- Received February 1, 2010.
- Accepted March 18, 2010.

- © 2010 The Royal Society