Royal Society Publishing

Incomplete and noisy network data as a percolation process

Michael P. H. Stumpf, Carsten Wiuf


We discuss the ramifications of noisy and incomplete observations of network data on the existence of a giant connected component (GCC). The existence of a GCC in a random graph can be described in terms of a percolation process, and building on general results for classes of random graphs with specified degree distributions we derive percolation thresholds above which GCCs exist. We show that sampling and noise can have a profound effect on the perceived existence of a GCC and find that both processes can destroy it. We also show that the absence of a GCC puts a theoretical upper bound on the false-positive rate and relate our percolation analysis to experimental protein–protein interaction data.

1. Introduction

In many areas of the physical, engineering, life- and social sciences network data are now becoming increasingly abundant. In light of empirical network data, mathematical models of networks (Durrett 2006) and the statistical tools for their analysis have changed from the beginnings of random graph theory and improved considerably over the past 10 years. However, despite these advances, not much effort has been devoted to detailed investigation and modelling of the effects of erroneous network data.

While for technological networks sampling (Stumpf et al. 2005b) and noise are of relatively minor importance, for biological and social networks the situation is markedly different. Especially in biology, there appears to be a genuine trade-off between data quality and the quantity in which network data are being generated. In particular, in the context of protein-interaction network data, the false-positive and -negative rates have been well documented with reported error rates frequently exceeding 50 per cent (von Mering et al. 2002; Bork et al. 2004; Reguly et al. 2006). With such high error levels, it may not be automatically expected that inferences obtained from noisy networks are informative about the true network, and incorporating a detailed analysis of the effects of noise has now become widely accepted practice (Middendorf et al. 2004; Yook et al. 2004; deSilva et al. 2006). Yet, wherever analyses were repeated on artificially perturbed networks (noise was introduced by adding false-positive interactions and deleting reported interactions) the perturbed networks had similar properties to the original network (Middendorf et al. 2004; Yook et al. 2004; de Silva et al. 2006). The observation that some networks have similar properties when perturbed in this way may suggest that some balance between false positives and false negatives has already been obtained.

Of the reported genome-wide protein interaction surveys that have been published to date, only a small number (Ito et al. 2001; Gavin et al. 2002) do not show evidence for the existence of a giant connected component (GCC). This puzzling observation—a priori we would assume that the molecular machinery underlying living systems is highly interconnected and tightly linked—suggests that noise can induce a percolation transition (Stauffer & Aharony 1992) on real networks. Here we describe percolation transitions on general random networks due to network sampling (Han et al. 2005; Stumpf & Wiuf 2005; Stumpf et al. 2005b; de Silva et al. 2006) and noise in the network observation. This will serve to highlight some differences in the way noise and incompleteness of networks will affect our observations. We will also show that the very existence of a GCC sets an upper limit on the false-positive rate in interaction data.

2. Noisy and incomplete networks and percolation

Here we introduce the notion of noisy and incomplete network ensembles (NINEs), which extend the standard notion of network ensembles (Burda et al. 2001) in order to account for the differences between true networks, and the noisy and incomplete representations thereoff that we can observe.

2.1. Notation

In order to define NINEs, we first assume that we have a true and complete network described by a graph, Embedded Image 2.1 which consists of a set of nodes or vertices, 𝒱, and a set of edges among the nodes, ℰ. The order and size of the network are defined as the number of nodes, N (or the size of 𝒱), and edges, M (or the size of ℰ), respectively, Embedded Image

Now let 𝒱S and ℰS denote subsets of 𝒱 and ℰ, respectively, with the property Embedded Image 2.2 and that e(i,j) ∈ ℰS iff i,j ∈ 𝒱S and e(i,j) ∈ ℰ. Thus, the graph Embedded Image 2.3 is the subgraph of 𝒢 that is induced by 𝒱S ⊆ 𝒱. Trivially, we have ℰS ⊆ ℰ for noiseless data.

Equally, we define noisy networks via Embedded Image 2.4 where 𝒱D = 𝒱 and the set of edges ℰD is defined by Embedded Image 2.5

Thus ρ and ξ are the true-positive and false-positive rates, respectively. Trivially, E[MD] = ρM + ξEmbedded Image.

We can also consider the case of a noisy–incomplete network by simply replacing 𝒱 and ℰ by 𝒱S and ℰS in equations (2.4) and (2.5). We will later introduce different schemes for specifying p, ρ and ξ, and study their interplay in shaping observed network data.

2.2. Percolation on networks

Below we consider general networks and seek to derive conditions for the GCC to emerge in the context of noise and incompleteness. The relative size of a component is defined as the number of nodes in the component divided by the number of non-zero degree nodes. The GCC has non-zero relative size as the size of the network becomes large.

In the theoretical section we assume that the network is uncorrelated and large and hence it is sufficient to know the degree sequence or the probability distribution generating the degree sequence. Networks given by their degree sequence were first studied by Hakimi (1962), and later in more detail by Bender & Canfield (1978) and Molloy & Reed (1995, 1998). The latter also studied percolation processes, i.e. in the context of random graphs the emergence of the GCC (Bollobás & Riordan 2006). We shall, however, refer to these graphs as Bender–Canfield (BC) random graphs. This has since been studied repeatedly in relationship to real networks (Callaway et al. 2000; Newman et al. 2001). The recent monographs by Durrett (2006) and Chung & Lu (2006) provide excellent surveys of this area from (predominantly) probabilistic and combinatorial perspectives, respectively.

The central result of Molloy & Reed (1995), frequently referred to as the Molloy–Reed (MR) criterion is given as follows.

Theorem 1.

A Graph 𝒢 = (𝒱,ℰ) has a GCC with high probability as N → ∞ iff Embedded Image 2.6 where z1 and z2 refer to the average numbers of nearest and next-nearest neighbours, respectively.

The proof of this statement can be found in Molloy & Reed (1995). Different perspectives are also offered by Durrett (2006). Here, we are content with analysing the behaviour of the GCC using equation (2.6) in the context of noisy and incomplete networks. In particular we are interested in uncorrelated large BC networks, where we need only to consider the degree distribution, Pr(k) (rather than the actual sequence of integers). Below, we denote the first and second moments of the degree distribution by 〈k〉 and 〈k2〉, respectively.

2.3. Calculating z1 and z2

Calculation of z1 and z2 is straightforward for the cases we consider and well covered in the literature, e.g. Dorogovtsev & Mendes (2003) or Burda & Krzywicki (2004); here it is only briefly repeated for the sake of completeness. The number of nearest neighbours, z1, is given by the average degree, Embedded Image 2.7

In uncorrelated graphs, the probability of a random edge ending in a node of degree l is Embedded Image 2.8

We can obtain z2 by summing over the probabilities of a node having degree k′ multiplied by the average degrees of the k neighbours, i.e. Embedded Image 2.9

For classical or Erdös–Rényi (Erdös & Rényi 1959, 1960) random graphs, which are characterized by an approximately Poisson degree distribution, the MR criterion yields the well-known result that the GCC appears as λ > 1, where λ is the average degree in the network.

Equations (2.7) and (2.9) together with equation (2.6) allow us to study the percolation behaviour of uncorrelated BC graphs by simply determining the functional form for the degree distribution in the incomplete, 𝒢S, and noisy, 𝒢D, network ensembles. Below, we will study the percolation transitions in incomplete and noisy BC networks, and we will briefly comment on the validity of these results for real correlated networks (Berg & Lässig 2002).

3. Percolation processes due to incomplete and unreliable network data

Noise and incompleteness are known to affect present network data sets; in particular, most biological network data sets are subject to considerable uncertainty (Han et al. 2005; de Silva et al. 2006). For incomplete network data, several studies have recently made some theoretical progress (Stumpf & Wiuf 2005; Stumpf et al. 2005b; Lee et al. 2006; Wiuf & Stumpf 2006), while investigations of the effects of noise have thus far been confined to semi-rigorous assessments of experimental data sets, in particular regarding protein interaction data (von Mering et al. 2002; Hart et al. 2006).

3.1. Percolation due to network sampling

Let p be the probability of sampling a node to be included in the set of subnet vertices 𝒱S. We can either specify p ∈ (0,1) or set p to the fraction of sampled nodes p = NS/N. The degree distribution in the subnet PrS(k) is then given by (Stumpf & Wiuf 2005; Lee et al. 2006), Embedded Image 3.1

Thus, the degree distributions in the true network and random subnets are generally of a different functional form.

The number of nearest and next-nearest neighours in the subnet, z1S and z2S, respectively, are straightforward to calculate and we can find the condition for which the MR criterion, equation (2.6), is fulfilled. We obtain for the number of nearest neighbours in the subnet, Embedded Image 3.2 and the next nearest neighbours, Embedded Image 3.3

Taken together with the MR criterion we can thus derive the critical value for the sampling fraction which indicates the appearance of the GCC, Embedded Image 3.4

From equation (3.4), it is obvious that for degree distributions with a diverging second moment, the GCC only disappears as pc → 0; of course, this is only relevant in the limit, N → ∞.

3.2. Percolation due to noisy network data

Here we consider how observational noise affects the GCC, i.e. if each edge is observed with (true-positive) probability ρ ∈ [0,1], and each non-existing edge is observed with (false-positive) probability ξ ∈ [0,1], do we observe the GCC or not?

We start again by determining the degree distribution in a network with given error rates, ρ and ξ. Assuming that a node has degree k in 𝒢, the degree of a node in 𝒢D is the sum of two binomial variables, one controlling the number of false positives, the other the number of false negatives. Hence, the node has degree l in 𝒢D with probability, Embedded Image 3.5 where 2F1 (a, b, c, x) is the hypergeometric function (Gradshteyn et al. 1994; Arfken & Weber 2005). We note that l > k may occur as the sum of the numbers of observed true positive and false positive edges may be larger than the number of true edges. For the degree distribution of the noisy network, 𝒢D, we thus have Embedded Image 3.6

While this expression is clearly intractable, the two moments 〈kD and 〈k2D have tractable analytic expressions in terms of 〈k〉 and 〈k2〉. Here we find Embedded Image 3.7 and Embedded Image 3.8

Next, we define δ = ρξ; then z1D and z2D are given by Embedded Image 3.9 and Embedded Image 3.10

Note, that many real networks show average degrees that are not comparable to the entire network size N. It suggests that it is reasonable to consider the situation ξ ≈ 0 with Ξ = Nξ, and δρ (assuming ρξ) in which case z1D and z2D become Embedded Image 3.11 and Embedded Image 3.12

The parameter Ξ is essentially the expected number of false positives per node. Together with equation (2.6), equations (3.11) and (3.12) allow us to determine the critical boundary where the GCC appears. For fixed ρ (or Ξ) we can thus determine the critical value of Ξ (or ρ) where the percolation transistion occurs. In contrast to network sampling, we see that depending on the parameter values the GCC might appear even if it is absent in the true network.

One important observation is that if Ξ ≥ 1 then z2D > z1D always, i.e. if more than one false positive is expected per node a GCC is present. This is not surprising since the number of false positives approximately follows a Poisson distribution with parameter Ξ for large networks. It also transpires from equations (3.11) and (3.12) that ρ essentially affects the number of nearest and next-nearest neighbours in the same way as p does under network sampling; though in addition there is a term that also depends on Ξ. The quantities z1S and z2S might be obtained from z1D and z2D by letting ρ = p and Ξ = 0.

3.3. Percolation due to the combined effects of noise and sampling

Combining sampling and noise is straightforward. There is, however, a subtle—so subtle to be undetectable in practice in fact—dependence on the order of sampling and noise. If we first choose the set of nodes, 𝒱S to assay for interactions, and then observe interactions subject to noise, we find Embedded Image 3.13 and Embedded Image 3.14

Should we, however, consider a subnet drawn from an already global but noisy network data set then we have Embedded Image 3.15 and Embedded Image 3.16

In practice, however, the differences are so small that the order in which noise is added to the network and nodes are sampled from it does not matter, i.e. Embedded Image 3.17 and Embedded Image 3.18 which is valid for pN ≫ 1. The percolation transition again occurs when z1DS = z2DS. We reiterate that the MR criterion is derived in the limit of a large network and we note that as N → ∞ or NξΞ the order in which sampling and noise enter the problem becomes irrelevant. But for very small networks this order may indeed matter.

3.4. Examples—theoretical network ensembles

We illustrate the percolation transitions using classical random graphs (Bollobás 2001) and scale-free random graphs (Barabasi & Albert 1999; Barabasi et al. 2001).

3.4.1. Classical random graphs

Classical random graphs are characterized by a binomial, or for sufficiently large graphs, Poisson degree distribution, Embedded Image 3.19

The first and second moments are given by Embedded Image 3.20

The percolation transition thus occurs at the critical sampling fraction Embedded Image 3.21 and a GCC is present if p > 1/λ. The noise-induced transition occurs at the critical values fulfilling Embedded Image 3.22 and a GCC is present if Embedded Image 3.23

Note that this condition reduces to equation (3.21) for Ξ = 0, as it should.

3.4.2. Scale-free random graphs

Scale-free random graphs exhibit a power-law degree distribution, Embedded Image 3.24 where ζx is Riemann's zeta function and acts as a normalizing constant. For γ ≤ 3, the second moment diverges and therefore neither sampling nor noise will induce a percolation transition on a scale-free random network unless γ > 3. Below we assume that we are dealing with a finite (though potentially very large, i.e. N > 106, scale-free random graph) with a degree sequence generated by drawing N random numbers from the distribution given by equation (3.24); in order to qualify as a proper degree distribution the sum of the degree sequence has, of course, to be even.

The sampling transition occurs at a critical value of the sampling fraction given by Embedded Image 3.25

Likewise, the noise-induced transition occurs when Embedded Image 3.26 and if the left side is larger than zero for a given value of (ρ,Ξ) then a GCC appears. As stated before, this is automatically fulfilled for Ξ ≥ 1. The areas where a GCC exists are shown in figure 1 for different values of the exponent γ.

Note that for some fixed values of ρ and Ξ = 0 the GCC is present, disappears for small non-zero values of Ξ and then reappears for larger values of Ξ. The explanation for this phenomomen is that the size of the GCC is relative to the number of non-zero degree nodes: for Ξ = 0 there are many zero degree nodes; when Ξ increases they become connected in many small disconnected components and for large Ξ, the small components are connected to form a GCC.

Figure 1.

Contour-plot of the fraction of nodes contained in the GCC for different fractions of true-positive and false-positive edges in a network (area labels refer to the whole area to the right and above of the corresponding curves). Only when the fraction of true and false positives are both very small, will the GCC disappear. On the left-hand side of the figure we show illustrative examples of cases which correspond (loosely) to areas in the (ρ, Ξ) plane indicated by the arrow-tips. Real edges are indicated in blue, whereas false-positive edges are drawn in green.

4. Percolation analysis for finite systems

Most real networks are finite in size and we would not expect to see a sharp percolation transition (i.e. the disappearance of a GCC) for either noise or sampling processes on networks. In figure 2, we show the effect of sampling on three different networks of order N = 5000; each network has 15 000 edges but was generated in a different way. From each network we sampled, at random, a fraction p of the nodes in the network and determined the components of the resulting induced subgraph. The networks considered are an Erdös–Rényi random graph (with λ = 3), a network generated by the Barabasi–Albert (BA) construction (Barabasi & Albert 1999) (at each timestep a new node was added and preferentially attached to the existing graph with three edges), and a corresponding BC random graph (i.e. we took the degree sequence of a BA graph and generated a randomly rewired graph with the same degree sequence).

Figure 2.

Average fraction of nodes in subnets of a network with 5000 nodes and 15 000 edges against sampling fraction/subnet size. Shown are results obtained for a scale-free network grown under the BA preferential attachment model (black), the corresponding BC graph (see text for details) (grey), and an Erdös–Rényi random graph. Averages were taken over 1000 independent simulated networks (black circles, scale-free RGG; crosses, scale-free BC; triangles, ER, random graph).

The fraction of nodes which are contained in the GCC is virtually identical for the BA network and its BC counterpart. This suggests that, as far as the percolation behaviour is concerned, the type of network which is grown to order N under the BA preferential attachment model is not different from a corresponding BC network ensemble. That is the degree sequence determines the percolation behaviour as suggested by the MR criterion (Molloy & Reed 1995). For the Erdös–Rényi random graph we observe, of course, differences, and the transition appears to be somewhat sharper. In fact, as predicted by equation (3.19), the transition occurs in the vicinity of 1/λ = ⅓.

For noisy networks, the lack of simple analytic expressions makes the analysis of simulations somewhat more complicated. What we find in extensive simulations studies, however, is that the GCC is remarkably stable against noise (see figure 1). Only when both the true-positive rate, ρ, and the false-positive rate, ξ, are sufficiently small, will the GCC vanish. It follows straightforwardly from equation (2.5) that the expected number of false-positive edges in a network—we always assume independence among edges—is given by Embedded Image 4.1 As most real networks are sparse, Embedded Image dominates M, and therefore for moderate values of Eξ[M*] will generally be sufficiently large to ensure that a GCC exists.

5. Error bounds for biological data

Protein interaction network (PIN) data have been rightly chastised for the high error rates in the experimental assays. But the very fact that a GCC is absent can be used in order to put an upper bound on the false-positive rate, ξ, or, equivalently, the average number of false-positive edges per node, Ξ. Only two high-throughput surveys (we considered all yeast and human PIN data sets deposited in the IntAct database, Kerrien et al. 2007) show no evidence for a GCC; some of the basic properties of these networks are summarized in table 1. Using the expressions for z1DS and z2DS, equations (3.15) and (3.16), respectively, we obtain an inequality for the existence of a GCC which depends on Ξ, ρ and p.

View this table:
Table 1.

Details of two protein–protein interaction networks. NGCC is the number of nodes which belong to the largest component in the data set. We have used recent estimates (Hart et al. 2006; Stumpf et al. 2008) of approximately 40 000 interactions among the 6000 or so protein-coding yeast genes in order to estimate and 〈〉 ≈ 6.7; for humans, the average degree is expected to be approximately 〈〉 ≈ 25. Only the data sets of Ho et al. (2002) and Gavin et al. (2002) show no evidence for a GCC.

For real networks, however, we do not know the moments of the degree distribution, 〈k〉 and 〈k2〉. For yeast, a range of studies has estimated the size of the true PIN and with the estimated number of edges, , we may in turn estimate 〈〉 = /N, which still leaves us with 〈k2〉 unknown. If we further set = NS/N we can write as: Embedded Image 5.1 to identify the region for which a GCC should exist according to the MR criterion.

These regions are shown in figure 3 for the only two published data sets that do not exhibit a GCC (according to the usual definitions). Depending on the second moment of the degree distribution, 〈k2〉, the absence of a GCC indicates fairly low true- and false-positive rates. From the results in figure 4, we conclude that for those PIN data sets for which we do not observe a GCC, the fraction of false positive edges must be below 50 per cent, a number that has sometimes been suggested (von Mering et al. 2002; Bader et al. 2004).

Figure 3.

Representations of eight networks in the IntAct database. The two early data sets generated by mass spectrometry (Ho et al. 2002; Gavin et al. 2002) do not show evidence for a GCC, as may well be expected given that they focus on protein complexes. The details of the data sets are provided in table 1.

Figure 4.

The coloured regions indicate combinations in (〈k2〉, ρ, Ξ)-space for which we will observe a GCC. (a) The corresponds to the case of in Ito et al. (2001), (b) to the case of in Gavin et al. (2002). Perhaps somewhat unexpectedly we also observe a pronounced effect of (ξ ≈ 14.5% in (a) and ≈ 12.1% in (b)) on the boundary of the parameter space allowing for a GCC with high probability.

6. Conclusions

Networks offer a convenient and powerful framework for the analysis of structured and complex processes and data. But network data, especially in biology, are also subject to considerable uncertainty and highly incomplete. Here we have studied the effects of incompleteness and noise on the existence of a GCC in theoretical network ensembles and experimental network data. We have introduced the notion of NINEs as a suitable conceptual framework to deal with observational vagaries of experimental network data sets. For uncorrelated networks we have derived the criteria which determine the existence of a GCC (with high probability). We have furthermore performed some simulations for finite-size network ensembles which show that the theoretical results describe the average behaviour of realistic networks.

We have seen that the GCC's persistence under sampling but especially under the effects of noise depends strongly on the second moment of the degree distribution (by virtue of the MR criterion). If networks are scale-free with exponent γ ≤ 3, then the divergence of 〈k2〉 means that a GCC will persist for all levels of noise. For real data sets that are finite—and almost certainly not scale-free in the classical sense (Przulj et al. 2004; Stumpf et al. 2005a; Tanaka et al. 2005; Khanin & Wit 2006)—we can, however, use the emergence of a GCC as (weakly) indicative of the relative effects of false-positive to true-positive error rates.

We have illustrated this in the context of published high-throughput protein interaction data in yeast and humans. Only two data sets fail to exhibit a GCC. The data of Gavin et al. (2002) and Ho et al. (2002) focused on interaction in protein complexes and therefore we would not necessarily expect a GCC to be visible in the data. Previous analyses of error rates in protein interaction data have sometimes focused on overlap between different sets of data (von Mering et al. 2002; Hart et al. 2006), which we expect to systematically overestimate error rates in such data. However, we are aware that our analysis relies on errors being distributed evenly in the network. This assumption is naturally debatable and further theoretical developments might be called for.

Fruitful applications of the present analysis could arise when reversing our approach: how much can we learn from incomplete and noisy data about the true (but unobserved) network? The notion of NINEs introduced above allows us to address this question rigorously. For any realistic complex system, we may start from the assumption that observed data were sampled from a connected graph. There have been only preliminary attempts (e.g. Stumpf et al. (2008)) at reverse-engineering the global topology of networks from partial (often poor) data. Especially, in the social sciences Robins & Pattison (2001) where network sampling is fraught with all manner of methodological, social and ethical problems, and in systems biology, where technological problems frequently prevent us from seeing e.g. protein interactions that are conditional on post-translational modifications, we see the need and much scope for suitable inferential procedures.


The authors thank Sophie Lèbre and Thomas Thorne for helpful discussions. MPHS is a Royal Society Wolfson Research Merit Award holder and acknowledges support from the BBSRC; CW acknowledges research support from the Danish Research Councils.

  • Received February 1, 2010.
  • Accepted March 18, 2010.


View Abstract