## Abstract

Protein interaction networks (PINs) are popular means to visualize the proteome. However, PIN datasets are known to be noisy, incomplete and biased by the experimental protocols used to detect protein interactions. This paper aims at understanding the connection between true protein interactions and the protein interaction datasets that have been obtained using the most popular experimental techniques, i.e. mass spectronomy and yeast two-hybrid. We start from the observation that the adjacency matrix of a PIN, i.e. the binary matrix which defines, for every pair of proteins in the network, whether or not there is a link, has a special form, that we call separable. This induces precise relationships between the moments of the degree distribution (i.e. the average number of links that a protein in the network has, its variance, etc.) and the number of short loops (i.e. triangles, squares, etc.) along the links of the network. These relationships provide powerful tools to test the reliability of datasets and hint at the underlying biological mechanism with which proteins and complexes recruit each other.

## 1. Introduction

Protein interactions are a biological phenomenon that controls a large part of the functionality of a cell. Protein interaction networks (PINs) are graphical representations of the complex patterns of interactions that appear in the proteome, which enable quantitative studies of the underlying biology via mathematical tools and complex networks theory.

Mathematically, a PIN is a graph where nodes *i* = 1 … *N* represent proteins and links represent their interactions. This graph is encoded in an adjacency matrix **a** = {*a _{ij}*}, whose entries denote whether there is a link between proteins

*i*and

*j*(

*a*= 1) or not (

_{ij}*a*= 0). However, there is ambiguity in its definition, arising from the non-binary nature of the underlying biochemistry. For example, three proteins may form a complex, but may not interact in pairs. Assigning binary values to intrinsically non-binary interactions requires further prescriptions, which vary across experimental protocols and lead in practice to different graphs. Moreover, different experiments measure protein interactions in different ways, which causes further biases [1–3]. For quantitative studies of the effects of sampling biases on networks see [4–10].

_{ij}In this paper, we seek to establish the connection between true biological protein interactions and protein interaction datasets produced by the most popular experimental techniques, mass spectronomy (MS) and yeast two-hybrid (Y2H). We argue that the most natural network matrix representation of the proteome has a separable form, which induces precise relationships between the degree distribution and the density of short loops. These relationships provide simple tests to assess the reliability and quality of different datasets, and provide hints on the underlying (evolutionary) mechanisms with which proteins and complexes recruit each other. Our study also provides a theoretical framework to discriminate between ‘party’ and ‘date’ hubs in PINs (e.g. [11] and references therein) and addresses several intriguing questions concerning the universality of protein and complex statistics across species. For example, given *N* protein species in a cell, what is the number of complexes they typically form, i.e. to what extent is the ratio *α* complexes/proteins conserved across different species? Is the distribution of complex sizes peaked around ‘typical’ values, or does it have long tails? How is this mirrored in the protein promiscuities, i.e. the propensities of proteins to participate in multiple complexes? Does the power-law behaviour of the degree distribution of PINs perhaps result from tails in the distribution of complex sizes and protein promiscuities?

We tackle the above questions using a mathematical approach that is entirely based on statistical properties of graph ensembles. In §2, we define our models as distinct separable graph ensembles which mimic PINs, each reflecting different possible mechanisms for complex genesis. In §3, we give an overview of the main results and their application to real PINs measured by MS and Y2H experiments. Section 4 defines the mathematical set-up of our analysis and in §§5–7, we give a full derivation of results, that are tested on synthetically generated networks in §8. We end our paper with a summary of our conclusions, and suggest pathways for further research.

## 2. Definitions and basic properties

### 2.1. The bipartite graph representation of the proteome

Proteins are large and complicated heteropolymers, which can bind in specific combinations to form stable molecular complexes. We consider a set of *N* protein species, labelled by *i* = 1 … *N*. We assume that the number of stable complexes *p* scales as *p* = *αN*, where *α* > 0, and we label the complexes by *μ* = 1 … *αN*. We can represent this system as a bipartite graph [12], with two sets of nodes (figure 1). The set *ν*_{p} represents proteins species (drawn as circles), the set *ν*_{c} represents complexes (drawn as squares), and a link between protein species and complex is drawn if protein *i* participates in complex *μ*. This graph is defined by the *N* × *αN* connectivity matrix where if there is a link between *i* and *μ*, and otherwise. For simplicity, we do not allow for complexes with more than one occurrence of any given protein species.

In the bipartite graph, one has two types of node degrees: the degree (or ‘promiscuity’) of each protein *i* gives the number of different complexes in which it is involved, and the degree (or ‘size’) of each complex *μ* gives the number of protein species of which it is formed (figure 1). For a given bipartite graph ** ξ**, we define the protein degree distribution, or promiscuity distribution, as
2.1where

*δ*is the Kronecher function, defined as 1 for

_{xy}*x*=

*y*and 0 otherwise. This counts the frequency of occurrence of a protein with promiscuity

*d*, and it is normalized by the total number of proteins

*N*. Similarly, we define the complex degree distribution, or complex size distribution, as 2.2which counts the frequency of occurrence of a complex containing

*q*different protein species, divided by the total number of complexes

*αN*.

As the number of links stemming from the proteins has to equate the number of links stemming from complexes, we have This leads to the identity where is the average promiscuity, i.e. the first moment of the distribution of promiscuities, and is the average complex size, i.e. the first moment of the distribution of complex sizes.

### 2.2. Protein interactions as detected by experiments

Protein detection experiments seek to measure for each pair (*i*, *j*) of protein species whether they interact in any complex, and assign an undirected link between nodes *i* and *j* if they do. This leads to a graphical representation of protein interactions in terms of a monopartite graph, where there is only one type of nodes, which represent proteins. The graph can be represented by an adjacency matrix ** a** = {

*a*} whose entries are

_{ij}*a*= 1 if there is a link between proteins

_{ij}*i*,

*j*and 0 otherwise. It is reasonable to expect that PIN adjacency matrices resulting from detection experiments are in the form 2.3and 2.4where

*θ*(

*x*) is a binary function that takes value 1 if

*x*> 0 and 0 otherwise. We call the form of the matrix

**‘separable’ as the dependence of its entries on the indices**

*a**i*,

*j*is factorized, as a consequence of the fact that protein interactions are mediated by complexes. Graph

**can be thought of as a marginalized version of the bipartite graph**

*a***whereby complexes are ‘integrated-out’, i.e. summed over. Figure 2 gives an illustration of the relationship between the two graphical representations, and shows that protein complexes in**

*ξ***are inextricably related to loops in**

*ξ***. In particular, the presence of large complexes will boost the number of short loops in PIN**

*a***.**

*a*If PINs detected experimentally displayed properties too far away from those observed in (2.4), this might signal the presence of strong biases in the experimental protocols for the PIN detection and one should ask what exactly the detection experiment is measuring. A key feature we will exploit in our analysis is that, due to the sparsity of links average properties of random graphs (2.4) are identical, to leading orders in *N*, to those of the related *weighted* random graphs
2.5and
2.6for which average properties are much easier to calculate. Graphs ** c** have the same structure as

**but links are weighted, with each weight representing the**

*a**number*of complexes in which proteins

*i*and

*j*participate simultaneously. We will mainly focus on the following properties of the random graphs

**:**

*c*— The degree distribution 2.7denoting the probability of observing a node in graph

having degree equal to*c**k*. This distribution should not be confused with the promiscuity distribution*p*(*d*|). The latter is a property of the bipartite graphs*ξ*, whereas the former is a property of the monopartite graphs*ξ*.*c*— The density of loops of length 3 and 4, defined as 2.8and 2.9denoting, respectively, the density of closed non-intersecting paths of length 3 and 4, along the links of network

.*c*

We do not go beyond loops of length 4 because real networks are known to be ‘small world’, with pairs of nodes typically connected by paths of small length. With larger path lengths, one can typically link every node to any other and the number of loops through any node will increase significantly as we increase the length of loops. Hence, one expects the relevant biological information to be encoded in the short loops statistics.

### 2.3. Link distribution in the bipartite graph

As we generally do not know the microscopic bipartite graph ** ξ**, we will regard the

**'s as random variables, drawn from a distribution**

*ξ**p*(

**), that we need to postulate. First of all, we will assume that the**

*ξ**ξ*'s are independent, so that their distribution factorizes over the protein and complex indices This is the easiest assumption that we can make and will need to be checked

*a posteriori*. Next, we need to postulate As each is a binary variable which can only take values 0 and 1, we only need to specify and We have three natural choices for mimicking (i) a complex-driven, (ii) a protein-driven and (iii) a mixed mechanism for complex genesis.

(i) A natural choice is to assume that complexes have given sizes {

*q*}, distributed according to which are determined, for example, by the functions that they are called to carry out inside the cell, and the likelihood of a protein_{μ}*i*making part of a complex*μ*is given by the number*q*of proteins participating in complex_{μ}*μ*divided by the total number of proteins, 2.10In random graphsbuilt according to this prescription, for large*ξ**N*, each complex size*q*(_{μ}) is a Poissonian random variable with average*ξ**q*, and all protein promiscuities_{μ}*d*(_{i}) will be Poissonian variables with the same average with (see appendix A). Hence, in this ensemble, complex sizes are prescribed on average, i.e. , and protein promiscuities are homogeneous, Poissonian variables with average determined from the average size of complexes, i.e. As in this ensemble proteins' recruitment to complexes is determined by functions, we will refer to this ensemble as ‘function-driven’ or more briefly, ‘*ξ**q*-ensemble’.(ii) An alternative choice is to assume that proteins have given propensities to interact {

*d*}, distributed according to which are determined, for example, by the number of their binding sites, polarization, etc., and the likelihood of a protein_{i}*i*making part of complex*μ*is given by the number of complexes which involve protein*i*divided by the total number of complexes, 2.11For large graphsdrawn from ensemble (2.11), each protein promiscuity*ξ**d*(_{i}) is a Poissonian variable with average*ξ**d*, whereas all complex sizes_{i}*q*(_{μ}) are Poisson variables with the same average with (appendix A). In this ensemble, it is therefore assumed that protein binding is driven by protein promiscuities, and we will refer to it as ‘protein-driven’ or ‘*ξ**d*-ensemble’.(iii) A third obvious choice is to assume that protein promiscuities {

*d*} and complex sizes {_{i}*q*} are distributed according to given_{μ}*P*(*d*) and*P*(*q*), respectively, and the likelihood of protein*i*participating in complex*μ*is controlled by both protein promiscuity and complex size 2.12Large graphsdrawn from this ensemble will have all protein promiscuities and complex sizes constrained on average, i.e. and with {*ξ**d*} and {_{i}*q*} distributed according to_{μ}*P*(*d*) and*P*(*q*). In this third scenario, protein-binding statistics is driven both by complex functionality and protein promiscuity factors, and we will refer to this as the ‘mixed ensemble’.

The mixed ensemble (2.12) reduces to (2.10) for the choice of homogeneous protein promiscuities and to (2.11) for the choice of homogeneous complex size By determining which of the above ensembles reflects better biological reality, we will learn about the mechanisms with which complexes and proteins recruit each other.

### 2.4. Accounting for binding sites

In all PINs, each protein is reduced to a simple network node, in spite of the fact that proteins are in reality complex chains of amino acids with several binding domains. Here we show that the ensembles introduced in the previous section can accommodate the presence of multiple binding sites when these are equally reactive. Let us first assume that each protein has *d* functional reactive amino acid endgroups. When two such proteins bind, the resulting dimer has 2*d*−2 unused reactive endgroups, a trimer has 3*d*−4 endgroups and a *k*-mer has *kd*−2(*k*−1) = (*d*−2)*k* + 2 endgroups. If all endgroups are equally reactive, the *a priori* probability that a protein *i* is part of a complex *μ* is given by
2.13where the last approximate equality holds for and This corresponds to ensemble (2.10), with the choice If proteins have different endgroups *d _{i}*,
2.14where

*d*=

*N*

^{−1}∑

*, leading to ensemble (2.12). If the variability of*

_{i}d_{i}*q*is small, 2.15and we retrieve (2.11). The assumption of unbiased interactions between proteins with varying individual binding affinities has been supported in [13], and in recent structural analysis on residue-type-independent interactions [14].

_{μ}## 3. Overview of results

In this section, we summarize the main results of this paper, that will be derived in full details in the next sections, suitable for readers with a mathematical or a statistical background. Readers with a less quantitative background who are interested in the biological applications of the mathematical framework introduced above, can stop at the end of this section and skip the mathematical details presented later.

### 3.1. Test relationships

The main result of this paper is that graphs with elements and
where the ** ξ** are drawn from either the ‘function-driven’ distribution (2.10) or the ‘protein-driven’ distribution (2.11), display special relationships between the moments , etc. of their degree distribution

*p*(

*k*) and the density

*m*

_{3},

*m*

_{4}, etc. of their loops of length 3, 4, etc. Remarkably, these relationships are completely independent of

*α*,

*P*(

*q*) and

*P*(

*d*) and follow solely from the separable nature of the matrix

*c*. In addition, they are identical, to orders to those found in the binary matrices

_{ij}**with elements and For the**

*a**q*-ensemble we have the following relationships: 3.1and 3.2whereas for the

*d*-ensemble we have 3.3

Remarkably, we obtain different sets of relationships for the two ensembles, meaning that the two ensembles do not represent equivalent descriptions of the bipartite representation of the proteome, as one may have naively expected. One can then check whether real PINs come closer to satisfy the test relationships from the *q*-ensemble or the ones from the *d*-ensemble. This will hint at the underlying mechanism with which proteins and complexes recruit each other. Next, we apply the above results to real publicly available protein interaction datasets, obtained via MS and Y2H experiments. The detailed quantitative features of the various datasets and their references are listed in table 1.

### 3.2. Application to mass spectrometry datasets

Seven of the experimental PIN datasets in table 1 were obtained by MS experiments, and they involved three distinct biological species, namely *Saccharomyces cerevisiae*, *Homo sapiens* and *Escherichia coli*. Each set takes the form of an *N* × *N* matrix of binary entries *a _{ij}*, but with different values of

*N*. In figure 3, we show the results of our analytical predictions for the densities of length-3 and length-4 loops, as given by the formulae for the function- and protein-driven ensembles, versus their measured values in the MS datasets. Before looking at the performance of PIN data with respect to the above test formulae, let us briefly look at what we would expect in fully random networks, of the Erdös–Rényi (ER) type, which are not in the family of separable graphs. Let us denote with

*m*and

_{ℓq}*m*the density of loop of length

_{ℓd}*ℓ*as predicted by the function- and the protein-driven ensemble, respectively, and with

*m*

_{ℓ}_{m}their measured value. In fully random graphs, the degree distribution is Poissonian and one has and Furthermore, measured values of loop densities are typically for

*ℓ*= 3,4. Hence, the r.h.s. of (3.1) and (3.2) vanish giving and whereas (3.3) gives and Hence, one has and

We now turn to the interpretation of plots in figure 3. First off, we note that the theoretical predictions from the function-driven ensemble lead to values of the number of short loops consistently higher than those predicted by the protein-driven ensemble, even for loops of length 3. This is remarkably different from the behaviour expected in random graphs of the ER type and is consistent with the behaviour of separable random graphs, where a function-driven complex genesis induces large cliques in the PINs, which boosts short loops. By contrast, a protein-driven complex genesis induces a homogeneous distribution for the complex sizes, which suppresses the presence of large cliques, hence of short loops, in the PINs. Notably, the densities of length-4 loops of all MS datasets are in between those of the *d*-ensemble (which thereby acts as a lower bound) and those of the *q*-ensemble (which acts as an upper bound). This suggests a compatibility of data from MS experiments with the expected separable form of the proteome network. However, the measured densities of length-3 loops are consistently lower than the values compatible with a separable structure of the proteome.

To shed light on this result, it is useful to compare this behaviour with those of random graphs with the same degree distribution as our PINs. As the latter is non-Poissonian, the theoretical predictions for the loop densities given by the function-driven ensemble are now and whereas both *m*_{3 m} and *m*_{4 m} are still yielding the same *m*_{3d} and *m*_{4d} as in ER networks. This now leads to and

In real PINs, we observe the same patterns of inequalities; however, the measured values of loop densities are hence much larger than their expected values in random graphs with the same degree distribution. In particular, several datasets (including *cerevisiae VI* and *cerevisiae IX*) show a loop density *m*_{3} of the same magnitude as the one predicted by the separable models and dataset *cerevisiae X* is in excellent agreement with the predicted value. This suggests a degree of compatibility with the proposed separable model, which enforces many more loops than in a typical random network with the same degree distribution, and also a large degree of noise which tends to randomize interactions.

### 3.3. Applications to yeast two-hybrid datasets

We tested similarly the compatibility of Y2H data with a separable structure of the proteome, by checking whether the measured values for the network observables *m*_{3} and *m*_{4} fall within what appeared to be (in separable graph ensembles) theoretical bounds set by the function- and protein-driven ensembles. We now used the 12 PIN datasets in table 1 that were obtained from Y2H experiments. Results are shown in figure 4. We observe that Y2H datasets exhibit generally fewer short loops than MS dataset. This may be due to the fact that, at variance with MS datasets, which use immunoprecipitations to sample from a functioning biological network, Y2H experiments sample proteins from the entire potential biophysical network, a much larger interaction space than any given one cell-type/tissue sampled by MS, leading to an undersampling of links and therefore to underestimation of connectivity and loops. Qualitatively Y2H datasets show similar trends to MS, with measured values of *m*_{4} somewhat compatible with separable models, and values for *m*_{3} that fall below. This is quite remarkable, as MS and Y2H experiments are known to measure interactions in very different ways. However, quantitatively, *m*_{3} is generally an order of magnitude less than that predicted by separable models, with the exception of dataset *jejuni* which stays closer to the predicted values. This suggests that Y2H has a lower level of compatibility with a separable structure of the proteome.

### 3.4. Origin of fat tails in the degree distribution of protein interaction network

In previous sections, we have introduced the promiscuity distribution *P*(*d*), the complex size distribution *P*(*q*) and the degree distribution *p*(*k*). The former distributions are properties of the bipartite graph ** ξ**, whereas the latter is a property of the marginalized graph

**, or**

*c***. The degree distribution**

*a**p*(

*k*) can be computed directly from the graph

**and for PINs it typically displays a fat tail, for large**

*a**k*with 2 <

*μ*< 3 [33–36].

By contrast, *P*(*d*) and *P*(*q*) cannot be measured directly, but they are related to *p*(*k*). This allows one in principle to infer the tail behaviour of the promiscuity and complex size distributions from the tail of the degree distribution of the PIN, which can be easily computed. The relationship between the above distributions depends on the mechanism driving complex genesis, i.e. (i) function-driven, (ii) protein-driven or (iii) mixed.

(i) For function-driven ensembles, one has for large

*q*3.4and This shows that the complex size distribution*P*(*q*) decays faster than the degree distribution of the associated PIN, so fat tails in the degree distribution*a**p*(*k*) of PINs can emerge from less heterogeneous complex size distributions. In particular, complex size distributions*P*(*q*) with a finite second moment (but diverging higher moments) give scale-free degree distributions*p*(*k*). This is consistent with the intuition that, while large hubs are often observed in PINs, super-complexes of the same number of proteins are unlikely to be stable. Indeed, many interactions in hubs are ‘date’ types, as opposed to ‘party’ type [11]. Our framework allows us to discriminate between different types of hub proteins, and suggests that heterogeneous (i.e. power law) behaviour in PINs may emerge from homogeneous protein ‘dating’ and moderately heterogeneous protein ‘partying’.(ii) For protein-driven ensembles, one has for large

*d*, 3.5with 3.6whereas Hence, any tail in the promiscuity distribution will produce the same tail in the degree distribution of, but with a rescaled amplitude. Fat tails in the degree distribution of PINs can thus arise from equally heterogeneous ‘dating’ interactions between proteins, combined with a homogeneous distribution of ‘party’ interactions. Short loops are boosted by broad distributions of complex sizes, as large complexes in the bipartite graph induce large cliques in the network*a*. The*a**d*-ensemble (4.2), which attributes any heterogeneity in*p*(*k*) to heterogeneity of protein binding promiscuities, generates separable PIN graphswith the least number of loops. Conversely, the*a**q*-ensemble (4.1), which attributes all heterogeneity in*p*(*k*) to heterogeneity in complex sizes, generates separable PIN graphswith the largest number of loops.*a*(iii) For the mixed ensembles, it is easier to write relationships in terms of the distribution

*W*(*q*), related to*P*(*q*) via Assuming*W*(*q*) has a power-law tail, with a finite first moment (as in both cases previously considered), i.e. with*γ*> 2, one has*P*(*d*) ∼*C*′*d*^{−}with^{μ}*μ*≤*γ*, where for*γ*>*μ*and for*γ*=*μ*. This means that if*W*(*q*) decays faster than*p*(*k*) (as for protein-driven recruitment), then the tail in*p*(*k*) must arise from the tail in*P*(*d*), although heterogeneities in*P*(*q*) will affect the amplitude of the power-law tail in*P*(*d*). Conversely, if*P*(*d*) is as broad as*W*(*q*), then both*P*(*q*) and*P*(*d*) contribute to the tail of*p*(*k*), whose amplitude will be the sum of the amplitudes of the tails of the two distributions.

## 4. Mathematical set-up

The remainder of this paper is devoted to the derivation of results presented in §3. We start by casting the objectives of our study in mathematical language. Formally, we can write the bipartite network distributions for the three different type of protein–complex recruitment (i.e. complex-driven, protein-driven and mixed) as
4.1
4.2
4.3respectively. Bipartite graphs drawn from (4.1) were found to have modular topologies, and to accomplish parallel information processing for suitable values of the parameter *α* [37,38], and their connection to PINs has been pointed out in [39]. As explained above, the three ensembles become equivalent when complex sizes and protein promiscuities are both homogeneous, i.e. and In that case, the recruitment process between proteins and complexes is fully random. We are interested in the properties of the random graph ensemble
4.4

where and *p*(** ξ**) is given by (4.1)–(4.3). Some properties of (4.4) will turn out not to depend on the choices made for the distributions of complex sizes and protein promiscuities, and this leads to powerful benchmarks against which to test available PIN datasets.

A key step of our analysis is that averages over (4.4) can often be replaced by averages over the ensemble of *weighted* graphs (2.6)
4.5

For finite *q _{μ}*,

*d*and

_{i}*α*, one finds that in large networks generated via (4.1)–(4.3), the probability of seeing

*c*> 1 is of order and the values of many macroscopic observables in the

_{ij}**a**and

**ensembles will, to leading order in**

*c**N*, be identical.

## 5. Network properties generated by the *q*-ensemble

In this section, we study the statistical properties of the ensembles (4.5) and (4.4) upon generating the bipartite protein interaction graph ** ξ** from ensemble (4.1), where complexes recruit proteins.

### 5.1. Link probabilities

For the graphs ** c** of (4.5), we find the following expectation values of individual bonds:
5.1where the brackets on the r.h.s. denote averaging over the complex size distribution

*P*(

*q*). The likelihood of an individual bond is (see appendix B) 5.2so we find for the first few probabilities: 5.3and 5.4and hence 5.5

The probability to have *c _{ij}* ≠ 0 is of order so the graphs generated by (4.5) are finitely connected. Moreover, although the graphs

**are in principle weighted, for large**

*c**N*the number of links per node that are not in {0, 1} will be vanishingly small.

### 5.2. Densities of short loops

We now turn to the calculation of expectation values for different observables in ensemble (4.5). First, we calculate the average number of ordered and oriented loops of length-3 per node, which are (see appendix B) 5.6 5.7

Calculating the density of loops *m _{L}* for lengths

*L*> 3 can be simplified by returning to the bipartite graph

**. We define a star**

*ξ**S*to be a simple (

_{n}*n*+ 1)-node tree in

**, of which the central node belongs to**

*ξ**ν*

_{c}(the complexes), and the

*n*leaves belong to

*ν*

_{p}(the proteins). Thus,

*S*

_{2}stars represent protein dimers,

*S*

_{3}stars represent protein trimers, and so on. Each link in

**corresponds to at least one**

*c**S*

_{2}star in the bipartite graph (which, in turn, can be a subset of any

*S*star with

_{n}*n*> 2). Therefore, the total number of

*S*

_{2}stars in the bipartite graph, 5.8has to equate in leading order the total number of links in graph

**, yielding 5.9which is indeed in agreement with the result of the direct calculation using (5.1). Similarly, we can obtain the number of loops of length 3, calculated earlier, by realizing that these loops arise when we have in the bipartite graph either a star**

*c**S*

_{3}(which can be a subset of any

*S*with

_{n}*n*> 3) or a combination of three

*S*

_{2}stars, where every leaf is shared by two stars. The contribution of the number of

*S*

_{3}stars per node to the number of loops of length 3 is 5.10The contribution of the combination of three

*S*

_{2}stars, where each leaf is shared by two stars, is 5.11with the square brackets [

*i*,

*j*,

*k*] denoting that the three indices are distinct. The expected density of length 3 loops is the sum of an contribution from

*S*

_{3}stars, plus an contribution from combinations of three

*S*

_{2}stars that share leaves. For large

*N*the second contribution vanishes, and we recover Likewise, the contribution to the density of length-4 loops comes from

*S*

_{4}stars in the bipartite graph, which consist of five sites (four leaves and one central node) and four links, each with probability Combinations of two

*S*

_{3}stars with two shared leaves, or of

*S*

_{2}stars, always involve a number of links at least equal to the number of nodes and therefore yield sub-leading contributions. Hence, the density of loops of length 4 is 5.12More generally, the average density of loops of arbitrary length

*L*is given by 5.13For large

*N*the ratio

*α*and the distribution

*P*(

*q*) of complex sizes apparently determine in full the statistics of loops of arbitrary length in

**, if the protein interactions are described by (4.1).**

*c*Finally, we note that if *m _{L}* gives the number of ordered and oriented loops of length

*L*per node, the number of unordered and unoriented closed paths of length

*L*equals as there are

*L*possible nodes to start a closed path from, and two possible orientations.

### 5.3. The degree distribution

It follows from (5.9) and (5.13) that by measuring the average degree and the densities *m _{L}* of loops of length

*L*we can compute all the moments of the distribution of complex sizes

*P*(

*q*): 5.14This would allow us to calculate

*P*(

*q*) in full via its generating function, provided

*α*and are known. However, counting the number of loops of arbitrary length in a graph is computationally challenging, and

*α*and are generally unknown. However, it is possible to express

*P*(

*q*) for large

*N*in terms of the degree distribution

*P*(

*k*) of

**. Specifically, in appendix C, we show that 5.15where 5.16**

*c*and is the likelihood to draw a link attached to a complex node of degree *q* in the bipartite graph ** ξ**. Formula (5.15) is easily interpreted. The degree of node

*i*in

**is given by the second neighbours of**

*c**i*in

**; the number**

*ξ**ℓ*of first neighbours of node

*i*will thus be a Poissonian variable with average and each of its

*ℓ*first neighbours will have a degree

*q*drawn from

_{r}*W*(

*q*). Clearly, any tail in the distribution

_{r}*W*(

*q*) will induce a tail in the distribution

*p*(

*k*), with (as we will show below) the same exponent, but an amplitude that is reduced by a factor

One can complement (5.15) with a reciprocal relationship that gives *P*(*q*) in terms of *p*(*k*). To achieve this, we define the generating functions and We then see from expression (5.15) for *p*(*k*) that
5.17and
5.18

The first identity can be rewritten as *Q*_{1}(−log(1−*y*)) = *Q*_{2}(*y*). Inserting this into (5.18) allows us to express the desired *Q*_{3}(*z*) as
5.19which translates into
5.20We can now extract the asymptotic form of *P*(*q*) from that of *p*(*k*). The generating functions *Q*_{1}(*z*) of degree distributions that exhibit prominent tails, i.e. for large *k* with 2 < *μ* < 3 (as observed in PINs [33–36]), are for small *z* of the form
5.21where *Γ* is Euler's gamma function [40]. For small *z*, we may use to rewrite (4.19) as
5.22Combining this with (5.21) then gives, for small *z*,
5.23Hence, for small *z*, *Q*_{3}(*z*) has the same form as *Q*_{1}(*z*),
5.24

Therefore, *W*(*q*) behaves asymptotically in the same way as *p*(*k*), i.e. This, in turn, gives
5.25

### 5.4. Relationships that are independent of *P*(*q*) and *α*

The first two moments of *p*(*k*) are given, to leading order in *N*, by (see appendix C)
5.26which is in agreement with (5.9), and
5.27The latter is easily interpreted in terms of the underlying bipartite graph: is the average density of paths of length two, so it has a contribution from because of backtracking, plus a contribution from pairs of *S*_{2} stars that share a node, whose density is
5.28plus a contribution from *S*_{3} stars, whose density is (as shown earlier). Combining (5.27) with (5.14) gives us a relationship between average and width of the degree distribution of ** c** and its density of length-3 loops. Remarkably, this relationship is completely independent of

*α*and

*P*(

*q*): 5.29

This identity and others, which all depend only on the separable underlying nature of the PIN and the assumption of complex-driven recruitment of proteins to complexes, can be derived more systematically from (5.20) by expanding both sides as power series in *z* and comparing the expansion coefficients. This gives a hierarchy of relationships between moments of *p*(*k*) and *P*(*q*), and hence (via (5.13)) between moments of *p*(*k*) and densities of loops of increasing length, that are all completely independent of *α* and *P*(*q*). At order *z*^{2}, one recovers (5.29). The next order *z*^{3} leads to
5.30

To test these asymptotic identities in finite systems, we generate random graphs ** c** of size

*N*= 3000 according to (4.1) and (4.5), and we compared the measured values of

*m*

_{3}and

*m*

_{4}in these random graphs with the predictions of formulae (5.29) and (5.30), respectively. We show the results in figure 5.

### 5.5. Link between *a* and *c* graph definitions

*a*

*c*

In conventional experimental PIN databases, one records only whether or not protein pairs interact, not the *number* of complexes in which they interact. Hence, protein interactions are normally represented in terms of the adjacency matrix ** a** = {

*a*}, which is related to the weighted matrix

_{ij}**= {**

*c**c*} via with the convention for the step function

_{ij}*θ*(0) = 0. We therefore have However, the links {

*a*} are correlated. In appendix D, we derive the relationship between the expected values of different graph observables for the two graph ensembles

_{ij}*p*(

**) and**

*a**p*(

**). Denoting averages in the**

*c***ensemble as , and using the usual notation for averages in the**

*a***ensemble, one finds that for large**

*c**N*the first two moments of the degree distributions and the first two loop densities in the two ensembles are identical: 5.31 5.32 5.33 5.34

Square brackets underneath summations again indicate distinct indices, which excludes backtracking in the counting of length-4 loops. Apparently, the ensembles *p*(** a**) and

*p*(

**) are asymptotically equivalent with regard to the statistics of these four quantities. We will see in the next section that this equivalence holds also for the ‘dual’ ensemble (4.2). To test the above claims, we compute and show in figure 6 the above observables in synthetic graphs**

*c***and**

*c***generated randomly from (4.4) and (4.5), where the random bipartite interaction graph**

*a***is drawn from (4.1).**

*ξ*## 6. Network properties generated by the *d*-ensemble

In this section, we will derive properties for the network ensembles (4.4) and (4.5) upon assuming that the statistics of the underlying bipartite PIN are given by (4.2), i.e. are protein-driven as opposed to complex-driven. Despite the superficial similarity between definitions (4.1) and (4.2), the expectations of graph observables in the two ensembles are found to be remarkably different.

### 6.1. Link probabilities

We start by calculating the link expectation values in the weighted graphs 6.1

Hence the random graphs ** c** are again finitely connected, now with
6.2

Averages over *d* refer to the distribution *P*(*d*) of protein promiscuities in the bipartite graph ** ξ**. The result (6.2) can also be written as and is thus notably different from the earlier expression found in the

*q*-ensemble. The link likelihood is calculated in appendix B, and shows again that

### 6.2. Densities of short loops

We can calculate the density of length-3 loops similar to how this was done for the *q*-ensemble in the previous section. Again these are given, to order , by the *S*_{3} stars in the bipartite graph, as the contribution from combinations of *S*_{2} stars is as before Here we obtain
6.3For loops of arbitrary length *L*, this generalizes to
6.4

Interestingly, the densities *m _{L}* of short loops and the average connectivity depend on

*P*(

*d*) only through its first moment. Promiscuity heterogeneity apparently cannot affect the densities of short loops. In the present ensemble, these densities must therefore be identical to what would be found in a randomly wired bipartite graph. This prediction will be confirmed in simulations.

### 6.3. The degree distribution

In appendix C, we calculate the asymptotic degree distribution of ** c** for the protein-driven complex recruitment model (4.2), giving
6.5

This result is again understood easily: the number of neighbours of a node *i* is a Poissonian variable *ℓ*, with average *d*, where *d* is now drawn from *P*(*d*). Each of the *ℓ* first neighbours will have a degree which is a Poissonian variable with average so the number *k* of second neighbours of *i* in the bipartite graph is a Poisson variable with average Equation (6.5) shows that a tail in the promiscuity distribution *P*(*d*) will induce a tail in the degree distribution *p*(*k*) of ** c**. The link between the two distributions is again most easily expressed via generating functions. Upon defining and we obtain from (6.5)
6.6For this gives
6.7Hence, if

*p*(

*k*) decays for large

*k*as with 2 <

*μ*< 3, then via (5.21) we infer that 6.8Equivalently, 6.9This implies that for large

*d*the promiscuity distribution will be of the form where 6.10

### 6.4. Relationships that are independent of *P*(*d*) and *α*

The first two moments of the degree distribution *p*(*k*) of the separable PIN networks ** c** are
6.11and
6.12Combination of (6.18), (6.12) and (6.13) now yields the relationship
6.13which still involves and

*α*. We can also find an alternative expression for the density of loops of length 3 by combining (6.18) and (6.13) 6.14

Unfortunately, neither of our two expressions for *m*_{3}, (6.13) nor (6.14), are useful, because the protein promiscuities distribution *P*(*d*) and the ratio *α* are generally unknown. Access to information on these quantities via future detection experiments may therefore be extremely welcome in support of theoretical modelling of protein interaction datasets. To make progress, we need to derive relationships for graph observables that are independent of *α* and *P*(*d*). We note that (6.14) yields
6.15This can be rewritten using (6.18) as
6.16

On the other hand, we know from (6.14) that Combining the above formulae allows us to establish the following relationship, that now is completely independent of *P*(*d*) and *α*:
6.17

Again we have tested the various formulae in synthetically generated graphs (figure 7).

### 6.5. Link between *a* and *c* graph definitions

*a*

*c*

As a final step, we check whether the observables *m*_{3} and *m*_{4} are indeed the same for the two PIN definitions (4.4) and (4.5), with the bipartite graph of our protein-driven ensemble (4.2), as protein detection experiments provide the binary matrix ** a** as opposed to the weighted graph

**for which (6.21) was derived. Again we denote averages relating to**

*c***as and those relating to**

*a***as For the moments of the degree distributions, we find the differences to be negligible: 6.18and 6.19The same is true for the densities of loops of length 3 and 4: 6.20and 6.21**

*c*This equivalence between the ensembles *p*(** a**) and

*p*(

**) when calculating the main average values of graph observables for large**

*c**N*implies that large protein interaction adjacency matrices can in practice be regarded as having a separable structure. Again, we check our relationships (6.12), (6.18), (6.20) and (6.21), against synthetically generated graphs and show results in figure 8.

## 7. Macroscopic observables in the mixed ensemble

The two bipartite graph ensembles (4.1) and (4.22) considered so far led to Poissonian distributions either for the protein promiscuities *d _{i}* (in the

*q*-ensemble) or for the complex sizes

*q*(in the

_{μ}*d*-ensemble). It is possible to model heterogeneity in both

*d*and

_{i}*q*using the mixed ensemble (4.3). Owing to the similarities with previous calculations we can and will be more brief in this section. For ensemble (4.3), the expectation values of individual links in the weighted graph

_{μ}**are 7.1and the average connectivity follows as 7.2**

*c*Full details are found in appendix B. As in previous ensembles, the leading contribution to the density of length-3 loops comes from the *S*_{3} stars in the bipartite graphs, now giving
7.3

As before, the heterogeneity in the *d* affects neither the average connectivity nor the density of triangles *m*_{3}, both are as they were in the *q*-ensemble. This is confirmed numerically (figure 9). The degree distribution for large *N* in the ensemble *p*(** c**) is calculated in appendix C, giving
7.4where
7.5

Again it is possible to relate the asymptotic behaviour of *p*(*k*) to that of *P*(*d*) and *W*(*q*), by inspecting the relationship between the relevant generating functions. Using our previous definitions for *Q*_{1}(*z*), *Q*_{2}(*z*), *Q*_{3}(*z*), and *Q*_{4}(*z*), we obtain via (7.4) and (7.5)
7.6and
7.7

Expanding (7.6) for small *z* tells us that Substitution into (7.7) subsequently gives
7.8

Assuming *W*(*q*) to have a power-law tail, but with a finite first moment (as in all cases previously considered), i.e. with *γ* > 2, its generating function *Q*_{3}(*z*), can be written as
7.9where *δ* = min{2,*γ* − 1}. Insertion into (7.8) then leads to
7.10

If *p*(*k*) = *Ck*^{−}* ^{μ}*, with 2 <

*μ*< 3, we may use our earlier result (5.21) and get 7.11

If *γ* > *μ* we have *δ* > *μ* − 1, so we can neglect the second term in the argument of *Q*_{4}, and conclude that the promiscuity distribution has the asymptotic form *P*(*d*) = *C*′*d*^{−}* ^{μ}* where This means that if

*W*(

*q*) decays faster than

*p*(

*k*) (as in §6), then the tail in

*p*(

*k*) must arise from the tail in

*P*(

*d*). Note, however, that heterogeneities in

*P*(

*q*) will affect the amplitude of the power-law tail in

*P*(

*d*), which will be smaller by a factor compared with the case where where we had Conversely, if

*γ*=

*μ*we have

*δ*=

*μ*− 1, and writing the term explicitly in (7.10) gives 7.12Expanding both sides in powers of

*z*and equating prefactors tells us that either

*C*′ = 0 and (i.e. which retrieves the case in §5), or

*δ*=

*μ*with Hence, if

*P*(

*d*) is as broad as

*W*(

*q*), then both contribute to the tail in

*p*(

*k*), whose amplitude will be the sum of the amplitudes of the tails in

*P*(

*q*) and

*P*(

*d*). We see in (7.12) that

*γ*<

*μ*is not possible, i.e.

*W*(

*q*) needs to decay at least as fast as

*p*(

*k*).

In appendix C, we calculate the first two moments of the degree distribution *p*(*k*) of the ensemble *p*(** c**). This recovers (7.2) for the first moment, and for the second moment gives
7.13Substituting (7.2) and (7.3) into (7.13) then leads to
7.14

The density of length-3 loops depends again on the first two moments of the degree distribution *p*(*k*), but is also seen to depend on the first two moments of the promiscuity distribution *P*(*d*), which is unknown. Hence, this relationship cannot serve as a test of PIN data quality. It is nevertheless useful for comparing the mixed ensemble with the *d*- and the *q*-ensembles in synthetically generated data.

## 8. Numerical comparison of the three bipartite generative ensembles

Here we compare the ability of our bipartite ensembles (4.1)–(4.4) to predict properties of the associated binary PIN graphs, for synthetic networks that are generated from any of these ensembles. We focus on comparing homologous formulae for the observables *m*_{3} and *m*_{4}. The synthetic matrices ** a** = {

*a*} with are defined as before via with

_{ij}*θ*(0) = 0, and the links of the bipartite graph

**are generated from the following three protocols. In the first protocol, links between nodes (**

*ξ**i*,

*μ*) are drawn randomly and independently, until their total number reaches a prescribed limit. In the second protocol, we assign the links preferentially to complexes with large sizes. In the third protocol, we assign links preferentially to proteins with large promiscuities.

In figure 9, we show along the vertical axes the values of (left) predicted by the three ensembles, via formulae (5.26), (6.2) and (7.2), the predicted values of (middle), via (5.27), (6.12) and (7.13), and the predicted triangle density *m*_{3} (right), via (5.29), (6.13) and (7.14). All are shown together with the corresponding values that were measured in ** a**, along the horizontal axes. As expected, the

*d*-ensemble outperforms the other ensembles when links are drawn according to

*d*-preferential attachment, whereas the

*q*-ensemble performs better for graphs generated via

*q*-preferential attachment. The mixed ensemble performs very similar to the

*q*-ensemble in terms of counting triangles, as expected from the reasoning in §7. Deviations between the

*q*and the mixed ensembles are most evident in the second moment of the degree distribution, where the mixed ensemble always leads to values well above those of the

*q*- and the

*d*-ensembles. We found in §6 that the

*d*-ensemble is indistinguishable from a fully random ensemble when calculating and

*m*

_{3}, which explains why the

*d*-ensemble predicts the values of these two observables perfectly. The other two ensembles are more sensitive to finite size effects, as any heterogeneity in the

*q*will boost the number of loops.

In figure 10, we show the values of *m*_{3} and *m*_{4} predicted by those formulae that involve only measurable graph observables, for the synthetically generated graphs used in figure 9. The prediction of *m*_{3} is now obtained from (5.29) and (6.21), for the *q*- and *d*-ensembles, respectively, and *m*_{4} is evaluated using (5.30) and (6.21). In figure 11, we plot the degree distribution *p*(*k*) of graphs with identical values for the number of nodes (*N* = 3000) and the number of links generated synthetically via the three chosen protocols, together with the distributions *P*(*q*) of complex sizes and *P*(*d*) of protein promiscuities. As explained in §7, tails in the degree distribution *p*(*k*) ∼ *k*^{−}* ^{μ}* can arise either from a complex size distribution

*P*(

*q*) ∼

*q*

^{−}

^{μ}^{−1}and a homogeneous promiscuity distribution, or from having an equally fat tail in the promiscuity distribution

*P*(

*d*) ∼

*d*

^{−}

*together with less heterogeneous complex sizes*

^{μ}*P*(

*q*) ∼

*q*

^{−}

^{γ}^{−1}with

*γ*>

*μ*.

## 9. Conclusion

In this paper, we propose a bipartite network representation of protein interactions, where the two node types represent proteins and complexes. A protein–protein interaction network can then be regarded as the result of a ‘marginalization’ of the bipartite network, whereby the complexes are integrated out (i.e. summed over). This leads to a weighted PIN ** c** with a separable structure. Adjacency matrices of PINs

**are then simply the binary versions of the separable**

*a***, obtained by the entry truncations**

*c**a*=

_{ij}*θ*(

*c*), with the convention

_{ij}*θ*(0) = 0. One of the central results of this work is that for sufficiently large networks there is an equivalence between the two graph ensembles

*p*(

**) and**

*c**p*(

**), inasmuch as macroscopic statistical properties are concerned, such as densities of short loops and degree distributions. This allows us to regard the conventional protein interaction adjacency matrices as if they were to have a separable structure, and induces precise relationships between expectation values of macroscopic graph observables which, remarkably, only depend on measurable quantities and on the underlying mechanism with which proteins and complexes recruit each other. They are independent of inaccessible microscopic details of proteins and their complexes.**

*a*We considered the two extreme complex recruitment scenarios, one where recruitment is driven solely by protein promiscuities, and another where it is driven by complex sizes. Preferential attachment to large complexes (the *q*-ensemble) favours the presence of large cliques in PINs, which boosts the number of short loops. Hence we can reasonably expect that the predictions on short loop densities from the *q*-ensemble will overestimate the real number of loops. Conversely, preferential attachment based only on protein promiscuities (the *d*-ensemble) leads to homogeneous complex sizes, which suppresses large cliques in PINs, leading to an underestimation of short loop densities. Remarkably, real protein interaction data from MS and Y2H experiments show a density of length-4 loops in between the predictions of the *d*-ensemble and those of the *q*-ensemble, suggesting a degree of compatibility of these experimental data with a separable structure of the proteome. By contrast, both MS and Y2H datasets show densities or length-3 loops that are consistently smaller than all our theoretical predictions and closer to expectation in random graphs with identical degree distributions, suggesting the presence of a noise level which randomizes interactions. We note that MS values generally show a higher degree of compatibility with a separable structure of the proteome than Y2H.

We believe that, by providing a systematic and practical framework for understanding protein interaction experiments, our approach may represent a valuable step towards establishing a more solid connection between protein interaction datasets and the underlying biology. Universal bounds on observables in PINs may become powerful tools for data quality testing. As future work, it would be useful to apply the present framework to datasets with different features, including ribosomal data, large-scale datasets resulting from the union of known datasets (e.g. [41]), more nuanced descriptions of PINs as those involving alternative splicing, as well as adapting the present framework to include multiple measurements from repeated experiments [42]. Improved versions of the present models, which fit the experimental data better, may open a route to infer quantities such as the ratio *α*, and the distributions of protein promiscuities and complex sizes. Such quantities are hardly available in the current PIN datasets, and are generally difficult to access experimentally. The present work has revealed that the asymptotic forms of these distributions can be extracted from the tails of the PIN degree distributions. Recent protein–complex datasets such as [43,44] may provide useful sources to test inference capabilities of the present framework. Finally, our method my shed some light on the way protein and complexes recruit one another, in particular, whether this recruitment is driven by proteins or by complexes, and may enable us to discriminate between ‘party hub’ and ‘date hub’ interactions.

## Competing interests

We declare we have no competing interests.

## Funding

A.C.C.C. is grateful for support from the UK's Biotechnology and Biological Sciences Research Council (BBSRC).

## Acknowledgements

A.A. acknowledges Alessandro Pandini and Sun Chung for providing protein interaction datasets. Kate Roberts is acknowledged for interesting discussions during the early stages of this work. Authors are grateful to all referees for many useful comments.

## Appendix A. Promiscuities and complex size distributions in ensembles *P*_{q}(*ξ*) and *P*_{d}(*ξ*)

_{q}

*ξ*

_{d}

*ξ*

For graphs drawn from ensemble (4.1), where complex sizes {*q _{μ}*} are drawn from a given distribution

*P*(

*q*), the distribution of protein promiscuities is, for large

*N*, A 1For graphs drawn from ensemble (4.2), where protein promiscuities {

*d*} are drawn from a given distribution

_{i}*P*(

*q*), the distribution of complex sizes is, for large

*N*, A 2

## Appendix B. Link probabilities in the weighted protein interaction network

In this appendix, we derive the likelihood to have a link in the weighted PIN when the are drawn from the ensembles (4.1)–(4.3).

### B.1. The *q*-ensemble

In the *q*-ensemble, we have

B 1

From this, one reads off directly the values of *p*(*c _{ij}* = 0),

*p*(

*c*= 1) and

_{ij}*p*(

*c*≥ 2). The density of triangles is obtained writing (B 2) as B 2and using B 3This gives

_{ij}B 4

### B.2. The *d*-ensemble

In the *d*-ensemble, we obtain

B 5 which gives B 6

### B.3. The mixed ensemble

For the mixed ensemble, the link likelihood is found to be

B 7 giving B 8

## Appendix C. Calculation of the degree distribution *p*(*k*)

In this appendix, we calculate the degree distribution of the weighted PIN in which the entries are drawn from the bipartite ensembles (4.1)–(4.3).

### C.1. The *q*-ensemble

In the *q*-ensemble, we can calculate *p*(*k*) as follows:

C 1

Hence, for large network sizes *N* → ∞ we obtain

C 2

We can rewrite this in terms of the distribution which denotes the likelihood to draw a link attached to a node of degree *q* in the bipartite graph,
C 3and upon defining
C 4we finally get to
C 5

The interpretation is that if we draw *ℓ* from a Poisson distribution with and then draw *ℓ* variables *q _{r}* from

*W*(

*q*), we find

_{r}*k*as a Poissonian variable with Clearly

*p*(

*k*) is normalized, and for its first moment we find C 6For the second moment, we obtain C 7This is in agreement with results from a direct calculation: C 8

### C.2. The *d*-ensemble

We can calculate the asymptotic degree distribution in the *d*-ensemble as follows:
C 9

### C.3. The mixed ensemble

In the mixed ensemble, we have the asymptotic degree distribution C 10

We can rewrite this expression in terms of the associated distribution as C 11or, equivalently, as C 12where C 13

The first two moments of *p*(*k*) are
C 14and
C 15

## Appendix D. The link between observables in the *a* and *c* networks

*a*

*c*

In this appendix, we inspect the relationship between expectation values of various observables in the ensembles *p*(** a**) and

*p*(

**).**

*c*### D.1. The *q*-ensemble

Denoting averages in the ** a** ensemble as , we have, for the

*q*-ensemble of bipartite graphs D 1and D 2where we used

D 3

For loops of length 3, we proceed in the same way, obtaining

D 4

where we used

D 5

Finally for loops of length 4, we have

D 6

where we used

D 7

and

D 8

Again, the square brackets underneath the summations indicate that all indices are different, to exclude backtracking in the counting of loops of length 4.

### D.2. The *d*-ensemble

For the *d*-ensemble, denoting averages relating to ** a** as , we have
D 9and

D 10

where we used

D 11

For loops of length 3, we have

D 12

where we used

D 13

Finally, for loops of length 4, we have

D 14

where we used

D 15

and D 16

- Received June 27, 2015.
- Accepted August 12, 2015.

- © 2015 The Author(s)

Published by the Royal Society. All rights reserved.