## Abstract

What type of connectivity structure are we seeing in protein–protein interaction networks? A number of random graph models have been mooted. After fitting model parameters to real data, the models can be judged by their success in reproducing key network properties. Here, we propose a very simple random graph model that inserts a connection according to the degree, or ‘stickiness’, of the two proteins involved. This model can be regarded as a testable distillation of more sophisticated versions that attempt to account for the presence of interaction surfaces or binding domains. By computing a range of network similarity measures, including relative graphlet frequency distance, we find that our model outperforms other random graph classes. In particular, we show that given the underlying degree information, fitting a stickiness model produces better results than simply choosing a degree-matching graph uniformly at random. Therefore, the results lend support to the basic modelling methodology.

## 1. Introduction and model

A protein–protein interaction (PPI) network is commonly viewed as an unweighted, undirected graph. Each node in the graph represents a protein and an edge between a pair of nodes indicates that these proteins have been observed to interact physically (Ito *et al*. 2000; Uetz *et al*. 2000; Giot *et al*. 2003; Li *et al*. 2004; Rual *et al*. 2005; Stelzl *et al*. 2005). The types of connectivity patterns that arise are neither completely random, in the classical Erdös–Rényi sense (henceforth denoted by ‘ER’), nor completely deterministic (Grindrod & Kibble 2004).

In an attempt to understand and describe the PPI connectivities, a number of models, i.e. formulae for generating edges in some probabilistic sense, have been proposed and tested against observed networks (Jeong *et al*. 2001; Maslov & Sneppen, 2002; Barabási *et al*. 2003; Pržulj *et al*. 2004; de Silva & Stumpf 2005). Many works have focused on matching degree distributions and recovering a scale-free law (Jeong *et al*. 2001; Maslov & Sneppen 2002; Barabási *et al*. 2003; Salathé *et al*. 2005), although whether PPI networks are really scale-free is still the subject of debate (Pržulj *et al*. 2004; Han *et al*. 2005; Dupuy *et al*. 2006; Friedel & Zimmer 2006; Khanin & Wit 2006). Our aim here is to present a new, pared-down, but biologically motivated model that simplifies previous work to the extent that fitting parameters and comparing local and global graph properties become meaningful and revealing.

Among the few existing models that incorporate some biological justification are those of Caldarelli *et al*. (2002), Thomas *et al*. (2003) and Deeds *et al*. (2006). These related models have in common the idea that proteins interact because they share complimentary physical aspects, a concept that is consistent with the underlying biochemistry. Following Thomas *et al*. (2003), we will refer to these physical aspects as binding domains. The approach in these papers is to generate graphs by assigning binding domain information to the nodes at random and then inserting links probabilistically according to some pairwise matching criterion. The aim is then to reproduce properties observed in real PPI networks, most notably the degree distribution. We also mention that a refined ‘lock-and-key’ version of the model from Thomas *et al*. (2003) has been used to extract protein-level detail from real datasets (Morrison *et al*. 2006), further justifying the modelling approach.

Presently, it would be a very challenging task to infer the number and distribution of distinct binding domains from a real PPI network (Bateman 2002; Deng *et al*. 2002), not least because the networks are known to be noisy (Sprinzak *et al*. 2003). For this reason, it is difficult to decide whether the models from (Caldarelli *et al*. 2002; Thomas *et al*. 2003; Deeds *et al*. 2006) are being tested under realistic parameter ranges. Therefore, we propose a simplified model that attempts to summarize the abundance and popularity of binding domains on a protein as a single number based on its normalized degree; we call this number the *stickiness index*. The model has the benefit of being tunable to the given degree structure of a PPI network. In this way, a benchmark model that captures the essence of Caldarelli *et al*. (2002), Deeds *et al*. (2006), Thomas *et al*. (2003) can be tested.

Our work can be motivated by two main assumptions.

*Assumption* 1. Having a high degree implies that a protein has many binding domains and/or its binding domains are commonly involved in interactions.

*Assumption* 2. A pair of proteins is more likely to interact (share complementary binding domains) if both have high stickiness indices, and correspondingly less likely to interact if one or both have a low stickiness index. Thus, we take the product of the two stickiness indices to define the probability of interaction—this borrows from the concept of an AND gate in Boolean logic (Ben-Ari 2001) and the idea of a rank-one approximation in dimension reduction (Eldén 2006).

The following pseudocode defines our model.

input , list of degrees of

*N*nodesoutput , adjacency matrix from model

for

*i*=1 to*N*end

Initialize all

*w*_{ij}=0for

*i*=1 to*N*for

*j*=1 to*N*compute a uniform (0, 1) sample,

*r*if

*r*≤*θ*_{i}*θ*_{j}*w*_{ij}=1 and*w*_{ji}=1

end if

end for

end for

This choice of stickiness index *θ*_{i} ensures that the *i*th node in the model has the *expected* degree deg_{i}. Moreover, under assumption 2, this definition of stickiness in terms of degree is the only one that captures the correct expected degree. Details are given in appendix A.

Our stickiness index coincides with the concept of *fitness* in Caldarelli *et al*. (2002), with a notable distinction that fitness in Caldarelli *et al*. (2002) is assigned at random, with a focus on the resulting degree distribution, whereas stickiness above is assigned deterministically, based on the unique choice that matches the expected degrees. Since we do not require any other parameter fitting, this approach allows us to perform a ‘proof of principle’ test of the basic idea that links can be modelled via mutual compatibility.

Note that high-degree proteins in the present PPI networks may not necessarily contain a plenty of binding domains, as implied by our assumption 1. Instead, their high connectivities may be artefacts of *technical false positives*, auto-activators or ‘sticky’ proteins, or owing to *biological false positives*, as some PPIs can occur in the experimental procedure, but not *in vivo* because protein pairs are not expressed at the same time, in the same sub-cellular compartment, or in the same tissue (Han *et al*. 2005). Thus, our assumption 1 may be a severe oversimplification for some proteins in the present PPI datasets. Nevertheless, as PPI detection biotechnologies improve to produce cleaner, higher-confidence PPI data, assumption 1 will become more descriptive of the observed networks.

A multitude of random graph models that reproduce scale-free degree distributions have been proposed, although the relevance of scale-freeness to PPI networks has been questioned (Pržulj *et al*. 2004; Han *et al*. 2005; Dupuy *et al*. 2006; Friedel & Zimmer 2006; Khanin & Wit 2006). The most notable such models are those based on biologically motivated *gene duplication and mutation* network growth principles (Vazquez *et al*. 2001; Pastor-Satorras *et al*. 2003; Wagner 2003; Goh *et al*. 2004). In these models, networks grow by duplication of nodes (genes), and as a node gets duplicated, it inherits most of the neighbours (interactions) of the parent node, but gains some new neighbours as well. Thus, a hybrid model having properties of both the gene duplication–mutation model and the stickiness index-based model is a promising future direction. In such a model, a duplicated gene would inherit the parent's stickiness index along with many of the parent's neighbours, as in a gene duplication–mutation model and it would gain new neighbours in proportion to its inherited stickiness index and stickiness indices of the nodes already in the network, as in our stickiness index-based model.

We remark that early tests on low confidence data in (Maslov & Sneppen 2002) suggest that PPI networks have a bias against connections between high-degree proteins. This is potentially at odds with the models in Caldarelli *et al*. (2002), Deeds *et al*. (2006), Thomas *et al*. (2003), where sets of proteins that share matching and commonly occurring (high fitness) physical aspects will interact and all have high degree. In our simple model, we assign edges independently, but it would be possible to add a post-processing stage in which the links were rewired in order to test various types of correlation. Hence, a further application of our model is in studying correlation effects in PPI network topology.

## 2. Experiments and results

Comparing large real-world networks is computationally intensive as it involves an NP-complete *subgraph isomorphism problem* (West 2001). Thus, simple heuristics measuring *global* and *local* network properties have been used. The most commonly examined global network properties are the *degree distribution*, *clustering coefficient* and network *diameter* (see Newman (2003) for a detailed survey). More recently, bottom-up local approaches to study a network structure have been proposed (Milo *et al*. 2002; Shen-Orr *et al*. 2002; Pržulj *et al*. 2004). Analogous to sequence motifs, *network motifs* have been defined as subgraphs that recur in a network at frequencies much higher than those found in randomized networks (Milo *et al*. 2002; Shen-Orr *et al*. 2002; Milo *et al*. 2004); they were used to uncover basic functional units in various real-world networks. To account for frequencies of occurrence of all small subgraphs rather than for only the over-represented ones, *graphlets* were defined as small connected non-isomorphic induced subgraphs of a large network and their *relative frequencies* were used to define a new *distance* measure between two networks (Pržulj *et al*. 2004).

To examine the fit of our new stickiness index-based model of PPI networks, we use all these standard global and local network parameters. The relative graphlet frequency distance is the most demanding network similarity measure, imposing 29 different constraints on the networks being compared (details in Pržulj *et al*. 2004); hence, we use it as our main comparison tool. We compared 14 large publicly available PPI networks with sample networks from five models, including the stickiness model.

We used PPI networks of the following eukaryotic organisms: yeast *Saccharomyces cerevisiae*; fruitfly *Drosophila melanogaster*; nematode worm *Caenorhabditis elegans*; and human. Several different datasets are available for yeast and human, so we analysed five yeast PPI networks of different confidence levels obtained from three different high-throughput studies (Ito *et al*. 2000; Uetz *et al*. 2000; von Mering *et al*. 2002), as well as five human PPI networks obtained from the two recent high-throughput studies (Rual *et al*. 2005; Stelzl *et al*. 2005) and three curated databases (Zanzoni *et al*. 2002; Bader *et al*. 2003; Peri *et al*. 2004). We denote by ‘YHC’ the high-confidence yeast PPI network from von Mering *et al*. (2002), by ‘Y11K’ the yeast PPI network defined by the top 11 000 interactions in the von Mering *et al*. classification (von Mering *et al*. 2002), by ‘YIC’ the Ito *et al*. ‘core’ yeast PPI network (Ito *et al*. 2000), by ‘YU’ the Uetz *et al*. yeast PPI network (Uetz *et al*. 2000), and by ‘YICU’ the union of YIC and YU yeast PPI networks (we combined them as in (Han *et al*. 2005) to increase coverage). ‘FE’ and ‘FH’ denote the fruitfly *D. melanogaster* entire and high-confidence PPI networks from (Giot *et al*. 2003). Similarly, ‘WE’ and ‘WC’ denote the worm *C. elegans* entire and ‘core’ PPI networks from (Li *et al*. 2004). Finally, ‘HS’, ‘HR’, ‘HB’, ‘HH’ and ‘HM’ stand for human PPI networks from yeast two-hybrid (Y2H) screens by Stelzl *et al*. (Stelzl *et al*. 2005) and Rual *et al*. (Rual *et al*. 2005) and from curated databases BIND (Bader *et al*. 2003), HPRD (Peri *et al*. 2004) and MINT (Zanzoni *et al*. 2002), respectively, (BIND, HPRD and MINT data were downloaded from OPHID (Brown & Jurisica 2005) on 10 February 2006). Note that YHC and Y11K networks are mainly coming from tandem affinity purifications (Gavin *et al*. 2002) and high-throughput mass spectrometric protein complex identification (Ho *et al*. 2002), while YIC, YU, YICU, FE, FH, WE, WH, HS and HR are yeast two-hybrid, and HB, HH, and HM are a result of human curation (BIND, HPRD and MINT). Thus, we are using PPI networks of different confidence levels that come from a range of high-throughput PPI detection biotechnologies as well as from human curation.

We compared these PPI networks with the following five model networks: ER random graphs (Erdös & Rényi 1959, 1960); random graphs with exactly the same degree distribution as that of a PPI network (Bender & Canfield 1978; Newman 2002) (denoted ‘R-SF’ for ‘random scale-free’); Barabasi–Albert scale-free networks (Barabási & Albert 1999) (denoted by ‘BA-SF’); three-dimensional geometric random graphs (Penrose 2003) (denoted by ‘GEO-3D’); and the stickiness model networks described previously (denoted by ‘STICKY’).

For each of the 14 PPI networks and for each of the five models, we compared the PPI network with 25 samples from the model. Each sample matched the number of nodes and edges in the corresponding PPI network.

Average relative graphlet frequency distances between the PPI and the corresponding model networks for each of the five network models are presented in figure 1. The stickiness model shows an improved fit over all other network models with respect to relative graphlet frequency distances in 10 out of 14 tested PPI networks (filled squares in figure 1); it fits as well as the GEO-3D model (open squares in figure 1) in one and is outperformed by the GEO-3D model in three PPI networks. In addition, this model reproduces global network properties, such as the degree distribution (see appendix A), the clustering coefficients (open circles in figure 2*a*) and the average diameters of PPI networks (open circles in figure 2*b*).

It is of particular note that the R-SF model does not perform as well as the stickiness model. This means that, given the degree distribution of a PPI network,

simply drawing a network uniformly at random from the class of all networks that match the degree distribution is less successful in capturing the underlying substructure than

enhancing this degree information by using the simple modelling insights summarized in assumptions 1 and 2.

## 3. Conclusions

Overall, the stickiness framework produces a convenient, parameter-free random network that is motivated by transparent modelling arguments and may be regarded as a simplified, testable distillation of more sophisticated models. The results give further justification for the modelling approaches in (Caldarelli *et al*. 2002; Thomas *et al*. 2003; Deeds *et al*. 2006). Since the model accurately reproduces all widely used quantitative measures, it also provides a benchmark against which others may be compared.

## Acknowledgments

We thank the referees for their valuable feedback.

## Footnotes

- Received June 21, 2006.
- Accepted July 14, 2006.

- © 2006 The Royal Society