Information-theoretic sensitivity analysis: a general method for credit assignment in complex networks

Most systems can be represented as networks that couple a series of nodes to each other via one or more edges, with typically unknown equations governing their quantitative behaviour. A major question then pertains to the importance of each of the elements that act as system inputs in determining the output(s). We show that any such system can be treated as a ‘communication channel’ for which the associations between inputs and outputs can be quantified via a decomposition of their mutual information into different components characterizing the main effect of individual inputs and their interactions. Unlike variance-based approaches, our novel methodology can easily accommodate correlated inputs.


INTRODUCTION
The analysis of networks represents a crucial focus of modern systems biology (Barabási & Oltvai 2004;Hwang et al. 2005;Kitano et al. 2005;Klipp et al. 2005;Wagner 2005;Alon 2006;Davidson 2006;Doyle & Stelling 2006;Kell 2006a,b;Palsson 2006). For many areas of interest, models of complex networks can be taken to have the form of a deterministic mapping from a set of n inputs to one or more output(s) (figure 1). The outputs can be considered separately so that for each output Y k there is a map f k : ðX 1 ; .; X n Þ1 Y k : Usually, the input-output mapping is not available in explicit form but can be evaluated numerically for any given inputs.
Global sensitivity analysis aims to rank the inputs X 1 , ., X n according to the degree to which they influence the output, individually and conjointly. Here, 'inputs' may also refer to intrinsic model parameters whose influence on the output is to be determined as in figure 1b. This type of global sensitivity analysis is commonly performed in a probabilistic manner by evaluating the model for multiple sets of randomly and independently selected input values drawn, for instance, from uniform distributions over suitable intervals. The output, being a function of the randomized inputs, thus also becomes a random variable. If the inputs are sampled independently, the variance of the output distribution can be decomposed into contributions by individual inputs, pairs, triplets and so forth. This procedure is well known in statistics as 'analysis of variance' (ANOVA; e.g. Box et al. 1978), and several authors have contributed to improve its computational efficiency for sensitivity analysis (e.g. Rabitz & Aliş 1999;Sobol 2001).
Rather than analysing the variance of the output distribution, we take a different route measuring output uncertainty in terms of Shannon's entropy (Shannon & Weaver 1949). Our starting point is the concept of the 'communication channel' (Cover & Thomas 2006), which enables us to view the model as a transmitter of information between inputs and outputs (figure 1b).
The mutual information of two variables is a quantity that measures their mutual dependence (Cover & Thomas 2006). Determining the mutual information I(X i ;Y ) between random sampling sequences of individual inputs X i and their output counterpart can elucidate first-order input-output relations. Mutual information provides a general measure of association that is applicable regardless of the shape of the underlying distributions and-unlike linear-or rank-order correlation-insensitive to nonmonotonic dependence among the random variables. Further insight can be obtained by unravelling conditional dependencies among the system inputs. Here, we define novel and general sensitivity measures of second and higher order by evaluating input correlations induced by conditioning on the output.
To our knowledge, only a first-order information-based analysis has been discussed in the literature to date (Critchfield et al. 1986;Dalle Molle & Morris 1993, pp. 402-407).
While variance adequately quantifies the variability of distributions that are symmetrical and unimodal, entropy is calculated directly from the probability distribution function and thus provides a more general measure of output variability. Therefore, we further develop an information-theoretic framework for the sensitivity measures thus derived, based on the observation that their sum is bounded from above by the output entropy H(Y ). From this viewpoint, the (information-theoretic) sensitivity indices quantify the amount of output uncertainty removed by the knowledge of individual inputs and combinations thereof.
Sensitivity analysis of this kind is also an analysis of the total mutual information I(X 1 , .,X n ;Y ), which subsumes all input-output associations including interactions. The resultant summation theorem for the sensitivity measures is an information balance in which the sum equals I(X 1 , ., X n ;Y ). Although in practice only effects of up to third-or fourth-order can easily be calculated explicitly, the joint impact of all higher order terms is provided by the remaining difference to I(X 1 , ., X n ;Y ). We can therefore assign credit or influence fully to all the parameters of a system over a wide range of operating conditions. For all variance-based approaches, the absence of input correlations is a critical prerequisite for the uniqueness of the variance decomposition (Saltelli et al. 2000(Saltelli et al. , 2004. As will be demonstrated in our methodology, independent inputs merely simplify the analysis. If input correlations exist (e.g. due to nonorthogonal sampling), their effect can easily be taken into account. We apply the methodology successfully to a model of the NFkB signalling pathway and thereby define how to modify its behaviour to provide a designed maximum effect.

METHODS
By randomly sampling the input space, a genuinely deterministic system can be analysed in stochastic terms. Random perturbation of the inputs creates a randomized output Y with a probability density p(y). Rather than attempting to find some parametric model of p(y), the output density is approximated by a histogram, and the output becomes a discrete random variable. which is the average uncertainty in Y over all possible discrete values x that the input variable X i can assume. The discretization of X i and Y is, of course, arbitrary and should be chosen in relation to the number of system evaluations (simulation runs). The mutual information is defined as the difference in output uncertainty with and without knowledge of X i , and characterizes the influence X i exerts on Y: The link between uncertainty and association established by equation (2.3) is one of the fundamental concepts of Shannon's information theory and forms the basis of our framework for sensitivity analysis. Calculating the mutual information I(X i ;Y ) for each X i constitutes a form of first-order sensitivity analysis, assessing only the influence of individual inputs.
2.1. An information-theoretic first-order sensitivity index Critchfield et al. (1986) defined the mutual information index (MII), which in our notation is the mutual information normalized by the entropy of the output variable: A first-order sensitivity analysis can be performed by calculating the MII of all inputs, where the mutual information is obtained by computing Figure 1. Complex systems with multiple inputs and outputs. This is a typical situation in systems biology. For instance, pathway models (a) are described by sets of coupled nonlinear ODEs (deterministic or stochastic). Input-output relations can only be elucidated by numerical evaluation of the system output, e.g. a flux, for various configurations of the input parameters. Global sensitivity analysis aims to determine the degree to which these inputs control the output, and how they interact. In most applications, the input-output mapping is nonlinear and not given in closed form; hence, the system is a 'black box' (b).
Though X i and Y are continuous variables, equation (2.5) contains discrete sums, indicating that, in practice, the probability densities are evaluated via the joint histogram and the marginal histograms of the input and output sequences.

Pairwise interactions
If we assume that, by design of the simulation, random input values are drawn independently, there will be no a priori correlations among the sequences of input values. However, if inputs interact in their influence on an output, one would expect to find associations in input sequences when conditioning on a particular value of that output. We show that the output-induced conditional dependence among two inputs, characterized by the conditional mutual information ð2:6Þ provides a measure of the joint influence of the pair (X i , X j ) on the output Y, on average. To understand why this is indeed an appropriate measure, we consider the degree of association among (X i ,X j ) and Y, that is the mutual information I(X i ,X j ;Y ). Since this quantity subsumes first-and second-order effects, one has to subtract the influence of the individual inputs, I(X i ;Y ) and I(X j ;Y ), in order to obtain the pure second-order effect of X i and X j on Y.
Using an auxiliary formula proved in appendix A.3, one obtains The second term on the right-hand side subtracts the effect of any a priori input associations due to the applied sampling scheme. If inputs are sampled independently, the term vanishes and the conditional mutual information by itself captures the joint effect of X i and X j on Y. Note, the simple structure of equation (2.7) makes it possible to apply arbitrary input sampling schemes, without having to be concerned about statistical independence. This reveals a considerable advantage of the information-theoretic approach over variance-based methods, which are not easily extended to non-orthogonal samples (Saltelli et al. 2000).

Higher order interactions
Capturing interactions among three or more inputs in information-theoretic terms requires generalizing the concept of mutual information beyond two variables.
To characterize the genuine three-way interaction of input triplets, we apply the same rationale as in §2.2 and consider a decomposition of the mutual information of an input triplet (X 1 , X 2 , X 3 ) and the output (derivation provided in appendix A.3) first-order C I ðX 1 ; X 2 jY Þ C I ðX 1 ; X 3 jY Þ C I ðX 2 ; X 3 jY Þ |fflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl ffl{zfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl ffl} second-order C I ðX 2 ; X 3 jX 1 ; Y ÞKI ðX 2 ; X 3 jY Þ |fflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl} third-order : ð2:8Þ Having identified the first-and second-order terms on the right-hand side, the decomposition suggests interpreting the remainder as the genuine third-order sensitivity measure. Using the notation I 3 , we define McGill also showed that interaction information is symmetric with respect to permutations of its arguments, meaning for the conditional form that Note that interaction information can be negative. By virtue of equation (2.10), this can only happen when all three pairwise interactions have non-zero conditional mutual information. Negative interaction information then indicates an inner redundancy of the triplet (X 1 ,X 2 ,X 3 ), in the sense that the pairs (X 1 ,X 2 ), (X 2 ,X 3 ) and (X 1 ,X 3 ) do not provide entirely independent pieces of information about Y. This situation rarely occurs in natural systems, although appendix A.1 presents a contrived and artificial example with negative interaction information.

The information balance: a summation theorem for sensitivity indices
Having identified measures for first-, second-and higher order sensitivities, we consider a decomposition of the total mutual information for arbitrary number of inputs. Generalizing equation (2.3), one obtains the general form of the information balance for a system Credit assignment in complex networks N. Lüdtke et al. 225 with n inputs I ðX 1 ; :::; X n ; Y Þ Z H ðY ÞKH ðY jX 1 ; :::; X n Þ: ð2:12Þ In addition to H(Y ), which is straightforward to compute, the 'noise entropy' H(Y jX 1 , ., X n ) has to be evaluated. In a deterministic system, this quantity vanishes for continuous random variables. However, computing information-theoretic quantities in a continuous fashion would require parametric models of all random variables. Since the input-output mapping is often not given in closed form, it will generally be impossible to derive such models analytically. Moreover, there is no general parametrization scheme that can be fitted to the multitude of possible empirical distributions arising in various systems. Rather, the parametric distributions must be selected on a case-by-case basis, with no guarantee of obtaining a close fit. In our experience, it proved very difficult to match the heavy-tailed output histograms arising in our particular application (cf. §3) with any standard distribution function.
Hence, to make the information-theoretic quantities measurable, we choose to discretize all variables. The noise entropy then takes the role of the residual uncertainty in Y that persists, given all inputs with a finite precision determined by the imposed discretization. We shall refer to this residual uncertainty as the discretization entropy, denoted by H D. In §3.4, we show how H D can be estimated via Monte Carlo simulation.
Normalizing by H(Y )KH D -the maximum amount by which the output uncertainty can be reduced by the parameters-one can rewrite the information balance for a discretized deterministic system as I ðX 1 ; :::; X n ; Y Þ H ðY ÞK H D Z 1: ð2:13Þ Equation (2.13) provides the basis for a summation theorem, since it is possible to express the left-hand side in terms of the previously defined sensitivity indices, as shown in appendix A.3. Decomposition up to third order yields Here, summations extend over all index combinations excluding permutations. While related decompositions of the information of an ensemble of variables have been considered previously (Watanabe 1960;Fano 1961;Panzeri et al. 1999;Amari 2001;Schneidman et al. 2006), they have never been applied in the context of sensitivity analysis (see appendix A.1 for a discussion of alternative decompositions).
Calculation of the conditional mutual information requires knowledge of the underlying marginal and joint probability density functions. In practice, these densities must be estimated empirically by means of marginal and joint histograms. Particularly, the empirical estimate of a joint density can be problematic when the amount of available data is insufficient to populate all the bins in its joint histogram. This leads to a systematic error ('bias') in the limited-sampling estimation of information; the higher the dimensionality of the histograms to be sampled, the larger the bias. Thus, the estimation of higher order interactions is particularly difficult. However, reliable correction of the sampling bias is possible using advanced statistical techniques (Panzeri & Treves 1996;Nemenman et al. 2004;Montemurro et al. in press). Given the amount of simulations we could produce in the particular application presented below, these techniques allowed an accurate elimination of the bias for up to thirdorder interactions. Only the first-order quantification would have been possible without using such bias reduction techniques.
Even though, for the practical reasons described above, sensitivity indices can only be evaluated up to a certain order, the remainder DI-the combined effect of all higher order interactions-can be assessed since all other terms in the equation are known. If the lower order sensitivity indices capture the essence of the dependence structure, the remainder will be a small fraction of H(Y )KH D . A significant value of DI would indicate that important higher order interactions exist, which is generally not expected in most simple systems (Rabitz & Aliş 1999). In large networks, higher order interactions require an extreme number of connections, unless the degree of connectivity varies strongly across the network. Hence, one would expect to find a small number of local 'hubs' forming highly connected subnetworks. While this is still the subject of debate, we note that the complex networks arising in biological systems do indeed tend to have sparse intrinsic connectivity patterns ( Wagner & Fell 2001;Barabási & Oltvai 2004;Csete & Doyle 2004).

Total sensitivity indices
A very useful concept in variance-based sensitivity analysis is the so-called total sensitivity index (Saltelli et al. 2005), which measures the overall influence that a particular input exerts on the output, comprising main effects and all interactions. In the ANOVA framework, the total sensitivity expresses the remaining output variance when all other inputs are kept fixed. The idea is to calculate this quantity without relying on the other sensitivity indices (first, second, third-order and so forth). If a total sensitivity index is zero, the corresponding input is irrelevant; if not, it is interesting to relate it to the other indices. For instance, comparing the total sensitivity index of an input with its first-order index reveals the degree to which the input is interacting with others.
This concept can be readily applied to informationbased sensitivity analysis. The information-theoretic total sensitivity index for variable X i is given by The total sensitivity index can also be expressed as the sum of all sensitivity indices involving X i : ð2:16Þ Note, the sum of all total sensitivity indices is generally greater than 1 since expansions for different input variables will share certain sensitivity indices if the variables interact.

INFORMATION-THEORETIC SENSITIVITY ANALYSIS OF A MODEL OF THE NFkB SIGNALLING PATHWAY
As an example, we apply our methodology to parameter sensitivity analysis in systems biology. We consider a model of the ( IkB)/NFkB signalling pathway (Hoffmann et al. 2002) and investigate the interdependencies among intrinsic parameters (in this case 64 reaction rate constants) with respect to their influence on the time course of the concentration of a particular metabolite, the nuclear transcription factor NFkB, which is a key component in early immune response. In a nutshell, the pathway model works as follows. There are three main components: NFkB, IkB inhibitory proteins (IkBa and its isoforms IkBb and IkB3) and the IkB kinase (IKK). The model describes the kinetics of interaction between these components, their transport between nucleus and cytoplasm, the inhibitor IkB degradation, as well as the NFkB-regulated gene expression and subsequent resynthesis of the inhibitors (figure 2). NFkB is normally bound in an IkB-NFkB complex. Following a step increase in the concentration of IKK, which models the effect of an extracellular stimulus (e.g. tumour necrosis factor (TNFa)), NFkB is released from the IkB-NFkB complex and enters the nucleus. The IkBs are rapidly degraded. In the nucleus, NFkB regulates the expression of genes leading to a resynthesis of the IkB inhibitor proteins. The newly synthesized IkB binds to the nuclear NFkB forming an IkB-NFkB complex and subsequently shuttles NFkB back to the cytoplasm, thus initiating a negative feedback loop. The cycle is repeated until all IKK has decayed. As a result of the delayed negative feedback, the concentration of nuclear NFkB exhibits an oscillatory behaviour (figure 3) that can be characterized in terms of features such as peak amplitude, frequency or phase ( Nelson et al. 2004;Kell 2006a).
To exemplify the sensitivity analysis, we select one feature as output Y, namely the time difference between the first two peaks of the nuclear NFkB oscillation, P 1 ZT 2 KT 1 . Evaluating the input-output function thus involves numerically solving a system of 24 ordinary nonlinear differential equations (ODEs) corresponding to the reaction equations and subsequently determining the first two maxima in one component of the solution. The entire analysis is performed with respect to the selected feature, based on 680 000 simulations, which yield sufficiently accurate information measures up to third order. All parameters were varied simultaneously, each drawn independently from a uniform distribution over the interval from 0.9 to 2.0 times the nominal value, and discretized into 15 bins. The lower bound of the sampling interval is dictated by the empirical observation that oscillations are only guaranteed to occur for parameter values above about 0.9 of the nominal values given in Hoffmann et al. (2002).

First-order sensitivity indices
The first-order analysis reveals that only a small subset of about eight parameters out of 64 (or at most 11, if one includes parameters 1, 19 and 37) significantly influence the output feature P 1 (figure 4). The parameters thus identified either directly affect the amount of available NFkB (9, cytoplasmic release; 19, nuclear import; 1, IkBa-NFkB association) or control the strength of the negative feedback, which depends on the production of the inhibitor protein IkBa (28, transcription; 36, constitutive translation), its transport into nucleus (38, nuclear import of IkBa), its destruction (37, degradation; 29, IkBa mRNA degradation; 62, IKK-IkBa catalysis). In addition, the feedback is indirectly affected by the IkB kinase (IKK). Therefore, its decay rate (61) and the rate of binding IKK to the IkBa-NFkB complex (52), a step prior to the release of NFkB, are also important. Parameters specifically relating to the inhibitor isoforms IkBb or IkB3 are insignificant, which is in accordance with experiments indicating that only the knockout of IkBa is lethal (Gerondakis et al. 1999).
The result is consistent with previous local sensitivity analyses of a different output feature (Ihekwaba et al. 2004(Ihekwaba et al. , 2005 and in accordance with global sensitivity analysis of the overall time course of the nuclear NFkB concentration ( Yue et al. 2006).

Second-order sensitivity indices
At the level of pairwise interactions, a rather small number of relevant pairs emerge (figure 5). No significant synergies are observed, meaning that only those pairs wherein at least one partner is individually relevant have significant interactions. The predominant contributions are by pairs where both partners have significant individual impact. The degree of interactivity, that is the number of relevant pairs in which a parameter appears, varies strongly. For instance, parameter 29 (the IkBa mRNA degradation rate) seems to play the role of a 'super parameter', in the sense that it is involved in most, and the strongest, interactions. Note that pairs with very small yet statistically significant sensitivities are not visualized in the interaction matrix (figure 5), due to limited diagram resolution.
Statistical significance can be assessed via a bootstrap test, which involves repeated random shuffling of the parameter sampling sequences and recalculation of the conditional mutual information from these shuffled sequences. Several hundred repetitions produce a bellshaped distribution of random sensitivity values, the mean and standard deviation of which characterize the range of 'chance values' of the particular sensitivity index under consideration. If the index calculated from the original non-shuffled data is two or three standard deviations above its bootstrap mean, it can be considered statistically significant.
While the particular parameter set identified as most relevant is very biologically plausible, the strongly varying degree of parameter interactivity is surprising. For instance, it is not obvious why the

Third-order interactions
The sparseness of the interaction structure continues at the third order (figure 6). It is again the combinations of individually relevant parameters that exhibit the strongest tripletwise interactions. The assessment of statistical significance is analogous to the procedure described in §3.2. All third-order indices are positive.

Monte Carlo estimation of discretization entropy
Let n be the number of system inputs. Assume a discretization scheme where each input range is partitioned into the same number of bins, denoted by n bins. Let j 1 , ., j n be the bin indices of the inputs X 1 , ., X n , with j 1 Z1, ., n bins ; j 2 Z1, ., n bins and j n Z1, ., n bins ; and let Dx 1 , ., Dx n be the corresponding bin widths. Then the bins are defined as B j 1 Z x 1;min C ½ðj 1 K1ÞDx 1 ; j 1 Dx 1 « B j n Z x n;min C ½ðj n K1ÞDx n ; j n Dx n :

ð3:1Þ
The discretization entropy is the averaged conditional entropy of Y, for which uniformly distributed input values is simply .; X n Z x n 2B j n1 Þ: ð3:2Þ The summation extends over the total number of input bin combinations, which is (n bins ) n . Since this number can be very large, equation (3.2) cannot be generally evaluated. However, Monte Carlo estimates can provide excellent approximations. Figure 7 shows the estimated discretization entropy as a function of the number of bin combinations used to compute the average. Only several hundred bin combinations are required to obtain a reliable estimate. For each bin combination, about 100 evaluations of the feature Y were performed to estimate the local conditional histogram from which the conditional entropy H YjX 1 Z x 1 2B j 1 ; .; X n Z x n 2B j n1 À Á in equation (3.2) is computed. The figure can be understood as a consequence of the output/feature uncertainty being small, for reasonably fine input discretization. Hence, the conditional histograms approximating p Y jX 1 Z x 1 2B j 1 ; .; X n Z x n 2B j n1 À Á tend to have only very few bins with non-vanishing probability. In our case, the output is discretized into 15 bins out of which usually only one or two have non-zero counts, which can therefore be estimated properly with about 100 data. Thus, 600!100Z60 000 evaluations provide a reasonably accurate estimate of the discretization entropy.

Information balance
The block diagram of the information balance (figure 8) shows that higher order interactions do contribute significantly to the total sensitivity. Moreover, only a small subset of parameter pairs and triplets interact significantly (figures 5 and 6), and we expect such sparse connectivity to continue at higher orders.

Total sensitivity indices
Total sensitivity indices consist of conditional entropies of the type H(Y j{X 1 , ., X n }\X i ), which therefore can  be estimated in a similar fashion to the discretization entropy, except that the value of the input X i under consideration is allowed to vary over its entire range. All other inputs are evaluated within their bins, and the conditional entropy is again averaged over (in theory) all bin combinations. For the system under investigation, the Monte Carlo estimates exhibit a convergence comparable to that in figure 7. Figure 9 shows the estimated total sensitivity indices of the eight most relevant parameters of the NFkB pathway model next to their first-order indices, revealing the different degrees of interaction. The diagram leads to two main conclusions. First, parameter 29 stands out in terms of its overall significance, since it has the strongest individual impact and also the highest degree of total impact, in the sense that almost 80% of the output uncertainty is removed by the information contributions of 29 and its interactions. Second, the fractional contribution of the interactions to the total sensitivity is higher in the other parameters, but, with the exception of parameter 36, their interaction impact (the difference of total and firstorder sensitivity) is lower than that of 29.
A total sensitivity index equalling unity would indicate that the corresponding input quantity is 'fully connected', in the sense that it participates in all relevant interactions; sensitivity indices not involving this parameter would be irrelevant. For parameter 29, this is almost the case.

CONCLUSIONS
With the advent of advanced estimation techniques, mutual information has become a viable means of characterizing input-output interactions in complex networks. The framework developed in this paper lays the theoretical foundations for an information-theoretic sensitivity analysis that assigns credit or influence to input variables in terms of their overall contribution to a system's output entropy. However, our method is far more than a replacement of analysis of variance by analysis of entropy. The information-theoretic approach does not rely on implicit assumptions of normally distributed outputs and is easily generalized to include non-orthogonal input sequences. Moreover, the information-theoretic approach lends itself well to the analysis of systems with an intrinsically stochastic structure, such as biochemical reactions with small numbers of molecules (Wilkinson 2006). In this case, the noise entropy provides a combined representation . Block diagram of the information balance for a particular feature (P1) in the NFkB oscillation (cf. figure 3). The height of the entire block equals the output uncertainty (entropy). All contributions are normalized with respect to the total information, the amount of output uncertainty the inputs account for. The remainder H D is the uncertainty due to the discretization of input values (c.f. §2.4). Obviously in this case, fourth-and higher order terms contribute a significant portion of the output entropy, indicating a high degree of parameter interaction. This result is supported by the high total sensitivity indices observed in the eight most significant parameters (cf. figure 9). first-order sensitivity total sensitivity Figure 9. Comparison of first-order and total sensitivity indices for the most significant parameters with respect to feature P 1 (cf. figure 3). The difference between the two measures indicates the parameter's degree of interaction. A total sensitivity value close to unity, as in parameter 29, indicates that the parameter and its interaction partners almost fully describe the system. Note, the total sensitivity indices do not sum to unity, unless there are no interactions, in which case first-order and total sensitivity indices are equal (additive system). In the example studied here, the interactions clearly dominate.
of the output uncertainty due to discretization and all sources of intrinsic stochasticity.
We thank the UK BBSRC for financial support, and Mike White, Pawel Paszek and Caroline Horton for useful discussion. This is a contribution from the Manchester Centre for Integrative Systems Biology (www.mcisb.org).

A.1. The sign of interaction information
In this section, we discuss the possibility of negative conditional interaction information (CII) and examine the conditions under which this can occur. Most maps do not seem to have this property, but we provide a carefully constructed example of a simple system with three parameters that can exhibit negative conditional interaction information. Consider the function where X 1 ; X 2 ; X 3 2½0;2, the components g l and h l are the functions and h l ðxÞ Z 1 2 ftanh½lðx K1ÞKtanh½lðx K2Þg: Figure 10 shows a plot of the auxiliary functions. The control parameter l determines the 'steepness' of the sigmoid components. For l/N, g l and h l become 'square-wave pulses'. Symbolically, one can write Hence, the random variables are weighted by indicator functions that respond to their values being either in the lower or upper half of their interval. Figure 11 shows a schematic visualization of the two possible scenarios (different l). If all the informationtheoretic sensitivity measures capturing first-, secondand third-order effects have positive sign, these contributions-together with the discretization entropy-sum up to the output entropy (figure 11a). Under certain circumstances, the sum of first-and second-order indices exceeds the output entropy (figure 11b), where the pairwise interactions are not providing independent pieces of information. In this case, the excess information is compensated by a negative third-order contribution. Although the sum of all sensitivity indices does equal the output entropy, a meaningful interpretation of the components as second-or third-order sensitivity measures is no longer justified. Table 1 shows the information balance of the system for two choices of l. Owing to symmetry in the inputs, the information measures within each order of interaction are the same. Therefore, only the sums of first-and second-order terms are provided.
Apparently, the combination of non-monotonicity and point symmetry of the system (A 1) leads to the negative information measure. Choosing a smaller value of the control parameter l destroys the symmetry, and consequently the CII becomes positive. Exact point symmetry is not likely to be a feature of natural systems such as our example from systems biology, since it would require an extreme degree of regularity. Therefore, we conclude that the example presented is a rare exception.
One potential alternative scheme that could provide an information decomposition based solely on non-negative quantities is the maximum entropy approach ( Jaynes 1957;Amari 2001;Schneidman et al. 2006). For instance, Amari's elegant method of information geometry (Amari 2001) requires the calculation of surrogate maximum entropy distributions, referred to as p (2) , p (3) , etc. Here, p (2) has the same pairwise marginals as the joint density ( p (3) , the same tripletwise marginals) but contains no higher order correlations. The Kullback-Leibler divergence of p (2) and p (1) can be taken to represent the total entropy attributable to second-order interaction and so on. However, at present, the maximum entropy approach is only fully developed with respect to a decomposition of the joint entropy of a set of variables (Amari 2001;Schneidman et al. 2006). We are currently investigating how to extend this approach to obtain a decomposition of the total information for an input-output relation. One practical concern in using a maximum entropy formalism for sensitivity analysis of complex systems with tens of parameters is that the maximum entropy method is (at present) computationally prohibitive and data-intensive. The maximum entropy densities are high-dimensional and suffer from a larger sampling bias than our approach, which is based on pair-and tripletwise contributions that can easily be corrected for the bias. In addition, in order to become useful for sensitivity analysis, the maximum entropy methodology needs to be extended to provide an information decomposition that explicitly identifies the particular most relevant parameter interactions. Future research will also be directed at further illuminating the relation between properties of the input-output map (e.g. monotonicity) and the sign of higher order sensitivity indices.

A.2. Bias corrected estimates of mutual information
Mutual information I(X;Y ) between an input variable X and an output variable Y is a functional of the probability densities of input and output. In practice, these probabilities are usually not known a priori and can only be estimated empirically from a limited number N of independent joint observations ('trials') of X and Y. The statistical errors made in measuring the probabilities owing to limited sampling leads to a severe systematic error (bias) in the information measures. This section is devoted to explain how we corrected the bias problem. For brevity, we focus on I(X;Y ), which we shall simply refer to as I, but these considerations would straightforwardly apply to other information quantities used in this paper, such as the conditional mutual information. For the sake of explanation, we suppose that X is an n-dimensional variable XZ{X 1 , ., X n }.
The bias due to sampling with N trials is defined as the difference between the limited sampling average value of information hI i N (h.i N being a probabilityweighted average over all possible (X, Y ) outcomes with N trials) and the true value of information I. Subtracting the bias from the limited sampling estimate allows for a much more accurate estimation of the true information I. We observe that I(X;Y ) can be written as the difference between two entropies I ðX; Y Þ Z H ðXÞKH ðXjY Þ: The bias of I is the difference between the biases of the two entropies. It is well known (Miller 1955) that entropies are biased downward (i.e. their bias is negative). H(X ) depends only on the marginal distribution p(X ). Its bias is much lower than that of H(XjY ), which depends on p(x, y), which is obviously much harder to sample than p(x). As a result, I is biased upward. Analytical considerations (Panzeri & Treves 1996) show that the bias decreases approximately linearly when increasing the number of trials, and increases approximately exponentially when increasing the number of dimensions n of the X-space. This makes it difficult to estimate I for large n.
Fortunately, the bias of the entropy can be computed approximately and eliminated by means of a number of techniques. In our experience, one of the most effective is the Bayesian technique of Nemenman et al. (2004), which uses a family of prior distributions that are weighted to produce a uniform expectation of entropy before any data are sampled. As (X, Y ) data become available, the entropy estimation is updated as an average over all the possible hypothetical probability distributions weighted by their conditional probability given the data. The algorithm converges to the true value of entropy quite rapidly with the . Its components are the discretization entropy H D , the first-order sensitivity indices, the secondorder indices and the third-order index given by the conditional interaction information. In this example, the CII is positive and hence an integral component of the output entropy, i.e. the sum of sensitivity indices is less or equal to the output entropy, with equality being reached once all relevant terms have been added. (b) Block diagram with negative CII. Here, the sum of discretization entropy plus first-and second-order terms exceeds the output entropy. The magnitude of the negative CII equals the information excess.
The sum of all indices does equal the output entropy, but the sensitivity indices have different signs. In this case, an interpretation of the CII as a third-order sensitivity index is not meaningful, since the input pairs do not contribute independent pieces of information, yielding non-orthogonal information decomposition. number of trials N. Unless the dimensionality of the space n is too large, residual errors left after this bias reduction are small. Since the uncorrected estimate of I is biased upward, residual errors in the estimation of I tend to be upward as well (Montemurro et al. in press).
To check whether any residual error is small, we have developed a different way to estimate information that is biased downward rather than upward (Montemurro et al. in press). This allows one to check the reliability of information-based sensitivity measures by assessing the proximity of the upper and lower bounds. To produce a downward-biased estimate of I, we used the following procedure (Montemurro et al. in press). We considered the entropy that would be obtained if the input was independent at fixed output, that is p(x jy)ZP i p(x i jy). The corresponding entropy of this 'independent' distribution is called H ind (X jY ) and typically has a very small bias because only marginal probabilities have to be sampled.
Alternatively, correlations between input variables can be removed by 'shuffling' the data at fixed Y, thus creating pseudo X-vectors obtained by randomly combining x i values from different trials in which the value Y was observed. The resulting entropy, called H sh (X jY ), has the same asymptotic value of H ind (X jY ) for an infinite number of trials, but has a much higher bias than H ind (X jY ) for finite N. Following the mathematical analysis of Panzeri & Treves (1996), Montemurro et al. showed that the bias of H sh (X jY ) is of the same order of magnitude as the bias of H(X jY ) but typically slightly larger. This observation suggests computing I in the following way: Owing to the bias cancellation created by the entropy terms added to the r.h.s., I sh has the same value of I for an infinite number of trials, but a much smaller bias for finite N. Moreover, since H sh (X jY) is more downward biased than H(X jY), the resulting bias of I sh is negative (Montemurro et al. in press). Thus, in cases when the upward-biased estimator I and the downward-biased estimator I sh coincide, we can be confident that our information estimate is unbiased. If they do not coincide, their difference provides an idea of the order of magnitude of our uncertainty in the information estimation.

A.3. Decomposition of the total mutual information
The following three equations are elementary formulae, proofs of which can be found in standard textbooks ( We next derive a decomposition of the total mutual information of three variables, which will serve as an auxiliary formula in the general decomposition with an arbitrary number of variables. A.3.1. Decomposition with three variables. By means of theorem (A 4), the total mutual information of a pair of random variables (X, Y ) and a third variable Z can also be expressed in terms of entropies: The decomposition (A 5) has previously been applied in computational neuroscience (Adelman et al. 2003).