## Abstract

Ordinary differential equations (ODEs) are a popular approach to quantitatively model molecular networks based on biological knowledge. However, such knowledge is typically restricted. Wrongly modelled biological mechanisms as well as relevant external influence factors that are not included into the model are likely to manifest in major discrepancies between model predictions and experimental data. Finding the exact reasons for such observed discrepancies can be quite challenging in practice. In order to address this issue, we suggest a Bayesian approach to estimate hidden influences in ODE-based models. The method can distinguish between exogenous and endogenous hidden influences. Thus, we can detect wrongly specified as well as missed molecular interactions in the model. We demonstrate the performance of our Bayesian dynamic elastic-net with several ordinary differential equation models from the literature, such as human JAK–STAT signalling, information processing at the erythropoietin receptor, isomerization of liquid *α*-Pinene, G protein cycling in yeast and UV-B triggered signalling in plants. Moreover, we investigate a set of commonly known network motifs and a gene-regulatory network. Altogether our method supports the modeller in an algorithmic manner to identify possible sources of errors in ODE-based models on the basis of experimental data.

## 1. Introduction

Mathematical models of biological systems become more and more complex and contribute important insights into various biological processes [1–7]. Since biological systems are naturally open, formulating mathematical models and specifying their boundaries is a highly non-trivial task [8,9]. Consequently, most researchers in systems biology are faced with the still unsolved issue to find a compromise between model complexity and the limited amount of knowledge, data and time [9–11]. Researchers in other fields including earth and environmental sciences are facing similar challenges [12]. In many cases, missed and unknown external influences as well as erroneous interactions in a model could lead to severely misleading results [7].

Current research is mostly focused on inference of perturbation effects and model selection [13–15]. Although, perturbation experiments are labour and cost intensive, which raises the need for a careful prioritization strategy [14–17]. On the other hand, statistical model selection and related methods require a strong knowledge about the system and its alternatives which is rarely given in practice [7,18]. Thus model selection can be very difficult, specifically if nothing is known about missing variables and their possible mechanisms.

In most situations, researchers have partial knowledge and preliminary hypotheses about their system, which needs to be integrated into a restricted but still predictive and experimentally validatable model [19]. Even if the biological system is partially known and the data are given for almost all molecular species, it is not clear how to deal with insufficient predictions [7]. Often this ends in trial-and-error approaches and does not ensure that the selected model reflects the reality rather than just fitting the given dataset. The question is, how to detect so far unknown molecules and their interactions in a more data driven manner. This could guide the modeller towards points in the given model, where the model is likely erroneous. In a second step, the modeller can then try to link these erroneous points to known mechanisms.

Recently, we proposed the dynamic elastic-net (DEN) as a more principled method to address this issue [20]. The DEN aims for estimating the dynamics of hidden influence variables in ordinary differential equations (ODE)-based systems via a penalized estimation procedure resembling elastic-net regression [21]. While our previous method was tested successfully on several applications, such as the erythropoietin receptor (EpoR)-dependent signalling network [19], it has still several shortcomings, which we address in this paper. More specifically, DEN is not a probabilistic approach and thus does not fully address the unavoidable uncertainty about estimates. DEN does not answer the question, whether estimated hidden influences could be attributed to missed or wrongly modelled interactions among the known molecular species. Here we introduce the Bayesian DEN (BDEN) as a new and fully probabilistic approach, which deals with all these aspects. In contrast with DEN, our new BDEN method does not require pre-specified hyper-parameters. We illustrate the predictive power of BDEN compared with DEN in several real biological models and test cases. The BDEN thus provides a systematic Bayesian computational method to identify target nodes and reconstruct the corresponding error signal including detection of missing and wrong molecular interactions within the assumed model. The method works for ODE-based systems even with uncertain knowledge and noisy data. In contrast with approaches based on point estimates the Bayesian framework incorporates the given uncertainty and circumvents numerical pitfalls which frequently arise from optimization methods [22,23].

## 2. Material and methods

### 2.1. Motivation

We assume the modelling process to start with an initial, potentially incomplete or partially misspecified nominal model including all known but not necessarily observable molecules [24]. Figure 1 illustrates the general idea. In most situations, the real system differs from the initially modelled nominal system, which is reflected by an insufficient fit to the given data caused by (i) hidden influences and (ii) erroneous molecular interactions. Exogenous hidden influences could, for example, be stimulatory (e.g. phosphorylation) or inhibitory (e.g. de-phosphorylation) events affecting the modelled system from outside. In addition, there could exist stimulatory or inhibitory influences *within* the system, which are not included in the model due to lack of knowledge, i.e. missing molecular interactions. Similarly, wrongly included molecular interactions could exist. In biochemical systems, molecular interactions are based on biochemical reactions, e.g. phosphorylation and binding events.

Owing to the fact that biological systems are open, the number of potential erroneous nodes (e.g. proteins or other molecules) within the nominal model is huge [9]. Without further knowledge, independent error terms have to be assigned to each node. If the respective node is in reality not directly targeted by a hidden influence, the hidden input takes the value zero. Only nodes directly affected by hidden influences have non-zero errors. Wrongly modelled or missing interactions between two nodes can be represented by two error terms, one for each of the respective nodes, which will be correlated over time. We exploited this idea to detect missing or erroneous interactions in a given ODE-based nominal model.

### 2.2. Approach

We assume the dynamical model
2.1a
2.1b
2.1cHere, denotes the time derivative of the state vector with initial value * η*. The not necessarily linear function

**represents the nominal model, which describes the current assumptions about the dynamic interactions between the state variables and in addition the effect of the known input function .**

*f*No model of a biological system can ever be totally complete and comprehensive. Therefore, we add the hidden influences to the nominal model function ** f**. The additive dynamic hidden influence

**(**

*w**t*) subsumes missing or wrong interactions between the state variables as well as exogenous influences caused by crosstalk with other biological processes (see electronic supplementary material, §3). Of course,

**(**

*w**t*) is unknown and we aim to estimate these hidden influences from the data [20]. Notably, the hidden influence

**(**

*w**t*) is not restricted to be constant or linear and thus can be any arbitrary function of time.

Often it is impossible to measure all components of the state vector ** x**, e.g. concentrations of reacting substances (due to technical limitations, e.g. non-availability of phospho-protein specific antibodies). The map from the state to the measurable output , with

*K*not necessarily equal to

*N*, is given by the measurement function

**, which we assume to be known (equation (2.1**

*h**b*)). In addition, we assume white Gaussian measurement noise with expectation zero and a noise covariance matrix , see below.

In practice, the data are given as measurements at discrete time points *t*_{l} with *l* ∈ {1, …, *T*}. We will use the notation *y*_{k,l} = *y*_{k}(*t*_{l}) for the measured output *k* ∈ {1, …, *K*} at measurement time *t*_{l} and the analogous notation for the other variables, i.e. *x*_{i}(*t*_{l}) = *x*_{i,l} and . For sake of simplicity in the following, we denote the matrix of observed measurements by and the corresponding state and hidden influence matrices by and .

From now on, we are interested in hidden influences at discrete time points. Under this assumption equation (2.1) can be rewritten as
2.2a
2.2b
2.2cConsequently, we obtain a first-order Markov process over the state variables ** x**. The function is obtained by fitting a cubic smoothing spline through each of the

*N*discrete time series of hidden influence signals

*w*

_{i,1:T}[25].

The assumption of Gaussian measurement noise can, if necessary, approximately be fulfilled after a variance-stabilizing transformation [26]. In addition, we assume uncorrelated measurement noise and thus *Ξ*_{l} = diag(*ξ*^{2}_{1,l}, …, *ξ*^{2}_{K,l}).

### 2.3. Marginal likelihood of the data

The likelihood of the observed data
2.3can be factorized due to the independence of the measurement noise with respect to time and observables. Note that is defined by equation (2.2a). In addition, and are conditionally independent from *Ξ*_{1:T}.

Since typically the number of replicate measurements per time point is small, the empirical variance is not a reliable estimator of the true measurement noise. Therefore, we impose an inverse gamma prior on the variance of the measurement noise 2.4The marginal likelihood of the data are obtained by marginalizing over the variance of the measurement noise variable 2.5This integral can analytically be calculated to yield [27] 2.6A detailed derivation of equation (2.6) is provided in electronic supplementary material, §4.

According to Bayes' theorem, the posterior density over the hidden input signals with initial value is given by 2.7Using equation (2.6), we can directly draw samples from the posterior density of the hidden influence. For this purpose, we propose a Bayesian elastic-net prior as detailed in the following sections.

### 2.4. Smoothness and sparsity via a Bayesian elastic-net prior

The hidden input signals can be understood as the statistical residuals of the nominal system, and every deviation of observations from the nominal system could thus be explained by non-zero components in . However, we are only interested in hidden input signals, which are far stronger than measurement noise. Therefore, we assume that the hidden input signal is smooth and sparse. Sparsity corresponds to the *a priori* belief that only a small subset of state variables is truly affected by unknown external or internal input signals. In addition, we assume the hidden input signal to be smooth over time. Smoothness and sparsity are encoded by a prior distribution inspired by the Bayesian elastic-net, which is here combined with a first-order Markov process over [28,29]. Overall our proposed approach is thus a hierarchical graphical model shown in figure 2. Details can be found in electronic supplementary material, §5.

Briefly, the Bayesian elastic-net defines a conditional Gaussian prior over each *w*_{i,l}|*w*_{i,l−1} (*i* = 1, …, *N*, *l* = 1, …, *T*). The scale of the variance of the Gaussian prior is a strongly decaying and smooth distribution peaking at zero, which depends on parameters *λ*_{2}, and *σ*^{2}. The parameter is itself given by an exponential distribution (one for each component of vector ) with parameters . In consequence, sparsity is dependent on the parameter vector , whereas smoothness is mainly controlled by *λ*_{2} [28,30]. These parameters are drawn from hyper-priors, which can be set in a non-informative manner or with respect to prior knowledge about the degree of shrinkage and smoothness of the hidden influences [31]. We refer the reader to electronic supplementary material, §§5 and 7, for details.

### 2.5. Estimating hidden influences from data

To estimate the hidden input and the parameters in the hierarchical model, we devise a Metropolis–Hastings algorithm with Gibbs updates of the Bayesian elastic-net hyper-parameters. The algorithm proceeds sequentially between the different time points 1, …, *T* by drawing different samples at each supporting point. At sampling step *s* and time point *l*, a random component *w*_{i,l} is selected of the hidden input vector (a node in the network) at the previous time point, which is modified by a sample from a univariate Gaussian transition kernel *π* [32]. The resulting vector is accepted with probability
2.8where is the Bayesian elastic-net prior over the hidden influences conditioned by hyper-parameters (for details, see electronic supplementary material, §§5 and 7). Because of the Gaussian measurement errors, the discrepancy between data component *y*_{k,l} and the corresponding model output in equation (2.6) is given by the quadratic error . Note that is obtained by numerically integrating the ODE system from time point *l* − 1 using as initial value according to equation (2.2a). Code for the sampling algorithm is provided in electronic supplementary material, §6.

### 2.6. Estimating endogenous hidden influences: missing and wrong reactions

After having estimated hidden influences on state components in the nominal ODE system, the question arises whether these hidden variables could in fact correspond to missing or wrongly specified interactions within the nominal system. A simple strategy, which we followed here, is to rank all state variables in the nominal system by their temporal correlation with the estimated hidden influences. The essential idea is that in case of a wrong or missing reaction the estimated hidden time courses should ‘compensate’ erroneous predictions by the nominal system (figure 3). In general, wrong or missing interactions can either have an increasing (stimulatory) or decreasing (inhibitory) influence on the target nodes. This results in a negative hidden influence with in the case of inhibition and a positive hidden influence with in the case of stimulatory events. More specifically, we distinguish several cases which are listed in table 1 and further illustrated in figure 3. Briefly, the idea is that an modelled stimulation between two state variables *x*_{1}, *x*_{4} in the ODE system yields two error signals *w*_{1} (influencing *x*_{1}) and *w*_{4} (influencing *x*_{4}), which are anticorrelated. This is because *w*_{1} and *w*_{4} capture the unmodelled dynamics of the ODE system. Similarly, an unmodelled inhibition yields *w*_{1}, and a positive correlation of *w*_{1} and *w*_{4}. A wrongly modelled stimulation results in and which are anticorrelated. A wrongly modelled inhibition yields *w*_{1}, and a positive correlation.

As exemplified in figure 3, due to differences in the expected correlations, our analysis should also allow for distinguishing missing inhibitory versus stimulating effects of missing monotonous interactions.

Different measures exist to capture the strength of correlation between time courses. Apart from the Pearson correlation, we here used the the cross-correlation coefficient
2.9to quantify temporal associations between two time series *X* and *Y* [33]. The cross-correlation *R*_{XY}(*τ*) depends on the time lag *τ* which was chosen as argmax_{τ}[*R*_{XY} (*τ*)].

## 3. Results

### 3.1. Tested mathematical models

The EpoR-induced JAK–STAT signalling pathway mediates a rapid signal transduction from the receptor to the nucleus related to cell proliferation and differentiation [19]. This pathway involves a rapid nucleocytoplasmic cycling of the signal transducer and activator of transcription 5 (STAT5) molecules which is not directly measurable [19].

The G protein cycling model quantitatively characterizes the heterotrimeric G protein activation and deactivation in yeast [34]. This model serves as a fully observed but complex test case where all states are measured.

In contrast, the model of the UV-B signalling in plants systematically links several signalling events induced by UV-B light to a comprehensive informational signalling pathway [35]. Only combinations of small amounts of the involved molecules are accessible and thus it serves as a complex and not fully observed test case.

Network motifs are thought to represent building blocks of larger biological systems [36]. It is thus informative to test BDEN with respect to these motifs to better understand the possible dependency of the performance of our models on different basic network topologies.

The dynamic EpoR model reflects the information processing at EpoR including turnover, recycling and mobilization of EpoR after stimulation with erythropoietin (Epo) at the cell membrane [37]. Consequently, it details the first part of the JAK–STAT signalling pathway. Only combinations of Epo concentrations in the medium, on the surface and in the cells are accessible and thus it represents a complex model with limited experimental data [37].

By contrast, the thermal isomerization of *α*-Pinene (*α*P) in the liquid phase has the purpose to investigate the applicability of BDEN to small compound reaction networks [38]. The model details the racemization of *α*P and its simultaneous isomerization to dipentene (dP) and allo-ocimene (aO).

To further investigate the utility of BDEN for complex systems, we used a gene-regulatory network composed of six genes and related proteins obtained from the DREAM6 challenge [39]. Further details about the described models are given in electronic supplementary material, §8.

### 3.2. Simulation study: testing the performance of Bayesian DEN

We first compared the performances of BDEN as well as our old DEN approach to correctly predict the location of single hidden influence for the JAK–STAT signalling model, the heterotrimeric G protein cycling, the U-VB signalling in plants and the aforementioned network motifs [19,34–36]. This was done on the basis of simulated data for each system. Details about the simulations are given in electronic supplementary material. For BDEN, we computed for each (where denotes the posterior mean taken over MCMC samples) the area under the predicted hidden influence curve by trapezoidal numerical integration [40]. For DEN, we applied the same method based on the provided point estimates of hidden influence curves. The area under the predicted hidden influence curve was compared against the simulated existence and non-existence of a hidden signal at that node. Consequently, we were able to compute an area under ROC (AUC) value and a corresponding Brier score (BS), i.e. the squared difference between the prediction score and the Boolean indicator of a true hidden influence [41,42]. Table 2 and electronic supplementary material, table S1, show a favourable overall performance of our new method for different levels of measurement noise and simulated errors of kinetic parameter estimates.

Next, we investigated the performance of BDEN to correctly detect more than one hidden influence. We used the comparatively large G protein cycle model for this purpose. Results can be found in electronic supplementary material, table S2. In this simulation, we randomly added hidden influences for up to 50% of the nodes and still observed a good prediction performance.

In a similar manner, we investigated the performance of BDEN to detect wrong and missing interactions (table 3). We simulated wrong model specifications of the heterotrimeric G protein cycling, the UV-B signalling in plants and the synthetic JAK–STAT signalling by randomly removing and adding interactions. As described above, the quantitative predictions of BDEN are given in terms of (cross-)correlations. By comparing these correlation values against the true existence and non-existence of a particular interaction, we were able to compute an AUC value. Notably, missing and wrong interaction detection is only possible with our new BDEN approach. Again we archived a very good performance for all systems under investigation. On average, 80% of the missing interactions are correctly detected by BDEN. Among the correctly identified missing interactions, on average 90% were correctly classified as ‘stimulating’ and ‘inhibiting’, respectively (electronic supplementary material, table S3). Details regarding the dependency on the measurement noise are given in table 3 and results in dependency of deviance of the parameter estimates are given in electronic supplementary material, tables S4 and S5.

### 3.3. Examples with real data

In the following, we further illustrate the results obtained with BDEN for the JAK–STAT signalling model, the information processing at EpoR and the isomerization of *α*P using experimental data.

#### 3.3.1. JAK–STAT signalling

The JAK–STAT signalling pathway model (§3.1) consists of four molecular species.

Unbound STAT5 molecules become phosphorylated (STAT5_{p}) catalysed by the erythropoietin receptor. Two STAT5_{p} molecules can form a dimer (STAT5_{di}) and thus are able to enter the nucleus (STAT5_{n}). Only the amount of phosphorylated STAT5 molecules, the total amount of STAT5 and the erythropoietin receptor are directly accessible. Experimental measurements are available at 16 time points [19].

Figure 4 illustrates the application of our method when ignoring the back-translocation of STAT5_{n} into the cytoplasm, which was hypothesized by the authors [19]. After parameter fitting the nominal system is not in sufficient agreement with the data. Introducing hidden influence terms leads to good agreement with the observations. Our method clearly localized two hidden influences and at STAT5 and STAT5_{n}. Subsequent analysis shows a high positive (cross-)correlation of with STAT5_{n} and a negative one with . Exactly the opposite pattern can be observed for . According to our above explained procedure we thus predict a stimulatory influence of on STAT5. This is in agreement with the claimed nucleocytoplasmic cycling of the phosphorylated STAT5 dimer [19,43].

#### 3.3.2. EpoR model

The complex core model of the EpoR regulation via receptor mobilization, turnover and recycling involves six species and eight time points. Here the ligand Epo binds to EpoR on the surface and builds a ligand–receptor complex (Epo–EpoR). In consequence, Epo–EpoR triggers the phoshorylation of the cytoplasmic EpoR and thus induces the JAK–STAT signalling pathway [37]. Several mechanisms affect the amount of active EpoR. This model covers the ligand-induced receptor endocytosis and thus the internalization of the ligand-bound receptor (Epo–EpoR_{i}), receptor recycling and degradation of the internalized ligand-bound receptor. Location-dependent degradation results in degraded Epo in cytoplasm (dEpo_{i}) and in medium (dEpo_{e}).

As a test case for BDEN, we wrongly specified a receptor-induced feedback on the amount of available Epo. In consequence, we expect to detect this wrong interaction. As shown in figure 5, BDEN allows to correctly localize and characterize this erroneous interaction in the nominal model.

#### 3.3.3. *α*P isomerization

The model of the dynamic isomerization of *α*P is composed of four molecular species. Measurements of *α*P, dP, aO and the dimer (Di) are available [38]. After heating, *α*P reacts either to dP or builds a dimer by reacting with aO. Furthermore, Di can react to aO.

To test BDEN the dimerization step was wrongly replaced with a simple reaction involving only aO. In consequence the interaction between *α*P and dP is completely independent from the interaction between aO and Di. The erroneous nominal system can be corrected by using BDEN as illustrated in figure 6. BDEN is able to correctly locate and add the falsely removed reaction.

### 3.4. Further examples with simulated data

In the following, we further illustrate the results obtained with BDEN for the G protein cycle in yeast, the UV-B signalling model and a generic gene-regulatory network using simulated data (2.5% noise level).

#### 3.4.1. G protein cycle in yeast

The heterotrimeric G protein cycle in yeast involves six species which are directly observable and coupled by several types of kinetics (for details see electronic supplementary material, §8) [34]. Experimental data at 8 time points were simulated by adding Gaussian distributed noise to the predicted values of the observable variables. We assumed a noise intensity of 2.5% relative to the mean of the related time series for each observable variable. The nominal system was generated by adding one additional ‘wrong’ interaction (between the ligand–receptor complex (LR) and the G protein_{α}-inactive (GP_{αi})). Electronic supplementary material, figure S1, illustrates the ability of BDEN to localize and recover the wrong interaction within the nominal system.

#### 3.4.2. UV-B Signalling

As a more complex example, we simulated seven data points of the photomorphogenic UV-B signalling in plants [35]. The model of the photomorphogenic UV-B signalling in the model plant *Arabidopsis thaliana* consists of 11 species coupled by several different kinetic rate expressions and five observable variables as a combination of seven different species (for details see electronic supplementary material, §8). As the nominal system, we used the literature given model and included a missing link by removing one interaction which influences two different species. Observed data were simulated by adding Gaussian distributed noise to the predicted values of the observable variables. We assumed a noise intensity of 2.5% with respect to the mean of the related time series for each observable variable. As shown in electronic supplementary material, figure S2, the BDEN is able to detect the missing molecular interaction and correctly identifies the corresponding proteins.

#### 3.4.3. DREAM6 Challenge Network

To investigate the applicability of BDEN to gene-regulatory networks we took a model from the DREAM6 challenge [39]. The model consists of six genes and six proteins coupled by mass action and hill kinetics. In this model, all proteins and one mRNA species are assumed to be directly observable (for details see electronic supplementary material, §8) [39]. As the nominal system, we used the provided model and included one inhibitory mechanism. Observed data were simulated at five time points by adding Gaussian distributed noise to the predicted values of observable variables according to the original challenge [39]. BDEN is able to detect and correct the spurious interaction, as illustrated in electronic supplementary material, figure S3.

## 4. Conclusion

Mathematical modellers in systems biology are frequently confronted with incomplete knowledge and limited understanding of a complex biochemical system [9–11]. Consequently, there is a non-negligible chance that relevant molecular species are missed or interactions are misspecified [8,9]. Our proposed method addresses this issue by adopting a Bayesian framework which allows for inferring hidden influence variables, as well as estimating missing and wrong molecular interactions. It has been successfully validated in several real models as well as common network motifs. This was done with simulated as well as experimental data. Owing to the fully Bayesian formulation all model parameters are sampled. Furthermore, the Bayesian approach allows to assign confidence levels to predictions. Besides these general features of a fully probabilistic framework our newly proposed BDEN method seems to be more stable and more robust because within the Bayesian framework, we average over a large number of parameters and do not rely on stiff integration methods.

A unique feature of our new approach is the distinction between exogenous and endogenous hidden influences in the biological system, allowing for the detection of missing and misspecified equations in the ODE system. Altogether we thus see our BDEN method as a further step towards a better automated and more objective revision of ODE-based models in systems biology and possibly other fields, such as pharmacokinetics, earth science, robotics and engineering [12,20]. In that context, we emphasize again that BDEN is *not* designed to learn ODE systems purely from data and should thus not be confused with network reverse engineering methods [44]. Much more, the utility of BDEN is to ease identification of sources of errors in mechanism-based mathematical models.

## Data accessibility

The source code (Matlab) with several examples is available online at http://www.abi.bit.uni-bonn.de/index.php?id=17.

## Authors' contributions

B.E. performed the simulations; B.E., M.K. and H.F developed the method; B.E. and H.F interpreted the data; B.E., M.K. and H.F drafted the manuscript; M.K. and H.F. designed the research. All authors reviewed the manuscript and gave final approval for publication.

## Competing interests

We declare we have no competing interests.

## Funding

B.E. was supported by the Research Training Group 1873, Deutsche Forschungsgemeinschaft (DFG).

## Footnotes

Electronic supplementary material is available online at https://dx.doi.org/10.6084/m9.figshare.c.3792295.

- Received May 8, 2017.
- Accepted May 23, 2017.

- © 2017 The Author(s).

Published by the Royal Society under the terms of the Creative Commons Attribution License http://creativecommons.org/licenses/by/4.0/, which permits unrestricted use, provided the original author and source are credited.