## Abstract

Bayesian statistics provides a framework for the integration of dynamic models with incomplete data to enable inference of model parameters and unobserved aspects of the system under study. An important class of dynamic models is discrete state space, continuous-time Markov processes (DCTMPs). Simulated via the Doob–Gillespie algorithm, these have been used to model systems ranging from chemistry to ecology to epidemiology. A new type of proposal, termed ‘model-based proposal’ (MBP), is developed for the efficient implementation of Bayesian inference in DCTMPs using Markov chain Monte Carlo (MCMC). This new method, which in principle can be applied to any DCTMP, is compared (using simple epidemiological SIS and SIR models as easy to follow exemplars) to a standard MCMC approach and a recently proposed particle MCMC (PMCMC) technique. When measurements are made on a single-state variable (e.g. the number of infected individuals in a population during an epidemic), model-based proposal MCMC (MBP-MCMC) is marginally faster than PMCMC (by a factor of 2–8 for the tests performed), and significantly faster than the standard MCMC scheme (by a factor of 400 at least). However, when model complexity increases and measurements are made on more than one state variable (e.g. simultaneously on the number of infected individuals in spatially separated subpopulations), MBP-MCMC is significantly faster than PMCMC (more than 100-fold for just four subpopulations) and this difference becomes increasingly large.

## 1. Introduction

The development, testing and consequent refinement of dynamic mathematical models plays an increasingly important role in the study of many systems. Bayesian statistics provides a general framework for inference of both model parameters and unobserved/unobservable aspects of system dynamics, and gives a rigorous basis for prediction, decision-making and model comparison [1]. As a reliable statistical technique, Bayesian inference, in the context of the type of models studied here, is a well-established methodology for extracting information from biological data (e.g. see [2–4] for epidemiological examples). However, despite the success of applying Bayesian methods to many systems, use has often been limited by computing power. In this paper, we develop novel computationally efficient tools that implement Bayesian inference for discrete state space, continuous-time Markov processes (DCTMPs). As described below, DCTMPs have proved extremely useful in modelling a huge range of biological phenomena. Moreover, the innovative computational approach introduced in this paper leverages the specific structure of the model under study, and therefore has the potential to make an impact across the same broad range of applications.

DCTMPs are readily simulated using the Doob–Gillespie algorithm [5] and have been used to model stochastic dynamics in a wide range of systems, including: a long history of application to human, animal and plant epidemics (see [6] for an overview); the study of population dynamics in ecology [7]; variation in the concentration of various compounds resulting from chemical reactions [5]; financial crises [8]; biochemical networks [9]; gene expression [10] and the behaviour of quantum systems [11]. Gibson & Renshaw [11,12] demonstrated implementation of Bayesian inference for DCTMPs by applying Markov chain Monte Carlo (MCMC) methods, and this now standard approach has subsequently been employed to conduct inference in epidemics [13], ecology [14] and systems biology [15].

The computational costs of implementing Bayesian inference using MCMC have led to the development of alternative approaches including approximate Bayesian computation (ABC) [16] and methods based on particle filtering such as particle MCMC (PMCMC) [17]. In addition, several authors have attempted to enhance Bayesian inference for *continuous* state space dynamical systems by modifying MCMC to use local gradients in the posterior distribution [18,19]. By contrast, this paper proposes a more effective MCMC proposal to implement Bayesian inference in *discrete* state space dynamical systems and specifically for DCTMPs.

We begin with a brief review of DCTMPs (§2) and Bayesian inference on these systems (§3). Section 4 details the computational approaches used to achieve this inference: a standard MCMC algorithm, PMCMC and MCMC using a new proposal type that is computationally more efficient than the standard approach. This new proposal, termed a ‘model-based proposal’ (MBP), represents the important novel aspect of this paper. These MBPs are generically applicable to DCTMPs, but for clarity of exposition are introduced and demonstrated in the context of two simple models that are widely used in epidemiology. A numerical comparison of all three techniques is given in §5, the paper is summarized in §6 and, finally, we discuss future research possibilities in §7.

## 2. Discrete state space, continuous-time Markov processes

Consider the following generic DCTMP in which, at time point *t*, the system state can be represented by a vector *n*(*t*) that contains *N* state variables. Each variable represents the population size in the *N* compartments within the model. Because populations take integer values, the space of possible configurations that the system can potentially take is discrete. Transitions of individuals between compartments are also discrete and are captured by *E* possible event types. Each event type *e* has the effect of altering the state variables as follows:
2.1where *c* denotes one or more compartments changed by the event and *Q* is a fixed matrix. The average rate at which each event type occurs is defined by the elements of vector *r*(*t*). An important simplifying assumption in DCTMPs is the Markovian property, which assumes that these rates are solely dependent on the current state vector *n*(*t*)*,* rather than on any previous state of the system.^{1} The functional dependence between *r*(*t*) and *n*(*t*) is characterized by parameters collectively referred to as *θ*.

To give an example, a commonly used compartmental model in epidemiology is the SIS model, illustrated in figure 1*a*. This divides individuals into those that are not infected and are susceptible to infection (the ‘*S*’ compartment) and those that are infected and infectious (the ‘*I*’ compartment). In this model, two possible events can occur: an infection (*S* is reduced by one and *I* increased by one) or a recovery (*I* is reduced and *S* is increased). Such a model is relevant to infections such as the common cold or influenza, which do not confer long-lasting immunity. For the SIS model in figure 1*a*, the following definitions can be made:
2.2where event 1 is infection and event 2 is recovery.

This model can be simulated using the Doob–Gillespie algorithm (subsequently abbreviated as DGA), a standard stochastic simulation technique for generating event sequences from DCTMPs (see the electronic supplementary material, appendix A, for details). Results are shown in figure 1*b* for the case in which a single infected individual enters a population of 100 susceptible individuals (corresponding to an initial state vector *n*^{init} = *n*(0) = (100, 1)). The parameters used were *θ* = (*β*, *γ*) = (0.003, 0.1), with simulation time *T* = 100. An initial rapid increase in the infected population (dotted curve) settles towards an endemic scenario, in which the rate of infection and recovery events approximately balance.

The black circles in figure 1*b* denote potential estimates of the infected population size *I*, made at discrete time intervals. For this study, these data points are used as hypothetically observed data ‘*y*’, which are collected over the time period [0, *T*]. The error bars associated with the circles in figure 1*b* represent the uncertainty associated with these measurements.^{2} In general, measurements on state variables are not necessarily regularly spaced, as in figure 1*b*, and can be made on different compartments at different times. For *m* ranging from 1 to *M*, we take *c _{m}* to represent the compartment measured with

*τ*the time of measurement,

_{m}*v*the observed value of the state variable (i.e. the population size in compartment

_{m}*c*at time

_{m}*τ*) and

_{m}*σ*the uncertainty associated with the measurement. For example, the first measurement in figure 1

_{m}*b*is defined by

*c*

_{1}=

*I*,

*τ*

_{1}= 20,

*v*

_{1}= 18 and

*σ*

_{1}= 2. The hypothetical data in figure 1

*b*is later used to provide a benchmark against which to assess the relative performance of the three computational approaches.

It should be stressed that these algorithms can be applied to arbitrary DCTMP. The simple SIS outlined above, and SIR and meta-population SIR models later used, have been selected for the purposes of clearly identifying the relative strengths and weaknesses of each approach. The implications of this analysis, however, are vitally important in developing practical Bayesian inference techniques for the wide range of applications identified in Introduction.

## 3. Bayesian framework

The data *y* from the previous section is insufficient to determine the precise sequence of events *ξ* and parameter values *θ* used to generate it. Bayesian analysis [1] assesses the probability, conditional on the model and the data, that the parameters *θ* and the so-called latent (unobserved) variables *ξ* take particular values. Using Bayes' theorem, this probability, termed the posterior, can be written as
3.1

Here, *π*(*ξ*|*θ*) represents the probability that a DGA simulation (as outlined in electronic supplementary material, appendix A) of the model using parameters *θ* generates event sequence *ξ*. It can be expressed as (see below for a brief derivation) [12,20]:
3.2where the product runs over all *N _{ξ}* events in

*ξ*. The quantities

*t*and

_{k}*ɛ*represent, respectively, the time and type (i.e. infection or recovery for the SIS model) of each event. The rates

_{k}*r*(

*θ*,

*n*), such as those given in equation (2.2), are evaluated using parameters

_{k}*θ*and the state vector

*n*immediately prior to time

_{k}*t*(

_{k}. R*θ*,

*n*) =

_{k}*Σ*

*(*

_{e}r_{e}*θ*,

*n*) is the total event rate. We define

_{k}*t*

_{0}= 0 as the time at which the initial conditions are set.

Equation (3.2) can be derived by considering the two ways in which stochasticity enters the DGA (see electronic supplementary material, appendix A): (1) event times are drawn from the exponential distribution *P*(*t _{k}*) =

*R*(

*θ*,

*n*)exp(

_{k}*−R*(

*θ*,

*n*) × (

_{k}*t*−

_{k}*t*

_{k−}_{1})) and (2) event types are selected with probability

*P*(

*ɛ*) =

_{k}*r*(

_{ɛk}*θ*,

*n*)

_{k}*/R*(

*θ*,

*n*). Multiplying these terms for all events in

_{k}*ξ*leads to equation (3.2). The first term on the right-hand side of equation (3.2), where

*n*represents the final state vector of the system, accounts for the time between the last event and the final observation time

_{T}*T*.

The probability of measuring data *y* given that the actual system has event sequence *ξ* is given by *π*(*y*|*ξ*), referred to as the observation model. If we assume a simple normal distribution^{3} in the error of a single measurement *y*_{sing}, the probability of observing a population value *v* given it actually has a value *n* is
3.3where *n* and *σ*^{2} are the mean and variance of a normal distribution (see electronic supplementary material, appendix B, for a definition of this function). If we assume that the *M* measurement errors in *y* are uncorrelated,^{4} the observation model for the complete data *y* can be written as the product
3.4This term contributes to the posterior probability equation (3.1) by giving greater weight to event sequences *ξ* that pass near to the measurements.

In the Bayesian framework, previous knowledge about model parameters is encapsulated in the prior distribution *π*(*θ*) in equation (3.1).^{5} A flat uninformative prior is assumed for the numerical results in this study.

Figure 2 shows a two-dimensional representation of the posterior distribution marginalized in parameter space *π*(*θ*|*y*), as discretized on an 80 × 80 grid, for the SIS model with *θ* = (*β*, *γ*) assuming data *y* from figure 1*b*. Darker shaded regions indicate more probable parameter combinations that are consistent with the observed data, model and prior. Point *A* represents the parameter values used to generate the true event sequence. This point lies within the shaded area, indicating that there is sufficient information within the measurements to approximately infer the true parameter values underlying the system's dynamics.^{6}

## 4. Material and methods

We now describe three different algorithms to generate the plot shown in figure 2. Standard MCMC and the PMCMC methods are briefly described below (details in the electronic supplementary material), and then the new MBP technique, which represents the novel contribution of this paper, is presented in some detail.

### 4.1. Standard Markov chain Monte Carlo

The standard method consists of generating an MCMC chain of event sequences *ξ ^{i}* and parameters values

*θ*(where

^{i}*i*denotes the position on the chain) using proposals randomly selected from four possibilities: (1) change a parameter value, (2) insert an event, (3) remove an event or (4) change the time of an event. Details of this procedure are given in electronic supplementary material, appendix C. These changes permit the algorithm to explore complete parameter and event space, and the resulting chain is guaranteed to asymptotically be distributed in accordance with the posterior probability, i.e. in the limit as the number of iterations goes to infinity.

The dotted-dashed line in figure 3*a* illustrates an example of an output from the standard MCMC algorithm, above. This shows the sequence of samples generated for *β* based on the data from the SIS model example in figure 1*b*. Observe that the curve moves up and down very slowly (note the large number of iterations on the *x*-axis). Within MCMC, the process whereby parameters explore the posterior is known as ‘mixing’. The problem of slowly varying parameters, or poor mixing, results from consecutive samples in the Markov chain being highly correlated (see [15] for another example). Mixing may be quantified through the autocorrelation function *R* (described in electronic supplementary material, appendix G). This is plotted as a function of CPU time in figure 3*b* (time is used instead of sample number to allow a direct comparison with the other methods, later). The dotted-dashed curve shows that after 1 s of computation (corresponding to around 5200 MCMC iterations), *R* = 0.96, indicating that the chain is still highly correlated over this timescale. (An intuitive explanation for this high degree of autocorrelation is presented in electronic supplementary material, appendix E.)

### 4.2. Particle Markov chain Monte Carlo

PMCMC is a recently proposed technique that exploits knowledge of the model and is naturally parallelizable. Its operation reduces the problematic correlations in event sequences found above [17], and application of PMCMC to compartmental models has recently been proposed [21,22]. The actual operation of PMCMC comes in two main varieties described below.

#### 4.2.1. Particle marginal Metropolis–Hastings

This approach treats parameters and event sequences somewhat differently. In terms of events, a sequential Monte Carlo (SMC) method [23] takes a given set of parameter values *θ* and simulates a series of possible event sequences, or ‘particles’, from the model. At measurements, a so-called ‘particle filter’ is applied to prune those generated sequences that are not in good agreement with the data. An overall measure of how well the simulated event sequences agree with the data *y* provides an unbiased estimate of the posterior distribution marginalized in parameter space In PMCMC, exploration in parameter space is achieved through an MCMC scheme that uses the approximation rather than the full posterior *π*(*θ*, *ξ*|*y*). A consequence of this approach is that event sequences become largely uncorrelated between subsequent iterations of the MCMC chain. This benefit in terms of mixing, however, comes at the significant computational cost of simulating the system multiple times. Further details of particle marginal Metropolis–Hastings (PMMH) are presented in electronic supplementary material, appendix F, and suitable optimization (e.g. selecting particle number) is described in electronic supplementary material, appendix H.

#### 4.2.2. Particle Gibbs

Iteratively, SMC is used to draw samples from *π*(*ξ*|*y*,*θ*) (by means of a Metropolis–Hastings step) and Gibbs (for cases in which it is possible) is used to directly sample from *π*(*θ*|*ξ*). For typical DCTMPs, there is a high degree of correlation between model parameters and state space. This leads to the Gibbs step only proposing relatively small jumps in parameter space (demonstrated by the fact that the distribution *π*(*θ*|*ξ*) in electronic supplementary material, figure 3, is much smaller than the full posterior shown in figure 2), resulting in highly correlated PMCMC samples. Consequently, the particle Gibbs version of PMCMC was found to be much less efficient than the PMMH procedure above, and the results presented here are obtained solely from the latter.

The dashed curve in figure 3*a* shows the variations in *β*, again based on the data from figure 1*b*. Observe that this curve moves up and down much more quickly than the standard MCMC method (note the different scales on the *x* axes). Although it takes significantly longer to run per iteration, it is a superior algorithm. This finding is emphasized by the autocorrelation function in figure 3*b*, which shows that the MCMC chain becomes largely uncorrelated after approximately half a second of computation time.

### 4.3. Model-based proposal Markov chain Monte Carlo

Above we demonstrated that the standard MCMC procedure is an inefficient means of obtaining posterior estimates for DCTMPs due to the strong autocorrelation of samples. Because event sequences and parameter values are highly correlated, any efficient scheme must *simultaneously* change parameters and events during the proposal step. Here, we present a generic method that achieves this by first proposing a large jump in parameter space *θ ^{f}* and then constructs a suitable event sequence

*ξ*based on

^{f}*θ*,

^{f}*θ*and

^{i}*ξ*. The algorithm is as follows.

^{i}#### 4.3.1. Initialization

Initial parameter values *θ*^{1} and event sequence *ξ*^{1} are set. These may be generated by first randomly assigning plausible values for *θ*^{1} (perhaps guided by a prior distribution) and then simulating *ξ*^{1} using the DGA. In this study, we simulated an event sequence to generate data *y* in the first place (figure 1*b*), so it was convenient to use this for *ξ*^{1}. (It should be stressed, however, that arbitrary choice of *ξ*^{1} would be equally valid, and this decision has no impact on the results presented in this paper because the system is sufficiently burned in before samples are taken.)

#### 4.3.2. Step 1

Generate *θ ^{f}* based on

*θ*. Many possibilities exist, but for this study we use a multivariate normal jumping distribution centred at

^{i}*θ*

^{i}scaled by a jumping size

*j*: 4.1

_{F}The covariance matrix *Σ* of the posterior distribution in parameter space is numerically approximated during the adaptation phase (see electronic supplementary material, appendix H, for details) along with an optimal choice for *j _{F}*. The advantage of using a multivariate distribution (rather than a simple normal of for each parameter separately) is because it accounts for potential correlations between parameters which would otherwise restrict jumping size.

#### 4.3.3. Step 2

Generate event sequence *ξ ^{f}* using

*ξ*,

^{i}*θ*and

^{f}*θ*. The essential concept is to construct a new sequence of events that is representative of that generated from the model using

^{i}*θ*, based on the assumption that

^{f}*ξ*is representative of that generated using

^{i}*θ*. This represents the key novel aspect of this algorithm. The approach is illustrated in figure 4. At time

^{i}*t*= 0, we initialize the state vectors for both the existing sequence and the proposed sequence

*n*(0) =

^{f}*n*(0) =

^{i}*n*

^{init}. The difference in rates between the two sequences for each event type e is calculated as: 4.2

Note, the *max* function ensures that this quantity is always non-negative and represents the excess rate for each event type at *θ ^{f}* compared with

*θ*. The DGA is run based on Δ

^{i}*r*

_{e}(

*t*) (as opposed to the usual

*r*

_{e}(

*t*)) up to the first event time (see

*t*

_{1}in the sequence

*ξ*in figure 4). This potentially adds new events to the timeline, e.g. a new infection event is added to

^{i}*ξ*in the diagram. Next, we assess whether or not to copy the event present in

^{f}*ξ*(e.g. at

^{i}*t*

_{1}) into

*ξ*(this accounts for cases when the rates at

^{f}*θ*are lower than that at

^{f}*θ*). This copying is done with probability 4.3where

^{i}*k*is the event number on sequence

*ξ*, which at

^{i}*t*

_{1}is

*k*= 1 (figure 4). The process of running the DGA based on Δ

*r*

_{e}to insert new events (represented by the curly brackets on the timeline in figure 4) and probabilistically copying events using equation (4.3) (represented by the arrows in the diagram) is repeated until the maximum observation time

*T*is reached.

It should be noted that the procedure above applies to arbitrary DCTMP, and not simply the models investigated in this paper.

#### 4.3.4. Step 3

Accept or reject the proposal. Set *ξ ^{i}*

^{+1}=

*ξ*and

^{f}*θ*

^{i+}^{1}=

*θ*with Metropolis–Hastings probability 4.4else

^{f}*ξ*

^{i}^{+1}=

*ξ*and

^{i}*θ*

^{i+}^{1}=

*θ*. Two important points can be made about this expression. Firstly, it is proportional to the ratio of observation model probabilities equation (3.4) for the two event sequences

^{i}*π*(

*y*|

*ξ*) and

^{f}*π*(

*y*|

*ξ*). Therefore, sequences in better agreement with the observations will be preferentially selected. Secondly, equation (4.4) does not contain the event sequence probability from equation (3.2), unlike the corresponding Metropolis expression used in the standard MCMC method. This is because the model, itself, is used to generate the proposal in step 2 above,

^{i}^{7}and consequently terms in the event sequence probability and proposal distribution cancel each other out (as shown in electronic supplementary material, appendix J).

#### 4.3.5. Step 4

Increment *i* and jump to step 1 or stop if sufficient accuracy has been achieved.

An example implementation of the MBP procedure written in C++ has been placed in the electronic supplementary material (download the file MBPexample.cc).

The solid curve in figure 3*a* shows how *β* varies as a function of iteration number for model-based proposal MCMC (MBP-MCMC). We observe that it does not mix quite as well as PMCMC, but because it is computationally faster per iteration,^{8} the autocorrelation function in figure 3*b* shows the MBP method to be significantly less autocorrelated than PMCMC at short computational timescales and it does not exhibit the same long tail.

A key justification for the procedure outlined in steps 1 and 2 above is that it ensures the acceptance probability in equation (4.4) does not contain the model, but rather only the data *y*, the current and proposed event sequences, and the prior. The model is incorporated into the proposal, hence the terminology ‘model-based’ proposal. This has important implications. Supposing no measurements are made on the system (i.e. *π*(*y*|*ξ*) = 1), repeated application of MBPs result in the Markov chain mapping out the prior distribution, as would be expected. Importantly, however, *j _{F}* in equation (4.1) can be chosen to propose arbitrarily large jumps in parameter space. These large jumps mean that consecutive samples on the MCMC chain are highly uncorrelated, and hence mixing is good. This is in stark contrast with proposals that use the normal Metropolis–Hastings expression, which relies of calculating event sequence probabilities. Here, even in the absence of measurements, the proposal jumping size is typically highly restricted by the model (because large jumps in either parameter space or state space have a very low probability of acceptance), leading to poor mixing. In reality, of course, any meaningful Bayesian analysis requires measurements. But because the number of observations is typically much smaller than the number of events in the system, a Metropolis–Hastings acceptance rate based on the ratio of observation model probabilities is much less restrictive that one based on the ratio of event sequence probabilities.

## 5. Results

Simulations were performed to assess the speed and accuracy of the three algorithms presented in §4 (standard MCMC method, PMCMC and MBP-MCMC). An efficient scheme for plotting the posterior distribution (based on Rao–Blackwellization) is outlined in electronic supplementary material, appendix G. Assessing the accuracy of this distribution after a certain number of iterations of a given algorithm presents difficulties because there is no analytical expression available with which to compare the results. Consider, however, running *W* different versions of the algorithm. They are each started with a different random seed^{9} and so the approximations to the posterior they generate are different. In the asymptotic limit, these approximations all converge to the actual posterior. The efficiency of each algorithm is evaluated by how quickly this convergence occurs (electronic supplementary material, appendix L, describes how this measure is defined).

Figure 5 shows how average posterior accuracy improves as a function of CPU time^{10} (based on a 2 GHz processor) for the three methods when using the data from figure 1*b* (note the log–log axes). Here, *W* = 50 chains are used and the MBP-MCMC, PMCMC and standard MCMC methods are iterated for 6 × 10^{5}, 10^{5} and 6 × 10^{6} steps, respectively. An adaptation phase of 10^{5} iterations is used to ensure each method uses optimized constants and is sufficiently burned in.

Consider the solid curve representing the MBP-MCMC results. Initially, the accuracy remains fairly flat, which can be interpreted as the time over which the chain mixes (the average autocorrelation times, as defined in electronic supplementary material, appendix G, for the three methods are denoted by the black arrows). In the long time limit, the curve approaches its theoretically expected slope^{11} represented by the thin black line. An accuracy of 10% is taken to correspond to an acceptable approximation of the posterior. We see from the black dot in figure 5 that MBP-MCMC takes around 52 s of calculation to achieve this. Compare now the dashed line for the PMCMC results in figure 5. This is shifted to right of the solid curve, hence it is slower. To achieve the same 10% accuracy requires around 295 s of computation, almost six-times longer. This increase is mainly due to the fact that it was necessary for *p* = 20 particles to be used, hence the DGA was effectively simulating the model 20 times per PMCMC step. The dotted-dashed line shows the results from the standard MCMC method. It is vastly slower than the other two methods, and it was not feasible to achieve an accuracy of 10%. However, an estimate of the time it would have taken is obtained by fitting a line of slope −1/2 to the final point on the curve and extrapolating, an approximation that is used in the subsequent analysis.

Next, we discuss how the three methods compare in various limits. Figure 6*a* shows how the computation time needed to achieve an accuracy of 10% varies as a function of measurement uncertainty (i.e. the error bar sizes in figure 1*b*, which are set to a constant value *σ _{m}*). The solid curve presents results for MBP-MCMC, showing that reducing

*σ*from moderate to small values increases CPU time. This is because it becomes necessary to reduce the proposal jumping size

_{m}*j*in step 1 of §4 to retain a good chance of acceptance, which leads to longer mixing times. The reason large jumps in parameter space are not permitted is that the method for generating

_{F}*ξ*does not take into account the data

^{f}*y*, so changing many events inevitably leads to a large discrepancy between the observation and what

*ξ*suggests really happened, i.e.

^{f}*π*(

*y*|

*ξ*) in equation (4.4) becomes small. When

^{f}*σ*increases from moderate to large values in figure 6

_{m}*a*, the simulation time also increases, but for a different reason. Due to greater uncertainty in the measurements, the size of the posterior distribution enlarges. Correspondingly, the jumping size gets larger, but not at the same rate, again leading to longer mixing times.

The situation for the PMCMC simulations, as shown by the dashed curve in figure 6*a*, is similar although the underlying mechanisms are different. In the case of small *σ _{m}*, most, if not all, the particles end up sufficiently far from the measurements that they have a very low observation probability and hence they do not represent samples from the posterior. This results in a larger number of particles being needed to achieve a sufficient accuracy for hence a longer simulation time. The dotted-dashed curve shows that the standard MCMC method is much slower. However, in this case, it becomes faster as

*σ*→ 0 because the reduction in the size of the posterior more than compensates for the reduction in acceptance rate resulting from the stricter limitations imposed by the observations.

_{m}Figure 6*b* shows the case when the initial susceptible population size *S*^{init} is varied. The initial infected population is taken to be *I*^{init} = *S*^{init}/10, rounding to the nearest integer. *S*^{init} was set to have values varying between 10 and 1000. The size of the measurement error was also adjusted to maintain a constant relative error using
5.1

The results were averaged over 10 datasets generated using DGA simulations, with measurements taken at five time points separated at 20 time unit intervals. The speed of PMCMC and MBP-MCMC were found to vary in a similar way. The increase in CPU time is less than linear with *S*^{init} because the size of the posterior distribution shrinks as a result of the scaling used in equation (5.1).

Next, we investigate how the computation time depends on the number of measurements made on the system, assuming a constant time difference between them. Figure 1*b* shows the scenario in which five measurements are made. We now consider what happens if further measurements to this system are added at times *τ*_{6} = 120, *τ*_{7} = 140, etc… (i.e. at 20 time unit intervals). Alternatively, measurements can be removed from this diagram, starting from *τ*_{5} = 100. Results from this process are summarized in figure 6*c*. We find that MBP-MCMC scales similarly to PMCMC as the number of measurements is changed.

The new MBPs can equally be applied to other compartmental models. For example, figure 7*a* shows perhaps the most well-known example, the SIR model [6]. In this case, recovered individuals no longer become susceptible, but enter a ‘recovered’ compartment. This model is applicable to diseases which confer lifelong immunity on the host, such as measles or mumps. An example of such system dynamics is given in figure 7*b* in which a single infected individual enters a susceptible population. An initial explosion in infections (i.e. an epidemic) leads to exhaustion of the susceptible population, after which the remaining infected individuals slowly recover. As in figure 1*b*, we take the circles in figure 7*b* as the basis of hypothetical data. Figure 7*c* shows the posterior distribution generated using MBP-MCMC. Investigation of the scaling shows broadly the same behaviour as in figure 6, although the inference times are much shorter due to fewer events occurring than in the previous SIS example.^{12} MBP-MCMC is marginally faster than PMCMC and they are both much quicker than the standard MCMC technique (results not shown).^{13}

Lastly, we look at the scenario in which measurements are made on more than one state variable simultaneously. Figure 8*a* shows an example of this, which represents a spatial one-dimensional meta-population model. Here, several different spatial regions are considered (two shown in the diagram) whose dynamics are locally described by SIR models. Movement of individuals, characterized by a migration parameter *μ*, result in a coupling between neighbouring regions. The diagram in figure 8*a* can be extended to include any number of linked regions in a line. This simple one-dimensional model may, for example, represent different populations along a coastline in which the habitat restricts the population to lie close to the sea. Figure 8*b* shows an example of the dynamics exhibited by this model, using three regions. An infected animal is initially assumed in the first region, and this leads to an epidemic (see the dark peak on the left). For clarity, the curves showing variation in the susceptible and recovered populations have been omitted. Due to migration of infected individuals to the second region, this initiates a second epidemic occurring at a slightly later time, as shown by the middle peak, and finally in the third region (light peak on the right).

The circles in figure 8*b* represent hypothetical measurements made on the system. As in figure 7*b*, we assume that eight periodic measurements are made, but here they are made simultaneously across all regions. Based on these observations, the posterior distribution in the (*β*,*γ*) plane can be established using the approach outlined above (equally it would be possible to investigate inference on parameter *μ*). Figure 8*c* shows how the computational time taken to calculate this distribution within 10% depends on the number of regions being observed. We find for PMCMC that this dependency is exponential, and calculation becomes computationally demanding beyond four regions. Using MBP-MCMC, however, the computational time grows much more slowly. This indicates that MBPs can be useful applied to scenarios in which it is not realistically possible for PMCMC to give acceptable results.

## 6. Summary and conclusion

This paper addresses the computational challenge of generating samples from posterior distributions which arise when conducting Bayesian inference on DCTMPs.

MCMC can be used to implement such inferences [9,11] by constructing a Markov chain to explore the joint space of parameters and events. Standard MCMC approaches typically make independent proposals to the parameters and event sequence, and changes to the event sequence use simple proposals such as adding, deleting or moving individual events. Proposals are retained or rejected according to a Metropolis–Hastings probability. However, if changes in parameter and event space are proposed independently, large changes are typically rejected due to the strong dependence of probable event sequences on parameter values. This means only small changes are likely to be accepted and the algorithm explores the posterior distribution very slowly via a diffusion-like trajectory.

Here, we proposed a generic approach applicable to any DCTMP model to mitigate this problem which generates joint proposals of parameters and event sequences by exploiting the dynamics of the underlying model. Specifically, in MBP-MCMC when a parameter change is proposed, the model dynamics are used to construct a corresponding event sequence proposal. With such MBPs, even large changes have a relatively high acceptance rate and this enables a rapid exploration of the posterior. This novel approach was compared with a standard implementation of MCMC for DCTMPs and a recently proposed PMCMC technique [12].

Often previous attempts to develop complex proposals to help ameliorate mixing problems depend sensitively on exactly how they are implemented for each new model. One of the key advantages of MBPs is that for arbitrarily complex models, only a single tuneable parameter exists (*j _{F}* defining the jumping size in parameter space) which can easily be optimized to give a good acceptance rate. Thus, the MBP-MCMC method adapts to the structure of different models.

Homogeneous SIS and SIR models and a meta-population SIR model were used as benchmark computational tests for the three methods. The following conclusions are drawn:

— standard MCMC is extremely slow for all but the simplest models. It was found to be several orders of magnitude slower than MBP-MCMC and PMCMC;

— PMCMC and MBP-MCMC had comparable performance when measurements were made on a single compartment (with MBP-MCMC having a slight advantage). They were found to scale in the same way under variation in measurement uncertainty, measurement number and population size; and

— when measurements were simultaneously made on several state variables (as in, for example, a spatial meta-population model), MBP-MCMC was much faster than PMCMC.

## 7. Discussion

This paper has presented the case for a new type of generic MCMC proposal (termed ‘MBP’) used to speed up Bayesian inference on DCTMPs. This new proposal has the advantage of having only a single tuneable parameter (making optimization straightforward) and is applicable to arbitrarily complex models. Comparisons were made with standard MCMC and PMCMC approaches, and MBP-MCMC was clearly demonstrated to be significantly computationally faster. Although most of these comparisons were done in simple non-spatial models (the SIS and SIR models), a final application looking at a meta-population model demonstrates that MBPs can, indeed, be scaled up to more complex models. Furthermore, the relative difference in speed between MBP-MCMC and PMCMC was found to increase with larger model complexity, pointing to the fact that MBPs will permit rigorous statistical inference in complex problems that have so far remained computationally infeasible.

Within this paper, MBP-MCMC is compared to the simplest and most widely used version of PMCMC, namely PMMH. This uses a simple SMC scheme that simulates directly from the model and applies a particle filter after each measurement. As pointed out in [17], more advanced SMC techniques and particle filtering procedures have been developed, and these may improve the performance of PMCMC (at the expense of increased algorithm complexity). One strong possibility to improve efficiency is to generate a modified ‘simulation’ that, to some extent at least, takes into account the next set of observations (rather than simulating directly from the model in the SMC step). This may permit PMCMC to overcome its main shortcoming, namely being unable to cope well with multiple simultaneous measurements. However, due to the nonlinear nature of the models, these improved proposals are challenging to be made sufficiently computationally fast to be of practical use.

Another approach that shares the same fundamental aims as MBPs (i.e. to simultaneously propose changes in parameter values and the event sequence) is known as non-centred MCMC [24,25]. Here, the strict separation between model parameters and event sequence times is broken through reparametrization. A simple example, presented in [24], considers the SIR model in which recovery times are known. The difference in the recovery and infection times is assumed to be exponentially distributed. One way to represent the system is to consider some new quantity *U* drawn from a unit time exponential distribution, and from this reconstruct the infection times as the recovery time minus *U*/*γ*. Thus, proposed changes in *γ* simultaneously alter infection times. In complex problems many different reparametrizations often exist, and it can be challenging to ascertain the best approach. A key advantage of MBPs is that they can be applied to arbitrary DCTMP models without the need to consider this flexibility. Furthermore, unlike non-centred approaches, the jumping probabilities in MBPs (see equation (4.4)) are independent of model complexity and constrained solely by the observation model and prior.

We now briefly discuss ABC in relation to the methods applied in this paper. In particular, we show that there is no reason to believe it can be computationally faster than PMCMC for DCTMPs. One of the key advantages of ABC is that it is a likelihood-free approach. For the models investigated here, however, calculating the likelihood (which is effectively achieved in equation (3.2)) is actually significantly faster than performing a single simulation from the model. The simplest type of ABC algorithm, called the ‘rejection algorithm’, consists of sampling parameter values from the prior, simulating the model using these values and then accepting or rejecting the resulting event sequence depending on some statistic that characterizes how near it falls to the measurements (crudely implementing the observation model in equation (3.4) by, for example, introducing a cut-off observation probability *ɛ* below which the sample is rejected). Since during this study, we assume no prior knowledge regarding parameter values, this would mean that in order to identify the posterior distribution in figure 2, it would be necessary to randomly select points within this diagram (in a scatter-gun approach), the vast majority of which would have almost zero posterior probability. To overcome this highly inefficient approach, the ABC-MCMC approach has been introduced (e.g. [26]). This is essentially the same as PMCMC but using only a single particle (with the crude implementation of the observation model), and therefore would not be expected to be faster than PMCMC, and in some circumstances (when *ɛ* becomes smaller) would be much slower.

The concepts underpinning MBPs may be applicable to coarse-grained models of DCTMPs, for example by applying the tau-leaping method [27] or through the use of stochastic differential equations to model the dynamics of the system [21]. Furthermore, extensions of MBPs to generic non-Markovian stochastic discrete event-based models should be relatively straightforward. In addition, the construction of different MBPs (i.e. modifications to step 2 in §4) that satisfy the condition equation (4.4) would permit the application of this new approach to a much wider range of statistical models. These promising developments remain the subject of future work.

## Competing Interests

We declare we have no competing interests.

## Funding

We wish to thank the Scottish Government for funding through the Strategic Partnership in Animal Science Excellence (SPASE) and the BBSRC through the Institute Strategic Programme Grant.

## Footnotes

↵1 This is not necessarily a good approximation in many situations. For example, immune system dynamics dictate that the time taken for a patient to recover is not exponentially distributed. However, the Markovian approximation is often used because, in most cases, it does not change the system's qualitative behaviour.

↵2 In most scenarios, precise measurements are not possible, e.g. when studying disease in the wild a small sample of animals may be captured and tested and the results extrapolated to the entire population.

↵3 Other measurement error functions, such as Poisson or Gamma distributions, can be substituted into equation (3.3).

↵4 This assumption of conditional independence of measurement errors is for convenience; it is not a requirement for the methods outlined in this paper.

↵5 For instance, in the SIS model where

*π*(*θ*) =*π*(*β*,*γ*), if previous studies indicate that*β*is close to a particular value, then*π*(*β*,*γ*) will be highly peaked around that value.↵6 If there were fewer measurements the shaded area in figure 2 would become larger, leading to greater uncertainty in parameter values.

↵7 The same Metropolis–Hastings expression equation (4.4) is generated if we directly simulated

*ξ*from the model using the DGA. However, this typically produces system dynamics that are not in good agreement with the data^{f}*y*which, in turn, leads to extremely low acceptance rates. The advantage of MBPs is that they contain a quantity*j*that can be tuned to give good acceptance rates._{F}↵8 MCMC generates a single event sequence per iteration, whereas PMCMC generates

*P*simulations of the model. Consequently, in this particular example, PMCMC takes around 16 times longer to perform a single iteration.↵9 So the pseudo-random number generator produces a different set of random numbers.

↵10 CPU time was considered a more appropriate quantity for comparison than the more standard MCMC iteration number. This is because the three techniques require varying amounts of computation per iteration.

↵11 We expect that since MCMC is a sampling procedure, and samples separated by more than

*τ*_{AC}are effectively uncorrelated, its accuracy should scale as 1/√CPU time. On a log–log plot, this corresponds to a line of slope −1/2.↵12 Because animals can only be infected once in the SIR model, only 100 infection events can occur in this example.

↵13 The difference is not as large as for the SIS model due to fewer events and the random walker argument scaling in proportion to the event number squared.

- Received March 12, 2015.
- Accepted April 23, 2015.

- © 2015 The Author(s) Published by the Royal Society. All rights reserved.