## Abstract

We present a new statistical approach to analyse epidemic time-series data. A major difficulty for inference is that (i) the latent transmission process is partially observed and (ii) observed quantities are further aggregated temporally. We develop a data augmentation strategy to tackle these problems and introduce a diffusion process that mimicks the susceptible–infectious–removed (SIR) epidemic process, but that is more tractable analytically. While methods based on discrete-time models require epidemic and data collection processes to have similar time scales, our approach, based on a continuous-time model, is free of such constraint. Using simulated data, we found that all parameters of the SIR model, including the generation time, were estimated accurately if the observation interval was less than 2.5 times the generation time of the disease. Previous discrete-time TSIR models have been unable to estimate generation times, given that they assume the generation time is equal to the observation interval. However, we were unable to estimate the generation time of measles accurately from historical data. This indicates that simple models assuming homogenous mixing (even with age structure) of the type which are standard in mathematical epidemiology miss key features of epidemics in large populations.

## 1. Introduction

Over the last century, mathematical epidemiology has played a critical role in our understanding of the spread of communicable diseases in human populations. Continuous-time mechanistic models constitute the backbone of the discipline. One of the most studied models is the susceptible–infectious–removed (SIR) model, in which individuals are successively susceptible to infection, infectious and removed (may no longer transmit the disease). Denoting *S*_{t} and *I*_{t}, the numbers of susceptibles and infectives in the population at time *t*, respectively, new infections occur at rate *βS*_{t}*I*_{t}; recoveries at rate *γI*_{t} (Soper 1929; Bailey 1975; Anderson & May 1991).

Likelihood-based estimation of the parameters of such a model would be relatively straightforward if the times of infection and removal were observed for each case (Becker & Britton 1999); but this detail of data is rarely obtained in practice. In general, the underlying transmission process is partially observed (e.g. times of infection are observed, but not the times of removal; some cases are not reported), and observed quantities may be further aggregated (e.g. times of detection are aggregated weekly, monthly…). In this context, calculation of the likelihood quickly becomes intractable since it requires to integrate over all unobserved quantities.

Other concepts have therefore been used in an attempt to develop easier methods of estimation. For example, Becker (1989) and Becker & Hasofer (1997) rested on martingale methods to estimate transmission parameters when observations consist of the initial state of the epidemic, plus (i) the final state of the epidemic or (ii) the whole removal process. The approach provided simple but nevertheless efficient estimators of key quantities and approximate confidence regions for the parameters. However, it would be difficult to extend it to more complex situations, such as the one we are interested in, where (i) times of detection are temporally aggregated, (ii) the initial state of the system is unknown, and (iii) we must account for under-reporting, seasonality (and possibly long-term variations) in transmission rates. It seems that only likelihood-based methods can provide an integrated framework to deal simultaneously with these issues.

Over the last decade, data augmentation methods have been extensively used to tackle the missing data problem that makes likelihood-based estimation so tedious. The idea is to augment the observed data with the pieces of information required to write easily the likelihood; here the times of infection/removal. In a Bayesian setting, the joint posterior distribution of parameters and augmented data is then explored by Markov chain Monte Carlo (MCMC) sampling (Gilks *et al.* 1996). Using reversible jump MCMC sampling (Green 1995), the methodology has been extended to the situation where the exact amount of missing data is unknown, for example owing to under-reporting (Gibson & Renshaw 1998; Auranen *et al.* 2000; Cauchemez *et al.* 2006). Although the method is flexible and allows investigation of complex models, it is essentially limited by the size of the augmented data, which increases with the number of cases. Consequently, the approach has been used only for the data collected in small communities such as households (Auranen *et al.* 2000; O'Neill *et al.* 2000; Cauchemez *et al.* 2004) or schools (Cauchemez *et al.* 2006), when the number of cases does not exceed a few thousands. Computation times would become prohibitive when dealing with larger datasets, such as those collected by surveillance systems, for which the number of cases can easily reach tens of thousands.

For large epidemics in large populations, there is therefore no option but to find approximations of the SIR model, which are analytically tractable. Consider, for example, epidemic time-series data. These data typically provide counts of cases reported daily, weekly or monthly on a local or national ground. For inference, a natural choice is to approximate continuous-time models by discrete-time models (Finkenstadt & Grenfell 2000; Morton & Finkenstadt 2005). In these latter models, each time period is made of one generation of cases; generation of period *k* is simply the offspring of the generation of period *k*−1. However, an important constraint is that one observation period must effectively capture one generation of cases. This may be achieved only if the generation time of the disease (delay between infection of a case and infection of their typical secondary case) is equal to the length *T* of observation periods, or is a multiple of *T*. In the latter case, the data must be further aggregated, which may lead to an additional loss of information. There are therefore a variety of situations where discrete-time models cannot be used, since few generations may occur during a single observation period.

In this paper, we design a statistical framework to estimate the continuous-time SIR model from time-series data, when (i) the number of cases is too large to augment the data with the times of infection/removal of each case and (ii) epidemic and data collection processes have different time scales, so that the use of discrete-time models is excluded. To tackle the problem of temporal aggregation (and missing data), the data are augmented with the latent state {*I*_{kT}, *S*_{kT}} at the beginning of each observation period *k* (=time interval ]*kT*,(*k*+1)*T*]). The main difficulty is then to define the relationship between {*I*_{kT}, *S*_{kT}}, {*I*_{(k+1)T}, S_{(k+1)T}} and the observation (number of infections reported for period *k*). This is achieved by introducing a diffusion process that mimics the SIR process, for which exact solution is readily available. The diffusion process is the Cox–Ingersoll–Ross process, which is commonly used in finance to model interest rates (Cox *et al.* 1985). The method is applied to measles time series in London in the pre-vaccination era (1948–1964).

## 2. Material and methods

### 2.1 The SIR epidemic model

#### 2.1.1 The SIR model

The SIR epidemic model is a continuous-time Markovian model that describes the spread of a communicable disease in a population. Denoting {*S*_{t},*I*_{t},*R*_{t}}, the numbers of susceptibles, infectives and removed cases at time *t*, respectively, and *H*_{t} the *σ*-algebra generated by the history {*S*_{u},*I*_{u}, *R*_{u};0≤*u*≤*t*}, the model is defined by the following equations:(2.1)where *ν* is the birth rate; *β* is the transmission rate; and 1/*γ* is the mean infectious period. In this formulation, we neglect the mortality due to disease. We also neglect the number of individuals who leave the susceptible population owing to death or migration.

In practice, this continuous-time process is only partially observed, and observed quantities are further aggregated. Surveillance data typically consist of numbers {*U*_{k}}_{k=0,} _{…,} _{K} of new infections occurring during periods of length *T*, i.e. *U*_{k} is the number of times in interval ]*kT*,(*k*+1)*T*] when *I*_{t} increases by +1.

When epidemic and data collection processes have the same time scale (1/*γ*≈*T*), a discrete approximation of the model may be considered, where the expected number of cases *E*(*U*_{k}) for period *k* is proportional to the number of cases *U*_{k−1} for period *k*−1 (Finkenstadt & Grenfell 2000; Finkenstadt *et al.* 2002). However, such relationship no longer holds when 1/*γ*≠*T* since few generations of cases may then occur during a single observation period. There is then no option but to come back to the continuous-time model.

Figure 1 shows two possible trajectories for the number of infectives *I*_{t} consistent with four new infections occurring during period *k* (*U*_{k}=4). The larger rate of recovery in figure 1*a* implies that, although the same number of new infections is observed in figure 1*a*,*b* trajectories of *I*_{t} are very different. Owing to stochastic fluctuations, important differences could be observed between trajectories, even if the recovery rate was the same.

Let us first assume that we observe {*I*_{kT}, *S*_{kT}}_{k=0,} _{…,} _{K}, the numbers of infectives and susceptibles at the beginning of each observation period (in practice, this is not the case; these terms will be considered as nuisance parameters of the final inference framework). The main issue for inference is to determine the probability *P*(*I*_{(k+1)T}, *U*_{k}|*I*_{kT}, *S*_{kT}), the joint probability of the number of infectives at the beginning of period *k*+1 and the number of new infections for period *k* given *I*_{kT}, *S*_{kT}.(2.2)

In the TSIR framework, which relies on the relatively strong assumption that the number of infectives *I*_{t} during time period *k* is constant and equal to the number of new infections (so *U*_{k}=*I*_{(k+1)T}), only the first term of equation (2.2) is needed (the second term is equal to 1 for *U*_{k}=*I*_{(k+1)T}, 0 otherwise). This simplification does not hold here. Consider, for example, the situation in figure 1*a*; there is an infinity of values for *U*_{k} consistent with *I*_{(k+1)T}=2 and *I*_{kT}=2 (any *U*_{k}≥0 is consistent *I*_{(k+1)T}=2 and *I*_{kT}=2).

Without loss of generality, let us assume *k*=0.

#### 2.1.2 Approximation: Cox–Ingersoll–Ross diffusion process

Here, we need to make two additional assumptions. We assume that length *T* of observation periods is small enough to neglect within-period changes in the number of susceptiblesand that the transmission parameter is constant during an observation periodThe value of is discussed in appendix C.

Under assumptions *H*1 and *H*2, the number of infectives *I*_{t} is a birth and death process over time period [0,*T*], with birth rate and death rate *γI*_{t} at time *t*. When there is one infective at the beginning of the period (*I*_{0}=1), the number *I*_{T} of infectives at the end of the period follows a geometric distribution (Bailey 1964; Renshaw 1991). If *I*_{0}>1, the distribution of *I*_{T} is still available, but the resulting expression is ‘really too messy to be of much practical use’ as *I*_{0} increases (Bailey 1964; Renshaw 1991). Considering the development of a population of initial size *I*_{0} as being equivalent to the development of *I*_{0} separate populations each of initial size 1, it can be shown that the distribution of *I*_{T} is negative binomial; however, it is unclear how large *I*_{0} may become before the ‘populations’ can no longer be assumed to develop independently of each other (Renshaw 1991). Here, we introduce an alternative approximation where the assumption of independence is not required.

In the models described above, the number of infectives is a discrete variable. Here, we intend to model the *effective number of infectives* as a continuous variable . A natural candidate is the diffusion process that mimics the epidemic SIR process under *H*1 and *H*2, i.e. with trend and volatilitywhich is the solution of the stochastic differential equation(2.3)where , and *W* is the Brownian motion. Equation (2.3) characterizes the Cox–Ingersoll–Ross process, which is used to model interest rates on financial markets (Cox *et al.* 1985). Interestingly, the exact solution of equation (2.3) is readily available (Cox *et al.* 1985). It has a non-central *Χ*^{2} distribution with zero d.f. (Siegel 1979; see appendix A).

Instead of equation (2.2), inference will therefore rely on(2.4)For , the first term of equation (2.4) is (appendix A)(2.5)where ; ; and *f* is defined in appendix A. The mean and variance of are and , respectively.

The second term of equation (2.4) can only be approximated. We use the fact that, given the expected number of new infections , the number *U*_{0} of infections occurring in [0,*T*] is Poisson distributed with mean *Λ*_{0}. Assuming that may be approximated by a gamma distribution with mean *M*_{0} and variance *V*_{0} to be determined, a first approximation of the distribution is(2.6)which characterizes the negative binomial distribution with mean *M*_{0} and variance *M*_{0}+*V*_{0} (Poisson distribution with gamma distributed parameter). In this first approximation, it is assumed that, given *Λ*_{0}, the number of infections in [0,*T*] is independent of . In practice, however, there is the additional constraint that if , the number *U*_{0} of persons infected in [0,*T*] must be . We therefore condition the negative binomial distribution by .

Eventually, to determine the mean *M*_{0} and variance *V*_{0} of , we rely on the linear model(2.7)where *ϵ*_{0} is the error. Denoting , the scalars that minimize (minimization is performed analytically and there is no need to use a minimization routine; see appendix B), and the variance of , we approximate the mean *M*_{0} by , and the variance *V*_{0} by . We derive from the Laplace transform of (appendix B).

#### 2.1.3 Approximation when the hypothesis of mass action is violated

When the hypothesis of mass action is violated, different authors have suggested to use the force of infection with *ϵ* close to 0 in general (Finkenstadt & Grenfell 2000; Morton & Finkenstadt 2005). Unfortunately, the results of §2.1.2 apply only for *ϵ*=0 since the exact solution of equation (2.3) is not available otherwise. However, the force of infection can be linearized for *ϵ* close to 0. Denoting , the expectation of the average number of infectives over period [0,*T*] given the force of infection may be approximated by the following term, linear in :(2.8)We can use equation (2.8) in the model described in §2.1.2 for inference. An approximation of is given in appendix C.

### 2.2 Statistical framework

#### 2.2.1 Observed and augmented data

We consider the situation where data consist of the time series of numbers of reported cases, plus birth rates . Unobserved variables required for model specification are

the total number of cases

*U*_{k}(i.e. reported+unreported cases) during observation period ,the number of infectives at the beginning of observation period , and

the number of susceptibles

*S*_{0}at the beginning of the first period.

Given *S*_{0}, , the number of susceptibles at the beginning of period *k*, is given by the deterministic relationship(2.9)In the statistical framework, observations are augmented with , which may be considered as nuisance parameters of the model.

#### 2.2.2 Joint distribution

Denoting *Θ*, the parameters of the model, the joint distribution of observations, augmented data and parameters are(2.10)The first term on the r.h.s. of equation (2.10) characterizes the reporting process. Here, we assume that the number of reported cases follows a binomial distribution(2.11)where *ρ* is the proportion of cases which are reported.

The second term has been discussed in §2.2.1

The third term models the state of the system at the beginning of the follow-up. Denoting *M*, the size of the population, we assume that and *S*_{0} are uniformly distributed in [0,*M*] and , respectively. For London in the period 1948–1964, we specified *M*=10 000 000.

The last term gives our priors for model parameters. For parameters defined on ]0,∞[, we specify an exponentially distributed prior with mean 10^{10}. This distribution is flat on the range of values possible for the parameters. The reporting rate *ρ* has a uniform prior distribution *U*[0,1].

#### 2.2.3 MCMC sampling

The joint posterior distribution of augmented data and parameters was explored by Metropolis–Hastings MCMC sampling (Gilks *et al.* 1996). The following steps were sequentially applied.

Update the parameters

*Θ*: parameters defined on ]0, +∞[ were updated by a random walk on the log scale. Power parameter*ϵ*was updated by a random walk on the real line.Update the numbers of infections

*U*_{k}, for : an independence sampler was used to update*U*_{k}(Gilks*et al.*1996). New candidate was drawn as follows: , where*X*_{k}is drawn from the negative binomial distribution with . This choice is motivated by the fact that, if*U*_{k}is Poisson distributed with gamma (*a*,*b*)-distributed parameter, the distribution of is the proposal used here.Update the numbers of susceptibles

*S*_{0}at the beginning of the follow-up: a random walk was used.Update the numbers of infectives at the beginning of period : a random walk on the log scale was used.

To reduce correlation between the transmission rate *β* and the initial number of susceptibles *S*_{0}, it was useful to reparametrize the transmission rate , with .

The standard deviations of the proposals were tuned to obtain an acceptance rate of 20–40%. We performed 4 000 000 iterations for each run of the MCMC algorithm. The first 400 000 were discarded as the burn-in period. The output was then recorded on every 200 iterations to constitute a sample from the posterior distribution. One MCMC run took roughly 20 hours on a desktop. Convergence of the MCMC was visually assessed.

### 2.3 Applications

#### 2.3.1 Simulation study

We first considered the situation where transmission rates vary every two weeks with a period of 1 year, and where data are collected every two weeks too (*T*=14 days). Epidemics were simulated from the continuous-time SIR model with mean infectious period 1/*γ* equal to 7, 14 and 21 days. We also simulated epidemics where the hypothesis of mass action was violated (*ϵ*=3, 5, 7%). Eventually, we simulated epidemics from the susceptible–exposed (infected but not infectious)–infectious–removed (SEIR) model, with constant or exponentially distributed latent period (time period during which the subject is infected but not infectious, mean *L*=2, 3.5, 7, 10 days) and exponentially distributed infectious period, with mean *I*=7 days.

To assess how the method could cope with temporal aggregation in the data, we also simulated an epidemic for 20 years from the continuous-time SIR model with mean infectious period 1/*γ*=14 days. In the simulation, two seasons with high and low transmissibility, respectively, were defined for each calendar year. We then estimated parameters of the SIR model for different levels of temporal aggregation. It is not always possible to split seasons in equally sized observation periods. For example, for a target duration of observation period of eight weeks, each season (26 weeks) is split in three observation periods with size 8, 8 and 10, respectively (average duration: 8.7 days). We investigated scenarios where the average duration of observation periods was 1, 2, 2.9, 4.3, 5.2, 6.5, 8.7 and 13 weeks. Note that the length *T* of observation periods is not constant in a dataset; the method can account for these variations in *T*.

For all scenarios, birth rate in the simulations was *B*=2152 births per month, which is the average birth rate in London between 1944 and 1964. Simulation values for the parameters were defined, based on their posterior mean given historical data on measles transmission in London, and under appropriate constraints (e.g. mean infectious period equal to 7 days for the scenario 1/*γ*=7 days). For some scenarios (e.g. SIR model with 1/*γ*=14 days), simulation values were simply equal to their posterior mean. For other scenarios (e.g. *ϵ*=−7%), parameters were adjusted so that the average number of cases per year was roughly consistent with the one observed in London between 1948 and 1964.

#### 2.3.2 Comparison with existing methods: measles epidemics in London

We also analysed the time series of the number of measles cases, collected bi-weekly in London between 1948 and 1964 (http://www.zoo.ufl.edu/bolker/measdata.html; Finkenstadt & Grenfell 2000; Morton & Finkenstadt 2005).

Previous studies have shown that, for the pre-vaccination era in the UK, models in which parameters have only seasonal variations fail to exhibit the same cyclical pattern as observed epidemics (Finkenstadt & Grenfell 2000; Morton & Finkenstadt 2005). This result, which was also observed with the method presented here, is probably due to changes in the structure of the population (through changes in birth rates) that modify transmission parameters themselves. The problem has been previously tackled by the use of local regressions, leading to the estimation of a sequence of reporting rates (Finkenstadt & Grenfell 2000). Here, we use an alternative approach, where the person-to-person transmission rate is inversely proportional to the size of the core group—group of individuals who contribute the most to the chain of transmission—(De Jong *et al.* 1995). For measles, we assumed that the core group consists of children with age below 4 years (i.e. below 104 bi-weeklies), so that the size of the core group for observation period *k* isand the contribution to the force of infection of an infective is during this period, where varies bi-weekly, with a period of 1 year.

For *ϵ*=0, the effective reproduction number *R*_{t} (average number of persons infected by a typical case at time *t*) is simply for time *t* in period *k*. When the assumption of mass action is violated (*ϵ*≠0), the effective reproduction number also depends on the number of infectives in the population . However, this quantity can easily be computed from the output of our algorithm for the sequence of times .

Two runs of the MCMC algorithm were performed: (i) all the parameters of the model are estimated, including the mean infectious period 1/*γ* and (ii) the mean infectious period of measles is known and equal to 1/*γ*=14 days.

For this dataset, the TSIR method is expected to be applicable since it is reasonable to assume that the generation time for measles is approximately two weeks. For comparison purpose, we also estimated our model with the TSIR method (Finkenstadt & Grenfell 2000; Morton & Finkenstadt 2005). Under TSIR assumptions (see above), equation (2.2) simplifies to , where the number *U*_{k} of new infections occurring during period *k* has a negative binomial distribution with mean , variance *E*_{k}/*q* and density

## 3. Results

### 3.1 Simulation study

#### 3.1.1 SIR model

Figure 2 shows convergence of the MCMC algorithm for the SIR epidemic simulation with 1/*γ*=7 days. Convergence is quickly obtained.

Table 1 gives simulation values, posterior mean and 95% credible interval of the mean infectious period 1/*γ*, the initial number of susceptibles *S*_{0} and the reporting rate *ρ*. The mean (s.d.) of the relative error for *β*/*γ* is also given. For the three simulated datasets: posterior means are close to simulation values; simulation values are always within the 95% credible interval; and the relative error of *β*/*γ* is small (less than 3%). Figure 3 shows the seasonal variations of the ratio *β*/*γ* for the epidemic simulated with 1/*γ*=14 days.

#### 3.1.2 Estimation when the hypothesis of mass action is violated

Table 2 gives the simulation values, posterior mean and 95% credible interval of *ϵ*, 1/*γ*, *S*_{0} and *ρ* when the hypothesis of mass action is violated. Power *ϵ* is correctly estimated when the simulation value is less than 5%; it is overestimated otherwise. Estimates of other parameters remain satisfying for *ϵ*≤5%. For *ϵ*=7%, both the ratio *β*/*γ* and the power coefficient *ϵ* are overestimated by 30%.

#### 3.1.3 SEIR model

Table 3 gives the simulation values, posterior mean and 95% credible interval of 1/*γ*, *S*_{0} and *ρ* when the epidemic is generated from an SEIR model. In this context, the estimate of 1/*γ* corresponds to the generation time, i.e. the sum *L*+*I* of the latent and infectious period, rather than to the effective infectious period *I*. Estimates for other parameters and the relative error of *β*/*γ* remain satisfying.

#### 3.1.4 Accuracy of estimates according to the level of temporal aggregation

Table 4 gives the posterior mean, 95% credible interval and relative error of parameters according to the level of coarseness in the data. We find that, so far as the average duration of observation periods is less than or equal to 5.2 weeks (=2.5×generation time of the disease), relative errors remain small for all parameters. For larger degrees of temporal aggregation in the data, important biases are observed.

### 3.2 Measles epidemics in London in the pre-vaccination era

#### 3.2.1 Posterior distribution

Table 5 gives the posterior mean and 95% credible interval of 1/*γ*, *S*_{0}, *ρ* and *ϵ*. Figure 4 shows the seasonality, the trend of the transmission rate and the effective reproduction number when 1/*γ*=14 days. Minimum transmission is obtained during holidays, at the end of August.

When all parameters of the model are estimated, including the mean infectious period, we find that the mean infectious period is very short (3–4 days), the number of susceptibles at the beginning of the follow-up is roughly 400 000, half of the cases are reported and the hypothesis of mass action is violated (*ϵ*>0), although the estimate of *ϵ* is close to zero (95% credible interval: 0.46 and 1.30%). When the mean infectious period 1/*γ* is assumed to be known (=14 days), the number of susceptibles at the beginning of the follow-up is halved and the estimate of *ϵ* doubles (posterior mean 2.09 instead of 0.88).

Our estimates are similar to those obtained with the TSIR approach (see table 5 and figure 4*a*). The main difference is obtained for the power coefficient *ϵ* which is twice larger for the TSIR method.

#### 3.2.2 Model checking

We simulated epidemics from the model, with parameters equal to their posterior means. Simulations started at the beginning of year 1948 and birth rates for period 1944–1964 were an input of the simulations. For the posterior distribution with 1/*γ* estimated from the data (3.38 days), simulated epidemics faded out. For 1/*γ*=14 days fixed, figure 5 compares the observed and expected (average of 40 realizations) time series. The model captures the biannual pattern of the epidemics, although predicted incidence for inter-epidemic years is slightly more important than the observed one (figure 5*a*,*c*). The model also captures the magnitude of biannual epidemics, except for two of the three very large epidemics for which the incidence is underestimated (years 1955 and 1961; figure 5*a*). Predicted trend in the number of susceptibles is relatively close to the observed one (figure 5*b*).

## 4. Discussion

We have presented a method to estimate continuous-time epidemic models from time-series data. Compared with existing methods based on discrete-time models, the approach can be used when epidemic and data collection processes have different time scales or when data are collected at irregular intervals.

A diffusion process for which an exact solution is readily available was introduced to approximate the SIR epidemic process. In large populations, modelling the number of infectives with a diffusion process does not raise major concerns. Quite obviously, this choice would be much more questionable for data collected in small communities or households.

We proposed a simple approach to capture changes in the transmission rate due to modifications in the structure of the population. The basic idea is that an individual's number of contacts is fixed, so that each infective does not contact 10% more people if the population grows 10%. This leads to assuming that the person-to-person transmission rate is inversely proportional to the size of the core group (Anderson & May 1991; De Jong *et al.* 1995). The core group for measles is clearly the group of young children, although defining a clear cut-off appears to be relatively arbitrary. Here, we specified the cut-off at 4 years because the birth rate at a delay time of 4 years has been found to have a positive effect on the number of cases and a negative effect on the fade-out probability (Finkenstadt & Grenfell 1998). This is consistent with an increase in the person-to-person transmission rate when an important number of children leave the group of children under 4 years old (core group). Under the assumption that the core group was the group of 4–6-year-old children (early school years) and that there was a 4-year delay between birth and introduction to the susceptible compartment, the fit was similar to that for our baseline scenario. When the core group was the group of 0–9 years old, there was no improvement of the fit compared with the situation where it is assumed that core group size and therefore transmission rates remain constant over time (in both cases, the expected period of epidemic cycles was 1 year). Further research could try to determine which core group gives optimal fit in a more systematic way.

In a context where the TSIR approach is expected to be applicable (generation time≈duration of the observation period), we found that our estimates were similar to those obtained with TSIR. The main difference was observed for the power coefficient *ϵ*, which was larger for the TSIR model than our diffusion method. One possible explanation is that the relatively crude way TSIR deals with temporal aggregation in the data leads to the overestimation of the ‘gap’ between mass action and historical contact patterns. Apart from this effect, our results validate the use of the TSIR approach when generation time≈duration of observation period.

When the observation interval and generation time are not approximately equal, more refined statistical models are needed, such as the one we presented here. Using simulated data, we found that our approach provided accurate estimates for all transmission parameters (including the mean infectious period) so long as the observation interval was less than or equal to 2.5-fold more than the generation time of the disease. This suggests that, using weekly surveillance data, our approach could be used to study most of the more common respiratory diseases, even those with very short generation time (for influenza, for example, two recent estimates of the generation time obtained from independent datasets are 2.6 days (Ferguson *et al.* 2005) and 2.85 days (Wallinga & Lipsitch 2007), respectively)). The relevance of the approach for rare diseases is more difficult to assess since the reporting rate may then vary dramatically with time. When the effective reproduction number is high, the method might not be able to cope with the same level of temporal aggregation as for measles, since the assumption that the number of susceptibles is roughly constant during one observation period might start to break down. When we used an SEIR model to generate the simulated data, we found that the estimate of 1/*γ* from the SIR diffusion model used to fit the data corresponded to the generation time of the SEIR model. When the assumption of mass action was violated, estimates remained accurate so far as *ϵ* was relatively small.

Although our approach provided accurate estimates of the generation time for simulated data, estimation of the generation time for measles from historical data was, disappointingly, unsuccessful. There were two contradictory observations regarding the quality of the fit for estimated *γ* (1/*γ*=3.34 days) and fixed *γ* (1/*γ*=14 days). First, likelihood comparison suggests that the model with estimated *γ* (log-lik=−7011) has a better fit than the model with fixed *γ* (log-lik=−9226). However, epidemics simulated with 1/*γ*=3.34 days fade out quickly, while those simulated with 1/*γ*=14 days gave a good fit. A possible explanation is that the long- and short-term predictions contradict each other, perhaps because more refined modelling is required to capture structural changes in contact patterns. It is also possible that the high removal rate we estimated for compartment *I* is due to an overestimated flow into this compartment. This might for example happen if the reporting rate is positively correlated with the number of cases, which is a plausible phenomenon. However, estimates of 1/*γ* were unchanged when we defined different reporting rates for high/low incidence periods.

Our inability to estimate the generation time of measles therefore has an intriguing and novel interpretation. It suggests that standard models assuming homogenous mixing (even within an age-defined core group) miss key features of epidemics in large populations. Possible ways to relax the assumption of homogenous mixing models are of course well known (Anderson & May 1991)—for example allowing for heterogeneity in susceptibility/infectiousness, the age structure of the population, spatial substructuring or network structure. Determining which of these elements are needed to improve parameter estimates would provide an important insight into which are the most important determinants of epidemic patterns in large populations. This is the topic of ongoing research.

Assuming that the generation time was known, the method we developed provided estimates of other parameters consistent with estimates from the TSIR model (when the TSIR is applicable) and the model captured both the magnitude and biannual pattern of measles epidemics. The most relevant epidemiological model for measles is the SEIR model with a latent period of 6–9 days and an infectious period of 6–7 days (Anderson & May 1991). It is therefore relatively unlikely that measles cases are effectively infectious for 14 days. We nonetheless assumed 1/*γ*=14 days because we found that, when our approach was applied to data simulated from the SEIR model, the estimate of 1/*γ* corresponded to the generation time of the disease (sum of the latent and infectious periods).

In general, it is not possible to identify both the reporting rate and the transmission parameters. For surveillance data on influenza for example, by appropriate re-scaling of the transmission parameters, it is probable that data would be consistent with a wide range of reporting rates. The fact that here we were able to obtain a precise estimate of the reporting rate is therefore relatively intriguing. Our ability to estimate the reporting rate for measles (as our inability to estimate it for influenza) is due to the fact that measles, as opposed to influenza, confers permanent immunity. In a context where the system is roughly stationary (i.e. no major change in epidemic patterns over time and relatively similar birth rates over time), we expect that the number of susceptible individuals in the population is roughly stationary too. This is possible only if the flow of births is compensated by the flow of infections. In this context, a very intuitive estimate of the reporting rate is the coefficient *a* of the linear regression: (cumulated number of infections)=*a* (cumulated number of births)+*b*, with *a*=0.478. For diseases like influenza that do not confer permanent immunity, strong assumptions on the reporting process or the history of immunity are needed to estimate transmission parameters (Finkenstadt *et al.* 2005).

Standard results on the Cox–Ingersoll–Ross model (Cox *et al.* 1985) provided the distribution . However, more work was required to relate and to the observation, i.e. the number *U*_{0} of infections in [0,*T*]. No exact solution could be obtained for ; we could derive only an approximated form of the distribution. We used that, given the expected number of new infections , the number *U*_{0} of infections should be Poisson distributed with mean *Λ*_{0}. The main issue was then to find a good predictor of *Λ*_{0} given . The analysis of the Laplace transform (appendix C) suggests that the linear predictor used here has satisfying properties

*Λ*_{0}and are highly correlatedandThe predictor explains a large part of the variance of the expected number of new infections

We therefore assumed that was gamma distributed with mean and variance . The choice of the gamma distribution had no theoretical foundation, but simplified the computation since the number of infections *U*_{0} had then a negative binomial distribution.

In the pre-vaccination era, in large cities like London, measles was endemic and it is not necessary to model introduction of cases (Morton & Finkenstadt 2005). In smaller towns, fade outs were common. The statistical framework could be extended to take into account the introduction of cases in this context. Under the assumption that cases are introduced at the beginning of each observation period, stochastic differential equation (2.3) would still apply.

In the standard SIR model, it is assumed that the duration of infectiousness is exponentially distributed. This is motivated by mathematical tractability (under this assumption, the system is Markovian) rather than biological realism. Data augmentation techniques can cope with more realistic distributions for the duration of infectiousness (Cauchemez *et al.* 2004), but those techniques are available for relatively small datasets only. For time-series data, designing estimation methods that do not rely on the Markovian assumption is the subject of further research.

## Acknowledgments

We thank the MRC, European Union FP6 SARSTRANS and INFTRANS projects, and the NIGMS MIDAS initiative for research funding.

## Footnotes

- Received November 5, 2007.
- Accepted November 29, 2007.

- © 2007 The Royal Society