## Abstract

Back-calculation is a process whereby generally unobservable features of an event leading to a disease outbreak can be inferred either in real-time or shortly after the end of the outbreak. These features might include the time when persons were exposed and the source of the outbreak. Such inferences are important as they can help to guide the targeting of mitigation strategies and to evaluate the potential effectiveness of such strategies. This article reviews the process of back-calculation with a particular emphasis on more recent applications concerning deliberate and naturally occurring aerosolized releases. The techniques can be broadly split into two themes: the simpler temporal models and the more sophisticated spatio-temporal models. The former require input data in the form of cases' symptom onset times, whereas the latter require additional spatial information such as the cases' home and work locations. A key aspect in the back-calculation process is the incubation period distribution, which forms the initial topic for consideration. Links between atmospheric dispersion modelling, within-host dynamics and back-calculation are outlined in detail. An example of how back-calculation can inform mitigation strategies completes the review by providing improved estimates of the duration of antibiotic prophylaxis that would be required in the response to an inhalational anthrax outbreak.

## 1. Background

The term ‘back-calculation’ was first coined with respect to infectious diseases in the late 1980s when a methodology for obtaining short-term projections of the acquired immune deficiency syndrome (AIDS) epidemic was proposed [1–6]. The methodology essentially follows from the concept that for each infected person the time of symptom onset is equal to the addition of the time of exposure and the incubation period. When considering the exposure of multiple persons the function of addition is generalized to a convolution of distributions, i.e. a ‘merging’ of the distribution of exposure times with the incubation period distribution. Such a convolution is assumed to reflect the distribution of symptom onset times, often referred to as the epidemic curve. However, usually this process is reversed in order to back-calculate the (generally unobservable) times of exposure by deconvolving the (generally observable) epidemic curve and the previously characterized incubation period distribution.

Back-calculation was originally developed for AIDS, which exhibits a relatively long time (of the order of years) from exposure to symptom onset. The technique was extended in the late 1990s during the bovine spongiform encephalopathy (BSE) [7–9] and variant Creutzfeldt–Jakob disease (vCJD) [10–13] epidemics, diseases that also exhibit long incubation periods. Studies undertaken during the course of these epidemics found that the susceptibility of a person to infection was age-dependent. Therefore, age structure was incorporated into the back-calculation process for these particular applications. In the first decade of the twenty-first century the focus of back-calculation shifted to bioterrorism following the US anthrax attacks in 2001. Retrospective analysis of the outbreaks in Florida, New Jersey and Washington allowed an estimation of the number of cases that might have been prevented by the distribution of antibiotics to potentially exposed persons [14,15] as well as the average inhaled dose of the causative agent [16]. Transmissible diseases of concern such as smallpox and pneumonic plague have also been evaluated by back-calculation in order to estimate properties such as the duration of infectiousness and how infectiousness varies with disease stage [17–19].

The purpose of this study is to review the more recently developed back-calculation techniques that can be applied in a real-time setting (i.e. prospectively) or shortly after the end of an outbreak (i.e. retrospectively) for non-transmissible acute infectious diseases where there is prior knowledge of the incubation period. Such a review is important, because estimating key parameters of the exposure to a pathogenic organism could help to improve situational awareness of, and responses to, the resulting outbreak.

## 2. Incubation period

The incubation period is defined as the time from pathogen exposure to the onset of symptoms [16,20]. Owing to factors such as host susceptibility, size of inoculum, strain virulence and chance, infected persons can experience different incubation periods. Therefore, the incubation period is more appropriately characterized by a distribution rather than by a single value. In the absence of an understanding of the within-host mechanisms leading to symptom onset (see §3.2.1 and §5), parametric distributions are often fitted to data gathered from disease outbreaks where it is assumed that such data are independently and identically distributed. For example, Sartwell [21] found that the lognormal distribution could be used to model the incubation period of a number of acute infectious diseases. Two other frequently considered parametric distributions are the gamma and Weibull [22–25]. Estimating the parameters of these distributions is commonly achieved by determining the parameter values that maximize the likelihood function, *L*, provided below
2.1In equation (2.1), *m** ^{n}* = {

*m*

_{1},

*m*

_{2}, … ,

*m*} represents a vector of incubation periods for each person

_{n}*i*where

*n*is the total number of exposed persons who eventually succumbed to the infection in the absence of intervention [26]. These data are used to estimate

**, a vector of parameters of the probability density function (PDF)**

*α**f*describing the incubation period distribution. Assuming that the symptom onset time is known to a relatively high degree of accuracy (e.g. to the nearest hour) then equation (2.1) is applicable in a limited number of situations where the exposure time is also known to a similar level of precision. Such situations might arise where an infected person has had only one contact of a limited duration with an infectious person or contaminated site, or is deliberately exposed as part of an experimental protocol. Unfortunately, in many situations, a person's incubation period is not known exactly, often because the exposure event has not been exactly observed. For example, infected persons might recollect the one particular day that they visited a contaminated environment but might not know the exact time during that visit when they were exposed. As a result, if it is assumed that only the time of symptom onset is known to a high degree of accuracy then the data are termed ‘singly interval-censored’ as each incubation period lies within the interval [

*l*,

_{i}*r*], where

_{i}*l*<

_{i}*r*[22]. Assuming that the exposure events are independent of time, the likelihood function for such data can then be written as follows 2.2

_{i}In equation (2.2), *F* represents the cumulative distribution function (CDF) of the incubation period. If the data consist of both exact and singly interval-censored data, then the required likelihood is simply the multiplication of equations (2.1) and (2.2) [22] with the data suitably grouped. In the common situation where both the time of exposure and symptom onset are not known exactly, then the data are termed ‘doubly interval-censored’ requiring a more sophisticated likelihood function as described in detail in [27]. Non-parametric approaches can also be applied to both exact and interval-censored data [28,29]. Table 1 shows some of the freely available R packages [30] that can be used to parametrize an incubation period distribution. In addition, an alternative method was developed in response to the SARS pandemic that requires detailed data regarding who-infected-whom [36]. The likelihood for this method actually reduces to an equation very similar to that described in equation (2.2), although this is clearly only applicable to transmissible diseases.

## 3. Temporal models

### 3.1. Retrospective analysis

#### 3.1.1. Instantaneous exposure

Assuming that the disease responsible for an outbreak has been identified and that there is a suitable *a priori* parametrization of the associated incubation period distribution then the simplest application of the back-calculation process is to the ‘common exposure’ (or point source) scenario. For example, imagine a situation where a number of people have been exposed to a food-borne pathogen at time *T* when they ate the same meal. When those infected persons start to display symptoms, standard epidemiological techniques could help to identify the causative food if those persons who became ill ate together on only one or a limited number of occasions. However, if the outbreak happened to occur where people eat together on a regular basis (i.e. within an enclosed institution [37] such as a school, barracks or prison), then it might be more difficult to distinguish between the harmful and harmless meals. To address this problem, Sartwell [38] applied the method of quantiles to estimate the date of common exposure *T* in four outbreaks where the actual time of exposure was known and found that the estimates were all accurate to within 1 day. The Japanese theoretical epidemiologists Hirayama and Horiuchi also used and extended similar techniques although such approaches are somewhat sensitive to a user-defined optional value [20]. The more robust method of maximum-likelihood estimation was described independently by Hill [39] and Tango [40], who used a lognormal distribution to estimate the date of common exposure to a smallpox and a food-borne outbreak, respectively. The likelihood function for this technique is
3.1In equation (3.1), *t** ^{n}* = {

*t*

_{1},

*t*

_{2}, … ,

*t*} represents a vector of symptom onset times, and therefore, the epidemic curve is simply the incubation period distribution shifted by time

_{n}*T*(i.e.

*m**=*

^{n}

*t**−*

^{n}*T*). If an

*a priori*parametrization of the incubation period distribution either does not exist or is no longer assumed to be applicable for any outbreak caused by the same pathogen, then equation (3.1) can similarly be used to jointly estimate both

*T*and

**[40].**

*α*#### 3.1.2. Continuous exposure

If the exposure of a number of persons is believed to have occurred over an extended period rather than at a single point in time, then exposure times can be modelled by a PDF *g*(*t*; ** β**), where

*t*is the time-dependent variable and

**is a vector of parameters.**

*β**g*(

*t*;

**) can be considered to be a generalization of the Dirac delta function represented by**

*β**T*in §3.1.1. The PDF of symptom onset times

*h*(

*t*;

**,**

*α***) is no longer equivalent to the shifted incubation period distribution but represented by the function 3.2**

*β*The likelihood function in equation (3.1) can then be generalized to the following
3.3For example, during a Legionnaires’ disease (LD) outbreak where the causative agent has been aerosolized over a number of days from, say, a poorly maintained air-conditioning unit, the resulting number of exposures might be considered to be equal for each day of the release. This process can be modelled by a uniform PDF with the two parameters of the distribution representing the start and end dates of the release [23]. The uniform model of exposure when convolved with the incubation period distribution has been shown to be a reasonable or good model for approximately half of the LD outbreaks analysed, but in a few instances, the early stages of the outbreak were not captured sufficiently well. In order to address this issue, instead of the release starting abruptly, the number of exposures were assumed to have increased over time to reflect the growth of organisms within the source. This process requires a slightly more complex exposure model involving an additional parameter and was shown to be a good extension for the LD outbreaks in Miyazaki and Barrow-in-Furness [23,41]. However, approximately one-third of LD outbreaks analysed were not well described by a convolution of the simple parametric distributions described above. In such situations, consideration should be given to a non-parametric back-calculation technique most recently applied to a large food-borne outbreak in northern Europe [42,43]. This method is based on the expectation–maximization–smoothing algorithm where the expected number of exposures for each time unit are calculated under the assumption of a Poisson distribution, but without any parametric assumption on how the exposures evolve in time [44]. The advantage of the non-parametric approach is that it allows greater flexibility in the estimation of the exposure times, but arguably with the loss of a mechanistic understanding of the exposure process. It is worth noting that the original descriptions of both the parametric [1] and non-parametric [45,46] continuous exposure approaches were first developed and applied to the AIDS pandemic. In addition, equation (3.2) would be equally applicable if the functions *h* and *g* represented the intensities of the point processes (e.g. the daily number) of symptom onsets and exposures, respectively, rather than probability densities [47].

### 3.2. Prospective analysis

#### 3.2.1. Instantaneous exposure

In a situation where the incubation period distribution extends to only a few days (e.g. food-borne botulism [48]), then public health authorities may only become aware of the outbreak after it has ended, and therefore only a retrospective analysis of the data is likely to be possible. However, if the incubation period is somewhat longer (e.g. food-borne listeriosis [49]), and the authorities become aware of the outbreak as it is progressing then prospectively analysing the situation in real-time would require slightly different statistical techniques. Let *τ* represent the time during an outbreak at which a total of *k* persons have started to display symptoms. *τ* is termed the ‘censoring time’ of the outbreak and the data are termed ‘right-censored’ (cf. ‘interval censored’ in §2). This is because with a potential outbreak size (in the absence of intervention) of *n* persons there are still *n* − *k* persons who are yet to display symptoms (i.e. they are still progressing through their incubation period to the *right* of time *τ* when visualizing the distribution). If the exposure time *T* is known, then the number of symptomatic cases *k* by time *τ* has a binomial distribution with unknown potential unmitigated outbreak size *n* and success probability *F*(*τ* − *T;* ** α**) [15]. The likelihood function is therefore given by
3.4

Equation (3.4) can be solved analytically to provide a simple equation for the maximum-likelihood estimate [26] given by 3.5

If the time of exposure *T* is also unknown then the likelihood function *L* is more complicated, because both the number of symptomatic cases *k* by time *τ* and their individual onset times *t** ^{k}* = {

*t*

_{1},

*t*

_{2}, … ,

*t*} are informative about the unknown parameters 3.6

_{k}Equation (3.6) has been suggested as a useful tool in response to an inhalational anthrax attack [50] where the release of the causative agent, and subsequent exposure, is assumed to be near-instantaneous. A slight variant to equation (3.6) has been used to estimate *n* and the incubation period parameters **α** for an assumed lognormal distribution following the inhalational anthrax outbreak in Sverdlovsk, Russia [26]. This was possible, because the time of the release was considered to be known in that particular outbreak. Note that equation (3.6) reduces to equation (3.1) when *k* = *n* which occurs as the censoring time *τ* → *∞*. For inhalational anthrax, the incubation period distribution has been shown to be dependent on the inhaled dose (i.e. the number of spores inhaled) *d*, predominantly at high inhaled doses [51,52]; thus, requiring an additional parameter (i.e. *f* = *f*(*t*,*d*; ** α**) and

*F*=

*F*(

*t*,

*d*;

**)). As a result, equation (3.6) can also be used to estimate the ‘representative dose’ of an exposed population [53], assuming that each person received the same inhaled dose**

*α**d*. However, it has been found that the representative dose is difficult to estimate with limited data, which is generally the situation during the early stages of an outbreak [53,54]. Interestingly, for transmissible diseases, accurately identifying parameters is similarly problematic when attempting to jointly characterize the generation interval (i.e. the time between a case being exposed and their subsequent transmission to another case) distribution and

*R*

_{0}, a key measure of the transmissibility of a disease, from early time-series data [55]. These are inherent parameter identifiability problems [56] where the likelihood can remain flat over a large range of parameter values. To counter this setback in the non-transmissible inhalational anthrax situation, if the exposed population and the time of common exposure are considered known, as was the situation for the postal workers in the US anthrax attacks of 2001 [16], then it is possible to obtain a more accurate estimate of the representative dose, which is usually unknown in real-life situations. This follows from the observation that the attack ratio

*AR*, which is the proportion of persons who would have succumbed to infection in the absence of intervention

*n*given an exposed population

*E*, is often found to be dependent on the inhaled dose

*d*, i.e. 3.7In equation (3.7),

*p*is often referred to as the probability of infection given an inhaled dose

*d*, or dose–response function where

**is a vector of parameters. Assuming that**

*γ**p*has been parametrized

*a priori*, by combining equation (3.7) with equation (3.5), it is possible to remove

*n*and solve for the only remaining parameter

*d*. Similarly, if the time of common exposure

*T*is unknown, then equation (3.7) can be combined with equation (3.6) instead to jointly estimate both

*d*and

*T*[16].

*n*can subsequently be estimated indirectly by using equation (3.7) with the known

*E*and the estimate of

*d*.

This latter process can be illustrated with a fictitious example (devised for this review) in which *E* = 1300 people who are situated inside the same building on 30 March are exposed to an aerosolized release of the causative agent of anthrax. Each person is assumed to inhale *d* = 1000 spores, and their probability of infection is derived from a within-host model where the inhaled spores can either germinate in the lungs of the host (resulting in infection) or be cleared by the host's defence mechanisms [16,52]. In the form of equation (3.7) with ** γ** = {

*λ*,

*θ*}, this ‘competing risks’ model is given 3.8In equation (3.8), the parameters

*λ*= 1.056 × 10

^{−5}per day and

*θ*= 0.131 per day are the germination and clearance rates of spores from the lung, respectively [52]. This particular scenario results in

*n*= 100 persons who, in the absence of intervention, progress through a lognormal incubation period distribution with parameters exp(

*µ*) = 8.73 days and exp(

*σ*) = 1.93 days [52]. If then, on 1 April, two persons start displaying symptoms followed by four persons on the next day and a further five persons on the following day, figure 1 shows the estimated time of common exposure and representative dose based on

*τ*= 2 days (day 0 is the first day of the outbreak) and knowledge of the number of exposed persons. Figure 2

*a*shows a forecast of the number of expected symptomatic casualties over the next month in the absence of any public health intervention by plotting time on the

*x*-axis against on the

*y*-axis—see §5 and §6 for further details regarding the other curves. Figure 3 provides a measure of the uncertainty associated with the forecast of symptomatic persons by sampling from the joint distribution of

*T*and

*d*. Note that the derived estimate of persons along with its confidence interval {30,140} can be somewhat appreciated by viewing the endpoint of the cumulative forecast given by

*nF*(

*t*−

*T*,

*d*;

**) on the**

*α**y*-axis of figure 3

*b*.

#### 3.2.2. Continuous exposure

In a similar manner to the retrospective analysis of §3.1, the continuous exposure model can be generalized from the instantaneous model for a prospective analysis [23]. Perhaps the most significant conceptual difference is that in this situation, the exposure could still be occurring at the time when the parameters describing the exposure process are being estimated. The CDF of symptom onset times *H*(*t*; ** α**,

**) is represented by the function 3.9**

*β*The likelihood function in equation (3.6) is then generalized to the following 3.10

In a real-time setting while some number of new cases is continuing to occur, the exposure process cannot be estimated to end beyond the time at which the inferences are being made (i.e. it is impossible to estimate when in the future the exposure process is going to stop, if it has not already done so). If the estimated end time of the exposure process is equivalent to the censoring time of the outbreak, then an appropriate interpretation is that exposures are actually still ongoing. The estimate of the end time of the exposure process affects the estimate of the potential size of the outbreak which is necessarily underestimated while persons continue to be exposed but is captured reasonably well shortly after the exposure stops. The start of the exposure process is captured reasonably well even in the early stages of the outbreak as it is the early data that predominantly dictates the estimation of this parameter which is largely unaffected by data later in the outbreak [23]. In a manner similar to §3.2.1, equation (3.10) reduces to equation (3.3) when *k* = *n* which occurs as the censoring time *τ* → ∞.

## 4. Spatio-temporal models

### 4.1. Retrospective and prospective analysis

The models described in §3 use purely temporal data, i.e. the times of symptom onset of a disease together with the number of cases at specific points in time. Spatial data are only implicitly considered in these models, for example, if there were a number of LD cases presenting at London-based hospitals and New York-based hospitals at approximately the same time, then it is probably reasonable to assume that there are two distinct outbreaks requiring separate investigation, once the possibility of transatlantic travel had been ruled out. In such a situation, the temporal models could be applied to the two independent sets of data. Even if there are concurrent cases occurring within the same country then initial epidemiological investigation possibly combined with cluster detection techniques [57] might indicate the occurrence of multiple outbreaks. Indeed, some of the temporal models described in §3 were separately applied to the three main clusters of inhalational anthrax in the 2001 US attacks [14–16].

When cases are considered to be linked together as a consequence of epidemiological investigation, then data relevant to their spatial locations during the time prior to their disease onset (captured by the incubation period distribution) can allow further analysis to help characterize certain features of the exposure process. For example, epidemiological investigations of an LD outbreak might identify that the vast majority of cases had been inside the same building at some recent time prior to illness and that this was the only common location among such cases. However, when there is no obvious common location, then it might be reasonable to assume that exposure had occurred at some distance from the source (i.e. the release location) owing to aerosolization of the causative agent and its subsequent dispersion in the atmosphere. Investigations of such community-acquired outbreaks often use attack ratios (see §3.2.1 for a definition) to highlight geographical areas where the number of infections per unit population is highest which can help to identify the source. There are two common approaches taken with regard to attack ratio analysis [58]. The first is to consider a number of potential sources that are known about in advance, such as cooling towers, and then to calculate the attack ratios in annuli around such sources; a decreasing pattern of attack ratios with distance from a potential release location might be indicative of the true source. The second is to place no *a priori* assumption regarding potential sources and instead calculate attack ratios by administrative geography; the spatial unit with the highest attack ratio might then contain the source, or at least be near to it. Table 2 provides some examples of the use of both approaches together with each study's choice of annuli distances or average population size of the spatial unit considered.

A more sophisticated technique is to retrospectively use atmospheric dispersion modelling combined with meteorological data to test whether a potential release location could have resulted in exposure to the pathogen that is consistent with the observed spatial pattern of cases [59,65]. Atmospheric dispersion modelling has similarly been applied in a retrospective manner to the Sverdlovsk inhalational anthrax outbreak to help identify the time and location of the release of the causative pathogen [69]. The authors were able to collect data regarding the residential and work addresses of cases via administrative/hospital records and household interviews with survivors or friends/relatives of those who died. Dates of symptom onset, hospital admission and death were most commonly obtained from pathologists’ notes and meteorological data were obtained from 3-hourly surface observations reported at a local airport. A re-analysis of the Sverdlovsk spatio-temporal data [70] sought to estimate the strength of the release (i.e. the number of spores released) by considering a range of potential dose–response functions (see §3.2.1 for a definition).

The process of fitting an atmospheric dispersion model to early case data has been formalized into a statistical framework [54]. This methodology has been applied to the early stages of a simulated inhalational anthrax outbreak to estimate the time, location, height and strength of a single near-instantaneous aerosolized release. Having back-calculated these parameters, the model can then be run forward to estimate the extent of the release, i.e. those areas that have been sufficiently exposed to warrant prophylactic treatment of all persons who were present in such areas at the time of the release. Input data include meteorological data and both day-time and night-time population densities together with the symptom onset times and home/work locations of early cases. Airborne dispersion following a single near-instantaneous release can be represented by a Gaussian puff model [54], which describes the concentration of released material in the atmosphere during the period following the release. In a biological release scenario a relatively simple extension to this model can allow for the exponential decay of the causative pathogen in the atmosphere [71]. Integrating this extended Gaussian puff model over time provides a formula for the total ground level inhaled dose *d* at various spatial locations
4.1or alternatively in terms of the standard normal distribution
4.2In equations (4.1) and (4.2), *S* and ** W** =

*{X*,

*Y*,

*Z}*are the strength (i.e. the number of spores released) and location vector of the release, respectively,

*x*is the distance downwind in the direction of the wind,

*y*is the crosswind distance and

*z*is the vertical distance.

**= {**

*δ**b*,

*u*,

*σ*,

_{x}*σ*,

_{y}*σ*,

_{z}*ζ*} are the input parameters, where

*b*is the breathing rate,

*u*is the wind speed and

*σ*,

_{x}*σ*,

_{y}*σ*are the downwind, crosswind and vertical standard deviations of the concentration distribution (also known as the dispersion parameters [72]), respectively, which predominantly depend on whether the release occurs during the day or at night and also on the wind speed and distance from the release location [71,73].

_{z}*ζ*is the decay rate of the pathogen which similarly varies with the time of day,

*c*

_{1}=

*σ*

_{x}

*ζ*/

*u*and

*c*

_{2}= (

*x*−

*X*)/

*σ*

_{x}.

*ϕ*and

**are the PDF and CDF of the standard normal distribution, respectively. Note that if**

*Φ**ζ*= 0, then equations (4.1) and (4.2) reduce to the ground-level inhaled dose without organism decay given by 4.3or alternatively in terms of the standard normal distribution 4.4

The inhaled dose calculated via a Gaussian *plume* model, which is generally used to describe the concentration of material in the atmosphere resulting from a *continuous* release, can be adapted to consider an instantaneous release [72]. Equations (4.3) and (4.4) are equivalent to using this modified Gaussian plume model, because by using the dispersion parameters given by the Briggs formulations [73], it can be shown that ** Φ**(

*c*

_{2}) ≈ 1 for all distances downwind of the release location; indeed, it was this version that was used in the original Sverdlovsk analysis [69,72]. The prospective likelihood function for an instantaneous aerosolized release, regardless of the form of the equation describing the inhaled dose

*d*, is given by 4.5

In equation (4.5) *q*^{k} = {*t*_{1}, *x*_{v1}, *y*_{v1}, *x*_{w1}, *y*_{w1}, *t*_{2}, *x*_{v2}, *y*_{v2}, *x*_{w2}, *y*_{w2}, … ,*t*_{k}, *x*_{vk}, *y*_{vk}, *x*_{wk}, *y*_{wk}} is a vector of early case data where *ν* and *w* represent a person's home (or day-time) and work (or night-time) location, respectively. It is important to note that for simplicity (and for the pragmatic reason that such data are more likely to be known) each exposed person is assumed to be exposed at one of these two locations and not elsewhere. For a retrospective analysis, equation (4.5) reduces to the likelihood function given by
A similar methodology that attempts to initially detect and subsequently characterize an instantaneous aerosolized release has also been developed [74,75]. This algorithm also uses equations (4.3)/(4.4) for the calculation of the inhaled dose, but the input is based on pre-diagnostic (syndromic) medical surveillance data rather than based on confirmed cases. Another technique that seeks to characterize a *continuous* release of radioactive material uses the Gaussian plume model which can be given as equations (4.3)/(4.4) but with *S* redefined as the emission rate rather than the total quantity of material released [76,77]. All three approaches find that the strength *S* and height *Z* are correlated. This appears to be a parameter identifiability problem similar to the dose-dependent version of equation (3.6) (i.e. a larger and higher release could result in an exposure pattern with the same likelihood as a smaller and lower release). For this reason, a fourth algorithm that uses environmental sampler data simplifies the problem by assuming a near ground-level release [78]. Table 3 provides a summary of the four spatio-temporal models. Note that the choice of technique implemented in each reference may have been due to the current state of each algorithm's development but may also be dependent on the frequency of the receipt of data. For example, the methodology associated with the notifiable disease system may receive data far less frequently than environmental samplers which may receive an almost continuous feed of data. It should also be noted that techniques for estimating parameters associated with an aerosolized *chemical* release based on concentration measurements have been studied in some detail but will not be discussed further here (e.g. see [79,80] and references therein). Table 4 provides a summary of all of the temporal and spatio-temporal back-calculation techniques outlined in this review.

## 5. Mitigation strategies

### 5.1. Retrospective and prospective analysis

When used in a response situation the prospective temporal and spatio-temporal models of sections 3.2 and 4.1, respectively, can potentially inform mitigation strategies. For example, in the event of an inhalational anthrax outbreak, public health authorities would be likely to seek to distribute prophylactic antibiotics to those persons who have been exposed to the causative agent in order to reduce their probability of illness and death. Prophylactic vaccination might also be considered as an adjunct to the use of antibiotics in relation to alleviating the potential problems associated with antibiotic resistance and reducing the otherwise long-term antibiotic course that is required to prevent inhalational anthrax from developing. Indeed, within-host modelling studies have indicated that the duration of the antibiotic course required to prevent illness is dependent on the inhaled dose *d* (i.e. the higher the inhaled dose, the longer the antibiotic course required). The framework that formulated this concept originally assumed that an antibiotic course would start immediately at the point of first exposure [16]. It was shown that exposed persons who have an initial probability of infection *p* would need to remain on antibiotics for *ψ* days in order to reduce their probability of developing disease to *ρ* where *ψ* is given by
5.1

Two other studies that have sought to capture more accurately the within-host mechanisms leading to inhalational anthrax have also used equation (5.1) [52,81]. Such a framework was extended to consider the impact of persons starting antibiotics at *t*_{1} days after first exposure, stopping at time *t*_{2} and perhaps starting again at time *t*_{3} [82]. More recent extensions have considered the conditional probability *p*_{c} of an exposed person becoming ill *given* that she/he is not ill at the time *τ* − *T* of starting an antibiotic course [83]:
5.2

In this situation, those persons who have been exposed and are asymptomatic would need to remain on antibiotics for *ψ*_{c} days in order to reduce their conditional probability of developing disease to *ρ*_{c} where *ψ*_{c} is given by
5.3It should be emphasized here that the following two paragraphs are the only novel extensions to previously published research outlined in this review. Equation (5.3) is particularly relevant to a real-time response situation. For example, during the US anthrax attacks of 2001, *k* = 4 cases became symptomatic on the fourth day after exposure in the Brentwood postal facility, i.e. {*τ* = 0, *T* = −4} days, out of an exposed population of *E* = 2747 [16]. Having estimated the inhaled dose *d* from a combination of equations (3.5) and (3.7), it was calculated from equation (5.1) that ‘26 and 59 days of antibiotic prophylaxis would be adequate to reduce the probability of disease to <1/1000 and 1/10 000, respectively’. However, given that the persons due to receive antibiotics were still asymptomatic *τ−T* = 4 days following exposure, a more accurate measure would be to use equation (5.3) instead. Therefore, 22 and 55 days would be a more appropriate estimate of the antibiotic course required to reduce their conditional probability of disease to less than 1/1000 and 1/10 000, respectively, assuming that such persons started to take their antibiotics immediately. Furthermore, in this particular example, the initiation of antibiotic prophylaxis did not commence until 9 days after exposure meaning that an even more accurate set of parameters for calculating both the representative dose *d* and the subsequent optimum duration of antibiotics for those exposed and disease-free persons *ψ*_{c} would be {*τ* = 5, *T* = −4}.

It is important to note that the equations in this section and the example above are based on the original ‘competing-risks’ model which ignores the time between germination of a spore and the onset of symptomatic disease [16]. However, the indoor release example provided in §3.2.1 is based on a modified competing-risks model which estimates the median time between spore germination and symptom onset *M* = 3.4 days [52]. A simplistic way of factoring this into equations (5.2) and (5.3) is to replace the *τ* − *T* terms with *τ* − *T* − *M*. This replacement is implemented in figure 4*a* which shows the antibiotic stockpile required for a range of mitigated attack ratios *ρ*_{c} given by . In other words, the antibiotic course required multiplied by the number of exposed persons who are asymptomatic at time days (3 April) following exposure. Combining equation (3.7) with the estimated number of persons who would have succumbed to infection in the absence of intervention and the number of exposed persons *E* = 1300 gives an estimated attack ratio . Then, by using the now modified equation (5.2), the conditional probability of infection is the transitional point in figure 4*a* where the negative gradient becomes zero; clearly, no antibiotics are required if . To reduce the conditional probability of developing disease to, say, *ρ*_{c} = 1/1000 each of the *E* – *k* = 1300–11 = 1289 persons who are asymptomatic at time (3 April) following exposure would need to immediately start taking antibiotics for approximately *ψ*_{c} ≈ 30 days. Using the now modified equation (5.3), this would give a required antibiotic stockpile of approximately 38 500 person-days. Figure 4*b* shows the outbreak size that would be expected to result based on the antibiotic course outlined in figure 4*a*. This is given by *k* + (*E* – *k*)*ρ*_{c} to the left of the transition point and to the right of the transition point. Therefore, if each asymptomatic person completed a *ψ*_{c} ≈ 30 day antibiotic course, then, in this particular example, the unmitigated outbreak size estimate of persons would be reduced to a mitigated outbreak size estimate of 11 + 1289/1000 = 12 persons (i.e. an estimate of 82% of cases saved).

In an outdoor release scenario, it is likely that the inhaled doses would vary significantly over space, and therefore, the required duration of antibiotics would also be likely to vary. Those receiving higher inhaled doses closer to the source might ideally need to undergo longer antibiotic courses than those receiving smaller inhaled doses further from the source in order to receive the same level of protection. Of course, logistical constraints might make such a distribution campaign undeliverable, and a more pragmatic approach might be to deliver the same antibiotic course to all sufficiently exposed persons [82,84]. The question of who has been sufficiently exposed to warrant treatment is an important one. It has been suggested that an estimated probability of infection *p* equal to 1/100 000 provided a good balance: ‘a higher threshold decreased the number of lives saved while a lower threshold did not save significantly more lives but required substantially larger numbers to be treated’ [54] (cf. figure 4). Moreover, consideration also needs to be given to the trade-off between the risk of disease versus the risk of serious adverse effects associated with long-term antibiotic therapy [15,85]. In a cost-effectiveness study of inhalational anthrax prophylaxis a baseline of 1/10 000 was attributed to the antibiotic adverse event ratio but with a wide range of 0–1/1 000 [86].

The mitigation strategies described thus far have assumed that antibiotics are distributed to all exposed and asymptomatic persons at the same time. Indeed, an alternative retrospective interpretation of equation (3.6) is to consider *τ* − *T* as the time following exposure at which completely effective prophylactic antibiotics had been distributed to the entire exposed population rather than the censoring time of the outbreak [14,15]. In this situation, is an estimate of the proportion of cases that have been prevented by the mass distribution of antibiotics (giving an estimate of 1–11/67 = 84% of cases saved for the indoor release example). However, where there are a large number of exposed persons it may not be appropriate to assume that all such persons would receive antibiotic prophylaxis at the same time. The forecast of symptomatic casualties in the absence of intervention (see §3.2.1) can be generalized to allow for the distribution of antibiotics to occur over a period of time. This mitigated forecast of symptom onset times is given by where the CDF *G*(*t; β*) is the probability that a person who would have eventually succumbed to infection has received antibiotics by time

*t*. In a similar manner to the original form of an exposure model during an LD outbreak (cf. §3.1.2), the associated PDF

*g*(

*t*;

**) has been assumed to follow a uniform distribution where the two parameters in the vector**

*β***represent the start and end days of the antibiotic distribution campaign [15,84,87]. The parameter**

*β**ε*is a summary measure that represents the effectiveness of antibiotics, i.e. the proportion of persons who would have eventually succumbed to infection and that have been identified for prophylaxis, multiplied by the proportion who subsequently adhere to an antibiotic course sufficient to prevent disease, multiplied by the efficacy of the antibiotics in preventing disease [15,83,87]. Returning to the indoor release example and assuming that antibiotics are completely effective (i.e.

*ɛ*= 1), if the distribution of antibiotics begins at time (3 April) following exposure and continues for, say, 3 successive days, then figure 2

*a*also shows a forecast of the mitigated outbreak. Under this scenario, the mitigated outbreak size estimate of 18 (giving an overall estimate of 73% of cases saved) with a 3 day antibiotic distribution campaign is approximately one-third higher than in the scenario where antibiotics are taken immediately. Note that the mitigated outbreak size estimate can be viewed as the endpoint of the mitigated cumulative forecast given by in figure 2

*b*.

## 6. Discussion

In the 1980s and 1990s, the process of back-calculation was developed to analyse diseases with long incubation periods (of the order of years) such as AIDS, BSE and vCJD. In the first decade of the twenty-first century, back-calculation was then used in a variety of bioterrorism applications. The comparatively short incubation periods (of the order of days and weeks) of many of the diseases associated with bioterrorism make *non-transmissible* diseases, such as inhalational anthrax and tularaemia [88], particularly applicable to analysis via back-calculation. However, the *prospective instantaneous* exposure techniques of sections 3.2.1 and 4.1 could also be applied to acute *transmissible* diseases providing that the case data are based only on those persons who were exposed to the source of the outbreak and not via subsequent human-to-human transmission. Therefore, those diseases with a longer mean incubation period and a longer mean serial interval (i.e. the time from symptom onset in a primary case to symptom onset in a secondary case, cf. the generation interval in §3.2.1), such as smallpox with a mean incubation period and serial interval of 12.5 days [89] and 16.0 days [18], respectively, are potentially more suitable for such analysis. Pneumonic plague, on the other hand, with its relatively short mean incubation period and serial interval of 4.5 days and 5.4 days [19], respectively, is likely to result in secondary cases quite soon after the release which could obscure the pattern of early index cases required for the analysis. Conversely, for each index case in an outbreak of smallpox, approximately five secondary cases might be expected to result (i.e. *R*_{0} ≈ 5 [90,91]), whereas it is likely that only one secondary case would occur for every index case during a pneumonic plague outbreak (i.e. *R*_{0} ≈ 1 [92–94]). It is possible that fewer non-index cases might make the back-calculation process less error-prone and more straightforward to perform. In a manner similar to the analysis of AIDS, any estimate of the size of the unmitigated outbreak *n* would need to be considered a lower bound given the potential for subsequent human-to-human transmission. Applying the *retrospective* instantaneous exposure technique of §3.1.1 to an outbreak of an acute transmissible disease would similarly be possible providing that the index and non-index cases could be disentangled from the epidemic curve. Note that equation (3.1) was actually first considered during the 1960s [39] and pre-dates the coining of the term ‘back-calculation’.

More recent modelling extensions have been applied to accidental outbreaks such as LD and certain food-borne outbreaks. For similar reasons to those described above, it would be extremely difficult to apply the *continuous* exposure techniques of sections 3.1.2 and 3.2.2 to acute transmissible diseases (i.e. differentiating those exposed via the source of the outbreak from secondary cases would be challenging). However, there are other non-transmissible diseases that might be suitable for analysis with such techniques, for example Q fever [95–98]. The causative agent of this zoonotic disease is shed from the placenta and birth products of infected small ruminants, principally sheep and goats. Aerosolization followed by inhalation of such material can result in human outbreaks in the areas near to affected farms. Back-calculating the period of exposure during such an ‘abortion wave’ [96,97] could help to identify the source of the outbreak. For example, following an outbreak of Q fever during the summer of 2007 in Cheltenham, UK [98], atmospheric dispersion modelling was used to investigate the potential airborne transport of organisms between a number of suspected sources and the residential addresses of cases during the risk period of exposure. The authors stated that ‘this period was chosen to cover the time distribution of cases from the earliest disease onset date minus maximum incubation period until onset of disease in the last detected case minus minimum incubation time’, an approach similarly taken during two distinct LD outbreaks in Spain [64,68]. Atmospheric dispersion modelling was similarly retrospectively applied to another LD outbreak in Sarpsborg, Norway [59]. The authors stated that ‘the focus time period was back-calculated to be 7–11 May, on the basis of the mid-time of illness onset (16 May) and a 7-day incubation period’. Although such relatively simple techniques can provide a useful estimate of the period during which exposure occurred, equations (3.2) and (3.3) provide more statistically robust estimates. Indeed, such recent advances were applied to an outbreak of LD during the summer of 2012 in Edinburgh [99] where the estimated period during which exposure occurred was used as an input into subsequent atmospheric dispersion modelling. Fusing these separate techniques into a common framework (i.e. generalizing the spatio-temporal likelihood of equation (4.5) from an instantaneous aerosolized release scenario to one which considers a continuous release) remains an open challenge.

Throughout this review, the symptom onset times of early cases have been shown to be a key requirement in the back-calculation process. However, these data might not be readily available during the initial stages of an outbreak and other, more immediately observable data, might be used instead. Alternatives include the times of hospital admission or death providing there was some *a priori* information regarding the distribution describing the time from exposure to hospital admission or death, respectively. For example, the time-to-death distribution for inhalational anthrax has previously been estimated directly [26] or can be estimated via the convolution of the incubation period distribution with the symptom onset to death distribution [81,100] (cf. equation (3.2); note that the latter process allowed for the additional forecasts in figure 2). In terms of the time-to-hospitalization distribution, some have assumed that this is equivalent to the incubation period distribution [87], whereas others have sought to perform a similar convolution to above but by utilizing the symptom onset to medical care presentation distribution instead [101]. These alternative distributions could prove to be an important substitute to the standard use of symptom onset times, especially during a response to an inhalational anthrax outbreak where the vast majority of infected persons would seek hospital care, be extremely ill (making it difficult to glean information from the person themselves) and most likely die in the absence of extensive treatment. However, in the situation of an LD outbreak approximately 70% of infected persons would be expected to be admitted to hospital and less than 10% of all cases would likely result in fatality [23]. Therefore, any back-calculation analysis not using symptom-onset dates would have to take into account that only a subset of the true number of infections was observable or significant biases might result. Conversely, any extension of the techniques described in this review to, say, Q fever would need to factor in that approximately 60% of infected persons are asymptomatic [102] and would therefore display no clinically observable reference in time. Clearly, the relationship between the severity pyramid of a particular disease and observable events needs to be carefully considered in the back-calculation process.

Continuing with the theme of the temporal relationship between the disease in question and the back-calculation process, it is worth noting that the epidemic curves of Q fever outbreaks tend to be reported in the literature on a weekly basis [95–98], whereas those of LD outbreaks tend to be reported on a daily basis [23]. This is most likely due to a combination of the longer mean incubation period of three weeks for Q fever [103] versus 6 days for LD and also that the abortion wave preceding a Q fever outbreak can occur over a number of months, whereas the period during which exposure occurs in an LD outbreak has rarely been longer than one month (and is often much shorter) [23]. As a result, it might seem appropriate to provide temporal estimates via the back-calculation process to the nearest week for Q fever as opposed to the nearest day for LD. This approach is somewhat supported by the back-calculation analysis of diseases with much longer incubation periods such as AIDS where estimates are reported yearly or half-yearly [1]. Thus, the time-course of the disease in question is likely to influence the reporting of the temporal estimates.

Furthermore, the model being used is also likely to have an impact on how accurately the temporal aspects of the exposure process can be estimated. For example, the spatio-temporal models of §4.1 require population data for each of the spatial units considered. These data tend to be available at a resolution of twice-daily to reflect the population of a particular spatial unit during the day and at night. As a result, the most appropriate estimate of the timing of an instantaneous release (using equation (4.5)) might be to the nearest half-day. A further consideration when using the spatio-temporal models is the choice of the size (in terms of both population and area, cf. table 2) of the spatial unit. A smaller unit could potentially provide more accurate and refined estimates, but there are a number of reasons why larger spatial units might be deemed advantageous. For example, population data tend to be snap-shots in time, and therefore, larger population aggregations might be less prone to the fluctuations which will inevitably occur on a daily basis. In addition, the puff models given by equations (4.1)/(4.2) and equations (4.3)/(4.4) do not capture small-scale changes (of the order of metres) in meteorology and are more generally applicable over larger distances (of the order of hundreds of metres upwards) [73]. Moreover, the computational run-time of the model tends to be faster when using larger spatial units, because there are fewer of them to consider, which could be critically important in a response situation. Finally, if for pragmatic reasons public health authorities intended to implement mitigation strategies (as outlined in §5) at a particular geographical scale, then there might be little point in providing a far more spatially resolved estimate of the area requiring prophylactic treatment than is required.

## 7. Conclusion

This review began by describing some relatively simple approaches for characterizing the incubation period distribution which is a fundamental requirement in the back-calculation process. Such approaches allow the utilization of the simpler temporal models which can be used to estimate the potential size of an outbreak along with the timing of the exposure process based on time-series data from early cases. If the impact of inhaled dose on the incubation period distribution is understood for a particular disease, then more sophisticated temporal models may be considered. Even more powerful inferences can be made with the temporal models if there is prior knowledge of the impact of inhaled dose on the probability of infection. The spatio-temporal models also require this level of knowledge and, in addition, the time-series data needs to be combined with, at a minimum, the home and work locations of early cases. Meteorological data and underlying population densities are also required inputs for the spatio-temporal models. Within-host modelling can potentially unify dose-dependent incubation periods with infection probabilities and can further allow the consideration of mitigation strategies. To date, this has really only been possible for inhalational anthrax, but it is hoped that further research in this area will benefit the planning and response to potential outbreaks of other pertinent diseases, such as tularaemia [104], Q fever and pneumonic plague.

## Funding statement

This research was supported by the Defence Science and Technology Laboratory (Dstl), the National Institute for Health Research Health Protection Research Unit (NIHR HPRU) in Emergency Preparedness and Response at King's College London and the NIHR HPRU in Modelling Methodology at Imperial College London in partnership with Public Health England (PHE). The views expressed are those of the authors and not necessarily those of Dstl, the NHS, the NIHR, the Department of Health or PHE. The authors declare that they have no competing interests.

## Acknowledgements

The authors are grateful to Joseph Gillard, Veronica Bowman and Steve Leach for providing helpful suggestions and comments.

- Received February 5, 2015.
- Accepted March 13, 2015.

- © 2015 The Author(s) Published by the Royal Society. All rights reserved.