## Abstract

Early theoretical work on disease invasion typically assumed large and well-mixed host populations. Many human and wildlife systems, however, have small groups with limited movement among groups. In these situations, the basic reproductive number, *R*_{0}, is likely to be a poor predictor of a disease pandemic because it typically does not account for group structure and movement of individuals among groups. We extend recent work by combining the movement of hosts, transmission within groups, recovery from infection and the recruitment of new susceptibles into a stochastic model of disease in a host metapopulation. We focus on how recruitment of susceptibles affects disease invasion and how population structure can affect the frequency of superspreading events (SSEs). We show that the frequency of SSEs may decrease with the reduced movement and the group sizes due to the limited number of susceptible individuals available. Classification tree analysis of the model results illustrates the hierarchical nature of disease invasion in host metapopulations. First, the pathogen must effectively transmit within a group (*R*_{0}>1), and then the pathogen must persist within a group long enough to allow for movement among the groups. Therefore, the factors affecting disease persistence—such as infectious period, group size and recruitment of new susceptibles—are as important as the local transmission rates in predicting the spread of pathogens across a metapopulation.

## 1. Introduction

Early epidemiological models typically assumed that host populations were large and well-mixed (e.g. Kermack & McKendrick 1927). Many human, wildlife and livestock populations, however, are structured into small groups with limited movement among the groups (Altizer *et al*. 2003; Kao *et al*. 2006). For example, communities of people that remain unvaccinated for religious or philosophical reasons constitute isolated and weakly linked patches of susceptible hosts for diseases such as measles and pertussis (Salmon *et al*. 1999; Feikin *et al*. 2000). Similarly, the ongoing spread of H5N1 influenza among wild birds underscores the need to understand whether insights derived from the theory of epidemics in large human populations can be applied accurately to diseases in wildlife. A number of studies have considered the effects of spatial or social group structures on disease invasion and persistence (e.g. Hess 1996*b*; Swinton 1998; Keeling 1999; Keeling & Gilligan 2000*a*,*b*; Thrall *et al*. 2000; Park *et al*. 2001, 2002; Fulford *et al*. 2002; Keeling & Rohani 2002; Cross *et al*. 2004; Hagenaars *et al*. 2004). Of particular importance is the research investigating the effects of population structure in the form of households on disease invasion and dynamics (e.g. Becker & Dietz 1995; Andersson 1997; Becker & Starczak 1997; Andersson & Britton 1998; Schinazi 2002). In this study, we take a novel approach to investigating disease invasion. Rather than analytically determining when a large outbreak is possible, we use hierarchical statistical methods to determine what criteria predict successful disease invasion most accurately. We then compare these results to more traditional thresholds to determine the amount of prediction error arising from the different approaches.

The basic reproductive number, *R*_{0}, is the expected number of infections caused by a typical infectious individual in a completely susceptible population. *R*_{0}*>*1 is the threshold condition traditionally applied for successful disease invasion (Anderson & May 1991; Heesterbeek 2002; Heffernan *et al*. 2005). *R*_{0}, as it is commonly used, assumes that the host population size is sufficiently large that the depletion of susceptible individuals through death or infection is negligible, and that the population is homogeneous or well-mixed (Anderson & May 1991; Keeling & Grenfell 2000). The *R*_{0} metric has been widely studied and refined to address more complex situations (e.g. multiple classes of host: Diekmann *et al*. 1990; spatial structure: Keeling 1999; depletion of the susceptible pool: Keeling & Grenfell 2000). Although some formulations of *R*_{0} use a matrix-based approach to account for spatial or group structure (e.g. Diekmann *et al*. 1990), *R*_{0} is, by definition, an individual-based rather than group-based metric. In this usage *R*_{0} may be high, reflecting within-group transmission, while the probability of between-group transmission remains low (Ball *et al*. 1997; Cross *et al*. 2005; Watts *et al*. 2005). When social groups are small, understanding the processes affecting within-group invasion becomes less important than understanding the processes regulating the spread of disease among groups.

The natural invasion metric for disease in a metapopulation is *R*_{*}, defined as the number of groups infected by individuals from the initially infected group (and hence the group-level analogue of *R*_{0}; Ball *et al*. 1997). A similar metric, *R*_{H0}, was developed by Becker & Dietz (1995) to assess the propagation of infection among households of variable sizes. In an idealized metapopulation, analytic theory has proven that must be greater than 1 for a pandemic to occur (Becker & Dietz 1995; Ball *et al*. 1997); under less restrictive assumptions, this same threshold has been demonstrated by simulation (Cross *et al*. 2005). Unfortunately, *R*_{*} is difficult to calculate analytically for any but the simplest metapopulation structures. Empirical estimation of *R*_{*} from outbreak data would require contact tracing data at a group level, a formidable challenge for wildlife or human diseases. Thus, while *R*_{*} brings conceptual clarity to the study of disease in metapopulations, its immediate utility in applied settings is limited. Therefore, we investigate the constituent parts of *R*_{*} to help focus field research on those parameters most important to disease invasion in structured populations.

Many studies addressing *R*_{0} in structured populations incorporate host movement via a phenomenological mixing approach, whereby hosts do not move among groups but simultaneously infect others locally and at a distance (Ball *et al*. 1997; Keeling 1999; Dobson & Foufopoulos 2001; Park *et al*. 2001; Fulford *et al*. 2002). Phenomenological mixing models are often analytically tractable, but they overlook the fact that between-group movements are discrete (and possibly rare) events, which can be crucial to understanding the stochastic dynamics of disease invasion (Cross *et al*. 2005) and the role of superspreaders in fuelling an epidemic (Lloyd-Smith *et al*. 2005*b*). An alternative approach is to model host movement mechanistically, explicitly tracking the movement of individuals between groups (e.g. Hess 1996*a*; Thrall *et al*. 2000; Keeling & Rohani 2002; Cross *et al*. 2005).

Previously, we used mechanistic models to show that disease invasion across a metapopulation depends crucially on the relative time-scales of host movement and recovery from disease (Cross *et al*. 2005). We showed that *R*_{0}>1 was insufficient for disease invasion when the product of the average group size and the expected number of between-group movements made by each individual while infectious (i.e. the ratio of movement rate to recovery rate) was less than 1 (Cross *et al*. 2005). This previous study addressed settings where the rate of host population turnover was negligible relative to the rate of disease processes of infection and recovery.

Here, we expand the earlier analysis to a much broader set of disease–host relationships, exploring settings where the duration of immunity ranges from transient to lifelong or where the demographic processes occur on comparable (or faster) time-scales to disease processes. Rapid replenishment of susceptibles allows qualitatively different dynamics compared to the earlier study, including the possibility for diseases to remain endemic within a local group even if movement is infrequent. Given *R*_{0}>1, we investigate additional factors that help to explain the remaining variation in whether or not a disease will become a pandemic. We also examine how these additional factors alter the structure of epidemics through their effect on the frequency of superspreading events (Lloyd-Smith *et al*. 2005*b*).

## 2. Methods

### 2.1 Model structure

We use two individual-based, stochastic, discrete-time SIR models that extend our previous work (Cross *et al*. 2005). These models differ from each other and our previous analyses only in the mechanism by which the susceptible pool is replenished. In the SIRS model immunity is transient, so recovered individuals can return to the susceptible state, and in the SIR_BD model immunity is permanent but births introduce new susceptibles, while deaths keep the population size constant. In simulations of each model, we track each individual's spatial position (group membership) and disease class (S-susceptible, I-infected, R-recovered).

In each model four processes occur: infection, recovery of infected hosts, creation of new susceptibles and movement among groups. We take disease transmission to be frequency-dependent (Getz & Pickering 1983), whereby the instantaneous rate of infection for each susceptible individual in group *i* is *βI*_{i}/*n*_{i}, where *β* is the transmission coefficient; *I*_{i} is the number of infected individuals in group *i*; and *n*_{i} is the total number of individuals in group *i*. Because our models operate in discrete time, the expression 1−exp(−*β*(*I*_{i}/*n*_{i})) is used to depict the saturating probability of infection per time-step for each susceptible individual (implicitly assuming that the force of infection is constant within each time-step). All disease transmission is assumed to occur within local groups and contact among groups occurs only by movement of individual hosts. We assume that infected individuals recover from infection to an immune class with a constant probability *γ* per time-step. We model movement among groups in a density-independent fashion such that all individuals have a constant probability *μ* of leaving their current group in each time-step. In the SIRS model, recovered individuals lose their immunity with probability *ρ* per time-step and births and deaths do not occur. In the SIR_BD model, all individuals have probability *δ* of dying and being replaced by a susceptible individual in the same group.

Groups are organized on a square lattice with periodic boundary conditions (i.e. movement is on a torus), where individuals move to one of their four nearest-neighbouring groups, chosen at random. Each simulation starts with one infected individual and all groups begin with the same number of individuals. Except where otherwise noted, we ran simulations on a 11×11 array of groups. Since our spatial model was symmetric, group sizes remained relatively constant during the course of each run. Therefore, our assumption of frequency-dependent transmission is approximately equivalent to a rescaling of density-dependent transmission.

In the continuous-time analogues of our models, *R*_{0}=*β*′/*γ*′ for SIRS and *R*_{0}=*β*′/(*γ*′+*δ*′) for SIR_BD (Anderson & May 1991; McCallum *et al*. 2001). The prime indicates that, in continuous time, these variables are rates rather than probabilities. For the discrete-time models used here, the ratio of *β*/*γ* is an approximation of *R*_{0} that works well when the time-step is small and group sizes are relatively large. These slight approximations do not change our qualitative conclusions, so for succinctness we refer to these ratios as *R*_{0}. Note that in the SIR_BD model, increasing *δ* reduces *R*_{0} because death removes individuals from the infectious class. To allow full comparison of the SIRS and SIR_BD models while varying *ρ* or *δ*, we present SIR_BD results for scenarios both where *β* is fixed (so *R*_{0} changes with *δ*) and where *β* is adjusted so that *R*_{0} remains constant.

### 2.2 Simulations and analyses

Using the models described above, we explore how different parameter interactions affect the outcome of disease introductions. Past studies of this model structure indicate that, for the parameter ranges we explore, most introductions result in extinction within the initial group or relatively complete invasion of the entire metapopulation, i.e. a ‘pandemic’ (Cross *et al*. 2005). As a binary measure of invasion success, we declare an invasion to be successful if more than 90% of groups are ever infected following a single disease introduction. This definition of a pandemic does not count disease persistence within a single patch as successful invasion, because we are focused on disease spread at the broader metapopulation scale.

To capture the effect of a finite, diminishing pool of susceptibles, we calculate empirical and values during the simulations. In contrast to the theoretical *R*_{0} values calculated from model parameters, these estimates are based upon individual simulation results. For each simulation, we calculate the individual reproductive number, *ν* (Lloyd-Smith *et al*. 2005*b*), by tracking the number of infections caused by the index case and then averaging *ν* over many simulations to calculate (Cross *et al*. 2005). Similarly, to calculate we take the average over *ν*_{*}, which in turn is calculated by tracking the number of groups infected by individuals from the index group. As estimates from model output, *ν*, *ν*_{*}, and all incorporate the effects of spatial structure, stochasticity, host movement and depletion of the susceptible pool within the infectious period of the index case (or group). We consider *ν*, *ν*_{*}, and to be ‘emergent’ quantities since they can only be estimated once the initial generations of a disease invasion have occurred. Following Lloyd-Smith *et al*. (2005*b*), we assess the frequency of SSEs in different population structures by constructing a histogram of infections caused by each index case to calculate the proportion of the distribution beyond the point corresponding to the 99th percentile of a Poisson distribution with the same mean. Since the distribution is not Poisson this tail will not necessarily contain 1% of individuals, but rather *y*%. The superspreading load (SSL) is the observed number of SSEs divided by the expected based upon a Poisson distribution that, when greater than one, predicts reduced invasion rates but more intense epidemics once invasion occurs (Lloyd-Smith *et al*. 2005*b*; Getz & Lloyd-Smith 2006).

We used classification and regression tree analyses to explore which factors influence the variation in disease invasion outcomes (Breiman *et al*. 1984). Classification tree analyses have been used extensively in clinical risk assessments (e.g. Begg 1986; Steadman *et al*. 2000) and are becoming more common in the ecological literature (e.g. De'ath & Fabricius 2000; Karels *et al*. 2004; Brose *et al*. 2005; Usio *et al*. 2006). Classification trees divide data in a hierarchical manner using binary rules based upon single predictor variables. Threshold criteria are then chosen to partition the response variable into groups that are as homogeneous as possible. We used the Gini index as the splitting criterion. Since larger trees will always predict the learning dataset better, we used 10-fold cross-validation and the 1−s.e. rule to guide in the choice of the ‘best’ tree size. This is a method to minimize the amount of prediction error on testing data (not used in the construction of the tree) while also incorporating a penalty for increasing tree size (Breiman *et al*. 1984). Since the classification analysis is intended to be heuristic, for clarity of presentation we present trees that are slightly simpler than those trees chosen according to the 1−s.e. rule, but resulted in only a minor increase in misclassification (details on alternative trees are presented in the electronic supplementary material). We explored three different sets of explanatory variables for the classification analysis: (i) six raw model parameters (*β*, *γ*, *ρ*, *δ*, *μ* and *n*), (ii) five aggregate model parameters (*β*/*γ*, *ρn*/*γ*, *μn*/*γ*, *ρ*/*γ* and *ρn*), and (iii) the five aggregate model parameters as well as *ν* and *ν*_{*}. Although we report results for all analyses in table 1, only the classification tree using the aggregate model parameters is shown in the main text; the others are illustrated in the electronic supplementary material.

We compare the criteria for invasion from the classification tree analysis with more traditional thresholds using a vocabulary taken from literature on diagnostics, where one assesses the utility of a diagnostic tool according to the proportion of times it yields false-positive and false-negative results. In the case presented here, false-positives occur when the criteria for invasion are met but the disease does not actually invade. False-negatives occur when the criteria for invasion are not met and yet the disease does invade (recall that a successful invasion is defined as the disease infecting individuals in over 90% of the groups of the metapopulation). Note that *R*_{0}>1 and *R*_{*}>1 are theoretical thresholds determining when disease invasions are possible; in stochastic models (or a stochastic world), satisfying these criteria does not guarantee that invasion will occur. The misclassification rate summarizes how well these thresholds work when used to predict invasion.

We generated simulation data for the classification tree analyses using a range of parameter values chosen to reflect a diversity of disease/host systems. The length of the time-step in the model is arbitrary, but with a time-step of 1 day in mind the average infectious periods, 1/*γ*, ranged from 10 days to 2.7 years (*γ*=0.001–0.1) Group sizes were relatively small (*n*=3–300), and rates of movement between groups ranged from once every 10 days to once (or less) in a lifetime (*μ*=0.0001–0.1). The theoretical *R*_{0} (as described in §2.1) ranged from 0 to 19, while the probability of losing immunity (*ρ*) or dying (*δ*) ranged from 0.0001 to 0.1. All parameters were sampled on a log scale to emphasize low parameter values, where the disease is more likely to be near the invasion threshold. We simulated each model with 6000 different parameter sets and ran each until the disease went extinct or every group of the metapopulation had been infected.

Because the model was stochastic, we conducted many runs of each parameter set for most analyses to determine average behaviour. For the classification tree analysis, however, we conducted only one run of each parameter set. We chose this approach to highlight the binary and stochastic nature of the invasion process; for real disease outbreaks, it is very rare to have sufficient replicates of an invasion process to estimate the probability of success. Rather, we were interested in the accuracy of different predictors in the stochastic context of single outbreaks. This strategy also allowed us to sample the parameter space more intensively since we ran each parameter set only once. Classification trees based on half as many runs were identical in structure and similar in threshold values to those presented, so we feel confident that this sampling approach was sufficient to yield robust results. All model simulations were run in Matlab v. 7.2 (Mathworks, Inc. 2006), which called spatial models written in C. Classification tree analyses were conducted in R using the Rpart package (R Core Development Team 2005; Therneau & Atkinson 2005).

## 3. Results

Successful invasion of a disease into a host metapopulation is determined by many factors in addition to the necessary, but not sufficient, threshold of *R*_{0}>1. As in our earlier study (Cross *et al*. 2005), we find that the likelihood of a pandemic exhibits a clear threshold in the ratio of movement rate to recovery rate (corresponding to the expected number of between-group movements during each individual's infectious period). However, the location of this threshold depends upon the recruitment of new susceptibles to the population (*ρ*/*γ* in the SIRS model and *δ*/*γ* in the SIR_BD model), whereby faster recruitment of susceptibles results in lower movement thresholds because the disease persists longer in each group (figure 1, top row). When *β* is fixed for the SIR_BD model, the probability of a pandemic is influenced by *δ* via its effect upon *R*_{0}, but *δ* does not alter the movement threshold (figure 1, second column). Results are generally similar between the two model structures (SIRS and SIR_BD) when *β* is scaled so that *R*_{0} values are equal between the models (figure 1, first and third columns). The SIRS and SIR_BD models also yield similar results for the classification tree analyses. Thus, we present only the SIRS model results, but provide the SIR_BD model results in the electronic supplementary material.

Inspection of figure 1 illustrates that is not a reliable predictor of pandemics when group sizes are small and movement between groups is limited, regardless of susceptible replenishment rate. In many cases , but the disease invasion fails because movement among groups is too infrequent compared to the infectious period of the disease (Cross *et al*. 2005). The quantity , on the other hand, is strongly associated with successful disease invasions across the metapopulation, for all levels of susceptible recruitment (figure 1). Note in figure 1 that is less than *R*_{0} (i.e. *β*/(*γ*+*δ*) or *β*/*γ*), primarily due to susceptible depletion effects that become important in small groups. In the first and third columns of figure 1, *R*_{0} predicts that the index case will infect five others, on average, but the realized number of infections () is lower owing to competition among infectors for the limited pool of susceptibles. Depletion of the susceptible pool also affects . When *μ*/*γ* is small, movement among groups is the limiting factor for , and increases with *μ*/*γ* (figure 1). As *μ*/*γ* approaches 10, however, declines due to competition among groups to infect other groups.

Although *R*_{*} may not be analytically tractable, we can consider its constituent parts. The probability that a disease propagates through a structured population depends upon at least two factors: the frequency of between-group movements and the total duration that the disease persists within a given group. The total infectious time (i.e. the sum of infectious host days in a single isolated group) increases with group size and with susceptible recruitment (figure 2*a*). If immune individuals are replaced by susceptibles sufficiently quickly, the disease can become endemic even in small groups. In figure 2, the average infectious period per individual (1/*γ*) is 100 time-steps. When the per capita infectious time is 1000 time-steps, each individual has been infected 10 times on average, which we use as an indication that the disease is endemic within a single group (though note that the choice of 10 infections is somewhat arbitrary; figure 2*b*).

The total infectious time within a group determines the threshold movement rate for a pandemic. For example, when *n*=10 and *ρ*/*γ* is low (say, 10^{−3}), the total infectious time is roughly 800 time-steps (figure 2*a*). In order for the expected number of between-group movements of infectious individuals to exceed 1, the movement probability per time-step for each individual (*μ*) must exceed 1/800 or 0.00125. When the recovery rate (*γ*) is 0.01, a threshold of *μ*/*γ*>0.125 is predicted, exactly as seen in figure 1 for an SIRS model with low *ρ*/*γ*. Similarly, when *n*=10 and *ρ*/*γ* is high (say, 10), the total infectious time is approximately 10^{5} time-steps, so the predicted threshold for *μ*/*γ* is 10^{−5}/0.01=10^{−3}, again corroborated by figure 1.

The classification tree analysis (figure 3*a*) indicates that disease–host combinations must satisfy several criteria for a pandemic to be likely. First, the disease must be able to spread successfully within the initially infected group. Traditionally this is assessed using the theoretical threshold *R*_{0}>1, above which invasion occurs with non-zero probability (Diekmann & Heesterbeek 2000). In the statistical context, however, a higher threshold of *R*_{0}≥2 minimizes the amount of misclassification error, although it increases the probability of a false-negative result where disease extinction is predicted but the disease actually invades (figure 3*a*, table 1). If *R*_{0} is sufficiently high to favour within-group transmission, then the disease still needs to propagate between groups, a process that depends upon group size, movement and the length of the infectious period (yielding a threshold of *μn*/*γ*≥2.7). Similar to *R*_{0}, the classification threshold for *μn*/*γ* exceeds the criterion *μn*/*γ*>1 that we proposed in an earlier simulation study (Cross *et al*. 2005). If the relative amount of movement between groups is low, then the disease may still be able to invade the entire metapopulation if the recruitment of new susceptibles (*ρn* or *δn*) scaled by the recovery rate (*γ*) is high. For the case we present, the classification threshold for *ρn*/*γ* is approximately 7.2; this can be considered a loose statistical criterion for endemicity, above which the disease persists long enough in each group that even infrequent between-group movements are sufficient to maintain the disease.

The specific thresholds presented here are likely to depend upon the model structure and parameter ranges used. Similar to previous work (Cross *et al*. 2005), we also simulated the disease model using a ‘non-spatial’ array of groups where individuals could move to any other group in one step (see electronic supplementary material). We found that the statistical threshold of *μn*/*γ* in the classification tree was lower (1.8 compared to 2.7) for the non-spatial array compared with the nearest-neighbour movement model, but the structure of the classification tree was the same (compare figure 3*a* and figure 1 of electronic supplementary material). In addition, we simulated the SIRS model with only one group and conducted a classification analysis on whether greater than 90% of that group was ever infected. The best statistical threshold for disease invasion was *R*_{0}≥2.4, which is similar to the criteria for the multi-group metapopulation model.

To investigate the effect of different parameters on the classification tree analysis, we constructed new classification trees using subsets of the data corresponding to particular ranges of certain parameter values. The relative amount of error explained by different variables depend upon the parameter space used, but the overall classification tree structure and threshold values were very similar. For example, in all 6000 runs of the SIRS model the disease invaded the metapopulation in 41.1% of the simulations. This percentage represents the total amount of error associated with a classification tree with no nodes. Inclusion of the first node, *R*_{0}≥2, decreases the error rate to 25%, for a relative error rate of 0.62 (i.e. 0.25/0.41). Adding the second node, *μn*/*γ*≥2.7, reduces the relative error rate to 0.38. The length of each branch of the classification tree is proportional to the reduction in prediction error associated with that node (figure 3). When we analysed only the subset of the data where group sizes were greater than 100, the first node alone, *R*_{0}≥1.9, became a more important predictor, reducing the error rate from 0.46 to 0.16 (relative error=0.34 compared to 0.616 with all group sizes) and the second node, *μn*/*γ*≥3.8, only led to a marginal improvement (figure 3*b*). Thus, loosely stated, the predictive ability of *R*_{0} increased with larger group sizes while the importance of movement decreased. Note, however, that the threshold values remained similar (figure 3*a*,*b*). When we analysed the subset of the dataset with shorter infectious periods (*γ*>0.01), the predictive power of *R*_{0} decreased while the importance of *μn*/*γ* increased (data not shown). Thus, for acute diseases movement becomes a more important predictor of disease invasion (Cross *et al*. 2004, 2005).

The theoretical threshold of *R*_{0}>1 determines when a disease invasion is possible in an infinite population. In a large, but finite, population this threshold holds to close approximation (Lloyd-Smith *et al*. 2005*a*), which makes it unsurprising that *R*_{0}>1 resulted in no false-negatives in our simulations. However, at least for the parameter ranges we explored, the disease did not invade in 35% of the simulations where *R*_{0} was greater than one. These invasion failures correspond to stochastic extinctions of the disease, but are counted as false-positive predictions when *R*_{0}>1 is interpreted as a predictor. Our previous rule of thumb, *μn*/*γ*>1, also resulted in few false-negatives (2%) but many false-positives (42%). The false-positive rate is reduced when using *R*_{0}>1 and *μn*/*γ*>1 in combination, but these rules still do not account for the recruitment of new susceptible individuals (table 1).

All the classification trees we analysed yielded lower misclassification rates on test data (13–18%) than either *R*_{0}>1 or *μn*/*γ*>1 (24–44%, table 1). The ‘best’ classification tree, as determined by the ‘1−s.e. rule’, was only marginally better at predicting disease invasion than the reduced tree shown in figure 3*a* (13 versus 14%, table 1). The classification tree based upon the raw model parameters *β*, *γ*, *μ*, and *n* did not perform quite as well as those based on aggregate parameters *β*/*γ*, *μn*/*γ* and *ρn*/*γ* (19 versus 14%, table 1). Threshold criteria based on the emergent quantities *ν* and *ν*_{*} produced the lowest misclassification rate, and *ν*_{*} was twice as good as *ν* (10 versus 20%, table 1). Our counting rules for *ν*_{*} did not account for the possibility that the index group could lose the infection (all infected members moving out) and then become re-infected (those same infected members moving back in, without having transmitted in their new group) before finally going on to spread the infection. As a result, a few simulations led to invasions when *ν*_{*}=0, which is at odds with the theoretical definition on *ν*_{*}, but this low probability event (33 out of 6000 simulations) does not change our overall conclusions (figure 1, table 1)

The analysis of individual reproductive numbers (figure 4) illustrates the strong influence of population structure on SSEs. Owing to the constant recovery probability assumed in our model, there is substantial individual variation in infectious periods. In a single large population, this leads an overdispersed distribution of *ν* and numerous SSEs (31 SSEs out of 500 simulations). Compared to an expected five SSEs out of 500 individuals for a homogeneous population, by our definition of an SSE, this yields a SSL of 31/5 or 6.2. In a metapopulation of small populations (*n*=10), the frequency of SSEs depends upon the movement of hosts among groups. When movement rates are high (*μ*/*γ*=10), there were 56 SSEs for a SSL of 11, whereas when *μ*/*γ* equalled 0.001 there were 12 SSEs, representing an SSL of just 2.4. The recruitment rate of new susceptibles did not have significant impact upon SSEs (data not shown).

## 4. Discussion

In socially or spatially structured host populations, *R*_{0}>1 is a necessary but not sufficient condition for a pandemic. As *R*_{0} increases beyond 1, the probability of disease invading the initially infected host group increases, but additional criteria are important to determining the probability that the disease spreads to other groups. Disease transmission among groups depends on the transmission rate among individuals (*β*), the frequency of individual host movement (*μ*) and the duration of time (measured cumulatively over all infected hosts) the disease persists within each group. Within-group persistence times increase due to longer individual infectious periods (1/*γ*), greater group sizes (*n*) or faster replenishment of the susceptible pool (Bartlett 1957; Bjornstad *et al*. 2002; Grenfell *et al*. 2002; Lloyd-Smith *et al*. 2005*a*). To synthesize, the disease is increasingly likely to invade the entire population for increasing *R*_{0}>1 and *μn*/*γ*>1; when movement is infrequent relative to host recovery (*μn*/*γ*>1), a pandemic requires that the recruitment of susceptible individuals is sufficiently fast to allow the disease to persist endemically in infected groups (figures 1 and 3).

To our knowledge, classification and regression tree analyses have not been used to understand disease invasions, yet we found that the method was naturally suited to analysing simulation results and illustrating the hierarchical nature of disease invasion criteria. After experimenting with many combinations of predictor variables (see electronic supplementary material), we focused on a set of aggregate parameters that were most informative, hence resulting in small trees, and corresponded to relevant biological processes: within-group transmission, *R*_{0} (*β*/*γ* in SIRS or *β*/(*γ*+*δ*) in SIR_BD); movement, *μn*/*γ*; and recruitment of new susceptibles, *ρn*/*γ* and *δn*/*γ*. The classification tree analyses corroborated our previous rule of thumb (Cross *et al*. 2005) that when transmission and recovery processes are fast relative to the recruitment of new susceptibles, *μn*/*γ* must exceed 1 for a pandemic to occur (figure 3*a*). Our expanded models, however, revealed that the effects of low movement rates can be compensated for by faster susceptible recruitment (e.g. *ρn*/*γ*>7, figure 3*a*).

Theoretical ecologists often search for thresholds or bifurcation points where system behaviour qualitatively changes. The threshold *R*_{0}>1 demarcates when a disease outbreak is possible, but as a predictor will lead to false-positives when the disease is predicted to invade but goes extinct due to initial stochastic events. Thus, *R*_{0}>1 is a conservative threshold for predicting disease outbreaks and circumstances exist where more accurate (but less conservative) predictions of invasion are useful. In 35% of simulations we conducted, the *R*_{0}>1 criterion was satisfied but the disease failed to invade (table 1). The combined threshold of *R*_{0}>1 and *μn*/*γ*>1 resulted in fewer misclassifications (24%) but the classification tree criteria were more reliable, misclassifying only 14±0.5% (s.d.) of all simulations that were not used in the tree construction (table 1). We emphasize, though, that all the ‘thresholds’ we describe are necessarily fuzzy due to the stochastic nature of disease invasion (Lloyd-Smith *et al*. 2005*a*).

All the criteria we applied, with the exception of *ν*_{*}>1, resulted in more false-positives than false-negatives due to the high probability of stochastic extinction in the early generations of disease invasion. The *ν*_{*} metric was the best predictor because it includes information on initial stochastic events as well as the movement of infectious individuals among groups. Predictions based on real empirical data are likely to suffer greater misclassification error rates than the simulated data we present due to process-based variation and sampling error. Despite these difficulties, our results emphasize the importance of understanding host movement and those processes that allow diseases to persist for longer in spatially or socially structured host populations.

Superspreading events (SSEs) result from heterogeneities in host, environment and parasite factors (Lloyd-Smith *et al*. 2005*b*). Our analysis focuses on the interaction between heterogeneity in the host factor of infectious period and in the environmental factor of contact with susceptible individuals. In our simulations, all infectious individuals had constant and identical probabilities per time-step of recovering from disease, as well as moving between groups, resulting in geometric distributions for the duration of infectiousness and the number of groups visited while infectious. The heterogeneities embodied by these geometrically distributed quantities create the conditions necessary for SSEs; that is, they lead to distributions of individual reproductive numbers that are overdispersed relative to the Poisson distribution predicted when all infectious individuals (and their environments) are identical. Given these individual heterogeneities, the frequency of SSEs may be constrained or facilitated by the population structure where the individual resides. In a large or panmictic population, transmission is not constrained by the supply of susceptible individuals. In contrast, when groups are small and movement is infrequent, the number of potential contacts is limited and the opportunity for SSEs is reduced even for individuals with extraordinarily long infectious periods. The same qualitative effect would arise for individual heterogeneity in transmission rates, as access to susceptibles is a prerequisite for transmission. The potential for superspreading in structured populations would be amplified if positive correlations existed between movement rates (and hence access to more susceptibles) and high transmissibility or slow recovery. Further subtleties may arise if movement itself is linked to transmission (as in SSEs aboard airliners) or increased risk of death (as in some wildlife systems).

The utility of simple, within-group calculations of *R*_{0} as a predictive measure of disease invasion is limited in systems where transmission between groups may be the primary factor regulating the probability of a pandemic. Examples include many wildlife populations (Woolhouse *et al*. 2001), livestock based on smallholdings (Keeling *et al*. 2001; Woolhouse *et al*. 2005), and human populations with small, weakly connected groups of susceptible individuals (Salmon *et al*. 1999; Feikin *et al*. 2000). While further research should aim to advance analytic theory, classification trees provide an effective means of connecting real-world, measurable variables to the likelihood of invasion, particularly in structured populations where system dynamics are governed by the hierarchy of contributing factors. Our analyses have focused on a relatively idealized system of equal group sizes and simplistic movement rules. Future work should aim to extend our findings to more realistic, heterogeneous settings and to link the ideas presented here with empirical evidence from the field.

## Acknowledgments

This research was funded by the NSF-NIH Ecology of Infectious Disease Grant DEB-0090323 to Wayne Getz, NIH-NIDA Grant R01-DA10135, the Center for Infectious Disease Dynamics (CIDD) Postdoctoral Fellowship (J.L.S.), a James S. McDonnell Foundation twenty-first Century Science Initiative Award (W.M.G.) and the US Geological Survey (P.C.C.). Many thanks to A. Shrag for her help with R and classification trees. Previous versions of this manuscript were improved by the comments of R. Plowright and several anonymous reviewers.

## Footnotes

Electronic supplementary material is available at http://dx.doi.org/10.1098/rsif.2006.0185 or via http://www.journals.royalsoc.ac.uk.

- Received September 22, 2006.
- Accepted November 3, 2006.

- © 2006 The Royal Society