## Abstract

In recent years, a number of systems capable of predicting future infectious disease incidence have been developed. As more of these systems are operationalized, it is important that the forecasts generated by these different approaches be formally reconciled so that individual forecast error and bias are reduced. Here we present a first example of such multi-system, or superensemble, forecast. We develop three distinct systems for predicting dengue, which are applied retrospectively to forecast outbreak characteristics in San Juan, Puerto Rico. We then use Bayesian averaging methods to combine the predictions from these systems and create superensemble forecasts. We demonstrate that on average, the superensemble approach produces more accurate forecasts than those made from any of the individual forecasting systems.

## 1. Introduction

Recent work has demonstrated that accurate forecasts of the timing and severity of infectious disease outbreaks can be generated using a framework combining a dynamical model of disease transmission and data-assimilation methods [1–5]. However, because no model perfectly represents transmission dynamics in the real world, infectious disease forecasts made by a single model are prone to error due to this model misspecification. In weather and climate forecasting, this problem has been addressed by combining forecasts from multiple competing models in a superensemble. The intent is that some of the biases inherent in the different models will offset so that the superensemble produces more accurate predictions than those generated by any individual model. Such improvement has indeed been observed [6–8].

Dengue is a viral mosquito-borne disease that has spread rapidly over the past 50 years, and is currently endemic in over 100 countries [9]. There are an estimated 360 million dengue infections per year, approximately 25% of which lead to apparent symptoms ranging from mild fever to life threatening haemorrhagic fever and shock [10]. In Puerto Rico, dengue virus was first isolated in 1963, and all four virus serotypes have been circulating on the island since 1982 [11]. As in the rest of the Americas [12], dengue incidence in Puerto Rico has been increasing over the past two decades, with major outbreaks occurring in 1994, 1998, 2007 and 2010 [11]. The annual cost of dengue illness in Puerto Rico has been estimated at $38.7 million, or $10.40 per person [13].

While there is no specific treatment for dengue, simple fluid replacement and case management can reduce the fatality rate for severe dengue from 20% to less than 1% [14]. However, during major dengue outbreaks, hospitals may lack the capacity to administer this basic treatment, leading to preventable deaths. It is, therefore, of great value for public health officials and healthcare facilities to have advance warning of increased dengue cases.

Dengue forecasts and early warning systems have been proposed using a number of approaches, including autoregressive integrative moving average (ARIMA) models [15,16], regression models [17–19], a spatio-temporal hierarchical Bayesian model [20], a percentile rank model [21] and an empirical Bayes model [21]. Other approaches have been used to forecast influenza, including stochastic agent-based models [22,23] and meta-population models [24]. These forecasting systems use various environmental, epidemiological and demographic predictors to generate estimates of future disease incidence.

Here, we develop three distinct forecasting systems for dengue outbreaks in San Juan, Puerto Rico, and then use Bayesian averaging methods to combine the predictions from these systems and create superensemble forecasts. We demonstrate that on average, this approach leads to more accurate forecasts than those made from any of the individual forecasting approaches.

## 2. Material and methods

### 2.1. Data

Weekly dengue incidence data from April 1990 through April 2013 for the San Juan-Carolina-Caguas Metropolitan Statistical Area in Puerto Rico were provided by the Puerto Rico Department of Health and Centers for Disease Control and Prevention. These data were collected by the Passive Dengue Surveillance System for dengue case reporting. Dengue cases were generally laboratory confirmed, with the exception of periods of high transmission when suspected cases exceeded laboratory testing capacity, or when case information was incomplete. During these events, additional positive cases from specimens that were not tested were estimated by multiplying the number of untested specimens by the fraction of tested specimens that were laboratory positive [11].

### 2.2. Forecasting targets

Dengue in Puerto Rico follows a roughly seasonal cycle, generally peaking in the late autumn or early winter months. Dengue seasons are defined as calendar week 17 (late April) through calendar week 16 of the following year. We produced forecasts of three characteristics for each dengue outbreak season: the maximum number of cases reported in a single week (peak incidence), the week during which peak incidence occurs (peak timing) and the total number of cases in the season (total incidence).

### 2.3. Forecast systems

Three distinct systems were used to produce competing retrospective forecasts of peak timing, peak incidence and total incidence. We call these: F1* _{N}* for forecasts generated using a model-inference framework, F2

*for forecasts generated using a Bayesian weighting of historical outbreaks and F3*

_{N}*for forecasts derived from historical likelihood functions (see below). Forecasts were generated every week,*

_{N}*w*, for each outbreak season,

*N*, during the ‘testing period’, defined as seasons 15 (year 2005/2006) through 23 (year 2012/2013). Training forecasts TF1, TF2 and TF3 were produced using the F1, F2 and F3 forecast methods, respectively, over the ‘training period’, prior seasons 1 through

*N −*1, and used to determine the contribution of each individual forecast to the weighted sum superensemble forecasts.

#### 2.3.1. Forecast method 1: susceptible–infectious–recovered-ensemble adjustment Kalman filter

F1 forecasts were produced using a model of disease transmission in conjunction with the ensemble adjustment Kalman filter (EAKF) data-assimilation method, as described in detail in Shaman & Karspeck [1]. The disease transmission model was a basic susceptible–infectious–recovered (SIR) compartmental model [25], commonly used for infectious disease simulation. The model assumes a perfectly mixed population and is governed by the following equations:
2.1;and
2.2;where *N* is the population size (arbitrarily 100 000 people), *S* and *I* are, respectively, the number of susceptible and infectious individuals in the population, *D* is the mean duration of infection and *β* is the contact rate. The basic reproductive number, *R*_{0}, is related to the contact rate by *R*_{0} = *βD*. For model scaling, we assumed the number of dengue cases observed in clinics represented 20% of total new infections each week.

This model is a greatly simplified representation of dengue transmission. Many processes, including infection rates within the mosquito population and interactions between dengue serotypes, are not explicitly modelled. Previous studies have used such simple compartmental models of human populations to gain insight into dengue transmission dynamics [26,27]. We chose this parsimonious model structure because we lacked data on mosquito infection rates and information on the immunological history of individual cases (e.g. whether an observed case is a primary or secondary infection). Such data would be necessary to constrain a more complex model of disease transmission for generation of reliable forecasts.

The SIR model was used in conjunction with an EAKF. The EAKF consists of an ensemble of SIR model replicates, in this study 400, initialized from a randomly drawn suite of state variable conditions and parameter values, and iteratively optimized over the course of an outbreak in a prediction-update cycle. In the prediction step, the SIR model moves each ensemble member forward until the next observation becomes available, which in this study occurred weekly. In the update step, the EAKF algorithm (see Anderson [28] for full algorithm details) assimilates new observations by adjusting the ensemble members such that their mean and variance match the posterior mean and variance predicted by Bayes' rule.

In addition to running the ensemble SIR-EAKF system forward in time and producing weekly posterior updates, we also used those posteriors to generate weekly forecasts. That is, following each new assimilation of an observation, the updated ensemble of model simulations was propagated forward using the latest parameter and state variable estimates to produce an ensemble forecast of disease incidence through the remainder of the season. Forecasts of the outbreak characteristics (peak timing, peak incidence and total incidence) were derived from the forecast trajectory of the ensemble mean.

Multiplicative inflation was included in the simulation to avoid filter divergence, which can occur in part due to differences between the simplified SIR model structure and true infection dynamics [1]. Here we assumed the dengue incidence data have normally distributed error with variance *σ*^{2} = 65.

Initial conditions for state variables and parameter values of each ensemble member were randomly selected from the following probability distributions functions: *D* ∼ Uniform [2 days, 10 days], *R*_{0} ∼ Uniform [1,4], *S*(*t* = 0) ∼ Uniform [0,0.6 × *N*] and *I*(*t* = 0) ∼ Exponential [mean = 40].

#### 2.3.2. Forecast method 2: Bayesian weighted outbreaks

The second forecasting method employs a statistical approach in which the current outbreak as observed thus far is described as a weighted average of outbreak trajectories from prior seasons. Similar approaches have been used previously to forecast influenza [29,30] and dengue [21]. Here, the respective weights for each candidate trajectory were determined using Bayesian model averaging (BMA), a statistical method that is commonly used to combine information from competing models [31–33], and was adapted by Raftery *et al*. [34] for use with dynamic weather forecasts.

The candidate trajectories used to construct forecasts of the outbreak in season *N* were the preceding seasons, 1 through *N* − 1, smoothed by applying a five-week centred moving average. More specifically, weeks *t* − 4 through *t* of the current season *N* are estimated as a weighted sum of weekly incidence during weeks *t* − 4 through *t* from prior seasons 1 through *N* − 1*.* A forecast is then generated by projecting that weighted sum for weeks *t* + 1 until the end of the season, week 52*.* The weightings of prior season incidence data are obtained using the probability distribution function (PDF)
2.3;where *y* is the weekly dengue incidence, *f _{k}* is a candidate outbreak trajectory,

*w*is the probability of trajectory

_{k}*f*being the best representation for season

_{k}*N*, and

*g*(

_{k}*y|f*) is the PDF of

_{k}*y*conditional on

*f*, given that

_{k}*f*is the best candidate model.

_{k}The conditional PDF for *y* given each candidate trajectory is assumed to be normally distributed with mean *f _{k}* and standard deviation σ. For simplicity, we assume equal

*σ*for all candidate trajectories. We use maximum-likelihood estimation over the training window of 5 weeks to obtain

*w*and

_{k}*σ*(see Raftery

*et al*. [34] for full details). The weights, which sum to 1, are then applied to the candidate trajectories to produce the weighted sum forecast. 2.4;The resulting trajectory of predicted weekly dengue incidence is used to predict peak timing, peak incidence and total incidence.

#### 2.3.3. Forecast method 3: historical likelihood

Historical likelihood forecasts, F3* _{N}*, are made by fitting probability distribution functions to historical data for each of the target outbreak characteristics, observed for seasons 1 through

*N*− 1. Peak timing is described by a normal distribution, while peak incidence and total incidence are described by gamma distributions. As the dengue season progresses and possible outcomes are eliminated, the probability distributions are updated, as described in the electronic supplementary material, Supplementary Methods. The resulting forecast is the expected value of each variable, calculated using the updated PDFs.

### 2.4. Creating the superensemble

Given a number of competing forecasts, we can produce a superensemble forecast by using a weighted average of the individual forecasts. Using the same BMA technique used to create the weighted analogue forecast, we combine first two (F1 and F2), (F1 and F3), (F2 and F3) and then three (F1, F2 and F3) competing forecasts to produced superensemble forecasts SE(F1,F2), SE(F1,F3), SE(F2,F3) and SE(F1,F2,F3), respectively.

Weights among the three individual forecasting methods were based on the performance of those forecasts over prior seasons (1 through *N* − 1). For example to produce the superensemble forecast for season 15, we used the training forecasts TF1, TF2, TF3 (described below) for seasons 1 through 14. For season 16, we repeat the process and produce a new set of historical forecasts for seasons 1 through 15, incorporating the additional information from season 15. Note that for the superensemble, BMA is applied across seasons to weight the forecast performance of each method, whereas in F2 it is used to weight how well prior season observations match the present season as thus far observed.

The forecasts for TF1* _{N}*, which are acquired independently for each season, only required evaluation of the performance over the prior

*N*− 1 seasons using the same methodology as described for the F1

*forecasts. Methods F2*

_{N}*and F3*

_{N}*use prior observations to produce a forecast for season*

_{N}*N*, and this pool of observations enlarges with each additional season. To evaluate the F2

*forecast accuracy, given the availability of*

_{N}*N*− 1 seasons, a leave-one-out (LOO) approach was used to construct TF2

*forecasts for each of those*

_{N}*N*− 1 prior seasons. For example, the set of TF2 forecasts used to inform superensemble weightings for season 15 is the set of 14 LOO forecasts for seasons 1 through 14. Each LOO forecast is the weighted average of the remaining 13 smoothed trajectories. A similar approach was used for the TF3

*forecasts.*

_{N}In using BMA to combine the individual F1, F2 and F3 forecasts into a superensemble forecast, we find the weights of the three competing forecasting systems for each outbreak characteristic over the training period seasons 1:*N* − 1, pooled over all forecast weeks 1:52. Equation (2.3) becomes
2.5;where *y* is the target outbreak characteristic, *w _{k}* is the probability that forecast method

*k*is the most accurate method and

*g*(

_{k}*y|*TF

*) is the PDF of*

_{k}*y*, conditional on TF

*, given that TF*

_{k}*is the most accurate forecast. This conditional PDF is assumed to be normal with mean TF*

_{k}*and standard deviation*

_{k}*σ*. For simplicity, we assume equal σ for all candidate trajectories. We use maximum-likelihood estimation over the entire set of training forecasts to obtain

*w*and

_{k}*σ*(see again Raftery

*et al*. [34] for full details).

The weights, which sum to 1, are then applied to the three candidate forecasts to produce the weighted sum superensemble forecast. Equation (2.4) becomes 2.6;For comparison, we produced a final set of forecasts by taking an equal-weighted average of the competing individual forecasts.

### 2.5. Forecast evaluation

The accuracy of each forecast was evaluated by the absolute error of the prediction relative to observation. We ranked the accuracy of the individual forecasts, simple averages and superensembles by comparing mean absolute error (MAE) aggregated over all seasons and forecast periods, as well as stratified by the week of forecast generation and forecast lead time.

## 3. Results

Weekly individual forecasts for each season of the initial training period, seasons 1–14, are shown in figures 1⇓–3. Forecast results during the testing period are shown grouped by the week each forecast was produced (figure 4), and by lead time with respect to the outbreak peak, defined as the difference in weeks between when the forecast was generated and the actual or predicted outbreak peak (electronic supplementary material, Supplementary Results, figures S1–S2).

Different inaccuracies are identifiable for each of the individual forecast approaches. The F1 forecasts frequently predicted false, overly small peaks early in the season when only a few dengue cases had been observed; however, later in the season, during the growth phase of the outbreak, larger outbreaks were forecast, which in some cases overestimated the true size of the outbreak (for example, figures 1⇑–3, seasons 2 and 8). The forecasts improved still later in the season as more observations were assimilated. The F1 forecasts were especially skilled at detecting when the actual peak had passed, leading to high accuracy in peak timing and peak incidence after the true peak (figures 1 and 2).

The F2 forecasts use a weighted sum of historical outbreaks. These forecasts are, therefore, constrained to the range of observed trajectories, but are able to adapt to observational data throughout the dengue season by adjusting the weights assigned to each contributing historical outbreak. While this forecasting system frequently produced better forecasts than F1 and F3, it was subject to large errors, particularly during years when observations followed an unfamiliar trajectory, such as season 3, which had unusually high numbers of observed dengue cases during the first few weeks of the season despite ultimately concluding as a relatively small outbreak. Of the three individual forecasts of peak week, F2 had the lowest MAE on average (figure 4). Like the F1 forecasts, F2 forecasts quickly detected the true peak, leading to accurate estimates of peak timing and peak incidence once the true peak had passed (figures 1 and 2).

F3 forecasts are based on long-term averages of outbreak characteristics; as a result, they did well when such characteristics were close to mean conditions (for example, figures 1 and 2, season 8; figure 3, season 3). In addition, the F3 forecasts were not prone to large error; the largest errors in these forecasts were smaller than those of F1 and F2 (figures 1⇑–3). This is an important advantage, as grossly inaccurate forecasts can have serious public health impacts. However, because this forecasting system possesses limited ability to adapt its forecasts in response to observations, it was slow to arrive at the true value for outbreaks that did not resemble the long-term average (figure 1, season 9; figures 2 and 3, season 13).

The MAE and maximum error of all forecasts made over the testing period (seasons 15–23) are shown for each individual forecast in table 1. *p*-Values reported here and in electronic supplementary material, table S1, indicate significant MAE differences among pairs of forecast approaches. Overall, F2 produced better forecasts of peak timing than F1 and F3 (*p* < 0.0001), which had similar MAE. F1 and F2 had equivalent MAE for peak and total incidence during the testing period, and both outperformed F3 (*p* < 0.016). As in the training forecasts, the maximum errors in F3 forecasts were substantially smaller than those of F1 and F2 for all three outbreak characteristics (table 1).

The contribution of F1* _{N}*, F2

*and F3*

_{N}*to the weighted sum superensemble forecasts was determined based on the performance of training forecasts TF1, TF2 and TF3 (figures 1⇑–3) during seasons 1 through*

_{N}*N*− 1 (i.e. the training period). The weights assigned to each forecast are shown in figure 5. In general, the F2 and F3 forecasts were weighted more heavily than F1 for superensemble forecasts of peak timing, while F1 and F2 dominated superensemble forecasts of peak incidence, and to a lesser extent, total incidence. The change in weights over time indicates the lack of consensus on the relative skill of each individual forecast. Each successive year incorporates an additional year of training forecasts and observations, giving a more informed estimate of the average performance of each individual forecasting system.

On average, SE(F1,F2) performed better at predicting peak timing than any of the individual forecasts (*p* < 0.023, table 1). Superensemble forecasts of peak timing that included F3 had a MAE greater than that of F2, but lower than F1 and F3 (*p* < 0.018). The outbreak peaks were closer to the long-term mean during the training period (mean difference 3.2 weeks) than during the testing period (mean difference 6.9 weeks). As a result, the F3 forecast, which is based on the long-term mean, performed well during the training period and was assigned a relatively high weight in the superensemble, but had larger errors during the testing period, which were passed on to the superensemble forecasts. The superensemble weights adjusted over time to discount the F3 forecast (figure 5). All superensemble forecasts of peak and total incidence had MAE equal to or lower than any individual forecast. Superensemble forecasts all had maximum errors greater than the maximum error in F3, but less than or equal to the maximum errors of F1 and F2 (table 1).

Among the superensemble forecasts using only two individual forecasts, SE(F1,F2) consistently had MAE lower than or equal to SE(F2,F3) and SE(F1,F3) (figure 4) and forecast lead time (electronic supplementary material, figures S1–S2). SE(F1,F2) was heavily weighted towards F2 for predictions of peak timing, and therefore had similar MAE to this forecast (figure 4). In contrast with forecasts of peak timing, where the superensemble forecasts were only as good as the best individual forecast, superensemble forecasts of peak incidence outperformed all three individual forecasts for several weeks in the early season, indicating that the biases of the individual forecasts were offset (figure 4). The superensemble forecasts did not provide a consistent advantage for predicted total incidence; both superensemble forecasts had greater MAE than one or more individual forecasts for most weeks; however, on average over all weeks the superensemble forecasts had lower MAE than the individual forecasts (table 1).

We tested the sensitivity of our results to the parameters and initial conditions of the individual forecast methods and found that the superensemble approach consistently provides greater forecast accuracy compared with the individual forecasts being averaged (electronic supplementary material, Supplemental Results, figure S3, table S2).

## 4. Discussion

Here we have presented three distinct systems for forecasting dengue incidence, each with certain strengths and weaknesses. By combining these individual forecasts, superensemble forecasts were generated that offset some individual system biases, while retaining reliable aspects of each forecast. Superensemble forecasts of peak and total incidence generally had MAE equal to or lower than any individual forecast. The superensemble forecasts also had lower maximum error than the F1 and F2 individual forecasts.

The findings here serve as a proof-of-concept for infectious disease forecast. The improved accuracy of the superensemble forecasts may have been somewhat limited by the relatively small number of outbreaks available for model training. For example, superensemble forecasts of peak timing were negatively affected by the relatively poor performance of F3 during the testing period, compared with the training period when the weights were assigned. The superensemble weights adjusted as more seasons were added to the training period, decreasing the contribution of the F3 system. We expect that the methods presented here will provide an even larger advantage for diseases, such as influenza, for which more observational data are available to inform superensemble weights.

We also expect the superensemble performance will improve further when more individual forecasting systems are used and the weightings among these candidate systems deviate more strongly from a simple average [35]. Indeed, the SE(F1,F2) and SE(F1,F2,F3) forecasts presented here performed only marginally better than a straight average (i.e. equal weighting) of individual forecasts, with the superensemble providing the greatest advantages over the straight ensemble in cases when the weights assigned to the respective forecasting systems are furthest from equal weighting (electronic supplementary material, table S3).

The methods presented here can incorporate any number and any type of competing forecasts. For example, the F1 forecast presented here used a very simplified model of disease transmission. Additional data on mosquito density and infection rates, as well as improved information on human immunology and dengue serotype interactions could be used to fit more realistic mechanistic models of dengue transmission, which could then be included in the superensemble. Similarly, we could include forecasts using stochastic models that simulate the infectious state of discrete individuals within a population while accounting for demographic noise [23,36]. The superensemble approach provides a formal method to weigh the strengths and limitations of each distinct forecast approach.

Additional training data are also expected to lead to greater advantages in using the weighted superensemble forecasts over the simple average forecasts. In this study, we used a constant weight for each season's forecasts, as the amount of data was not sufficient to justify further stratification of superensemble weighting. However, the relative strength of each forecast method varied based on circumstances such as the timing of the forecast (both with respect to the calendar week, and relative to the predicted outbreak peak, figures 1⇑⇑–4; electronic supplementary material, figures S1–S2), as well as indicators contained within the individual forecasts and in the observed data. For example, F1 forecasts have large errors early in the season but do well near and after the outbreak peak. Forecast indicators that can be used to inform superensemble weights for these F1 forecasts might include within-ensemble variance of the model-inference system and forecast streak (the number of consecutive forecasts predicting the same result), both of which have been previously shown to predict forecast accuracy [2,37]. The number of observed dengue cases in the weeks preceding a forecast relative to observations during those weeks in previous seasons might also be used to indicate whether the weighting of the historically based F2 and F3 forecasts are appropriate; for example F1 forecasts function well during a larger than usual outbreak, whereas the F2 and F3 forecasts might be prone to error and could, therefore, be discounted. If observations are available for multiple locations, as in the case of influenza, the relative performance of competing forecasts might be varied based on geographical and demographic characteristics [2]. Given sufficient data, the derivation of superensemble weights can be binned, or stratified, by metrics such as these in order to produce more specific weights.

## 5. Conclusion

In summary, we have demonstrated the use of a superensemble approach in order to combine information from multiple competing forecasts of disease incidence. As more real-time forecasts of infectious disease outbreaks are operationalized and incorporated into public health decision-making, it will be increasingly important to reconcile disparate forecasts and combine information from each in order to obtain the most accurate prediction of an unfolding disease outbreak. The work presented here is a first example of such a process.

## Data accessibility

The data used to produce forecasts are available at http://dengueforecasting.noaa.gov/.

## Authors' contributions

T.K.Y. and J.S. conceived and designed the experiments; T.K.Y. and S.K. performed the experiments; T.K.Y. and J.S. analysed the results; T.K.Y. and J.S. wrote the paper. All authors gave final approval for publication.

## Competing interests

J.S. declares partial ownership in S.K. Analytics. T.K.Y. and S.K. declare no competing interests.

## Funding

This work was supported by US NIH grant no. GM100467 (J.S.) and GM110748 (T.K.Y., S.K., J.S.), as well as NIEHS Center grant ES009089 (J.S.) and the Defense Threat Reduction Agency contract HDTRA1-15-C-0018 (T.K.Y., J.S.).

## Acknowledgements

Dengue incidence data were provided by the United States Centers for Disease Control and Prevention and the Puerto Rico Department of Health.

## Footnotes

Electronic supplementary material is available online at https://dx.doi.org/10.6084/m9.figshare.c.3491541.

- Received May 25, 2016.
- Accepted September 14, 2016.

- © 2016 The Author(s)

Published by the Royal Society. All rights reserved.