## Abstract

Many population genetic models have been developed for the purpose of inferring population size and growth rates from random samples of genetic data. We examine two popular approaches to this problem, the coalescent and the birth–death-sampling model (BDM), in the context of estimating population size and birth rates in a population growing exponentially according to the birth–death branching process. For sequences sampled at a single time, we found the coalescent and the BDM gave virtually indistinguishable results in terms of the growth rates and fraction of the population sampled, even when sampling from a small population. For sequences sampled at multiple time points, we find that the birth–death model estimators are subject to large bias if the sampling process is misspecified. Since BDMs incorporate a model of the sampling process, we show how much of the statistical power of BDMs arises from the sequence of sample times and not from the genealogical tree. This motivates the development of a new coalescent estimator, which is augmented with a model of the known sampling process and is potentially more precise than the coalescent that does not use sample time information.

## 1. Introduction

The genetic diversity of many pathogens is influenced by recent epidemiological history, and a variety of methods exist to estimate features of an epidemic history given random samples of pathogen genetic markers [1]. An issue that is central to how pathogen genetic diversity is understood is how infected individuals are sampled. A great deal of theory has been developed under the assumption of complete sampling, that is, that all infected individuals in the population are sampled and provide at least one pathogen sequence. These methods have found great utility for the study of small outbreaks [2,3], and for certain hospital-acquired infections [4]. A separate body of theory has developed for the study of epidemics where a sample of hosts is obtained for pathogen sequencing, and these methods are derived from classical population genetic models such as the coalescent [5,6] and classical population dynamics models such as the birth–death process [7,8]. This paper considers the scenario of incomplete sampling and the potentially confounding effects of non-random sampling through time on inference using the coalescent and birth–death-sampling formula [9].

The coalescent is a mathematical model of genealogies and describes the structure of genealogies generated by different demographic processes [10]. The coalescent has been the standard tool for demographic inference and is the underlying genealogical model in most phylogenetic software [11,12]. Under the neutral coalescent, the time between consecutive common ancestry events (the internode intervals) is modelled as a point process with a hazard rate *r*(*t*) that depends on the effective population size *N*_{e}(*t*) and the number of extant lineages in that interval *A*(*t*) at time *t* in the past. With time in units of the generation interval *τ*, this becomes

By relating the time of common ancestry to the population size, the coalescent enables estimation of the latter. A variety of non-parametric [13–15] and parametric [13,16,17] models have been developed for *N*_{e} as a function of time. The parametric models for *N*_{e}(*t*) tend to be deterministic functions of time, and we will consider such deterministic models in this paper, although there have been several recent attempts to fit stochastic demographic process models using the coalescent [18,19].

Birth–death processes trace their origins to work by Kendall [8], who showed how to calculate the probability that a given number of lineages will survive up to a given point of time in a stochastically growing population. Further results were developed by Thompson [20] and Gernhard [21], who showed how to calculate the probability density of genealogies generated by the birth–death process under complete sampling. These models were subsequently extended to account for incomplete sampling of the population by Stadler [9]. In order to account for incomplete sampling, the birth–death process must be combined with a model of the sampling process. Two sampling processes have thus far been considered in birth–death-sampling models (BDMs): sampling of lineages may take place at a constant rate; or, at a given point in time, a proportion of lineages may be sampled uniformly at random. These sampling processes may be combined, and recently developed methods allow sampling rates to vary through time according to a step function [22].

There are many variations on the coalescent and birth–death models that could be compared. Different coalescent and birth–death models make different assumptions about the demographic and sampling process, and each will be susceptible to different levels of bias by violation of those assumptions. We will focus on two models that have recently received considerable attention and have been used in epidemiological investigations. We use the BDM described in [9], and the coalescent model (CoM) for an unstructured population as described in [17]. Originally, CoMs were based on restrictive assumptions about the proportion of the population sampled and when taxa are sampled. CoMs were also based on strictly deterministic demographic processes, but all of these assumptions have been relaxed since the coalescent was first introduced. BDMs were originally based on census sampling at a single point in time, but that assumption has also been relaxed. Both models have been extended to consider heterogeneous structured populations [17,23].

The probability of observing a genealogy given demographic parameters may be calculated using either the CoM or the BDM, though these two models have very different mathematical foundations. The likelihood functions provided by each approach are difficult to reconcile mathematically, yet they tend to give similar results as we demonstrate below. The BDM has the advantage of accounting for stochasticity of the demographic process in an efficient and natural way. It is also possible to account for stochastically varying effective population size in the coalescent, but this has greater computational requirements [18]. A potential disadvantage of BDMs is that they require a model of the sampling process, whereas the coalescent makes no assumptions about how lineages are sampled through time. If the sampling process deviates from the simplistic processes that form the basis of current BDM theory, it is possible that estimates based on the BDM will be biased.

Both methods have particular advantages and vulnerabilities. Estimates based on CoMs may be biased by noisy demographic processes, and estimates based on the BDM may be biased by misspecification of the sampling process. In this paper, we will evaluate the vulnerability of both methods to these confounders. Because of the additional assumptions about sampling built into BDMs, it is difficult to make a direct comparison of the statistical power of BDMs and CoMs. If the sampling process is correctly specified, the observed sequence of sample times provides a great deal of information about the population size through time, which is not directly accessible with the CoM approach. Indeed, given a sequence of sample times, it is possible to estimate birth and death rates without a genealogy provided that the model of the sampling process is correctly specified (§3.1). We show that much of the statistical power of the BDM approach is derived from information in the sequence of sample times and not in the genealogy. This finding also suggests an enhancement to CoMs: if the sampling process is known, we can augment the CoM likelihood with a separate likelihood for the sequence of sample times. This augmented coalescent method is presented in §3.2.

In sampling at a single time point (homochronously), we show that estimates based on CoMs and BDMs are very similar. In §4.8, we show how the distribution of coalescent times predicted by CoM converges with large sample size to the distribution given by BDM.

## 2. The demographic and sampling processes

The population size *Y*(*t*) is modelled as a continuous-time Markov chain on [0, *∞*), which is governed by the following transition probabilities:
2.1where *λ* and *μ* are the *per capita* birth and death rates of the process, respectively. Initially, *Y*(0) = 1.

We investigated three distinct sampling processes for the reconstruction of genealogies from a simulated demographic history:

(1) Continuous sampling through time at constant rate. According to this model, after a lineage dies (with a per-lineage rate

*μ*), it is sampled with independent probability*p*.(2) Homochronous sampling. According to this model, every extant lineage at a predetermined time point is sampled with independent probability

*ρ*.(3) Weighted sampling. According to this model, each unit has a sample weight at the time of death. If {

*t*} is the set of death times for lineages indexed by_{i}*i*, the sample weights are . A sample is taken of*n*lineages without replacement with selection probabilities proportional to sampling weights.

Note that, with the exception of homochronous sampling, the lineages are only sampled at the time of death. This design is chosen for mathematical convenience, since it eliminates the possibility that a sampled lineage will be directly ancestral to another sample, which would yield genealogies with zero branch lengths [24], although more complex BDMs and CoMs can be applied in this situation.

In this paper, we use CoMs based on the following deterministic approximation *y*(*t*) to the stochastic process *Y*(*t*):
2.2with real-valued initial conditions *y*(0) that will be estimated.

Genealogies were generated by simulation of the birth–death process in continuous time using the software MASTER v. 1.7.1 [25]. For simulating genealogies with a time-dependent sampling rate, we developed a custom simulator for the birth–death process in Python (see the electronic supplementary material). We simulated 300 genealogies for each of the three sampling scenarios given above using *μ* = 1 and *λ* = 2 or *λ* = 1.25 or *λ* = 1.1. In the case of sampling through time, we terminated the simulation when 100 samples were collected and using a sampling probability of *p* = 1% or 50%. If sampling homochronously, we sampled 100 taxa after 9.21 or 25 units of time, yielding a sample proportion that varied around 1% or 20%, respectively. Simulations that failed to reach the target sample size were removed.

## 3. Estimation methods

All models are fitted by maximum likelihood (ML). The choice of ML was motivated by the simplicity of the demographic process, the speed of ML methods and the small number of free parameters. For the exponential growth process, there are four potential parameters that could be estimated: birth rates *λ*, death rates *μ*, the initial population size *y*(0) (needed for CoMs but not BDMs), and a parameter that describes sampling (needed for BDMs but not CoMs). As previously shown in the analysis of BDMs, at most two of these parameters are identifiable from a genealogy alone, and we must therefore choose which parameters to fix according to prior knowledge, and CoMs are subject to the same identifiability constraints. We focus on an epidemiologically plausible scenario, where birth rates and the number of infections are unknown, but independent clinical information provides information on death rates. Consequently, we will assume *μ* = 1 is known and will focus on the estimation of birth rate *λ* along with the nuisance parameters describing initial population size (for CoMs) or sampling rates (for BDMs). We will also consider the special case of homochronous sampling, in which we can reparametrize the CoM such that, like the BDM, estimates of the sampling fraction can be obtained.

Throughout the remainder of the paper, we use two symbols to denote time on different axes, and all dynamic variables will be defined on both axes. *t* will denote time from an arbitrary point in the past, whereas *s* will denote time before present. It will be useful to define the population genetic models in terms of the retrospective time axis *s*.

Let represent a genealogy consisting of a set of nodes , edges and a function that gives the time *s* before the present of each node. Every edge corresponds to a 2-tuple (*u*, *v*) such that and the node *u* is said to be ancestral to *v*. We will consider only rooted binary genealogies; every internal node has exactly two descendants, and all internal nodes but the root have exactly one ancestor.

For CoMs, we use the likelihood given in [17], and we denote the maximum likelihood estimator (MLE) birth rate .

This likelihood is that of a time-inhomogeneous point process with a hazard rate that depends on the population size and number of extant lineages. Specifically, following the approach in [17], the total population birth rate will be denoted *f*(*s*) = *λ**y*(*s*) and the rate of coalescence is
3.1where the first equality can be understood as the hypergeometric probability of selecting two lineages that are ancestral to the sample out of the set of *y*(*s*) lineages. Now let *x*′ denote the vector of node times (including sampled tips) in ascending order. The probability of observing the *i*‘th interval is
3.2And the likelihood is
3.3Note that the number of terms in the likelihood is the number of internode intervals 2*n* − 2 if all sampling times are distinct. One subtlety arises if more than one lineage is sampled at a single time point, such as with a homochronous sample, in which case we simply deduplicate elements in the vector *x*′ and adjust the number of terms in the likelihood.

For BDMs, we used the ML framework described in [9]. We denote the MLE birth rate . The R package expoTree [26] was used along with the implementation described here, and all results presented below are based on the best performing of the two implementations of the BDM likelihood. We simplified the likelihood equations in [9] to two situations: sampling occurs at a single time point with sample fraction *ρ*, or individuals are sampled with probability *p* at the time of death. Let *x* denote the vector of times before present for each internal node in in descending order. Note that *x*_{0} corresponds to the root of the tree. If the sampling takes place according to the homochronous process, *ρ* will denote the probability of sampling a lineage at a single point in time. Then,
3.4where *q*(·) is derived from the birth–death-sampling formula:
3.5and *c*_{1} and *c*_{2} are the following constants:
3.6and
3.7Note that the integral in the likelihood equation accounts for the unobserved time of origin of the birth–death process.

If sampling heterochronously at rate *ψ* = *μ**p*, the likelihood has a different form. Let *x* denote the vector times before present of each node as above, and let *y* denote the vector of sample times in any order.
3.8and *c*_{1}, *c*_{2} and *c*_{3} are the following constants:
3.9
3.10
3.11BDMs and CoMs were fitted according to the same numerical algorithm, with maximization of the likelihood accomplished in R using the simplex method. In order to ensure convergence to the global maximum, multiple starting conditions were drawn from a multivariate uniform distribution, and the likelihood optimized for each. The best model fit is reported among the three or five optimizations, although in general they converged to the same value.

### 3.1. Estimating birth rates using times of sampling

Consider the sequence of sample times in increasing order ** t** = (

*t*

_{1}, … ,

*t*). If the sampling process is known, the sequence of sample times is informative about population size. We will consider a simplistic sampling process such that individuals are sampled at a constant rate upon death, which is the sampling process underlying current BDMs. If sampling occurs at a constant known rate, it is straightforward to estimate the historical population size from the sample times, since the probability that a sample will be observed at some point in time is proportional to population size at that time. Therefore, it is possible to estimate the population size using sample time information alone, and not using the genealogy. We show here that it is possible to estimate the birth rate, even if the sampling rate is unknown. Two simple estimators are presented. The first is based on a simple regression with the expected cumulative number of samples through time. The second is based on treating the sample times as arising from a point process and using ML.

_{n}Let *S*(*t*) denote the cumulative number of samples collected up to time *t*. We show that the cumulative number of samples increases at the same rate as the unknown population size. According to the deterministic model, the expected change in *S* over time Δ*t* will be
3.12Consequently, the logarithm of *S*(*t*) ∝ (*λ* − *μ*)*t*. Regressing the vector log(*S*(** t**)) on the vector

**yields an estimate of the growth rate**

*t**k*=

*λ*−

*μ*, and using knowledge of

*μ*= 1 we have the regression estimator 3.13Figure 1 shows the number of samples through time, for a single simulated genealogy along with the regression line.

The likelihood approach is based on modelling the sequence of sample times as a point process and also makes use of the deterministic approximation to population size. The rate of a sample appearing at time *t* is
with *a* = *p**μ**y*(0) and *k _{s}* =

*λ*−

*μ*. The probability of

**is 3.14As with the regression estimator, 3.15**

*t*### 3.2. The augmented coalescent model

The genealogy and the sample times ** t** are conditionally independent given demographic and sampling parameters

*θ*= (

*λ*,

*μ*,

*p*,

*y*(0)). Therefore, the likelihood of both is the product of the marginal likelihoods given above (equations (3.15) and (3.3)): 3.16

We will denote the MLE birth rate as .

## 4. Results

The following results demonstrate the level of bias, precision and efficiency of different inference methods when estimating the birth rate from genealogies generated by the birth–death process.

### 4.1. Constant sampling rate

Figure 2 shows the distribution of MLEs for five estimators presented above based on 300 simulated genealogies. Simulations were based on a sampling process with constant sampling probability *p* = 1% at the time of death.

Estimators based only on the sequence of sample times ** t** perform well even though they do not use the coalescence times. The ML estimator consistently outperforms the simple regression estimator , presumably reflecting that residuals in the loglinear regression model are not i.i.d. normal.

Comparing a model that only uses sample time information (*λ*^{S}) with the CoM that only uses genealogical information (*λ*^{C}) we find that that the RMSE of is 11.5% compared with 11.8% for . In this instance, the sample time sequence is actually more informative than the coalescent times for inferring birth rates.

Comparing the BDM and coalescent, we find that the BDM is more precise (RMSE = 0.085) but slightly less accurate; the average bias of the BDM estimator was 0.022 (95% CI: (0.013, 0.031)) compared with 0.009 (95% CI: (−0.022, 0.005)) for the CoM estimator. Comparing the BDM and augmented CoM (a model that uses both coalescence and sample times), we find that the augmented CoM is slightly less precise than the BDM (RMSE of is 0.092), which may reflect the use of a misspecifed deterministic population size; however, we did not detect significant bias of (95% CI: (−0.012, 0.009)) in contrast to the BDM.

Figure 3 sheds some light on why the estimators perform differently by comparing the ML estimated by each method on each simulated genealogy. Electronic supplementary material, figure S1, shows a similar scatter plot of MLE birth rates. The BDM likelihood is highly correlated with that of all other estimators. By contrast, the CoM likelihood is almost independent of the estimators that use sample times only (Pearson correlation = 0.066). The highest correlation is found between the BDM and the augmented CoM (Pearson correlation = 0.95). This illustrates that the CoM is not using sample time information, but the BDM and augmented CoM are using information from both the sample times and genealogy.

### 4.2. Homochronous sampling

If all samples are collected at a single point in time, and if the sampling proportion is unknown, then the time of sampling and sample size confer no information about population size. The homochronous sampling case with unknown sampling rate therefore provides a fair comparison for BDMs and CoMs. Here, we consider 300 simulations of the birth–death process with a sample of *n* = 100 at *t* = 9.2, so that the sample fraction is around 1%, though it differs between replicates. The birth rate used in the simulations was *λ* = 2.

Figure 4 shows the distribution of MLE birth rates. The distributions are very similar and have similar precision (RMSE of is 0.106 and RMSE of is 0.101). The CoM estimator does not have detectable bias (95% CI of bias: (−0.0183, 0.0048)), but the BD model slightly overestimates birth rates (average bias = 0.036, 95% CI: (0.0242, 0.0470)). Figure 4 also shows that the log likelihoods of the MLEs generated by both methods are highly concordant up to a constant factor. The Pearson correlation of BDM and CoM MLs is 99.6%. The estimated birth rates also have a high correlation coefficient of 86.6%.

Comparing the RMSE of the BDM estimator in both the homochronous and constant sampling rate cases, it appears that having informative sample time information decreases the residual sums-of-squares of the BDM estimator by about 36%, but this gain in precision will certainly depend on parameters of the system and sample size.

We repeated the simulation exercise with a smaller birth rate (*λ* = 1.25) in order to assess if the CoM estimator would be less accurate if the population is growing more slowly. The MLEs are depicted in the electronic supplementary material, figure S2. With the smaller birth rate, we do not detect significant bias of the BDM estimator (average bias less than l × 10^{−3}, 95% CI: (−0.0069, 0.0077)), or with the CoM estimator (average bias 0.002, 95% CI of bias: (−0.0057, 0.0096)). The RMSE of the BDM and CoM estimators are similar (0.037 and 0.039, respectively).

### 4.3. Coverage

To assess the ability of both estimators to estimate accurate confidence intervals, we computed likelihood profiles with the bbmle package in R. We also computed confidence intervals using a parametric bootstrap method for the CoM estimator. 95% CIs were computed for each of 300 simulations with *λ* = 2 and 1.25 and homochronous sampling. The BDM estimator provides excellent coverage with profile likelihoods. When *λ* = 2, BDM has 95.3% coverage, and when *λ* = 1.25, BDM has 95.5% coverage. By contrast, when *λ* = 2, the deterministic CoM has 80.5% coverage, and when *λ* = 1.25, the deterministic CoM has 75.1% coverage using profile likelihoods.

Because the RMSE of the CoM estimator is similar to that of the BDM estimator, we hypothesized that a bootstrap method would provide more reliable confidence intervals for CoM. For each MLE based on CoM, we simulated 100 coalescent trees using MLE parameters, re-estimated *λ* for each, and computed confidence intervals based on quantiles of the bootstrap distribution. In order to maximize speed of the bootstrap algorithm, we simulated node heights using the approximate coalescent rates described in §4.8. We find that the CoM estimator has very good coverage with the parametric bootstrap method. When *λ* = 2, the deterministic CoM has 95.0% coverage, and when *λ* = 1.25, the deterministic CoM has 93.8% coverage.

### 4.4. Comparison of estimated sample rates

An alternative parametrization of the coalescent is in terms of the population size at the time of sampling in a homochronous scenario. In this case, we can calculate a deterministic approximation to the population size at time *s* in the past as
where *n* is the sample size and *ρ* is the sample proportion, and *n*/*ρ* is the population size at the time of sampling. According to this parametrization, we replace the nuisance parameter *y*(0) with *ρ*, and the coalescent estimates of the sample proportion can be directly compared to estimates with the BDM.

We fit the reparametrized CoM to the same genealogies used in §3.2 with *λ* = 1.25 and *μ* = 1. Figure 5 shows that the estimates are highly concordant with Pearson correlation of 99.7%.

### 4.5. Small reproduction number and high sample fraction

The CoM based on a deterministic demographic process may be most biased when the population size is small and subject to large stochastic fluctuations. We generated 300 trees from the BD process with *λ* = 1.1, *μ* = 1 and homochronous sampling with *n* = 100 and a variable sample fraction around 50%. The distribution of MLE birth rates is shown in figure 6.

We found small but significant bias in the estimated birth rates using both BDM and CoM methods. The mean bias of the BDM estimator was 0.019 (95% CI: (0.0148, 0.0244)), and the mean bias of the CoM was 0.021 (95% CI of bias: (0.0160, 0.0265)). The BDM had smaller RMSE (0.043 versus 0.47), and the Pearson correlation of estimated birth rates was 95%. A comparison of estimated birth rates is shown in the electronic supplementary material, figure S3.

### 4.6. Decreasing sample rate and small sample fraction

When the sampling model implicit to the BDM approach is misspecified, the BDM may yield highly biased results. Figure 7 shows the MLE birth rates for both the BDM and CoM estimators when the sampling rate changes through time according to e* ^{αt}* (see §2). One hundred and twenty simulations were carried out, and the sampling rate decreased at a rate of

*α*= −0.44. This value was chosen so that the expected sample size would be 100 if taking a weighted sample of all lineages at the time of death. Note that the sampling rate is an exponential function of time, so that the sequence of sample times still appears as though it arises from an exponentially increasing population, and there would be no warning from the sequence of sample times alone that the rate is changing. The BDM estimates are biased downwards by 0.23 (95% CI: (−0.2488, −0.2207)).

In this scenario, the CoM is robust to changing sample rate, since the CoM conditions on observed sample times. The CoM estimates did not have significant bias (95% CI of bias: (−0.0443, 0.0045)).

### 4.7. Increasing sample rate and large sample fraction

In these experiments, we examine bias in the coalescent due to sampling a large fraction of lineages from a small population growing stochastically. Three hundred genealogies with *n* = 100 were simulated from a birth–death process. Simulations were terminated when the number of deceased lineages reached 200, so that the sample fraction was 50% of deceased lineages and about 25% of all lineages. In the same experiments, we examined bias in BDMs due to a misspecified sampling process. In these experiments, the sampling rate increases from zero at time zero at a rate of *ρ* = *μ*.

Figure 8 shows the distribution of MLE birth rates. We do not find detectable bias with the CoM estimator (95% CI: (−0.0271, 0.0260)), despite using a misspecified deterministic approximation to the demographic process, and despite that a large sample of the population was taken and that the population size was only around 400 on average at the time of the last sample.

Because the BDM relies on a misspecified sampling process, the BDM estimator gives highly biased estimates in this scenario. The average bias was 0.46 (95% CI: (0.4460, 0.4920)).

### 4.8. Asymptotic distribution of coalescent times

Some insight into why CoM and BDM give similar estimates can be gained by comparing the asymptotic distribution of coalescent times predicted by both models in the case of homochronous sampling. The distribution of coalescent times in the limit of large sample size for a deterministic CoM can be easily computed, and we show that this distribution is equivalent to the marginal likelihood of a node given by the birth–death model.

In [27,28], an approximation to the lineages through time for the coalescent process was presented for a population under exponential growth:
4.1If sampling occurs at a single time point, such that *A*(0) = *n*, this has the unique solution
4.2where *y*_{0} is the population size at the time of sampling. We will call this a doubly deterministic coalescent model (DDCoM) because both the demographic and genealogical processes are modelled with deterministic approximations. The asymptotic distribution of coalescent times for the DDCoM is given by the derivative of *A*(*s*) (equation (4.1)) and expanding *y*(*s*) and normalizing
4.3
4.4The factor of *n* − 1 normalizes the distribution since there are *n* − 1 nodes in the tree. In [29], the DDCoM was found to be an excellent approximation to the stochastic coalescent for large populations.

The BDM likelihood takes the form of a product over coalescent times and sample times, including the time of origin. Conditioning on the time of origin, and given a homochronous sample, the likelihood is given by the product of marginal probabilities for each coalescent time. From equation (3.4), expanding *c*_{1},*c*_{2} and simplifying,
4.5where *c*_{1} and *c*_{2} are the following constants:
4.6and
4.7

### Theorem 4.1.

*Given a homochronous sample of a proportion* *ρ* *lineages from a population growing exponentially according to the birth–death process with birth rate* *λ*, *death rate* *μ*, *and* *λ* > *μ*,

for all times *s*.

### Proof.

By Taylor expansion of the denominator of equation (4.4), we have
4.8The limit of the numerator of equation (4.4) is
4.9Taking the large *n* limit of equation (4.8) and computing the ratio of (4.8) and (4.9), and rearranging, we have
4.10It may be verified that this is equivalent to *P*_{BDM} (equation (4.5)).▪

Note that this result applies to the DDCoM and not the CoM used elsewhere in the text. In [29,30], it was shown that the lineages through time given by DDCoMs are generally excellent approximations to lineages through time given by standard CoMs if the sample size is large.

Outside of the large-*n* limit, we can investigate the similarity of *P*_{BDM} and *P*_{DDCoM} numerically. To summarize the difference between distributions *P*_{BDM} and *P*_{DDCoM}, we compute the Kullback–Leibler divergence
Figure 9 shows the divergence as a function of sample sizes ranging from *n* = 2 to 2^{14} and with *λ* = 1.1, *μ* = 1 and *ρ* = 0.9. We find that divergence is very insensitive to birth rates and sample proportion, so results are only shown for one scenario. When *n* = 2, the divergence is quite high, but it rapidly converges to zero. We observe that, to excellent approximation, the divergence scales in a very simple way as a function of sample size: *D*(*P*_{DDCoM}, *P*_{BDM}*|**λ**μ*, *ρ*, *n*) ≈ e^{−}^{3/2}/*n*, and this is shown by the red line in figure 9.

Figure 9 also shows a comparison of the DDCoM marginal density of coalescent times with the BDM marginal likelihood with several different sample sizes and a smaller sample fraction of *ρ* = 0.01. When *n* = 2, the distributions are quite different, but when *n* = 10 they are very similar and when *n* ≥ 100 they are almost indistinguishable.

## 5. Discussion

Two distinct areas of concern have arisen related to phylodynamic inference using CoMs and BDMs. CoMs based on a deterministic demographic process may be subject to inductive bias if the deterministic process is a bad approximation to the true stochastic demographic process. Similarly, BDMs are subject to bias if the model of the sampling process is misspecified. We have found that the bias due to the deterministic approximation is generally very small for populations growing exponentially, even when sampling 50% of individuals from a small population. Furthermore, errors in CoMs due to a deterministic process can be resolved with additional computational effort, as it is possible to use the coalescent with a stochastic demographic process [19,31]. Such methods may be necessary for populations with very small and noisy population dynamics. Bias is likely to be greatest if the population is small and growing slowly such that population dynamics are relatively noisy. Indeed, we found only one situation where the BDM was noticeably more precise than CoM estimators, which occured with a small *R*_{0} of 1.1 and a large sample fraction; however, we did not find a situation where the BDM estimator was substantially less biased than the CoM estimator. Confidence intervals based on profile likelihoods have superior coverage if using BDMs rather than deterministic CoMs, which may reflect the explicit incorporation of stochastic population dynamics in the BDM estimator. However, the combined use of the CoM with confidence intervals based on parametric bootstrapping gave estimates with low bias and very good coverage.

We have found that BDMs can yield highly biased estimates if the sampling process is misspecified. It may be hard to detect if the sampling process deviates from the modelled form in many real-world situations, and most real datasets are likely to violate the BDM sampling process assumptions to some degree. An example heterogeneous sampling through time is shown in figure 10 for a dataset which has previously been analysed with BDMs in [32]. Figure 10 shows the sampling proportion through time of HIV sequence samples in the UK HIV Drug Resistance Database [33]. Typical for HIV sequence databases, the sample proportion is essentially zero throughout the 1980s, and there is a rapid increase in sampling effort throughout the late 1990s and early 2000s, followed by a plateau after 2010 due to reporting delays. In [32], a BDM susceptible–infected–recovered (SIR) model was fitted to HIV sequence data from the UK under the assumption that the sampling rate jumped from zero to a constant rate, but the time span of the estimated phylogenies ranged from 1978 to 2003 over which the true sampling rate varied greatly.

Future work should explore how violation of sampling assumptions may bias estimates of *R*_{0} when fitting BDM SIR models.

The sequence of sample times may be informative about the population size through time if the sampling process can be correctly specified. We have shown how birth rates may be estimated from the sequence of sample times if sampling occurs according to the BDM assumptions, and this is possible even if the sample rate is not known. BDMs implicitly use the sequence of sample times to estimate birth and/or death rates, and this is the case even if the sampling rate is not given, but estimated. Comparisons of CoMs and BDMs should account for the effects of sampling, and a fair comparison can be obtained in the case of homochronous or serial-homochronous sampling with unknown sample rate, so that the sample times contain no information about population size and birth rates.

Previous simulation-based studies on fitting SIR epidemiological models to sequence data [31] have purported to show increased statistical efficiency of BDMs relative to CoMs, but these studies did not control for the informativeness of sample times, and the supposed advantage of BDM in these simulations is likely to be due to the sampling model and not the genealogical model. For example, the simulation studies in [31] did not consider a homochronous sample, a misspecified sampling process, or the possibility of extending the coalescent estimators to use sample time information. The study in [31] used a Bayesian method, in contrast to our ML methods, so some differences may also be due to the choice of priors. Popinga *et al*. [31] hypothesized that the difference in performance of BDMs and CoMs was due to the latter's use of a misspecified deterministic demographic process, but in the context of exponential growth, we found very little bias due to the deterministic approximations of the coalescent, but large biases due to the effects of sampling.

Future research on BDMs may reveal ways to accommodate more realistic sampling processes. For example, in [22], a piecewise constant sampling process was presented; however, this also required the introduction of many more parameters to describe the sampling process. If the sampling process is known, a useful alternative to BDMs is to model the sampling process in tandem with the coalescent. As we have shown, the coalescent likelihood of a genealogy is approximately independent of the likelihood of the sample times, and for complex sampling processes it is much easier to model the genealogical and sampling process separately and combine likelihoods than to derive a joint likelihood. In the case where stochasticity is important but the sampling process is complex, the combined use of the CoM likelihood and parametric bootstrapping offers a means to obtain reliable parameter estimates and associated confidence intervals.

## Funding statement

E.M.V. is supported in part by US NIH Models of Infectious Disease Agent Study (MIDAS) grant U01GM110749. S.D.W.F. is supported in part by a UK Medical Research Council (MRC) Methodology Research Programme grant (MR/J013862/1). This research is jointly funded by the UK MRC and the UK Department for International Development (DFID) under the MRC/DFID Concordat agreement.

## Acknowledgements

The authors thank Alison Brown (PHE UK) for providing HIV diagnosis statistics from the UK. Xavier Didelot and Caroline Colijn (Imperial College London) provided many helpful suggestions. The funders had no role in study design, data collection and analysis, decision to publish or preparation of the paper.

- Received August 22, 2014.
- Accepted October 7, 2014.

© 2014 The Authors. Published by the Royal Society under the terms of the Creative Commons Attribution License http://creativecommons.org/licenses/by/4.0/, which permits unrestricted use, provided the original author and source are credited.