Abstract
We present an accelerated algorithm to forward-simulate origin-fixation models. Our algorithm requires, on average, only about two fitness evaluations per fixed mutation, whereas traditional algorithms require, per one fixed mutation, a number of fitness evaluations of the order of the effective population size, Ne. Our accelerated algorithm yields the exact same steady state as the original algorithm but produces a different order of fixed mutations. By comparing several relevant evolutionary metrics, such as the distribution of fixed selection coefficients and the probability of reversion, we find that the two algorithms behave equivalently in many respects. However, the accelerated algorithm yields less variance in fixed selection coefficients. Notably, we are able to recover the expected amount of variance by rescaling population size, and we find a linear relationship between the rescaled population size and the population size used by the original algorithm. Considering the widespread usage of origin-fixation simulations across many areas of evolutionary biology, we introduce our accelerated algorithm as a useful tool for increasing the computational complexity of fitness functions without sacrificing much in terms of accuracy of the evolutionary simulation.
1. Introduction
A key goal in the field of molecular evolution is to understand the processes by which proteins acquire substitutions and diverge from one another. The phenotypic effect a mutation has and whether it ultimately becomes fixed or lost in a population both depend, in part, on the biochemical and biophysical effects the mutation has on protein structure and function [1–5]. By studying the physio-chemical forces that underlie the substitution process we can begin to untangle the mechanisms that have given rise to organismal diversity, shedding light on linage-specific functional divergence [6–9].
While considerable progress has been made in characterizing the substitution process through mathematical modelling and from studying extant sequences, further understanding can be gleaned from observing evolution directly via simulation. However, simulating fitness landscapes that explicitly and accurately model the effects of individual mutations on protein structure and function can be computationally burdensome. Traditionally, this problem has been addressed by using coarse-grained protein models, such as lattice or bead models, owing to their computational tractability [3, 10]. Here, we take a different approach to solve the problem of computational tractability. Rather than relying on highly simplified fitness landscapes, we propose a more efficient algorithmic strategy for simulating evolution.
Our approach is based on the origin-fixation model, which applies in the limit of strong selection and weak mutation [11]. In this model, populations are assumed to be monomorphic and represented by a single genotype. Mutations are introduced sequentially (origin) and either accepted or rejected (fixation). This model has been applied in a number of different contexts [12–18] to simulate evolution over various fitness landscapes, and it is of course much more computationally efficient than simulating all individuals in a population. However, even origin-fixation simulations can be computationally expensive. These simulations require fitness calculations for a large number of genotypes, and the majority of these genotypes will be rejected and not influence the evolutionary path taken. For fitness landscapes where the evaluation of a single genotype is computationally expensive, for example when fitness is based on detailed, atomistic models of protein structure and dynamics, it is desirable to minimize the number of these evaluations. The fewer fitness evaluations we have to perform per accepted mutation, the more computational time we can expend on each individual fitness calculation, and hence the more realistic a model we can simulate.
In the context of protein evolution, recent work has argued, based on first principles, that most fixed mutations have to be neutral or nearly neutral [19,20]. In this scenario, we expect that for one fixed mutation, we have to evaluate and discard a number of mutations of the order of the effective population size Ne. Discarding that many mutations after evaluating their fitness is a highly inefficient way of simulating evolution, and any reduction in the number of fitness evaluations required per fixed mutation will speed up the simulation. Here, we demonstrate that, by using a modified acceptance criterion for fixed mutations, we can reduce the number of required fitness evaluations to approximately two per one fixed mutation, a speed-up of the order of Ne. Our algorithm, which uses a scaling idea originally introduced in the context of Metropolis–Hastings sampling [21,22], is guaranteed to yield the same evolutionary steady state as the original origin-fixation model. Further, we demonstrate through extensive simulations with toy models and with an all-atom, energy-based protein model that the key difference of the accelerated algorithm is reduced variance in fixed selection coefficients. The magnitude of reduced variance is small, of the order of the difference between a Wright–Fisher model and a Moran model, and it can be compensated by choosing a slightly smaller effective population size Ne.
All simulation models make assumptions about how the underlying processes operate, and the relative validity of these assumptions hinges on the specific context in which a model is used. Our accelerated algorithm presents a useful way to rapidly simulate protein evolution in the context of a detailed, atomistic model of protein structure. However, our algorithm does perturb the evolutionary process and produces data that may not be appropriate for some types of analysis. Ultimately, the balancing point between more detailed descriptions of individual fitness and the accuracy of an evolutionary model depends on the questions being posed of the data. We present our algorithm as an option along this continuum, favouring a rigorous treatment of individual fitness over evolutionary precision.
2. Results
2.1. Evolutionary simulations using Metropolis sampling
To simulate a population of size Ne, one would ordinarily have to keep track of Ne different, individual genotypes at all times. However, under the assumption that the product of mutation rate μ and population size is small, μNe ≪ 1, such a simulation can be substantially simplified. In this limit, the population is nearly entirely monomorphic at all times, and thus it can be represented at any time point by a single, resident genotype i. When a new mutation j arises, it is either rapidly lost to drift or it goes to fixation and replaces the previously resident genotype i. Thus, the evolution of the entire population can be described by the sequence of resident genotypes that get fixed over time. Models that make this simplifying assumption are referred to as origin-fixation models, because new mutations originate and either fix, in which case they become the new resident genotype or do not fix, in that case, they are discarded. Origin-fixation-type models have been used widely to study a variety of biological questions, and they have been fruitful both for efficient simulations [11–15] and for mathematical modelling [23–27].
The accelerated algorithm, we present here, is also an origin-fixation simulation. The simulation is initialized with a single genotype i representing the resident genotype at time t = 0. The algorithm then proceeds in discrete iterations. At each iteration, a novel mutation is tested and is either accepted and becomes the new resident genotype or rejected and the current resident genotype remains. The difference between our algorithm and the original origin-fixation approach is merely the specific expression used for the probability of accepting a mutation. Our formulation leads to many more accepted mutations and thus speeds up simulations substantially.
To illustrate the differences between the original algorithm and our accelerated variant, we have to first outline the theory underlying origin-fixation models. Consider a population of Ne individuals evolving in discrete time steps (i.e. Wright–Fisher sampling [28]). We denote the fitness of genotype i by fi and the mutation rate from genotype i to genotype j by μji. We assume a symmetric mutation matrix, such that μij = μji. Under these assumptions, the evolutionary process can be approximated by a discrete Markov process that describes how the resident genotypes transition from one to another over time [23]. This Markov process is defined via a matrix Wji which holds the transition probabilities from state i to state j. The matrix elements are given by
2.1The term μjiNe describes the origin of new mutations. It represents the rate at which mutations from i to j are generated in a population of size Ne. The term π(i → j) describes the fixation of mutations. It represents the probability that mutation j goes to fixation in the background of resident genotype i. The sum over k runs over all possible genotypes considered in the model.
The fixation probability π(i → j) can be written as [23]
2.2Using this expression, it is possible to analytically calculate the stationary frequencies Pi of the Markov process [23]. These frequencies provide the probabilities that the Markov process will be in state i at a randomly chosen point in time in dynamic equilibrium, and for the model, we are studying here they are given by [23]
2.3These frequencies satisfy the detailed balance equation [23]
2.4
To simulate an origin-fixation model, we start with a genotype i and generate a new mutation according to the mutation matrix μij. We then accept or reject the mutation according to the fixation probability, as given by equation (2.2) [11]. Thus, we can think of the fixation probability as an acceptance criterion for a proposed mutation. Importantly, if a fitness landscape has a large proportion of neutral or nearly neutral genotypes, such that fi ≈ fj for many pairs of i and j, then, on average, of the order of Ne mutations need to be evaluated before one is accepted, because, in this limit, π(i → j) ≈ 1/Ne.
We now describe how we can modify this algorithm to speed up simulation. Our modification is based on the fact that Wij satisfies the detailed balance condition equation (2.4). Because of this condition, we can rescale Wij with any symmetric matrix and retain the same steady-state frequencies Pi. Let W′ij denote a rescaling of Wij for all non-zero paths in the matrix, such that
2.5where τij
2.6Here τij is defined such that we rescale transitions along the (i, j) edge with the inverse of the probability of fixing the higher-fitness genotype when starting out at the lower-fitness genotype.
By comparing W′ij with Wij, we see that we can incorporate τij into a modified fixation probability,
2.7If we interpret this modified fixation probability as an acceptance criterion in an origin-fixation model, we see that, under this new acceptance criterion, we always accept mutations that increase fitness. Mutations that decrease fitness are exponentially unlikely to be accepted.
Notably, equation (2.7) looks exactly like the Metropolis–Hastings acceptance criterion widely employed in Markov chain Monte Carlo sampling [21,22], and indeed, our rescaling approach is equivalent to and was directly motivated by the original work of Metropolis et al. [21]. The advantage of equation (2.7) over equation (2.2) is that we now accept every neutral or beneficial mutation, hence the simulation becomes much more efficient. However, this advantage comes at a cost. Even though the steady state of the simulated Markov process remains unchanged, by construction, the order in which states are visited has changed. Because we use a different rescaling constant along each mutational path, the relative likelihood with which specific paths are chosen has changed between the original and the accelerated process. In particular, in the accelerated process, we have much more rapid accumulation of neutral or nearly neutral mutations relative to the rate at which highly beneficial mutations accumulate.
To evaluate the effect of this distortion, we performed a set of simulations using a simple three-state fitness landscape. The three states are denoted as A, B and C, and states A and B are assigned the same fitness (1.0) whereas state C has a higher fitness (1.25). The transition probabilities between states are shown in figure 1a for the regular acceptance criterion equation (2.2) and in figure 1b for the accelerated acceptance criterion equation (2.7). In both cases, we used a relatively small Ne = 10, to ensure the process regularly moves from the higher-fitness state to one of the lower-fitness states.
Comparison of the original and accelerated sampling algorithms on a simple three-state fitness landscape. State A has a fitness of 1.0, state B a fitness of 1.0 and state C a fitness of 1.25. (a) Directed graph showing the transition probabilities between the three states, calculated using the original fixation probability equation (2.2) and assuming Ne = 10. (b) Directed graph shows the transition probabilities between the three states, calculated using the accelerated fixation probability equation (2.7) and again assuming Ne = 10.
We simulated origin-fixation models over the three-state landscape using both acceptance criteria, and we found that the mean time spent in any state did not differ significantly between the two algorithms (figure 2a and electronic supplementary material, table S1), as expected owing to the construction of the accelerated algorithm. This observation verifies our mathematical prediction for the accelerated algorithm. In terms of simulating a biological model, it implies that the overall amount of time a population spends fixed at a specific genotype is equivalent between the two algorithms. Interestingly, the variance in the amount of time spent in any state did differ (electronic supplementary material, table S1). Figure 2b shows that the fraction of types of transitions that occurred (e.g. A–B or B–A) differed between the two algorithms as well (electronic supplementary material, table S2). This observation implies that the accelerated algorithm moves between genotypes more often than does the original algorithm. In combination, these results suggest that the two algorithms produce distinct patterns in the order states are sampled and in the frequency of transition types. We also inspected the distributions of selection coefficients produced by the two algorithms (electronic supplementary material, figure S1). Although the distributions appear similar, a Kolmogorov–Smirnov (KS) test revealed that they are not equivalent (p = 5.0 × 10−7).
Comparison of the frequencies with which states are visited and with which specific transitions occur in the original and the accelerated sampling algorithm, using the three-state fitness landscape of figure 1 and Ne = 10. (a) The mean number of times a state is visited does not differ significantly between the two algorithms (t tests, all p > 0.2, electronic supplementary material, table S1). However, the variance in the number of times a state is visited does differ (electronic supplementary material, table S1). (b) The mean fractions of specific transitions between states differ between the two algorithms (all p < 0.001, indicated by three stars, electronic supplementary material, table S2).
To assess whether the differences in results generated by the two algorithms were due to the modified acceptance criterion and not some underlying feature of the particular fitness landscape considered, we repeated this experiment on a different three-state fitness landscape. We found that these additional simulations resulted in similar dynamics to those obtained on the initial landscape (electronic supplementary material, figures S2–S4 and tables S1 and S2).
2.2. Application to protein evolution
To evaluate the performance of our accelerated algorithm in a more realistic context, we constructed simulations of protein evolution using a computationally costly all-atom model of protein structure. The simulation was initialized with a small protein as the resident genotype. At each iteration, a mutation to a non-resident amino acid was introduced to a random location and the structure was locally repacked 10 Å around the mutation. The stability (ΔG) of the mutant was evaluated with Rosetta's all-atom score function [29,30]. To convert protein stability into fitness, we used a soft-threshold model. This model assumes that the protein's fitness is given by the fraction of proteins in the ground state in thermodynamic equilibrium [31–33]. This assumption results in a sigmoidal fitness function (specifically, the Fermi function), where very stable proteins have a fitness of one and fitness declines as stability passes through a threshold value. We calculate the fitness of protein i as
2.8where β is the inverse temperature, ΔGi is the stability of protein i and ΔGthresh is the stability threshold at which the protein has lost 50% of its activity.
To simulate the accelerated algorithm, it is convenient to log-transform fitness,
2.9so that the fixation probability equation (2.7) can be expressed as (assuming Ne − 1 ≈ Ne)
2.10As in equation (2.7), the transitions from a lower fitness to a higher fitness always happen, while transitions from a higher fitness to a lower fitness are exponentially suppressed in Ne and in the difference in log-fitness xi − xj.
The soft-threshold fitness model of protein evolution has been studied previously [32,34], and the relationships it produces between protein stability, population size and inverse temperature β are well understood. As effective population size Ne increases, increasingly smaller fitness differences become visible to selection, and this results in an increase in protein stability in the steady state. The inverse temperature β determines how hard or soft the fitness threshold is. For very large values of β, the fitness function turns into a hard cut-off, producing a fitness of 0 for ΔGi > ΔGthresh and a fitness of 1 for ΔGi < ΔGthresh. For smaller β, the fitness function becomes increasingly shallow, and this more shallow fitness function allows for a larger range of ΔG values to be explored at the point of mutation–selection balance. This effect ultimately increases the amount of variance in stability sampled over the course of the simulation.
To compare the original and accelerated algorithm for this protein model, we initially simulated evolutionary trajectories in triplicate under both algorithms and recorded the first 500 accumulated substitutions. In these simulations, we set Ne = 100, β = 1 and ΔGthresh = ΔGinit/2, where ΔGinit is the initial stability of the protein. (For the protein used in these simulations, ΔGthresh = −267, measured in arbitrary units.) For both algorithms, we initialized the simulations with an initial stability ΔGinit far from the threshold stability, so that we could compare the equilibration behaviour between the two algorithms. We saw that protein stability ΔG equilibrated after about 200 substitutions in both cases, and subsequent fluctuations in stability were comparable across algorithms (figure 3a). The equilibrium stability ΔG was close to the threshold value. Notably, the original algorithm exhibited a greater amount of variation in the ΔΔG of fixed mutations than did the accelerated algorithm (figure 3b).
Stability ΔG (a) and change in stability ΔΔG caused by fixed mutations (b) for the first 500 accepted mutations when simulating the evolution of a small protein under a soft-threshold fitness model. Simulation parameters were Ne = 100, β = 1 and ΔGthresh = −267. Points represent individual measurements of ΔG or ΔΔG, pooled over replicate runs. Lines represent a local smoothing of the data, and shaded areas represent the standard error around the smoothed estimate.
The average number of proposed mutations necessary for 500 substitutions was 141 035 using the original algorithm and 1299 using the accelerated algorithm. The mean runtime of the accelerated algorithm was 2.55 min and the original algorithm required, on average, 2.73 h. Thus, we observed a speed-up of approximately 64-fold at a simulated population size of Ne = 100.
We next compared the equilibrated behaviour of the original and accelerated algorithms. We ran simulations for 1500 accumulated substitutions and discarded the first 1000 substitutions as burn-in to ensure that the stationary state was reached and only equilibrium sampling behaviour was included in the analysis. To examine whether the original and accelerated algorithms produced comparable patterns of divergence, we compared the distributions of the numbers of substitutions at individual sites. We found that both distributions were highly similar (electronic supplementary material, figure S5a, KS test; p = 1). We also compared the specific numbers of substitutions at individual sites. While these numbers are noisy, owing to limited sampling, substitution numbers at individual sites were significantly correlated (electronic supplementary material, figure S5b, Pearson's R = 0.603, p < 2.2 × 10−16). Moreover, the site-specific variation among algorithms was comparable to the site-specific variation exhibited across replicate experiments using the same algorithm (electronic supplementary material, figure S6).
We also investigated whether the accelerated algorithm had an effect on how epistatic mutations accumulate. A number of recent studies have focused on understanding the influence of epistasis at the level of position-specific substitutions along a protein sequence [6,13,19,35]. A core concept in these works is that substitutions that were neutral or nearly neutral at the time of fixation become entrenched over time, and the probability of reverting back to the substitution's predecessor diminishes as additional mutations accumulate [19]. We examined the probability of reverting to an ancestral state to assess if our accelerated algorithm caused altered fixation patterns of epistatic mutations during equilibrium behaviour. For every substitution, we measured, after each of the subsequent 15 substitutions, the ΔG of the protein with the initial mutation reversed and then calculated the probability of fixation of the reversion mutation in the current background. Because the original and accelerated algorithm produced acceptance probabilities on different orders of magnitude, we normalized these probabilities to the sum over the 15 Markov steps, to allow for comparison. Under both algorithms, as substitutions accumulate the probability of accepting a reversion mutation to the ancestral state declines (electronic supplementary material, figure S7), as expected from the recent literature [6,19]. The rate of decline seems to be slightly faster in the accelerated algorithm, but overall both algorithms show very similar behaviour.
2.3. Analysis of selection coefficients and rescaling of Ne
We also compared the distribution of fixed selection coefficients during equilibrium behaviour (figure 4). For both algorithms, these distributions have means close to zero, but the distributions themselves are significantly different (KS test; p = 4.2 × 10−3). Notably, the accelerated algorithm is more conservative and fixes more mutations with smaller selection coefficients. The bulk of mutations fixed by either algorithm are neutral or nearly neutral (defined as |s| ≤ 1/Ne [36]), 95% under the original algorithm and 97% under the accelerated algorithm.
Comparison of the distribution of fixed selection coefficients generated under the original or accelerated sampling. Simulation parameters were as in figure 3, but measurements were taken over 500 accepted mutations after an initial burn-in phase of 1000 accepted mutations. (a) Distributions of fixed selection coefficients. The two distributions differ significantly (KS test; p = 4.2 × 10−3). (b) Q–Q plot of the two distributions of fixed selection coefficients. The x = y line is given in red. The Q–Q plot highlights systematic differences between the two distributions.
We found that the larger amount of variation in selection coefficients observed in the original algorithm could be recaptured by decreasing Ne in the accelerated algorithm (electronic supplementary material, figure S8). Specifically, we could rescale Ne to Ne′ = C Ne, where we chose the constant C such that it maximized the p-value in a KS test of the two distributions of selection coefficients. We then repeated the accelerated simulations using the rescaled Ne′ and then compared the distributions of selection coefficients between the original and rescaled accelerated simulations. We found that now the KS test indicated no difference between the two distributions (p = 0.72, figure 5a).
Rescaling Ne in the accelerated algorithm minimizes differences between the two algorithms. Simulation parameters were β = 1, ΔGthresh = −267 and Ne = 50 or Ne = 37 for the original or the accelerated algorithm, respectively. (a) Distribution of fixed selection coefficients using both algorithms. These distributions are equivalent (KS test; p = 0.72). (b) ΔG of first 500 accepted mutations. (c) ΔΔG of first 500 accepted mutations. (d) Probability of accepting a reversion substitution. Probabilities are normalized to one for each original substitution analysed. Individual points show the probability averaged over 1500 substitutions (500 substitutions in three replicate experiments), the solid line represents a model of exponential decay (y = ce−ax + b) fitted to the data (see electronic supplementary material, table S3 for fitted parameter values), and the shaded area represents the standard error on the fitted curve.
To examine more systematically the effect of rescaling Ne in the accelerated algorithm, we repeated several of the previously outlined analyses after rescaling. We found that rescaling eliminated most of the previously observed differences between the original and the accelerated algorithm (figure 5b–d). The equilibrium protein stability decreased slightly, though by a barely notable amount, when using the accelerated algorithm with rescaled Ne (figure 5b). However, the amount of variation observed in ΔΔG (figure 5c) was more similar to the original algorithm. The pattern of decay in the probability of accepting a reversion mutation was also now more similar to the original algorithm (figure 5d).
We determined the optimal scaling constant C at two additional values of Ne, and we found a linear relationship between the rescaled Ne used in the accelerated algorithm and the Ne used in the original algorithm (figure 6). This linear relationship suggested that the accelerated algorithm requires an approximately two-times smaller effective population size to generate comparable genetic drift relative to the original algorithm.
Relationship between the Ne used in the original algorithm and the rescaled Ne required by the accelerated algorithm to produce similar distributions of fixed selection coefficients. Simulation parameters were β = 1 and ΔGthresh = −267. Simulations using the rescaled Ne in the accelerated algorithm were found to have equivalent distributions of selection coefficients to their original sampling simulation counterparts (KS tests; p = 0.75, 0.72, 0.19 for Ne = 100, 50, 25, respectively). The red line represents a linear fit with slope m = 0.53 and intercept b = 11.25.
3. Discussion
We have presented an accelerated algorithm to simulate origin-fixation models. Our algorithm improves run-time efficiency of simulations by approximately an order of Ne. For example, in simulations with Ne = 100 we found a 64-fold speed increase relative to the regular algorithm. Our accelerated algorithm does not change the frequency with which states are sampled, but it does alter some aspects of the evolutionary process. For example, in the case of proteins evolving under a soft threshold model, we have found that our algorithm does not substantially modify equilibrium protein stability, the distribution of accumulated substitutions or epistatic interactions. However, we have observed that the variance in selection coefficients of fixed mutations is reduced.
Notably, the reduction in variance is relatively minor, and we can recover the original variance by using a slightly reduced population size when simulating with the accelerated algorithm. The scaling between the original and reduced population size is approximately a factor of 2, of the order of the differences between haploid and diploid models or between Wright–Fisher sampling and the Moran process [37]. Of course, reducing Ne results in a slightly modified steady state, and for the soft-threshold model most notably a very minor decrease in equilibrium stability, as expected by the intrinsic relationship between population size and protein stability [34]. However, otherwise, we have observed little difference between the accelerated algorithm with rescaled population size and the original algorithm with original population size.
While our algorithm demonstrates a considerable performance gain, it also exhibits several shortcomings. First, as stated, our algorithm results in a reduction in the amount of variance sampled along evolutionary trajectories, and thus implicitly simulates a larger effective population size Ne than is used as the parameter value in the simulation. However, the implicit increase in Ne is moderate, of the order of a factor of 2. Second, we have emphasized equilibrium dynamics here. Considering that the paths sampled by the two algorithms differ, we expect that the approach to equilibrium will differ between the algorithms. Hence, the two algorithms are likely not going to perform similarly away from equilibrium. Third, we have only compared the behaviour of these algorithms in simulations with relatively small values of Ne (Ne ≤ 100). While our algorithm is not expected to perform substantially differently at larger values of Ne, we cannot test this hypothesis because of computational constraints associated with using large Ne in the original algorithm. Despite the various shortcomings of our algorithm, for questions that only deal with equilibrium behaviour and whose answer does not depend on sampling order or variance in the selection coefficients, the performance gain from using the accelerated algorithm is substantial.
It is well known in population genetics that effective population size, genetic drift and evolutionary time are closely interconnected. In fact, there are multiple ways to define effective population size based on the aspects of genetic drift in question [38,39]. Considering that our rescaling of the probability of transitions between genotypes is analogous to rescaling time, downstream effects on how genetic drift and Ne operate within our model are to be expected. Examining the reduction of variance in the distribution of selection coefficients and the reduction of variance in states sampled on a simplified landscape, we infer that under our accelerated algorithm drift is more limited than under regular sampling. Considering the inherent relationship between population size and drift, it then follows that we are able to recapture the breadth of drift by decreasing the Ne used in our accelerated algorithm. While the exact linear relationship, we found here between the Ne in the original simulation and the Ne in the accelerated simulation may depend on aspects of the particular protein we studied, we expect that similar rescalings will be possible for most fitness landscapes.
The accelerated algorithm we have presented here has been used previously [17,40] but without consideration of whether it actually represents an accurate approximation to a realistic evolutionary model. These studies simply assumed that using the Metropolis–Hastings acceptance criterion instead of the ordinary fixation probability would not substantially perturb the evolutionary process, but they did not explicitly test this assumption. We have shown here that this algorithm may indeed be suitable in a variety of different modelling contexts.
Accelerated simulation algorithms have also previously been proposed in the context of adaptive walks [41–45], which are origin-fixation models with the added assumption that deleterious mutations never fix. This assumption is justified in the limit of very strong selection, when |(fi − fj)Ne| ≫ 1 for all or most pairs of i and j ≠ i. For the special case of Gillespie's mutational landscape model [46,47], which describes the fitness landscape in terms of fitness differences from a focal genotype, strategies have been developed to efficiently simulate these models by evaluating only a subset of mutant genotypes [42,44,45].
The enhanced performance of our algorithm opens the door for a wealth of studies which could benefit from the use of more sophisticated models of protein structure or of other aspects of the fitness landscape. Further applications of our work include examining the role and source of entrenchment with a more rigorous treatment of protein structure than is currently being employed [13,19,20,35]. Our algorithm can also be applied to more detailed studies of the way in which proteins maintain interacting partners over long evolutionary periods [40]. Finally, our algorithm is flexible enough to allow for any fitness function to be used in place of the soft-threshold stability-based model we used here. For example, efficient versions of evolutionary simulations of pathway flux [48] could be constructed and the probability of reversion to an ancestral state could be examined to study the role of epistatic effects at the pathway level.
The field of molecular evolution has long sought to understand the relationship between protein structure and sequence evolution. Here, we have introduced an accelerated simulation algorithm that may serve as a useful tool in pursuing this goal. Our work demonstrates that simplified or coarse-grained models of protein structure and function are not required for simulation studies of protein evolution. Our algorithm is sufficiently fast that fitness landscapes based on full-atom approximations using empirical potentials [49] or composite statistical- and physics-based potentials [30] can easily be simulated on modern desktop computers. Even though our algorithm does perturb some aspects of the evolutionary process, these perturbations are minor and/or do not influence the main quantities of interest. Thus, our algorithm paves the way for future simulation studies that use fitness landscapes with much increased levels of realism.
4. Methods
4.1. Simulations over a three-state fitness landscape
Simulations over the three-state fitness landscape were performed using custom Python scripts. States A and B were assigned the same fitness (1.0), and state C was given a higher fitness (1.25). Ne = 10 was chosen to allow for adequate sampling of transitions to lower fitness states. At each step in the simulation, a random state excluding the state currently occupied was proposed. The proposed state was accepted or rejected based on either equation (2.2) (regular algorithm) or equation (2.7) (accelerated algorithm). Fifty replicates of each experiment were performed. Each simulation was run for 100 000 iterations, and the states visited and the types of transitions accepted were stored over the last 2500 iterations. The mean and variance exhibited by the two different acceptance criteria was evaluated by comparing the results from these replicated experiments. The results of these comparisons is given in figure 2, electronic supplementary material, tables S1 and S2.
A similar analysis was carried out on a different fitness landscape were the three states were given fitness values 1.0 (state A), 1.15 (state B) and 1.25 (state C). Data were analysed as before. The results of these experiments are given in electronic supplementary material, figure S3 and tables S1 and S2.
4.2. Simulations of protein evolution
We simulated protein evolution using the all-atom energy function implemented in the Rosetta protein-design suite (version: 2016.08.58479) [29,30]. We implemented both the original and the accelerated algorithm in Python, and we used PyRosetta [50] to mutate the protein structure and to evaluate its energy. We used the Rosetta score as a substitute for ΔG in equation (2.8). Throughout all calculations, we held the protein backbone fixed.
The protein used for all simulations was purple acid phosphatase from rat bone (PDB identifier 1QHW [51]), which is 300 amino acids long. This protein was chosen owing to its comparatively small size and its previous use in other evolutionary studies [35,52]. We first cleaned the structure using the cleanAtom function in PyRosetta [50] to remove extraneous information from the PDB file. Then, before each simulation, we minimized the Rosetta score with the Davidon–Fletcher–Powell gradient minimization method provided through PyRosetta [50] (function dfpmin_armijo_nonmonotone). Starting from the minimized structure, either simulation algorithm then proceeded as follows. At each discrete time step, we randomly chose a site in the structure and replaced the resident amino acid with one of 18 alternative amino acids, also chosen at random. The alteration of cysteine residues present in the initial structure and the addition of cysteines through mutation was not allowed to avoid modifying or adding disulfide bonds. Replacement of the resident amino acid with itself was not allowed, to improve computational efficiency. We then locally repacked the structure 10 Å around the mutation and calculated the Rosetta score. Finally, we used either equation (2.2) (for the original algorithm) or equation (2.7) (for the accelerated algorithm) to determine whether to accept or to reject the mutation.
We set Ne = 100 to keep simulations computationally tractable when using the original algorithm. The parameter β was set equal to 1, and ΔGthresh was set to half of the Rosetta score of the initial structure, which was −534, measured in arbitrary units. To assess the steady state dynamics of the two algorithms, we ran each trajectory in triplicate until we had accumulated 1500 substitutions. To ensure that steady state sampling behaviour was being exhibited we discarded the first 1000 substitutions as burn-in before preforming analysis comparing steady state behaviour. For the remaining 500 substitutions, we recorded the number of substitutions observed at each position in the protein, the probability of accepting a reversion mutation, and the selection coefficients of fixed mutations. To examine if the two algorithms differed in their initial evolutionary behaviour as they approached equilibrium, we carried out identical simulations to the ones described above but recorded only the first 500 substitutions.
The probability of accepting a reversion mutation was calculated using data from simulations run for 1500 substitutions where the first 1000 substitutions were discarded as burn-in. The probability of accepting a reversion mutation was determined by calculating the fitness of the ancestral state at each of the following 15 Markov steps after the acceptance of every new state, and then evaluating the probability of acceptance of the reversion using either model. These data were arranged in a matrix with 15 rows, one for each Markov step and 1500 columns, one for each substitution (500 substitutions for each of the three replicates). Each column in the matrix was normalized such that the probability of any specific substitution being accepted along the following 15 Markov steps summed to 1. The row means were then computed from this column normalized matrix to give the mean probability of accepting a reversion mutation at any specific Markov step.
We carried out additional simulations to test the rescaling of the effective population size of Ne′ = C Ne. First, we used the original sampling to simulate trajectories with 1500 substitutions at population sizes of Ne = 25, 50, 100, each performed in triplicate. We then simulated trajectories with the same Ne using accelerated sampling. Next, for each population size, we chose an appropriate scaling constant C that maximized the p-value in a KS test of the distributions of selection coefficients obtained under original and accelerated sampling. Finally, we simulated trajectories with the corresponding Ne′ values, again performed in triplicate.
4.3. Statistical analysis and data availability
Statistical analysis was performed with R [53]. Data visualization and model fitting were performed with ggplot [54]. All source code and processed data are available at https://github.com/a-teufel/Accelerated_Sim.
Author's contributions
C.O.W. conceived of the project. Experiments were designed by A.I.T. and C.O.W. and performed by A.I.T. The manuscript was written by A.I.T. and C.O.W.
Competing interests
We have no competing interests.
Funding
This work was supported in part by National Institutes of Health grant nos. R01 GM088344 and R01 AI120560, Army Research Office grant no. W911NF-12-1-0390 and NSF Cooperative agreement no. DBI-0939454 (BEACON Center).
Acknowledgements
We thank Austin Meyer and Julian Echave for useful discussions.
Footnotes
Electronic supplementary material is available online at https://doi.org/10.6084/m9.figshare.c.3698896.
- Received November 10, 2016.
- Accepted January 31, 2017.
- © 2017 The Author(s)
Published by the Royal Society. All rights reserved.