You're one in a googol: optimizing genes for protein expression

Mark Welch, Alan Villalobos, Claes Gustafsson, Jeremy Minshull

Abstract

A vast number of different nucleic acid sequences can all be translated by the genetic code into the same amino acid sequence. These sequences are not all equally useful however; the exact sequence chosen can have profound effects on the expression of the encoded protein. Despite the importance of protein-coding sequences, there has been little systematic study to identify parameters that affect expression. This is probably because protein expression has largely been tackled on an ad hoc basis in many independent projects: once a sequence has been obtained that yields adequate expression for that project, there is little incentive to continue work on the problem. Synthetic biology may now provide the impetus to transform protein expression folklore into design principles, so that DNA sequences may easily be designed to express any protein in any system. In this review, we offer a brief survey of the literature, outline the major challenges in interpreting existing data and constructing robust design algorithms, and propose a way to proceed towards the goal of rational sequence engineering.

1. Introduction

At the heart of biotechnology is our ability to cause a cell to produce a protein it would not normally make. These proteins may be useful in themselves, for example as therapeutics or industrial catalysts. They may enable a cell to produce new compounds or to interact with other cells in a novel way. Whether a protein is a modified version of proteins that are naturally produced by the intended expression host or comes from another kingdom, its sequence must be encoded in a gene that the host cell recognizes as instructions to produce appropriate amounts of the specified amino acid sequence.

Synthetic biologists envision a future in which combinations of well-characterized sequence elements lead to predictable outcomes, enabling the rational design of biological circuits and novel metabolic pathways (Andrianantoandro et al. 2006; Heinemann & Panke 2006; Drubin et al. 2007; Sayut et al. 2007; Tyo et al. 2007). Attempts to characterize and employ sequences that control transcription, mRNA stability and the initiation of translation are underway in many synthetic biology laboratories (Yokobayashi et al. 2002; Sprinzak & Elowitz 2005; Heinemann & Panke 2006; Dasika & Maranas 2008; Michalodimitrakis & Isalan 2009). One aspect of the design process that needs more attention, however, is the treatment of coding sequences. As in biotechnology, these are at the centre of synthetic biology: it is proteins that will catalyse the reactions in a novel metabolic pathway or be the signal transducers or the new biomaterials. There is an implicit assumption that, because we know the genetic code, it will be straightforward to choose a DNA sequence to encode any protein. But we need to think about more than the sequences that will ensure enough mRNA and an adequate rate of translational initiation: the codon choices themselves must not limit expression under the anticipated conditions of use.

Today, researchers can obtain genes by cloning from cDNA libraries or polymerase chain reaction (PCR) amplification from the source organism. Increasingly, they are also turning to direct synthesis of genes whose sequences appear in rapidly expanding sequence databases, but whose physical location is frequently obscure (Venter et al. 2004). When a gene is synthesized, it is generally modified from the natural version. These modifications are made to simplify subsequent manipulations (adding or eliminating restriction sites, for example), but also for a much more significant reason: natural genes are often poorly expressed in heterologous hosts, even when the expression system is related to the organism from which the gene originated.

Several different prokaryotic and eukaryotic systems are now available for heterologous expression, offering flexibility for a variety of protein types and applications. For improved results, these systems can be further manipulated by changing environmental conditions such as temperature or media components, by changing the intracellular environment by altering tRNA levels, and by changing the context and copy number of the gene itself (Hannig & Makrides 1998; Baneyx 1999; Aricescu et al. 2006). Despite so many options, natural genes are still frequently recalcitrant to expression in any host system useful for an intended application. Synthetic biology applications may also be particularly inflexible in their choice of host, especially if the characterization of synthetic parts turns out to be very host specific. Successful expression of target proteins, particularly high-yield expression, may therefore only be achieved by adaptation of the gene sequence to the desired expression system.

Unfortunately, robust rules for designing a gene for heterologous expression are not available, despite much study. There are two main reasons for this: (i) synthetic genes have only been widely and cheaply available for a few years, so systematic, well-controlled studies of the relationship between gene design parameters and expression have not been practical and (ii) protein synthesis is a complex process and probably depends on multiple properties of the gene sequence in addition to host-specific variables and environmental conditions.

Design of an ‘optimal’ gene (which we define in this review as one in which the codon choices do not limit expression) requires a thorough understanding of the interaction of the gene sequence with the expression environment and specification of the desired goal (expression level, solubility, localization of expressed protein, etc.). One does not need to dig deeply into the scientific literature to realize that the relationship between sequence, host and expression properties is complex (figure 1). It is also clear that there are big gaps in the data available to map this relationship. Previous studies have generally focused on one or a few rules applied to design a single modified version of a gene (Gustafsson et al. 2004; Wu et al. 2007a,b). If expression is improved, there is a good chance that the result is described in the published literature; if not, it is probably not reported. As none of these rules has reliably improved expression, the tendency has been to layer more and more rules on top of each other, greatly complicating the gene design task and creating problems of prioritization when applying several conflicting principles. The problem is that the design rules are often only weakly supported by anecdotal data and little is known about their general applicability and relative significance. In many cases, they also have no obvious grounding in biochemical explanations of protein expression. A robust and generally applicable gene design method must instead be based on well-established relationships that are validated by thorough experimentation. The ability to create synthetic gene sets with variations in synonymous gene coding will be essential to elucidate these relationships.

Figure 1

Factors influencing protein expression. Several factors that act along the path of expression from DNA to mRNA to protein are shown, any of which could be altered by or could affect the impact of gene design. RBS, ribosome-binding site.

In this review, we discuss the difficulties with current gene design methods, focusing on the interdependence and conflicts of common design principles. The principles that are currently applied are evaluated and challenges in incorporating them into algorithms for automated design are described. Finally, we propose practical ways to resolve current uncertainties that limit the potential of synthetic genes.

2. The gene design challenge

The standard genetic code encodes the 20 ubiquitous amino acids by 61 nucleotide triplets (codons). An amino acid may be encoded by as few as one or as many as six codons. This redundancy means that a protein can be encoded by many alternative nucleic acid sequences; a 300 amino acid protein of average amino acid composition could be encoded by more than 10100 different gene sequences. If the codon choice at each position is considered an independent variable, the possibilities would be distributed over an intractable, high-dimensional sequence space. Methodical gene optimization is thus only practical if the governing variables can be dramatically reduced and/or general rules exist to limit the considered possibilities.

A reasonable body of published work exists in which significant changes in expression are found in genes that have been resynthesized according to simple design rules (Gustafsson et al. 2004; Wu et al. 2007a,b). This is encouraging because it suggests that the optimization problem can be reduced to a manageable number of variables to describe a gene sequence. What has not yet emerged is the identity of the most important sequence variables or their contributions to protein expression. Part of the problem is one of variable definition—we know categorically what is important but not exactly what, how and/or when. For example, codon bias is widely thought to affect expression, but which codons are important, how best they should be biased, whether the same biases are important for all proteins and how this relates to the expression host is much less clear. Compounding this lack of clarity, we do not know how to prioritize or compromise with interdependent variables. For example, consider a protein containing many tyrosine residues that is encoded by following the rule ‘maximize GC content’. The resulting expression levels may then be ascribed to the overall GC content of the gene, since that was the design rule used. However, suppose that the expression levels were primarily influenced by the choice of UAC instead of UAU to encode every tyrosine. A second gene containing no tyrosines encoded by ‘maximizing GC content’ would be completely unaffected by the choice of tyrosine codons, so that the rule may result in quite different expression properties in the second case. In §3, the case for and limitations of various gene design variables previously associated with expression level are discussed.

3. Gene design principles

3.1 Origins of codon usage bias

Grantham et al. (1981) identified biases in the codons that were used to encode amino acids in the 161 full or partial mRNA sequences present in the nucleic acid sequence database at that time. These biases differed depending on which organism the gene came from. The authors speculated that these differences in codon bias may play a role in protein expression levels, and proposed a biophysical basis for this. We now have a much more detailed understanding of how gene expression is regulated including synthesis, processing and degradation of mRNA and initiation of translation. The role of natural codon biases, however, is still very poorly understood.

Why some organisms show marked bias and why organisms often differ dramatically has been the subject of much speculation (Holm 1986; Eyre-Walker & Bulmer 1993, 1995; Eyre-Walker 1996; Akashi 2001; Knight et al. 2001; Akashi & Gojobori 2002; Rocha 2004; Marquez et al. 2005; Suzuki et al. 2008; Yang & Nielsen 2008). Codon bias may serve to make the translational process more efficient. Biases can reduce the diversity of isoacceptor tRNAs required, perhaps reducing the metabolic load (Rocha 2004). This may particularly be beneficial to organisms that spend much of their life cycle in rapid growth. Several other constraints, not directly related to expression yield, are also likely to influence codon bias. These include altering the likelihood and directionality of amino acid substitutions (mutational bias) and selection for GC content (Eyre-Walker & Bulmer 1995; Eyre-Walker 1996; Knight et al. 2001; Marquez et al. 2005; Antezana & Jordan 2008; Yang & Nielsen 2008).

Whatever the evolutionary factors contributing to codon usage biases, the relevance of natural biases to designing genes for heterologous expression is not clear. Weak correlations between codon bias and expression of individual intragenomic genes have been observed in yeast and bacteria, but these genes are very rarely expressed at the high levels that are often desirable in biotechnological applications (more than 10 or 20% of cell protein). Furthermore, the cellular protein expression machinery has several means to control expression other than by control of translational elongation step rates. Indeed, in thorough studies where protein expression has been normalized relative to mRNA levels, correlations between expression and codon bias have all but disappeared (Gouy & Gautier 1982; Sharp & Li 1987; dos Reis et al. 2003; Jansen et al. 2003; Friberg et al. 2004; Lu et al. 2007; Wu et al. 2007a,b). Nevertheless, genes that are designed using different codon biases often do have significantly different expression properties, suggesting that bias or covariant gene variables are important (Gustafsson et al. 2004).

3.2 Biochemistry of codon usage bias

Synonymous codon choice may influence heterologous expression yield by limiting the translational elongation rate. For each codon along a message, the translational elongation step rate is probably primarily determined by the concentrations of cognate and competing EF-Tu.aa-tRNA ternary complexes in the cell and rate constants for complex selection at the ribosomal A-site (Varenne et al. 1984; Curran & Yarus 1989; Gromadski & Rodnina 2004; Wintermeyer et al. 2004; Rodnina et al. 2005). There is also evidence that the rate of tRNA selection at the A-site may be significantly influenced by the tRNA and codon occupying the ribosomal P-site, causing local context effects (Yarus & Folley 1985; Gouy 1987; Folley & Yarus 1989; Gutman & Hatfield 1989; Irwin et al. 1995; Boycheva et al. 2003; Moura et al. 2005; Buchan et al. 2006). Finally, and of particular relevance when high levels of heterologous protein are expressed, the concentrations of free amino acids and charged tRNA in the cell could change significantly, altering the relative translation step rates for different codons (Dong et al. 1995; Elf et al. 2003; Dittmar et al. 2005; Elf & Ehrenberg 2005a,b).

A detailed mechanistic model of expression incorporating accurate rates for all steps in the synthesis pathway and context dependence is not imminent even for Escherichia coli and even further off for many other potentially useful hosts. Even without complete understanding, however, the biochemical principles of expression can inspire some reasonable guesses about design criteria. In E. coli, there is some correlation between codon usage frequency and observed cognate tRNA level, which is more pronounced at higher growth rates (Ikemura 1981; Bulmer 1987; Dong et al. 1996). Translation of a gene containing many codons that are rarely used in the host organism will therefore generally use cognate tRNAs that are present at low levels in the cell in a large number of steps in translational elongation. This would be expected to impair expression, an effect that is indeed observed (Chen & Inouye 1990; Kane 1995; Cruz-Vera et al. 2004). Also consistent with this mechanistic explanation, E. coli strains expressing boosted levels of such tRNAs from plasmid-borne genes can in some cases support increased expression levels of genes containing rare codons (Kane 1995; Burgess-Brown et al. 2008).

Although rare codons may often be translated at lower rates, the relationship between their use and expression yield is not a simple one. In fact, in some cases, the inclusion of rare codons may even improve yield perhaps by controlling the ribosomal traffic along the translated message or by introducing translational pauses at strategic positions, such as domain boundaries, to help promote proper protein folding (Angov et al. 2008; Tsai et al. 2008). Also, position and sequence context of the rare codons can significantly affect their impact. In certain contexts, rare codons have been shown to increase translational errors (Del Tito et al. 1995; Kane 1995; Kurland & Gallant 1996; You et al. 1999; Kerrigan et al. 2008). In particular, consecutive rare codons within the first codons of a message may be especially deleterious, whereas in some cases rare codons may be distributed downstream of the initial coding sequence with little effect (Varenne & Lazdunski 1986; Chen & Inouye 1990, 1994).

How to abstract gene design principles from non-rare codon biases in an expression host's genome is even less clear. One idea that is commonly cited, despite the gradual evaporation of experimental support, is that higher levels of expression can be obtained by maximizing high-frequency codons within a gene. This idea is an extension of Grantham's work (Grantham et al. 1981), coupled with the observation that some highly expressed genes in E. coli and Saccharomyces cerevisiae (predominantly ribosomal proteins) are more biased in codon usage than the average for the genome (Sharp & Li 1987). In this line of reasoning, the codon that is used most frequently in these highly expressed genes is considered an ‘ideal’ codon. How closely a gene conforms to this ideal can then be quantified as the codon adaptation index (CAI; Sharp & Li 1987); a gene of maximal CAI equal to 1 is one that uses only the most frequent codon in the high expressor subset to encode each amino acid. Although the CAI of a gene has often been cited as a predictor of the expression level of a protein, there is no demonstrated causality. In E. coli and S. cerevisiae, where protein and mRNA levels have both been measured, there is no meaningful correlation between CAI and protein yield per mRNA transcript, suggesting that CAI is not a measure of translational efficiency (Friberg et al. 2004; Lu et al. 2007).

From a purely biochemical perspective, simply maximizing the CAI of a gene might be problematic, especially for applications where the target protein is to be expressed at much higher levels than any single natural protein. While it would favour the use of tRNAs present at higher levels in a non-expressing cell, limiting the used tRNA pool to just one or two isoacceptors per amino acid could limit the maximal synthesis flux and increase translational errors (Kane 1995; Kurland & Gallant 1996). A balance between tRNAs used that maximizes the availability of aa-tRNA for production while maintaining a non-limiting elongation rate is probably preferable.

An additional complication is that different proteins with different amino acid compositions will stress the translational process differently (Kane 1995). For example, the optimal balance of serine codons in a gene encoding a protein containing only a few serine residues may be quite different from the optimal balance of serine codons where the protein is 20 per cent serine. In the latter case, the rates at which serine tRNAs are recharged may have a significant impact on translation rate. Likewise, the effect of codon bias may depend on the expression level itself. The prevalence of serine in a protein, for example, may not matter if the protein is produced at a low level owing to other expression limiting factors, such as low mRNA level or slow translational initiation. In such a case, the stress on serine tRNAs would be low, independent of the protein serine content.

Many examples exist where changes in synonymous codon usage have a dramatic effect on the yield of heterologously expressed protein, but drawing conclusions about optimal codon biases from these data is very difficult (Gustafsson et al. 2004). Published examples differ widely in many respects including the expression host, regulatory elements associated with the gene and, most importantly, the protein being expressed. Furthermore, the sample sets in such examples are generally small, usually describing only two genes, one natural and one whose codon bias has been altered. Now that genes can be synthesized quickly and cheaply, it should be possible to construct sets that are diversified systematically to experimentally test the effects of changes in bias of codons for each amino acid, the occurrence of codon pairs, GC%, the use of rare codons and other potentially important determinants of expression level. The effects of other factors on expression can also be tested, including those described in the following sections.

3.3 Codon bias at the start of the open reading frame

Numerous lines of evidence suggest that the initial 15–25 codons of the open reading frame deserve special consideration in gene optimization (Eyre-Walker & Bulmer 1993; Chen & Inouye 1994; Stenström et al. 2001a,b; Stenström & Isaksson 2002; Gonzalez de Valdivia & Isaksson 2004, 2005). Natural E. coli genes show a distinct bias in codon usage for the initial 25 codons compared with the overall genomic bias (Eyre-Walker & Bulmer 1993; Chen & Inouye 1994; Stenström et al. 2001b). In fact, rare codons are enriched in this initial leader for reasons that are not clear. Studies have shown that the impact of rare codons on translation rate is particularly strong in these first codons, especially within the first six triplets (Chen & Inouye 1990, 1994).

Ribosomes in the initial phase of elongation appear to be particularly prone to abortive termination, perhaps owing to an increased rate of peptidyl-tRNA drop-off (Gonzalez de Valdivia & Isaksson 2004, 2005). Early rare and NGG codons may accelerate premature termination by stalling elongation (Gonzalez de Valdivia & Isaksson 2005). These codon effects appear to be independent of alterations in mRNA secondary structure that might also stall early elongation or prevent initiation. As translational initiation depends on the rates of both ribosome binding and clearing of the ribosome-binding site (RBS) after initial elongation (approx. 13–20 codons), slow translation through the initial leader may reduce or eliminate any benefits of a strong RBS sequence.

3.4 mRNA structure

Gene design strategies often seek to minimize mRNA structure. Structures that involve or otherwise occlude the RBS and/or start codon in genes expressed in prokaryotes can impair expression, presumably by interfering with ribosomal binding and translational initiation (Kozak 1986; de Smit & van Duin 1990, 1994; Griswold et al. 2003; Studer & Joseph 2006). For this reason, gene design strategies often consider such structure in coding of the first several amino acids. Voigt and co-workers have recently developed an algorithm for designing prokaryotic RBSs to achieve desired rates for initiation of translation considering the structure of the mRNA and the affinity of the RBS for the ribosome (http://www.voigtlab.ucsf.edu/software/).

As with codon bias, considerations of the effects of mRNA structure within the open reading frame are not straightforward. While some RNA structures, particularly pseudoknots, have been shown to cause translational pauses (Kontos et al. 2001; Hansen et al. 2007; Wen et al. 2008), a clear relationship of RNA structure strength, type and distribution to translation rate is lacking. Ribosomes possess an intrinsic helicase activity that allows translation through even very strong hairpins and may preclude many structures from limiting the translation rate (Takyar et al. 2005). An actively translated message can be densely packed with ribosomes, unwinding structure as they move along. For this reason and others, structures predicted by RNA folding algorithms may not reliably represent actual mRNA structures in vivo (Meyer & Miklos 2004, 2007). Relevant structures may be those restricted to windows along the mRNA where structure could form between ribosomes. The lengths and lifetimes of such windows would be dependent on translational kinetics and would probably vary significantly along the message. These many layers of uncertainty greatly obscure a rational approach to general mRNA structure optimization. As structure minimization strategies can greatly influence other gene parameters, such as codon bias, it is critical that systematic analysis of the benefits of various mRNA structure treatments be performed.

3.5 Gene design and protein structure

Although much of the forgoing discussion has implicitly assumed that maximizing the rate of translational elongation is unequivocally desirable, this is not entirely accurate. Often the expressed protein must be properly folded to be useful. There have been several recent reports describing the effects of synonymous codon changes on protein folding (Thanaraj & Argos 1996; Angov et al. 2008; Tsai et al. 2008). It has been suggested that too rapid translation may not allow for efficient ‘self’ or chaperone-aided folding and that strategically placed slower codons or codon runs, perhaps at protein domain boundaries, could maximize folding efficiency while maintaining a high overall translation rate (Angov et al. 2008). Unfortunately, there are even less data from which to derive rules for such designs than there are for understanding codon bias. Developing rules for designing genes to express soluble active protein should be facilitated by synthesis and testing of varied sets of genes, as described above for penetrating the mysteries of codon bias and other gene variables.

3.6 Potentially deleterious motifs

Depending on the host expression system, there are a number of sequence motifs to be avoided in gene design. These comprise an expanding list of sequence element classes that could have negative effects on expression of a target protein. For example, in an E. coli system expressing a gene under control of a T7 promoter, one would wish to avoid both class I and II transcriptional termination sites. Shine–Dalgarno-like sequences within the coding sequence may cause incorrect downstream initiation or translational pauses in prokaryotic hosts. In eukaryotic hosts, potential splice signals, polyadenylation signals and other motifs affecting mRNA processing and stability are generally to be avoided. Other classes of deleterious motifs include sequences that promote ribosomal frameshifts and pauses (Kurland & Gallant 1996; Kontos et al. 2001; Hansen et al. 2007). For many of these motifs, polyadenylation sites in particular, the relationship of sequence and impact on expression is not yet well understood. With further work, we expect that the list of toxic and regulatory motifs will grow, but also that rules for avoiding them in gene design will be better defined.

4. Integrating principles into design algorithms

Several algorithms have been developed which allow researchers to manipulate various gene design parameters (Grote et al. 2005; Jayaraj et al. 2005; Villalobos et al. 2006; Ferro et al. 2007; Wu et al. 2007a,b). Ideally, an algorithm should be based on an accurate predictive model of the relationship between design parameters and expression yield. To develop such a model, it is critical to first identify the sufficient subset of predictive design variables for explaining expression. There is good reason to hope that careful experimentation will allow reasonable quantification of the effects of codon bias, mRNA structure and other factors on heterologous expression in various expression systems.

Prioritization of the expression-determining variables is also necessary to create a robust design algorithm. It may not be sufficient or practical to simply apply standard criteria independently to a number of design parameters, as the parameters themselves may not be fully independent of each other. Avoidance of possible deleterious motifs, particularly those that are ambiguously defined or otherwise common, can conflict with codon usage and other design parameters. Common design requirements such as the removal of restriction sites, avoidance of dam or dcm methylation sites overlapping with restriction sites or elimination of extended coding sequences in other reading frames also constrain codon choices.

Optimization of multiple constraints based on anecdotal information and accepted but often unsubstantiated lore is particularly problematic. This ‘system voodoo’ can so significantly limit the available DNA sequences as to actually preclude adequate expression! It is impossible to overstate the value of experimental support for assigning importance to the impact of sequence variables on expression.

4.1 Managing constraints

Irrespective of the specific variables, gene design will always involve several, often conflicting, types of sequence constraints. Meeting these various constraints simultaneously necessitates development of sophisticated algorithms. The most useful algorithms would allow flexibility in prioritization of constraints, as appropriate for different applications and design goals. In some cases, compromises may be acceptable for some of the parameters, for example minimizing repetitive sequences instead of eliminating them completely. In other cases, the algorithm might have the choice of meeting at least one of a set of constraints. For example, the gene design may require that either EcoRI or HindIII sites not be present in the resulting DNA sequence.

Another important criterion is algorithm efficiency or run time. For a complex set of design constraints, optimization can be computationally intensive, especially for large genes. In some cases, running the algorithm for a few days or weeks might be acceptable, as long as all the goals are met. In other cases, the algorithm might need to be executed for a large library of genes in a quick manner, so that post-optimization analysis may be performed and new goals defined. Finally, there will often be cases where the optimization problem is so difficult, owing to particulars of the amino acid sequence to be encoded and/or the combination of constraints applied, that design goals cannot be reached within a reasonable time and an exhaustive search of all parameter space is unrealistic. Thus, a practical algorithm must employ some kind of heuristic, perhaps Monte Carlo random walks, simulated annealing or a genetic algorithm, to efficiently search parameter space.

Typical gene design parameters vary in nature and present different problems in optimization. Particularly challenging are parameters of a distributed nature such as codon bias or repeats in the DNA sequence, which are necessarily interdependent with other parameters. Codon bias for at least one amino acid will change whenever a new codon is chosen anywhere in the sequence to modify any other design parameter. Changing a codon to eliminate repetition of one sequence element within the gene may introduce other, different repeated elements. Design algorithms must use optimization methods that are suited to the nature of the parameters involved.

4.2 Choosing optimization methods

For any optimization method, the starting point can be very important. Ideally, it should be chosen as close as possible to the expected optimum and in a way that is not deterministic. That way, if the starting point proves unacceptable, the algorithm can be restarted with a new starting point. One way to select a starting point that is non-deterministic and focuses on search space near a typical optimization goal is to select a codon for each amino acid, based on the probability of that codon occurring as given by a target codon bias table.

As is common with multidimensional optimization problems, there are many ways of navigating the search space and each has its benefits depending on the optimization requirements (figure 2). The search hierarchy must accommodate the interdependencies and priorities of the constraints. Otherwise, conflicts might cause the algorithm to get cornered in a local optimum and not reach the design goals.

Figure 2

Choosing an appropriate design algorithm. A simple example is shown of how two different algorithms for the same optimization problem are affected by sequence constraints. The coding sequence encodes five peptide segments of a protein, which may or may not be contiguous. The initial starting sequence is one possibility, chosen to match the target codon bias of the gene. The optimization constraints for both algorithms are that (i) no EcoRI is allowed, (ii) codon usage ratios for E (GAG/GAA) and F (TTC/TTT) must be equal to 1, and (iii) direct sequence repeats greater than seven nucleotides should be minimized. Iterations involve single codon replacements and a greedy search is followed. Thus, replacements are allowed only if improvement is achieved. At each step, no worsening of previously applied constraints is allowed. The algorithm in (a) begins by minimizing repeat elements and then tries to remove EcoRI sites without increasing the number of repeats. Since either possible substitution to remove the EcoRI site will add new repeats, no change is allowed and the algorithm fails to reach its goals. In (b), because the hard constraint of restriction site removal is applied first, the algorithm has two routes (red versus blue arrows) to successfully reach the goals.

If there is only one constraint (codon bias, for example) or multiple constraints with non-conflicting goals, then a ‘greedy’ algorithm may suffice to rapidly find an optimum. At each iteration of such an algorithm, the sequence is scored based on the optimization parameters. If all the goals have not been met, a random codon position is changed and the resulting sequence is scored. If the new sequence is improved, it becomes the starting point for the next iteration. Otherwise, the previous sequence is kept and changed randomly again in the next iteration. This continues until all goals are met or a minimum is reached.

If the constraints are interdependent, as the most proposed gene optimization parameters are, a greedy algorithm may be prone to getting trapped in undesired local optima as it will always work towards the optimum closest to the starting point. One way of getting around this problem is to apply simulated annealing (Kirkpatrick et al. 1983; Rodrigo et al. 2007; Rocha et al. 2008). In this method, worse scoring sequences can also be selected as the next current state, but at a given probability (‘temperature’). As the iterations step forward, the temperature is dropped progressively, decreasing the probability of accepting a sequence that scores lower than its predecessor. This ‘cooling’ results in an algorithm that initially samples a broad region of search space and then slowly becomes greedier in its heuristic, eventually becoming a simple greedy algorithm once it reaches zero. Generally, the ‘best state’ observed during the iterations is taken as the final result.

Another method for avoiding local optima, particularly as parameter space becomes large and interdependency is high, is to simultaneously follow multiple search paths and choose those that perform best. In a ‘genetic algorithm’, for example, a population of different current states is maintained (Mitchell 1998; Patil et al. 2005; Rocha et al. 2008). With each generation, the best individual sequences are selected as parents for the next generation. These are randomly mutated and recombined. The best of the resulting progeny is then selected and iterations continue until convergence in performance of the population is reached. The combination of multiple starting points and diversification through mutation and recombination efficiently searches a large expanse of sequence space, avoids single suboptimal solutions and is more likely to find a true optimum for complex multivariate problems.

4.3 Algorithms for exploring parameter space

Creating an optimization algorithm to find sequences that meet multiple interdependent constraints is only half the battle. The functionality of sequences designed by these algorithms will generally be limited by how well the constraints that the algorithm imposes match parameters that actually affect expression. Robust optimization algorithms will therefore require data and development of valid models describing the design–expression relationship. One way of approaching this problem is to coevolve the design algorithms together with these models. Initial algorithms can be used to independently vary parameters thought to be important, with experimental measurements and data modelling allowing these hypotheses to actually be tested. As more data are gathered, unimportant parameters will be discarded, new parameters may be added and remaining parameters will be reprioritized in the model. Thus, the optimization algorithm and our understanding of how to design genes for protein expression will be refined together.

5. Future prospects

Rapid expansion of sequence databases and development of gene synthesis technologies have greatly increased the repertoire of protein sequences to which biological researchers have access. Natural, derivative or novel sequences of interest may be directly obtained by researchers with minimal expertise in molecular biology. Although the rules for deciphering a DNA sequence to determine the amino acid sequence of the encoded protein were established over 40 years ago, the rules for designing DNA sequences to express an encoded protein are still not well understood. Fortunately, the methods for determining such rules are very familiar to both scientific and engineering traditions, merged in the field of synthetic biology. Reliable criteria for designing expressible genes will help to enable synthetic systems, where a gene encoding any protein may be slotted between reusable control elements, combined into new biosynthetic pathways or biological circuits without having to suffer through extensive trial and error just to get the gene to express.

Footnotes

  • One contribution to a Theme Supplement ‘Synthetic biology: history, challenges and prospects’.

  • Received December 9, 2008.
  • Accepted February 3, 2009.

References

View Abstract