Challenges in the computational design of proteins

María Suárez, Alfonso Jaramillo

Abstract

Protein design has many applications not only in biotechnology but also in basic science. It uses our current knowledge in structural biology to predict, by computer simulations, an amino acid sequence that would produce a protein with targeted properties. As in other examples of synthetic biology, this approach allows the testing of many hypotheses in biology. The recent development of automated computational methods to design proteins has enabled proteins to be designed that are very different from any known ones. Moreover, some of those methods mostly rely on a physical description of atomic interactions, which allows the designed sequences not to be biased towards known proteins. In this paper, we will describe the use of energy functions in computational protein design, the use of atomic models to evaluate the free energy in the unfolded and folded states, the exploration and optimization of amino acid sequences, the problem of negative design and the design of biomolecular function. We will also consider its use together with the experimental techniques such as directed evolution. We will end by discussing the challenges ahead in computational protein design and some of their future applications.

1. Introduction

Protein folding is the physical process leading from an unstructured polypeptide chain to a functional protein with a definite structure. We expect that molecular dynamics (MD) techniques will be able to predict the structure of a protein from its sequence. On the other hand, we could ask ourselves the inverse question, also known as the inverse folding problem (Bowie et al. 1991): what are all the amino acid sequences that would fold into a given three-dimensional structure? Protein design aims to design proteins with some desired characteristics and it may proceed by trial and error or using more sophisticated methods, such as directed evolution or applying the inverse folding. Akin to the forward problem, the inverse problem in protein design could be approached by using computational techniques. In this paper, we will review the various challenges ahead posed by this problem whenever designing proteins with targeted functions.

In this review, we will analyse the progress in computational protein design which has been achieved by facing the inverse folding problem. Given the intimate relationship between a protein's structure and function, a way to design proteins with targeted properties is to start from a desired structure and find sequences able to fold into it, imposing additional constraints in the process. This problem has no unique answer, as similar sequences usually share the same structure. In addition, there are many cases of nearly identical structures sharing no sequence similarity (Brenner & Levitt 2000; Devos & Valencia 2000; Koppensteiner et al. 2000; Tian & Skolnick 2003). Nevertheless, in computational protein design, the goal is not to find all sequences folding in the given structure but just one with the required properties. Figure 1 illustrates the difference between the direct and inverse folding problems and its relationship with computational protein design.

Figure 1

(a) The folding problem. Given an amino acid sequence, the goal is to predict the final structure it will adopt after the folding process. (b) The inverse folding problem. Among the astronomically large number of possible sequences, we find those stabilizing a given fold. The sequences are ranked, for the chosen fold, by evaluating their folding free energy and only the ones with the lowest energy are kept. The goal in computational protein design is to find at least one sequence able to adopt the chosen structure.

Some simplifications can be introduced to make the problem tractable. We assume that the folding occurs without intermediates towards the target structure and that this folded structure has the lowest free energy among competing structures. When we search for the amino acid sequence that minimizes the folding free energy towards the desired structure, the folding free energy associated with competing structures will not be optimized, which means that, most probably, it will be much larger. Note that any possible folding intermediate is also a competing structure. The usual way to solve the inverse folding problem is by using computational optimization techniques that use the folding free energy as the score function. Here we choose a protein of known structure as the target, as this guarantees the designability of the fold and the existence of a folding pathway, although this later depends on the amino acid sequence. In fact, Zhu et al. (2004) designed one of the fastest folding proteins known (folding times 1–50 μs).

Computational protein design has attained significant breakthroughs, with profound biotechnological and biomedical implications: design of new biocatalysts (Bolon & Mayo 2001a; Kaplan & DeGrado 2004; Jiang et al. 2008; Röthlisberger et al. 2008; Suárez et al. in preparation) and biosensors for non-natural molecules (Looger et al. 2003), the redesign of improved protein binding affinity (Lazar et al. 2006), the redesign of protein binding specificity (Ogata et al. 2002; Joachimiak et al. 2006; Shifman et al. 2006) or the design of proteins able to bind non-biological cofactors (Cochran et al. 2005). Furthermore, the use of physico-chemical models in computational protein design has further advanced our knowledge on the biophysical interactions involved in processes such as protein structure (Jaramillo et al. 2002; Dantas et al. 2003; Kuhlman et al. 2003; López de la Paz & Serrano 2004), protein–protein interactions (Bolon et al. 2005; Joachimiak et al. 2006), DNA–protein interactions (Ashworth et al. 2006) and enzymatic activity.

In computational protein design, we search the space of amino acid sequences and rank the sequences according to their folding free energy towards a predefined folded structure. To evaluate this folding free energy, for each given sequence, we need to develop physical models of atomic interactions for the initial and final states, namely the folded and unfolded states. These models are then used to compute the free energy of each state and they are constructed to allow the fast estimation of the folding free energy for an astronomical number of sequences and the use of combinatorial optimization techniques. We will need to rank the sequences and find the sequence that minimizes the folding free energy. If we consider a small 50-residue protein, the number of possible sequences is 2050≈1065; we will have to use combinatorial optimization to explore the space of available sequences. Each of these steps involves important challenges that we are going to discuss in the subsequent sections. Firstly, we need an appropriate function to estimate free energies; this function has to be accurate enough to provide a good ranking of the sequences but it also has to be computationally efficient to be of practical use within combinatorial optimization. The required compromise between speed and accuracy involves the use of suitable approximations. Secondly, we will discuss the challenges posed by the modelling of the folded state: the reduction of the conformational space size through the use of rotamer libraries; the implications of the fixed backbone approach; and the attempts to introduce the backbone flexibility and the relevance of entropic contributions. Thirdly, we need a model of the unfolded state apt for combinatorial optimization. Fourthly, we will face the difficult problem of the search of the amino acid sequence space and the need to keep all energy terms pairwise, to allow for the use of combinatorial algorithms, which will also be described. Fifthly, we will discuss some improvements in the optimization problem implemented by destabilizing competing structures for the same sequence. This will lead us to the negative design problem.

Once we have overviewed protein design methodologies, we will review some of its achievements. We will also discuss the integration of this computational methodology with experimental techniques such as directed evolution. In fact, computational protein design can help in the construction of libraries for directed evolution experiments. Afterwards, we will review the challenges linked to the incorporation of biomolecular function, required to design functional proteins. Finally, we will discuss some other challenges ahead.

2. Using atomic energies to design protein structures

For any given amino acid sequence, we will estimate the free energy of the protein or of the protein–ligand complex using a physical model of atomic interactions and empirical data. We can follow Lazaridis & Karplus (2000) and Mendes et al. (2002) and define three types of models. (i) Statistical effective energy functions (SEEF), or knowledge-based potentials, extracted from protein structure and sequence databases. (ii) Empirical effective energy functions (EEEFs) that include terms accounting for protein stability obtained physically or empirically (Hao & Scheraga 1999). The weight of each term is optimized for protein design using a training set with empirical information. (iii) Physical effective energy functions (PEEFs): based on an accurate, first-principles description of interatomic interactions within proteins. This approach uses atomic-level MD force fields, usually specifically developed to study proteins (Brooks et al. 1983; Hermans et al. 1984; Weiner et al. 1984; Mayo et al. 1990). In contrast to what happens with EEEF and SEEF, PEEE uses the same parametrization regardless of the target structure and provides a sound grounding for methodological advances in force-field parametrization.

The physical principles that govern protein properties are the same that govern small molecules, so, if the physical model correctly describes the properties of small molecules, protein characteristics such as secondary structure propensities and the polar/non-polar amino acid pattern should emerge without the need to impose them (Koehl & Levitt 1999b; Wernisch et al. 2000).

The folding free energy takes into account the energetic competition between folded and unfolded states, so models of both the folded and unfolded states are needed ΔGfolding=GfoldedGunfolded.

2.1 Energy terms

The potential energy functions used in computational protein design have to meet certain requirements: they have to be accurate enough to capture the more relevant characteristics of the atomic interaction related to protein structure and function but, at the same time, they have to have a fast computational implementation. See Gordon et al. (1999) and Boas & Harbury (2007) for reviews on this topic.

Energies related to covalent bonds, bonding energies, are not usually considered in computational protein design, since we may assume that internal coordinates in the residues, such as bond length, angles, torsions, etc., remain unchanged along the folding process. These energetic terms play a key role in determining the conformation of the residues, but this difficulty is overcome through the use of rotamer libraries (see §2.2).

Electrostatic interaction energy between two charges qi, qj, at a distance Rij in a medium with dielectric constant ϵ, is computed using Coulomb's law: Embedded Image. Computational requirements usually impose the introduction of a long distance cut-off.

van der Waals interactions play a key role in the packing of protein cores. They are usually included by Lennard-Jones functions that included a short distance repulsion term (decreasing with Embedded Image) and a weakly attractive term that depends on Embedded Image. The values of the proportionality constants depend on the chosen force field.

In addition, some protein design protocols introduce in the force field extra terms, depending on the hybridization angles of the atoms in order to favour H-bond formation. Solvation energies accounting for the energetic interaction with the solvent play a critical role in protein design and will be addressed in §2.5.

2.2 Modelling the folded state

Protein design algorithms usually consider a given three-dimensional structure as the input and look for sequences that would fold into the given backbone; in this way, we can avoid the combinatorial complexity related to looking in structure space. By using a fixed backbone, we are assuming that the secondary structure will not be altered by side-chain mutations. This hypothesis should be carefully considered whenever performing mutations and especially if the considered amino acid is proline. The contributions from the (fixed) backbone and the fixed side-chains will equally contribute to the score of each sequence, so we do not need to include their contribution to the energy of the folded state.

To model the folded state, we have to face the complexity of the side-chain conformational space, so we will discretize the number of conformations of the residues through the use of rotamer libraries (Tuffery et al. 1991; Dunbrack & Karplus 1993). A rotamer (rotational isomer) is a side-chain conformation defined by the value of its inner dihedral angles. In naturally occurring proteins, some values of these angles are much more frequent than others (they are generally more energetically favourable), so we can consider only discrete values for these torsional angles. Other parameters such as bond length do not usually vary and can be considered to remain fixed. Rotamer libraries contain, for each residue, a collection of frequently appearing rotamers. Rotamer libraries can be backbone independent, backbone dependent (the frequency of the rotamers depend on the ϕψ backbone angles) or secondary-structure dependent (the frequencies depend on whether we consider an α-helix or a β-sheet). See Dunbrack (2002) for a review of existing rotamer libraries.

2.2.1 Towards backbone flexibility

The fixed backbone approximation biases the search on sequence space towards smaller amino acids that are able to fit in the already existing cavities, although a backbone adjustment could accommodate some small to large mutations at the core, as was shown in the systematic characterization of T4 lysozyme mutants by Hurley et al. (1992), Baldwin et al. (1993) and Harbury et al. (1993). They observed that, upon core mutations, the angular conformations of the mutated side-chains remain roughly unchanged but the main chain suffered displacements. Mooers et al. (2003) computationally redesigned the core of the T4 lysozyme with the fixed backbone hypothesis and found differences between the designed and obtained structures of up to 2.8 Å.

One approach to introduce backbone flexibility is to simultaneously consider an ensemble of closely related structures (Desjarlais & Handel 1995, 1999; Kraemer-Pecore et al. 2003). For highly symmetric proteins (e.g. Tim barrels), we can use parametric descriptions to classify the ensemble of generated structures. Using this approach, Harbury et al. (1998) designed non-natural right-handed coiled-coil folds (a dimer, a trimer and tetramer), obtaining structures with less than 0.6 Å r.m.s.d. between the predicted and crystallographic structures.

Su & Mayo (1997) redesigned the core of the protein Gβ1 domain, allowing small bulk movements of rigid secondary structure elements, and obtained core sequences similar to those obtained for the original backbone.

In the aforementioned cases, regardless of whether the backbone is flexible or not, a discrete rotamer library is used. Fung et al. (2008) have recently used flexible templates; they used a continuously varying template and NMR structure refinement techniques instead of discrete rotamers, so that all continuous Cα–Cα distance and dihedral angle values vary within a predefined range. Another way to introduce backbone flexibility is the backrub method, which is based on protein backbone motions of three-residue segments: the Cα atoms are chosen as pivot points to rotate the backbone around the axis formed by consecutive Cα's. The backbone modifications significantly alter the corresponding side-chain conformations in a similar way to that observed in crystal structures (Smith & Kortemme 2008).

One of the most impressive backbone designs was developed by Kuhlman et al. (2003): the de novo design a protein fold. Their method was based on iterative rounds of sequence design and backbone minimization and produced a protein, Top7, with a new fold and an r.m.s.d. between the designed fold and the crystallized one of 1.17 Å (§3).

2.3 Modelling the unfolded protein

For each sequence, the folding free energy takes into account the energetic competition between folded and unfolded states, so a model of the unfolded state is needed.

2.3.1 Random dipeptide model

We can assume that the energy of the denatured state depends only on its amino acid composition. We can maintain a fixed amino acid composition (Koehl & Levitt 1999a) or a hydrophilic–hydrophobic pattern (Dahiyat & Mayo 1997) to avoid this computation.

More general approaches rely on the use of reference energies for each amino acid type. SEEF approaches use experimental data (Street & Mayo 1998; Kuhlman & Baker 2000). On the other hand, first-principles methodologies construct a model of the reference state using a non-interacting gas of amino acids. Solvation energies are computed assuming that the amino acids are fully solvent accessible (Hellinga & Richards 1994) or introducing some burying (Wernisch et al. 2000). The contribution of each amino acid can be computed using dipeptides, tripeptides or by averaging its contributions in different positions in an ensemble of decoy structures (Jaramillo & Wodak 2005).

Neglecting the reference energy for the unfolded state leads to designed proteins with an overwhelming abundance of polar amino acids that tend to replace the hydrophobic ones, causing poor core packing in cavities; arginine would be the most over-represented amino acid, due to its size and high solubility.

2.3.2 Structured unfolded state

Residual structure may appear in the denatured state generating a low change in the specific heat ΔCp during the folding process, contributing to a higher free folding energy and to a higher denaturation temperature (Mok et al. 2001). Anil et al. (2006) analysed the structure of the unfolded state to design a hyperstable protein by combining two mutations that reduced the entropy of the unfolded state. Creamer et al. (1995) have suggested using fragments from native proteins to refine the unfolded state modelling. The proposed modification leads to a better correlation between the experimental and computed data of ΔΔG upon mutation (Alm & Baker 1999; Kim et al. 2000; Pokala & Handel 2005).

The structure of the unfolded state is usually neglected in protein design, since it would lead to the design of sequences that have to adopt the reference structure at some point of the folding process. In addition, the initial residual structure would have a certain number of buried residues, so that the hydrophobic effect would be given a lower weight in the energy function and, as a result, we may risk losing the hydrophilic–hydrophobic patterning.

2.4 Pairwise approximation

Large-scale protein designs require combinatorial optimization methods to explore the vast sequence space. These methods usually demand the precomputation of the different residue–residue energies, which has led to the development of pairwise decomposable energy functions. The pairwise approximation implies that, instead of evaluating simultaneously all interactions among residues, at most two different residue conformations are evaluated at a time, which may lead to unrealistic results. The advantage of the pairwise approximation is that, whenever performing a mutation or a conformational change in one of the designed positions, only the pairwise terms involving the targeted residue have to be considered. Although the number of pairwise terms in a protein grows as n2 (with n the number of residues), as we consider only the terms that involve the target residue, the number of terms involved grows only linearly with the size of the protein.

The drawback of the pairwise approximation when considering proteins is the huge effect of cooperative interactions within the protein: the effects of perturbing a given residue may propagate through an intricate network of interactions through the whole protein (Hilser et al. 1998). In fact, conformational changes on a limited number of residues (e.g. upon ligand biding) can trigger allosteric changes on the whole protein.

The use of the pairwise approximation allows the use of efficient optimization algorithms, but has biased the research on energy functions towards those that lend themselves to this type of expressions. Electrostatic and van der Waals interactions are inherently pairwise, since these energies can be expressed as a sum of interactions over atomic pairs; on the other hand, the pairwise approximation poses serious problems whenever considering the environment-dependent solvation energies. Unfortunately, even the models with solvation energies decomposable in atom–atom pairs, which we will address below, lead to unphysical results: the solvation of one side-chain pair strongly depends on the shape of the protein, which depends on the conformation of all other side-chains that, in turn, modify the solvent structure. This not only leads to a loss of accuracy when using combinatorial optimization, but also, in most cases, provides unrealistic results (Jaramillo & Wodak 2005).

2.5 Modelling the solvent

When computing free energies, we have to include the solvent's degrees of freedom. The free energy difference with respect to the vacuum, solvation free energy, accounts for the averaging out of the solvent degrees of freedom. The solvation free energy accounts for the free energy cost of transferring the solute from vacuum into the solvent. Therefore, these terms contain the entropic contributions of the solvent and the solute–solvent interactions. The large drop in heat capacity ΔCp, which usually accompanies folding in water (Woolfson et al. 1993), signals electrostatic effects due to the presence of the solvent surrounding the protein as one of the main driving forces of the folding process and, accordingly, is one of the key elements in the design of the scoring functions (Loladze et al. 1999; Domingues et al. 2000; Jaramillo et al. 2002). The alterations in the solvent structure and therefore the main contributions to the folding free energy arise from the energetic cost of burying an exposed residue.

Despite recent accomplishments in the design of small ligands in the presence of explicitly included water molecules (Mancera 2002; Ogata et al. 2002) or the simulations of small systems (Ma & Nussinov 1999), the combinatorial complexity introduced by solvent molecules forbids their use in computational protein design. Schymkowitz et al. (2005) modified the solvation model of their statistically derived force field to allow prediction of structural water molecules since, in specific instances (protein–protein interfaces, protein cavities where water molecules may be buried or docking problems; van Dijk & Bonvin 2006), the explicit consideration of the corresponding water bridges provides a more accurate estimate of the mutational free energy change. Unfortunately, these approaches cannot be easily applied to protein design with various simultaneous mutations.

Jiang et al. (2005) have developed a ‘solvated rotamer’ approach that explicitly includes water molecules in certain locations of the protein, so that we get a limited number of additional degrees of freedom. We may introduce ‘solvated rotamers’ within the designing procedure by considering, for each designed position, the solvated version of each rotamer (figure 2), computed considering the frequency of water molecules binding to specific sites along the peptide chain (Petukhov et al. 1999).

Figure 2

Placement of water molecules in the solvated rotamers. Studies on the distribution of water molecules around backbone and polar residues lead to the construction of solvated rotamers ((a) carbonyl oxygen in Asp, Glu, Gln and Asn; (b) amide nitrogen in Lys, Arg, Gln and Asn; (c) hydroxyl oxygen in Ser and Thr; (d) hydroxyl oxygen in Tyr; (e) aromatic hydrogen in His). Although only modifications to the side-chains have been included here, extra water molecules can also be included in the backbone description. The solvated rotamers can afterwards be introduced in computational protein design (Jiang et al. 2005).

On the other hand, we can incorporate the effects of the solvent through an effective free energy. This energy can be calculated using continuous electrostatics or empirical methods (see Simonson 2001 and references therein). These models use different values of the dielectric constant for solvent regions and then Poisson–Boltzmann (PB) equations are solved (Kirkwood & Westheimer 1938). Usually, these methods are too time-consuming to be of practical use. In addition, the numerical resolution of the equations meets convergence problems. Analytical and semi-analytical modifications of these models (Feig & Brooks 2004) have been reported to yield good agreement with full continuum approaches and are nowadays included in standard molecular modelling packages (Brooks et al. 1983; Weiner et al. 1984). Vizcarra et al. (2008) have developed a pairwise finite-difference PB method for its use within computational design methodologies using combinatorial optimization.

Neglecting the solvation energies, Wernisch et al. (2000) obtained protein designs with a higher abundance of polar residues than that in natural proteins. Although the core of the designed proteins appeared well packed and with no relevant cavities, they obtained a network of buried electrostatic interactions, which usually destabilize protein cores.

To treat membrane proteins, solvents other than water are to be considered. Yin et al. (2007) have succeeded in designing peptide sequences, specifically binding transmembrane proteins, using a simplified solvation model related to the depth in the membrane.

2.6 Entropic contributions

Folding free energy contains the contribution of the entropy from the unfolded and folded states. Entropic terms could arise from: (i) vibrational degrees of freedom from bonding interactions, which may be considered to remain approximately constant neglected (Karplus et al. 1987), (ii) solvation entropy, due to the interaction with the solvent, which is accounted for through the implicit solvation free energies (Lazaridis & Karplus 1999), and (iii) conformational entropy.

The entropic contribution to the folded state is a widely disputed topic: we could neglect this entropy, due to the close packing of the residues in the protein (Doig & Sternberg 1995; Dahiyat & Mayo 1997; Jaramillo et al. 2001) or include it, since a large number of side-chain configurations may stabilize the same fold (Jin et al. 2003; Berezovsky et al. 2005; Zhang & Liu 2006). Different groups working in computational protein design using PEEF approaches have achieved protein core redesigns with native-like sequence but, on the contrary, they have not been able to reproduce the same native-like sequences on the surface of the proteins (Kuhlman & Baker 2000; Jaramillo et al. 2002). This failure may be attributed to an inadequate treatment of entropy.

Hu & Kuhlman (2006) evaluated the average energy and entropy of an ensemble of side-chain conformations generated by Monte Carlo sampling around the minimum energy conformation. Sciretti et al. (2009) have produced an approximation for the conformational entropy in terms of single-site probabilities and pair occupancy probabilities, and then they obtain the most likely rotamer distribution and its conformational entropy. Although the reported approaches yield different results, not easy to compare, the same conclusion substantially holds in both cases: conformational entropy is not the main determinant of amino acid environmental preferences, although entropic effects may alter the results of the design.

3. Combinatorial optimization

Protein design involves the exploration of the gigantic space of possible sequences. The exploration of this space with appropriate search algorithms involves a trade-off between computational speed and exhaustiveness. In addition, the chosen algorithm imposes severe constraints on the score function to be used (as we have previously seen, the pairwise approximation is generally one of these constraints). We can distinguish two types of optimization methods that have been applied to computational protein design: heuristic or deterministic methods.

Heuristic methods are not guaranteed to find the global minimum; they usually sample the solution space randomly. For instance, they sample a solution, this solution is accepted or rejected and then the algorithm moves to another solution. The differences among the different heuristic algorithms reside on the criteria to accept/reject a solution and in this way the movement towards a new solution is carried out. Examples of such algorithms are those based on Monte Carlo methods, where solutions are accepted/rejected using a criterion such as Metropolis (Gardiner et al. 1956): modifications are accepted if they decrease the energy; whenever they increase it, there is still a small probability of them being accepted. Acceptance probability for unfavourable movements could be varied with an external parameter (akin to the temperature in Monte Carlo simulated annealing) or discard those movements altogether (Hellinga & Richards 1994; Kuhlman & Baker 2000; Wernisch et al. 2000; Jaramillo et al. 2002). Genetic algorithms use recombination to generate new models from the existing ‘parents’ (Desjarlais & Handel 1995). Heuristic approaches are the only possibility for larger problems since they can be used in an almost infinite space of solutions; on the other hand, the optimization algorithm may end up trapped in a local minimum far away from the global energy minimum (figure 3).

Figure 3

(a) Iterative protocol for backbone and sequence simultaneous optimization. The use of the same energy function for backbone and sequence optimization guarantees that all interactions will be equally treated. This protocol was employed by Kuhlman et al. (2003) in the design of Top7, a 93-residue α/β protein designed to have a novel sequence and fold. The iterative combinatorial optimization protocol used in this design was based on the use of heuristic methods: Monte Carlo protocol with Metropolis criterion. (b) The structure of Top7 is shown. PDB, Protein Data Bank.

Deterministic algorithms are designed to be equivalent to a systematic and exhaustive search of the solution space. If they find a minimum, they guarantee that it is the global minimum. The most common types of these algorithms start by performing a search on a subset of the space of solutions and then apply a rejection criterion to designate rotamers as incompatible with the global minimum and eliminate them. These algorithms usually apply the dead-end elimination theorem or its generalizations (Desmet et al. 1992; Goldstein 1994; Looger & Hellinga 2001). These methods have also been combined with Monte Carlo methods (Gordon & Mayo 1999) and have been widely used in computational protein design (Dahiyat & Mayo 1997; Wernisch et al. 2000; Bolon & Mayo 2001a; Looger et al. 2003). The convergence of these methods is not guaranteed, which could be an issue for larger problems.

3.1 Multi-objective optimization methods

The design of functional proteins implies the consideration of at least two, often competing, objectives: the foldability of the protein and its function. In addition, naturally occurring protein sequences are presumably selected according to multiple factors such as stability, function, solubility, foldability, no aggregation, etc. In addition, we can also see the introduction of negative design elements as the addition of an additional objective (destabilization of competing conformations; figure 4).

Figure 4

Example of positive and negative design states: schematic of competing states included in the design of ligand binding-induced allosteric changes ((a) open conformation, no binding; (b) open conformation with binding; (c) aggregated state; (d) closed conformation, no binding ligand; (e) closed conformation with binding; (f) unfolded state and ligand). The optimization has to consider two target states corresponding to the two desired conformations (a,e), guaranteeing stability of the open and closed conformations. Competing structures are included for stability (f), sensitivity (d), affinity (b) and solubility (c), and we will have to maximize their free energies simultaneously. The energies to minimize are Embedded Image and Embedded Image. Embedded Image reflects the probability of the ligand binding and not inducing the corresponding conformational change. Embedded Image reflects the probability of the conformational transition taking place in the absence of the ligand. The probability that a designed protein will adopt the target state is given by Embedded Image.

Simultaneously considering several competing objectives leads to a set of optimal solutions, each of them representing a different trade-off between the objectives. To obtain solutions with the desired characteristics, we could enforce a set of criteria to reduce the solutions (i.e. by demanding stability above a given threshold or demanding a certain structure to the network of H-bonds; Looger et al. 2003; Joachimiak et al. 2006; Sood & Baker 2006). However, the application of multi-objective optimization algorithms (Humphris & Kortemme 2007; Suárez et al. 2008) will allow simultaneous exploration of all possible sensible combinations without imposing a priori constraints.

Humphris & Kortemme (2007) applied a multi-objective computational algorithm to explore promiscuous protein interfaces and maintain interaction with known partners. They found that the sequences so selected are more native like the ones obtained by optimizing the interaction with only one partner. In this work, the multi-objective cost function is constructed given equal weight to the different interactions. On the other hand, in the work by Suárez et al. (2008), an algorithm is presented to explore all different relative weights between the different interactions, since binding to one given partner may be more relevant than binding to another one.

4. Design of sequence libraries for directed evolution

Directed evolution is able to simultaneously explore many mutations while selecting for stability and function, but the number of sequences computational optimization techniques can explore (size of the combinatorial library) is much bigger, so a combination of both methods will probably provide the best results. Computational protein design is able to rapidly screen in silico the space of sequences and eliminate sequences incompatible with the fold, generating in this way a combinatorial library with a degree of complexity low enough to be experimentally screened, but nevertheless without a relevant loss in its diversity. Voigt et al. (2001) have developed a computational algorithm to identify the residues with the highest structural tolerance in subtilisin E and T4 lysozyme, and it was used to predict positions that, upon mutation, were likely to improve specific protein properties. Following a similar strategy, Hayes et al. (2002) were able to signal 19 positions in TEM-1 β-lactamase, which were used to generate a 2×105 mutant library that, upon screening, produces a 1280-fold increase in resistance to cefotaxime (due to a specificity enhancement).

Different computational protein design algorithms can be used to generate these libraries and the results are generally not easily comparable among them or with a randomly generated library. The work by Treynor et al. (2007) compared distinct libraries generated for green fluorescent protein by the mutation of 15 positions. They compared randomly generated libraries and libraries constructed using structure-based computational methods developed by Voigt et al. (2001), Hayes et al. (2002) and Treynor et al. (2007). They judged preservation and diversity of function by evaluating the distributions of brightness and colour. Treynor et al. (2007) concluded that designed libraries showed greater diversity and greater preservation of function than the randomly generated libraries. The design of a library for directed evolution experiments can also be seen as a multi-objective optimization problem, where the competing objectives are the complexity and the diversity of the library and require the use of specifically designed tools to optimize the libraries.

Still one of the remaining challenges is to develop an algorithm to predict the mutations to introduce active sites into non-catalytic scaffolds, which will allow the introduction of desirable catalytic activities within stable and well-behaved frameworks.

5. Negative design

Computational protein design aims to find sequences that are able to fold into the given structure. The optimization process looks for sequences decreasing folding free energy (positive design). In order for the folded state of the obtained sequence to be stable (and unique), information on competing structures may be explicitly included in the sequence design (negative design). In addition, negative design can also be used to increase the solubility of the designed protein (Pokala & Handel 2005).

Negative design strategies consider aggregates and misfolded structures as alternative states, in equilibrium with the target structure, which can be sampled by a given sequence. Together with the folding free energy of the target structure, Embedded Image, there appears Embedded Image, the folding free energy of the competing structure. The goal of the designing procedure will be to minimize Embedded Image while maximizing Embedded Image. Figure 4 shows an example of some of the competing structures that can be studied when designing ligand binding-induced allosteric changes.

Negative design elements had been introduced in earlier protein design studies before computational methods were developed. Regan & DeGrado (1988) and Hecht et al. (1990) had introduced negative design elements in earlier protein design studies before computational methods were developed: they incorporated glycine or proline amino acids to fixed positions to disfavour helix formation and favour turn formation.

The design of protein–protein interfaces also includes elements from negative design. Summa et al. (2002) designed an A2B2 heterotetrameric di-iron protein and performed negative design to avoid the formation of competing homotetramers or heterotetramers; their design produced the correct conformation and showed enzymatic activity. Havranek & Harbury (2002) have developed a general method for the automated design of specificity in molecular recognition based on negative design: the algorithm selects sequences with an energetic preference for the target state over the competing states. They designed and characterized two coiled-coil sequences that associated into homodimers and did not cross-hybridize; they modelled four states: the homodimer and heterodimer conformations, the unfolded and aggregated states, to avoid sequences with poor solubility.

Negative design is also important whenever molecular specificity is an additional objective for the design: proteins able to selectively bind a given small molecule without interacting with similar compounds; proteins able to undergo an allosteric transformation upon recognition of a regulatory compound; or enzymes that show great substrate specificity. Nevertheless, there are cases where excellent specificity was achieved without explicit negative design (Looger et al. 2003; Cochran et al. 2005; Hu et al. 2008; Yosef et al. 2009).

In natural existing proteins, misfolded conformations are destabilized by the presence of strongly repulsive amino acids at distant positions (in the native structure); in the misfolded conformations, these amino acids will not be distant and will interact repulsively; thus, charged amino acids that repel each other in non-native conformations of proteins contribute to negative design (Bolon & Mayo 2001b; Berezovsky et al. 2005); in this way, optimizing the sequence for a particular structure might automatically introduce negative design elements encoded in the conformation of the protein backbone (Yosef et al. 2009).

6. Design of biomolecular function

There are processes such as dimerization, protein interaction, DNA protein interaction or enzymatic activity, which can also be studied from the protein design point of view, even though stability is not the only goal in the design.

6.1 Design of protein–ligand interaction

Specificity issues can be addressed from a negative design perspective, where not only the sequences leading to the final goal are considered but also the competing structures or ligands are taken into account. Ashworth et al. (2006) used negative design to redesign the specificity of an endonuclease. Their goal was to obtain a pair: designed protein-mutated DNA molecules were able to interact among themselves, but the designed protein did not interact with wild-type DNA and natural endonuclease was not able to cut the mutated DNA. This and other examples (Green et al. 2006; Joachimiak et al. 2006) show the usefulness of introducing negative design when considering specificity. Fortunately, affinity is linked to specificity and sometimes, when optimizing for affinity, specificity is found without the need to introduce negative design (Looger et al. 2003; Cochran et al. 2005; Hu et al. 2008; Yosef et al. 2009). Figure 5 shows the redesign by Looger et al. (2003) of the ribose-binding protein from Escherichia coli for specificity against trinitrotoluene.

Figure 5

Redesign of ribose-binding protein for specificity towards trinitrotoluene. (a) Wild-type protein binding a ribose molecule (Protein Data Bank access code 2DRI). (b) Protein redesigned to recognize trinitrotoluene by Looger et al. (2003).

When considering protein–protein/DNA interaction specificity, the free energy we minimize is usually the binding energy between the members of the complex. It is worth noting that usually these binding energies include a reference to the free energy of the unfolded state (reference energies) not to endanger foldability (Ogata et al. 2002).

Tools to redesign protein–protein interaction specificity will allow the manipulation of cellular networks involved in regulation and complexity. Kortemme et al. (2004) and Joachimiak et al. (2006) succeeded in redesigning the interaction interface between a bacterial DNase (colicin E7) and its inhibitor (immunity protein Im7) by looking for disruptive mutations at the interface that destabilized the original complex and stabilized the designed one, obtaining new partners with a 300-fold specificity switch (figure 6).

Figure 6

Redesign of the interface between the DNase E7 and immunity protein Im7. Wild-type DNase E7 has PDB access code 7CEI. The redesign of the interface produced a mutant with a 300-fold higher specificity (PDB access code 2ERH). The figure shows the following mutations: D35Y, T51Q (Im7) and T539Q, K528Q (E7) and the changes they introduce in the H-bond network in the interface between the protein and the inhibitor.

6.2 Designing an existing enzymatic function

Enzymes can enhance reaction rates by up to 23 orders of magnitude (Kraut et al. 2003). When designing enzymes, care has to be taken to define the scoring function, since multiple factors may be important: substrate accommodation; transition state binding; product release; catalytic residues orientation; protein stability/flexibility; conformational changes; etc. Different scoring functions have been used: enzyme–substrate binding energy or transition state binding energy together with geometrical constraints on active-site geometry and catalytic residue orientation (Chakrabarti et al. 2005; Lassila et al. 2005).

Considering the stabilization of the transition state, using the binding energy as the score function increases the enzyme efficiency through the acceleration of kcat. However, in this process, KM is not directly affected, even though the efficiency of the enzyme is given by kcat/KM. Defining scoring functions able to select a good kcat/KM ratio is one of the pending tasks of protein design.

6.3 Designing de novo enzymatic function

The design of new enzymes with tailored functionalities able to catalyse reactions for which non-natural catalysts occur will open new avenues for biotechnology and biomedicine (Jiang et al. 2008; Röthlisberger et al. 2008).

The design of new active sites poses several problems: an active-site location has to be found, but its surroundings have to be designed to allow room for the active site and optimize the transition state binding (Lassila et al. 2006). The placing of small molecules also needs special treatment to deal with the combinatorial problem arising when considering their possible degrees of freedom (Zanghellini et al. 2006).

To introduce a new active site into an active pre-existing enzyme, care has to be taken not to alter the original function. Residues surrounding the native active site appear to be a good choice due to their high plasticity, since they do not belong to the protein scaffold or the native catalytic machinery (Khersonsky et al. 2006). Suárez et al. (in preparation) mutated these residues to introduce esterase activity into a thioredoxin scaffold, following Bolon & Mayo (2001a), and obtained 1.5-fold enhancement of the native oxidoreductase activity.

6.4 Design of allosteric proteins

The (re)design of proteins undergoing ligand-induced conformational changes will provide new elements for biotechnological applications, since some biological mechanisms such as signal transduction across large distances are linked to allostery. The redesign of these changes requires the simultaneous consideration of various objectives (one for each considered conformation; figure 4).

Clear progress has been made in understanding and controlling the conformational changes induced by ligand binding. Shifman et al. (2006) have successfully redesigned the Ca2+ signal transducer calmodulin (CaM) to investigate the CaM-induced activation of the calmodulin-dependent protein kinase II (CaMKII) in the presence of only two Ca2+ ions (instead of the four CaMs that are able to bind). Their design was based on the inactivation of one of the binding domains by stabilizing it in the ‘closed’ conformation.

Looger et al. (2003) redesigned the closed conformation of the ribose-binding protein of E. coli to alter its substrate specificity and construct receptors that bind trinitrotoluene, l-lactate and serotonin with high selectivity and affinity. The receptors were used afterwards to construct biosensors for these elements.

7. Future challenges and applications of protein design

More research is needed to overcome the challenges posed by computational protein design. First of all, in computational protein design, it is assumed that free energies (folding or binding free energies) represent some of the desired characteristics of the protein. The way these energies are defined is a measure of our understanding of biophysical processes such as protein folding, binding or enzymatic activity. Research on enzymatic mechanisms, folding pathways, allosteric phenomena, etc. will help us develop scoring function that correctly captures the characteristics of the underlying process. The way these energies are defined will affect, among others, the way in which the unfolded state is defined, how entropic effects are treated and how far the fixed backbone approximation can be maintained. Regardless of how the scoring functions are defined, an approximation for them has to be used in computational protein design. The use of physics-based atomic potentials will help us understand the different approximations that have to be introduced in order to have accuracy and speed to compute scoring functions. Nevertheless, knowledge-based potentials are commonly used successfully. Solvent-related effects and the search for an adequate approximation for the solvation energies are hot topics in this field. Sampling is the third challenge computational protein design has to face, since we need to perform a suitable exploration of the space of available sequences. The development of combinatorial optimization algorithms that include negative design elements and multi-objective criteria will advance the field of computational protein design.

On the other hand, the applications of computational protein design can help solve some fundamental biological questions. The number of natural protein folds is estimated to be approximately 1000 (Orengo et al. 1994), much smaller than the number of folds we may think of. The open question is whether the not yet observed folds are unrealizable or have not been characterized or whether evolution has had no time to sample them yet. Computational protein design has been used for the de novo design fold cite (Kuhlman et al. 2003), providing useful insights into these questions, but still a generalized framework for the design of new protein folds is lacking.

Computational protein design can also provide useful insights into fundamental questions regarding the folding problem. Nauli et al. (2001) redesigned the folding pathway of protein G (a single α-helix packed against a four-stranded β-sheet formed by two symmetrically opposed β-hairpins; in the folding process, the second β-turn is formed and the first disrupted). They reversed the folding pathway by increasing the stability of the first hairpin and decreasing that of the second one, and they obtained a 100-fold faster folder.

The modular properties of proteins and their separation in functional domains allow the exploration of an interesting application of computational protein design: the creation of chimeric proteins by fusion of several protein modules in a single protein assembly. Computational methods would predict the modules that would allow the construction of complex multifunctional assemblies that may finally encompass, for example, all the needed enzymes on a given pathway or the needed zinc finger proteins to target a given DNA sequence (Sander et al. 2007). This treatment requires the previous analysis of the docking problem to obtain an accurate description of the model. The work by Chang et al. (2007) has been used to design a link between ferredoxin and hydrogenase by P. Silver's group and our group. The goal of this fusion would be to increase the electron flow from ferredoxin to hydrogenase and to increase the hydrogen evolution rate. To increase the electron flow to ferredoxin, Antonkine et al. (2007) linked it directly to the Photosystem I system. Ha et al. (2006) built an autoregulated modular enzyme by fusing two domains that underwent a conformational change upon ligand binding, the competing conformations of each module allowed the complex to act either as barnase (in the absence of ligand) or as a ligand-binding polypeptide. In addition, modular properties of proteins can be exploited to design protein-based logic gates with AND or XOR behaviour (Salis & Kaznessis 2006; Sallee et al. 2007).

Computational protein design methods can allow the increase in the affinity between the domains forming the complex through the redesign of the interface between two proteins, since using a flexible linker may not always be enough. An early example of this approach can be found in Chevalier et al. (2002), where a new chimeric DNA-binding protein was designed (and constructed) by fusion of two DNA-binding domains from homing endonucleases in such a way that the designed protein was able to recognize a DNA sequence formed by the two target sequences of each domain. Additionally, the chimeric protein constructed retained the catalytic activity and was able to cleave its target sequence. These applications will be greatly benefited by the development of protein docking with backbone flexibility methodologies, although this is a formidable challenge because rigid body and internal backbone degrees of freedom are mixed in the search space.

Enzyme design presents a huge challenge, not only in the de novo design of catalysts for which no natural counterpart is known, but also in the design of multi-purpose enzymes, which may have a wide range of biotechnological applications related to fields such as industrial organic synthesis and metabolic engineering (Jürgens et al. 2000; Canada et al. 2002; Bornscheuer & Kazlauskas 2004; Kazlauskas 2005). One of the actual challenges in computational protein design is the use of accumulated information on enzymatic moonlighting (Jeffery 1999) or catalytic promiscuity (O'Brien & Herschlang 1999) towards the design of multi-purpose enzymes. In nature, a huge number of reactions are mediated by metallo-enzymes, but the interactions between the metal and the protein, charge transfers and polarizations effects, pose problems of their own, which have to be solved if we want to continue along the lines of Benson et al. (1999), Oelschlaeger & Mayo (1999) and Spiegel et al. (2006).

Acknowledgements

We acknowledge financial support from EU grants BioModularH2 (FP6-NEST contract 043340), EMERGENCE (FP6-NEST contract 043338) and TARPOL (FP7 KBBE contract 212894) and the ATIGE Genopole/UEVE CR-A3405. We thank the reviewers of this manuscript for their helpful comments and suggestions.

Footnotes

  • One contribution to a Theme Supplement ‘Synthetic biology: history, challenges and prospects’.

  • Received November 30, 2008.
  • Accepted February 4, 2009.

References

View Abstract