## Abstract

The protein structure prediction (PSP) problem is concerned with the prediction of the folded, native, tertiary structure of a protein given its sequence of amino acids. It is a challenging and computationally open problem, as proven by the numerous methodological attempts and the research effort applied to it in the last few years. The potential energy functions used in the literature to evaluate the conformation of a protein are based on the calculations of two different interaction energies: *local* (bond atoms) and *non-local* (non-bond atoms). In this paper, we show experimentally that those types of interactions are in conflict, and do so by using the potential energy function Chemistry at HARvard Macromolecular Mechanics. A multi-objective formulation of the PSP problem is introduced and its applicability studied. We use a multi-objective evolutionary algorithm as a search procedure for exploring the conformational space of the PSP problem.

## 1. Introduction

Proteins are long sequences of 20 different amino acids. Proteins are known to have many important functions in the cell, such as enzymatic activity, storage and transport of material, signal transduction, antibodies and more (Whisstock & Lesk 2003). The amino acids composition of a protein will usually uniquely determine its three-dimensional structure (Anfinsen 1973), to which the protein's functionality is directly related.

The protein structure prediction (PSP) problem is currently one of the most challenging problems in Biochemistry and Bioinformatics (Nicosia 2004). It is simply defined as the task of understanding and predicting how the information coded in the amino acid sequence of proteins translates into the three-dimensional structure of the biologically active protein. If we were able to solve the PSP problem we would greatly simplify, for example, the task of understanding the mechanism of hereditary and infectious diseases, of designing drugs with specific therapeutic properties and of growing biological polymers with specific material properties.

Protein folding (PF) has to be distinguished from PSP. In the PSP, we are not interested in the folding process (dynamic aspect), but only in the attained final structure (static aspect). Common methods for finding protein three-dimensional structures (such as X-ray crystallographic and NMR—nuclear magnetic resonance) are slow and costly, and may take up to several months of lab work. As a consequence, there has been a continuously growing interest in the design of *ad hoc* algorithms for the PSP problem. There are two main types of computational strategies which are employed today: knowledge-based and *ab initio*. The hypothesis behind knowledge-based methods (homology modelling, Tramontano & Morea 2003; threading, Mirny *et al*. 2000) is that similar sequences will fold in a similar way. *Ab initio* strategies (Klepeis *et al*. 2005) are required when no homology is available so that one is forced to fold the proteins from scratch.

Successful structure prediction requires a free energy function sufficiently close to the right potential for the native state, as well as a method for exploring conformational space. Current potential functions, however, have limited accuracy and the conformational space is vast. Several algorithmic approaches have been applied to the PSP problem in the last 20 years (Nicosia 2004; Cozzetto *et al*. 2005): molecular dynamics, Metropolis Monte Carlo, simulated annealing, simulated tempering, evolutionary algorithms. In spite of all these efforts, PSP remains a challenging computationally open problem.

## 2. Multi-objective optimization

Historically, PF and PSP, both central problems in molecular biology, have been approached as a large single-objective optimization problem: given the primary sequence find the three-dimensional native conformation with minimum energy (PSP), and the pathways to reach the native conformation (PF), using a single-objective potential energy function. Molecular dynamics, Monte Carlo methods and evolutionary algorithms (Bowie & Eisemberg 1994; Hansmann & Okamoto 1997; Pendersen & Moult 1997; Simons *et al*. 1997; Cui *et al*. 1998; Nicosia 2004) are today's state of the art methodologies to tackle PF and PSP as a single-objective optimization problem.

We conjecture and partially verify by computational experiments that it could be more suitable to model the PSP problem as a multi-objective optimization problem (MOOP).

When an optimization problem involves more than one objective function, the task of finding one (or more) optimum solution, is known as multi-objective optimization (Steuer 1986). The PSP problem naturally involves multiple objectives. Different solutions, i.e. the three-dimensional conformations, may involve a trade-off (the conflicting scenario in the funnel landscape) among different objectives. An optimum solution with respect to one objective may not be optimum with respect to another objective. Consequently, one cannot choose a solution which is optimal with respect to only one objective. In general, in problems with more than one conflicting objective, there is no single optimum solution. There exists, instead, a set of solutions which are all optimal, called the *optimal Pareto front*.

Hence, for a MOOP we can define the following procedure:

find the optimal (or the observed) Pareto front with a wide range of values for objectives; and

choose one of the solutions in the Pareto front, using some higher-level information.

We would like to emphasize the fact that our major goal in using the multi-objective approach to PSP is to find the *folded ensemble* by means of the Pareto optimality concept.

We think that finding the native structure of a given protein is not equivalent ‘to finding a native state needle in a conformational space haystack’ but, instead, should be more like ‘finding a set of equivalent needles in a haystack’. Obviously, the problem is still far from being solved, but we are modelling the PSP problem in an alternative and more accurate way; that is, at any stage the molecule exists in an ensemble of conformations. We want to model such a stage as an approximated Pareto front.

Indeed, the authors of the paper (Ma *et al*. 1999) report: ‘The long-held views on lock-and-key versus induced fit in binding arose from the notion that a protein exists in a single, most stable conformation, dictated by its sequence. However, in solution proteins exist in a range of conformations, which may be described by statistical mechanical laws and their populations follow statistical distributions. Upon binding, the equilibrium will shift in favour of the bound conformation from the ensemble of conformations around the bottom of the folding funnel.’

Hence, the ensemble of equivalent conformations is crucial for proteins in solution, as well. Finally, it is evident how the multi-objective approach intends to discover the conformation populations around the bottom of the folding funnel using the Pareto optimality concept so as to study the biological activity in this stage. Indeed, conformational diversity around the bottom of the folding funnel may provide a simple yet elegant solution to a range of binding processes. We want to discover this conformational diversity around the bottom of the funnel using the Pareto optimality.

Since tackling multi-objective problems in their native form (i.e. rather than adding together objective values or treating some of them as constraints) can be actually beneficial, it should be more effective and accurate to face a problem as a MOOP when the objectives have a significant anti-correlation or conflict, although this is probably not a sufficient condition (Louis & Rawlins 1993; Knowles *et al*. 2001; Jensen 2003). In particular, Knowles *et al*. (2001) showed experimentally how in some cases transforming a single-objective problem into a multi-objective one can reduce the number of local optima and facilitate the optimization process. Moreover, they showed how this process, called *multi-objectivization*, can facilitate the search process for neighbourhood based algorithms, like Pareto archived evolutionary strategy (PAES).

In a recent article (Day *et al*. 2002), Lamont and co-authors reformulated the PSP problem as a MOOP and used a multi-objective evolutionary algorithm (MO fmGA) for the structure prediction of two small protein sequences: [Met]-enkephelin (five residues), polyalanine (14 residues). After this initial approach, this idea was applied to medium size protein sequences (46–70 residues) with promising results (see Cutello *et al*. 2005).

In the following, we formally introduce the MOOP.

### 2.1 Multi-objective optimization problems

A MOOP can be formally defined as follows.

*Find a vector* *which*

*satisfies the p equality constraints,*(2.1)*is subject to the m inequality constraints,*(2.2)*and which optimizes the vector function,*(2.3)

Hence, we have the following two hyperspaces.

*Equations* *(2.1) and (2.2)* *define the feasible region (or decision variable space)*(2.4)*and any point x*∈

*Ω defines a feasible solution*.

*The vector function f*(

*)*

**x***maps the elements of Ω into a set Λ which represents all possible values of the objective functions:*(2.5)

The evaluation function of the MOOP , maps decision variables ** x**=(

*x*

_{1},

*x*

_{2}, …,

*x*

_{n}) to vectors

**=(**

*y**y*

_{1},

*y*

_{2}, …,

*y*

_{k}).

*A point x*

^{*}∈

*Ω is Pareto optimal if for every*∈

**x***Ω and I*={1, 2, …,

*k*}

*either*(2.6)

*or there is at least one i*∈

*I such that*(2.7)

*A vector u*=(

*u*

_{1}, …,

*u*

_{k})

*is said to dominate*=(

**v***v*

_{1}, …,

*v*

_{k})

*, denoted by*

*if and only if*∈{1, …,

**u**is partially less than**v**, i.e. for all i*k*}, .

If the vector * u* dominates the vector

*, or mathematically , we also say that*

**v***is dominated by*

**v***, or*

**u***is non-dominated by*

**u***, or*

**v***is not inferior to*

**u***.*

**v***A point x*

^{*}∈

*Ω is a weakly non-dominated solution if there is no*∈

**x***Ω such that f*

_{i}(

*)<*

**x***f*

_{i}(

**x**^{*})

*, for i*=1, …,

*k*.

*A point x*

^{*}∈

*Ω is a strongly non-dominated solution if there is no*∈

**x***Ω such that f*

_{i}(

*)≤*

**x***f*

_{i}(

**x**^{*})

*, for i*=1,…,

*k and for at least one value of i, f*

_{i}(

*)<*

**x***f*

_{i}(

**x**^{*}).

Thus, if **x**^{*} is strongly non-dominated, it is also weakly non-dominated, but the converse is not necessarily true.

*For a given MOOP f*(

*)*

**x***, the Pareto optimal set,*

^{*}

*, is defined as*(2.8)

In this paper, a Pareto optimal set that truly meets this definition is called a true Pareto optimal set, . In contrast, a Pareto optimal set that is obtained by means of an optimization method is referred to as an observed Pareto optimal set, . In reality, an observed Pareto optimal set is an *estimate* (or a discrete representation) of a true Pareto optimal set.

Identifying a good estimate is the key factor for the decision-maker's selection of a compromise solution, which satisfies the objectives as much as possible.

We denote the observed Pareto optimal set at time-step *t* obtained using an optimization method by (or the current observed Pareto optimal set). Moreover, we have(2.9)where is the total number of observed Pareto solutions at time-step *t*.

Obviously, the major problem a decision-maker needs to solve, is to find the best

*For a given MOOP f*(

*)*

**x***and Pareto optimal set*

^{*}

*, the Pareto front,*

*F*

^{*}

*, is defined as*(2.10)

As for the Pareto optimal set, we can define the *observed Pareto front* at time-step *t* by an optimization method:(2.11)where is the total number of observed Pareto front solutions at time-step *t*.

The goal of our work is to estimate the observed Pareto front, , by a multi-objective evolutionary algorithm for the structure prediction of real proteins. Identifying a good estimate of is crucial for the biologist's selection of a stable fold protein near native conformation, under biological conditions, satisfying the objectives as much as possible.

## 3. Methods

The most difficult task when using a search procedure for the PSP problem is to come up with good:

*representation*of the conformations,*cost function*for evaluating conformations, and*metrics*to evaluate how similar to the native structure are the predicted conformations.

We will now introduce the problems correlated with those aspects and we will describe our choices.

### 3.1 Representation of the polypeptide chain

Few conformation-representations are commonly used:

all-atom three-dimensional coordinates;

all-heavy-atom coordinates;

backbone atom coordinates+side-chain centroids;

*C*_{α}coordinates; andbackbone and side-chain torsion angles.

Some algorithms use multiple representations and move among them for different purposes.

In this work, we use an internal coordinates representation (torsion angles), based on the fact that each residue type requires a fixed number of torsion angles to fix the three-dimensional coordinates of all atoms. Bond lengths and angles are fixed at their ideal values. All the *ω* torsion angles are fixed at their ideal value 180°. So, the degrees of freedom in this representation are the backbone and side-chain torsion angles (*ϕ*, *ψ* and *Χ*_{i}). The number of *Χ* angles depends on the residue type (see table 1).

### 3.2 Potential energy function

In order to evaluate the structure of a molecule we need to use some cost or energy functions. To come up with some good functions it would be natural to use quantum mechanics, but it is too computationally complex to be practical in modelling larger systems, so, we use classical physics. Sometimes called potential energy functions or force fields, these functions return a value for the energy based on the conformation of the molecule. They provide information on what conformations of the molecule are better or worse. The lower the energy value, then the better should be the conformation.

Most typical energy functions have the form(3.1)where * R* is the vector representing the conformation of the molecule, typically in Cartesian coordinates or in torsion angles.

The literature on cost functions is enormous (Momany *et al*. 1975; Hermans *et al*. 1984; Cornell *et al*. 1995). In this work, in order to evaluate the conformation of a protein, we use the Chemistry at HARvard Macromolecular Mechanics (CHARMM) (v.27) energy function. CHARMM is a popular all-atom force field used mainly for studying macromolecules (MacKerell *et al*. 1998; Foloppe & MacKerell 2000). It is a composite sum of several molecular mechanics equations grouped into two major types: *bonded* (stretching, bending, torsion, Urey-Bradley, impropers) and *non-bonded* (van-der-Walls, electrostatics).

The CHARMM energy function has the form(3.2)where

*b*is the bond length,*b*_{0}is the bond equilibrium distance and*k*_{b}is the bond force constant;*S*is the distance between two atoms separated by two covalent bonds (1, 3 distance),*S*_{0}is the equilibrium distance and*k*_{UB}is the Urey Bradley force constant;*θ*is the valence angle,*θ*_{0}is the equilibrium angle and*K*_{θ}is the valence angle force constant;*Χ*is the dihedral or torsion angle,*k*_{Χ}is the dihedral force constant,*n*is the multiplicity and*δ*is the phase angle;*ϕ*is the improper angle,*ϕ*_{0}is the equilibrium improper angle and*k*_{imp}is the improper force constant; and*ϵ*_{ij}is the Lennard Jones well depth,*r*_{ij}is the distance between atoms*i*and*j*,*R*min_{ij}is the minimum interaction radius,*q*_{i}is the partial atomic charges and*e*is the dielectric constant.

Typically, *ϵ*_{i} and *R* min_{i} are obtained for individual atom types and then combined to yield *ϵ*_{ij} and *R* min_{ij} for the interacting atoms via combining rules. In CHARMM, *ϵ*_{ij} values are obtained via the geometric mean , and *R* min_{ij} via the arithmetic mean, .

As a final note, we would like to underline the fact that the CHARMM energy function simply adds together bond and non-bond energies. As it will be experimentally shown later on, this produces an energy landscape which does not well correlate with the real molecule folding process.

### 3.3 Distance matrix error and root mean square deviation metrics

To evaluate how similar the predicted conformation is to the native one, we employ root mean square deviation (RMSD) coupled with another frequently used metric, the distance matrix error (DME). RMSD is given by the formula(3.3)where *r*_{ai} and *r*_{bi} are the positions of atom *i* of structure *a* and structure *b*, respectively, and where structures *a* and *b* have been optimally superimposed. Fitting was performed using the McLachlan algorithm (McLachlan 1982).

DME is given by the formula(3.4)

This calculation does not require the superposition of coordinates. RMSD, which measures the similarity of atomic positions, is usually larger than DME, which measures the similarity of inter-atomic distances.

RMSD is one of the most used instruments for structure comparison. However, using RMSD alone has some negative aspects: best alignment does not always mean minimal RMSD; significance of RMSD depends on the size of the structures; significance of RMSD varies with protein type; it is not a good measure when all equivalent parts of the proteins cannot be simultaneously superposed; all atoms are usually treated equally, though, for example, residues on the surface have a higher degree of freedom than those in the core (so weights might be used in the calculation).

## 4. The multi-objective formulation

In order to reduce the size of the conformational space, backbone torsion angles are bounded in regions derived from secondary and supersecondary structure prediction (table 2).

Supersecondary structure is defined as the combination of two secondary structural elements with a short connecting peptide between one to five residues in length. A short connecting peptide can have a large number of conformations. They play an important role in defining protein structures. The conformations of the residues in the short connecting peptides are classified into five major types, namely, *a*, *b*, *e*, *l* or *t* (Sun & Jang 1996) each represented by a region on the *ϕ*–*ψ* map. Sun *et al*. (1997) developed an artificial neural network method (ANN) to predict the 11 most frequently occurring supersecondary structures: H*–b*–H, H*–t*–H, H*–bb*–H, H*–ll*–E, E*–aa*–E, E*–ea*–E, H*–lbb*–H, H*–lba*–E, E*–aal*–E, E*–aaal*–E and H*–l*–E, where H and E represent α-helix and β-strand, respectively.

Side-chain torsion angles are constrained in regions derived from the backbone-independent rotamer library of Roland L. Dunbrack (Dunbrack & Cohen 1997). Side-chain constraint regions are of the form: [*m*−*σ*, *m*+*σ*]; where *m* and *σ* are the mean and the s.d. for each side-chain torsion angle computed from the rotamer library. Under these constraints, the conformation is still highly flexible and the structure can take on various shapes that are vastly different from the native shape.

We can think of a protein as a collection of atoms linked by a chemical bond. With the symbol *a*_{i}↔*a*_{j} we represent a chemical bond between the two atoms *a*_{i} and *a*_{j}. Using this notation we can divide all the atoms into two categories: *bond atoms* and *non-bond atoms*,(4.1)(4.2)

The bond set *A*_{bond} represents the set of all atom chains of max length four, in this way we consider only bonds, angles and torsion interactions between atoms, *local interaction*. The *A*_{non-bond} set represents all the atoms not connected by chemical bond, which are atoms separated by at least three or more covalent bonds, *non-local interaction*. This division reflects the decomposition of CHARMM in two partial sums: bonded and non-bonded atom energies, following definition (3.2),(4.3)(4.4)where symbols **C**_{bond} and **C**_{non-bond} are, respectively, the force constants involved for bond and non-bond atoms in equation (3.2).

The bond energy characterizes the interactions between residues that are neighbours along the primary sequence. The non-bond term represents the interaction between residues that are separated in the primary sequence by at least two intervening residues (one to four interactions).

These two functions represent our minimization *objectives*, the torsion angles of the protein are the *decision variables* of the multi-objective problem, and the constraint regions are the *variable bounds*.

## 5. The multi-objective evolutionary algorithm

The algorithm PAES was proposed for the first time by Knowles & Corne (1999). PAES is a multi-objective optimizer which uses a simple (1+1) local search evolution strategy. Nonetheless, it is capable of finding diverse solutions in the Pareto optimal set because it maintains an archive of non-dominated solutions which it exploits to accurately estimate the quality of new candidate solutions. At any iteration *t*, a candidate solution *c*_{t} and a mutated solution *m*_{t} must be compared for dominance. Acceptance is simple if one solution dominates the other. If neither solution dominates the other, the new candidate solution is compared with the reference population of previously archived non-dominated solutions. If the comparison fails to favour one solution over the other, the chosen solution is the one which resides in the least crowded region of the space. A maximum size of the archive is always maintained. The crowding procedure is based on recursively dividing up the *M*-dimensional objective space in 2^{d} equal-sized hypercubes, where *d* is a user defined depth parameter. The algorithm continues until a given, fixed number of *iterations* is reached.

*I-PAES* is a modified version of PAES with a different solution representation (polypeptide chain) and immune inspired operators: *cloning* and *hypermutation* (Cutello & Nicosia 2004). The algorithm starts by initializing a random conformation. The torsion angles (*ϕ*, *ψ*, *Χ*_{i}) are generated randomly from the constraint regions. After that, the energy of the conformation (a point in the landscape) is evaluated. First, the protein structure in internal coordinates (torsion angles) is transformed in Cartesian coordinates. Then the CHARMM energy potential of the structure is computed using routines from TINKER Molecular Modelling Package (http://dasher.wustl.edu/tinker/).

At this point, we have the main loop of the algorithm. From the current solution, two clones will be generated, producing the solutions which will be mutated into . After evaluation, the best clone (min *E*_{charmm}) between and is selected as new mutated solution *m*, while the other one, if possible, is added to the archive following the standard method of PAES to update the archive. From this moment on, the algorithm proceeds following the standard structure of PAES. Figure 1 shows the pseudo-code of the algorithm.

Two kinds of mutation operators were used in the affinity maturation phase (line 6 of I-PAES). The first clone is mutated using the first mutation operator and the second clone using the second mutation operator. The first mutation operator, *M*_{1}, may change the conformation dramatically. When this operator acts on a peptide chain, all the values of the backbone and side-chain torsion angles of a randomly chosen residue are re-selected from their corresponding constrained regions. The probability for the application of this operator is regulated by the following law:(5.1)where *T*_{max} is the max number of evaluation allowed and ffe is the number of fitness function evaluation done. The probability of mutation decreases as the search method proceeds. The second mutation operator, *M*_{2}, performs a local search of the conformational space. It will perturb some torsion angles (*ϕ*, *ψ*, *Χ*_{i}) of a randomly chosen residue with the law(5.2)where *θ* is the generic torsion angle, and *N*(0, 1) is a real number generated by a Gaussian distribution of mean *μ*=0 and s.d. *σ*=1. The mutation rate used is similar to the scheme presented in (Cui *et al*. 1998). The number of mutations decreases as the search method proceeds following the law(5.3)where ffe and *T*_{max} are defined as before, *L* is the number of residues and *k* is a constant set to 4.

### 5.1 Bond energy versus non-bond energy

Before we analyse the quality of the obtained results, we would like to experimentally validate such a multi-objective approach. It is based on the fact that local interaction (bond energy) and non-local interaction (non-bond energy) among atoms are in conflict. This is the typical characteristic of a MOOP. The literature on energy functions and about those two different interactions is very vast. Most of the major energy functions are based on the combined usage of bond and non-bond energies. There is no *formal proof*, however, about the conflict between them. We start by describing the simple intuition about the conflict and then we show how it is possible to verify it experimentally.

In the PF process, it was demonstrated experimentally that the native structure of a protein is at its global minimum of the thermodynamical potential (free energy) of the protein (Anfinsen 1973). This is a valid principle that governs the protein conformational search. During the pathway to reach the native structure, the protein is forced to decide what to do next. It is quite clear that it is possible to make movements that locally are able to decrease the bond energy of the system. Globally, however, this could not be true. For example, the electrostatic interactions or the short distance van der Waals interaction of atoms near in the space but far in primary sequence could be penalized although an improvement it is reach in bond, angles or torsion energies. This simple intuition is demonstrated experimentally by the plot in figure 2. In this figure, we show the typical bond and non-bond energy course during the iterations of the algorithm. It is clear that the two functions are in conflict. If one interaction decreases the other always increases but while this process goes on, there are compensation effects that will cause the minimization of the total energy.

### 5.2 Decision-making phase

As described in §2, after a Pareto front is found, to solve effectively a MOOP one has to choose a solution (or a class of solutions) in the Pareto front using some ‘higher-level information’. Such a *decision-making* phase, could be really difficult to accomplish, in particular when the number of objectives and solutions is large. Although, there is no universally accepted method, in general the most interesting solutions of the observed Pareto front are characterized by the fact that a small improvement in one objective will cause a large deterioration in at least one other objective. These solutions are called *knees* (Branke *et al*. 2004; Handl & Knowles 2005). In particular, in Branke *et al*. (2004), the authors describe an algorithm for finding the knees in the Pareto front: an angle-based method which uses the four closest neighbours. More in details, given a conformation point *P*, if we denote by *A*_{1} and *A*_{2} the two closest points from the left, and by *B*_{1} and *B*_{2} the two closest points from the right, we can form four angles: , , and . The greatest of these four angles is then assigned to *P*. The knees of the Pareto front are the angles greater than a given threshold.

Using such a simple idea, we can adopt the following decision-making scheme:

first, detect the solutions which lie in the knees of the observed Pareto front, using the angle-based method with four neighbours described above; and

select the solution with the lowest energy function value from these samples.

As we will see later, such a simple method is able to select solutions with a good trade-off between energy and metrics values (DME and RMSD).

This is just one possible approach for the decision-making phase. It is possible to use other type of higher-level information to select solutions from the Pareto front, using for instance structure stability, compactness, hydrophobic score, etc.

## 6. Results

In this section, we report the results obtained using the multi-objective approach for PSP. We applied our algorithm to a famous short peptide ([Met]-enkephalin) and then to four proteins sequences from the Protein Data Bank (PDB). Table 7 shows the results for each protein. For 1ZDD, 1ROP, 1UTG and 1CRN proteins we set the maximum number of iterations to 2.5×10^{5}; while for Met-enkephalin peptide we ran the I-PAES for 3.5×10^{5} energy functions evaluations.

*[Met]-enkephalin*. [Met]-enkephalin is a very short polypeptide, with only five amino acids (TYR–GLY–GLY–PHE–MET), 22 variable backbone and side-chain torsion (or dihedral) angles and 75 atoms. From an optimization point of view, the [Met]-enkephalin polypeptide is a paradigmatic example of multiple-minima problem. It is estimated to have more than 10^{11} locally optimal conformations. This peptide is an obvious ‘test bed’, for which a substantial amount of *in silico* experiments has been done (Li & Scheraga 1988). Figure 3 shows the dynamic of the Pareto fronts at different time-steps of the algorithms. Figure 4*a* instead shows the overlap between predicted and the Scheraga conformations (a classical benchmarks); while, figure 4*b* shows the overlap between predicted conformation and the native structure of the peptide 1PLW. After computing the Pareto front using our algorithm, firstly we detect the class of solutions in the knees of the observed Pareto front and then select the solution with lowest energy value. For the Met-enkephalin, the lowest energy value in the knees corresponds to the lowest energy value of the overall Pareto front, −20.56 kcal mol^{−1}; this conformation matches the Scheraga structure with DME_{all-atoms}=2.211 Å, RMSD_{all-atoms}=2.83 Å, and , and the crystal structure of 1PLW with DME_{all-atoms}=2.311 Å, RMSD_{all-atoms}=3.605 Å, and .

*Disulphide-stabilized mini protein A domain (1ZDD)*. 1ZDD is a two-helix peptide of 34 residues (Starovasnik *et al*. 1997). For this protein, the native secondary structure information was determined using the original PDB server, while the secondary structure constraints were predicted by the SCRATCH prediction server (Pollastri *et al*. 2002). By inspecting the Pareto front of 1ZDD protein (figure 5), we can note that there are no knees, hence we cannot use our decision-making method. In this case, we simply select the conformation with lowest energy function value −1052.09 kcal mol^{−1}; this conformation matches the crystal structure with and (see figure 6). Table 3 shows the good relationship between the best energy structure and the best RMSD conformation in the final archive. Moreover, by inspecting the conformations in the final archive, we can see that they all present good characteristics both in terms of RMSD and energy. For this protein, the algorithm is able to produce an *ensemble* of good quality structures. Figure 7*a* shows the *C*_{α} RMSD per residue for the core region (3–32) of predicted structure. One of the two α-helix is better predicted than the other. In the plots, we also report the protein sequence, the predicted secondary structures for each residue and the native one.

Figure 7*b* displays a good correlation between the RMSD and the energy, suggesting that minimizing the energy by varying the conformation will tend to drive the conformation toward the true structure. Moreover, by inspecting the plot, it is evident that the algorithm is able to make a high sampling of the conformational search space: in a range of 1/2 Å there are more than 1000 conformations near the native state.

*Repressor of primer (1ROP)*. Repressor of primer is a four-helix bundle protein that is composed of two identical monomers (Banner *et al*. 1987). Each monomer has 56 residues and forms α–turn–α structure (PDB id. 1ROP). For this protein the supersecondary structure constraints were predicted by the ANN method of Sun *et al*. (1997). The best computed structure, based on the decision-making method described in §5.2, matches the crystal structure with , and energy −797.57 kcal mol^{−1} (see figure 9). Table 4 shows the comparisons between the structures in the final archive based on different decision-making criteria.

In figure 8 we plot the observed Pareto front reporting many empty regions along the curve. These discontinuous regions show different clusters of non-dominated compact solutions near the folded state. In the inset plot of figure 8 we show the correlation between the energy and the RMSD for conformations sampled by I-PAES algorithm. It is worth to note that it is possible to produce structures with lower energy than that of their native structures: a typical phenomenon observed in the search process of any method. In particular, this appears to be the case for the repressor of primer where the native structure energy (calculated with CHARMM) has value equal to −667.0515 kcal mol^{−1}, while the inset plot displays many native-like structures with lower energy values

*Uteroglobin (1UTG)*. Uteroglobin is a 4-helix protein that has 70 residues (Morize *et al*. 1987). The predicted supersecondary structures are α–bb–α–lbb–α–bb–α. Using these supersecondary structure constraints (predicted by the ANN method of Sun *et al*. 1997) the best computed structure, using our decision-making method described in §5.2, matches the native structure with , and energy 1128.3 kcal mol^{−1} (see table 5). The observed Pareto front of 1UTG obtained by the multi-objective evolutionary algorithm is a sparse set of points (see figure 11).

*Crambin (1CRN)*. Crambin is a 46-residue protein with two α-helix and a pair of β-strands (Williams & Teeter 1984*a*,*b*). It has three disulphide bonds, whose constraints we do not use. The supersecondary structures, predicted by the ANN method of Sun *et al*. (1997), are β-loop–α–lbb–α–l–β-loop–α. The best computed structure, using our decision-making method described in §5.2, matches the crystal structure with , and energy 701.25 kcal mol^{−1} (see table 6). Figure 12 shows the relation between the Pareto front of the last iteration and the energy versus RMSD plot for 1CRN protein. Pareto front solutions are grouped into three individual clusters of non-dominated compact solutions. A wrong supersecondary structure prediction was made in the crambin at the C-terminal of the peptide chain: an incorrectly predicted α-helix (from residue 41 to 45) was imposed on the peptide chain as a constraint, this is evident from figure 13. Although, the wrong structure was formed in this terminal, the algorithm is able to reach a native like structure.

We would like to underline the fact that the protein conformation that has the minimum energy in the knees, is often better than the one from the whole obtained Pareto front (e.g. 1ROP, 1CRN and 1UTG proteins). Thus, as we mentioned above, the energy landscape produced by the CHARMM energy function does not seem to fit well the real landscape. Finally, the fact that solutions in the Pareto front that are not minimum energy solutions are better (in terms of RMSD and DME), clearly justifies a multi-objective approach (tables 4–6).

### 6.1 Comparisons with other approaches

We compared our algorithm, I-PAES, and its results to other works in literature (see table 8) and others MOEAs, in particular NSGA2 (Deb *et al*. 2002), that we implemented and tested on PSP. Two possible versions of NSGA2 were implemented. The first one uses standard low-level operators (SBX crossover and polynomial mutation; Deb 2001), and the protein is considered as a long sequence of torsion angles (real numbers). The second one uses high-level operators (naive crossover and the scheme of mutation used by I-PAES). In this case, the protein is manipulated at the amino acid level. The better performance of the high-level version is very clear, although the best RMSD found is always worse then that found by I-PAES. The best RMSD found for 1CRN by Cooper *et al*. (2003), using a Hill-climbing genetic algorithm, is 5.6 Å. Again, our method performed better in terms of best solution. Inspecting the results reported in table 9, I-PAES outperform also the good RMSD values obtained by the GA designed by Dandekar & Argos (1996). Table 9 shows the comparison between I-PAES, NSGA2, Hill-climbing GA (Cooper *et al*. 2003) and Dandekar & Argos' GA (1996) on 1CRN.

## 7. Conclusion

As reported by Plotkin & Onuchic (2002) ‘the folded state is a small ensemble of conformational structures compared to the conformational entropy present in the unfolded ensemble’. This sentence characterizes our research goal of finding a set of equivalent three-dimensional conformations inside the folded state. To reach this goal we adopt a multi-objective approach in order to obtain good observed Pareto fronts of non-dominated compact solutions near or inside the folded state. We propose a modified version of the algorithm PAES that uses immune inspired principles (clonal expansion and hypermutation operators) as a new search method for PSP.1

The multi-objective approach is used to fold a peptide, the Met-enkephalin, and medium size proteins, and the results are comparable in terms of RMSD and DME to other approaches in the literature.

In the last 50 years, the PF and the PSP problems have been faced as a large single-objective optimization problem. In this article, we conjecture by computational experiments that, instead, it could be more suitable to model the PSP problem as a MOOP. Recently, the role of non-native and native interactions has been deeply studied (McLeish 2005). Specific yet non-native interactions may be important in stabilizing the low-dimensional diffusive searches on the folding pathways, as well as native interactions. The present research work considers the bond and non-bond interactions as main forces to direct the folding toward the native state. This model is based on the fact that local interaction (bond energy) and non-local interaction (non-bond energy) between atoms are in conflict.

It is clear that, although it is possible to make movements that locally are able to decrease the bond energy of the protein conformation, globally, this could be not true. Moreover, the electrostatic interactions or the short distance van der Waals interaction of atoms near in the space but far in the protein primary sequence could be penalized although an improvement is obtained in bond, angles or torsion energies. If one interaction decreases the other always increases but while this process goes on, there are compensation effects that will cause the minimization of the total energy. Experimentally, it has been shown that the two interaction types are in conflict following a typical characteristic of the MOOPs.

## Acknowledgments

The authors would like to express their gratitude to the anonymous reviewers for their helpful comments. The computations were carried out on the Applied Computer Science Group Facility of the Ippari Research Center at the Comiso Campus of the University of Catania.

## Footnotes

One contribution of 8 to a themed supplement ‘Statistical mechanics of molecular and cellular biological systems’.

↵The I-PAES source code is available from the authors.

- Received May 4, 2005.
- Accepted August 22, 2005.

- © 2005 The Royal Society