Intelligently deciphering unintelligible designs: algorithmic algebraic model checking in systems biology

Bud Mishra

Abstract

Systems biology, as a subject, has captured the imagination of both biologists and systems scientists alike. But what is it? This review provides one researcher's somewhat idiosyncratic view of the subject, but also aims to persuade young scientists to examine the possible evolution of this subject in a rich historical context. In particular, one may wish to read this review to envision a subject built out of a consilience of many interesting concepts from systems sciences, logic and model theory, and algebra, culminating in novel tools, techniques and theories that can reveal deep principles in biology—seen beyond mere observations. A particular focus in this review is on approaches embedded in an embryonic program, dubbed ‘algorithmic algebraic model checking’, and its powers and limitations.

Keywords:

1. Hypotheses non fingo: Hooke and Newton

Over the last few years, Sir Robert Hooke, a somewhat maligned, but still a very fascinating English experimental scientist and the first secretary of the Royal Society, had begun to feature unexpectedly prominently in practically all my public presentations on systems biology. Initially, what had attracted me to the story of Hooke was the uncanny resemblance he bore to some of our present-day scientists in terms of their insistence on data, observations and hypotheses, their apparent non-rigorous and intuitive approaches to scientific questions but, most inexplicably, their protracted and debilitating open rivalries over the questions of recognition. But as I learned more about Hooke's life and views, it also became clearer that his indirect influence on the way we think about science today is only surpassed by the opinions of only a handful of other contemporary thinkers, with some of whom Hooke fought bitter and hopeless semi-philosophical battles. They have, thus, unwittingly lent us a useful perspective that is worth examining with some care. How the emerging field of systems biology could establish itself, how it should face its trials and tribulations along the way, and how it could be a significant component of the ‘new new’ biology, etc., could all be examined from the points of view of these seventeenth century scientists—a perspective that remains anachronically and peculiarly relevant even today.

Robert Hooke (1635–1703) was an experimental scientist, mathematician, architect and astronomer. He was also the first Secretary of the Royal Society from 1677 to 1682, and because of his wide ranging interests, Hooke has been variously described as ‘England's Da Vinci’. His work micrographia of 1665 contained his microscopical investigations, which included the first identification of biological cells, an enduring discovery that has maintained its centrality in subsequent developments in biology for more than three centuries. In his drafts of Book II, Newton had referred to him as the most illustrious Hooke—‘Cl[arissimus] Hookius’. However, not long after, Hooke became involved in a bitter dispute with Sir Isaac Newton over the priority of the discovery of the inverse square law of gravitation. In a letter Hooke wrote to Halley, he complained about omission of credit given to his discovery of the properties of gravity, ‘which of late Mr Newton has done me the favour to print and publish as his own inventions’. In response, Newton wrote back to Halley, ‘Now is this not very fine? Mathematicians that find out, settle & do all the business must content themselves with being nothing but dry calculators & drudges I believe you would think him a man of a strange unsociable temper’. In a more well-known letter that Newton wrote directly to Hooke, he famously said, ‘If I have seen further than other men, it is because I have stood on the shoulders of giants’—where, of course, the giants Newton was alluding to were Kepler and Galileo, and not the dwarfish, small-minded and short-tempered likes of Hooke! When Christopher Wren was brought in to resolve this rather strangely English war of words, Wren diplomatically described the disagreement using Clairaut's characterization of ‘the great distance between a glimpsed truth and a demonstrated truth’—raising perhaps the question of relative roles, which should be ascribed to the inductive hypothesis-driven science in relationship to the deductive principle-driven science.

What is the nature of ‘TRUTH’ in biology, and how is it to be sought? Hooke saw biology as an observational science; he wrote in Micrographia, ‘The truth is, the science of Nature has already been too long made only a work of the brain and the fancy. It is now high time that it should return to the plainness and soundness of observations on material and obvious things’,—a view supporting hypothesis-driven experimentation that advances science through steps of falsification or validation. Newton, on the other hand, championed a search for deep and unifying principles. Newton shunned hypotheses; his motto stated in Principia was ‘Hypotheses non fingo’. (‘I feign no hypotheses’.1) Newton's viewpoints are probably best stated by his most ardent disciple, Halley; in his rather ornately titled essay ‘The true Theory of the Tides, extracted from that Admired Treatise of Mr Isaac Newton, Intituled, Philosophiae Naturalis Principia Mathematica’, he wrote the following: ‘Truth being uniform and always the same, it is admirable to observe how easily we are enabled to make out very abstruse and difficult matters, when once true and genuine principles are obtained’.

Biology still remains an observational science; it continues to move through the toils of a vast army of scientists each examining a small subsystem of a favoured organism, as the scientists sharpen their intuitions, build upon guesses, conjectures and hypotheses, and refine their ideas in many small steps—occasionally interrupted by a great leap, a grand vision or a comprehensive shift in paradigm. If subtle principles are to be brought to light, they must wait for serendipity. It has been argued that life is complex, it does not yield to few small neat explanations or pigeonholing, and if there is a unifying principle in biology, it is that there is no unifying principle in biology.

Can ideas from mathematics, computer science and systems sciences be brought to bear to systematically hunt for principles and patterns that will reveal a grand unified theory of biology? Are there design rules at play in how these systems evolve, interact and self-assemble? What tools can we steal from other sciences or engineering disciplines to see much more a global view of biology? What can be automated to make computers work on tasks that are humanly impossible? Is systems biology the answer to the problems of biology?

I will argue that systems biology should aim to occupy a middle ground where experimentation, observations, modelling, model building, model checking and ultimately the search for grand unifying principles, all must dovetail harmoniously into a whole. Given the enormity of the task, the focus should be on sound computational tools, rigorous mathematical foundations and fidelity to the underlying biology.

After one motivating example, I will focus primarily on a somewhat narrowly defined question, namely, ‘Are there suitable formalisms to create biologically faithful models that will permit devising computable procedures to explore the models' visible properties, which can be succinctly captured within certain sound and sufficiently expressive logics?’

Unfortunately, an adequately affirmative answer to this question will not satisfy our appetite completely. It will engender, as Hydra's heads, many other challenging questions: What are the trade-offs between computational complexity and approximability? Or between these and the classes of properties that we wish to infer? I will not pretend here to provide answers to all these questions—only hope that these questions will keep the attention of systems biology engaged in the near future. Systems biology, by contributing to the core questions about models of computation and computability, will reciprocate with other quantitative disciplines.

2. Models as idealization: from Turing to Odell

A particularly revealing example, described below, underlines how biology progresses through both hypotheses and principles—in many false starts and true conclusions—as it attempts to develop mathematical models: in this case, models of biological pattern formation.

Alan Turing, an English mathematician, logician and cryptographer, may be viewed as one of the most important, although somewhat neglected proponent of search for principles in biology,2 despite an immortalized life fuelled by an eclectic set of polymathic interests that led to such wide-ranging concepts as Turing machines—the fundamental models of computing, Church–Turing thesis and Turing reductions—the key ingredients of computability and universality in constructive mathematics, the Turing test of artificial intelligence and Turing patterns arising from reaction–diffusion equations. Turing worked from 1952 until his death in 1954 on mathematical biology, specifically morphogenesis. In 1952, Turing published in Phil. Trans. R. Soc. Lond. (Turing 1952) one of the most influential and probably the last truly mathematical biology papers: ‘The chemical basis of morphogenesis’. There, Turing proposed reaction–diffusion equations to explain the formation of biological patterns, and as the key to understanding developmental biology. His later papers on the subject remained unpublished until 1992 when the Collected Works of A. M. Turing was finally printed.

Through this research, Turing was attempting to create ‘a mathematical model for the growing embryo’, as he was intrigued by the process that could self-assemble a whole human with a complex thinking brain from a simple fertilized egg. He wished to create a very general program for modelling embryogenesis: reminiscent of a Newtonian view of science, Turing's ‘model’ was designed to be ‘a simplification and an idealization and consequently a falsification’. His model was centred on an ‘abstract’ object, which he christened as morphogen, defined as ‘simply the kind of substance concerned in this theory…’. In fact, anything that diffuses into the tissue and ‘somehow persuades it to develop along different lines from those which would have been followed in its absence’ qualifies under this definition.

In this theory, one starts with a variable a, denoting the concentration of some morphogen. The first step would be the one relating the first temporal derivative of the morphogen concentration ∂a/∂t (its rate) to its second spatial derivative ∇2a (its flux) through a diffusion constant Da. The resulting diffusion equation isEmbedded Image

A much more interesting situation arises when one considers two interacting morphogens with their concentrations denoted by a and b, and also models their interactions through an additional reaction part augmenting the diffusion equationEmbedded ImageEmbedded Imagewhere functions f and g, while nonlinear, could still be fairly simple: e.g. affine. For instance, one could have f(a,b)≡a(b−1)−k1 and g(a,b)≡−ab+k2.

A simple system of reactiondiffusion equations such as this can be shown to give rise to a plethora of life-like patterns that have now come to be known as Turing patterns. Is this how the biology of animal development works? Is this how the leopard got his spots? Do we now know how the fly gets its segmentations? What roles do genes play in this? How are they related to morphogens?

Turing viewed that since the role of genes is presumably catalytic, influencing only the rate of reactions, unless one is interested in comparison of organisms, they ‘may be eliminated from the discussion’. His reliance on one grand sweeping principle and the accompanying elegant mathematical models, in the absence of any experimental support, may have been Turing's greatest folly.3 Developmental genes and their control of ‘morphogens’ would turn out to be extremely important in the pattern formation, as they began to be discovered through a series of careful (though tedious) genetic knockout experiments on flies.

Within a year of Turing's seminal paper, Francis Crick and James Watson, both working from Cambridge's Cavendish laboratory at the time, wrote another of the seminal papers of the last century: ‘Molecular structure of nucleic acids: a structure of deoxyribose nucleic acid’ (published in the British scientific journal Nature in 1953; Watson & Crick 1953b), and followed it up in a month with another equally important paper entitled ‘Genetical implications of the structure of deoxyribonucleic acid’ (Watson & Crick 1953a). The first of these two papers not only elucidated the double-helix structure of DNA but, much more significantly, also opened the way for a deeper understanding of the most important principles governing biological processes. Watson and Crick wrote: ‘It has not escaped our note that the specific pairing that we have postulated immediately suggests a possible copying mechanism for the genetic material’. DNA molecules, and the genes they contained, inherited, transmitted and used for regulating biological processes, could be seen to be nature's codebooks written in a language that also encoded perhaps all of the fundamental principles of biology. The collection of the genes came to be known as a ‘genome’, and our efforts to read, store, compare, align, decipher, annotate and interpret genomes using massive computation, automation and biotechnological tools became an obsession of our generation. It was felt that if we could ‘mine’ this data, we could learn all the important principles of biology. The debate between ‘hypothesis-driven science’ versus ‘data mining-driven science’ would emerge repeatedly in our time, completely oblivious of any historical context.

To a computational biologist, a genome is merely a long string made out of just four characters A, T, C and G (in case of humans, six billion base pairs encoding the haplotypic genome). This simple one-dimensional object (perhaps with a little auxiliary epigenomic annotation) would encode the genotypes and indirectly, through a Darwinian unidirectional mapping, also the phenotypes. Since evolutionary, relational and ecological properties all must depend on either the genotypes directly, or the phenotypes indirectly, all biological principles must be sought in the genomes, or in the way they are moulded by the variation and selection forces. Crick's central dogma assigned a centrality to the genomes. This dogma (due to Sir Francis Crick in 1958) states that the information flows from DNA to RNA through transcription and then to proteins through translation; these information flows are all unidirectional: ‘…Once ‘information’ has passed into protein it cannot get out again. The transfer of information from nucleic acid to nucleic acid or from nucleic acid to protein may be possible, but transfer from protein to protein or from protein to nucleic acid is impossible. Information means here the precise determination of sequence, either of bases in the nucleic acid or of amino acid residues in the protein’. Thus, once one has performed all the experiments to determine the sequences of the genome, this dogma may suggest that much of biological research can proceed with just genomics, genome data mining and annotation and genomic polymorphisms studies.

Also, it may appear that once the genes produce the corresponding proteins through the processes of transcription and translation, they do not play any direct role in the phenotypes determined by protein interaction, except that they have a catalytic role in the rate of production of the proteins. In that case, Turing would have been perfectly right in dismissing the genes, as he struggled to create an abstract (hence idealized) view of biology without genes.

However, proteins do interact with DNA, as do many small microRNAs, and in their roles as transcriptional regulators, they collude with the genes to affect the dynamics of the entire biological system. Thus, the mechanistic picture of biology that one may carry around is the following: a specific region of the genome (DNA) that determines the synthesis of proteins (through the transcription and translation) will be called a gene, thus giving it a physical meaning, even though, originally, a gene meant something more abstract—a unit of hereditary transmission. Transcription of a gene to a messenger RNA, mRNA, is keyed by a transcriptional activator/factor, which attaches to a promoter (a specific sequence adjacent to the gene). Regulatory sequences such as silencers and enhancers control the rate of transcription. Most of these regulators themselves are proteins synthesized by other genes. These regulators also play a very critical role in the development of an organism and the pattern formation.

The important role of regulators is best appreciated in answering such morphogenetic puzzles as in the elucidation of patterning along the head to tail (anteroposterior) axis of the fruitfly Drosophila melanogaster. The circuitry understood from studying a fly applies to other organisms and could reveal some important insights into developmental biology; other multicellular organisms use similar mechanisms for axis formation, and use similar signal transfer between the earliest cells of the developing organisms. The processes important for patterning in the Drosophila embryo are controlled by gradients in the concentration of maternal gene products that arise soon after fertilization. These protein molecules establish broad domains of gene expression, which interact to establish final patterns that ultimately encode instructions for body plan (axes: anteroposterior, or dorsal–ventral) or construct germ layers (e.g. ecto-, meso- and endoderm). A simple picture of the processes involved in anteroposterior patterning is described below.

Upon fertilization, various protein (morphogen) concentration gradients spanning the egg are established, starting with the products of the genes, called maternal effect genes, that encode for morphogen proteins. These gradients, in a manner that is generative rather than descriptive, establish positional values and affect the behaviour of the other genes. The model now called ‘positional information (PI)’ (or more colloquially and colourfully the ‘French flag model’) was first proposed by Wolpert (1969), with fairly scant experimental evidence. Wolpert, a civil engineer from South Africa, teaching cell biology at King's College in London, proposed in the 1960s that different genes would be activated differently in response to different threshold concentrations of the morphogen. For instance, in order to construct a French flag, nature may start with a linearly arrayed sequence of cells with a morphogen gradient monotonically decreasing from one end to the other, which would activate the cells in the neighbourhood differently: with high concentrations (above an upper threshold) turning on a blue gene, low concentrations (below a lower threshold) turning on a red gene, with white a default state in regions of the embryo at the intermediate concentrations (between the lower and upper thresholds). If one increases the number of cells, or varies the morphogen gradient (without violating the monotonicity property), one still gets the same French flag with surprising robustness. A reader may ponder over the following challenging exercise seeking to devise a similar algorithm to construct an American flag using PI.

Some of the most important maternal effect genes participating in the Drosophila PI model are: bicoid and hunchback, which pattern anterior parts (head and thorax) of the Drosophila embryo; and nanos and caudal, which form posterior abdominal segments of the Drosophila embryo. Bicoid (Bcd) is a homeodomain-containing transcription factor that establishes and places all anterior structures in the Drosophila body plan. After eggs fertilize, bcd (bicoid) mRNA is translated, and a gradient of bcd (bicoid) protein is formed, with highest levels near the anterior tip of the embryo, and gradually lower levels towards posterior regions, determined by many parameters: rates of translation, diffusion and degradation. The experimental evidence supporting bcd (bicoid) as a morphogen comes from mutation experiments in which bcd (bicoid gene) is knocked out or its concentration is shifted, with dramatic effect on the patterns that emerge. For instance, in embryos with an increased copy number of bcd gene (e.g. four or six copies), the cephalic furrow, one of the first distinguishable morphological features, becomes shifted posteriorly.

During the time bcd protein gradient is forming, zygotic nuclei concomitantly undergo many rapid division cycles and reorganize themselves along the periphery of the embryo. Almost at the same time, the zygotic transcription process gives rise to activation of genes such as hunchback (hb) and orthodenticle (otd), thus turned on by bicoid (bcd). In this way, the first morphogen gradients cascade into other interesting subsequent gradients.

While I omit many of the interesting details, the readers can imagine a rather cleverly regulated process that is at the core of the patterning in biology—a process in which a very specific set of developmental genes play specific and important roles, contradicting Turing's idealization of biology.

In the picture that emerges, the next set of genes to be turned on is called ‘gap genes’. The gradients already established by bicoid, hunchback and caudal proteins transcriptionally regulate other zygotically expressed protein products, derived from members of the ‘gap’ family of developmental control genes. For instance, hunchback, krüppel, giant, tailless and knirps are all gap genes. These genes, in combination with others, establish the segmentations along the anterior–posterior axis in the body plan of fly. Thus, the gap genes are usually thought of as the first layer of a hierarchical cascade of the genes that control segmentation.

The next set of genes consists of the so-called pair-rule genes, a class of segmentation genes, expressed after the gap gene products. These genes are expressed in striped patterns of seven bands perpendicular to the anterior–posterior axis. Finally, the ultimate class of segmentation genes, the segment polarity genes, fine-tunes the process by interactions between the cells of adjacent parasegments. A rather elegant example of this process is seen through the reciprocal signalling between Wingless and Hedgehog producing cells, which stabilizes the boundary between each segment.

This tour de force of experimental ‘observational’ biological research culminated in 1995 in a Nobel Prize for Physiology or Medicine, awarded to three scientists: Edward B. Lewis, Christiane Nüsslein-Volhard and Eric Wieschaus. This recognition was to be seen as a testimony to the power of careful experiments based on genetic screening of embryo patterning mutants that revealed the role, played in early embryologic development by homeobox genes such as bicoid. Perhaps after all, Hooke had won the argument. It is entirely possible that the theoretical elegance of abstraction of the kind Turing was proposing is rather seductively misleading; truth must be sought in the details, no matter how confusingly ugly.

But our viewpoints shifted again by subsequent computational research reported in a paper that appeared in July 2000 issue of Nature, where von Dassow et al. (2000) reported a mathematical analysis of a model based on the earlier experimental work that I have just described. The authors of this paper used a computer simulation to examine the robustness of a simple working model of patterning in the fly. This computational model, based upon a kinetic mass action formulation of the biochemical reactions, consisted of a system of coupled differential algebraic equations, derived from two categories of first-order ordinary differential equations: namely, equations characterizing the transcriptional processes; and equations characterizing translational processes. In this formulation, the rate of change of mRNA product concentration is expressed by combining two terms: an additive (positive) term accounting for the way the transcriptional regulators synthesize the mRNA; and a subtractive (negative) term accounting for the way the mRNA degrades or is removed. The first term is expressed by a rational function involving polynomials whose degrees are determined by various cooperative interactions among the regulators (Hill's coefficients in Hill-type equations), and the second term is a linear function. They all involve many free parameters, whose values are often not known a priori. Similarly, the rate of change of protein product concentration is expressed by combining two linear terms: an additive term for protein synthesis through translation; and a subtractive term for protein degradation. However, their first model, which had encoded faithfully the accepted interactions based on the experimental work, failed, and upon debugging, found to require the addition of some interactions that could be only vaguely substantiated. The revised model of fly segment polarity, which deviated from the experimental evidence reported so far, turned out to be highly robust, insensitive to variation in parameters and initial conditions. The revised model, developed by Odell's group, encompassed 136 coupled equations and nearly 50 free parameters such as half-lives, diffusion constants and binding coefficients, but despite this complexity, the model exhibited unusual robustness. For instance, even when the authors picked random values for the free parameters (ranging over a wide interval varying over several order of magnitudes, but still within a generally realistic range), they discovered that, for a surprisingly large fraction of these parameter sets, the revised model still exhibited the patterns seen in real flies.

After having exhausted various mathematical, experimental and computational tools available, one may still wonder whether it is possible to be reasonably convinced of the finality of this answer(s)—even when the question is narrowed down to just the topic of biological patterning. Is this our final answer?

For instance, what can be said about the values of the system parameters that gave rise to the pattern formation? How random are they really? How did the specific values of the parameters get selected by the evolutionary processes? Even when we understand the proximate causes involved in organizing various biological processes, we have no hint of how the ultimate causes are to be found.

Ingolia (2004) studied the topology of the feasible parameter subspace, by first postulating a bistability in the stable gene expressions at different cell types in the segment polarity pattern, which required a positive feedback of gene products on their own expression. Since these requirements would impose certain constraints on the values that the parameters can assume, the feasible parameter space can be mathematically expressed in terms of algebraic relationships involving equalities and inequalities. A tedious way to discover such algebraic relationships would be—perhaps guided by intuition—to repeatedly postulate various equality and inequality relationships among the parameters, and test them on the simulated parameter sets that satisfy the phenotypic trait to be characterized. Similar questions about parameter selection as well as algebraic relationships associated with them also occur in the context of bacterial chemotaxis, to be discussed in a later section (Andrews et al. 2006), see §4.3.

Thus, teleological questions, implicit in the preceding discussion, are likely to take the centre stage as we grapple with ultimate causes in biology. For some discussion of these issues, see the essay by Lander (2004).

As these examples illustrate, this issue of truth and cognitive finality in biology remains a rather unsettled and unsettling topic. At present, it is not uncommon to assume many biological problems to have been satisfactorily solved without rigorous justification. Thus, it is hoped that rigorous mathematical models with automated tools for reasoning, simulation and computation can be of enormous help to uncover cognitive flaws, qualitative simplification or overly generalized assumptions. Some ideal candidates for such study would include: prion hypothesis, cell cycle machinery, muscle contractility, most of the processes involved in cancer (cell cycle regulation, angiogenesis, DNA repair, apoptosis, cellular senescence, tissue space modelling enzymes, etc.), signal transduction pathways and many others. The story above brings us to our next topic: What role can systems biology play in better, rigorous and refined elucidation of biological principles?

In particular, we ask: (i) What would be the nature of systems biological models that would be most appropriate for discovering not just proximate causes, but also the ultimate ones? (ii) How do we ensure that the formalism we develop for this purpose possesses both completeness and computability, and still approximates the true biology as faithfully as possible?

As in the preceding example, there is a certain appeal to using mathematical/descriptive models in terms of a system of differential algebraic equations with numerical (possibly, symbolic) coefficients. At present, such models are routinely used in simulation studies to create trajectories in the state space, which can then be examined by other auxiliary procedures. For simpler systems (e.g. linear), one could examine the asymptotic properties via various techniques already developed and applied to engineered systems by control theorists. However, in general, many of the basic mathematical problems that one may wish to solve, in an adequately general setting, ultimately lead to various thorny open problems in differential algebra (e.g. Ritt's problems related to differential ideals), and appear hopelessly impossible (unless these systems happen to be isobarizable). By contrast, simulation techniques, although currently highly popular, are extremely limiting: one would have to deal with the problems of numerical errors; relationships among parameters can be explored only through parameter-sweeping, etc.

Another alternative is to study these problems through suitably approximated computational/executable models (Fisher & Henzinger 2007), and, to do so, by using tools already developed in computer science (e.g. model checking of Kripke structures) to examine discrete finite-state models, in which state transition relationships describe the dynamics. Since various temporal properties can be formulated in terms of fixed point of the dynamics, the properties can be explored efficiently by graph search or by symbolic methods, without ever resorting to simulation. To a very limited extent, one could occasionally handle certain infinite-state models through abstract interpretation or by exploiting bisimulation maps, but they appear unsatisfactory for many biological models.

In order to avoid limitations of either of descriptive or executable approaches, there have been attempts to combine them in a hybrid description, as we will see shortly. However, since even these hybrid models allow examining only single instances of the biological systems—one at a time—we need a bit more; it will be desirable to ‘algebraize’ the models: i.e. treat most of the variable parts of the models (e.g. initial conditions and parameters) symbolically. One can then attempt to algebraically characterize the invariant properties that endow the biological systems with the relevant phenotypic properties, most expressively stated in one of the many standard temporal logics. In this form, model-checking algorithms can be used over a single instance of the biological system or a family of systems (suitably algebraized) to discover previously uncharacterized phenotypic properties.

Tools of this nature could help answer such questions as follows: (i) questions about robustness (not only just under infinitesimal perturbation, but also large dosage variations triggered by a mutation in single base or copy number), (ii) questions about isotropy in evolution (related to questions about sources of genetic variations, and whether some genetic changes can be thought to be facilitated), and (iii) questions related to canalization, etc. These tools could also help in parameter estimation with high fidelity, i.e. the in vivo parameters of a biological system. For instance, if one considers a family of experiments (say yeast cell cycles studied over wild-types, mutants, double mutants, etc.), one could combine the algebraic invariants constraining feasible parameters from the family of experiments to narrow down the exact values of the parameters as they occur in nature.

3. Model building and model checking: Kripke and Tarski

Systems biology is characterized as a broad and challenging field attempting to hasten the understanding of life by integrating computational, mathematical, statistical, artificial intelligence, machine learning, temporal reasoning and other so-called in silico techniques coupled with traditional wet-laboratory in vivo and in vitro experimentation. As illustrated earlier, one important strand of research has been modelling, simulation and analysis of biochemical pathways to evaluate mutual consistency of biological facts, to validate or refute plausible hypotheses, to aid experiment design and refine existing models. One may then safely posit that the fundamental systems biology problem addressing biochemical networks will continue to be reasoning about the emergent temporal properties and their modulation by external signals.

Furthermore, understanding biological models well and developing the abilities to manipulate them have important pay-offs—both in practice and theory. A systems-level analysis (as opposed to a reductionist approach) of biology is crucial not only for giant leaps in our understanding of ‘plausibility of life’ itself (Cornish-Bowden & Cardenas 2005), but also for more mundane applications to vaccine and drug discovery, diagnostics, genetically modified food and in the nascent field of synthetic biology (Cascante et al. 2002; Cornish-Bowden & Cardenas 2003). The data generated by the recently announced American National Cancer Institute's The Cancer Genome Atlas project, when combined with proper systems biology tools modelling signal transduction pathways, can be hoped to replicate the success of cancer drugs such as Her-2 and Gleevec (imatinib mesylate), and shed light on why other cancer drugs such as Iressa (gefitinib) and Tarceva (erlotinib) have only had limited success in few populations.

One is further inspired by the techniques of model checking, a popular computational approach, which has had many important impacts (Pnueli 1981; Clarke et al. 1986) in engineering fields because of its ability to efficiently validate temporal logic formulae, describing dynamical properties of the formal models of complex real systems. These well-established techniques have been tremendously successful in VLSI (Very Large Scale Integrated) circuit design, protocol design, cyber-physical systems, robotics, embedded systems, avionics, control theory, etc., and have been applied even to hybrid systems with both continuous and discrete dynamics. For biochemical networks, several modelling tools, some with rigorous analytical capabilities, have been developed, including Systems Biology Workbench (SBW, Hucka 2003; Virtual Cell, Slepchenko et al. 2003; E-Cell, Takahashi et al. 2003; Genomic Object Net, Nagasaki et al. 2003; Simpathica, Antoniotti et al. 2003; and BioMiner, Sirava et al. 2003).

To illustrate the power of model building and model checking (in the simplest possible setting), one could focus upon various model-checking problems centring around verification of safety properties of a system: given a biological system (e.g. encoded by a suitable executable model, say) and a safety property ϕ, one may wish to test whenever ϕ holds along all of H's trajectories. Since this is the case if and only if there is no reachable state in which ϕ does not hold, the verification of safety properties naturally reduces to the problem of computing its reachable sets. But this situation represents an exception and not the rule, since evidently not all the properties of interest can be so easily translated to the reachability problem. Thus, one must explore the situation for the most general classes of queries in the temporal logic of choice.

One may desire to represent biological systems as transition systems and apply classical model-checking techniques (see Clarke et al. 1999). Note, however, that, in general, most interesting descriptive models of biology will be infinite-state systems and the standard model-checking techniques, which work on finite-state models, fail. To solve this problem, many authors have suggested the use of equivalence reductions based on relationships such as bisimulation. In particular, since bisimulation preserves branching-time temporal logics such as computational tree logic (CTL) and CTL*, whenever the bisimulation quotient of a system is finite, one could verify CTL and CTL* properties of the system by applying finite model-checking techniques on its bisimulation quotient. Obviously, bisimulation has the advantage of preserving more expressive logics, but in many cases it produces infinite quotients. On the other hand, one may choose to use the simpler simulation relationships mutatis mutandis. However, note that simulation preserves less expressive logics, while being more permissive as it can reduce a significantly larger class of automata to finite-state models.

As an example, consider the following example of robustness analysis of a model, representing the purine metabolism pathway. The main metabolite in purine biosynthesis is 5-phosphoribosyl-α-1-pyrophosphate (PRPP). A linear cascade of reactions converts PRPP into inosine monophosphate (IMP). IMP is the central branch point of the purine metabolism pathway. IMP is transformed into adenosine monophosphate (AMP) and guanosine monophosphate (GMP). Guanosine, adenosine and their derivatives are recycled (unless used elsewhere) into hypoxanthine (HX) and xanthine (XA). XA is finally oxidized into uric acid. In addition to these processes, there appears to be two ‘salvage’ pathways that serve to maintain IMP levels and thus adenosine and guanosine levels as well. In these pathways, adenine phosphoribosyltransferase and HX–guanine phosphoribosyltransferase combine with PRPP to form ribonucleotides. This pathway is believed to be extremely robust—not just to small perturbations, but to significant and repeated perturbations to the input signal. In a branching-time propositional temporal logic (e.g. CTL), one could formulate a query that would ask the model checker whether the following formula is true (in a discretized simulation-preserving model):

  • Always (PRPP>50*PRPP1)

  • implies

  •   steady_state()

  •   and Eventually(IMP1>IMP2)

  •   and Eventually(HX<HX1)

  •   and Eventually(Always(IMP=IMP1))

  •   and Eventually(Always(HX=HX1))

However, when this particular query was examined with NYU's Simpathica/XSSYS tool, it was determined to be false, but pointed to certain incompleteness in the pathway representation of a commonly accepted model (shown in figure 1). When the pathway representation was augmented with additional back-up reactions (already known in the literature), the robustness query turned out to be true, as determined by the same model checker.

Figure 1

The pathway above depicts the classical (but incomplete) model of purine metabolism. Using Simpathica/XSSYS systems biology tools, one can automatically create a discretized model and perform model checking on the resulting automaton to show that the model fails to satisfy a particular temporal logic formula, encoding global robustness. This reasoning process and manipulations with model-checking tools pointed to a more complete model that does satisfy the robustness formula.

Building upon some of these foundations and other tools from mathematics, one must strive to develop more sophisticated, but sound model-checking tools for biology: for instance, one may aim to devise powerful and efficient parallelizable dense-time algebraic model checkers that can be used by even novice users (e.g. biologists, biotechnologists, et al.) to reason about and manipulate important biological processes through richer and deeper symbolic queries. Concurrently, one may also seek to elicit a better understanding of complexity and decidability (Blum et al. 1997) issues in algebraic model checking. These ideas are elegantly integrated under the umbrella of ‘algorithmic algebraic model checking’(AAMC; Casagrande et al. 2005a,b, Mysore et al. 2005, 2006; Piazza et al. 2005; Casagrande 2006; Mysore 2006; Mysore & Mishra 2006, 2007), the main topic of §3.1.

3.1 Problem domain

We start with the following taxonomy into which the cellular biochemical processes are typically organized, as described below.

3.1.1 Genetic regulation

Genetic regulation is the process of modulation of the expression of the relevant genes at the correct locations and times, and is keyed by specific proteins called transcriptional factors. Through transcriptional factors and other ancillary modulators, proteins, the products of genes, themselves partake in this genetic regulatory process, thus giving rise to complex interaction networks; such proteins interact with regions of the DNA to effect modulation of how genes are transcribed. The binding of the transcription machinery and the transcriptional factors to the DNA involves complex protein–DNA–protein interactions, where, more often than not, the structural modification of the DNA (such as euchromatin and heterochromatin regions) and the protein has to be accounted for.

The rate of gene transcription, the post-transcriptional mechanisms that affect mRNA half-life (i.e. stability) and the formation of the mRNA–ribosome complex are other aspects of genetic regulation. Similarly, there are post-translational mechanisms for protein modification such as phosphorylation of key residues, multimerization, chaperone-guided complex formation, protein-folding control and genetic control by small interfering RNA (siRNA), all of which must be faithfully included.

3.1.2 Signal transduction

The cell responds to external signals through receptors, which may be on its surface or in its cytoplasm. The signal is transmitted to the interior through messengers, which induce the desired response to the external signal. Typically, a ligand binds to a transmembrane receptor whose conformation subsequently changes. This change is detected by proteins bound to it (usually on the cytoplasmic side), or is manifested as a change in the receptor's chemical properties. Subsequently, second messenger molecules amplify the signal and communicate it to the target(s). Alternatively, the ligand can directly enter the cell through non-specific channels and then bind to the receptors inside the cell. Small molecules such as calcium often participate in these pathways, where most of the reactants are enzymatic proteins. The net result of the signal transduction pathway is an appropriate response by the specific subcellular component. Very often, the signalling pathway results in the nuclear localization of transcriptional factors, leading to the transcription (or shutting down) of corresponding genes. The binding of the signalling molecule with the receptor, the modification of the structure of the receptor and associated proteins (with the receptor sometimes acting as an enzyme) and dispatching of second messengers are the activities near the cell membrane. Receptor desensitization, internalization and regeneration are other complex subprocesses, thus altering the physical properties of binding and diffusion.

3.1.3 Metabolism

Metabolism represents almost all processes that are not genetic regulatory or signal transducing. The gigantic set of biochemicals needed by the cell is continuously produced and consumed by complex enzyme-catalysed pathways. These comprise the metabolic network. They essentially govern the matter and energy cycles of a cell—the way energy and matter are obtained, transformed and consumed by living organisms. Photosynthesis, for example, is the process by which light energy is converted into chemical energy during sugar (e.g. glucose) formation. During respiration, the oxidation of glucose transforms the energy into adenosine tri-phosphate (ATP). While the ATP cycle and photosynthesis comprise the well-known energy metabolism, carbohydrate metabolism deals with glycolysis and phosphates, lipid metabolism pertains to triacyl glycerol and fatty acids, and amino acid metabolism mostly refers to glutamate and urea.

3.1.4 Other processes

Of course, there are still more aspects to cellular biology beyond this simple trichotomic characterization. These include the biophysics of DNA packaging, siRNA, protein folding and DNA–protein interaction, cell adhesion, non-transcriptional regulatory pathways, cellular compartments and related spatio-temporal phenomena, cell proliferation and cell migration. While the modelling approaches suggested here, when further augmented with suitable stochastic and spatial formalisms, will generalize as well, I will not emphasize those applications directly in my discussions here.

3.2 Biochemical models

The different component parts and processes in the biochemical domain may be represented at different levels of abstraction (de Jong 2002; Ideker & Lauffenburger 2003). I summarize some of the major approaches below, but will guide the discussion towards hybrid automata representation, a very general and powerful model for these systems.

3.2.1 Logical modelling

The state of the reactant is captured through a finite number of abstract states (where intermediate expression levels are assumed to have the same behaviour), and functions are used to describe the new states (concentration range) of the chemical species, given their old states. The transitions between states can be assumed to occur synchronously or (more accurately) asynchronously. In the simplest case, only two states (‘on’ and ‘off’) are used, and Boolean algebra is used to describe the dynamics. Literature on Concurrent Transition Systems (Chabrier & Fages 2003; Chabrier et al. 2004) and Pathway (Rewrite) Logic (Eker et al. 2002) provides good expositions of logical modelling. Kappler et al. (2002) demonstrated how to extend simple Boolean networks by using ordinary differential equations to capture the concentration, while Boolean functions continue to determine the rates of the reactions. The probability of being in a state is sometimes a more reasonable measure to estimate, as in the case of Sachs et al. (2002), who use Bayesian networks to model cell-signalling pathways. Similarly, Shmulevich et al. (2003) described the use of probabilistic Boolean networks to model genetic regulatory networks, and determined the long-term joint probabilistic behaviour of a few selected genes. Platzer & Meinzer (2002) simulate the embryonic development of Caenorhabditis elegans by assuming Boolean states for the genes and synchronously updating at each time step based on an interaction matrix. Batt et al. (2003) have applied model-checking theory on biochemical systems modelled though qualitative simulation.

3.2.2 Differential equations

If instead the concentrations are represented exactly in the real continuous domain, the ordinary differential equations of the dynamics directly follow from the law of general mass action (Keener & Sneyd 1998; Voit 2000; Cornish-Bowden 2004). For instance, in the reaction aA+bBcC+dD, the rate of the forward reaction is vfkf[A]a[B]b and the rate of the backward reaction is vbkb[C]c[D]d, where kf and kb are the forward and backward rate constants, respectively, and the rate of individual reactants is Embedded Image. As a compromise between discrete and continuous representations, qualitative differential equations can be used with qualitative states corresponding to the different concentration ranges (Batt et al. 2003; de Jong 2003). Models, involving spatially distributed or probabilistic interactions among molecules, require several other mathematical generalizations, e.g. partial differential equations, stochastic differential equations or reaction–diffusion equations.

3.2.3 Hybrid systems

Many biological systems, such as the cell, follow a combination of discrete and continuous behaviours, which cannot be characterized in a proper way using either only discrete or only continuous models. On one hand, their evolution is ruled by a continuous dynamical law concerning substance concentrations and gradients and, on the other hand, such a dynamical law may change discretely depending on the system status itself. Owing to their hybrid nature, part discrete and part continuous, such systems are named hybrid systems. To model hybrid systems, Alur et al. introduced the notion of hybrid automata in Alur et al. (1995). Intuitively, a hybrid automaton is a ‘finite-state’ automaton with continuous variables, which evolve according to a set of continuous laws characterizing each discrete mode of the automaton itself. The use of hybrid automata for modelling biomolecular networks has been described by Alur et al. (2002a) and Mishra (2002). Amonlirdviman et al. (2002) demonstrated the use of hybrid systems by modelling Drosophila planar cell polarity. Starting with the S-system formulation of Voit & Savageau (1986) and Antoniotti et al. (2002) used an additional automaton to broaden the set of representable systems, subsequently using full-fledged hybrid automata (Antoniotti et al. 2004). Ghosh et al. presented both delta-notch (Ghosh & Tomlin 2001; Ghosh et al. 2003) and protein signalling network (Ghosh & Tomlin 2005) models based on the hybrid automaton formalism. Casagrande et al. (2005a) suggested a simple (and decidable) hybrid automaton model for Escherichia coli chemotaxis (see the example later). Lincoln & Tiwari (2004) detailed hybrid automaton modelling of biochemical networks, while Hu et al. (2004) described stochastic hybrid system modelling of subtilin production in Bacillus subtilis. More recently, Drulhe et al. (2006) have described piecewise-affine models of genetic regulatory networks.

3.2.4 Algebraic hybrid automata, dense-time logic and decidability

Thus, if these methodologies are to be adapted to the settings suitable for systems biology, what is needed is an appropriate generalization of discrete-time systems, classical temporal logic, possible-world models of temporal logic given by Kripke (e.g. Kripke structures), model checking algorithms based on graph theoretic analysis, etc. However, while the generalization must be sufficiently powerful to capture a large segment of reasoning processes used by the biologists, but yet it should also be sufficiently constrained, so that these systems can be reasoned by feasible computational means. At the least, the resulting problems should be decidable (computable). We seek such a framework below by a judicious amalgamation of symbolic algebra (e.g. using decision procedures of semi-algebraic geometry), sufficiently constrained dense-time temporal logic and algebraic models-based hybrid automata. I will present a brief formal discussion of such hybrid automata and their reachability problem, and then discuss an appropriate dense-time logic and the complexity of the model-checking problem in this framework. The discussion, with few accompanying biological examples, indicates how close these techniques are to the boundary of undecidability.

4. Algorithmic algebraic model checking

My research group introduced the subject of AAMC in an attempt to rigorously examine connections between systems biology, dynamical systems, modal logic and computability, and how they can be useful in the biological context. Towards this aim, we started by addressing the symbolic model-checking problem for a new class of hybrid models arising in systems biology—semi-algebraic hybrid systems, introduced in the first paper of our ‘AAMC’ series (Piazza et al. 2005). In there, and its subsequent sequels, we sought to characterize the widest range of automata that admit sound albeit expensive mathematical techniques, as opposed to focusing on a very narrow class of systems that often prematurely sacrifice generalizability for the sake of efficiency. From this research, a new class of automata, Independent Dynamics Hybrid Automata (IDA), has emerged as one that satisfies the desiderata of systems biology.

In preparation for these systems biology models, I start by providing a formal definition of hybrid automata—a concept first introduced in Maler et al. (1991) and Alur et al. (1993) as a model and specification language for hybrid systems, i.e. systems consisting of a discrete program within a continuously changing environment.

4.1 Syntactical structure

First, some notations and conventions need to be introduced. If p=(p1, …, pk) and s=(s1, …, sk) are vectors in Embedded Image, Embedded Image, Embedded Image and Embedded Image, then I will use Embedded Image to denote the vector Embedded Image and Embedded Image to indicate equational relationships such as Embedded Image.

Indexed capital letter variables Zm and Embedded Image, where Embedded Image, denote variables ranging over Embedded Image, while Z and Z′ denote vectors of variables (Z1, …, Zk) and Embedded Image, respectively.

The temporal variable T models time and range over Embedded Image. In the following, given a formula ψ[Z] and a model Embedded Image, I will denote the set of tuple of values satisfying ψ in Embedded Image as Embedded Image, i.e. Embedded Image. When Embedded Image is clear from the context, I may simply write Sat(ψ).

A formal definition of hybrid automata may proceed as follows: for each state of a discrete automaton, there is an invariant condition as well as a dynamic law. This dynamic law may depend on the initial conditions, i.e. on the values of the continuous variables at the beginning of the evolution in the state. The discrete jumps from one discrete state to another are regulated by the so-called activation and reset conditions. As an example, one may imagine a two-state hybrid automaton with the states {q1, q2}, with activation conditions from q1 to q2 (respectively, q2 to q1) given by inequality relationships G12(x)≥0 (respectively, G21(x)≥0). Furthermore, as the automaton enters these states, it resets itself to some values, say x1 in the state q1 and x2 in state q2. Starting from this skeletal description of the automaton's ‘discrete’ behaviour, the complete description augments it with more details about what transpires between the discrete switches: namely, how the automaton evolves in each of these states according to a continuous dynamics that is then described by flow and invariant functions: namely, Embedded Image and gi(x)>0 for state qi. Thus, this automaton may start in state Embedded Image with a value Embedded Image and evolve continuously for a period according to the local flow described by f1(x), while constrained to the subspace whose boundary is defined by g1(x), until the activation condition G12(x)≥0 triggers an instantaneous jump to state q2. In the new state, it continues according to different flow and invariant rules for the next period before jumping back to q1. A classical example of a hybrid automaton may be provided by a thermostat that maintains a room temperature in the range [15 and 20°C]. The thermostat switches back and forth between two states qoff and qon depending on whether the room is cold or hot. If the room is cold (x<15°C), then the heat must be turned on, and the temperature will increase according to the law Embedded Image. On the other hand, if the room is hot (x>20°C), then the heat must be turned off, and the temperature will decrease according to the law Embedded Image. The reset and invariant conditions are the obvious ones. I will next describe these same ideas bit more formally.

Let Embedded Image be a first-order language over the reals and Embedded Image be a model of Embedded Image. In the following definition, it will be implicitly assumed that Inv, Dyn, Act and Reset are formulae of this first-order language Embedded Image.

Definition 4.1 (Hybrid automaton)

A hybrid automaton (of dimension k) Embedded Image, (over Embedded Image), consists of the following components:

  1. Z=(Z1, …, Zk) and Embedded Image are two vectors of variables ranging over the reals, Embedded Image;

  2. Embedded Image is a finite directed graph; the vertices of Embedded Image are called locations, or control modes, the directed edges in Embedded Image, control switches;

  3. Each Embedded Image is labelled by the two formulae Inv(v)[Z] and Dyn(v)[Z,Z′,T], such that if Inv(v)[p] holds (in Embedded Image), then Dyn(v)[p,p,0] holds as well;

  4. Each Embedded Image is labelled by the formulae Act(e)[Z] and Reset(e)[Z,Z′].

The formulae Inv(v)[Z] and Dyn(v)[Z,Z′,T] are said to be invariant of v and dynamics of v, respectively, while Act(e)[Z] and Reset(e)[Z,Z′] are called activation of e and reset of e, respectively. Moreover, if a reset does not depend on Z, then it is said to be a constant reset.

From such formulae, one can define the formula Embedded Image, where e=〈v,u〉.

Henceforth, one may simply write Embedded Image and Embedded Image to mean Sat(Act(e)) and Embedded Image, respectively.

In the above definition, instead of using the classical approach based on differential equations to define the flow, one may use the formulae in DynSet to describe the continuous evolution without using derivatives. (See Brihaye et al. (2004) and also Lafferriere et al. (2000).) This simplification allows such hybrid automata to generalize several recently discovered notions in this theory. For instance, this step allows treatment of o-minimal hybrid automata (Lafferriere et al. 2000; Brihaye et al. 2004) as a special case of the resulting automata, and map rectangular hybrid automata (Puri & Varaiya 1994; Henzinger & Kopke 1996; Kopke 1996) into a subclass of the earlier definition.

On occasion, one may wish to simply express hybrid automaton dynamics using differential expressions (either equations or inclusions). Let Embedded Image be a function assigning to each vertex Embedded Image, a system of differential inclusions (that can become a system of differential equations, as a particular case). I will occasionally use the notation Embedded Image instead of Embedded Image to denote the fact that, for each vertex Embedded Image, the formula Dyn(v)[Z,Z′,T] corresponds to the solution of the differential inclusions Embedded Image when the starting point is Z.

4.2 Semantic structure: reachability and beyond

A suitable formalization of the semantics of hybrid automata is built upon the concepts of hybrid automaton's states and trajectories (traces).

Definition 4.2 (States)

Let H be a hybrid automaton (over Embedded Image) of dimension k. A state q of H is a pair 〈v,r〉, where Embedded Image is a location and Embedded Image is an assignment of values for the variables of Z. A state 〈v,r〉 is said to be admissible, if Inv(v)[r] holds in Embedded Image.

Dynamics of a hybrid automaton can then be described via sequences of transitions from one state of the automaton to another. Hybrid automata have two kinds of transition (and reachability) relationships: continuous transition (reachability) relationships, capturing the continuous evolution of a state according to both formulae Dyn(v) and Inv(v); and discrete transition (reachability) relationship, capturing changes of location driven by the formulae Reset(e) and Act(e). More formally, the semantics of our hybrid automata is described as follows.

Definition 4.3 (Hybrid automaton—semantics)

Let H be a hybrid automaton (over Embedded Image) of dimension k. The continuous reachability transition relationships Embedded Image between admissible states are defined as follows:Embedded ImageThe discrete reachability transition relationship Embedded Image, where Embedded Image, between admissible states is similarly defined as follows:Embedded Image

I will also use the notation Embedded Image to indicate that either Embedded Image, if Embedded Image, or Embedded Image, when Embedded Image.

Example 4.4

Let H be a hybrid automaton with Embedded Image, Embedded Image, and in which Dyn(v){Z,Z′,T] is Z′=eT*Z, Inv(v)[Z] is 1≤Ze2, Reset(e)[Z,Z′] is Z′=1 and Act(e)[Z] is 4≤Ze2. One may choose the following transition sequence tr as valid for Embedded Image. If, now, one defines another automaton H′, by modifying only the invariant formula Inv(v)[Z] of H to change to 1≤Z<e2, then by the semantics proposed in (Henzinger 1996; Lygeros et al. 2001), tr is still valid for H′, while it is no longer valid by our semantics. Note that 〈v,e2〉 is not a valid state since Inv(v)[e2] is now false, and is not admitted in a trace in the way I have defined it. Except for few subtle differences such as this, I have otherwise closely followed the classical definition throughout this paper.

Without loss of generality, I focus only on hybrid automata whose formulae are satisfiable. For this reason, one can safely omit any mention of the model over which the automaton is constructed or to the dimension of the automaton, except when it is ambiguous in the context.

Definition 4.5 (Trace)

Let H be a hybrid automaton, Embedded Image be an admissible state of H and let Embedded Image be an initial segment of Embedded Image.

A trace of H is a sequence Embedded Image of admissible states such that

  1. for all Embedded Image;

  2. for all Embedded Image, there exist e in Embedded Image such that either Embedded Image or Embedded Image,

where Embedded Image.

At this point, a couple of remarks are in order:

Remark 4.6

Condition 2 in the above definition admits a notion of hybrid trace analogous to the notion of trajectory defined in dynamical systems; without such a condition, a transitive dynamics must be assumed. Consider, for instance, the dynamics Dyn(v)[Z,Z′,T] defined as Embedded Image, with Z=(Z1,Z2) and Embedded Image. Intuitively, one does not expect to admit a trajectory from (0,0) passing through (4,8), satisfying Dyn. However, without condition 4, Embedded Image would be a legal trace from 〈v,(0,0)〉 passing through 〈v,(4,8)〉.

Remark 4.7

There can exist traces that do not spend much time in continuous evolution and, in fact, time does not even advance on them. Hybrid automata admitting such traces are called Zeno hybrid automata. Since allowing such systems in defining models of systems biology may lead to counter-intuitive conclusions, special care is called for in such situations.

I will now formally describe the notion of reachability.

Definition 4.8 (Hybrid automaton—reachability)

Let H be a hybrid automaton of dimension k. A point Embedded Image reaches a point Embedded Image (in time t) if there exists a trace tr=〈v,r〉, …, 〈u,s〉, for some Embedded Image (and t is simply the sum of the elapsed times in continuous transitions).

We use ReachSet(r) to denote the set of points reachable from r. Moreover, given a region Embedded Image we use ReachSet(R) to denote the set Embedded Image.

A naive approach to compute reachability relationship may involve simply iterating over the computation of points locally reachable through continuous and discrete transitions. Unfortunately, this procedure is not effective in general, as it immediately faces two difficulties: namely, transitions might be characterizable only by undecidable formulae; or even when single transitions are computable, the global procedure may not be guaranteed termination. Other attempts, for instance, the following fix-point result, defined in terms of certain computable ‘forward time closure operators’, 〈.〉 and [.], and presented in Alur et al. (1995)) by Alur et al., suggest computable algorithms for reachability, only in certain simple situations.

Proposition 4.9

Let H be a hybrid automaton and R be a region in H's state space. The reachable region Embedded Image is least fix point of the equationEmbedded Image

More generally, it has been shown in Henzinger et al. (1995) that reachability is not decidable. Nonetheless, many interesting classes of hybrid automata over which reachability is decidable have been characterized in the literature (Henzinger & Kopke 1996; Kopke 1996; Lafferriere et al. 2000; Brihaye et al. 2004). A common approach for deciding reachability of hybrid automata employs the technique of discretizing the automata either using equivalence relationships that strongly preserve reachability (e.g. bisimulation; Lafferriere et al. 2000), or using abstractions (e.g. predicate abstraction; Alur et al. 2002b, Tiwari & Khanna 2002). The classes of hybrid automata that owe their decidability to bisimulation include the following: timed automata; certain subclasses of rectangular automata; and o-minimal automata, and are defined as follows.

Both timed and rectangular automata are special cases of linear hybrid automata, which is not immediately useful as high-fidelity models in systems biology. For a linear hybrid automaton, its dynamics, invariants and activation relationships are all defined by linear expressions over the set Z of variables. For the control modes, the dynamics is defined by a differential equation of the form Embedded Image, where k is a constant, one for each variable in Z, and the invariants are defined by linear equalities and inequalities (corresponding to a convex polyhedron) in Z. Also, for each transition, the set of reset assignments consists of linear formulae in Z, too. For automata of these kinds, a trace is a piecewise linear function whose values at the points of discontinuity are finite sequences of discrete changes. A particularly well-studied, but somewhat artificial, special case of a linear hybrid automaton is a timed automaton, in which each continuous variable increases uniformly with time Embedded Image and model's clocks in asynchronous or self-timed systems. A discrete transition may either reset or disregard the clock.

A slightly more interesting specialization of a linear hybrid automaton, which was mentioned earlier, is a rectangular automaton. A hybrid automaton is rectangular, if the continuous dynamics are independent of the control modes and the variables are pairwise independent. Here, the dynamics is described by differential inclusions of the form Embedded Image for each variable. The invariant condition and the transition relationship are described by linear predicates that correspond to hyper-rectangles. For rectangular automata, the reachability problem (also the more useful controller synthesis problem) is decidable, when the mode switches can be assumed to occur through discrete/integral sampling.

O-minimal systems generalize the systems of hybrid automata considerably, by allowing much more general classes of function to describe the dynamics, invariants and activation, but constrain the reset operation to constant values. The decidability has been shown for a very general (somewhat technical) class, describable in the o-minimal theories, whose definable sets are finite unions of points and intervals (with respect to a linear order relationship over its models), and their suitable generalizations to higher dimensions. In spite of the technical nature of these definitions, these automata present a glimpse of what could be attempted in systems biology models, where discrete mode switches and differential algebraic equations abound. However, one disappointing limitation (at least from the viewpoint of biological modelling) is the requirement of constant reset.

Although at first glance, it may appear that such restrictions must necessarily be imposed, if one wishes to achieve decidability (especially, using bisimulation techniques); fortunately, there are still other algorithmic approaches to explore. For instance, several general and novel techniques have emerged from the themes of the AAMC program, in which computational problems on hybrid automata can be fruitfully studied via translation of the problems, such as reachability, into satisfiability of first-order formulae over decidable theories. In particular, one can make use of the following results (proof omitted).

Theorem 4.10

If a class Embedded Image of hybrid automata is first-order definable and its underlying Embedded Image is decidable, then the membership problem for a given hybrid automata H is decidable.

Furthermore, if the reachability problem for a given hybrid automata H is first-order definable and Embedded Image is decidable, then the reachability problem for H is decidable.

As explained earlier, the ubiquitous tussle between complexity of the system and the ease of analysis continues to dictate which hybrid automaton subclass will ultimately prevail as the most appropriate for which application. For example, although remarkably efficient verification techniques have been perfected for timed automata (Alur et al. 1995), the very idea of modelling a biochemical system as a set of well-behaved clocks is largely untenable. At the other end of the spectrum, detailed spatio-temporal models at the atomic level, although perhaps extremely accurate, will leave simulation-based analysis as the only feasible option.

In Casagrande et al. (2005b, 2008), we introduced the Semi-Algebraic Constant reset Hybrid Automata (SACoRe), which extended o-minimal automata over the reals, in the case of flows obtained from non-autonomous systems of differential inclusions. SACoRe automata were shown to admit decision procedures for reachability and model checking for a limited fragment of CTL, which they achieved by combining Tarski's decidability result over the reals with Michael's selection theorem. However, this formalism is still quite restrictive in the biochemical domain, as the constant reset requirement that they impose is very limiting. This constraint is rather artificial in systems biology since, when a biochemical system changes its ‘discrete’ state, it would be rather unnatural to assume that the concentrations are reset to constant values. In fact, the most common result of a state change is no perceptible transient variation in continuous variables, because of the underlying thermodynamic model built upon large number of molecular interactions over a short time interval. In other words, identity resets are unavoidable, if one wishes to capture such fundamental aspects of biological state transitions.

For this purpose was introduced a new class of hybrid automata—IDA, whose characterizing conditions are based upon a decidable first-order theory over the reals (e.g. Embedded Image). The dynamics are the solutions of autonomous systems of differential equations. The reset conditions can be either constants as in the case of o-minimal hybrid automata (Lafferriere et al. 2000) or the identity function. In particular, these models distinguish independent variables, whose resets are the identity function, from dependent variables whose resets are constant functions. The flows and the reset functions of the dependent variables are allowed to depend on the independent ones, but not vice versa.

In the most general setting, one cannot impose any bound on the time interval over which an IDA's dynamics can be satisfactorily examined, but one is able to define such bounds4 for an interesting subclass of IDA, called ∞IDA. As a consequence, reachability is always decidable on ∞IDA. In this manner, IDA automata achieve following two desiderata: (i) they extend o-minimal automata to make them suitable for systems biology applications and (ii) they restrain Semi-Algebraic Hybrid Automata (defined in the first AAMC paper) and make them more amenable to analysis. IDA automata accomplish these by exploiting the decidability of the first-order theory over which it is defined, and thus can bound the time interval that must be considered to solve a reachability problem as well as prove the decidability of reachability. When the IDAs of interest are defined on the first-order theory Embedded Image, our approach exploits Tarski's result and quantifier elimination to characterize reachability.

The reachability problem for hybrid automata is now formulated as follows: given a hybrid automaton H and two formulae, ι and τ, denoting an initial set of points Embedded Image and a target set of points Embedded Image, respectively, one desires to decide whether there exists a point in Sat(ι), which reaches a point in Sat(τ).

In the IDA class of hybrid automata, the components of Z can be partitioned into two sets: the independent and the dependent variables. I will denote by X the vector of independent variables, which maintains the same component ordering of Z. Similarly, I will indicate with Y the vector of dependent variables and with X′ and Y′ the primed version of X and Y, respectively. The independent variables are never reset and their dynamics are the same in all the locations. This condition is similar to that used in rectangular initialized hybrid automata (see Henzinger & Kopke 1996; Kopke 1996).

Moreover, one needs to impose conditions that will ensure the existence of a minimum amount of time, which has to be spent in a location between two jumps. In particular, I will require that the invariants are closed and bounded, and that the distance between reset and activation regions assumes a positive value.

In condition 3 of the definition, I will need that the independent variables that are not reset have the same dynamics in each location. The reset and the dynamics of the dependent variables may have to be determined by the independent variables.

For this last condition, I need to consider a norm ‖.‖ on Embedded Image, and the induced distance d(.,.) between subsets of Embedded Image defined as Embedded Image. From now on, two edges e and e′ will be said to be subsequent if the target node of e is the source node of e′.

Definition 4.11 (Independent dynamics automata)

A hybrid automaton H is an independent dynamics automaton, or simply an IDA, if

  1. H is defined over a decidable theory over the reals;

  2. for each pair of subsequent edges e and e′ following holds: Embedded Image;

  3. the vector Z of variables can be partitioned into two vectors X (independent variables) and Y (dependent variables) such that

    1. for each Embedded Image, Reset(e)[Z,Z′] is of the form Embedded Image;

    2. for each Embedded Image, Dyn(v)[Z,Z′,T] is of the form Embedded Image.

  4. the set of values Embedded Image satisfying Inv(v)[Z] is closed and bounded for each Embedded Image.

Example 4.12

Consider the hybrid automaton Embedded Image such that: the dimension, k of the automata is 2; the discrete projection Embedded Image is illustrated in figure 2; the formulae Inv(v1)[Z], Inv(v2)[Z] and Inv(v3)[Z] are equal to Embedded Image; and the remaining formulae defining the automaton are as follows:Embedded ImageThe automaton H is an IDA.

Figure 2

The discrete component of example 2.

It is not difficult to extend our class of automata by allowing different partitioning of the variables depending on the topology of the discrete structure. However, such an extension would involve many tedious technical details, which are preferably omitted in this short review.

Various questions about the power of IDAs can be found in Casagrande et al. (2005a); I refer the interested readers to this paper and its sequels. But I will conclude this section with the following useful positive result.

Consider a class of automata over which an effective (terminating) algorithm can be devised to test reachability (see algorithm in Casagrande et al. (2005a)).

Definition 4.13 (∞IDA)

Let H be an IDA. H is said to be in the class ∞IDA if and only if, for each edge e=〈v,v′〉 of H, it holds that fv is continuous on Embedded Image, and for each Embedded Image it holds that Embedded Image.

Only some of the components of p are used in fi(p,t), i.e. only the components of p corresponding to the independent variables.

Theorem 4.14

Let H be an ∞IDA, ι and τ be two formulae. The algorithm in Casagrande et al. (2005a) can be used to decide whether in H, Sat(ι) can reach Sat(τ).

A few remarks are in order, especially in the context of classes of automata that my research group introduced in Casagrande et al. (2005a,b, 2008) in its effort to discover the right formalisms to take us beyond the linear hybrid systems. These, unlike many of the earlier generalizations, do not admit finite bisimulation quotients (a default approach of attack in this area). However, since for these automata it is possible to define a translation from temporal formulae (a fragment of CTL) into first-order formulae over the reals, they permit effective model checking procedure—a property of enormous value in computational systems biology. In a following section (see §4.4), I will touch on some questions related to model checking and hybrid automata.

However, before leaving this section, I wish to highlight two questions in this area that are likely to grow in importance: (i) in systems biology, model-checking methods for modularly and hierarchically described hybrid models will need to be developed. However, it appears that decidability in a class of hybrid automata does not imply a similar decidability result for models constructed by the composition of members (Cartesian product) of the class. This apparent lack of a closure property is at a first glance rather bewildering, but could be intuitively explained by the difficulty in keeping track of how the modules within a system remain synchronized (see Casagrande et al. 2007b). (ii) The second question deals with an obvious, but neglected, problem in systems biology: namely, from where do the hybrid automata come? Thus, there is a need for algorithmic approaches to construct hybrid automata from experimental data (see Casagrande et al. 2007a).

4.3 Example

The use of an IDA model may be demonstrated through a simple, yet biologically relevant example: namely, the bacterial chemotaxis (Spiro et al. 1997; Berg 2000). Many bacteria have evolved extremely effective strategies for responding to a chemical gradient in its environment, by detecting the concentration of ligands through a number of receptors, and then reacting to the input signal for driving its flagella motors to alter its path of motion. Such a bacterium responds in one of two ways: either it ‘runs’—moves in a straight line by moving its flagella anticlockwise (CCW; typically lasting 1000 ms), or it ‘tumbles’—randomly changes its heading by moving its flagella clockwise (CW; typically lasting 100 s). The response is mediated through the molecular concentration of CheY in a phosphorylated form (YP variable in figure 3), which in turn is determined by the bound ligands at the receptors that appear in several forms (variables Embedded Image Embedded Image etc. in figure 3). The ratio of y=YP/Y0 (phosphorylated concentration of CheY to its concentration in the unphosphorylated form) determines a bias with an associated probability that flagella will exert a CW rotation; note that, in our example IDA model (figure 3), I have simplified this situation by ignoring this stochastic effect by modelling it deterministically. Thus, the most important output variable is the binary-valued angular velocity w that takes discrete values +1 for CW and −1 for CCW. The more detailed pathway involves other molecules: CheB (either with phosphorylation or without, Bp and B0), CheZ (Z), bound receptors (LT) and unbound receptors (T), while their continuous evolution is determined by a set of differential algebraic equations derived through kinetic mass action formulation.

Figure 3

An IDA capturing the run–tumble mechanism of a chemotactic bacterium.

This IDA model captures the essence of how a bacterium cell performs a biased random walk by transiently decreasing its tumbling frequency to move towards a region with greater ligand concentration. A second feature of this control is its sensitivity to concentration gradients and its observed dynamic range: rather than responding to absolute concentrations, the bacterium adapts quickly as it compares its environment during the immediate past to what existed a little earlier. Furthermore, it does so over a wide range of input concentrations.

4.4 Model checking in systems biology

While reachability analysis hints at the basic automated reasoning capability needed in systems biology, and rightfully occupies a pivotal role in reasoning about complex biological systems, it must also be abundantly clear to the reader that there exists a still bigger class of pertinent biological queries, statements and hypotheses that remain beyond the grasp of reachability. A semantically rich language for this purpose can be fashioned with the aid of a rather simple propositional temporal logic, which could in principle extend the techniques explored so far.

As examples of some straightforward biological property queries that occur naturally in the emerging field of computational systems biology, one may list the following: safety analysis, sensitivity analysis, robustness (both local and global) analysis, steady-state or homeostasis properties, flux-balance analysis, etc.

Furthermore, since on a single hybrid automaton, one can consider both timed and untimed semantics and one can compute bisimulation on both of them. For these reasons, one must distinguish between the so called timed-abstract simulations/bisimulations, computed on the untimed semantics, and the timed simulation/bisimulation, evaluated on timed semantics.

While there are many versions of timed semantics, with their individual quirks and idiosyncrasies, a particularly elegant formulation is given via Timed Computation Tree Logic (TCTL).

Definition 4.15 (TCTL;

Alur et al. 1990). It has the following syntactic structure:Embedded ImageIts associated semantics is described below:

  1. z: The freeze quantification ‘z.’ binds the associated variable z to the current time. Thus, the formula z.ϕ(z) holds at time t iff ϕ(t) does.

  2. Embedded Image and Embedded Image: universal (on all paths) and existential (on at least one path) ‘until’ operators. For Embedded Image to be true on a path, ϕ2 is required to be true somewhere along the path, and ϕ1 is required to be true all along the path up to (but not necessarily at) that point.

Remark 4.16

The basic notations are often extended by the following syntactic abbreviations (Alur et al. 1990).

  1. Embedded Image and Embedded Image: ‘subscripted’ Until operators (max is the time bound).

  2. Embedded Image and (Embedded Image): ‘eventuality’ operators.

  3. Embedded Image and Embedded Image: ‘invariance’ operators.

In addition to using this logic to answer a particular instantiation of a biological model (e.g. with fixed values of system parameters), one may wish to engineer it appropriately to provide symbolic model-checking facilities (e.g. the system parameters are treated as symbolic variables, and it is sought to determine various ‘equational relationships’ among these variables in order for the biological system to satisfy certain TCTL formulae). In other words, one seeks to devise algorithmic algebraic solutions to various kinds of queries (in TCTL) to examine interesting properties and invariants about the hybrid automata that model biochemical systems. As one would expect, the simplest and perhaps the most important question that one could ask about these systems is the symbolic state reachability problem: namely, can one reach a particular state from an initial state by following the dynamics of the hybrid automaton, which may be described symbolically? For instance, a very relevant biological question would be to provide a symbolic description of the initial conditions (states) from which the biological system (modelled symbolic algebraically) can reach a desired state (say, apoptosis state for a cancer cell), or avoid certain unsafe states. In this sense, algebraic descriptions of this form in systems biology can be a potent tool, and need focused attention. My group, as well as others, has made some progress by exploiting approximations, bounded reachability analysis, etc., or by suitably constraining the power of the family of hybrid automata studied (Casagrande et al. 2005a,b, 2008; Mysore et al. 2005, 2006; Piazza et al. 2005; Casagrande 2006; Mysore 2006; Mysore & Mishra 2006, 2007). But much remains to be done!

5. Challenges of systems biology

It may be considered premature to attempt listing what problems would dominate the thinking in this emerging field for the next few years; nonetheless, we suggest a handful.

  1. Problem 1. What are the most important hybrid systems models for biology?These models should have many desirable properties: namely, high fidelity, expressivity and decidability (and computational efficiency most probably through new developments in symbolic computation).

  2. Problem 2. How do genotypes determine phenotypes? We can take the view that the model (e.g. structure of the hybrid systems and their parameters, which may be inferred directly from the linguistic patterns in the genomes) encode the genotypes, whereas the temporal logic formulae satisfied by the model encode the phenotypes. One may then ask: How does biology relate them? Are there engineering principles governing them? What properties are important? Are some parameters more ‘flexible’, ‘facilitated’ or ‘robust’ than others? Are there symmetries? Scaling laws? Laws governing compositionality? Laws governing modularity and hierarchy? How does evolution control them? What structural changes can be accomplished by nature? What selection forces act on them? What invariants are important to biology? Are there usage functions that are being optimized? What are they? A similar, but a more detailed, viewpoint has been expounded by Fox Keller & Harel (2007).

  3. Problem 3. How can our understanding of systems biology lead to designing useful artificial biological systems?Can we create ab initio a complete artificial organism? Can we perturb the properties of an existing cell by introducing synthesized biological circuits? These questions will see application of analytical techniques of systems biology to delineate the engineering design space.

  4. Problem 4. How do we build models? What measurements are important in this process?Causal models versus phenomenological models: Will it be possible now to create a causal model for any biological phenomenon, given our cognitive incompleteness of biological processes? Can we achieve a useful, albeit approximate, understanding of biology from simpler phenomenological models that may ignore details of many of the structural properties of models?

  5. Problem 5. How can we apply our understanding of systems biology to solve important biomedical problems?The application areas would involve creation of new tools for diagnostics and prognostics, drug discovery, vaccine design, etc. Systems biology models, model checking and model-based supervisory controller design could become the basis for rational drug design in the future.

6. Closing thoughts

Before closing, a reflection on the nature of scientific discovery is in order. Lest some may mistakenly conclude that I have argued parochially in favour of theories over experiments (equivalently, Newton over Hooke), I conclude this review with the following beautiful quote from Hooke:

So many are the links, upon which the true Philosophy depends, of which, if any can be loose, or weak, the whole chain is in danger of being dissolved; it is to begin with the Hands and Eyes, and to proceed on through the Memory, to be continued by the Reason; nor is it to stop there, but to come about to the Hands and Eyes again, and so, by a continual passage round from one Faculty to another, it is to be maintained in life and strength.

It is hoped that in the near future, we will see the emergence of the field of systems biology serving as a strong link between biological principles and observations—maintained in life and strength, as Hooke had prophesied!

Acknowledgments

The ideas presented here owe enormously to several, but most prominently to: Carla Piazza, Alberto Casagrande and Venkat Mysore. In addition, I wish to thank all who have wittingly or otherwise contributed to my (still unfinished) journey into systems biology: Ed Clarke, Merrick Furst, Bob Tarjan, Dana Scott, Mike Foster, Brad White, Tom Anantharaman, Randy Bryant, Jeannette Wing, Prasad Sistla, Allen Emerson, David Dill, Bob Kurshan, Amir Pnueli, Rajeev Alur, Jack Schwartz, Martin Davis, Gerardo Lafferriere, Anne Dinning, Lou Salkind, Dayton Clark, David Shaw, Pasquale Cainiello, Chee Yap, Giovanni Gallo, Pina Carra, William Sit, Tom Dube, Paul Pedersen, Bruno Buchberger, Franz Winkler, Bernd Sturmfels, Volker Weispfenning, Teo Mora, Michael Singer, Ricky Pollack, Marie Francois-Roy, Saugata Basu, Mohsen Jafari, Marco Antoniotti, Shankar Sastry, Toto Paxia, Roger Brockett, Sanjoy Mitter, Pravin Variya, Pradeep Khosla, Dan Koditschek, John Canny, Bruce Donald, Eduardo Sontag, Rohit Parikh, Laxmi Parida, Raj Reddy, Izzy Edelman, Misha Gromov, Ale Carbone, Mike Wigler, Jim Watson, Joey Zhou, Raoul Daruwala, David Schwartz, Mike Waterman, Dick Karp, Ron Shamir, Gene Myers, J. Craig Venter, Gilad Lerman, Joe McQuown, Jim Glimm, Raphy Coiffman, Charles Cantor, Lee Hood, Charles Delisi, Shree Kumar, Ashish Tiwari, Vijay Kumar, Pat Lincoln, Jim Collins, Drew Endy, George Pappas, Claire Tomlin, Pablo Parillo, Calin Belta, Nadia Ugel, Paolo Barbano, Marina Spivak, Jiawu Feng, Yuri Lazebnik, Jack Lin, Seongho Ryu, Jane Hubbard, Fang Cheng, Joan Brugge, David Harel, Marc Kirschner, Partha Mitra, John Doyle, Richard Murray, Mike Savageau, Naren Ramakrishnan, Debo Das, Bahram Parvin, Jeff Trent, R. K. Shyama Sundar, P. S. Thiagarajan, Somenath Biswas, Manindra Agarwal, Alberto Policriti, Raffaella Gentilini, Iuliana Ionita, Samantha Kleinberg, Kevin Casey, Giuseppe Narzisi, Giuseppe Nicosia, Antonina Mitrofanova, Mike Teitell, Andrew Sundstrom, Alex Lim, Pierre Franquin, Jason Reed, Jim Gimzewski, Kalim Mir, Patrick Soon-Shiong, Mickey Atwal, Raul Rabadan and Eric Aslakson. I would also like to thank four anonymous reviewers for their many helpful comments and suggestions as well as useful pointers to several relevant results in this field, which were missed in the first draft, and also for letting me to write this review in a non-traditional style, in spite of their misgivings.

Appendix A. Historical and bibliographic notes

Since computational systems biology attempts to create a consilience of many different scientific fields, it must respond to the diverse styles, traditions and history. What I provide below is a Cook's tour of that history without any pretense of completeness. One may rightfully argue that the present renaissance in computational systems biology owes great debts to the seminal paper of Watson & Crick (1953a,b) and then the efforts that went into sequencing genomes of a diverse group of organisms, starting with the human genome project (Cantor 1990). At present, it is possible to access and study, in great details, the genomes of a plethora of organisms: by the end of 2001, there existed assembled sequence data for Haemophilus influenza, E. coli and 39 other bacteria, Saccharomyces cerevisiae, C. elegans, D. melanogaster, Arabidopsis thaliana and a rough draft for Homo sapiens (Lander 2001; Venter 2001). Subsequently, many more genomes were sequenced and became publicly available for further analysis. At present, this list includes: human (appearing on the covers of Science Venter (2001) and Nature Lander (2001)), chimpanzee, dog, chicken, rat and mouse (on covers of Nature Mouse Genome Sequencing Consortium (2002), International Chicken Genome Sequencing Consortium (2004), Rat Genome Sequencing Project Consortium (2004), Lindblad-Toh (2005), The Chimpanzee Sequencing and Analysis Consortium (2005), respectively), mosquito and fly (on covers of Science Adams (2000), Holt (2002), respectively), malaria parasite and slime mold (Nature Gardner (2002), Eichinger (2005), respectively), etc. Computational approaches were developed to annotate genes and their regulators, as well as to compare orthologous genes across the species. These studies gave rise to the field of functional genomics and a qualitative version of systems biology (Mishra 2003). Researchers were able to combine various ‘omics’ (genomics to study genomes, transcriptomics to study RNA abundance, proteomics to study protein, regulomics, metabolomics, etc.) to piece together a picture of the tangled network of how these discretized elemental objects of molecular biology may be modelled to influence each other. The asymptotic topological structures of these networks and many over-abundant motifs they contained turned out to be exceptionally fascinating. They hinted not only at the elegance of the nature's solution at a static level, but also at the exquisite beauty of the evolutionary dynamics, which sculpted these biochemical networks; it seemed appropriate that the field of systems biology should turn its attention to a quantitative analysis of these systems as various biomolecules interact within and across the cells—usually, categorized into three classes: regulatory, metabolic and signal-transduction pathways. The goal of this emerging field became a better understanding of the foundational design principles that persisted across the evolutionary scales. In this setting, systems biology discovered its roots in quantitative modelling of enzyme kinetics, a discipline that flourished between 1900 and 1970, but also in the simulations developed to study neurophysiology, the control and communication theories and game theory. A particularly interesting approach to understand the tangles of biochemical reactions came from the formalisms of S-system and its extensions to XS-systems (Voit & Savageau 1986; Voit 2000; Antoniotti et al. 2002). The complex networks of biochemical reactions could then be created out of elemental reactions such as: reversible and irreversible reactions, synthesis and degradation reaction (possibly, involving stoichiometric constraints) and enzyme or enzyme–coenzyme-based reactions. Of course, such modules could then be abstracted at different levels of hierarchy and composed to create a hierarchical modular structure. Other approaches focusing on systems biology markup language (SBML; Hucka 2003) and related SBW also have become widely influential, because of the ease with which they could be used by working biologists.

The catalytic model of gene activation by transcriptional activators (either protein or micro-RNA factors) played a crucial role in genetic regulatory models and the resulting systems of ordinary differential reactions employed by systems biology. These models closely followed the classical Michaelis–Menten equations and its generalization in the form of Hill's equations (Segel 1993). The word catalyst was coined in 1836 by the Swedish chemist Jöns Jakob Berzelius, who collected a number of examples of catalysis. In 1892, the British chemist Adrian Brown had studied the rate of fermentation of sucrose in the presence of yeast to discover that the rate primarily depended on the invertase molecules that formed a complex with the sucrose, and had been influenced by the work of Cornelius O'Sullivan & Frederick Tompson (1890) and Adolphe Wurtz (1880). Along the way, the field has been invigorated by several seminal discoveries, as listed below: Fischer's ‘lock and key hypothesis’ of 1894 (specificity of enzyme action, explained in terms of the precise fitting together of enzyme and substrate molecules); Victor Henri's 1903 research elucidating relationship between substrate and enzyme concentration; work of Michaelis & Menten (1913), closely related to the work by Hill (1910), where kinetic consequences of an enzyme–substrate complex was used to give a precise formulation of Michaelis–Menten equation; finally, a more general formulation (of the Michaelis–Menten equation) by George Briggs and the British geneticist John Burdon Sanderson Haldane (in 1925).

Some of the steps in the metamorphosis of general systems theory (as proposed by Ludwig von Bertalanffy) into several key ideas in the present-day systems biology include the following: work of the British neurophysiologists Alan Hodgkin and Andrew Huxley (1952) on a mathematical model of the action potential propagating along the axon of a neuronal cell; Hutter & Noble's (1960) computer model of a beating heart; development in the 1960s and 1970s of several approaches to study complex molecular systems, such as Metabolic Control Analysis and the Biochemical Systems Theory (Fell 1992); many successes of molecular biology throughout the 1980s, (although hindered by a scepticism toward theoretical biology that then promised more than it achieved); the birth of functional genomics in the 1990s with accompanying large quantity of good quality data; Tomita and his group's (1999) quantitative model of the metabolism of a whole (hypothetical) cell, and creation of (around the year 2000) institutes and departments of systems biology in Seattle, Tokyo and Boston. Systems biology may have finally found acceptance as a movement in its own right, spurred on by the completion of various genome projects, the large increase in data from the omics (e.g. genomics and proteomics) and the accompanying advances in high-throughput experiments and bioinformatics.

The field of model checking is likely to assume as important a role in systems biology, as it is currently playing in many engineering and computational areas; this field is much younger, but possesses an interesting history. Model checking, as a tool for systems verification, was proposed as an alternative to automatic proof-theoretic approaches, which are ‘in general quite tedious’ and require ‘a good deal of ingenuity… to organize the proof in a manageable fashion (Clarke & Emerson 1981a)’. These highly practical and popular techniques are used to ‘mechanically determine whether the system meets a specification expressed in propositional temporal logic’. It has been shown, through ample examples, that model checking enjoys many advantages over its rivals: namely, it requires no proofs; it is fast (compared with other rigorous methods); it produces diagnostic counterexamples if the system's ‘real’ behaviour does not match the believed (specified) behaviour; it is not hampered by partial specifications; and its underlying logics can easily express main properties of interests in many domains of application.

In parallel to the work of Clarke et al. and Pnueli et al., there have appeared several techniques that share a common philosophy in their approach to the problem of reasoning about large complex systems: namely, the independent work of Murata and Jensen on Petrinet tools (late 1970s); Bochmann's work on protocol verification (1978) as well as the work of Holzman also on protocol verification (1978–79).

The interest in temporal logic as a language for specifying computational protocols and programs dates back to the mid-1970s: e.g. the work of Burstall (1974) and Pnueli (1977), all proposing temporal logic for reasoning about computer programs. However, Amir Pnueli first used temporal logic for reasoning about concurrency, as he proved program properties from a set of axioms that described the behaviour of the individual statements. The method was extended to sequential circuits by others (e.g. Bochmann in 1982 and Owicki & Malachi in 1981). In 1981, Clarke & Emerson (1981) wrote an influential paper showing that their temporal logic model-checking algorithms allowed this type of reasoning to be automated. The success of this approach can be attributed to the simple fact that ‘Checking that a single model satisfies a formula is much easier than proving the validity of a formula for all models’. Their algorithm was practical (with polynomial time complexity). In mid-1980s, the present author and Edmund Clarke showed its use for hardware verification (and discovered a thorny hardware bug in a published circuit designed by Charles Seitz to implement a FIFO queue in hardware).

Another innovation, with a high practical payoff, in some of the early model checkers (e.g. Carnegie-Mellon University's Extended Model Checker) was the facilities for giving counterexamples for universal CTL properties that were false or witnesses for existential properties that were true. This was added by Michael Browne in 1984 to an enhanced implementation, called the model checker B (MCB) model checker. Further enhancements include: automata-theoretic techniques of Aggarwal et al. (1983), also explored in Dill's thesis (1987); and two major breakthroughs in the field in the form of ‘symbolic model checking’ techniques due to two teams of researchers, Coudert et al. (1991), and Burch et al. (1994), and the ‘partial order reduction’ techniques due to Godefroid (1991), Valmari (1991) and Peled (1994).

However, the field is far from mature, as it still needs to deal with many problems that arise naturally in very complex systems such as those appearing in biology. There is a strong impetus to explore non-traditional techniques: namely, techniques based on (i) compositional reasoning, (ii) abstraction, (iii) symmetry reduction, and (iv) induction and parametrized verification. New model-checking techniques have pushed the field in novel directions: timed and hybrid automata (a topic likely to be very important to systems biology), bounded model checking, localization reduction, compositional model checking and learning and infinite-state systems. The field continues to grow in momentum, as many of its early practitioners begin to be recognized for the field's impact.5

A rigorous attempt to create tools for systems biology would most probably depend on various algebraic techniques: both from commutative as well as differential algebra. These techniques must be effective (i.e. algorithmic) and efficient (i.e. have low-computational complexity) in order for the tools to be practical, rigorous and scalable. Interestingly, the history of algorithms and algebra is somewhat intertwined: Abū 'Abd Allāh Muhaammad ibn Mūsā al-Khwārizmī (780–850 AD), a mathematician in the court of Caliph Harun Al Rasid of Abbasid Dynasty, is credited as the father of both of these subjects. Two of his books, Al-Kitab al-Mukhtasar fi-hisab al-Jabr al-Muqabalah (algebra) and Kitab al-Jam'a wal-Tafreeq bil-Hisab al-Hindi (algorithm), established the foundations for these two fields. The later of the two, translated into Latin in the twelfth century, as Algoritmi de numero Indorum, and incorporating many of the ideas in Aryabhatta's Siddhanta (also translated by al-Khwārizmī into Arabic as SindHind) created a revolutionary amalgamation of Indian and Greek mathematics.

Soon after that, the field of algebra saw a profusion of breakthroughs; a short list of some of the important milestones in the history of algebra would include: circa 850: Persian mathematician al-Mahani conceived the idea of reducing geometrical problems such as duplicating the cube to problems in algebra; circa 850: Indian mathematician Mahavira solved various quadratic, cubic, quartic, quintic and higher order equations, as well as indeterminate quadratic, cubic and higher order equations; circa 990: Persian Abu Bakr al-Karaji further developed algebra by replacing geometrical operations of algebra with modern arithmetical operations, and defining the monomials; circa 1050: Chinese mathematician Jia Xian discovered numerical solutions of polynomial equations; circa 1072: Persian mathematician Omar Khayyam developed algebraic geometry, and gave a complete classification of cubic equations; circa 1114: Indian mathematician Bhaskara, in his Bijaganita (algebra), solved various cubic, quartic and higher-order polynomial equations, as well as the general quadratic indeterminant equation; circa 1202: Leonardo Fibonacci of Pisa through his work Liber Abaci introduced the subject to Europe; circa 1300: Chinese mathematician Zhu Shijie created polynomial algebra, and solved simultaneous equations etc.; circa 1400: Indian mathematician Madhava of Sangamagramma gave iterative methods for approximate solution of non-linear equations; circa 1545: Girolamo Cardano published Ars Magna, which elucidated Fontana's solution to the general quartic equation; circa 1591: Francois Viete developed improved symbolic notation In artem analyticam isagoge; circa 1682: Gottfried Leibniz developed his notion of symbolic manipulation with formal rules, called characteristica generalis; circa 1750: Gabriel Cramer stated Cramer's rule and studied algebraic curves, matrices and determinants; circa 1824: Niels Henrik Abel proved that the general quintic equation is insoluble by radicals; circa 1832: Evariste Galois developed Galois theory; circa 1950: Tarski gave a decision method for elementary algebra and geometry, for which the first elementary recursive method was found by Collins using the technique of cylindrical algebraic decomposition, with a better complexity (doubly exponential.) The field of algorithmic algebra that combined computational methods with algebraic manipulation to analyse systems of algebraic equations, inequations and inequalities flourished with the development of many beautiful algorithms: Buchberger's algorithm for Grobnr bases; Ritt's characteristic sets; Tarski's algorithm for semi-algebraic geometry; and the methods based on resultants. As we described, many of these techniques have begun to influence the development of effective techniques to decide various properties of hybrid automata, and thus simpler but immensely informative models of biochemical process. The emergence of the field of ‘Algebraic Biology’ is a first step in combining ideas from algebra, algorithms and automata to provide tools to understand biology better. This young field will continue to be inspired by this consilience, which has begun to attract practitioners from many different fields and with many different backgrounds. It is hoped that these pioneers will build upon these foundations that have survived millennia and will thus create an enduring structure.

Footnotes

  • This paper is dedicated in memory of a mentor and close friend, Jack Schwartz (1930–2009), a ‘gourmet diner at the feast of science’, who with an omnivorous appetite enjoyed every morsel of his art.

  • 1 It may be more innocuous to translate Newton's motto as ‘I make no hypotheses’ or ‘I imagine no hypotheses’, as one of my anonymous referees has pointed out. However, since there is some interesting history surrounding this quote, I have retained the more controversial translation of what Newton wrote in his Scholium, which may be defended as follows: quoting Cohen (see ‘Science rules: a historical introduction to scientific methods’, by Peter Achinstein; p. 105, ch. 9, sec. 1, The general Scholium: ‘Hypothese non fingo’.) ‘Alexandre Kroye suggested that by ‘fingo’ Newton probably intended ‘I feign’, in the sense of ‘inventing a fiction’, since the Latin version (1706) of the Opticks (1704) used the cognate ‘confingere’ for the English ‘feign’. Thus, Newton would be saying that he does not invent or contrive fictions (or ‘hypotheses’) to be offered in place of sound explanations based on phenomena’. Also, few sentences later Cohen writes, ‘From the manuscript drafts we learn that Newton chose the verb ‘fingo’ with care; it was not the first word to leap into his mind. He tried ‘fugio’ as in ‘I flee from [or shun] hypotheses’; he tried ‘sequor’, as in ‘I do not follow [or, perhaps I am not a follower of] hypotheses…’.

  • 2 Other important contemporary mathematical biologists, with similar goals, include Nicolas Rashevsky, Robert Rosen and Conrad Waddington. Rashevsky and Rosen's work on abstract relational biology led to Embedded Image systems that may be argued to have anticipated a systems view, especially in terms of algebraic automata, for biology. Waddington's work on canalization and a proposed role of gene regulation in development anticipated many of the spatio-temporal models of biological pattern formation that we will discuss shortly. I will surely have other occasions to say more about these scientists, as their views (particularly, those of Rosen’s) will very likely play an important role in this burgeoning field.

  • 3 A computer science colleague, Dr S. Biswas of IIT Kanpur, wrote in response to an early draft: ‘You can see that I'm somewhat unhappy that you say that Turing was wrong—not because of some thing is wrong with your account, but because Turing is Turing’! At some deeper level, there is some truth in this view.

  • 4 Most importantly, observe that it is not necessary to explicitly compute these time bounds, but only to check their existence, again, by solving a satisfiability problem.

  • 5 For instance, in recognition of model checking's impact on both theoretical and practical computer science, Amir Pnueli, Ed Clarke, Allen Emerson, Joe Sifakis, Ken McMillan, Randy Bryant, Robert Kurshan, David Harel, etc. have been awarded some of the highest awards in computer science (Turing and Kanellakis awards, for instance).

    • Received December 31, 2008.
    • Accepted March 11, 2009.

References

View Abstract