redesigning viral genomesroe/courses/isc/redesigningviralgenomes.pdf · innovative but laborious...

COVER FE ATURE

47MARCH 2012Published by the IEEE Computer Society0018-9162/12/$31.00 © 2012 IEEE

ing the power to answer questions such as, what would happen if we change a particular naturally occurring DNA sequence?

Our research group combines algorithmic design with cutting-edge synthesis technologies to improve experimen-tal techniques for solving classical biological problems, particularly in the area of viruses. We have collaborated with biologists to design attenuated viruses to serve as vaccines against important pathogens, locate biologi-cally significant regions within genes, and refactor viral genomes to make them more amenable to manipulation.

SYNTHESIS TECHNOLOGIESShort sequences of DNA bases, or oligonucleotides

(“oligos”), can be chemically synthesized from building blocks in rounds of single-base extensions. Each round adds one base to a chain, but with less-than-perfect fidelity and yield. For example, a 1 percent per-base error rate would imply that only 60.5 percent of all 50-base sequences—and only 36.6 percent of 100-base sequences—are correctly synthesized, and this fraction rapidly decreases with longer sequences. The synthesis of oligos is completely automated and forms the basis of the polymerase chain reaction (PCR), a fundamental experi-mental technique in molecular biology.

Synthesis of longer DNA molecules to specification is perhaps the most fundamental technology in synthetic biology. Researchers can assemble oligos into longer sequences through hybridization. Because each DNA base binds to its complementary base (A with T and C with G), two oligos with appropriate sequences will tend to hybridize. Careful design can thus yield a set of oligos that will readily self-assemble into longer target sequences of

S ynthetic biology is an emerging field that extends traditional genetic engineering techniques to create novel forms of life. In contrast to DNA sequencing, which consists of decoding the linear

order of the nucleotide bases A, C, T, and G in existing DNA molecules, DNA synthesis involves constructing new DNA molecules from a given genome sequence.

In May 2010, a team led by pioneering geneticist and entrepreneur Craig Venter synthesized a 1.08-megabase bacterial chromosome, arguably the first synthetic life form.1 Since then, advances in synthetic biology have made it possible to cost-effectively manufacture commercially important proteins—for example, for use in therapeutic drugs or biofuels—or construct biological circuits to better understand cellular functions. Vendors today charge less than 40 cents per base to synthesize a DNA sequence to specification—this works out to only a few thousand dollars for a custom virus-length sequence—and prices are rapidly dropping.

The advent of low-cost, large-scale synthesis, coupled with advanced algorithmic tools, opens the door to a host of exciting new biotechnology applications. In addition to facilitating the engineering of new biological structures and functions, computationally driven synthesis could revolutionize the study of natural organisms by provid-

The advent of low-cost, large-scale DNA synthesis, coupled with advanced algo-rithmic tools, opens the door to a host of exciting new biotechnology applications, including vaccines.

Steven Skiena, Stony Brook University

Redesigning Viral Genomes

COVER FE ATURE

COMPUTER 48

hundreds or even thousands of bases, as Figure 1 shows. Researchers can construct even longer sequences by piecing together previously assembled sequences using innovative but laborious cloning strategies.

Ensuring perfect accuracy of the final assembled sequence is challenging, which explains why longer sequences have a higher per-base synthesis cost than smaller sequences. Synthesizing a megabase bacterial chromosome is currently a multimillion-dollar operation, but costs are dropping rapidly, and promising new tech-nologies are on the horizon.

Genes and even entire viral genomes are relatively short, often less than 10,000 bases, making them ame-nable to whole genome synthesis. Consequently, synthetic virology is rapidly emerging in tandem with declining synthesis costs. For example, the initial synthesis of polio-virus by Jeronimo Cello, Aniko Paul, and Eckard Wimmer cost roughly $300,000, but could be replicated today for less than $5,000. Indeed, our group has designed and synthesized more than a dozen novel poliovirus and in-fluenza strains to date in our quest to develop vaccines and better understand the fundamental mechanisms of DNA transcription and translation. Different groups have synthesized other viral strains, including SARS and the 1918 flu, for biomedical research.2 This freedom of design can provide tremendous power to perform large-scale redesign of coding sequences for various experimental applications.

THE TRIPLET CODEGenes are DNA sequences that code for proteins, which

are sequences of amino acids. The triplet code, shown in Figure 2, defines how the 4 × 4 × 4 = 64 triples of DNA bases, or codons, map to the 20 amino acids and a stop symbol. For example, the amino acid leucine can be encoded as TTA, TTG, CTA, CTC, CTG, or CTT. Coding sequence design involves selecting one codon from the stack of possibilities representing each desired amino acid. Because of the triplet code’s redundancy, there are roughly 3n informationally equivalent DNA sequences coding for any n-residue protein. If we interpret Figure 2 from left to right as a specified sequence of 20 amino acids, selecting one codon per column implies there would be

4 × 6 × (25) × 4 × 2 × 3 × 6 × 2 × 1 × 2 × 4 × 6 × 4 × 1 × 2 × 4 × 3 =

1,019,215,872

distinct synonymous genes coding for the same tiny protein.

Synonymous DNA sequences greatly vary in the speed and efficiency with which they produce protein. A typical synthetic biology application thus involves maximizing the production or expression of a desired protein by design-ing and then synthesizing an optimized coding sequence. Although gene expression is not completely understood despite more than 40 years of research, careful sequence design generally results in higher expression levels.

Conventional wisdom dictates that maximum expres-sion results from designing a gene to match its host’s codon preferences. For example, at 39.6 percent, the CTG codon is the most common of the six leucine codons in human genes, but the yeast Saccharomyces cerevisiae employs CTG only 11 percent of the time. Hence, a gene designed to maximize expression in humans might employ additional CTG codons whenever possible.

G C A T G A T A A T G A C G T A C

G T A C T A T T A C T G C A T

G C A T G

A T A A T GA C G

T A C

G T A C T A T

T A C T G C A T

Figure 1. Researchers can assemble large DNA sequences from small synthesized fragments.

VIRAL GENOME SYNTHESIS TERMINOLOGY

Attenuated virus: A virus strain that is substantially less infec-tious than the naturally occurring (wildtype) virus.

Codon bias: The observed statistical distribution governing which codons are over- or underrepresented in the genome of any particular organism.

Codon-pair bias: The (more subtle) observed statistical distri-bution governing which pairs of neighboring codons are over- or underrepresented in the genome of any particular organism.

DNA synthesis: The construction of customized DNA molecules encoding a particular desired pattern of {A,C,G,T}.

Gene expression: The turning on of a particular gene to pro-duce a large number of specified protein molecules.

Live vaccine: A virus strain that confers immunity to a particular pathogen that is so weakened that it does not cause disease. Flu-Mist is a popular live vaccine against influenza.

Phenotype: An observable property of a particular organism—for example, blue eyes.

Sequence signals: Specific patterns in DNA/RNA sequences that the cell interprets to trigger some particular biological activity. These are analogous to how computer systems use end-of-file char-acters and bit patterns to delimit file boundaries.

Wildtype: The naturally occurring virus circulating in the envi-ronment, as opposed to variant strains created only in the laboratory.

49MARCH 2012

Full understanding of the complexities of gene expres-sion will only come as a result of large-scale synthesis studies, but such studies to date have been limited by the high cost of synthesis. Joshua Plotkin’s group3 employed a clever strategy to create 140 synonymous genes at minimal cost, but because these are not independent designs, the resulting encodings cover only a very narrow portion of the gene design landscape. Larger, more definitive stud-ies will clearly be forthcoming as synthesis becomes less expensive.

SYNTHETIC VIROLOGY AND VACCINE DESIGN

Viruses have always been one of the main causes of human death and disease. Unlike bacterial diseases, viral diseases remain difficult to treat. Thus, vaccination has been the main defense against viruses. Live vaccines work by exposing the immune system to a strain of the disease-causing or wildtype virus to build up useful antibodies. Yet this strain must be weak enough that the body can safely fight off the vaccine-produced infection.

Producing attenuated viruses to serve as safe live vac-cines is traditionally done by forcing the wildtype to evolve in an unfamiliar and hostile environment, typically a non-human organism. The mutations accumulated in doing so typically make the virus less well adapted to humans. However, this process is expensive and time-consuming. For example, the Sabin 1 poliovirus vaccine was derived from 52 rounds of monkey infections and 16 rounds of monkey kidney cell passages, requiring 100,000 monkeys and several years of work.

Live vaccines have many advantages, often being easy to manufacture and administer. Sometimes the residual growth of the vaccine in recipients allows “herd” immuni-zation (immunization of people in close contact with the primary vaccine recipient). These advantages are particu-larly important in an emergency, when a vaccine is needed rapidly. However, reversing a small number of mutations—just five in the case of Sabin 1—can cause a live attenuated vaccine to revert to wildtype virulence. For this reason, the Sabin polio vaccine is no longer used in the US.

One of our goals is to design viruses that grow more slowly than would occur naturally, so that the immune system can readily fight off any infection while building

antibodies to protect against future infections from the wildtype pathogen.

Toward that end, we have proposed synthetic attenu-ated virus engineering (SAVE), a systematic approach for generating attenuated live vaccines on demand that has yielded promising vaccine candidates for poliovirus and influenza. SAVE uses computer algorithms to design dis-advantageous coding sequences to ensure that viruses will be slow-growing. We never alter a virus’s amino acid sequences, so SAVE produces the exact same proteins as the wildtype and thus elicits the same immune system response. Because attenuation is based on many hundreds of nucleotide changes across the viral genome, reversion of the attenuated variant to a virulent form by evolving in the host is unlikely.

Our initial experiments exploited codon bias: the over- or underrepresentation of certain codons in an organism’s genome. For example, since the CTA codon encodes only 7.2 percent of the occurrences of leucine in human genes, it is natural to employ it frequently to create an attenuated virus. In particular, we synthesized two new poliovirus strains: one heavily biased toward unpopular codons and the other designed (using bipartite matching algorithms) to maximally rearrange the codons without changing their frequency distribution. To avoid causing a loss of func-tion, we maintained the original amino acid sequence of the protein coded by the gene. As anticipated, the design employing unpopular codons was significantly attenuated, while the codon-shuffled design grew essentially as the wildtype virus despite hundreds of base changes. However, restricting encodings to the least-frequent codon at each position left no remaining design freedom.

Certain pairs of codons appear as neighbors less often than should occur under an assumption of statistical inde-pendence. For example, the CTT CGA codon pair appears 1.95 times more frequently in humans than expected, while the synonymous CTT AGG codon pair is underrepresented at only 0.44 times expectation. The mecha-nism underlying this codon-pair bias is unknown, and the biological community has largely ignored it.

We employed simulated-annealing-based optimization algorithms to rearrange codons to construct unfavorable codon-pair bias designs that preserved protein sequences. Our unfavorable codon-pair design was strongly

Ala Arg Asp Asn Cys Glu Gln Gly His Ile Leu Lys Met Phe Pro Ser Thr Trp Tyr Val stop

GCU CGU GAU AAU UGU GAG CAG GGU CAU AUU CUU AAG AUG UUU CCU UCU ACU UGG UAU GUU UGAGCG CGG GAC AAC UGC GAA CAA GGG CAC AUC CUG AAA UUC CCG UCG ACG UAC GUG UAGGCC CGC GGC AUA CUC CCC UCC ACC GUC UAAGCA CGA GGA CUA CCA UCA ACA GUA AGG UUG AGU AGA UUA AGC

Figure 2. The DNA triplet code defines how the 4 × 4 × 4 = 64 triples of DNA bases, or codons, map to the 20 amino acids and a stop symbol.

COVER FE ATURE

COMPUTER 50

attenuated, proving codon-pair bias to be a biologically sig-nificant phenomenon instead of a mere statistical artifact. Further, our codon-pair deoptimized strains performed well as a polio vaccine in mice.4

Most recently, we demonstrated that codon-pair bias designs provide safe and effective vaccines in mice for influenza, a significantly different virus than polio that claims 250,000 to 500,000 lives annually worldwide.

That SAVE can rapidly create new vaccines is particu-larly important in dealing with seasonal epidemics and pandemic threats, such as H5N1 (avian flu) or the 2009 H1N1 influenza. In addition, our vaccine design approach provides an enormous margin of safety: it conferred flu immunity in mice at doses 1,000 times lower than that causing disease.5

SIGNAL LOCATION SEARCHPoliovirus is not only the causative agent of a terrible

disease but also serves as a model organism to study how RNA viruses work. Literally thousands of papers have been published on the behavior of this 7,500-base sequence, reflecting years of painstaking laboratory work to create new strains through directed evolution or explicit manipu-lation. This research has led to the identification of several important sequence signals within the poliovirus genome.

Might there be other, undiscovered signals lurking within the genome’s coding regions?

Over the course of evolution, organisms have developed a wide variety of mechanisms to regulate gene expression outside the triplet code’s boundaries. However, efficient assays to perform genome-wide scans for such sequence-critical signals have been elusive. Because bioinformatic methods are useful but not perfect signal predictors, researchers use laboratory experiments to prove the sig-nificance of signals in encoding sequences.

Through site-directed mutagenesis, molecular biologists can make small changes—typically one to five bases—at specific positions in a genomic sequence and observe any phenotypic changes (changes in observable proper-ties). Although widely used, site-directed mutagenesis is a very slow, tedious process. Searching for a putative signal

is a repeated guess-and-test pro-cedure that is impractical for searching regions of even a few hundred bases.

Synthetic sequence design holds the key to more efficient signal detection. Suppose we synthesize a virus with large numbers of synonymous codon substitutions over a 1,600-base region of the wildtype. The viability of this synthesized virus proves that no critical signal in

the wildtype has been removed. If our synthetic design is unexpectedly dead, however, it means we have acci- dentally eliminated an essential element in the genome. How might we find the precise location of this critical signal in the large altered region?

Group testing, Gray codes, and signal searchIn response to this challenge, we have developed a novel

signal-search procedure that couples large-scale synthesis with sophisticated sequence designs based on discrete mathematics, specifically combinatorial group testing and balanced Gray codes.

Suppose we replace the right half of the coding se-quence of a viable gene with the coding sequence of a dead, signal-deficient strain, as shown in design I in Figure 3. If this hybrid gene is viable, the critical signal must occur in the left half of the gene, whereas a dead gene implies it must occur in the right half. A binary search process involving additional synthesis or subclon-ing can locate the signal in one of n regions in log2n sequential rounds. However, each round takes on the order of weeks to complete, making this a slow and la-borious process.

Instead, suppose we simultaneously synthesize all four of the designs in Figure 3. Note that each column defined by these designs consists of a distinct red-and-green pat-tern. Thus the viability/nonviability of the four synthetic designs uniquely identifies the critical signal’s position in only one experimental round. In this case, gene I is defec-tive and the other three are viable, pinpointing the signal’s location in the third region from the right.

We have validated this efficient, single-round method-ology in poliovirus as well as adenovirus,6 and believe it will ultimately become a standard technique for under-standing viral biology. Certainly, such genome scans will be an essential part of any effort to apply our SAVE vaccine design technology outside well-studied model organisms. The likelihood of stumbling upon a sequence-specific signal and creating dead phenotypes dooms this approach unless we can efficiently identify the position of these signals when we encounter them.

Figure 3. Designs of four synthetic genes to locate a specific sequence signal. The green regions are drawn from a viable sequence, while the red regions are drawn from a lethally defective sequence. Genes II, III, and IV are viable while gene I is defective, an outcome that can only be explained by a lethal signal in the red region located third from the right.

51MARCH 2012

Design improvementsThe freedom of large-scale synthesis makes it possible

to add extra features to gene designs to make experiments more robust. The key observation is that each column rep-resents one of s = 2k distinct subsets of k designs: the alive or dead signatures of k strains pinpoint the signal location to one of s possibilities. Yet the designs in Figure 3 contain only 14 regions. Eliminating the columns that are all green or all red ensures at least one viable strain and at least one dead strain—an effective control against certain classes of experimental errors.

Unfortunately, at each transition point between wild-type and synonymous sequence a signal can span two or more regions, as Figure 4 shows, producing ambiguous results. To address this problem, we constructed the designs in Figure 3 such that each column differs from its neighbors in only one position. This Gray code ordering minimizes the number of transitions between red and green, and thus the potential of the signal occurring at a transition point.

The combinatorial search procedure above fails when two or more unknown signals contribute to a phenotype. Researchers have extensively analyzed the general case of d arbitrary positives,7 but such designs require prohibi-tively many tests even when d = 2.

A more compelling cost/performance tradeoff exists in designs identifying at most d consecutive positives, cor-responding to long signals spanning multiple neighboring regions as in Figure 4. We have developed an efficient set of such designs that uses 10 tests to detect up to three con-secutive positives for 105 regions. The previous best design had a capacity of only 16 regions, so our designs offer six times the resolution with no additional cost.8

SEQUENCE DESIGN FOR GENOME REFACTORING

Traditionally, molecular biologists study wildtype organisms extracted from nature or laboratory strains resulting from directed evolution or small-scale manipu-lation. Synthesis technology makes it possible to refactor a genome sequence so that it is functionally equivalent—behaves the same in its natural environment—while being easier to manipulate experimentally. (The term refactor is imported from software engineering, where it means to redesign a program to improve its internal structure for better ease of maintenance, while leaving its external behavior unchanged.)

George Church’s Molecular Technology group at Har-vard University recently refactored the E. coli genome by replacing all 314 occurrences of the TAG stop codon with the equivalent TAA, thus freeing up the TAG symbol to be redeployed in many exciting ways.9 For example, it could be used to code for a novel amino acid, thus enabling the synthesis of new types of proteins never before possible.

Alternatively, it could be used to recode an existing amino acid, enabling the design of viruses that thrive in the al-tered cells yet could never grow in natural cells. Even more ambitious is Johns Hopkins University’s Synthetic Yeast 2.0 project,10 which is engineering a computer-designed yeast chromosome with several attractive characteristics.

The freedom to engineer genomes to specification opens up levels of design complexity to the synthetic biologist similar to those that the advent of large-scale integration offered the circuit designer. For this reason, synthetic biol-ogy is moving to sophisticated computer-aided design tools such as GenoCAD11 to optimize sequences in the same way that silicon chip design now relies on algorithms rather than people to optimize circuit layouts.

Optimizing unique restriction sitesAlthough large-scale synthesis costs are decreasing rap-

idly, it generally remains more cost-effective to replace targeted regions in genomic DNA through the process of cloning rather than to synthesize an entire genome from scratch.

Cloning uses pairs of unique restriction sites, distinct sequences cut by particular enzymes. For example, the enzyme EcoRI cuts at the pattern GAATTC. More than 600 distinct restriction enzymes are available commercially. Unique restriction sites within a given target are particu-larly prized, because they cut the sequence unambiguously in exactly one place.

The redundancy of the genetic code offers the freedom to insert new restriction sites at certain places and remove them from others without changing the protein coded by a given gene. The challenge is finding well-spaced unique placements for many different enzymes to facilitate labora-tory manipulation of synthesized sequences.

PRESTOWe developed PRESTO (Placing REstriction SiTes Op-

timally)12 to heuristically solve the unique restriction site placement problem, after proving the general problem NP-complete. This genome refactoring tool uses an O(n22r)-time dynamic programming algorithm, which is quite

Selected cut pointsDNA sequence

Vital region

Figure 4. Long sequence signals can span multiple con-secutive regions in synthetic genes, potentially producing ambiguous search results. Identifying the locations of such signals motivates more advanced sequence designs incorpo-rating Gray codes and consecutive positive group tests.

COVER FE ATURE

COMPUTER 52

practical for designs when the number of enzymes r is small. It also employs heuristics for site placement based on weighted bipartite matching, which is polynomial in both n and r and results in good designs in practice.

PRESTO produces genomes with three- to fourfold more unique restriction enzymes than the baseline al-gorithm and reduces the maximum gap size between restriction sites three- to ninefold. Figure 5 shows a PRESTO refactoring of poliovirus, taking it from 35 unique restriction sites to 110, all positioned to eliminate large gaps between sites.

T he synthetic biology era is just beginning. Emerg-ing DNA synthesis technologies will remake our world, leading to breakthroughs in medicine,

manufacturing, and fundamental biological research. Building complex biological systems will ultimately require CAD tools just as complex electronic circuitry demands today.

The gene coding design space for any given protein is enormous: a typical 300-amino-acid sequence can be encoded in about 10151 ways. Sequence design algorithms must optimize a range of complex criteria including many not discussed here, such as pattern frequencies, RNA sec-ondary structure, and minimum-length restrictions. Our research on viral genome synthesis highlights the need for more active collaborations between life and computational scientists to develop such algorithms to support biological research.

AcknowledgmentsNone of the work described in this article would be possible without the experimentalists on the SAVE team, with whom I have had many stimulating discussions: Steffen Mueller, Rob Coleman, Bruce Futcher, Yutong Song, Molly Arabov, Anjaru-wee Nimnual, Chen Yang, Aniko Paul, and Eckard Wimmer. In our work on adenovirus therapeutics, we collaborated with Wadie Bahou, Varsha Sitaraman, and Pat Hearing. I also thank my computational colleagues on the projects reported in this article: Charles Ward, Dimitris Papamichail, Barry Cohen, Bei Wang, Pablo Montes, Heraldo Memelli, Joondong Kim, and Joe Mitchell. This research was partially supported by NIH Grant AI075219 and NSF Grants DBI-1060572 and IIS-1017181.

References 1. D.G. Gibson et al., “Creation of a Bacterial Cell Controlled

by a Chemically Synthesized Genome,” Science, 2 July 2010, pp. 52-56.

2. E. Wimmer et al., “Synthetic Viruses: A New Opportunity to Understand and Prevent Viral Disease,” Nature Biotech-nology, Dec. 2009, pp. 1163-1172.

3. J.B. Plotkin and G. Kudla, “Synonymous but Not the Same: The Causes and Consequences of Codon Bias,” Nature Reviews Genetics, Jan. 2011, pp. 32-42.

4. J.R. Coleman et al., “Virus Attenuation by Genome-Scale Changes in Codon-Pair Bias,” Science, June 2008, pp. 1784-1787.

5. S. Mueller et al., “Live Attenuated Influenza Vaccines by Computer-Aided Rational Design,” Nature Biotechnology, July 2010, pp. 723-726.

6. V. Sitaraman et al., “Computationally Designed Adeno-Associated Virus (AAV) Rep 78 Is Efficiently Maintained within an Adenovirus Vector,” Proc. National Academy of Sciences, 23 Aug. 2011, pp. 14294-14299.

Figure 5. Comparison of the set of unique restriction sites available in wildtype poliovirus (left) versus PRESTO-refactored poliovirus (right). Each box corresponds to the name of a particular restriction enzyme with a unique site within the virus sequence.

53MARCH 2012

Selected CS articles and columns are available for free at http://ComputingNow.computer.org.

7. D.-Z. Du and F.K. Hwang, Pooling Designs and Nonadaptive Group Testing: Important Tools for DNA Sequencing, World Scientific, 2006.

8. Y.-L. Lin, C. Ward, and S. Skiena, “Synthetic Sequence Design for Signal Location Search,” Proc. 16th Ann. Int’l Conf. Research in Computational Molecular Biology (RECOMB 12), Springer, 2012; http://recomb2012.crg.eu.

9. F.J. Isaacs et al., “Precise Manipulation of Chromosomes in Vivo Enables Genome-Wide Codon Replacement,” Science, 15 July 2011, pp. 348-353.

10. J.S. Dymond et al., “Synthetic Chromosome Arms Function in Yeast and Generate Phenotypic Diversity by Design,” Nature, 22 Sept. 2011, pp. 471-476.

11. M.J. Czar, Y. Cai, and J. Peccoud, “Writing DNA with Geno-CAD,” Nucleic Acids Research, July 2009, pp. W40-W47.

12. P. Montes et al., “Optimizing Restriction Site Placement for Synthetic Genomes,” Proc. 21st Ann. Conf. Combinatorial Pattern Matching (CPM 10), LNCS 6129, Springer, 2010, pp. 323-337.

Steven Skiena is Distinguished Teaching Professor of Computer Science at Stony Brook University as well as cofounder and chief scientist of General Sentiment (www.generalsentiment.com), a social media and news media analytics company based in Long Island, New York. His research interests include the design of graph, string, and geometric algorithms and their applications, particularly to biology. Skiena received a PhD in computer science from the University of Illinois at Urbana-Champaign. He is a member of ACM and the American Association for the Advancement of Science. Contact him at [email protected].

redesigning viral genomesroe/courses/isc/redesigningviralgenomes.pdf · innovative but laborious...

Documents