genome evolutiondownloads.hindawi.com/journals/specialissues/732768.pdf · advances in...

49
Genome Evolution Guest Editors: Izabela Makałowska, Igor B. Rogozin, and Wojciech Makałowski Advances in Bioinformatics

Upload: others

Post on 15-Oct-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Genome Evolutiondownloads.hindawi.com/journals/specialissues/732768.pdf · Advances in Bioinformatics EditorialBoard Shandar Ahmad, Japan T. Akutsu, Japan Rolf Backofen, Germany Craig

Genome EvolutionGuest Editors: Izabela Makałowska, Igor B. Rogozin, and Wojciech Makałowski

Advances in Bioinformatics

Page 2: Genome Evolutiondownloads.hindawi.com/journals/specialissues/732768.pdf · Advances in Bioinformatics EditorialBoard Shandar Ahmad, Japan T. Akutsu, Japan Rolf Backofen, Germany Craig

Genome Evolution

Page 3: Genome Evolutiondownloads.hindawi.com/journals/specialissues/732768.pdf · Advances in Bioinformatics EditorialBoard Shandar Ahmad, Japan T. Akutsu, Japan Rolf Backofen, Germany Craig

Advances in Bioinformatics

Genome Evolution

Guest Editors: Izabela Makałowska, Igor B. Rogozin,and Wojciech Makałowski

Page 4: Genome Evolutiondownloads.hindawi.com/journals/specialissues/732768.pdf · Advances in Bioinformatics EditorialBoard Shandar Ahmad, Japan T. Akutsu, Japan Rolf Backofen, Germany Craig

Copyright © 2010 Hindawi Publishing Corporation. All rights reserved.

This is a special issue published in volume 2010 of “Advances in Bioinformatics.” All articles are open access articles distributed underthe Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, providedthe original work is properly cited.

Page 5: Genome Evolutiondownloads.hindawi.com/journals/specialissues/732768.pdf · Advances in Bioinformatics EditorialBoard Shandar Ahmad, Japan T. Akutsu, Japan Rolf Backofen, Germany Craig

Advances in Bioinformatics

Editorial Board

Shandar Ahmad, JapanT. Akutsu, JapanRolf Backofen, GermanyCraig Benham, USAMark Borodovsky, USAAlvis Brazma, UKJanusz M. Bujnicki, PolandRita Casadio, ItalyDavid Corne, UKBhaskar Dasgupta, USARamana Davuluri, USAJ. Dopazo, SpainAnton Enright, UKStavros J. Hamodrakas, Greece

Paul Harrison, USAHuixiao Hong, USADavid Jones, UKGeorge Karypis, USAJack Leunissen, The NetherlandsJie Liang, USAGuohui Lin, CanadaPietro Lio, UKDennis Livesay, USASatoru Miyano, JapanBurkhard Morgenstern, GermanyMasha Niv, IsraelZoran Obradovic, USAFlorencio Pazos, Spain

David Posada, SpainJagath Rajapakse, SingaporeMarcel J. T. Reinders, The NetherlandsP. Rouze, BelgiumAlejandro A. Schaffer, USAE. L. Sonnhammer, SwedenSandor Vajda, USAY. Van de Peer, BelgiumAntoine van Kampen, The NetherlandsAlexander Zelikovsky, USAZhongming Zhao, USAAlbert Zomaya, Australia

Page 6: Genome Evolutiondownloads.hindawi.com/journals/specialissues/732768.pdf · Advances in Bioinformatics EditorialBoard Shandar Ahmad, Japan T. Akutsu, Japan Rolf Backofen, Germany Craig

Contents

Genome Evolution, Izabela Makałowska, Igor B. Rogozin, and Wojciech MakałowskiVolume 2010, Article ID 643701, 2 pages

Evolution and Diversity of the Human Hepatitis D Virus Genome, Chi-Ruei Huang and Szecheng J. LoVolume 2010, Article ID 323654, 9 pages

Adaptive Evolution Hotspots at the GC-Extremes of the Human Genome: Evidence for Two FunctionallyDistinct Pathways of Positive Selection, Clara S. M. Tang and Richard J. EpsteinVolume 2010, Article ID 856825, 7 pages

Testing the Coding Potential of Conserved Short Genomic Sequences, Jing WuVolume 2010, Article ID 287070, 8 pages

Algorithmic Assessment of Vaccine-Induced Selective Pressure and Its Implications on Future VaccineCandidates, Mones S. Abu-Asab, Majid Laassri, and Hakima AmriVolume 2010, Article ID 178069, 6 pages

Applying Small-Scale DNA Signatures as an Aid in Assembling Soybean Chromosome Sequences,Myron Peto, David M. Grant, Randy C. Shoemaker, and Steven B. CannonVolume 2010, Article ID 976792, 7 pages

EREM: Parameter Estimation and Ancestral Reconstruction by Expectation-Maximization Algorithmfor a Probabilistic Model of Genomic Binary Characters Evolution, Liran Carmel, Yuri I. Wolf,Igor B. Rogozin, and Eugene V. KooninVolume 2010, Article ID 167408, 4 pages

Page 7: Genome Evolutiondownloads.hindawi.com/journals/specialissues/732768.pdf · Advances in Bioinformatics EditorialBoard Shandar Ahmad, Japan T. Akutsu, Japan Rolf Backofen, Germany Craig

Hindawi Publishing CorporationAdvances in BioinformaticsVolume 2010, Article ID 643701, 2 pagesdoi:10.1155/2010/643701

Editorial

Genome Evolution

Izabela Makałowska,1 Igor B. Rogozin,2 and Wojciech Makałowski3

1 Laboratory of Bioinformatics, Faculty of Biology, A. Mickiewicz University, 61-614 Poznan, Poland2 National Center for Biotechnology Information, NIH, Bethesda, MD 20892, USA3 Institute of Bioinformatics, University of Munster, 48149 Munster, Germany

Correspondence should be addressed to Izabela Makałowska, [email protected]

Received 31 December 2010; Accepted 31 December 2010

Copyright © 2010 Izabela Makałowska et al. This is an open access article distributed under the Creative Commons AttributionLicense, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properlycited.

Evolution is a central concept of biology; it explains boththe diversity and the origin of all living organisms. It isbased on the observation that change is a universal featureof nature. This idea is rooted in the philosophy of Heraclitus(535–475 BC) and is best expressed by the famous phrase,panta rei, coined by Simplicus in the sixth century AD.However, only modern biology has been able to explain howchanges in biological systems occur. Genetic information isstored in long molecules of deoxyribonucleic acid (DNA).The complement of this information is called a genomeand may consist of one or more DNA molecules, forinstance, the human nuclear genome consists of twenty-threesuch molecules, called chromosomes. Interestingly, almostidentical information in our closest relative (chimpanzees)is organized in twenty-four chromosomes. It is clear that,during evolution, genomes can undergo major rearrange-ments. These changes can be categorized into inversions,translocations, insertions, and deletions of genetic materialthat is piece of DNA. Many of these events are driven byrepetitive sequences, most notably transposable elements. Itshould be noted, however, that minute changes at the DNAlevel, such as nucleotide substitutions and single nucleotideinsertions or deletions, dominate the landscape of genomicchanges. Nevertheless, it is fascinating to study all thesechanges and be able to infer the ancestral status of genomiccontent.

With ever improving sequencing technology and thedecline of the cost of sequencing, biologists are faced witha “data tsunami.” On the one hand, this constantly growingquantity of sequences and related information creates a realproblem how to store and analyze such an amount of data.On the other hand, it gives us unprecedented opportunitiesto work on biological problems, which until recently were

unsolvable. One such a problem, which is heavily data drivenis genome evolution. At the moment of this writing (June2011), there are over 4000 complete genome sequences listedin the NCBI’s Entrez Genome: 2668 viral, 1656 microbial,and 42 eukaryotic (http://www.ncbi.nlm.nih.gov/genome/).Many more are under way, for instance over 700 eukaryoticgenomes, including first genome scale population studies,1000 human genomes project (http://www.1000genomes.org), and Drosophila Population Genomics Project (http://www.dpgp.org/). This indeed is an exciting time for thosewho study genome evolution. The last decade alreadywitnessed enormous progress in understanding the structureand dynamics of genomes and with the current progressin molecular biology technologies we may expect anotherrevolution in evolution research.

The presented special issue is dedicated to genomeevolution and consists of six papers: one review, four researcharticles, and a resource review. The issue starts with a paperabout the simplest organisms-viruses. C.-R. Huang and S. J.Lo discuss the evolution of the human hepatitis delta virus(HDV) genome, which, with a length of 1.7 kb, is the smallestknown virus genome. HDV is not an autonomous virus sinceits genome does not code for the capsid protein, instead ituses an envelope protein of the hepatitis B virus (HBV) for itsvirion assembly. Hence, sometimes, it is called a satellite virusof HBV. Interestingly, HBV is the smallest know DNA viruswith a genome spanning 3.2 kb. The authors explore a rangeof hypotheses on the HDV origin, evolution, and divergence.

Research papers in this issue of Advances in Bioinfor-matics cover a wide range of topics. C. S. M. Tang andR. J. Epstein searched the human genome for adaptiveevolutionary hotspots and they found two separate onesthat correlate with two extreme GC contents. Interestingly,

Page 8: Genome Evolutiondownloads.hindawi.com/journals/specialissues/732768.pdf · Advances in Bioinformatics EditorialBoard Shandar Ahmad, Japan T. Akutsu, Japan Rolf Backofen, Germany Craig

2 Advances in Bioinformatics

these two extremes share many features, for example, intronlength and gene expression level with genome isochoresdiscovered by Bernardi in the 1970s. Based on the findings,they put forward a hypothesis about two mechanismsmediating adaptive evolution at the molecular level: “(1)intron lengthening and reduced repair in hypermethylatedlowly-transcribed genes and (2) duplication and/or inser-tion events affecting highly-transcribed genes, creating low-essentiality satellite daughter genes in nearby regions ofactive chromatin.”

Annotating genes on newly sequenced genomes is one ofthe basic tasks in genome analysis. Yet, the current statisticalmethods fail to find complete sets of genes in a genome. J. Wufrom the Carnegie Mellon University presents a new methodto test protein coding potential of conserved short genomicsequences and applies it to the human genome. Addingconservation information to the statistical models of codonsenables an increase of the number of candidate regionsthat can be coded for peptides and keeps the false positivesrate relatively low. This new method was tested on thehuman genome with conservation information taken fromhuman/mouse alignment. The procedure detected eighty-three percent of the human exons annotated in RefSeqcollection, at a less than three percent false positive rate. J. Wuwas able to determine 12,688 new short regions with protein-coding potential, most of which lay in the intergenic regionsof the human genome. This is a promising observationsince recent years witnessed a rapidly growing interest inlong noncoding RNAs (lncRNAs), a relatively new actor onthe genomic stage. However, despite many efforts, lncRNAsstill hold a status of the genomic “dark matter.” Indeed,while other noncoding RNA molecules, that is, ribosomal,transfer, small nuclear, antisense, small nucleolar, micro-,and Piwi-interacting RNAs, have already been assigned well-defined functional roles, the origin and function of lncRNAsremain largely unknown. Even their definition is somewhatuncertain: lncRNAs are defined as noncoding transcriptslonger than ∼200 nucleotides. In addition, the evolutionaryconservation of many lncRNAs is poor, they do not appearto be under direct selection, and the levels of their expressionare low. It cannot be excluded that at least some lncRNAsencode unknown short proteins, thus prediction of protein-coding regions is still an important avenue of research.

The fast growing field of evolutionary medicine ispromising a better understanding of infectious diseases. Afterall, medicine is based on biology and the two fields canonly be fully integrated within an evolutionary framework.Developing a new vaccine is not a trivial task. Some fastevolving pathogens, for example, HIV, notoriously escapeour efforts to develop an effective approach. M. S. Abu-Asab et al. explore a selective pressure induced by avaccine on infecting bacterial strains and its implicationon vaccine design. They developed a phylogenetic approachto understand why a vaccine had not worked. They usedpredicted pilin sequences on a phylogenetic tree to assessthe vaccine’s effect on Neisseria strains, in particular if thevaccine has caused an increased selection pressure on thepathogen. This method should help to reformulate vaccinedesign for the next round of trials. This paper clearly shows

the importance of basic science in any applied field andmedicine in particular.

One of the first tasks after obtaining sequences is toassemble them into longer pieces with the ultimate goalto obtain a complete genome. However, nowadays, whena whole shotgun strategy dominates, the order of thesequenced pieces is unknown, making assembly challeng-ing. The usual strategy is to assemble sequences basedon sequence overlaps and clone-size information. In anew approach, M. Peto et al. explore the usefulness ofDNA signatures, defined as distribution of dinucleotides, inassembly of chromosome sequences. This method aims atovercoming difficulties in the assembly of genomic sequencesin the centromeric and pericentromeric regions caused bya lack of recombination events in these areas. The authorsused dinucleotide signature and binding energy to aidsoybean genome assembly. This interesting method shouldbe especially useful in the detection of misassembly and maybe further improved by the incorporation of other genomicsignals, for example nucleosome binding potential.

This issue is concluded by a paper by L. Carmel andcolleagues that discusses EREM software that uses maximumlikelihood to estimate the parameters of a probabilistic modelof binary character evolution on a bifurcating phylogenetictree. This program was successfully applied to sets ofconserved genes from nineteen eukaryotic species. It wasinferred that a relatively high intron density was reachedearly; that is, the last common ancestor of eukaryotescontained more than 2.2 introns per kilobase, a greaterintron density than in many extant fungi and some animals.The rates of intron gain and intron loss appear to havebeen dropping during approximately the last one billionyears, with the decline in the gain rate being much steeper.It seems that intron gain has been episodic and, perhaps,associated with major evolutionary transitions, for example,the origin of animals, as opposed to the more uniform (evenif lineage specific) intron loss process. Indeed, it appearscertain that, for example, during the evolution of mammals(∼100 million years) and, probably, during the evolution ofvertebrates (over 400 million years), there has been virtuallyno intron gain. Other eukaryotic lineages might have ahigher intron gain rate, though, as illustrated by the evidenceof apparent recent gain in nematodes. In addition to theanalysis of introns, EREM can be applied to various binarycharacters, for example, gene content and morphologicalcharacters.

It is worth noting that all the papers presented herewere written over a year ago at the dawn of new sequencingmethods, which no doubt will bring new computationalchallenges that will need to be addressed by the genomiccommunity to successfully utilize the accumulated data.However, we have no doubt that the approaches describedin this special issue will be widely used by the scientificcommunity in the near future.

Izabela MakałowskaIgor B. Rogozin

Wojciech Makałowski

Page 9: Genome Evolutiondownloads.hindawi.com/journals/specialissues/732768.pdf · Advances in Bioinformatics EditorialBoard Shandar Ahmad, Japan T. Akutsu, Japan Rolf Backofen, Germany Craig

Hindawi Publishing CorporationAdvances in BioinformaticsVolume 2010, Article ID 323654, 9 pagesdoi:10.1155/2010/323654

Review Article

Evolution and Diversity of the Human Hepatitis D Virus Genome

Chi-Ruei Huang1 and Szecheng J. Lo1, 2

1 Institute of Microbiology and Immunology, National Yang Ming University, Taipei 112, Taiwan2 Department of Life Science, Chang Gung University, TaoYuan 333, Taiwan

Correspondence should be addressed to Szecheng J. Lo, [email protected]

Received 26 August 2009; Accepted 11 December 2009

Academic Editor: Izabela Makalowska

Copyright © 2010 C.-R. Huang and S. J. Lo. This is an open access article distributed under the Creative Commons AttributionLicense, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properlycited.

Human hepatitis delta virus (HDV) is the smallest RNA virus in genome. HDV genome is divided into a viroid-like sequence anda protein-coding sequence which could have originated from different resources and the HDV genome was eventually constitutedthrough RNA recombination. The genome subsequently diversified through accumulation of mutations selected by interactionsbetween the mutated RNA and proteins with host factors to successfully form the infectious virions. Therefore, we propose that theconservation of HDV nucleotide sequence is highly related with its functionality. Genome analysis of known HDV isolates showsthat the C-terminal coding sequences of large delta antigen (LDAg) are the highest diversity than other regions of protein-codingsequences but they still retain biological functionality to interact with the heavy chain of clathrin can be selected and maintained.Since viruses interact with many host factors, including escaping the host immune response, how to design a program to predictRNA genome evolution is a great challenging work.

1. Introduction

Viruses are a heterogeneous class of agents that parasitizeevery form of life including not only animals and plantsbut also bacteria, archaea, and fungi. Viruses vary greatlyin particle size and morphology and in genetic complexityand host range. There are the DNA and the RNA virusesdefined by the type of nucleic acid present in the maturevirion particles. The genome size of the virus, be it a DNAor RNA virus, determines the number of proteins encodedby the viral genome. The minimal set of viral proteinsincludes capsid and envelope structural proteins requiredfor the assembly of the virion particles and DNA or RNApolymerase for replication of the viral genome. A subclassof viruses, containing only few members, has been calledsatellite viruses since their genome can be encapsidatedby their own coat proteins but require helper viruses forproviding envelopes for assembling mature virions.

The length of the genome of a DNA virus varies froma few kilo-bases (kb) to several hundred kb. The smallestknown DNA virus is the human hepatitis B virus (HBV)which is 3.2 kb long and contains four open reading frames

(ORFs) that encode the surface antigens, the core proteins,a polymerase, and an X protein [1, 2]. Together with theduck HBV (DHBV), the woodchuck hepatitis virus (WHV),and the ground squirrel hepatitis virus (GSHV), they forma family of DNA viruses called the hepadnaviruses [3]. Thegenome size of RNA viruses is generally shorter than that ofDNA viruses and ranges approximately from 2 to 31 kb. Thesmallest RNA virus identified to date is the human hepatitisD virus (HDV) which is about 1.7 kb in size and containsonly one ORF [4–7]. HDV requires the coexistence of HBV tosupply envelope proteins for its assembly into mature virionsand is, hence, a defective virus, or it is called a satellite virusof HBV [4, 6]. Although HBV and other hepadnaviruses arefound in a number of mammals, HDV has thus far beenfound in humans [8]. There is another unique class of RNA-containing infectious agents called the viroid. Viroid RNAis a circular genome with a few hundred nucleotides longand carries no discernible coding sequences. Unlike otherRNA viruses, viroids exist as a naked form of RNA withouta capsid or coat-protein armor. Viroids are known to infectmany species of plants and have not been found in any otherlife forms besides plants [9].

Page 10: Genome Evolutiondownloads.hindawi.com/journals/specialissues/732768.pdf · Advances in Bioinformatics EditorialBoard Shandar Ahmad, Japan T. Akutsu, Japan Rolf Backofen, Germany Craig

2 Advances in Bioinformatics

The origin of viruses remains elusive and debatablebecause there are no fossils of viruses. Currently, thereare two popular hypotheses to explain possible origin ofviruses, viz the “regressive” or “degeneracy hypothesis” andthe “cellular origin” or “vagrancy hypothesis” [10, 11]. The“regressive hypothesis” is similar to the “endosymbiosishypothesis” which explains the origin of mitochondria andchloroplasts. Both hypotheses propose that the two indis-pensable cellular organelles that harbor their own geneticcontent originated from small prokaryotic cells that came toreside within some ancestral eukaryotic cells and graduallydegenerated to become organelles of specific biologicalfunctions. Therefore, vaccinia viruses, each of which harborsa DNA genome of about 200 kb in size, might have originatedfrom parasitized smaller cells through successive reductionof the genome size within the host cells. The currentlyexisting fact that, cell within cell, such as rickettsia andChlamydia, can only replicate within the host cells supportsthe “regressive hypothesis.”

The “cellular origin hypothesis” is based on the findingof the presence of plasmids and various types of mobileelements, such as transposons and retrotransposon, inthe current prokaryotes and eukaryotes. When present inancient cells, these elements could evolve to become viruseswhen they had gained new properties through evolutionarydivergence to enable them to escape from the entrappinghost cells to gain entry to and propagate in new hosts.Retroviruses and HBV, despite being different in the RNA orDNA content, respectively, might have originated from thesame family of ancestral retrotransposons since replicationof both viral genomes involves reverse transcription [12, 13].Hence, the distinguishing feature between the “regressive”and the “cellular origin” hypotheses lies in the “loss” or“gain” of genetic materials, respectively. Nonetheless, thehypotheses are similar in the same prerequisite that virusesevolved subsequent to the emergence of self-sustainingcells on Earth. A third hypothesis, called the “coevolutionhypothesis”, postulates that viruses had evolved from theancient soup of complex protein and nucleic acid moleculesindependent of the self-sustaining host cells and concurrentin time point with the appearance of the first cells on Earth[14].

However, no single hypothesis can satisfactorily explainthe origin all known viruses. In this paper, we use thesmallest human RNA virus, HDV, to illustrate how this viruscould have evolved and diverged. The evolution of largergenome size of RNA viruses, such as retroviruses, flaviruses,picornavirus, and corona viruses, is not a subject in thisreview.

2. Molecular Biology of HDV

HDV was first discovered in HBV-infected patients by anItalian physician, Mario Rizzetto, in 1977. It was originallythought to be a new nuclear antigen associated with HBV[15]. It was later proved to be a new virus that requires thesurface antigens of HBV (HBsAgs) to support its life cycle

and infectivity as clearly demonstrated in the experimentalanimals, chimpanzee, and woodchuck [8, 16, 17]. Co- orsuperinfection by HDV in HBV patients is closely correlatedwith the more severe symptoms of liver disease HBVinfection alone [18, 19]. Studies have established that theHDV genome is a negative circular RNA about 1.7 kb long.Electron microscopic observation has further revealed thatthe HDV genome appears as a rod-shape structure undernondenaturing conditions but the genome presents itself ina circular form under denaturing conditions [20]. Extensiveintramolecular base-pairing of the single-stranded RNAmolecule results in the formation of the double-strandedrod-like structure [20].

Most RNA viruses encode their own replicases or RNA-dependent RNA polymerases (RdRp) essential for viralgenome replication. However, the HDV genome does notcarry a replicase gene rendering HDV totally dependent onthe host replication machinery for its propagation [21]. It hasbeen demonstrated that HDV uses the host DNA-dependentRNA polymerases (DdRp) to facilitate the replication ofits genome and antigenome [21] through a double-rollingcircle mechanism (Figure 1) as the majority of viroidsdoes [22]. Unlike viroid, however, HDV requires the HDVgenome-encoded small delta antigen, SDAg, for replication.Furthermore, HDV is 4-5-fold larger in genome size than theviroid. In addition to the coding sequence, the HDV genomecontains a viroid-like sequence in both the genome andthe antigenome [23–25]. A ribozyme, a self-cleaving RNAsequence, resides in the viroid-like sequence of the HDVgenome that cleaves a linear form of multiple-copy lengthof the viral genome or antigenome into monomeric unitsthat are then circularized to complete the replication cycle(Figure 1).

The HDV genome has only one ORF to encode twoisoforms of hepatitis delta antigen (HDAg) during repli-cation. The small hepatitis delta antigen, SDAg, is a 24-kDa protein composed of 195 amino acid residues; thelarge hepatitis delta antigen, LDAg, is 27 kDa and consistsof 214 residues [26, 27]. The two antigens are identicalat the N-terminal 195 residues. The extra 19 residues (or20 residues in those HDV isolates from South America)present in the C-terminus of LDAg are derived from RNAediting catalyzed by the host enzyme, adenosine deaminaseacting on the RNA (ADAR), and converting the stopcodon (UAG) of the SDAg ORF into a tryptophan codon(UGG), thus extending the coding sequence to terminateat a downstream termination codon [28–34]. In the lifecycle of HDV, SDAg supports viral genome replicationwhile LDAg inhibits replication and promotes interactionof the HDV genome with HBsAgs for virion packaging[35–37]. In spite of the sharing of the same sequence of195 amino acid residues between SDAg and LDAg, theadditional peptide sequence of 19-20 residues of LDAghas a distinct biological function. The peptide sequencecontains a nuclear export signal (NES) that facilitates nuclearexportation of HDAg and an isoprenylation recognitionmotif that interacts with HBsAgs to form the HDV virions[37–41].

Page 11: Genome Evolutiondownloads.hindawi.com/journals/specialissues/732768.pdf · Advances in Bioinformatics EditorialBoard Shandar Ahmad, Japan T. Akutsu, Japan Rolf Backofen, Germany Craig

Advances in Bioinformatics 3

Viroid-like domain HDAg-coding domain

A/GRz

1/1683

HDAg-coding sequence

(a)

G

G

G

αG

αG

αG

(b)

Figure 1: HDV genome and replication. (a) Structural features of the HDV genome. The viral genome is a single-stranded circular RNAmolecule composing of the viroid-like and the HDAg-coding domains. The ribozyme sequence (Rz) in the viroid-like domain is shownas a blue box. “A/G” in the HDAg-coding sequence indicates the nucleotide edited by the host ADAR leading to the synthesis of LDAg.The beginning and the end of the circular genome are labeled as 1/1683. (b) Replication of the HDV genome by the double rolling-circlereplication model. The genome and antigenome of HDV are represented by blue or red circles and labeled as “G” or “αG”, respectively.The open blue or red lines represent the primary transcription products composed of multimeric units of the viral genome or antigenome,respectively. The location of Rz is denoted by the green or blue boxes in the genome and antigenome, respectively. Arrowheads indicate theself-cleavage site of the ribozyme.

3. Diversity of HDV Genome Sequences

Based on the percentage of nucleotide identity of thegenome, HDV was initially classified into three genotypes,designated genotypes I-III [4]. The distribution of variousHDV genotypes is closely associated with geographic originsand disease outcomes [18]. Genotype I HDV is distributedworldwide and causes hepatitis with a wide range of clinicalseverity. Genotype II HDV is found mainly in East and NorthAsia including Taiwan, Japan, and Siberia and infectionusually leads to less severe clinical manifestations thangenotype I infection. On the other hand, genotype III isreported to cause a severe form of fulminant hepatitis andhas only been isolated in the northern area of South America.In addition, the nucleotide sequence of genotype III is themost divergent amongst all the isolated HDV sequences [42].

Recently, an increasing number of HDV isolates havebeen identified and sequenced. The classification of HDVthus changes from the three genotypes into the eight clades,HDV1 to HDV8 [42, 43]. In the new scheme of HDV clas-sification, HDV1-3 has replaced genotypes I-III, respectively.HDV isolates that belong to genotype II subgroup b are nowregrouped into the HDV4 class. The most recently isolatedHDV sequences derived from African patients are groupedinto HDV5 through to HDV8 [42, 43]. HDV sequencesisolated from various geographic locations of Africa are

found to be more divergent sequences isolated from othercontinents suggesting that the first HDV might have arisenfrom Africans. Two related questions are then raised: (1)How did the first HDV originate and how was it evolved?(2) Is HDV diversity correlated with the established waves ofhuman migration across continents?

Analysis of the nucleotide sequences of 43 HDV isolatesmined from databases in the public domain has revealedthat the percentage nucleotide sequence identity of thecomplete HDV genome ranges from 64.1% to 76.4% incomparison of each out-group [42]. The percentage identifyis the lowest (64.1%) between HDV3 and HDV 5 and thehighest (76.4%). However, the percentage identity of theSDAg coding sequence ranges from 72.4% to 83.6% incomparison of each out-group in which the lowest (72.4%) isfound between HDV3 and HDV7 and the highest (83.6%) isbetween HDV2 and HDV5 [42]. A higher percentage identityof the SDAg coding sequence than that of the completeviral genomic sequence indicates that nucleotide sequencediversity is more restricted to the functional region thanin the noncoding region. Profiling of nucleotide identitiesfrom the 43 HDV isolates shows that, when compared tothe HDAg coding sequence, the more conserved nucleotidesequence is found in the viroid-like sequence that contains aribozyme sequence (Figure 2). The reason for lower diversityin the viroid-like region than in the HDAg coding sequence

Page 12: Genome Evolutiondownloads.hindawi.com/journals/specialissues/732768.pdf · Advances in Bioinformatics EditorialBoard Shandar Ahmad, Japan T. Akutsu, Japan Rolf Backofen, Germany Craig

4 Advances in Bioinformatics

0

100

Iden

tity

(%)

RzRz HDAg-coding sequence

300 600 900 1200 1500 1800

Nucleotides

Figure 2: Profiling of nucleotide identities of 43 HDV isolates.The vertical-axis represents percentage nucleotide identity and thehorizontal-axis displays the whole viral genome sequence as therelative length of 1800 nucleotides due to the alignment. Theribozyme (Rz) in the genome and antigenome is indicated by thegreen or blue box, respectively. The relative genomic position of thethe HDAg-coding sequence is also shown (red arrow-bar).

could be because of the usage of wobble codons in HDAgtranslation contributing to a higher degree of tolerance ofnucleotide substitutions without affecting the amino acidresidue. On the other hand, the ribozyme in the viroid-likeregion exerts its function directly in the RNA sequence and is,hence, less tolerant to nucleotide changes. This propositionis supported by a base substitution study that revealed a 2.4-fold higher C-to-U substitution in the HDV coding sequencethan in the full-length genome echoing frequent C-to-Ucodon degeneracy in the third base of codons [42].

4. The Origin of HDV

The current hypothesis of the origin of HDV favors the“cellular origin” concept because numerous studies havefound similarities in structures and sequences between thenucleotide and protein sequences of cellular genes and thecoding and viroid-like sequences of HDV.

In a genomewide search for ribozyme sequences, it hasbeen found that an HDV-like sequence in the human CPEB3gene encodes the cytoplasmic polyadenylation element-binding protein 3 [44]. CPEB3 is a member of a familyof proteins that regulate mRNA polyadenylation and arehighly conserved among mammals. The ribozyme residesin the second intron of CPEB3 but is dissimilar in theprimary nucleotide sequence to that of the HDV ribozyme.However, the secondary structure of the CPEB3 ribozyme issimilar to that of the HDV ribozyme both of which servethe function of self-cleavage of the multimeric precursor.These findings provide a clue that the viroid-like sequence ofHDV could have arisen from an ancestral CPEB3 ribozymein the mammal. Two other HDV viroid-like sequenceswith sequence similarity to a cellular RNA come from thefindings that the HDV genomic sequence from nucleotide(nt) 683–724 is complementary to nt 10–55 of the 7SLRNA sequence and the antigneomic sequence from nt 858–899 is complementary to nt 188–233 also of the 7SL RNA[45, 46]. Both complementary regions of the HDV sequenceare located adjacent to the viroid-like region and have 73%to 77% nucleotide identity to the 7SL RNA [45].

A cellular protein, termed delta interacting protein A(DIPA), is reported to interact with HDAg affecting HDVreplication [47]. Alignment of the protein sequences of DIPAand HDAg has revealed a sequence identity of 24% andsequence similarity of 56% [47]. Both DIPA and HDAg aresimilar in size and both form oligomers through a coiled-coil domain. These authors have further proposed that theDIPA gene is a homolog of HDV and that the captureof some DIPA transcripts by a viroid-like sequence couldhave initiated the evolution of the very first HDV. Findingsof RNA recombination between different HDV genotypesin patients or in cell cultures support the idea that DIPAtranscripts and viroid-like sequence could join together [41,48, 49]. Two different groups of researchers have furtherdemonstrated that RNA recombination is through a switchin the transcription templates [50, 51].

This “cellular origin hypothesis” then raises the inter-esting questions of when and in what animal host(s) suchRNA recombination first occurred. Since HDV has to coexistwith HBV for propagation, a first guess would be that HDVshould have evolved in animals that were susceptible toinfection by hepadnaviruses. The most primitive HBV isthought to DHBV found in ducks, the viral genome of whichcontains three ORFs but lacks the X-protein ORF that isfound in the human HBV [2, 52]. However, no naturallyoccurring HDV has so far been found in ducks nor in twoother mammals, woodchucks, and ground squirrels, despitethe presence of the X-protein ORF in the correspondinghepadnaviruses. Taken together, it is highly likely that thefirst HDV originated in ancestral humans. Thereafter, HDVcoevolved with HBV and selected human cellular factors togenerate sequence diversity along with the many waves ofhuman migration to different geographic localities.

5. HDV Sequence Diversity through MutationsFollowed by Host Factors’ Selection

Mutations frequently occur as a result of nucleotide mis-matches during DNA or RNA replication. In general, themutation rate of RNA viruses is higher than that of DNAviruses because the fidelity of base-pairing is lower duringRNA replication than DNA replication. Diversity of the RNAgenome of HDV should, therefore, be derived primarily fromaccumulation of mutations. Mutations that had occurredin essential sequences that could lead to impairment ofvirion formation could have been eliminated during viralreplication. In contrast, mutations in nonessential sequencescould have been preserved.

In addition to the ribozyme sequence in the viroid-likeregion and the HDAg coding sequence, there are severalcis-elements important for HDV replication. An essentialcis-element is the promoter sequence that is recognizedby cellular DdRps for transcription. Another important cissequence is recognized by ADARs for RNA editing to resultin LDAg production.

Among the essential sequences for HDV replicationand maturation, the sequence encoding the C-terminal

Page 13: Genome Evolutiondownloads.hindawi.com/journals/specialissues/732768.pdf · Advances in Bioinformatics EditorialBoard Shandar Ahmad, Japan T. Akutsu, Japan Rolf Backofen, Germany Craig

Advances in Bioinformatics 5

Table 1: Alignment of the nucleotide sequences that encode the carboxyl terminal 19-20 residues of LDAg. The sequences are displayed inthree different groups (HDV1 as Group I, HDV2 and HDV4 to HDV8 as Group II, and HDV3 as Group III). The amino acid residues thatconstitute the clathrin box-binding domain and the isoprenylation signal are indicated above the nucleotide sequence in a single-letter aminoacid abbreviation. The nucleotide substitution percentage is indicated at right, in which the total nucleotide variations within in-groups areindicated as “All” in the first column. The percentage of nucleotide variation corresponding to the clathrin-box binding domain and theisoprenylation site is shown in the second and third columns, respectively. The accession number of each HDV isolate is indicated at bottom.

Group Isolates∗ Nucleotide substitution percentage (%)

AllClathrin

BoxIsoprenylation

motif

I 3.24 (24/741)3.60(7/195)

0.19 (4/156)L F P S D C R P Q

CTCTTCCCAGCCGAT TGTCGACCCCAG

Ethiopia TGGGATATA.........T.....CCTCCCTTTTCTCCCCAGAGT---.....T......

US-1 .......................C.......CC............---............

US-2 ................................C............---.....T......

Nagasaki-2 ..............T....T.........................---............

Taiwan ......T......................................---............

TW2667 .............................................---............

China .............................................---............

Italy ..........................G..................---............

Nauru ..................T.........................C---.....T......

Cagliari ......C.....................................C---............

HDV-Iran ..................T.......A.....C............---.....T......

Lebanon ..................T....C........C............---............

Somalia ..................T..........................---............

II

L P L L E C T P Q

Japan TGGGTAAGCCCATCTCCCCCCCAACAACGCCTTCCACTCCTCGAG---TGTACCCCCCAA

Taiwan-3 ......CA.....................................---........T...

Miyako-37 ......C......................................---.....T......

TW2476 ......C......................................---............

Yakut-26 .......AT..GGT.......GGG..G..................---............

Yakut-62 .....C.AT..GG........GGG..G..................---............

Miyako .....TGA...TGGG.G....TCT.CC..................---............

L215 .....TG....TGGGAG....TCT.CC...T..............---............

Miyako-36 .....T.A...TCAG.....T.CT.CC..............T...---............

Taiwan-Tw-2b .....T.....TCAG.....T.CT.CC..................---............

AF209859 .....T.....TCAG.....T.CT.CC..................---............

Tokyo .....T.A...TCAG.....T.CT.CC..................---............

dFr2600 .......A...GGGA....SS.CT.CC..................---.....T......

dFr2005 ...........GGGAT....GTCT.CC..................---.....T......

dFr47 ......RR...GGGA....GG.CT.CC..................---.....T.....G

dFr910 ......GA...GGGA....GG.CT.CC..................---.....T.....G

dFr73 ...........GGGAG....GTCT.CC..................---.....T......

dFr2703 ...........GGRM....SS.CT.CC..................---.....T......

dFr48 ....GT.A.A.TC......G..CT.CC..................---.....T......

dFr2139 ....GT.A.GGTC......G.TCT.CC..............A...---..C.........

dFr2627 ....GT.A.A.TC......G..CT.CC..................---.....T......

dFr45 ....G.CCTAGTC.GA......CT.CC..................---............

dFr2158 ....G.CCTAGTC.GA......CT.CC..................---............

dFr2072 ....G.CAGAGTC.G.......CT.CC..................---............

dFr644 ....G.CAAAG.C.G.......CT.CC..................---............

dFr2736 ....G.CAA.A.C.G.......CT.CC..................---............

7.31 (25/342)

0.77(3/390)

4.5 (12/312)

5.26 (18/342)

3.80 (13/342)

3.51 (6/171)

0.00 (0/114)

2.34 (4/171)

III 0.42 (1/180) 0 0Y Y W V P C T Q Q

Peru-1 TGGTATGGGTTTACCCCGCCTCCCCCCGGGTATTACTGGGTCCCAGGGTGCACCCAACAA

VnzD8624 ...........C................................................

VnzD8375 ............................................................

VnzD8349 ............................................................

∗The accession numbers of the sequences used were AF209859, AF209859; Cagliari, X85253; China, X77627; dFr45, AX741144; dFr47, AX741149; dFr48,AX741164; dFr73, AX741154; dFr644, AX741169; dFr910, AX741159; dFr2005, AM183331; dFr2072, AM183330; dFr2139, AM183332; dFr2158, AM183333;dFr2600, AM183326; dFr2627, AM183329; dFr2703, AM183328; dFr2736, AM183327; Ethiopia, U81989; HDV-Iran, AY633627; Italy, X04451; Japan,X60193; L215, AB088679; Lebanon, M84917; Miyako, AF309420; Miyako-36, AB118845; Miyako-37, AB118846; Nagasaki-2, AB118849; Nauru, M58629;Peru-1, L22063; Somalia, U81988; Taiwan, M92448; Taiwan-3, U19598; Taiwan-Tw-2b, AF018077; Tokyo, AB118847; TW2476, AF104264; TW2667,AF104263; US-1, D01075; US-2, L22066; Vnzd8349, AB037948; Vnzd8375, AB037947; Vnzd8624, AB037949; Yakut-26, AJ309879; and Yakut-62, AJ309880.

Page 14: Genome Evolutiondownloads.hindawi.com/journals/specialissues/732768.pdf · Advances in Bioinformatics EditorialBoard Shandar Ahmad, Japan T. Akutsu, Japan Rolf Backofen, Germany Craig

6 Advances in Bioinformatics

Ta

ble

2:A

lign

men

toft

he

amin

oac

idse

quen

ces

ofth

eca

psid

prot

ein

ofth

eW

estN

ilevi

rus

(WN

V)

wit

hth

eJa

pan

ese

ence

phal

itis

viru

s(J

EV

).Se

quen

ces

deri

ved

from

ten

diff

eren

tiso

late

sof

WN

Van

dJE

V,r

espe

ctiv

ely,

wer

era

ndo

mly

sele

cted

from

the

publ

icdo

mai

nd

atab

ase

for

alig

nm

ent.

Th

eG

enB

ank

acce

ssio

nn

um

ber

ofea

chvi

rals

equ

ence

issh

own

onth

ele

ft.R

edbo

xsh

ows

the

clat

hri

nbo

xpr

esen

tin

the

caps

idof

WN

Vbu

tn

otin

JEV

.

WN

VClathrin

1Box

127

AAA48498

MSKKPGGPGKNRAVNMLKRGMPRGLSLIGLKRA-MLSLIDGKGPIRFVLALLAFFRFTAIAPTRAVLDRWRGVNKQTAMKHLLSFKKELGTLTSAINRRSTKQKKRGGTAGFTILLG----LIACAGA

NP041724

MSKKPGGPGKNRAVNMLKRGMPRGLSLIGLKRA-MLSLIDGKGPIRFVLALLAFFRFTAIAPTRAVLDRWRGVNKQTAMKHLLSFKKELGTLTSAINRRSTKQKKRGGTAGFTILLG----LIACAGA

P06935

MSKKPGGPGKNRAVNMLKRGMPRGLSLIGLKRA-MLSLIDGKGPIRFVLALLAFFRFTAIAPTRAVLDRWRGVNKQTAMKHLLSFKKELGTLTSAINRRSTKQKKRGGTAGFTILLG----LIACAGA

AAT02759

MSKKPGGPGKNRAVNMLKRGMPRGLSLIGLKRA-MLSLIDGKGPIRFVLALLAFFRFTAIAPTRAVLDRWRGVNKQTAMKHLLSFKKELGTLTSAINRRSTKQKKRGGTAGFTILLG----LIACAGA

ABC49716

MSKKPGGPGKNRAVNMLKRGMPRGLSLIGLKRA-MLSLIDGKGPIRFVLALLAFFRFTAIAPTRAVLDRWRGVNKQTAMKHLLSFKKELGTLTSAINRRSTKQKKRGGTAGFTILLG----LIACAGA

ABR19636

MSKKPGGPGKNRAVNMLKRGMPRGLSLIGLKRA-MLSLIDGKGPIRFVLALLAFFRFTAIAPTRAVLDRWRGVNKQTAMKHLLSFKKELGTLTSAINRRSTKQKKRGGTAGFTILLG----LIACAGA

ABU41789

MSKKPGGPGKNRAVNMLKRGMPRGLSLIGLKRA-MLSLIDGKGPIRFVLALLAFFRFTAIAPTRAVLDRWRGVNKQTAMKHLLSFKKELGTLTSAINRRSTKQKKRGGTAGFTILLG----LIACAGA

ABR19637

MSKKPGGPGKNRAVNMLKRGMPRGLSLIGLKRA-MLSLIDGKGPIRFVLALLAFFRFTAIAPTRAVLDRWRGVNKQTAMKHLLSFKKELGTLTSAINRRSTKQKKRGGTAGFTILLG----LIACAGA

ABC49717

MSKKPGGPGKNRAVNMLKRGMPRGLSLIGLKRA-MLSLIDGKGPIRFVLALLAFFRFTAIAPTRAVLDRWRGVNKQTAMKHLLSFKKELGTLTSAINRRSTKQKKRGGTAGFTILLG----LIACAGA

ABR19639

MSKKPGGPGKNRAVNMLKRGMPRGLSLIGLKRA-MLSLIDGKGPIRFVLALLAFFRFTAIAPTRAVLDRWRGVNKQTAMKHLLSFKKELGTLTSAINRRSTKQKKRGGTAGFTILLG----LIACAGA

JEV

AAA21436

MTKKPGGPGKNRAINMLKRGLPRVFPLVGVKRVVM-SLLDGRGPVRFVLALITFFKFTALAPTKALLGRWKAVEKSVAMKHLTSFKRELGTLIDAVNKRGRKQNKRGGNEGSIMWLASLAVVIACAGA

AAA46248

MTKKPGGPGKNRAINMLKRGLPRVFPLVGVKRVVM-SLLDGRGPVRFVLALITFFKFTALAPTKALLGRWKAVEKSVAMKHLTSFKRELGTLIDAVNKRGRKQNKRGGNEGSIMWLASLAVVIACAGA

P27395

MTKKPGGPGKNRAINMLKRGLPRVFPLVGVKRVVM-SLLDGRGPVRFVLALITFFKFTALAPTKALLGRWKAVEKSVAMKHLTSFKRELGTLIDAVNKRGRKQNKRGGNEGSIMWLASLAVVIACAGA

AAA64561

MTKKPGGPGKNRAINMLKRGLPRVFPLVGVKRVVM-SLLDGRGPVRFVLALITFFKFTALAPTKALLGRWKAVEKSVAMKHLTSFKRELGTLIDAVNKRGRKQNKRGGNEGSIMWLASLAVVIACAGA

AAK11279

MTKKPGGPGKNRAINMLKRGLPRVFPLVGVKRVVM-SLLDGRGPVRFVLALITFFKFTALAPTKALSGRWKAVEKSVAMKHLTSFKRELGTLIDAVNKRGRKQNKRGGNEGSIMWLASLAVVIACAGA

BAA14219

MTKKPGGPGKNRAINMLKRGLPRVFPLVGVKRVVM-SLLDGRGPVRFVLALITFFKFTALAPTKALSGRWKAVEKSVAMKHLTSFKRELGTLIDAVNKRGRKQNKRGGNEGSIMWLASLAVVIACAGA

AAL08020

MTKKPGGPGKSRAINMLKRGLPRVFPLVGVKRVVM-SLLDGRGPVRFVLALITFFKFTALAPTKALSGRWKAVERSVAMKHLTSFKRELGTLIDTVNKRGRKQNKRGGNEGSIMWLASLAVVIACAGA

BAA14218

MTKKPGGPGKNRAINMLKRGLPRVFPLVGVKRVVM-SLLDGRGPVRFVLALITFFKFTALAPTKALLGRWKAVEKSVAMKHLTSFKRELGTLIDAVNKRGRKQNKRGGNEGSIMWLASLAVVIACAGA

P19110

MTKKPGGPGKNRAINMLKRGLPRVFPLVGVKRVVM-SLLDGRGPVRFVLALITFFKFTALAPTKALLGRWKAVEKSVAMKHLTSFKRELGTLIDAVNKRGRKQNKRGGNEGSIMWLASLAVVIACAGA

AAD16276

MTKKPGGPGKNRAINMLKRGLPRVFPLVGVKRVVM-SLLDGRGPVRFVLALITFFKFTALAPTKALLGRWKAVEKSVAMKHLTSFKRELGTLIDAVNKRGRKQNKRGGNEGSIMWLASLAVVIACAGA

Page 15: Genome Evolutiondownloads.hindawi.com/journals/specialissues/732768.pdf · Advances in Bioinformatics EditorialBoard Shandar Ahmad, Japan T. Akutsu, Japan Rolf Backofen, Germany Craig

Advances in Bioinformatics 7

peptide of LDAg is highly variable among the HDV1-8; 198ILFPADPPFSPQSCCRPQ214 in HDV1 (as Group I),198GPSPPQQRLPLLECTPQ214 in HDV2 and HDV4-8 (asGroup II), and 198FTPPPPGYYWVPGCTQQ214 in HDV3(as group III) (Table 1). The identity of the amino acidsequence of the in-groups is almost 100%; however, thenucleotide substitution of in-groups ranges from 0.42% to7.31% (Table 1). There are three functional domains in thisshort peptide of LDAg. They are the nuclear exporting signal,the clathrin heavy chain (CHC) interacting domain, andthe isoprenylation signal [39, 53–55]. There is no nucleotidesubstitution occurring in the clathrin box and isoprenylationmotif within Group III, while 3.60% and 0.19% withinGroup I, and 0.77% and 4.5% within Group II (Table 1). Thelower identity among the various clades could be attributedto the relaxation of the amino acid sequence participatingin interaction with its counterparts of cellular proteins andthe flexibility of cellular location. For example, the last fouramino acids called the CaaX box (C: cysteine; a: aliphaticamino acids; X: any amino acid except leucine and phenylala-nine) is an isoprenylation signal required for interaction withHBsAgs for virion maturation. The isoprenylation signal ofHDV1 (Group I) is 211CRPQ214, HDV2 and HDV4 throughto HDV8 (Group II) is 211CTPQ214, and HDV3 (Group III)is 211CTQQ214. The original sequence encoding the CaaXbox may yet be unknown but one could envisage that anynucleotide variations leading to the coding of a varied CaaXbox would be maintained as has, indeed, been observed invarious HDV sequences.

Our previous findings have shown a greater extent ofdiversity in the sequence encoding for the CHC-interactingdomain which could vary in sequences and locations [54].The HDV1 (Group I) and HDV 2 and HDV4-8 (Group II)have a clathrin box sequence 199LFPAD203 and 206LPLLE210,respectively, while the HDV3 (Group III) does not havesuch a sequence (Table 1). Instead, HDV3 could still forma complex with CHC through the sequence 205YYWV208 or206YWVP209 via the association with a CHC-adaptor protein,AP-2. Such binding flexibility could have further providedhigher tolerance in nucleotide polymorphism in the HDVsequences.

6. Conclusion and Perspective

Interestingly, increasing lines of evidence show that thecapsid protein or genome of many other RNA viruses bindsto the same cellular factors as HDV does. For example, apull-down assay of the capsid protein of mosquito- andblood-borne West Nile virus (WNV) has shown that theCHC is one of interacting proteins of the viral capsidprotein [56]. Although the authors did not demonstrate thatthe interaction between the capsid protein of WNV andCHC is important for virion packaging, we could identify aconsensus sequence of the clathrin box motif in the capsidproteins of all WNV isolates but not in the capsid proteinsof Japanese encephalitis virus (JEV), a Flavivirus closelyrelated to WNV (Table 2). Another example is the case ofthe HDV genome binding to glyceraldehydes 3-phosphate

dehydrogenase (GAPDH), a key enzyme in glycolysis [57,58]. The genomes of JEV and the tomato bushy stunt virus,a plant virus, are also found to bind to GAPDH [59, 60].Whether the sequence of the HDV genome responsiblefor interaction with GAPDH also contributes to sequencediversity of the HDV genome awaits further investigation.

As compared with many other satellite viruses whichderived from a deletion of helper viruses, HDV lacksany similarity of nucleotide sequences of helper virus, theHBV, and evolved from two distinct nucleotide componentsthrough a recombination. In addition to discussing thepossible origin of HDV, in this review, we proposed thatthe HDV nucleotide diversity could be resulted from theselection of interacting with host factors. By using theprograms of PIST and PLATO, Anisimova and Yang analyzedall three codon positions of HDAg from 33 HDV isolatesand explained that the HDV sequence diversity could beresulted from the positive selection force by escaping hostimmune response [61]. Nevertheless, it is a great challengeto bioinformatic scientists to formulate a set of general rulesfor the interpretation of the genome diversity in other RNAviruses.

Acknowledgments

The authors would like to thank Drs. Mei Chao and Kong-Bung Choo for their critical comments on this manuscript.This work was supported by Grants from the NationalInstitute of Health Research (EX9521BI, EX9521BII, andEX9521BIII) to S. J. Lo.

References

[1] P. J. Chen and D. S. Chen, “Hepatitis B virus infection,” TheNew England Journal of Medicine, vol. 338, pp. 1312–1313,1998.

[2] P. Tiollais, C. Pourcel, and A. Dejean, “The hepatitis B virus,”Nature, vol. 317, pp. 489–495, 1985.

[3] I. D. Gust, C. J. Burrell, A. G. Coulepis, W. S. Robinson, and A.J. Zuckerman, “Taxonomic classification of human hepatitis Bvirus,” Intervirology, vol. 25, no. 1, pp. 14–29, 1986.

[4] M. M. C. Lai, “The molecular biology of hepatitis delta virus,”Annual Review of Biochemistry, vol. 64, pp. 259–286, 1995.

[5] S. Makino, M.-F. Chang, C.-K. Shieh, et al., “Molecularcloning and sequencing of a human hepatitis delta (δ) virusRNA,” Nature, vol. 329, no. 6137, pp. 343–346, 1987.

[6] J. M. Taylor, “Hepatitis delta virus,” Virology, vol. 344, no. 1,pp. 71–76, 2006.

[7] K.-S. Wang, Q.-L. Choo, A. J. Weiner, et al., “Structure,sequence and expression of the hepatitis delta (δ) viralgenome,” Nature, vol. 323, pp. 508–514, 1986.

[8] M. Rizzetto, B. Hoyer, M. G. Canese, J. W. Shih, R. H. Purcell,and J. L. Gerin, “delta agent: association of delta antigen withhepatitis B surface antigen and RNA in serum of delta-infectedchimpanzees,” Proceedings of the National Academy of Sciencesof the United States of America, vol. 77, no. 10, pp. 6124–6128,1980.

[9] E. M. Tsagris, A. E. Martinez de Alba, M. Gozmanova, and K.Kalantidis, “Viroids,” Cellular Microbiology, vol. 10, no. 11, pp.2168–2179, 2008.

Page 16: Genome Evolutiondownloads.hindawi.com/journals/specialissues/732768.pdf · Advances in Bioinformatics EditorialBoard Shandar Ahmad, Japan T. Akutsu, Japan Rolf Backofen, Germany Craig

8 Advances in Bioinformatics

[10] L. M. Iyer, S. Balaji, E. V. Koonin, and L. Aravind, “Evolution-ary genomics of nucleo-cytoplasmic large DNA viruses,” VirusResearch, vol. 117, no. 1, pp. 156–184, 2006.

[11] E. J. Strauss, “Intracellular pathogens: a virus joins themovement,” Current Biology, vol. 6, no. 5, pp. 504–507, 1996.

[12] G. F. Joyce, “The antiquity of RNA-based evolution,” Nature,vol. 418, no. 6894, pp. 214–221, 2002.

[13] R. H. Miller and W. S. Robinson, “Common evolutionaryorigin of hepatitis B virus and retroviruses,” Proceedings of theNational Academy of Sciences of the United States of America,vol. 83, no. 8, pp. 2531–2535, 1986.

[14] J. Cracraft and M. J. Donoghue, Assembling the Tree of Life,Oxford University Press, New York, NY, USA, 2004.

[15] M. Rizzetto, M. G. Canese, S. Arico, et al., “Immunofluo-rescence detection of new antigen-antibody system (δ/anti-δ)associated to hepatitis B virus in liver and in serum of HBsAgcarriers,” Gut, vol. 18, no. 12, pp. 997–1003, 1977.

[16] A. Ponzetto, P. J. Cote, H. Popper, et al., “Transmission of thehepatitis B virus-associated δ agent to the eastern woodchuck,”Proceedings of the National Academy of Sciences of the UnitedStates of America, vol. 81, no. 7, pp. 2208–2212, 1984.

[17] M. Rizzetto, M. G. Canese, J. L. Gerin, W. T. London, D. L.Sly, and R. H. Purcell, “Transmission of the hepatitis B virus-associated delta antigen to chimpazees,” Journal of InfectiousDiseases, vol. 141, no. 5, pp. 590–602, 1980.

[18] J. L. Casey, T. L. Brown, E. J. Colan, F. S. Wignall, and J.L. Gerin, “A genotype of hepatitis D virus that occurs innorthern South America,” Proceedings of the National Academyof Sciences of the United States of America, vol. 90, no. 19, pp.9016–9020, 1993.

[19] R.-N. Chien, K.-W. Chiu, C.-M. Chu, and Y.-F. Liaw, “Acutehepatitis in HBsAg carriers: comparisons among clinicalfeatures due to HDV superinfection and other etiologies,”Chinese Journal of Gastroenterology, vol. 8, pp. 8–12, 1991.

[20] A. Kos, R. Dijkema, A. C. Arnberg, P. H. van der Meide, and H.Schellekens, “The hepatitis delta (δ) virus possesses a circularRNA,” Nature, vol. 323, no. 6088, pp. 558–560, 1986.

[21] E. Lehmann, F. Brueckner, and P. Cramer, “Molecular basis ofRNA-dependent RNA polymerase II activity,” Nature, vol. 450,no. 7168, pp. 445–449, 2007.

[22] P.-J. Chen, G. Kalpana, and J. Goldberg, “Structure andreplication of the genome of the hepatitis δ virus,” Proceedingsof the National Academy of Sciences of the United States ofAmerica, vol. 83, no. 22, pp. 8774–8778, 1986.

[23] M. H. Kolk, H. A. Heus, and C. W. Hilbers, “The structureof the isolated, central hairpin of the HDV antigenomicribozyme: novel structural features and similarity of the loopin the ribozyme and free in solution,” EMBO Journal, vol. 16,no. 12, pp. 3685–3692, 1997.

[24] M. Y.-P. Kuo, L. Sharmeen, G. Dinter-Gottlieb, and J. Taylor,“Characterization of self-cleaving RNA sequences on thegenome and antigenome of human hepatitis delta virus,”Journal of Virology, vol. 62, no. 12, pp. 4439–4444, 1988.

[25] A. T. Perrotta and M. D. Been, “A pseudoknot-like structurerequired for efficient self-cleavage of hepatitis delta virusRNA,” Nature, vol. 350, no. 6317, pp. 434–436, 1991.

[26] M.-F. Chang, S. C. Baker, L. H. Soe, et al., “Human hepatitisdelta antigen is a nuclear phosphoprotein with RNA-bindingactivity,” Journal of Virology, vol. 62, no. 7, pp. 2403–2410,1988.

[27] M. Y.-P. Kuo, J. Goldberg, L. Coates, W. Mason, J. Gerin,and J. Taylor, “Molecular cloning of hepatitis delta virus RNAfrom an infected woodchuck liver: sequence, structure, and

applications,” Journal of Virology, vol. 62, no. 6, pp. 1855–1861,1988.

[28] J. L. Casey and J. L. Gerin, “Hepatitis D virus RNA editing:specific modification of adenosine in the antigenomic RNA,”Journal of Virology, vol. 69, no. 12, pp. 7593–7600, 1995.

[29] X. Nie, J. Chang, and J. M. Taylor, “Alternative processing ofhepatitis delta virus antigenomic RNA transcripts,” Journal ofVirology, vol. 78, no. 9, pp. 4517–4524, 2004.

[30] A. G. Polson, B. L. Bass, and J. L. Casey, “RNA editingof hepatitis delta virus antigenome by dsRNA-adenosinedeaminase,” Nature, vol. 380, no. 6573, pp. 454–456, 1996.

[31] A. G. Polson, H. L. Ley III, B. L. Bass, and J. L. Casey, “Hepatitisdelta virus RNA editing is highly specific for the amber/W siteand is suppressed by hepatitis delta antigen,” Molecular andCellular Biology, vol. 18, pp. 1919–1926, 1998.

[32] S. Sato, S. K. Wong, and D. W. Lazinski, “Hepatitis deltavirus minimal substrates competent for editing by ADAR1 andADAR2,” Journal of Virology, vol. 75, no. 18, pp. 8547–8555,2001.

[33] S. K. Wong and D. W. Lazinski, “Replicating hepatitis deltavirus RNA is edited in the nucleus by the small form ofADAR1,” Proceedings of the National Academy of Sciences ofthe United States of America, vol. 99, no. 23, pp. 15118–15123,2002.

[34] T.-T. Wu, V. V. Bichko, W.-S. Ryu, S. M. Lemon, and J. M.Taylor, “Hepatitis delta virus mutant: effect on RNA editing,”Journal of Virology, vol. 69, no. 11, pp. 7226–7231, 1995.

[35] F.-L. Chang, P.-J. Chen, S.-J. Tu, C.-J. Wang, and D.-S. Chen,“The large form of hepatitis δ antigen is crucial for assemblyof hepatitis δ virus,” Proceedings of the National Academy ofSciences of the United States of America, vol. 88, no. 19, pp.8490–8494, 1991.

[36] M. Chao, S.-Y. Hsieh, and J. Taylor, “Role of two forms ofhepatitis delta virus antigen: evidence for a mechanism of self-limiting genome replication,” Journal of Virology, vol. 64, no.10, pp. 5066–5069, 1990.

[37] S. B. Hwang and M. M. C. Lai, “Isoprenylation mediatesdirect protein-protein interactions between hepatitis largedelta antigen and hepatitis B virus surface antigen,” Journal ofVirology, vol. 67, no. 12, pp. 7659–7662, 1993.

[38] M.-F. Chang, C.-J. Chen, and S. C. Chang, “Mutationalanalysis of delta antigen: effect on assembly and replicationof hepatitis delta virus,” Journal of Virology, vol. 68, no. 2, pp.646–653, 1994.

[39] J. S. Glenn, J. A. Watson, C. M. Havel, and J. M. White,“Identification of a prenylation site in delta virus largeantigen,” Science, vol. 256, no. 5061, pp. 1331–1333, 1992.

[40] C.-Z. Lee, P.-J. Chen, M. M. C. Lai, and D.-S. Chen,“Isoprenylation of large hepatitis delta antigen is necessary butnot sufficient for hepatitis delta virus assembly,” Virology, vol.199, no. 1, pp. 169–175, 1994.

[41] T.-C. Wang and M. Chao, “RNA recombination of hepatitisdelta virus in natural mixed-genotype infection and trans-fected cultured cells,” Journal of Virology, vol. 79, no. 4, pp.2221–2229, 2005.

[42] F. Le Gal, E. Gault, M.-P. Ripault, et al., “Eighth major cladefor hepatitis delta virus,” Emerging Infectious Diseases, vol. 12,no. 9, pp. 1447–1450, 2006.

[43] N. Radjef, E. Gordien, V. Ivaniushina, et al., “Molecularphylogenetic analyses indicate a wide and ancient radiation ofAfrican hepatitis delta virus, suggesting a deltavirus genus ofat least seven major clades,” Journal of Virology, vol. 78, pp.2537–2544, 2004.

Page 17: Genome Evolutiondownloads.hindawi.com/journals/specialissues/732768.pdf · Advances in Bioinformatics EditorialBoard Shandar Ahmad, Japan T. Akutsu, Japan Rolf Backofen, Germany Craig

Advances in Bioinformatics 9

[44] K. Salehi-Ashtiani, A. Luptak, A. Litovchick, and J. W. Szostak,“A genomewide search for ribozymes reveals an HDV-likesequence in the human CPEB3 gene,” Science, vol. 313, no.5794, pp. 1788–1792, 2006.

[45] F. Negro, J. L. Gerin, R. H. Purcell, and R. H. Miller, “Basisof hepatitis delta virus disease?” Nature, vol. 341, article 111,1989.

[46] B. Young and B. Hicke, “Delta virus as a cleaver,” Nature, vol.343, article 28, 1990.

[47] R. Brazas and D. Ganem, “A cellular homolog of hepatitisdelta antigen: implications for viral replication and evolution,”Science, vol. 274, no. 5284, pp. 90–94, 1996.

[48] S. O. Gudima, J. Chang, and J. M. Taylor, “Reconstitution incultured cells of replicating HDV RNA from pairs of less thanfull-length RNAs,” RNA, vol. 11, pp. 90–98, 2005.

[49] J.-C. Wu, T.-Y. Chiang, W.-K. Shine, et al., “Recombinationof hepatitis D virus RNA sequences and its implications,”Molecular Biology and Evolution, vol. 16, no. 11, pp. 1622–1632, 1999.

[50] M. Chao, “RNA recombination in hepatitis delta virus:implications regarding the abilities of mammalian RNApolymerases,” Virus Research, vol. 127, no. 2, pp. 208–215,2007.

[51] S. O. Gudima, J. Chang, and J. M. Taylor, “Restoration in vivoof defective hepatitis delta virus RNA genomes,” RNA, vol. 12,no. 6, pp. 1061–1073, 2006.

[52] M. A. Feitelson and R. H. Miller, “X gene-related sequencesin the core gene of duck and heron hepatitis B viruses,”Proceedings of the National Academy of Sciences of the UnitedStates of America, vol. 85, no. 16, pp. 6162–6166, 1988.

[53] C. Huang, S. C. Chang, I.-C. Yu, Y.-G. Tsay, and M.-F. Chang,“Large hepatitis delta antigen is a novel clathrin adaptor-likeprotein,” Journal of Virology, vol. 81, no. 11, pp. 5985–5994,2007.

[54] Y. C. Wang, C. R. Huang, M. Chao, and S. J. Lo, “TheC-terminal sequence of the large hepatitis delta antigen isvariable but retains the ability to bind clathrin,” VirologyJournal, vol. 6, article 31, 2009.

[55] Y.-H. Wang, S. C. Chang, C. Huang, Y.-P. Li, C.-H. Lee, andM.-F. Chang, “Novel nuclear export signal-interacting protein,NESI, critical for the assembly of hepatitis delta virus,” Journalof Virology, vol. 79, no. 13, pp. 8113–8120, 2005.

[56] T. A. Hunt, M. D. Urbanowski, K. Kakani, L.-M. J. Law,M. A. Brinton, and T. C. Hobman, “Interactions betweenthe West Nile virus capsid protein and the host cell-encodedphosphatase inhibitor, IPP2A

2 ,” Cellular Microbiology, vol. 9, no.11, pp. 2756–2766, 2007.

[57] S.-S. Lin, S. C. Chang, Y.-H. Wang, C.-Y. Sun, and M.-F. Chang, “Specific interaction between the hepatitis deltavirus RNA and glyceraldehyde 3-phosphate dehydrogenase: anenhancement on ribozyme catalysis,” Virology, vol. 271, no. 1,pp. 46–57, 2000.

[58] D. Sikora, V. S. Greco-Stewart, P. Miron, and M. Pelchat,“The hepatitis delta virus RNA genome interacts with eEF1A1,p45nrb, hnRNP-L, GAPDH and ASF/SF2,” Virology, vol. 390,no. 1, pp. 71–78, 2009.

[59] R. Y.-L. Wang and P. D. Nagy, “Tomato bushy stunt virus Co-Opts the RNA-binding function of a host metabolic enzymefor viral genomic RNA synthesis,” Cell Host and Microbe, vol.3, no. 3, pp. 178–187, 2008.

[60] S. H. Yang, M. L. Liu, C. F. Tien, S. J. Chou, andR. Y. Chang, “Glyceraldehyde-3-phosphate dehydrogenase(GAPDH) interaction with 3′ ends of Japanese encephalitis

virus RNA and colocalization with the viral NS5 protein,”Journal of Biomedical Science, vol. 16, article 40, 2009.

[61] M. Anisimova and Z. Yang, “Molecular evolution of thehepatitis delta virus antigen gene: recombination or positiveselection?” Journal of Molecular Evolution, vol. 59, pp. 815–826, 2004.

Page 18: Genome Evolutiondownloads.hindawi.com/journals/specialissues/732768.pdf · Advances in Bioinformatics EditorialBoard Shandar Ahmad, Japan T. Akutsu, Japan Rolf Backofen, Germany Craig

Hindawi Publishing CorporationAdvances in BioinformaticsVolume 2010, Article ID 856825, 7 pagesdoi:10.1155/2010/856825

Research Article

Adaptive Evolution Hotspots at the GC-Extremes ofthe Human Genome: Evidence for Two Functionally DistinctPathways of Positive Selection

Clara S. M. Tang1 and Richard J. Epstein1, 2

1 Laboratory of Computational Oncology, Department of Medicine, The University of Hong Kong,Pokfulam, Hong Kong

2 Department of Medicine, Queen Mary Hospital, University of Hong Kong, Pokfulam Rd,Pokfulam, Hong Kong

Correspondence should be addressed to Richard J. Epstein, [email protected]

Received 25 August 2009; Revised 31 December 2009; Accepted 10 February 2010

Academic Editor: Igor B. Rogozin

Copyright © 2010 C. S. M. Tang and R. J. Epstein. This is an open access article distributed under the Creative CommonsAttribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work isproperly cited.

We recently reported that the human genome is “splitting” into two gene subgroups characterised by polarised GC content (Tanget al, 2007), and that such evolutionary change may be accelerated by programmed genetic instability (Zhao et al, 2008). Herewe extend this work by mapping the presence of two separate high-evolutionary-rate (Ka/Ks) hotspots in the human genome—one characterized by low GC content, high intron length, and low gene expression, and the other by high GC content, high exonnumber, and high gene expression. This finding suggests that at least two different mechanisms mediate adaptive genetic evolutionin higher organisms: (1) intron lengthening and reduced repair in hypermethylated lowly-transcribed genes, and (2) duplicationand/or insertion events affecting highly-transcribed genes, creating low-essentiality satellite daughter genes in nearby regions ofactive chromatin. Since the latter mechanism is expected to be far more efficient than the former in generating variant genesthat increase fitnesss, these results also provide a potential explanation for the controversial value of sequence analysis in definingpositively selected genes.

1. Introduction

The genomes of higher species are under negative selection tomaintain complexity, yet must also remain adaptable in orderto defer extinction in changing environments. The geneticmechanisms that facilitate environmental adaptation, evolv-ability, and/or speciation in higher organisms remain unclear[2–5]; equally controversial are the criteria for definingand/or identifying positive selection, and for distinguishingadaptive evolution from neutral divergence and geneticdrift [6–8]. Geographical isolation and inbreeding acceleratepositive selection [9]—particularly for genes related to sexualpheromones, mate choice, fertility or neurodevelopment,many of which have been implicated by sequence (Ka/Ks)analysis [10–12]. Whether such analyses suffice for sensitiveand specific detection of positively selected genes, however, isdebated [13, 14].

Positive selection does not occur randomly [15]. Relevantto this, we used methylation-sensitive dinucleotide andKa/Ks analyses to show that promoter CpG islands act asevolutionary oscillators—that is, associated with increasedtranscription and low evolutionary rate when hypomethy-lated, but with low transcription and high evolutionary ratewhen hypermethylated [16]. Prior to this we reported apositive correlation between intron length and 3′ gene evo-lutionary rate, suggesting that this association reflected DNAmisrepair due to intron-dependent transcriptional attrition[17]. In the present study, we have combined these exper-imental approaches to quantify the relative contributionsof intron lengthening and methylation-dependent tran-scriptional silencing/mutation to gene evolutionary rates.Unexpectedly, the results implicate two separate pathwaysto adaptive evolution, at least one of which seems likely

Page 19: Genome Evolutiondownloads.hindawi.com/journals/specialissues/732768.pdf · Advances in Bioinformatics EditorialBoard Shandar Ahmad, Japan T. Akutsu, Japan Rolf Backofen, Germany Craig

2 Advances in Bioinformatics

0.80.60.40.20

(Ka/Ks)

0

4

8

12

16

Cu

mu

lati

veco

un

t(k

)

(a)

80706050403020

GC (%)41 64

AllKa/Ks > 0.2

0

0.01

0.02

0.03

0.04

0.05

Freq

uen

cy

(b)

Figure 1: Ka/Ks profile of the human genome, showing that 75% ofall genes are characterized by a Ka/Ks < 0.2; that is, most are undernegative selection, whereas only a small percentage is characterisedby very high Ka/Ks.

to involve gene duplication and/or exon insertion eventsaffecting highly-transcribed, high-essentiality genes.

2. Materials and Methods

2.1. Sequence Data. We retrieved the genomic humansequence from the University of California, Santa Cruz(UCSC) Table Browser (http://genome.ucsc.edu/) [18].Genome assemblies of hg18 (NCBI build 36.1, March 2006)were used. Sequence analyses were carried out using theentire data set of approximately 24,000 RefSeq genes, ofwhich 15409 were informative. To prevent interspersedrepeats like Alu sequences from creating bias in nucleotidecomposition, RepeatMask sequences were used. Genes notcommencing with ATG codons, or not terminating with

canonical stop codons, were excluded in order to obtain themost homogeneous set of coding genes. When several genescontained identical exonic sequences, only the one with thelongest genomic length was retained.

2.2. Distribution of GC Content. Distributions of cod-ing GC % were best-fitted using the NOCOM pro-gram (http://www.genemapping.cn/nocom.htm) based on acounting (EM) algorithm. Under no transformation (expo-nent = 1), mean, the standard deviation and proportion ofeach population was estimated.

2.3. Gene Expression. The SAGEmap (Nov 2005,ftp://ftp.ncbi.nlm.nih.gov/pub/sage) of NCBI was usedfor quantitative evaluation of gene expression. SAGElibraries were grouped according to 26 tissue types includingbrain, blood, bone, bone marrow, cervix, cartilage, colon,eye, heart, kidney, liver, lung, lymph node, mammarygland, muscle, ovary, pancreas, peripheral nervous system,placenta, prostate, skin, stem cell, stomach, thyroid, vascular,and esophagus. Reliable tag-to-gene mapping of NlaIII SAGEtags to UniGene clusters was obtained from SAGEmap, andeach cluster was represented by the longest RefSeq gene.Ambiguous tags mapping to more than one RefSeq genewere excluded. If a tag had been counted once only in onetissue, it was regarded as likely due to sequencing errorand was thus discounted. SAGE tags of each RefGene werecounted for each tissue type and normalized to counts permillion. The normalized counts of each tissue were averagedacross all tissue types for fair comparison between organswith different mean expression level.

2.4. Evolutionary Rate Determination. Homologue datain XML format was obtained from NCBI Homolo-Gene database (ftp://ftp.ncbi.nih.gov/pub/HomoloGene/).Orthologous gene pairs between human and mouse, togetherwith their synonymous substitution, nonsynonymous substi-tution rate (Ka), and their ratio (Ka/Ks) were isolated.

3. Results

3.1. Two Separate GC-Content Peaks Are Demonstrable forFaster-Evolving Genes. To explore the finding of an overallinverse trend between GC content and Ka/Ks noted inour last study [16], we first sought to determine thenature of this relationship using a specific gene set. Tothis end, we used the superfamily of human genes encod-ing G-protein-coupled receptors, including gene subsetsencoding olfactory receptors, (putative) taste receptors, andputative vomeronasal receptors. Since many members ofthese gene families are believed to be transcriptionallyinactive in humans, we expected a higher-than-usual pro-portion of high Ka/Ks (“pseudogenizing”) genes. Supple-mentary Figure 1 (Supplementary Material available atdoi:10.1155/2010/856825) suggests a negative relationshipbetween GC content and Ka/Ks within this gene superfam-ily, consistent with an evolutionary role for methylation-dependent transcriptional inactivation and mutation. To

Page 20: Genome Evolutiondownloads.hindawi.com/journals/specialissues/732768.pdf · Advances in Bioinformatics EditorialBoard Shandar Ahmad, Japan T. Akutsu, Japan Rolf Backofen, Germany Craig

Advances in Bioinformatics 3

(Ka/Ks) (Ka/Ks)

Intron length (kb)

>0.20.1–0.20–0.100

20

40

60

80Number of exon

>0.20.1–0.20–0.100

4

8

12

16Intron length/exon (kb)

>0.20.1–0.20–0.10

(Ka/Ks)

0

2

4

6

(a) GC < 41(%)

(Ka/Ks) (Ka/Ks)

Intron length (kb)

>0.20.1–0.20–0.100

20

40

60

80Number of exon

>0.20.1–0.20–0.100

4

8

12

16Intron length/exon (kb)

>0.20.1–0.20–0.10

(Ka/Ks)

0

2

4

6

(b) GC > 64(%)

Figure 2: Comparative relationship between low- (upper rows) and high-GC gene groups (lower rows) and intron length (left) and exonnumber (middle), and their ratio (right).

extend our earlier finding of two GC-content gene modeswithin the human genome as a whole [16], we focusedsubsequent genomic analysis on a subset of genes withKa/Ks > 0.2. This shows that most of these faster-evolvinggenes are characterized by GC contents less than 41%,with a relative scarcity of such genes in the 41–55% GCcontent range; but an additional fast-evolving gene subsetis also detectable within the GC content range of 55–75%(Figure 1).

3.2. High-GC-Content Genes with Higher Ka/Ks Are Char-acterized by Relatively Higher Exon Numbers, Corrected forGene Length, than Low-GC-Content High Ka/Ks Genes. The“golden middle” (highly regulated, intermediate-expressinggenes) of the genome is reported to contain the longestgenes [19], but this analysis has not been corrected for GCcontent. We find that subsets of rapidly evolving (Ka/Ks >0.2) genes with low gene expression levels and breadth areidentifiable within both low-GC (<41% GC content; n = 346)and high-GC (>64% GC content; n = 365) gene populations(P < 2.2 × 10−16, and P < .001, resp., Table 1). In contrast,more rapidly evolving high-GC genes exhibit an increase inexon number that is disproportionate to gene length, whereaslow-GC genes do not (Figure 2). This difference raises thenovel possibility that faster evolution of some high-GC genescould be mediated through exon insertion events, consistentwith the notion that high-GC genes tend to be located withinregions of accessible chromatin.

3.3. Both Low-GC and High-GC Ka/Ks Peaks Are Associatedwith Gene Lengthening as Transcription Declines. Three-dimensional genomic heat mapping was then used to char-acterise the foregoing Ka/Ks “twin peaks” in greater detail.Figure 3(a) confirms the negative relationship between GCcontent and gene length, while Figure 3(c) again suggeststhe existence of two discrete gene populations (a higherGC subgroup with shorter length, and a lower GC sub-group with higher length). The most transcribed genestend to be those characterized by shorter gene length andintermediate-to-high GC content, with expression levelsgenerally falling in association with longer gene/intronlength (Figure 3(e)). Interestingly, genes with the highestKa/Ks values are most obvious at lower GC and highergene lengths (Figure 3(d), left panel), but at lower cutoffsare seen to track in a C-shaped distribution that overliesshort, highly-transcribed genes and extends rightwards (i.e.,in association with higher gene/intron lengths) when thetwo GC-extremes of the gene census are reached. Consideredtogether with Table 1 and Figure 2, these data suggest thathighly-transcribed genes (which, presumably, tend to beunder strong negative selection) may give rise to lessessential gene progeny via two different processes: eitherby gene methylation associated with reduced transcription,reduced repair of methylation damage (i.e., progressive CpGloss), and intron lengthening or by duplication and/orexon insertions affecting stably hypomethylated (high-GC)genes.

Page 21: Genome Evolutiondownloads.hindawi.com/journals/specialissues/732768.pdf · Advances in Bioinformatics EditorialBoard Shandar Ahmad, Japan T. Akutsu, Japan Rolf Backofen, Germany Craig

4 Advances in Bioinformatics

86420

LOG (intron length)

0.2

0.4

0.6

0.8

GC

con

ten

t

(a)

86420

LOG (intron length)

0.2

0.4

0.6

0.8

GC

con

ten

t

(Ka/Ks)00.05

0.10.15

0.20.25

(b)

86420

LOG (intron length)

0.2

0.4

0.6

0.8

GC

con

ten

t

Frequency

00.05

0.10.15

0.20.25

86420

LOG (intron length)

0.2

0.4

0.6

0.8

GC

con

ten

t

Frequency

00.05

0.10.15

0.20.25

(c)

86420

LOG (intron length)

0.2

0.4

0.6

0.8

GC

con

ten

t

(Ka/Ks)

00.01

0.020.03

0.040.05

86420

LOG (intron length)

0.2

0.4

0.6

0.8

GC

con

ten

t

(Ka/Ks)

00.01

0.020.03

0.040.05

(d)

Figure 3: Continued.

Page 22: Genome Evolutiondownloads.hindawi.com/journals/specialissues/732768.pdf · Advances in Bioinformatics EditorialBoard Shandar Ahmad, Japan T. Akutsu, Japan Rolf Backofen, Germany Craig

Advances in Bioinformatics 5

86420

LOG (intron length)

0.2

0.4

0.6

0.8

GC

con

ten

t

SAGE

020

4060

80100

86420

LOG (intron length)

0.2

0.4

0.6

0.8

GC

con

ten

t

SAGE

020

4060

80100

(e)

Figure 3: Distribution of genes with various GC content and intron length ((a), left) dot plot ((b), right) contour map with nearestneighbour smoothing. (c) Contour map with fixed neighbour smoothing (left, 1%) and (right, 5%). (d, e). Contour map of (d) Ka/Ksand (e) expression levels in SAGE of genes, using different sensitivity cutoffs (left, 1%, and right, 5%).

Table 1: Mean expression score (breadth and SAGE) of varying Ka/Ks groups for low and high GC genes. The data confirm that the differentKa/Ks groups so defined vary significantly in terms of gene expression levels for both low-GC (correlation coefficient−0.32, P < 2.2×10−16)and high-GC gene subsets (correlation coefficient −0.10, P = .00033), as well as in terms of expression breadth (correlation coefficient−0.35, P < 2.2× 10−16, and correlation coefficient −0.098, P = .00067, resp.) using Spearman correlation.

Ka/KsBreadth SAGE

Low GC High GC Low GC High GC

0 15.85 11.83 163.93 103.38

0–0.1 14.03 10.74 58.22 67.84

0.1–0.2 11.52 10.40 39.42 74.04

>0.2 9.10 8.86 32.37 57.75

3.4. Gene Evolutionary Rate Tends to Be More Rapid in High-GC Genes with Higher Ratios of Exon Number to IntronLength. A weak-positive correlation exists between intronnumber and intron length, as expected, and two groups ofoutlier genes from the central distribution can be identified:shorter genes with relatively higher ratios of intron (exon)number to intron length and longer genes of relativelylow exon:intron length ratio (Supplementary Figure 2).When compared using three-dimensional mapping, theselatter two gene subsets are seen to differ in terms of geneexpression levels and evolutionary rate, both of which appearhigher in the shorter, high-exon group (Table 2; P < .03).The bimodality of high Ka/Ks genes when analysed inthis way, independent of GC content, again suggests twodistinct gene-altering pathways, one of which favors exoninsertion over intron lengthening as a presumed adaptivemechanism.

4. Discussion

Biologydepends upon an environmentally-modulated bal-ance between genetic conservation and variation [20–25]—

implying, paradoxically, that genetic “variability” is some-how “conserved” at the species level so that fitness may bemaintained. Evolutionary devices that may fulfil this needinclude introns and DNA methylation [26, 27]; by promotingboth transcriptional inhibition and gene sequence mutation,the latter mechanism expedites rapid structural alterationsof “underperforming” (i.e., less essential, pseudogenizing)genes [28]. The efficiency of such putative random muta-tions in producing selectable genes that confer a biologicaladvantage can reasonably be predicted to be low [29], how-ever, prompting the question whether more direct adaptivepathways to genetic novelty exist.

Relevant to this issue, horizontal gene transfer is increas-ingly recognized as a critical contributor to adaptive genomicevolution in prokaryotes [30]. In sexually reproducingorganisms, analogous “horizontal” pathways to genomicchange include not only retrotransposition, but also recom-bination, insertional mutagenesis (including exon swap-ping), and gene duplication/conversion or amplification[31]. The latter mechanism is attractive from a theoreticalstandpoint since prior conservation of an active gene perse implies functional conferral of a fitness advantage to a

Page 23: Genome Evolutiondownloads.hindawi.com/journals/specialissues/732768.pdf · Advances in Bioinformatics EditorialBoard Shandar Ahmad, Japan T. Akutsu, Japan Rolf Backofen, Germany Craig

6 Advances in Bioinformatics

Table 2: Characterisation of gene subsets with differing intron/exon numbers and intron length, in terms of evolutionary rate and geneexpression. Spearman correlation coefficient (= 0.58, P < 2.0 × 10−16) was calculated for gene subgroups greater than 2SD (intronlength/number and intron number/length) from the mean.

Short and higher intron Long and higher intronP-value†length/number number/length

Mean Median Mean Median

Ka/Ks 0.19 0.17 0.054 0.080 <2× 10−16

Breadth 10.27 9 23 11.52 0.019

SAGE 114.44 34.66 30.05 39.91 0.021†

P-value of nonparametric Mann-Whitney test.

complex organism [32], thereby increasing the probabilitythat a duplicated variant will offer further survival benefits[33, 34]. Consistent with this, human segmental duplicationstend to occur around core duplicons which encode primate-specific genes under positive selection [35, 36]; similarly,duplications have been reported to be centred on positiveselection hotspots for mating-specific genes [11]. Moreover,just as cellular stress has been shown to facilitate geneamplification [37, 38], it is tempting to postulate that tran-scriptional frequency and associated chromatin accessibilitycould directly promote adaptive gene duplication/conversionevents [39].

The findings of the present study are pertinent to thelatter possibility. Our unexpected identification of a rapidlyevolving human gene subgroup characterised by high GCcontent, relatively short gene length, but high ratio of exonnumber to intron length compared to slowly evolving genesof similar GC content, supports the view that positiveselection may occur not only through passive release ofnegative selection constraints, but also via a more acceleratedand direct mechanism involving, say, exon insertion intoGC-rich duplicates of ancestral genes characterized by highexpression and tight conservation. Of note, this putativepathway of positive selection is quantitatively underesti-mated by studies based on point mutation (Ka/Ks) dataalone, since most of the functional novelty is predicted toarise either from changes in chromosomal gene locationaffecting expression [39] or from exon insertion eventsunassociated with sequence variation. Indeed, recent workfrom Drummond and Wilke [40] suggests that proteinmisfolding may be the dominant selection pressure inmetazoan evolution, casting further doubt on the equationof Ka/Ks with evolutionary rate. Interestingly, Jordan etalhave shown that gene essentiality selectively correlates withevolutionary conservation in bacterial genomes, though notin mammalian [41]. These and other reports emphasise thatevolutionary rate is likely influenced by many complex andheterogeneous factors.

The conclusions of our study remain limited by theirinferential and non-specific nature. More direct evidence ofpositive selection based on experimental manipulation ofgene duplication and related processes (conversion, amplifi-cation, recombination) is needed before any firm conclusionsare drawn. Nonetheless, the prospect of accelerating speciesevolution by using global genomic techniques to promote

gene duplication, even if only on an experimental basis ini-tially, is exciting. Conversely, the possibility that maladaptivesomatic processes such as cancer may be driven in part bypositive selection secondary to such global genomic changes[42] is important to consider. Chromatin-based therapeuticinterventions, either at the cellular (germline) or tissue(somatic) level, could be the long-term deliverable from thisline of evolutionary investigation.

Conflict of Interest

There is no conflict of interest.

Authors’ Contributions

Clara S. M. Tang performed the calculations and experi-ments, and helped finalize the manuscript. Richard J. Epsteindesigned the experiments and wrote the paper.

Acknowledgment

The authors thank Dr Yongzhong Zhao, Dr David Smith, andProfessors Karen Lam and Raymond Liang for assistance andsupport.

References

[1] Y. Zhao and RJ. Epstein, “Programmed genetic instability: atumor-permissive mechanism for maintaining the evolvabilityof higher species through methylation-dependent mutation ofDNA repair genes in the male germ line,” Mol Biol Evol., vol.25, no. 8, pp. 1737–1749, 2008.

[2] A. Levasseur, L. Orlando, X. Bailly, M. C. Milinkovitch,E. G. J. Danchin, and P. Pontarotti, “Conceptual bases forquantifying the role of the environment on gene evolution:the participation of positive selection and neutral evolution,”Biological Reviews, vol. 82, no. 4, pp. 551–572, 2007.

[3] M. Camps, A. Herman, E. R. N. Loh, and L. A. Loeb,“Genetic constraints on protein evolution,” Critical Reviews inBiochemistry and Molecular Biology, vol. 42, no. 5, pp. 313–326, 2007.

[4] T. L. O’Loughlin, W. M. Patrick, and I. Matsumura, “Naturalhistory as a predictor of protein evolvability,” Protein Engineer-ing, Design and Selection, vol. 19, no. 10, pp. 439–442, 2006.

Page 24: Genome Evolutiondownloads.hindawi.com/journals/specialissues/732768.pdf · Advances in Bioinformatics EditorialBoard Shandar Ahmad, Japan T. Akutsu, Japan Rolf Backofen, Germany Craig

Advances in Bioinformatics 7

[5] A. R. Templeton, “The reality and importance of founderspeciation in evolution,” BioEssays, vol. 30, no. 5, pp. 470–479,2008.

[6] J. C. Fay and P. J. Wittkopp, “Evaluating the role of naturalselection in the evolution of gene regulation,” Heredity, vol.100, no. 2, pp. 191–199, 2008.

[7] J. D. Jensen, A. Wong, and C. F. Aquadro, “Approaches foridentifying targets of positive selection,” Trends in Genetics,vol. 23, no. 11, pp. 568–577, 2007.

[8] A. L. Hughes, “Near neutrality: leading edge of the neutraltheory of molecular evolution,” Annals of the New YorkAcademy of Sciences, vol. 1133, pp. 162–179, 2008.

[9] H. Kokko and I. Ots, “When not to avoid inbreeding,”Evolution, vol. 60, no. 3, pp. 467–475, 2006.

[10] N. L. Clark and W. J. Swanson, “Pervasive adaptive evolutionin primate seminal proteins,” PLoS Genetics, vol. 1, no. 3,article no. e35, 2005.

[11] L. Horth, “Sensory genes and mate choice: evidence thatduplications, mutations, and adaptive evolution alter variationin mating cue genes and their receptors,” Genomics, vol. 90, no.2, pp. 159–175, 2007.

[12] N. L. Clark, G. D. Findlay, X. Yi, M. J. MacCoss, and W. J.Swanson, “Duplication and selection on abalone sperm lysinin an allopatric population,” Molecular Biology and Evolution,vol. 24, no. 9, pp. 2081–2090, 2007.

[13] J. Zhang, “On the evolution of codon volatility,” Genetics, vol.169, no. 1, pp. 495–501, 2005.

[14] A. L. Hughes, “Looking for Darwin in all the wrong places:the misguided quest for positive selection at the nucleotidesequence level,” Heredity, vol. 99, no. 4, pp. 364–373, 2007.

[15] D. L. Stern and V. Orgogozo, “Is genetic evolution pre-dictable?” Science, vol. 323, no. 5915, pp. 746–751, 2009.

[16] C. S. Tang and R. J. Epstein, “A structural split in the humangenome,” PLoS ONE, vol. 2, no. 7, article no. e603, 2007.

[17] C. S. Tang, Y. Z. Zhao, D. K. Smith, and R. J. Epstein, “Intronlength and accelerated 3’ gene evolution,” Genomics, vol. 88,no. 6, pp. 682–689, 2006.

[18] D. Karolchik, AS Hinrichs, TS Furey, et al., “The UCSCTable Browser data retrieval tool,” Nucleic Acids Res., 2004,32(Database issue):D493-6, PMID: 14681465.

[19] A. E. Vinogradov, “‘Genome design’ model and multicellularcomplexity: golden middle,” Nucleic Acids Research, vol. 34,no. 20, pp. 5906–5914, 2006.

[20] A. Wagner, “Robustness, evolvability, and neutrality,” FEBSLetters, vol. 579, no. 8, pp. 1772–1778, 2005.

[21] A. B. Reams and E. L. Neidle, “Selection for gene clustering bytandem duplication,” Annual Review of Microbiology, vol. 58,pp. 119–142, 2004.

[22] H. Philippe, D. Casane, S. Gribaldo, P. Lopez, and J. Meunier,“Heterotachy and functional shift in protein evolution,”IUBMB Life, vol. 55, no. 4-5, pp. 257–265, 2003.

[23] M. Lynch and J. S. Conery, “The evolutionary demographyof duplicate genes,” Journal of Structural and FunctionalGenomics, vol. 3, no. 1–4, pp. 35–44, 2003.

[24] R. Frankham, “Stress and adaptation in conservation genet-ics,” Journal of Evolutionary Biology, vol. 18, no. 4, pp. 750–755,2005.

[25] R. R. Copley, L. E. O. Goodstadt, and C. Ponting, “Eukaryoticdomain evolution inferred from genome comparisons,” Cur-rent Opinion in Genetics and Development, vol. 13, no. 6, pp.623–628, 2003.

[26] J. S. Mattick and M. J. Gagen, “The evolution of controlledmultitasked gene networks: the role of introns and othernoncoding RNAs in the development of complex organisms,”

Molecular Biology and Evolution, vol. 18, no. 9, pp. 1611–1630,2001.

[27] E. Beutler, T. Gelbart, J. Han, J. A. Koziol, and B. Beutler,“Evolution of the genome and the genetic code: selection atthe dinucleotide level by methylation and polyribonucleotidecleavage,” Proceedings of the National Academy of Sciences of theUnited States of America, vol. 86, no. 1, pp. 192–196, 1989.

[28] N. G. C. Smith and L. D. Hurst, “Molecular evolution ofan imprinted gene: repeatability of patterns evolution withinthe mammalian insulin-like growth factor type II receptor,”Genetics, vol. 150, no. 2, pp. 823–833, 1998.

[29] J. W. Drake, “Mutations in clusters and showers,” Proceedingsof the National Academy of Sciences of the United States ofAmerica, vol. 104, no. 20, pp. 8203–8204, 2007.

[30] E. V. Koonin and Y. I. Wolf, “Genomics of bacteria andarchaea: the emerging dynamic view of the prokaryotic world,”Nucleic Acids Research, vol. 36, no. 21, pp. 6688–6719, 2008.

[31] M. Lynch and J. S. Conery, “The evolutionary fate andconsequences of duplicate genes,” Science, vol. 290, no. 5494,pp. 1151–1155, 2000.

[32] X. He and J. Zhang, “Gene complexity and gene duplicability,”Current Biology, vol. 15, no. 11, pp. 1016–1021, 2005.

[33] R. P. Sugino and H. Innan, “Selection for more of thesame product as a force to enhance concerted evolution ofduplicated genes,” Trends in Genetics, vol. 22, no. 12, pp. 642–644, 2006.

[34] U. Bergthorsson, D. A. N. I. Andersson, and J. R. Roth,“Ohno’s dilemma: evolution of new genes under continuousselection,” Proceedings of the National Academy of Sciences ofthe United States of America, vol. 104, no. 43, pp. 17004–17009,2007.

[35] Z. Jiang, H. Tang, M. Ventura, et al., “Ancestral reconstructionof segmental duplications reveals punctuated cores of humangenome evolution,” Nature Genetics, vol. 39, no. 11, pp. 1361–1368, 2007.

[36] G. H. Perry, F. Yang, T. Marques-Bonet, et al., “Copy numbervariation and evolution in humans and chimpanzees,” GenomeResearch, vol. 18, no. 11, pp. 1698–1710, 2008.

[37] P. J. Hastings, A. Slack, J. F. Petrosino, and S. M. Rosenberg,“Adaptive amplification and point mutation are independentmechanisms: evidence for various stress-inducible mutationmechanisms,” PLoS Biology, vol. 2, no. 12, article no. e399,2004.

[38] E. S. Slechta, K. I. M. L. Bunny, E. Kugelberg, E. Kofoid, D. A.N. I. Andersson, and J. R. Roth, “Adaptive mutation: generalmutagenesis is not a programmed response to stress but resultsfrom rare coamplification of dinB with lac,” Proceedings of theNational Academy of Sciences of the United States of America,vol. 100, no. 22, pp. 12847–12852, 2003.

[39] S. N. Rodin and D. V. Parkhomchuk, “Position-associated GCasymmetry of gene duplicates,” Journal of Molecular Evolution,vol. 59, no. 3, pp. 372–384, 2004.

[40] D. A. Drummond and C. O. Wilke, “Mistranslation-inducedprotein misfolding as a dominant constraint on coding-sequence evolution,” Cell, vol. 134, no. 2, pp. 341–352, 2008.

[41] I. K. Jordan, I. B. Rogozin, Y. I. Wolf, and E. V. Koonin,“Essential genes are more evolutionarily conserved than arenonessential genes in bacteria,” Genome Research, vol. 12, no.6, pp. 962–968, 2002.

[42] G. V. Glazko, V. N. Babenko, E. V. Koonin, and I. B. Rogozin,“Mutational hotspots in the TP53 gene and, possibly, othertumor suppressors evolve by positive selection,” Biology Direct,vol. 1, p. 4, 2006.

Page 25: Genome Evolutiondownloads.hindawi.com/journals/specialissues/732768.pdf · Advances in Bioinformatics EditorialBoard Shandar Ahmad, Japan T. Akutsu, Japan Rolf Backofen, Germany Craig

Hindawi Publishing CorporationAdvances in BioinformaticsVolume 2010, Article ID 287070, 8 pagesdoi:10.1155/2010/287070

Research Article

Testing the Coding Potential of Conserved ShortGenomic Sequences

Jing Wu

Department of Statistics, Carnegie Mellon University, PA 15213, USA

Correspondence should be addressed to Jing Wu, [email protected]

Received 21 September 2009; Accepted 2 January 2010

Academic Editor: Igor B. Rogozin

Copyright © 2010 Jing Wu. This is an open access article distributed under the Creative Commons Attribution License, whichpermits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Proposed is a procedure to test whether a genomic sequence contains coding DNA, called a coding potential region. The proceduretests the coding potential of conserved short genomic sequence, in which the assumptions on the probability models of genestructures are relaxed. Thus, it is expected to provide additional candidate regions that contain coding DNAs to the currentgenomic database. The procedure was applied to the set of highly conserved human-mouse sequences in the genome databaseat the University of California at Santa Cruz. For sequences containing RefSeq coding exons, the procedure detected 91.3% regionshaving coding potential in this set, which covers 83% of the human RefSeq coding exons, at a 2.6% false positive rate. The proceduredetected 12,688 novel short regions with coding potential at the false discovery rate <0.05; 65.7% of the novel regions are betweenannotated genes.

1. Introduction

A popular computational strategy in identifying coding DNAof the human genome is using probability models. Forexample, for a single genome, one approach would be to useprobability models to delineate a DNA sequence into a genewhich is composed of several parts such as promoter regions,UTR regions, splicing sites, exons, and so forth [1]. Alterna-tively, by considering a genome (e.g., human) together withthe genome of a suitably related species (e.g., mouse), onecan combine the conserved information of the two speciesto develop a more refined probability models for the geneportions (ROSETTA [2], CEM [3], TWINSCAN [4], SLAM[5], and SGP2 [6–8]). While these approaches have beeneffective in predicting genes, a noticeable drawback is thatthe more refined a probability model is, the more constraintsthere are for a DNA sequence to be a gene. In effect, ahighly refined probability model tends to overparameterizethe problem, and thus inevitably restrain the ability of agene prediction algorithm for identifying genes, especiallythose that do not fit well with the “prescribed characters”delineated by the probability model; see for example [9].To compensate such restraint, some algorithms report genes

that are not the best fit to the model (e.g., suboptimal genesin GENSCAN).

Noting the limitations of existing approaches motivatedour interest to identify coding potential regions. That is,to localize regions that contain coding DNA, we developprocedures that determine the coding potential of shortregions. Instead of slightly relaxing the restraints on genestructure, such as in the prediction of suboptimal genes inGENSCAN, the proposed method tries to make probabilisticassumptions on gene structure as few as possible. Theapproach employs a locally smooth function, that is, thelowess function [10]. The key idea is that the signal containedin each codon is generally faint and not strong enough tostand out from the background noise, but fortunately eachcoding exon in the gene is made of a block of codons, sothat by using a locally smooth function one is able to collectthe strength of such faint signals from codons together todetermine the coding potential of the region. The proposedprocedure is mainly based on probability models for thenucleotide dependency in codons and the dependency ofnucleotide triplets across different sequences. A log-oddsratio is calculated for each triplet in the human genome,to measure the likelihood of the triplet being random or a

Page 26: Genome Evolutiondownloads.hindawi.com/journals/specialissues/732768.pdf · Advances in Bioinformatics EditorialBoard Shandar Ahmad, Japan T. Akutsu, Japan Rolf Backofen, Germany Craig

2 Advances in Bioinformatics

Select the strand andthe starting position of

the alignment

Calculate the log-oddsratio Yi for each

aligned triplet, tripletby triplet, i = 1, . . . ,L

Take the average ofthe log-odds scores of

a window size w:Si,w =

∑wi=1Yi+ j−1/w

Lowess estimation:Si,w −→ Si,w

Select the peak S ofSi,w , i = 1, . . . ,L−w + 1

Calculate the P-value

of S

Figure 1: Summary of the proposed statistical procedure.

codon [7, 11–13]. The intuition is that when there is a codingexon in the aligned sequences, there is the associated peak inthe log-odds ratio. Therefore, the coding potential of a regioncan be viewed as the presence of a peak in the sequence oflog-odds ratios, under the expectation that a locally smoothfunction may be useful. The difference between the proposedmethod and the existing gene prediction method is that ittries to tell whether a sequence contains a coding region ornot instead of trying to obtain the boundary of a codingexon in the sequence. The nonparametric nature of such anapproach is expected to provide regions in genes with novelstructure.

2. Method

The proposed procedure is detailed schematically in Figure 1.First, given the likelihood of an aligned triplet pair from acodon, the aligned sequence pair is segmented into alignedtriplet pairs and transformed into log-odds ratios. Second,a window frame with a given size slides through the seriesof log-odds ratios and the average log-odds ratio in eachwindow frame is obtained. Third, the average log-odds ratiois smoothed by a locally smooth method [10], that is, thelowess method, which is a robust locally weighted regression.Finally, the largest local maximum of the corresponding

lowess function is selected as the test statistic and theapproximate p-value of the test statistic is proposed. Theproposed method brings statistical tools such as the locallysmooth function to the coding potential detection problem.It treats the coding potential problem as a peak huntingproblem. The proposed method not only realizes the optimalaccuracy suggested by [12], but also detects novel regionswith high coding potential.

2.1. Hypotheses. The proposed procedure is based on theobservations that functional elements, such as the codonsof exons, tend to be more strongly conserved in evolutionthan random genomic sequences and that adjacent codonstend to depend on each other. The method is applicable todata that consists of genomic sequences of interest, calledthe target sequence, and sequences from a related species thatare aligned to the target sequence, called the informationsequence. The test of the alignment discriminates between thefollowing hypotheses:

(H0) all the DNA in the target sequence is not coding,

(H1) a proportion of the DNA in the target sequence iscoding.

Thus, a region has coding potential when (H0) is rejected.

2.2. Model. The approach to determine a region’s codingpotential is to use information provided by the log-odds ratioof the aligned triplet pairs in the given alignment. The log-odds ratio is defined as follows. Denote a pair of alignedsequences X = {h1, . . . ,hL;m1, . . . ,mL}, where hi’s are non-overlapping triplets in the target sequence and mi is thetriplet in the information sequence aligned to hi. The log-odds ratio (LOD) at each position i, i = 2, . . . ,L, is

LODi = logPA(hi | hi−1)PB(mi | hi)QA(hi | hi−1)QB(mi | hi) , (1)

where probability matrix PA gives the conditional probabilityof observing codon hi given the previous codon hi−1, PB givesthe conditional probability of observing an aligned tripletmi given codon hi, QA gives the conditional probability ofobserving a triplet hi from noncoding regions given theprevious triplet hi−1, andQB gives the conditional probabilityof observing an aligned triplet mi given hi from noncodingregions.

The concept of constructing a test statistic that identifiesan exon based on the log-odds score is that for a targetsequence containing an exon; when the partitioning of thealignment into aligned triplets is correct, there is a positionl0 and a position l1 such that hl0 , . . . ,hl1 are codons whileh1, . . . ,hl0−1 and hl1+1, . . . ,hL are not codons. Therefore, l0and l1 are the two points where the underlying distributionof Xj = (hj ,mj) switches between that of the randomtriplet-triplet alignment and the codon-triplet alignment,thus resulting in the log-odds ratios between l0 and l1 witha higher mean. When using a nonparametric method tosmooth the log-odds ratios, the corresponding curve of thesmoothed log-odds score versus its location in the alignmentwill show a peak between l0 and l1.

Page 27: Genome Evolutiondownloads.hindawi.com/journals/specialissues/732768.pdf · Advances in Bioinformatics EditorialBoard Shandar Ahmad, Japan T. Akutsu, Japan Rolf Backofen, Germany Craig

Advances in Bioinformatics 3

To obtain the value of the test statistic from a givenalignment, the first step is to partition the alignment intoaligned triplets so that the codons are in the correct frameand the correct DNA strand when the alignment contains acoding exon. To obtain the segmentation, the average log-odds ratio, Si,w0 =

∑w0j=1 LODi+ j−1/w0, is calculated for each

block of w0 aligned triplet pairs for both the alignment andthe reverse complement of the alignment. The block thatattains the maximum Si,w0 is extended toward both ends ofthe alignment in units of aligned triplet pairs. Removing anypartial triplet pairs at the both ends of the alignment, thesegmentation and the strand of the alignment is obtainedand denoted by X = {h1, . . . ,hL;m1, . . . ,mL}.

Given the selected segmentation and strand, X ={h1, . . . ,hL;m1, . . . ,mL}, the average log-odds scores, Si,w =∑w

j=1 LODi+ j−1/w, are obtained for the ith aligned tripletpair, where LODk is defined in (1) and w is a parameter.Because the nucleotides in the noncoding region are lessconserved in evolution, the nucleotides in noncoding regionsare assumed to be independent, so Si,w is approximatelynormally distributed when w is large enough.

The function lowess() in the R standard package(http://www.r-project.org/) is used to smooth the averagelog-odds scores. A smoothing parameter f determines thefraction of neighboring data points to be used in smoothing.Since longer exons tend to have longer alignments, f is fixedfor all alignments so that the length of the exon is taken intoaccount. Based on this smoothing, detecting the exon in thealignment is transformed into detecting a significant peak inthe profile of the smoothed average log-odds scores.

The maximum of the local maximum, denoted by S, ofthe lowess estimation is selected as the test statistic. Theselection of the local maximum is performed by the functionppc.peaks() in the R package ppc developed by Tibshirani etal. [14], in which the parameter span is set as the same as fin the lowess function.

Finally, the p-value of the test statistic S is approximatedby the extreme distribution of the normal random variable.Specifically, since the scores Si,w’s are normally distributed,the lowess smoothed scores, denoted by Si,w are also normallydistributed [10]. Moreover, since Si,w’s are locally dependent,for simplicity, they are treated as if they were independentunder the null hypothesis. Denoting P0 as the probability thata peak exists in the alignment and assuming that Si,w is froma normal distribution, by the Bayesian rule, the approximatep-value for S is

p = P(

max(S1,w, . . . , SL−w+1,w

)>S | peak exists under H0

)P0

≈⎛⎝1− P

(Z <

S− μσ

)L−w+1⎞⎠P0,

(2)

where Z ∼ N(0, 1). The p-value is set as p = 1 whenno peak is found. Given a significance level α, when the p-value of an alignment is less than α, the alternative that thealignment contains coding DNA is supported. When testingk alignments, the p-values, p1, . . . , pk, are transformed into

q-values to control the false discovery rate [15, 16], where thefalse discovery rate is the proportion of false rejections of H0

among the total number of rejections of H0. That is, denoteri as the rank of pi with the smallest p-value ranked as 1 andlet

qi = min(kpiri

, 1)

, (3)

then the expected number of false positive is ≤ ri0α, whereri0 = max{ri : qi < α}.

2.3. Datasets. The proposed method is assessed on the setof highly conserved human-mouse pairwise alignments, thatis, the axtTight directory of the UCSC genome database inHuman May 2004 (hg17) (http://hgdownload.cse.ucsc.edu/goldenPath/hg17/vsMm5/axtTight/). This axtTight foldercontains the latest version of a highly conserved subset ofthe best alignments with mouse sequences for any partof the human genome; it remains the same although thegenome database has been updated to hg19. The alignmentsare quite short; about 95% of the human sequences inthis set are <597 bps. An interesting feature of this set isthat, although it was obtained without the knowledge ofgene structure, it contains a subset that heavily overlapswith the set of human RefSeq coding exons [17, 18](http://www.ncbi.nih.gov/RefSeq/) in the genome databaseat UCSC (http://genome.ucsc.edu/cgi-bin/hgTables), May2004, which has 172,042 exons nonoverlapping with eachother. The human sequences in the axtTight folder overlapwith 91.2% human RefSeq coding exons, in which 94.8%sequences overlap with only one RefSeq coding exon ineach sequence, 4.0% overlap with only two RefSeq codingexons, and the average percentage of coding DNA in thehuman sequences that overlap with human RefSeq codingexons is 67%. Thus, the human sequences in this folder wereused for both evaluating the procedure and for determiningnovel regions with coding potential. To be consistent withthe coordinates of the sequences in the axtTight folder, theparameters for the proposed method were estimated fromthe sequences in the assembly of hg17.

Since the proposed method tests whether coding DNAsare embedded in the target sequence, the positive set consistsof alignments whose target sequence contains a coding exonwith noncoding DNA flanking it. The negative set consists ofalignments whose target sequence does not have evidence ofcoding DNA.

In order to determine regions with coding potential inthe axtTight folder, the human sequences were extractedfrom the alignments in the axtTight folder, and eachsequence was extended 50 bps on each end and paired withthe mouse sequence according to the alignments in theaxtNet folder (http://hgdownload.cse.ucsc.edu/goldenPath/hg17/vsMm5/axtNet/). The alignments that are longer than150 bps were kept. The human sequences of the alignments(before extension) overlapping with RefSeq coding exons, arecalled the conserved coding potential regions. Among thesealignments, 3,000 were randomly selected as a training set.The human sequences in the axtTight folder, whose extendedalignments are longer than 150 bps, but do not overlap

Page 28: Genome Evolutiondownloads.hindawi.com/journals/specialissues/732768.pdf · Advances in Bioinformatics EditorialBoard Shandar Ahmad, Japan T. Akutsu, Japan Rolf Backofen, Germany Craig

4 Advances in Bioinformatics

with the human RefSeq coding exons are called candidatecoding potential regions. The total number of conservedcoding potential region is 146,254, which corresponds to3.9× 107 bps and includes 156,928 RefSeq coding exons. Theaverage percentage of coding DNA in the human sequenceof the extended alignment of the conserved coding potentialregion is 43%. The total number of candidate codingpotential regions is 751,313, corresponding to 1.2 × 108 bps.To show the robustness of the proposed method, the human-dog alignments of the extended conserved coding potentialregions were also extracted from hg17. In this set, the averagepercentage of coding DNA in the human sequence of theextended alignment of the conserved coding potential regionis 38% since more noncoding flanking DNAs are conservedbetween human and dog.

To simulate aligned conserved noncoding regions, wefirst estimated the conditional probability of the adjacentnucleotide triplet pair in human, the aligned nucleotidetriplet pair between human and mouse, and the lengthdistribution of conserved noncoding regions from the setof aligned human-mouse sequences called the alignment ofpotential nonexons [7]. These sequences do not overlap withany known genes, ESTs. The coordinates of the potentialnonexons from [7] were lifted from hg12 to the assembly ofhg17 in UCSC’s genome database by the batch coordinateconversion (http://genome.ucsc.edu/cgi-bin/hgLiftOver).The alignments of potential nonexons were then extractedfrom the axtNet folder in UCSC’s genome database (hg17)and 20,000 alignments were randomly selected as a trainingset. Based on the estimated probabilities and the lengthdistribution from the training set for the alignment ofnoncoding regions, 15,062 paired sequences were simulated.Among them, 10,305 paired sequences are longer than150 bps and are used as noncoding regions to evaluate theproposed procedure.

Finally, to analyze the coding potential regions detectedfrom the axtTight folder, the predictions of existing geneand pseudogene prediction algorithms listed in Table 1 fromthe genes and gene prediction tracks in UCSC’s genomedatabase (http://genome.ucsc.edu/cgi-bin/hgTables, human,May 2004) were downloaded.

2.4. Training the Model. In order to apply the testingprocedure, the probabilities under the codon model andthe noncoding region model in (1) were estimated. Theconditional probability of two triplets is estimated by thejoint counts from the alignments in the training sets. Thatis,

PA(h | h′) = Number of pairs (h′h) + e

Number of h′ + 125e,

PB(m | h) = Number of pairs (hm) + e

Number of h + 125e,

QA(a | a′) = Number of pairs (a′a) + e

Number of a′ + 125e,

QB(b | a) = Number of pairs (ab) + e

Number of a + 125e,

(4)

Table 1: The above tables in UCSC’s genome database are usedto analyze the coding potential regions detected from the humansequences in the axtTight folder in UCSC’s genome database.

Tracks URL

RefSeq [17, 18] http://www.ncbi.nih.gov/Refseq/

Known genes [19]

TWINSCAN [4]

GENSCAN [1]

SGP [20] http://nemo.imim.es/grib/

ENSEMBL http://www.ensembl.org/

GENEID [21] http://www1.imim.es/software/geneid/index.html

AUGUSTUS [22]

ECgene [23] http://genome.ewha.ac.kr/ECgene/

MGC [24]

AceView [25] http://www.ncbi.nih.gov/IEB/Research/Acembly/index.html

CCDS [18, 26]

Nonhuman RefSeq [24]

Retropose [27]

Yale Psuedo [28] http://www.pseudogene.org/

Vega http://vega.sanger.ac.uk/

Vega pseudogenes http://vega.sanger.ac.uk/

UniGene [29]

where e = 1 is the pseudocount added, h andh′ are adjacent codons in conserved coding regions,m is the triplet aligned to h, a and a′ are adja-cent triplets in potential nonexons, and b is the tripletaligned to a. Each probability matrix is of dimension125 × 125. The probability matrices can be downloadedfrom http://www.stat.cmu.edu/∼jwu/axtTight/probs/. Forany two nucleotide triplets c1c2c3 and d1d2d3, ck,dk ∈{A,C,G,T , indel}, the nucleotides are coded asA = 0,T = 1,G = 2, C = 3, indel = 4, P(d1d2d3 | c1c2c3) corresponding tothe (i, j)th entry i = 25c1 + 5c2 + c3, j = 25d1 + 5d2 + d3, soeach probability matrix is of dimension 125× 125.

The window sizes are set at w0 = 20 and w = 9which correspond, respectively, to the 10th and the 2ndpercentile of the length distribution (in units of triplets) ofthe exons in the training set. The normal qq-plot in Figure 2illustrated the distribution of the score Si,w as normal, whichis consistent with the assumption for the p-value calculation.

The estimated mean and variance of the log-odds scoresfor the simulated triplets are −0.66 and 1.58, respectively.Since w = 9, the estimated parameters in (2) are μ = −0.66and σ = 1.58/3 = 0.527. For each alignment in the test sets,the p-value is p ≈ (1−P(Z < (S+0.66)/0.527)L−8)×P0, whereZ is from standard Normal N(0, 1) and L is the number oflog-odds scores.

Lastly, the parameter f in lowess() and span = f inppc.peaks() are selected by testing the alignment in thetraining set of conserved coding regions and potentialnoncoding DNAs. An appropriate f uses as many of the

Page 29: Genome Evolutiondownloads.hindawi.com/journals/specialissues/732768.pdf · Advances in Bioinformatics EditorialBoard Shandar Ahmad, Japan T. Akutsu, Japan Rolf Backofen, Germany Craig

Advances in Bioinformatics 5

3210−1−2−3

Sample

−4

−2

0

2

4

Th

eore

tica

l

Figure 2: A normal qq-plot of the averaged log-odds scores fromthe simulated sequences.

neighboring scores as possible to smooth the averaged log-odds score in the center of the exon in the coding regionbut includes few scores from noncoding DNAs. Since in theextended alignment of conserved coding regions, on average,each alignment contains 43% coding DNAs, only f ≤ 0.5were considered. To select f , values 1/4, 1/3, and 1/2 wereevaluated on the training datasets. For each f , P0 is estimatedby the observed relative frequency of the potential nonexonalignments having a peak and then the p-value in (2) isobtained for each alignment. Among them, the p-values fromf = 1/3 best separate the extended alignments in the trainingset of conserved coding regions from potential nonexons.Thus, the parameter f in lowess() is set as f = 1/3 and thenthe estimated probability of observing at least one peak innoncoding regions is P0 = 0.04. For each alignment in thetest sets, the p-value is p ≈ (1−P(Z < (S+0.66)/0.527)L−8)×0.04, where Z is fromN(0, 1) and L is the number of log-oddsscores.

3. Results

The procedure is tested on the human sequences inthe axtTight folder in UCSC’s genome database (http://hgdownload.cse.ucsc.edu/goldenPath/hg17/vsMm5/axt-Tight/). From this set, the procedure detected 91.1%conserved coding potential regions using human-mousealignments, with the estimated 2.6% false positive rate,covering to 83% of the entire human RefSeq codingexons. At the same false positive rate, it also detects 90.7%conserved coding potential regions using human-dogalignments. Among the detected conserved coding potentialregions from human-mouse alignments, many contain shortcoding exons and coding exons with alternative splicingsites which existing gene prediction algorithms tend tomiss. In addition, the procedure identified 12,688 humansequences at the false discovery rate <0.05; among them, 57overlap with nonhuman RefSeq coding exons [24], 65.7%are between annotated genes, and 41.4% have UniGene [29]matches, indicating that these regions may contain novelcoding exons.

Example: an extended coding region

6040200

Position

ScoresLowess fitPeak

−2

−1

0

1

2

Ave

rage

log-

odds

scor

e

Figure 3: Identifying a coding potential region in chromosome 1:1058121-1058365 from assembly hg17. The position is in units oftriplets. The codons are at position 25–56.

3.1. Detecting Coding Potential Regions from the Datasets.Figure 3 illustrates an example of identifying a coding poten-tial region of human chromosome 1: 1058121-1058365, inwhich 1058195-1058290 is a coding exon.

The plot in Figure 3 shows the 74 averaged log-oddsscores from a selected segmentation of the alignment of thatconserved region. From the lowess fit and peak selection, asindicated by the solid curve and the cross patch, respectively,the value of the test statistic is obtained at the peak S = 1.199having P = 0.0004.

The performance of the proposed procedure is comparedwith the results of Nekrutenko et al. (2002) [12]. Their studyshows that, when an aligned sequence is either an alignedcoding exon with codon frame known (true positive) or analigned random sequence (true negative), the likelihood ratiotest attains the true positive rate (TP) of 90.5% and the falsepositive rate (FP) of 2.6.% This result can be viewed as thebest accuracy that coding potential region detection methodscan attain using only conservation information since the truepositive set assumes that the coding exon frames are known.Our negative set includes 10,305 simulated paired sequencesthat are at least 150 bps. This set is comparable to the numberof simulated paired sequences used in [12], which is 24,000without length limitation. To detect the coding potentialregion when the coding exon frames are unknown, the errorrates of the proposed method are calculated as follows. Givena threshold, the true positive rate is the fraction of thetotal number of conserved coding potential regions whosealignment has p < α and the false positive rate is the fractionof the total number of simulated alignments having p < α.The results are summarized in Table 2.

To further study the coding potential regions detected inthe test set, we compared the detection on RefSeq codingexons with GENSCAN and TWINSCAN with regards to thetype of exons and summarized the results in Table 3. Forsingle exons, because the gene structure is simple, GENSCAN

Page 30: Genome Evolutiondownloads.hindawi.com/journals/specialissues/732768.pdf · Advances in Bioinformatics EditorialBoard Shandar Ahmad, Japan T. Akutsu, Japan Rolf Backofen, Germany Craig

6 Advances in Bioinformatics

Table 2: The detection of coding potential regions in the human-mouse conserved regions. The table lists the number of alignmentsand the corresponding base pairs of the human sequences in eachtest set. The true positive rates and false positive rates correspond tothe number of alignments that have p-value less than α = 0.0387by the present method, where the method with the parametersestimated from human-mouse training sets was applied both to thehuman-mouse alignments and human-dog alignments. The rowof KA/KS is cited from [12]. The threshold is set so that the falsepositive rate of the proposed method is the same as that of [12].

Conserved codingregions (TP)

Simulated randomsequence pair (FP)

Size146, 254 10, 305

(3.9× 107 bps) (6.8× 106 bps)

Peak p < 0.0387 (mouse) 91.3% 2.6%

Peak p < 0.0387 (dog) 90.7% 2.6%

KA/KS 90.5% 2.6%

Table 3: The distribution of RefSeq coding exons contained inthe regions detected by the proposed method compared with thosepredicted by GENSCAN and TWINSCAN according to the types ofexons: initial, internal, final, and single, where single refers to exonsof single exon genes.

Exon type Initial Internal Final Single

Peak p < 0.0387 (mouse) 90.1% 94.3% 81.7% 80.1%

GENSCAN 81.8% 86.8% 78.3% 91.5%

TWINSCAN 30.4% 29.9% 42.4% 73.9%

can take the full advantage of the gene structure withoutthe conservation limit; it is able to identify most singleexons. Using sequence conservation limited the ability toidentify unconserved genes as shown by the predictions fromTWINSCAN and the proposed method.

We also compared the results with the internal exonspredicted by MZEF [30] in Table 1 in [30]. We identifiedthe locations of 22 genes in UCSC’s genome database. Sincethe genomic region has expanded over the years, we comparethe percentage of the internal exons identified relative to thenumber of internal exons available to both methods per gene.Among these genes, the proposed method had a higher callrate than MZEF on internal exons in 9 genes and had a lowercall rate on those in another 10 genes. The average call ratefor the proposed method on the 22 gene is 76% while that ofMZEF is 83%. On the other hand, when only counting theregions available in the test set, the average call rate for theproposed method is 88.6%.

We examined the regions that are conserved noncodingregions defined by PhastCons [31]. PhastCons defined 39%of the sequences in the axtTight set as conserved noncodingregions, and in the subset of sequences with coding potentialwith p < 0.0387, only 22% are defined as conservednoncoding region. We also evaluated the structured RNAsin the ENCODE [32] regions, that is, Vienna RNAz [33].We downloaded the encodeUViennaRnaz table from UCSC’sgenome database. Among the total 3,346 conserved RNA

regions in the encodeUViennaRnaz table, our dataset axt-Tight overlaps with 489 regions and 251 of them havea p-value < 0.038. We also examined closely the regionsthat were not predicted by those computational algorithmsin Table 1 and found that most of those regions containcoding exons of alternative splicing sites or very short codingexons. For example, the region chr1:198070085-198070137,. . .tagccaGAGCAGGAAGgacat. . ., contains one internal cod-ing exon indicated by the upper case. The p-value is 0.007.It is not predicted by any of the algorithms mostly becausethis exon lacks the proper flanking dinucleotides (GT/AGor GC/AG). Another example is the region chr1:211644829-211645072; it only contains a coding exon which is the “A” ofthe start codon. The p-value is 0.002. This coding exon is onlypredicted by AceView which considers alternative splicing.

3.2. Detecting Novel Coding Potential Regions in the HumanGenome. The proposed method is also applied to the align-ments of candidate coding potential regions to detect novelcoding potential regions. To adjust for multiple hypothesistesting, the p-value is adjusted to the q-value according to(3) to control the false discovery rate. By setting q <= 0.05,which corresponds to p < 0.01, we detected 46,188 codingpotential regions. Among them, 12,688 are absent from thepredictions listed in Table 1 (excluding nonhuman RefSeqgenes and UniGene genes). Among the human segmentscontaining novel coding exons, 57 overlap with nonhumanRefSeq coding exons [24] and 5,259 (41.4%) have UniGenematches. These evidences indicate the existence of 12,688novel coding potential regions in human. The coordinatesof the human segments of these regions can be downloadedfrom http://www.stat.cmu.edu/∼jwu/axtTightCoding/.

The novel coding potential regions detected are com-pared with those by Nekrutenko et al. [13], in which theyreported 13,700 novel coding exons; 61% of which lay withinannotated genes and 38% lay between annotated genes, andamong those between annotated genes, 25% had UniGenematches. Among the 12,688 novel coding potential regionsreported here, 34.3% are within annotated genes and 65.7%are between annotated genes according to the annotation inTable 1, and among the novel coding potential regions inbetween annotated genes, 35.1% have UniGene matches. Thedifference shows that the proposed method is more sensitiveto genes with unknown structure.

4. Discussion

A statistical procedure is proposed to detect regions contain-ing coding exons in conserved human sequences. It revealscoding potential regions from genes that do not fit thestructure prescribed by existing methods. The success of theprocedure depends on a locally smooth function (i.e., thelowess function) to address the problem of localizing codingpotential regions. Furthermore, the prediction method issensitive to codons but insensitive to noncoding DNA. Asseen from the results from human-mouse alignments andhuman-dog alignments (Table 2), the method is also notsensitive to the alignments used.

Page 31: Genome Evolutiondownloads.hindawi.com/journals/specialissues/732768.pdf · Advances in Bioinformatics EditorialBoard Shandar Ahmad, Japan T. Akutsu, Japan Rolf Backofen, Germany Craig

Advances in Bioinformatics 7

The proposed method is an effective tool to analyzeshort conserved regions. Although it does not predict genestructures from sequences, it identifies those conservedregions that overlap with genes. A direct application of theproposed method is to improve the accuracy of the existinggene or coding exon prediction algorithms. The proposedmethod could be used as a filtering procedure to provideinput sequences to these exon prediction algorithms. Forexample, when applied to the data in short HMM [7], withthe same parameters except that the probability matrices (4)were estimated from the training data in [7], it reduced falsepositive from 0.77% to 0.49% at same true positive rate byfiltering out the alignments with large p-values. It could alsobe used as an additional criterion for the alternative genespredicted by GENSCAN. In addition, the proposed methodwould also benefit algorithms that predict single-exon genes.Specifically, by increasing the window size w and applying tothe sets with longer flanking noncoding regions, the peak inthe hump in the long coding exon emerges while the peakin the humps in other short exons becomes less significant.Then, using the detected coding potential regions as theinput data for algorithms that only predict the single-exongene, because of the gene structure, one would expect thatmost long exons from multiexon genes would be filtered out.

A more interesting feature of the proposed method isthat it provides new data for methods that predict genestructures. As shown in Section 3.1, from the comparisonwith GENSCAN, the proposed method detects more codingpotential regions from multiple-exon genes. Moreover, itis sensitive to coding potential regions containing shortexons and exons with alternative splicing sites as shown inSection 3.1. Thus, the proposed method could be used toreveal novel gene structure by studying the coding potentialregions that failed to be predicted by the existing algorithms.

There is a possibility that the proposed method couldbe biased toward pseudogenes simply because there is arelaxation of the whole gene structure. However, such abias is not obvious since the percentage of coding potentialregions predicted overlapping with known pseudogenes iswithin the range of those from existing gene predictionalgorithms. As a matter of fact, 2% of the coding poten-tial regions predicted from the human sequences in theaxtTight folder overlap with the database of Yale pseudo-genes (http://www.pseudogene.org/), corresponding to 4%in length. Both percentages are lower than those of GENEID,GENSCAN, Augustus, and SGP and are higher than thoseof the rest 7 gene prediction methods in Table 1 (excludingRefSeq genes, nonhuman RefSeq genes, vega genes, vegapseudogenes, retro genes, and Yale pseudogenes).

The proposed statistical procedure is not sensitive to theparameters used since the lowess function smoothes out thesudden changes in the log-odds scores from the randomness.However, there still are some general rules for selecting theparameters. Specifically, the window size w0 for selecting thestrand and segmentation of the alignment should be largeenough to include more codons, but not too large so thatfew noncoding DNAs are included when the window frameis on the coding exon. The window size w for obtainingthe normally distributed scores should be small so that the

dependency among the scores is weak and the alignmenthas ample scores for the lowess estimation and the peakselection. On the other hand, w should also be large enoughto ensure the distribution of the average log-odds ratios inthe window frame is approximately normal. The method isnot sensitive to the parameter f in the lowess function orthe parameter span in ppc.peaks() due to the nonparametricnature of these two functions. Moreover, the lowess functioncould be replaced by similar locally smooth functions suchas the spline method; other peak selection functions couldalso be used instead of ppc.peaks(). However, the smoothingparameter does affect the prediction sensitivity. The largerthe f, the larger the p-value for a given alignment. On theother hand, as shown in Table 2, for a dataset that is notdramatically different from the one used in this paper inDNA composition and sequence length distribution, thethreshold for the p-value; say 0.01, remains a good indicationon whether the sequence contains coding DNA or not.

One limitation of the proposed method is that it isonly applicable to alignments that are not too short; saylonger than 150 bps. This limitation excluded 3.5% of humanRefSeq coding exons that overlap with the alignments in theaxtTight folder from the analysis, as these RefSeq codingexons do not have enough conserved flanking noncodingregions after the extension. One justification of the lengthconstraint is to insure that the alignment has adequate log-odds scores for the peak selection function ppc.peaks().Furthermore, the proposed method is expected to havelimited statistical power in detecting coding potential regionsfrom alignments ≤150 bps. As shown by Nekrutenko et al.[12], even with gene structure given, only 42% coding exonsare detected from the conserved RefSeq coding exons withlength ≤50 bps. The power of the proposed method on theshort aligned sequences (<150 bps) is about 40%. Also, thepower of proposed approach decreases when the length ofthe alignment increases to thousands of base pairs or moresince the p-value increases with the length of the alignment.

The code that realizes the proposed procedure and thepredicted coding potential regions can be downloaded fromhttp://www.stat.cmu.edu/∼jwu/axtTightCoding/, in whichthe code to calculate the log-odds score is written in C++and the code to calculate the p-value is written in R.

Acknowledgments

The author is grateful to Dr. David Haussler for introducingfor her the subject of comparative genomics and for manyinspiring discussions. Thanks to Dr. R. W. Doerge and Dr.Wen-Hsiung Li for reading, editing the manuscript, andencouragement.

References

[1] C. Burge and S. Karlin, “Prediction of complete gene struc-tures in human genomic DNA,” Journal of Molecular Biology,vol. 268, no. 1, pp. 78–94, 1997.

[2] S. Batzoglou, L. Pachter, J. P. Mesirov, B. Berger, and E.S. Lander, “Human and mouse gene structure: comparative

Page 32: Genome Evolutiondownloads.hindawi.com/journals/specialissues/732768.pdf · Advances in Bioinformatics EditorialBoard Shandar Ahmad, Japan T. Akutsu, Japan Rolf Backofen, Germany Craig

8 Advances in Bioinformatics

analysis and application to exon prediction,” Genome Research,vol. 10, no. 7, pp. 950–958, 2000.

[3] V. Bafna and D. H. Huson, “The conserved exon method forgene finding,” in Proceedings of the 8th International Conferenceon Intelligent Systems for Molecular Biology (ISMB ’00), vol. 8,pp. 3–12, AAAI Press, 2000.

[4] I. Korf, P. Flicek, D. Duan, and M. R. Brent, “Integratinggenomic homology into gene structure prediction,” Bioinfor-matics, vol. 17, supplement 1, pp. S140–S148, 2001.

[5] S. Cawley, L. Pachter, and M. Alexandersson, “SLAM webserver for comparative gene finding and alignment,” NucleicAcids Research, vol. 31, no. 13, pp. 3507–3509, 2003.

[6] R. Guigo, E. T. Dermitzakis, P. Agarwal, et al., “Comparison ofmouse and human genomes followed by experimental verifi-cation yields an estimated 1,019 additional genes,” Proceedingsof the National Academy of Sciences of the United States ofAmerica, vol. 100, no. 3, pp. 1140–1145, 2003.

[7] J. Wu and D. Haussler, “Coding exon detection using compar-ative sequences,” Journal of Computational Biology, vol. 13, no.6, pp. 1148–1164, 2006.

[8] M. Re, G. Pesole, and D. S. Horner, “Accurate discriminationof conserved coding and non-coding regions through multipleindicators of evolutionary dynamics,” BMC Bioinformatics,vol. 10, article 282, 2009.

[9] S. Rogic, A. K. Mackworth, and F. B. F. Ouellette, “Evaluationof gene-finding programs on mammalian sequences,” GenomeResearch, vol. 11, no. 5, pp. 817–832, 2001.

[10] W. S. Cleveland, “Robust locally weighted regression andsmoothing scatterplots,” Journal of the American StatisticalAssociation, vol. 74, pp. 829–836, 1979.

[11] I. B. Rogozin, D. D’Angelo, and L. Milanesi, “Protein-codingregions prediction combining similarity searches and conser-vative evolutionary properties of protein-coding sequences,”Gene, vol. 226, no. 1, pp. 129–137, 1999.

[12] A. Nekrutenko, K. D. Makova, and W.-H. Li, “The KA/KSratio test for assessing the protein-coding potential of genomicregions: an empirical and simulation study,” Genome Research,vol. 12, no. 1, pp. 198–202, 2002.

[13] A. Nekrutenko, W.-Y. Chung, and W.-H. Li, “An evolutionaryapproach reveals a high protein-coding capacity of the humangenome,” Trends in Genetics, vol. 19, no. 6, pp. 306–310, 2003.

[14] R. Tibshirani, T. Hastie, B. Narasimhan, et al., “Sample classi-fication from protein mass spectrometry, by “peak probabilitycontrasts”,” Bioinformatics, vol. 20, no. 17, pp. 3034–3044,2004.

[15] Y. Benjamini and Y. Hochberg, “Controlling the false discoveryrate: a practical and powerful approach to multiple testing,”Journal of the Royal Statistical Society. Series B, vol. 57, no. 1,pp. 289–300, 1995.

[16] J. D. Storey, “The positive false discovery rate: a Bayesianinterpretation and the q-value,” Annals of Statistics, vol. 31, no.6, pp. 2013–2035, 2003.

[17] K. D. Pruitt, K. S. Katz, H. Sicotte, and D. R. Maglott,“Introducing RefSeq and LocusLink: curated human genomeresources at the NCBI,” Trends in Genetics, vol. 16, no. 1, pp.44–47, 2000.

[18] K. D. Pruitt, T. Tatusova, and D. R. Maglott, “NCBI Refer-ence Sequence (RefSeq): a curated non-redundant sequencedatabase of genomes, transcripts and proteins,” Nucleic AcidsResearch, vol. 33, pp. D501–D504, 2005.

[19] D. A. Benson, I. Karsch-Mizrachi, D. J. Lipman, J. Ostell, andD. L. Wheeler, “GenBank: update,” Nucleic Acids Research, vol.32, pp. D23–D26, 2004.

[20] T. Wiehe, S. Gebauer-Jung, T. Mitchell-Olds, and R. Guigo,“SGP-1: prediction and validation of homologous genes basedon sequence alignments,” Genome Research, vol. 11, no. 9, pp.1574–1583, 2001.

[21] R. Guigo, “Assembling genes from predicted exons in lineartime with dynamic programming,” Journal of ComputationalBiology, vol. 5, no. 4, pp. 681–702, 1998.

[22] M. Stanke and S. Waack, “Gene prediction with a hiddenMarkov model and a new intron submodel,” Bioinformatics,vol. 19, supplement 2, pp. ii215–ii225, 2003.

[23] P. Kim, N. Kim, Y. Lee, B. Kim, Y. Shin, and S. Lee, “ECgene:genome annotation for alternative splicing,” Nucleic AcidsResearch, vol. 33, pp. D75–D79, 2005.

[24] W. J. Kent, “BLAT—the BLAST-like alignment tool,” GenomeResearch, vol. 12, no. 4, pp. 656–664, 2002.

[25] D. Thierry-Mieg, J. Thierry-Mieg, M. Potdevin, and M.Sienkiewicz, “AceView: identification and functional annota-tion of cDNA-supported genes in higher organisms,” GenomeBiology, vol. 7, supplement 1, p. S12, 2006.

[26] T. Hubbard, D. Barker, E. Birney, et al., “The Ensembl genomedatabase project,” Nucleic Acids Research, vol. 30, no. 1, pp. 38–41, 2002.

[27] W. J. Kent, R. Baertsch, A. Hinrichs, W. Miller, and D.Haussler, “Evolution’s cauldron: duplication, deletion, andrearrangement in the mouse and human genomes,” Proceed-ings of the National Academy of Sciences of the United States ofAmerica, vol. 100, no. 20, pp. 11484–11489, 2003.

[28] Z. Zhang, P. M. Harrison, Y. Liu, and M. Gerstein, “Millionsof years of evolution preserved: a comprehensive catalog ofthe processed pseudogenes in the human genome,” GenomeResearch, vol. 13, no. 12, pp. 2541–2558, 2003.

[29] A. E. Lash, C. M. Tolstoshev, L. Wagner, et al., “SAGEmap:a public gene expression resource,” Genome Research, vol. 10,no. 7, pp. 1051–1060, 2000.

[30] M. Q. Zhang, “Identification of protein coding regions inthe human genome by quadratic discriminant analysis,”Proceedings of the National Academy of Sciences of the UnitedStates of America, vol. 94, no. 2, pp. 565–568, 1997.

[31] A. Siepel and D. Haussler, “Phylogenetic hidden Markovmodels,” in Statistical Methods in Molecular Evolution, R.Nielsen, Ed., pp. 325–351, Springer, New York, NY, USA, 2005.

[32] E. Birney, J. A. Stamatoyannopoulos, A. Dutta, et al., “Identifi-cation and analysis of functional elements in 1% of the humangenome by the ENCODE pilot project,” Nature, vol. 447, no.7146, pp. 799–816, 2007.

[33] S. Washietl, J. S. Pedersen, J. O. Korbel, et al., “StructuredRNAs in the ENCODE selected regions of the humangenome,” Genome Research, vol. 17, no. 6, pp. 852–864, 2007.

Page 33: Genome Evolutiondownloads.hindawi.com/journals/specialissues/732768.pdf · Advances in Bioinformatics EditorialBoard Shandar Ahmad, Japan T. Akutsu, Japan Rolf Backofen, Germany Craig

Hindawi Publishing CorporationAdvances in BioinformaticsVolume 2010, Article ID 178069, 6 pagesdoi:10.1155/2010/178069

Research Article

Algorithmic Assessment of Vaccine-Induced Selective Pressureand Its Implications on Future Vaccine Candidates

Mones S. Abu-Asab,1 Majid Laassri,2 and Hakima Amri3

1 Laboratory of Pathology, National Cancer Institute, National Institutes of Health, Bethesda, MD 20892, USA2 Laboratory of Methods Development, Center for Biologics Evaluation and Research, Food and Drug Administration,Rockville, MD 20852, USA

3 Department of Physiology and Biophysics, School of Medicine, Georgetown University, Washington, DC 20007, USA

Correspondence should be addressed to Hakima Amri, [email protected]

Received 21 August 2009; Accepted 4 November 2009

Academic Editor: Wojciech Makalowski

Copyright © 2010 Mones S. Abu-Asab et al. This is an open access article distributed under the Creative Commons AttributionLicense, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properlycited.

Posttrial assessment of a vaccine’s selective pressure on infecting strains may be realized through a bioinformatic tool suchas parsimony phylogenetic analysis. Following a failed gonococcal pilus vaccine trial of Neisseria gonorrhoeae, we conducted aphylogenetic analysis of pilin DNA and predicted peptide sequences from clinical isolates to assess the extent of the vaccine’seffect on the type of field strains that the volunteers contracted. Amplified pilin DNA sequences from infected vaccinees, placeborecipients, and vaccine specimens were phylogenetically analyzed. Cladograms show that the vaccine peptides have divergedsubstantially from their paternal isolate by clustering distantly from each other. Pilin genes of the field clinical isolates wereheterogeneous, and their peptides produced clades comprised of vaccinated and placebo recipients’ strains indicating that thepilus vaccine did not exert any significant selective pressure on gonorrhea field strains. Furthermore, sequences of the semivariableand hypervariable regions pointed out heterotachous rates of mutation and substitution.

1. Introduction

The recent failure of the HIV vaccine’s STEP Study is areminder that there is not usually an apparent reason thatmay explain a trial’s demise [1, 2]. Only basic research willprovide an understanding of why a vaccine had not workedand guidance for the design of better candidates [2]. As a stepin this direction, we sought to provide a bioinformatic toolthat is capable of gauging whether a vaccine has exerted anyselective pressure on infectious field strains, as this may aid inreformulating the vaccine or the design of other candidates.A comparative algorithmic model for establishing the extentof a vaccines’ efficacy is currently lacking although it maycontribute to the improvement of formulation and imple-mentation of future vaccine hypotheses.

We are presenting a new analytical model that appliesthe principles of phylogenetics, such as parsimony, to assesswhether a vaccine has affected the selection of infectiousstrains during a trial. Our approach relies on the robustparsimonious modeling of fast arising genetic variation

to discriminate between two groups that are under dif-ferent selective pressures [3, 4]. If a vaccine is shownto exert a selective pressure, then its formulation can bemodified to broaden its effective range. Although phyloge-netic algorithms have been applied in the classification ofmicroorganisms and to detect recombination in a multiplesequence alignment, they have not been used in vaccine trialassessment [5, 6].

This study is a follow up on a field trial conducted amongU.S. personnel stationed in the Republic of South Korea [7].For the trial, a purified pilus preparation was isolated fromPgh 3-2 Neisseria gonorrhoeae strain and tested as a vaccinein 3123 men and 127 women volunteers [7, 8]. Amongmale volunteers, 108 vaccine and 102 placebo recipientscontracted gonorrhea after 15 or more days following vacci-nation. None of the women volunteers developed gonococcalinfections. Samples of clinical isolates from all infectedparticipants were plated on selected media, identified, andstored at the Department of Bacterial Diseases (Walter ReedArmy Institute of Research, Washington, DC, USA). The

Page 34: Genome Evolutiondownloads.hindawi.com/journals/specialissues/732768.pdf · Advances in Bioinformatics EditorialBoard Shandar Ahmad, Japan T. Akutsu, Japan Rolf Backofen, Germany Craig

2 Advances in Bioinformatics

authors of the trial concluded that the pilus vaccine failedto protect men against gonococcal urethritis during the fieldtrial [7].

Gonococcal type IV pilus is filamentous proteinaceoussurface structure responsible for initial bacterial attachmentand is associated with virulence of N. gonorrhoeae (thegonococcus) [9, 10]. The pilus is a polymer comprisedof pilin subunits; the latter share a common distinctivestructure that also occurs in the pilins of other generaand is termed T4 pilin. The T4 pilin of N. gonorrhoeae iscomprised of a highly conserved domain (C: 1–53 aminoacids), a semivariable domain (SV: 54–114 amino acids), ahypervariable region (HV: variable number of amino acids)flanked by two conserved regions with each containing acysteine residue, and a variable COOH-terminal region ofirregular length following the second cysteine region [11].

Genetic variation that occurs at the SV and HV regionsof the pilin involves a multigene system and has antigenicimplications [12, 13]. Within a gonococcal genome, astructural gene (pilE) encodes for the pilin subunits. Inaddition to pilE, the genome contains several silent pilingenes (pilS); each pilS has one or more incomplete pilingene(s) arranged in tandem and connected by interveningsequences [14]. Partial pilin copies of pilS lack the conservedregion of pilE but have the same arrangement of SV andHV [14]. Recombination events between silent and expressedsites result in variations in the expressed pilin [15]. Thus, pilEreplaces some, but not all, of its variable sites from any of thesilent copies.

The most suitable method for analyzing fast arisingmutations, such as those in the SV and HV regions of thepilin, is sequencing followed by a parsimony phylogeneticanalysis [3]. Our analysis examines the pilin composition ofthe vaccine and several clinical isolates from the vaccine trialto assess whether the vaccine had any selective effect on fieldstrains that infected the vaccinated participants in spite ofits failure to protect participant from infection. We applied amaximum parsimony phylogenetic algorithm to classify thepilin sequences according to their phyletic relatedness [3, 4],which has the capability to model a fast changing DNA andrecent divergence of genes better than maximum likelihoodor clustering [3].

2. Materials and Methods

2.1. Vaccine Strains and Clinical Isolates. Bacterial strainsfrom the vaccine trial were obtained from the depositoryof the Department of Bacterial Diseases at Walter ReedArmy Institute of Research, Washington, DC, where theywere kept at −80◦C [7]. To our knowledge, the vaccinestrain did not undergo any further passages since vaccinepreparation and this study. All the isolates used in this workwere chosen randomly from positive samples; 40 isolatescoded from 1 to 40 were used for hybridization analysis, and12 strains (Table 1) were used for the sequencing of theirpilin gene. Although the number of trial strains included inthe sequencing and phylogenetic analysis was restricted to 12strains, it was still sufficient to test our hypothesis.

Table 1: Strains of Neisseria gonorrhoeae used in the study.

Strains OriginGenBank

Accessions

Pgh 3-2 clinical isolate [8] EU379154

P32 derived from Pgh 3-2 [7] EU379152

P32brntnvaccine strain derived fromPgh 3-2

EU379153

P32brntn18vaccine strain derived fromPgh 3-2

U16742

68from a vaccinatedparticipant

EU340030

1009from a vaccinatedparticipant

EU360770

2132from a vaccinatedparticipant

EU379148

2184from a vaccinatedparticipant

EU379150

2968from a vaccinatedparticipant

EU379151

446 from a placebo recipient EU346893

854 from a placebo recipient EU360769

2136 from a placebo recipient EU379149

2.2. Pilin Gene Amplification. Bacteria cells from the frozenstock were used without subculture and lysed by heatingin 100 µL of 5% Chilex (Bio-Rad, Hercules, CA) for 5 minat 95◦C. For PCR, 5 µL of the Chilex solution was used.Primer selection was based on published sequences of pilingenes [16, 17]; forward (TACATTGCATGATGCCGATGG)and reverse (CGTTCCGCCCGCCCCAGCAGGC) primersamplified only the expressed pilin gene (pilE) and not thesilent homologous copies.

2.3. Hybridization Experiments. To detect whether theexpressed pilin genes from the isolated field strains werehomologous or heterologous to that of the vaccine strain,P32brntn, the strains’ amplicons were probed with oligonu-cleotides corresponding to the semivariable (SV) segmentsand the hypervariable (HV) regions of the vaccine pilin.Based on the pilin sequences of P32brntn, oligonucleotidescorresponding to variable segments of the SV (GCTTTC-AAAAATCAT and CAAATGGCTTCAAGCAA) and thetotal lengths of the HV (CCGACAACGACGACGTCAAAand GAGGCCGCCAACAACGGC) were synthesized andlabeled with S35 isotope. Pilin gene amplicons from the 40trial isolates were downloaded on a nylon membrane andprobed with the synthetic oligonucleotides. The hybridiza-tions were carried out at different stringency levels (50, 46,and 42◦C) to detect the presence of closely homologoussequences and the degree of heterogeneity within the fieldstrains.

2.4. Amplicons Cloning, Sequencing, and Translation. ThePCR-produced amplicons of pilin gene from 12 strains(Table 1) were cloned into an M13 vector (Applied Biosys-tems, Foster City, CA). The cloned pilin genes were

Page 35: Genome Evolutiondownloads.hindawi.com/journals/specialissues/732768.pdf · Advances in Bioinformatics EditorialBoard Shandar Ahmad, Japan T. Akutsu, Japan Rolf Backofen, Germany Craig

Advances in Bioinformatics 3

sequenced and translated into their predicted amino acidsusing GeneDoc [18].

2.5. Parsimony Phylogenetic Analysis. We used Protpars fromthe PHYLIP package to carry out the parsimony phylogeneticanalysis [19]. Three sets of parsimony analyses, using aminoacid sequences, were carried out: first, the whole sequenceof the gene; second, SV regions alone (amino acids 51–127); and third, HV regions alone (amino acids 109–166).The latter two regions were analyzed to find out whethertheir sequences produced similar results to that of thewhole sequence and whether the two regions’ phylogenieswere congruent with each other. This provided a test forthe strain-specific pilin hypothesis since different regionsof the peptides should not produce substantially varyinghypotheses of relationships if the pilin is strain specific.

3. Results

First, the variability of the pilE gene in 40 clinical isolates wasanalyzed by hybridization with the selected oligonucleotideprobes of the vaccine strain, P32brntn. The results werenegative at all stringency levels. This indicated the absenceof homologous or partially homologous pilin genes in theclinical isolates of infected participants.

To confirm this result, the sequencing analysis wasperformed for the 12 samples presented in Table 1, includingthe vaccine strain. We found that the vaccine strain con-tained two pilin gene sequences (P32brntn and P32brntn18,Table 1, Figure 1) instead of one pilin gene sequence as it wasthought by the authors of the trial [7]. These two pilins seemvery closely related as they grouped together in three differentcladograms (Figures 2–4).

The sequences of the 12 specimens used in the study werecongruent with the published structure of Neisseria pilins(GenBank accession numbers are listed in Table 1). However,the SV and HV regions (DNA and peptide sequences) offield strains were dissimilar to those of the vaccine (Figure 1).The variation among the sequences is shown phenetically(i.e., overall similarity, Figure 1), and phylogenetically (theirphyletic relatedness, Figures 2–4).

Maximum parsimony analysis with Protpars [19] usingwhole peptide sequences produced one parsimonious clado-gram (Figure 2). The SVs produced 4 equally parsimoniouscladograms (Figure 3 shows the consensus cladogram);the HVs produced 12 equally parsimonious cladograms(Figure 4 shows the consensus cladogram).

All three parsimony phylogenetic analyses did not assem-ble separate groups for the strains isolated from vaccineescohort and those isolated from placebo recipients. Thestrains of both groups were very closely related. This suggeststhat the vaccine had no immunological selective pressure onthe isolates.

4. Discussion

Postvaccine trial analysis beyond success or failure is a raritydue to lack of analytical methods. We are not aware of

any existing models for carrying out such an analysis. Asthe HIV vaccine STEP Study has shown, a vaccine failuresometimes is an enigma and no obvious reasons are at handto explain its failure [1, 2]. However, we are attempting hereto introduce parsimony phylogenetic analysis as an analyticalparadigm for posttrial examination (it may also be usedfor the formulation of future vaccine candidates). There areseveral goals of such analysis: first, to assess the heterogeneityof field strains in relation to vaccine strains; second, toevaluate the phyletic relationships among all the strains; andthird, to find out if the vaccine exerts any immunologicalselective pressure at the gene level of the field strains that mayaffect the type of infecting strain.

The pilin gene sequence was not known at the time ofthe vaccine trial, and attempts to sequence the pilus peptide’ssubunits were not completely successful. Our sequencingresults from the stored P32brntn strain revealed two distinctpilE genes indicating that the culture has some heterogeneity(P32brntn and P32brntn18, Table 1, Figure 1), which is incontrast with the assumption of the vaccine trial authors ofa single-type pilus [7]. The exact composition of the vaccineis significant (whether it was a single-type or multiple-typepilus) in order to assess its implications on the outcome ofthe trial.

The efficacy of a pilus vaccine in preventing gonorrheainfections was the subject of a long debate fueled bycontradicting evidence [7, 20]. On one hand, the pili areassociated with gonorrhea’s virulence [21]; pilus vaccineshave been effective in protecting suckling piglets and cattleagainst infections of E. coli and Moraxella bovis, respectively[7, 22]; and these vaccines were immunogenic [23]. Onthe other hand, the pilus vaccine was ineffective beyondthe homology of its pilus strain and even its homologousprotection was overcome with larger challenge inocula [20].The authors of the vaccine trial argued that human challengeexperiments do not always predict the outcome in a naturalsetting and embarked on a large placebo-controlled, double-blinded field trial of pilus vaccine [7]. Although the vaccineelicited a good immune response in vaccinated recipients,it failed to protect them [24]. This work examined theextent of pilin diversity among infected participants andpilin phylogeny as indicators of the vaccine selective pressure.We explored a new analytical model to determine whethervaccine effectiveness can be assessed on the basis of pilinsequences phylogeny to infer whether the vaccine exerted aselective pressure on the gonorrhea strains that infected thevaccinated participants.

The heterogeneity of the vaccine inoculum (two pilintypes: P32brntn and P32brntn18) did not seem to conferany additional effectiveness on the vaccine. This could beattributed to the close sequence similarity of the two; the twotypes have shared sequences and grouped together in all threesets of the analyses (Figures 2–4).

In order to test the validity of our hypothesis, whichis based on the phylogeny of the pilin genes, the ancestralstrains, Pgh 3-2, a clinical isolate from which the vaccinestrain was derived [8], and a strain derived from it (P32),were sequenced and included in the analyses. The twoancestral strains clustered together in all three cladograms

Page 36: Genome Evolutiondownloads.hindawi.com/journals/specialissues/732768.pdf · Advances in Bioinformatics EditorialBoard Shandar Ahmad, Japan T. Akutsu, Japan Rolf Backofen, Germany Craig

4 Advances in Bioinformatics

NT L QKGF T L I E L M V I A I V GI L A A VA L P A Y QDY T A R A QV S E A I L L A E GQ

. . . . . . . . . I E L M V I A I V GI L A A VA L P A Y QDY T A R A QV S E A I L L A E GQ

NT L QKGF T L I E L M V I A I V GI L A A VA L P A Y QDY T A R A QV S E A I L L A E GQ

NT L QKGF T L I E L M V I A I V GI L A A VA L P A Y QDY T A R A QV S E A I L L A E GQ

. . . . . . . . . . . . M V I A I V GI L A A VA L P A Y QDY T A R A QV S E A I L L A E GQ

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . E A I L L A E GQ

. . . . . . . T L I E L M V I A I V GI L A A VA L P A Y QDY T A R A QV S E A I L L A E GQ

. . . . . GF T L I E L M V I A I V GI L A A VA L P A Y QDY T A R A QV S E A I L L A E GQ

. . . . . . . . . . . . M V I A I V GI L A A VA L P A Y QDY T A R A QV S E A I L L A E GQ

. . . . . . . . . . . . . . . . . . . . . . . . . . P A Y QDY T A R A QV S E A I L L A E GQ

. . . . . . . T L I E L M V I A I V GI L A A VA L P A Y QDY T A R A QV S E A I L L A E GQ

. . . . . . . . . . . . . . . . . . GI L A A VA L P A Y QDY T A R A QV S E A I L L A E GQ

KS A VT E Y Y L N HGKW E NNDS A GVA S . A S K I I GKY VKQVE V KNGVVT A QM

KS A VT E Y Y L N HGKW KDNT S A GVA S . A S K I I GKY VKQVE V KNGVVT A QM

KS A VT E Y Y L N HGI W E NN . E A GVA S . P S D I KGKY VKS VT V T NGVVT A QM

KS A VT E Y Y L N HGE W KNNT S A GVA S P P S D I KGKY VQS VT A A KA S VT A QM

KS A VT E Y Y L N HGE W KDNT S A GVA S P P PN I KGKY VE S VT V T NGVVT A KM

. . . . . . . . . . . . . . . . . . S A GVA S . P S D I KGKY VKE VE V KNGVVT A T M

KS A VT E Y Y L N HGE W KDNT S A GVA S . A S K I I GKY VKQVE V KNGVVT A E M

KS A VT E Y Y L N HGE W E NNA S A GVA S . S S D I KGKY VKS VT V T NGVVT A QM

KS A VT E Y Y L N NGE W E NNT S A GVA S . S DK I KGKY VQKVE V A KGVVT A T M

KS A VT GY Y L N R GT W KDNT S A GVA S . P PN I KGKY VE NI T V T NGVVT A KM

KS A VT E Y Y L N NGI W E NNA S A GVA S P P S N I KGKY VE S VT V T NGVVT A KM

KS A VT E Y Y L N HGE W KDNT S A GVA S . A S D I KGKY VQKVE V NNGVVT A T M

NVN E I KD KKL S LWA KR Q G S VKWF CGQ P VT R DNA D . . . NDDV . . KDA

NVN E I KD KKL S LWA R R Q G S VKWF CGQ P VT R DNA D . . . NDDV . . KDA

GVN E I KG KR L S LWGKR E G S VKWF CGQ P VT R A NA K . A DNDDVKDA A A

GVN E I KG KR L S LWGKR E G S VKWF CGQ P V . . DP R R QR R NDDVKDA A A

GVN E I QG KKL S LWA KR Q G S VKWF CGQ P VT R . . . . . . NGDT VA DA . .

GVN E I QG KKL S LWA KR Q G S VKWF CGQ P VT R DNA K . A DNDDVT DDK .

GVN E I KG KKL S LWA R R Q G S VKWF CGQ P VKR NA KD . . . DDT VA A DA T

GVN E I QD KKL S LWA R R Q G S VKWF CGQ P VKR DNA NA A KDDDVT KDDA

GVN E I KG KKL S LWA KR E G S VKWF CGQ P VKR DNA K . A DNDDVKA DA A

GVN E I QG KKL S LWA KHQ G S VKWF CGQ P VT R DKA . . . I T DDA VKDVT

GVN E I QG KKL S LWA KR Q G S VKWF CGQ P VT R A A KD . . KDDVKA DT T G

NVN E I KD KR L S LWA KR E G S VKWF CGQ P VT R GA G . . . . NDDVP KT T .

N I E T KH P S T CR DE P NA E * .

N I E T KH P S T CR DK. . . . . .

N I NT KH P S T CR DT S S D A K*

N I NT KH P S T CR DNF DA S * .

N I DT KH P S T CR DK. . . . . .

N I DT KH P S T CR DKHDA T * .

N I E T KH P S T CR DE P S A T * .

. I E T KH P S T CR DKHDA K* .

K I E T KH P S T CR DT S S D A K*

N I E T KH P S T CR DKS S A E . .

D I DT KH P S T CR DE S S A GT *

R I DT KH P S T CR DE S L H * . .

I

I

I

I

I

.

I

I

I

.

I

.

P

P

P

P

P

.

P

P

P

P

P

P

A

A

N

A

L

K

A

K

K

L

L

A

S

S

S

T

S

D

T

D

D

S

S

S

S

S

P

S

S

S

S

S

S

S

S

A

G

G

.

.

G

G

K

K

K

K

E

K

N

N

G

G

D

N

E

D

D

Y

K

D

L

L

L

L

L

L

L

L

L

L

L

L

G

G

D

D

.

N

G

.

D

G

K

.

D

D

N

N

D

D

D

D

D

A

A

N

K

K

N

K

K

K

K

K

K

K

K

K

1 50

51 100

101

151 173

150

Pgh 3-2

P32 P32brntn

P32brntn18

68 446

854

1009

213221362184

2968

Pgh 3-2

P32 P32brntn

P32brntn18

68 446

854

1009

213221362184

2968

Pgh 3-2

P32 P32brntn

P32brntn18

68 446

854

1009

213221362184

2968

Pgh 3-2

P32 P32brntn

P32brntn18

68

446

854

1009

213221362184

2968

M

.

M

M

.

.

.

.

.

.

.

.

Figure 1: Multiple sequence alignment of pilus predicted peptides from 12 strains used in the analysis (Table 1). These peptide sequenceswere produced from translating DNA sequences (see Table 1 for GenBank accession numbers). There are three domains in the pilus peptide:conserved domain (C: 1–53 amino acids), a semivariable domain (SV: 54–114 amino acids), and a hypervariable region (HV: variablenumber of amino acids starting at amino acid 132). The color shadings (white, gray, and black) indicate the variability of the sequence; wehave white: high variability, gray: slightly variable, and black: highly conserved.

Page 37: Genome Evolutiondownloads.hindawi.com/journals/specialissues/732768.pdf · Advances in Bioinformatics EditorialBoard Shandar Ahmad, Japan T. Akutsu, Japan Rolf Backofen, Germany Craig

Advances in Bioinformatics 5

Pgh 3-2

P32

854

2136

68∗

2184∗

446∗

1009∗

2132∗

2968∗

P32brntn18

P32brntn

Figure 2: Most parsimonious cladogram of full-length predictedpeptides. Pgh 3-2 was used as an outgroup since it is the ancestralstrain of the vaccine strains. Strains from infected vaccinees aremarked by ∗. For a few strains, small sequence segments at thebeginning of the gene were not obtained and were treated as missingvalues in the analysis.

Pgh 3-2

P32

854

2968∗

446

2132∗

2136

68∗

2184∗

1009∗

P32brntn18

P32brntn

Figure 3: Consensus cladogram of the semivariable (SV) regionspeptides (included amino acids 51–127). Pgh 3-2 was used as anoutgroup. Strains from vaccinated individuals are marked by ∗.

(Figures 2–4), while the vaccine pilins clustered distantlyfrom them. On two of the cladograms, the ancestral pilinand the vaccine pilins were separated from each other byall the other isolates (Figures 2 and 3). The vaccine pilinsconsistently paired with the participants’ sequences. Sincethe phylogenetic history of these strains is well known tous, one can conclude that the pilin sequences of the vaccine

P32

Pgh 3-2

2184∗

446

68∗

854

2968∗

P32brntn18

P32brntn

2132∗

1009∗

2136

Figure 4: Consensus cladogram of the hypervariable (HV) regionspeptides (included amino acids 109–166). Pgh 3-2 was used as anoutgroup. Strains from vaccinated individuals are marked by ∗.

have diverged from their ancestral strains to a point wheretheir true phylogeny is not reflected in their pilin sequences.Furthermore, it seems that because of recombination eventsas well as high mutation rate, particularly at the HV region,a strain-specific pilin appears to be an inaccurate term.

The phylogenetic analysis seems to indicate that thevaccine did not appear to have influenced the strain type inthe vaccinated group. This is inferred from the groupings ofthe sequences of the placebo and vaccinated groups wherethey appear together in mixed groups (Figures 2–4). If thevaccine had any selective pressure against gonorrhea strains,the placebo and vaccinees groups would have been expectedto group separately from one another on the cladograms.

This study provided a clear insight into the magnitudeof antigenic variation of pilin exhibited among field strains,and therefore, permits an evaluation of the feasibility of pilias a vaccine against one of the highest reported infectionsin the US—gonorrhea [25]. This high heterogeneity ofpilin provides a strong reasoning against a single-type pilusvaccine and lends support for multitype pilus of futurevaccine candidates.

Variation within the expressed pilin gene is partiallyderived from intragenomic recombination events betweenthe former and copies of silent pilin genes pilS [26]. There-fore, in light of the results obtained from phylogeneticallyassessing the three segments of pilin gene (Figures 2–4), itwill be important to assess the degree to which silent copies inthe clinical isolates have contributed to the variation withinexpressed pilin gene. This step is postponed for a futurestudy.

Acknowledgment

We thank Dr. Margaret Bash for her suggestions and criticalreview of this article.

Page 38: Genome Evolutiondownloads.hindawi.com/journals/specialissues/732768.pdf · Advances in Bioinformatics EditorialBoard Shandar Ahmad, Japan T. Akutsu, Japan Rolf Backofen, Germany Craig

6 Advances in Bioinformatics

References

[1] M. J. McElrath, S. C. De Rosa, Z. Moodie, et al., “HIV-1vaccine-induced immunity in the test-of-concept Step Study: acase-cohort analysis,” The Lancet, vol. 372, no. 9653, pp. 1894–1905, 2008.

[2] E. Check Hayden, “HIV: the next shot,” Nature, vol. 454, no.7204, pp. 565–569, 2008.

[3] M. Abu-Asab, M. Chaouchi, and H. Amri, “Evolutionarymedicine: a meaningful connection between omics, disease,and treatment,” Proteomics—Clinical Applications, vol. 2, no.2, pp. 122–134, 2008.

[4] P. A. Goloboff and D. Pol, “Parsimony and bayesian phyloge-netics,” in Parsimony, Phylogeny, and Genomics, V. A. Albert,Ed., Oxford University Press, Oxford, UK, 2005.

[5] D. Paraskevis, E. Magiorkinis, G. Magiorkinis, et al., “Increas-ing prevalence of HIV-1 subtype a in Greece: estimatingepidemic history and origin,” Journal of Infectious Diseases, vol.196, no. 8, pp. 1167–1176, 2007.

[6] W.-H. Lee and W.-K. Sung, “RB-finder: an improved distance-based sliding window method to detect recombination break-points,” Journal of Computational Biology, vol. 15, no. 7, pp.881–898, 2008.

[7] J. W. Boslego, E. C. Tramont, R. C. Chung, et al., “Efficacy trialof a parenteral gonococcal pilus vaccine in men,” Vaccine, vol.9, no. 3, pp. 154–162, 1991.

[8] C. C. Brinton, J. Bryan, J.-A. Dillon, et al., “Uses of pili ingonorrhea control: role of bacterial pili in disease, purifica-tion and properties of gonococcal pili, and progress in thedevelopment of a gonococcal pilus vaccine for gonorrhea,” inImmunobiology of Neisseria Gonorrhoeae, G. E. Brooks, E. C.Gotschlich, K. K. Holmes, W. D. Sawyer, and F. E. Young, Eds.,pp. 155–178, American Society for Microbiology, Washington,DC, USA, 1978.

[9] D. S. Kellogg Jr., I. R. Cohen, L. C. Norins, A. L. Schroeter,and G. Reising, “Neisseria gonorrhoeae. II. Colonial variationand pathogenicity during 35 months in vitro,” Journal ofBacteriology, vol. 96, no. 3, pp. 596–605, 1968.

[10] J. Swanson, K. Robbins, O. Barrera, et al., “Gonococcal pilinvariants in experimental gonorrhea,” Journal of ExperimentalMedicine, vol. 165, no. 5, pp. 1344–1357, 1987.

[11] M. S. Rohrer, M. P. Lazio, and H. S. Seifert, “A real-timesemi-quantitative RT-PCR assay demonstrates that the pilEsequence dictates the frequency and characteristics of pilinantigenic variation in Neisseria gonorrhoeae,” Nucleic AcidsResearch, vol. 33, no. 10, pp. 3363–3371, 2005.

[12] E. V. Sechman, M. S. Rohrer, and H. S. Seifert, “A geneticscreen identifies genes and sites involved in pilin antigenicvariation in Neisseria gonorrhoeae,” Molecular Microbiology,vol. 57, no. 2, pp. 468–483, 2005.

[13] A. K. Criss, K. A. Kline, and H. S. Seifert, “The frequencyand rate of pilin antigenic variation in Neisseria gonorrhoeae,”Molecular Microbiology, vol. 58, no. 2, pp. 510–519, 2005.

[14] R. Haas and T. F. Meyer, “The repertoire of silent pilus genesin Neisseria gonorrhoeae: evidence for gene conversion,” Cell,vol. 44, no. 1, pp. 107–115, 1986.

[15] T. F. Meyer and J. P. M. Van Putten, “Genetic mechanismsand biological implications of phase variation in pathogenicneisseriae,” Clinical Microbiology Reviews, vol. 2, supplement,pp. S139–S145, 1989.

[16] T. F. Meyer, E. Billyard, R. Haas, S. Storzbach, and M. So, “Pilusgenes of Neisseria gonorrheae: chromosomal organizationand DNA sequence,” Proceedings of the National Academy of

Sciences of the United States of America, vol. 81, no. 19, pp.6110–6114, 1984.

[17] A. C. F. Perry, I. J. Nicolson, and J. R. Saunders, “Structuralanalysis of the pilE region of Neisseria gonorrhoeae P9,” Gene,vol. 60, no. 1, pp. 85–92, 1987.

[18] K. B. Nicohas and H. B. Nicholad, “GeneDoc: a tool for editingand annotating multiple sequence alignments,” 1997.

[19] J. Felsenstein, “PHYLIP: phylogeny inference package (version3.2),” Cladistics, vol. 5, pp. 164–166, 1989.

[20] C. C. Brinton, S. W. Wood, A. Brown, et al., “The developmentof a Neisserial pilus vaccine for gonorrhea and meningitis,” inSeminars in Infectious Disease: Bacterial Vaccines, J. B. Robbins,J. C. Hill, and J. C. Sadoff, Eds., vol. 4, pp. 140–159, Thieme-Stratton, New York, NY, USA, 1982.

[21] J. Swanson, S. J. Kraus, and E. C. Gotschlich, “Studies ongonococcus infection. I. Pili and zones of adhesion: their rela-tion to gonococcal growth patterns,” Journal of ExperimentalMedicine, vol. 134, no. 4, pp. 886–906, 1971.

[22] B. Nagy, H. W. Moon, R. E. Isaacson, C. C. To, and C.C. Brinton, “Immunization of suckling pigs against entericenterotoxigenic Escherichia coli infection by vaccinating damswith purified pili,” Infection and Immunity, vol. 21, no. 1, pp.269–274, 1978.

[23] M. Siegel, D. Olsen, C. Critchlow, and T. M. Buchanan,“Gonococcal pili: safety and immunogenicity in humans andantibody function in vitro,” Journal of Infectious Diseases, vol.145, no. 3, pp. 300–310, 1982.

[24] S. C. Johnson, R. C. Y. Chung, C. D. Deal, et al., “Humanimmunization with Pgh 3-2 gonococcal pilus results in cross-reactive antibody to the cyanogen bromide fragment-2 ofpilin,” Journal of Infectious Diseases, vol. 163, no. 1, pp. 128–134, 1991.

[25] National Center for Health Statistics (U.S.), Health, UnitedStates, 2006, Department of Health and Human Services,Centers for Disease Control and Prevention, Atlanta, Ga, USA;Public Health Service, Washington, DC, USA, 2006.

[26] H. S. Seifert, “Questions about gonococcal pilus phase- andantigenic variation,” Molecular Microbiology, vol. 21, no. 3, pp.433–440, 1996.

Page 39: Genome Evolutiondownloads.hindawi.com/journals/specialissues/732768.pdf · Advances in Bioinformatics EditorialBoard Shandar Ahmad, Japan T. Akutsu, Japan Rolf Backofen, Germany Craig

Hindawi Publishing CorporationAdvances in BioinformaticsVolume 2010, Article ID 976792, 7 pagesdoi:10.1155/2010/976792

Research Article

Applying Small-Scale DNA Signatures as an Aid in AssemblingSoybean Chromosome Sequences

Myron Peto, David M. Grant, Randy C. Shoemaker, and Steven B. Cannon

USDA-ARS-CICGR Unit and Department of Agronomy, Iowa State University, Ames, IA 50011, USA

Correspondence should be addressed to Steven B. Cannon, [email protected]

Received 19 November 2009; Accepted 28 June 2010

Academic Editor: Izabela Makalowska

Copyright © 2010 Myron Peto et al. This is an open access article distributed under the Creative Commons Attribution License,which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Previous work has established a genomic signature based on relative counts of the 16 possible dinucleotides. Until now, it has beengenerally accepted that the dinucleotide signature is characteristic of a genome and is relatively homogeneous across a genome.However, we found some local regions of the soybean genome with a signature differing widely from that of the rest of the genome.Those regions were mostly centromeric and pericentromeric, and enriched for repetitive sequences. We found that DNA bindingenergy also presented large-scale patterns across soybean chromosomes. These two patterns were helpful during assembly andquality control of soybean whole genome shotgun scaffold sequences into chromosome pseudomolecules.

1. Introduction

The soybean (Glycine max (L.) Merr.) genome sequencingproject was conducted using the whole genome shotgunstrategy [1], with the DOE’s Joint Genome Institute produc-ing sequence and primary assemblies, and NSF and USDAfunded groups providing genetic and physical map resourcesto integrate the genome into chromosome-scale assemblies[2]. In a whole genome shotgun (WGS) strategy, overlappingpaired-end reads are assembled into scaffolds on the basisof sequence overlaps and clone-size information. More thanfive thousand sequence-based markers were used in thesoybean genome assembly to help order and orient scaffolds[3, 4]. Despite the large number of available markers, asignificant hurdle in the assembly of scaffolds into pseudo-molecules is that genetic markers give poor resolution inthe centromeric and pericentromeric regions due to the lackof recombination events [5, 6]. We present two techniques,dinucleotide signature and binding energy, which were usefulin assessing the soybean chromosome assemblies and may beof use for other genome assembly projects.

2. Results

A WGS sequencing project is often divided into twophases: (1) assembly of the reads into scaffolds based on

sequence overlap and (2) construction of chromosomepseudomolecules by placing and orienting the scaffolds usingother information (i.e., genetic and physical maps). Wefound that the genetic map, while generally collinear with thegenomic sequence, showed widely varying rates of recombi-nation. Figure 1 shows chromosome 6 of soybean (formerlylinkage group C2), which illustrates the phenomenon. Thehorizontal section in the middle of the chromosome coversthe centromeric and pericentromeric regions where a largephysical distance corresponds to a small genetic distance.The relative lack of recombination in this region results inpoor resolution and difficulties in ordering and orientingthose scaffolds. In contrast, the euchromatic regions at eitherend display high genetic-to-physical ratios (in the range of1 centimorgan per 200,000 bases [1 cM/200 kb]), enablingconfident placement of most scaffolds. We were able to usechromosomal-scale signals in both dinucleotide signaturedifferences and binding energies as an aid in ordering andorienting scaffolds in the soybean genome.

Plots of binding energy and dinucleotide differences wereoverlaid with scaffold boundaries. Figure 2 gives an exampleof a dinucleotide plot along with the scaffold boundaries forchromosome 6. The darkened scaffolds show peaks that webelieve correspond to the centromere, based on concentratedarrays of 91-92 base satellite repeats [5, 6] at those locations.Gradients shown in Figure 2, as well as other supporting

Page 40: Genome Evolutiondownloads.hindawi.com/journals/specialissues/732768.pdf · Advances in Bioinformatics EditorialBoard Shandar Ahmad, Japan T. Akutsu, Japan Rolf Backofen, Germany Craig

2 Advances in Bioinformatics

0

20

40

60

80

100

120

140

160

(cM

)

0 100 200 300 400 500 600

Physical distance (100 kb)

Figure 1: Physical (horizontal) versus genetic (vertical) distancefor soybean chromosome 6. Note the flat region in the middle ofthe chromosome, corresponding to a portion of the chromosomewith few recombination events. This hinders accurate marker-basedplacement of scaffolds in that region. The genetic distances are takenfrom the Soybean Consensus Map 4.0 [3, 4].

information (below) led to the chromosomal build shown inFigure 2(b).

The highlighted scaffolds in Figure 2 are scaffold 58and scaffold 35 (from the Arachne build preceding theGlyma1.01 assembly release [1]). Scaffold 58 contains 14mapped markers, ranging from 102.9 to 103.9 cM on the4.0 consensus genetic map [4], and tentatively indicatingthat this scaffold should have a positive orientation. Scaf-fold 35 contains 7 mapped markers, ranging from 103.1 to103.5 cM, also tentatively indicating a positive orientation.The scaffolds were initially placed with scaffold 58 first(cM 103.37), then scaffold 35 (103.41) in the orientationsmentioned. These cM values are below the resolution ofthe map, however, so are fair game for re-evaluation.The dinucleotide plot, with peaks at the edges of bothscaffolds, suggested that a reversal of orientation of bothscaffolds was appropriate. This change was also supportedby two FPC contigs [7] that span the boundaries betweenscaffolds.

The contig WmFPC Contig240 spans the boundarybetween scaffold 35 and scaffold 882, the scaffold directlyto the right of scaffold 35 (see the soybase genome browserat http://soybase.org/). WmFPC Contig240 also spans theboundary between scaffold 882 and scaffold 3195, the scaf-fold directly to the right of scaffold 882. This stronglysuggests that the position and orientation of scaffold 35shown in Figure 2(b) is indeed correct. WmFPC Contig6136spans the boundary of scaffold 58 and scaffold 35 across thecentromere. Integrity of this centromere-spanning scaffoldis suspect (Will Nelson, personal communication), buttogether with the evidence above, the physical map providessome supporting evidence of both the correct orientation ofscaffold 35 and scaffold 58.

Plots of dinucleotide binding energy along the chro-mosome versus genetic position were similarly useful inpseudomolecule assembly. We calculated the binding energyof 50 kb segments by adding up the energy of all theindividual dinucleotides and averaging by the total count.When we plotted the averages across a whole chromosome,we observed large-scale patterns. The binding energy and

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

Din

ucl

eoti

dedi

ffer

ence

Position on chromosome

Gm06 dinucleotide difference

(a)

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

Din

ucl

eoti

dedi

ffer

ence

Position on chromosome

Gm06 dinucleotide difference

(b)

Figure 2: The plots in (a) and (b) show the dinucleotide differenceof two assemblies of chromosome 6. The vertical lines correspondto scaffold boundaries. The dinucleotide signature of the darkenedtwo scaffolds provided information about their orientation. Notethe peaks at the edges of those two scaffolds. In (b) the orientationof both scaffolds has been reversed in order to unify the assumedcentromere. (Other scaffold-order changes outside of the shadedareas were made on the basis of other information, includingmarker and synteny analyses).

variability tended to increase in the centromeric and peri-centromeric regions. The average and standard deviation ofbinding energy from the beginning 17.5 million bases (Mb)and last 10 Mb of DNA from chromosome 6 (delineatedby vertical lines in Figure 1) were 1.21 and 0.02. Figure 3shows a plot of binding energy with vertical lines separatingthe regions. The average and standard deviation of thebinding energy from the remaining middle section of thechromosome were 1.27 and 0.04. In addition to a largervariation, there tended to be large-scale oscillations presentin the middle pericentromeric and centromeric regionsof chromosomes. It was those large-scale patterns thatwe were able to exploit in assembling and orientating ofscaffolds in a manner similar to our use of the dinucleotidesignature.

Figure 4 provides an example of the use of binding energyplots in chromosomal assembly for chromosome 2 (formerlyD1b). The orientation of the darkened scaffold, scaffold 34,was provisionally reversed, on the assumption that a break inthis gradient was unlikely to occur by chance precisely at thescaffold boundaries. The binding energy plot of the resultingassembly is shown in Figure 4(b).

Page 41: Genome Evolutiondownloads.hindawi.com/journals/specialissues/732768.pdf · Advances in Bioinformatics EditorialBoard Shandar Ahmad, Japan T. Akutsu, Japan Rolf Backofen, Germany Craig

Advances in Bioinformatics 3

11.05

1.11.15

1.21.25

1.31.35

1.41.45

1.5

En

ergy

(kca

l/m

ol)

Position on chromosome

Binding energy Gm06

Figure 3: Binding energy versus chromosomal location for soybeanchromosome 6. The two vertical lines correspond to boundariesbetween euchromatic and heterochromatic regions, as determinedfrom Figure 1.

When we further examined the marker data for thatscaffold, we found suggestive evidence that the orientationshown in Figure 4(b) is correct. Figure 5 shows a plot ofcM values versus physical location for scaffold 34 in thechanged orientation. We note that the cM values of the firsttwo markers, 68.1 and 71.7, are significantly less than thecM values of the last three markers, 79.2, 81.5, and 82.6. Wealso note that there is a flattening of the graph as we movein the pericentromeric region. This is what we expect asrecombination events become rarer and marker resolutiondecreases. This provides additional evidence that theorientation of scaffold 34, shown in Figure 4(b), is indeedcorrect.

After observing and utilizing the small-scale signalsoutlined above, we decided to further characterize the DNAin order to better understand the biological meaning behindthe signals. For chromosome 6, we examined the individualρ∗XY counts that were used to compute the dinucleotidedifferences shown in Figure 2. In the centromeric region,some dinucleotides increase in relative count while othersdecrease, leading to the aggregate differences shown inFigure 2. As an example of this, Figure 6 shows a plot ofρ∗CG and ρ∗CC/GG counts which illustrates the phenomenon.This analysis gives information about what dinucleotidesfrequencies differ along the chromosome but does not offeran underlying biological reason for those differences. Usingthe etandem repeat finding software, part of the EMBOSSsoftware suite [8], we identified many tandem repeats ofdominant length 91 in that centromeric region. Thoserepeats have been characterized using FISH in previousstudies [9]. We also searched for a representative 91-lengthrepeat in chromosome 6 using wublast [10], retainingmatches with at least 90% identity. Tandem arrays were thenidentified by counting hits that occurred within 910 basepairs (10x the repeat length) of each other. These arrays arefound at exclusively one location–the centromere. ρ∗CG andρ∗CC/GG counts, calculated directly from the 91 repeat, were2.85 and 0.567, respectively, which differ significantly fromthe values of 0.540 and 1.21 for the entire soybean genome.Those CG and CC/GG count differences by themselves, whenviewed in the context of how we determine δ∗ values, are

11.05

1.11.15

1.21.25

1.31.35

1.41.45

1.5

En

ergy

(kca

l/m

ol)

Position on chromosome

Gm02 binding energy

(a)

11.05

1.11.15

1.21.25

1.31.35

1.41.45

1.5

En

ergy

(kca

l/m

ol)

Position on chromosome

Gm02 binding energy

(b)

Figure 4: The plots in (a) and (b) show the binding energy of 50 kbsegments of two assemblies of chromosome 2. The vertical linesagain correspond to scaffold boundaries. The darkened scaffold in(a) showed discontinuities in the connection to scaffolds at bothends. In (b) the orientation of that scaffold has been reversed,resulting in a less disrupted binding energy plot. (Other scaffold-order changes outside of the shaded areas were made on the basis ofother information, including marker and synteny analyses).

enough to explain dinucleotide differences of ∼0.2. This isapproximately the height of the peak in Figure 2.

In attempting to explain the broad pericentromeric peaksin the binding energy plot of chromosome 2 (and all soybeanchromosomes, data not shown), we analyzed the GC contentof the euchromatic and heterochromatic regions as well asthe GC content of a collection of LTR retrotransposons. TheGC content of the LTR retrotransposons was 0.39, that ofthe euchromatic regions of chromosome 2 was 0.32, and thatof the heterochromatic region of chromosome 2 was 0.37.When we removed the repeat content (LTRs and satelliterepeats) from the heterochromatic region and recalculated,we saw a GC content of 0.33. This is enough to explain thebroad peaks of Figure 4 and strongly suggests that it is theincreased GC content of LTRs and repeats that lead to thepatterns in binding energy.

We calculated similar binding energy and dinucleotideplots for chromosomes of grape, Arabidopsis, poplar, andrice to determine whether the patterns we observed insoybean were a general phenomenon or were specific to thisspecies. Although we saw a few large-scale patterns along thechromosomes from those species, they appeared rarely andthe patterns were in general more subdued than those seen

Page 42: Genome Evolutiondownloads.hindawi.com/journals/specialissues/732768.pdf · Advances in Bioinformatics EditorialBoard Shandar Ahmad, Japan T. Akutsu, Japan Rolf Backofen, Germany Craig

4 Advances in Bioinformatics

60

65

70

75

80

85

(cM

)

0 10 20 30 40 50 60

Position on scaffold (x100000)

82.681.5

79.279.279.5

76.9

81.882.4

71.7

68.1

CM values versus scaffold location-super33

Figure 5: Position versus cM values for markers along super33.

00.20.40.60.8

11.21.41.61.8

2

Din

ucl

eoti

dedi

ffer

ence

Position on chromosome

Dinucleotide signatures-Gm06

CG countsCC/GG counts

Figure 6: ρ∗CG and ρ∗CC/GG counts for 50 kb stretches along chromo-some 6. The counts stay relatively stable until the centromere, wherethey differ significantly from the rest of the genome.

in soybean (data not shown). Rice, poplar, and Arabidopsisshowed relatively homogeneous peaks in the dinucleotideplots, with a noisy background consisting of narrow (∼50–200 kb) peaks. The dinucleotide difference plot for thegrape genome showed only minor, infrequent peaks. Weare unsure whether the differences in the dinucleotide andbinding energy plots were a result of differences betweenthe genomes of those species or, rather, a difference insequencing strategies used for the comparison genomes. Therice and poplar genomes were sequenced using clone-by-clone techniques and did not determine the sequence of allpericentromeric regions [11, 12]. The poplar genome wassequenced using a WGS strategy, but a larger proportion(∼75 Mb) of the estimated genome was included in thepseudomolecule assemblies, and this excluded fraction wasrepeat dense [13].

3. Discussion

Marker data provided the bulk of the information necessaryto order and orient the soybean WGS scaffolds, particularly

in euchromatic regions. In addition, other tools were alsouseful across the genome, including FPC contigs, syntenyplots, and gene- and retrotransposon-density data. However,in pericentromeric regions, the final assembly often requiredjudgment calls after examining several pieces of inconclusiveevidence. Thus, chromosomal assembly is not an exactscience, particularly in centromeric and pericentromericregions, where repeat arrays and a lack of marker resolutionmake higher-order assemblies problematic.

One property of a genome sequence, termed a dinu-cleotide signature, has been used to infer evolutionary his-tory and structural organization of the genome. Dinucleotidesignature data [14–17] was first used as a means to show thatDNA was of opposite rather than similar polarity [18]. Sincethen it has been used to clarify phylogenetic relationships[14, 19–22] and provided evidence of horizontal transferevents between organisms [23–25]. In the latter application,it is the distinctive, relatively homogeneous signature of anorganism’s genome that allows putative foreign DNA to beidentified. More recent work has suggested a correlationbetween changes in genomic signature and changes in DNAreplication and repair machinery [26]. The evolutionarydistances between DNA repair and recombination orthologsin a group of protobacteria correlated very highly withdinucleotide signature differences [26].

Until now, it has been generally accepted that, for any50 kb stretch of a genome, ρ∗ for that segment varies littlewhen compared to other 50 kb segments of the same genome.Differences between ρ∗ values for different organisms havebeen reported to be larger than differences between ρ∗

values for segments of the same organism [19, 27]. Thesoybean genome appears to challenge that conventionalwisdom.

Binding energy and dinucleotide difference plots pro-vided additional information for the soybean assembly, buttheir utility was predicated on the existence of large-scalechromosomal patterns for both of these patterns. Broadpatterns were not evident in poplar, Arabidopsis, or rice,but smaller-scale features were evident. Although we did nothave the scaffold boundaries as an aide, this suggests that ourtechnique could be used to guide some scaffold placementsin other species.

That the (C+G) content of euchromatic DNA (0.32)matched so closely the (C+G) content of heterochromaticDNA without LTR retrotransposons (0.33), coupled withthe high (C+G) content of the LTR retrotransposonsthemselves (0.39), suggests that the broad peaks we seein the dinucleotide binding energy plot of chromosome 6are connected to LTR retrotransposons. A strikingly highproportion (approximately 87%) of the LTR transposons insoybean is located in pericentromeric regions [28]. Remain-ing variability in (C+G) content dinucleotide signature inthe pericentromere may be due to various features in thepericentromere, including ribosomal arrays and other genes(approximately 22% of genes predicted in the soybeangenome occur within pericentromeric boundaries [1]). Theρ∗CG and ρ∗CC/GG values and localization of the length-91repeats in the centromere suggest that the centromeric peaksare a result of the repeats.

Page 43: Genome Evolutiondownloads.hindawi.com/journals/specialissues/732768.pdf · Advances in Bioinformatics EditorialBoard Shandar Ahmad, Japan T. Akutsu, Japan Rolf Backofen, Germany Craig

Advances in Bioinformatics 5

We note that binding energy in the soybean datacorrelates very strongly with (C+G) content, defined asthe ratio of total GC count over total nucleotide count.For the entire soybean genome the correlation was 0.999.Similarly, correlation in parts of the human genome betweenbinding energy and (C+G) content was calculated to be0.998 [29]. This suggests that in soybean (C+G) content anddinucleotide binding energy could be used interchangeably.We chose to use dinucleotide binding energy because we planto compare this genome feature between soybean and otherplant species.

4. Conclusions

We described a new technique for evaluating the placementof sequence scaffolds into linkage groups in areas of thechromosome where marker resolution is poor because ofinfrequent recombination events. The technique can high-light shifts in gradients and can identify possibly problematicscaffold placements; nevertheless, it should be used withother sources of information such as genetic and physicalmap data. There are other signals in DNA that could serveas additional pieces of information, such as nucleosomebinding potential [29]. Many signals correlate strongly with(C+G) content, suggesting they would add little additionalinformation because dinucleotide binding energy correlatesso strongly with (C+G).

5. Methods

5.1. Genome Sequences. The soybean genomic sequenceassemblies used in Figures 3 and 4 used scaffolds generatedusing the Arachne [10] assembler, constructed as part ofthe soybean genome consortium project [2]. Those draftassemblies led to the Glyma1.01 assembly, available athttp://www.phytozome.net/soybean.php. The poplar (JGI,v1.0 (June 2004)) [13] and Arabidopsis (version TAIR 9.0)[11] genomes were downloaded from NCBI. The grapegenome (assembly version 1, 2007) [30] was downloadedfrom the Grape Genome Browser (http://www.genoscope.cns.fr/externe/GenomeBrowser/Vitis/).

5.2. Dinucleotide Signature. The dinucleotide signature isbased on the frequencies of the individual dinucleotides,normalized for the frequencies of the nucleotides. Let fX bethe frequency of nucleotide X in a genome and let fXY be thefrequency of dinucleotide XY. We define

ρXY = fXYfX fY

(1)

as the signature of dinucleotide XY , normalized for thepercentages of the component nucleotides. Since genomicDNA is double stranded, we generalize

ρ∗XY =∼ ρXY = fXYfX fY

(2)

Table 1: Free energy of binding at 37◦C for all of the dinucleotidepairs [38]. The reverse-compliment pairs are shown together,resulting in 10 total unique pairs.

Dinucleotide Pair ΔG◦ (kcal/mol) Dinucleotide Pair ΔG◦ (kcal/mol)

AA/TT −1.00 CC/GG −1.84

AC/GT −1.44 CG −2.17

AG/CT −1.28 GA/TC −1.30

AT −0.88 GC/GC −2.24

CA/TG −1.45 TA −0.58

to include the reverse compliment of a single-strandedsequence. Since there are 16 possible nucleotides, ρ∗ consti-tutes a vector signature for any given genome, consisting ofthe 16 individual dinucleotide signatures. We define ρ∗( f )and ρ∗(g) to be the vector signature of organisms f and g(or of regions f and g in the same genome). A coarse-grainedmeasurement of the difference between the two organisms’signatures is thus defined by Karlin and Mrazek [21] as

δ∗(f , g) = (

116

)∑XY

∣∣ρ∗XY( f )− ρ∗XY(g)∣∣. (3)

ρ∗ was calculated for the soybean genome as a whole, takinginto account total nucleotide and dinucleotide counts forall chromosomes. Ns in the sequence were not included ineither the nucleotide or dinucleotide counts. The vector wasthen compared with the vector from nonoverlapping 50 kbstretches of a chromosome, generating δ∗ values for all 50 kbstretches. This was done using custom perl scripts (availableon request). For random sequences of DNA, the probabilityof observing values of ρ∗XY greater than 1.23 or less than 0.78was found to occur less than one in a thousand times [21, 31].These values have been used to identify an over- and under-representation of a dinucleotide, respectively [14, 25, 31].

5.3. Dinucleotide Binding Energy. Thermodynamic stabilityof DNA has been used as a means of predicting codingregions and promoter locations of a genome [32–34]. Thesuccess of this method is largely dependant on the differencein (C+G) content between the regions of interest andthe (C+G) content of the rest of the genome. Nearestneighbor (NN) free energy values can be used to calculatethermodynamic stability of DNA. Numerous studies havemeasured and exploited NN free energy values of the variousdinucleotide pairs [35–38]. Table 1 gives a consensus for thefree energy of binding of each of 16 pairs [38].

DNA binding energy was calculated using the NearestNeighbor (NN) free energy values in Table 1. For a 50 kbsegment of DNA, the total free energy of binding wascalculated using the free energy for overlapping dinucleotidesand dividing by the total number of dinucleotides. Nsin both the energy calculation and the nucleotide countwere ignored. As with dinucleotide signature, the averagebinding energy was calculated, in 50 kb stretches, for allchromosomes. This was done using custom perl scripts(available on request).

Page 44: Genome Evolutiondownloads.hindawi.com/journals/specialissues/732768.pdf · Advances in Bioinformatics EditorialBoard Shandar Ahmad, Japan T. Akutsu, Japan Rolf Backofen, Germany Craig

6 Advances in Bioinformatics

5.4. Other. Tandem repeats were found using the etandemrepeat finding software that is part of the EMBOSS softwarepackage [8]. Counts of (C+G) were calculated using customperl scripts (available on request).

Acknowledgments

The authors would like to thank Nathan Weeks for helpfulprogramming suggestions during the collection and prepa-ration of data for this paper. Funding for MP was providedby the United Soybean Board.

References

[1] J. Schmutz, S. B. Cannon, J. Schlueter et al., “Genome sequenceof the palaeopolyploid soybean,” Nature, vol. 463, no. 7278,pp. 178–183, 2010.

[2] S. A. Jackson, D. Rokhsar, G. Stacey, R. C. Shoemaker, J.Schmutz, and J. Grimwood, “Toward a reference sequence ofthe soybean genome: a multiagency effort,” Crop Science, vol.46, no. 1, pp. 55–61, 2006.

[3] D. L. Hyten, S. B. Cannon, Q. Song et al., “High-throughputSNP discovery through deep resequencing of a reducedrepresentation library to anchor and orient scaffolds in thesoybean whole genome sequence,” BMC Genomics, vol. 11, no.1, article 38, 2010.

[4] D. L. Hyten, I.-Y. Choi, Q. Song et al., “A high density inte-grated genetic linkage map of soybean and the development ofa 1536 universal soy linkage panel for quantitative trait locusmapping,” Crop Science, vol. 50, no. 3, pp. 960–968, 2010.

[5] L. K. Anderson, N. Salameh, H. W. Bass et al., “Integratinggenetic linkage maps with pachytene chromosome structurein maize,” Genetics, vol. 166, no. 4, pp. 1923–1933, 2004.

[6] M. I. Tenaillon, M. C. Sawkins, L. K. Anderson, S. M.Stack, J. Doebley, and B. S. Gaut, “Patterns of diversity andrecombination along chromosome 1 of maize (Zea mays ssp.mays L.),” Genetics, vol. 162, no. 3, pp. 1401–1413, 2002.

[7] W. Nelson and C. Soderlund, “Integrating sequence with FPCfingerprint maps,” Nucleic Acids Research, vol. 37, no. 5, articlee36, 2009.

[8] P. Rice, L. Longden, and A. Bleasby, “EMBOSS: the Europeanmolecular biology open software suite,” Trends in Genetics, vol.16, no. 6, pp. 276–277, 2000.

[9] N. Gill, S. Findley, J. G. Walling et al., “Molecular andchromosomal evidence for allopolyploidy in soybean,” PlantPhysiology, vol. 151, no. 3, pp. 1167–1174, 2009.

[10] S. Batzoglou, D. B. Jaffe, K. Stanley et al., “ARACHNE: awhole-genome shotgun assembler,” Genome Research, vol. 12,no. 1, pp. 177–189, 2002.

[11] S. Kaul, H. L. Koo, J. Jenkins et al., “Analysis of the genomesequence of the flowering plant Arabidopsis thaliana,” Nature,vol. 408, no. 6814, pp. 796–815, 2000.

[12] International Rice Genome Sequencing Project, “The map-based sequence of the rice genome,” Nature, vol. 436, no. 7052,pp. 793–800, 2005.

[13] G. A. Tuskan, S. DiFazio, S. Jansson et al., “The genomeof black cottonwood, Populus trichocarpa (Torr. & Gray),”Science, vol. 313, no. 5793, pp. 1596–1604, 2006.

[14] S. Karlin and C. Burge, “Dinucleotide relative abundanceextremes: a genomic signature,” Trends in Genetics, vol. 11, no.7, pp. 283–290, 1995.

[15] C. Burge, A. M. Campbell, and S. Karlin, “Over- and under-representation of short oligonucleotides in DNA sequences,”Proceedings of the National Academy of Sciences of the UnitedStates of America, vol. 89, no. 4, pp. 1358–1362, 1992.

[16] A. J. Gentles and S. Karlin, “Genome-scale compositionalcomparisons in Eukaryotes,” Genome Research, vol. 11, no. 4,pp. 540–546, 2001.

[17] S. Karlin, L. Brocchieri, J. Trent, B. E. Blaisdell, and J. Mrazek,“Heterogeneity of genome and proteome content in bacteria,archaea, and eukaryotes,” Theoretical Population Biology, vol.61, no. 4, pp. 367–390, 2002.

[18] J. Josse, A. D. Kaiser, and A. Kornberg, “Enzymatic synthesis ofdeoxyribonucleic acid. VIII. Frequencies of nearest neighborbase sequences in deoxyribonucleic acid,” Journal of BiologicalChemistry, vol. 236, pp. 864–875, 1961.

[19] A. Campbell, J. Mrazek, and S. Karlin, “Genome signaturecomparisons among prokaryote, plasmid, and mitochondrialDNA,” Proceedings of the National Academy of Sciences of theUnited States of America, vol. 96, no. 16, pp. 9184–9189, 1999.

[20] S. Karlin and I. Ladunga, “Comparisons of eukaryoticgenomic sequences,” Proceedings of the National Academy ofSciences of the United States of America, vol. 91, no. 26, pp.12832–12836, 1994.

[21] S. Karlin and J. Mrazek, “Compositional differences withinand between eukaryotic genomes,” Proceedings of the NationalAcademy of Sciences of the United States of America, vol. 94, no.19, pp. 10227–10232, 1997.

[22] M. W. J. van Passel, E. E. Kuramae, A. C. M. Luyf, A. Bart, andT. Boekhout, “The reach of the genome signature in prokary-otes,” BMC Evolutionary Biology, vol. 6, article 84, 2006.

[23] F. Collyn, L. Guy, M. Marceau, M. Simonet, and C.-A. H.Roten, “Describing ancient horizontal gene transfers at thenucleotide and gene levels by comparative pathogenicityisland genometrics,” Bioinformatics, vol. 22, no. 9, pp.1072–1079, 2006.

[24] B. Fertil, M. Massin, S. Lespinats, C. Devic, P. Dumee, andA. Giron, “GENSTYLE: exploration and analysis of DNAsequences with genomic signature,” Nucleic Acids Research,vol. 33, no. 2, pp. W512–W515, 2005.

[25] S. Karlin, “Detecting anomalous gene clusters andpathogenicity islands in diverse bacterial genomes,” Trends inMicrobiology, vol. 9, no. 7, pp. 335–343, 2001.

[26] A. Paz, V. Kirzhner, E. Nevo, and A. Korol, “Coevolution ofDNA-interacting proteins and genome ”dialect”,” MolecularBiology and Evolution, vol. 23, no. 1, pp. 56–64, 2006.

[27] S. Karlin, J. Mrazek, and A. M. Campbell, “Compositionalbiases of bacterial genomes and evolutionary implications,”Journal of Bacteriology, vol. 179, no. 12, pp. 3899–3913, 1997.

[28] J. Du, Z. Tian, C. S. Hans et al., “Evolutionary conservation,diversity and specificity of LTR retrotransposons in floweringplants: new insights from genome-wide analysis and multi-specific comparison,” The Plant Journal, vol. 63, no. 4, pp.584–598, 2010.

[29] W. Li and P. Miramontes, “Large-scale oscillation of structure-related DNA sequence features in human chromosome 21,”Physical Review E, vol. 74, no. 2, part 1, Article ID 021912,2006.

[30] O. Jaillon, J.-M. Aury, B. Noel et al., “The grapevine genomesequence suggests ancestral hexaploidization in majorangiosperm phyla,” Nature, vol. 449, no. 7161, pp. 463–467,2007.

[31] S. Karlin and L. R. Cardon, “Computational DNA sequenceanalysis,” Annual Review of Microbiology, vol. 48, pp. 619–654,1994.

Page 45: Genome Evolutiondownloads.hindawi.com/journals/specialissues/732768.pdf · Advances in Bioinformatics EditorialBoard Shandar Ahmad, Japan T. Akutsu, Japan Rolf Backofen, Germany Craig

Advances in Bioinformatics 7

[32] C. J. Benham and C. Bi, “The analysis of stress-induced duplexdestabilization in long genomic DNA sequences,” Journal ofComputational Biology, vol. 11, no. 4, pp. 519–543, 2004.

[33] E. Yeramian, S. Bonnefoy, and G. Langsley, “Physics-basedgene identification: proof of concept for plasmodiumfalciparum,” Bioinformatics, vol. 18, no. 1, pp. 190–193, 2002.

[34] E. Yeramian and L. Jones, “GeneFizz: a web tool to comparegenetic (coding/non-coding) and physical (helix/coil)segmentations of DNA sequences. Gene discovery andevolutionary perspectives,” Nucleic Acids Research, vol. 31, no.13, pp. 3843–3849, 2003.

[35] K. J. Breslauer, R. Frank, H. Blocker, and L. A. Marky,“Predicting DNA duplex stability from the base sequence,”Proceedings of the National Academy of Sciences of the UnitedStates of America, vol. 83, no. 11, pp. 3746–3750, 1986.

[36] R. Gonzalez, Y. Zeng, V. Ivanov, and G. Zocchi, “Bubbles inDNA melting,” Journal of Physics Condensed Matter, vol. 21,no. 3, Article ID 034102, 9 pages, 2009.

[37] W. A. Kibbe, “OligoCalc: an online oligonucleotide propertiescalculator,” Nucleic Acids Research, vol. 35, pp. W43–W46,2007.

[38] J. SantaLucia Jr., “A unified view of polymer, dumbbell, andoligonucleotide DNA nearest-neighbor thermodynamics,”Proceedings of the National Academy of Sciences of the UnitedStates of America, vol. 95, no. 4, pp. 1460–1465, 1998.

Page 46: Genome Evolutiondownloads.hindawi.com/journals/specialissues/732768.pdf · Advances in Bioinformatics EditorialBoard Shandar Ahmad, Japan T. Akutsu, Japan Rolf Backofen, Germany Craig

Hindawi Publishing CorporationAdvances in BioinformaticsVolume 2010, Article ID 167408, 4 pagesdoi:10.1155/2010/167408

Resource Review

EREM: Parameter Estimation and Ancestral Reconstruction byExpectation-Maximization Algorithm for a Probabilistic Model ofGenomic Binary Characters Evolution

Liran Carmel,1 Yuri I. Wolf,2 Igor B. Rogozin,2 and Eugene V. Koonin2

1 Department of Genetics, The Alexander Silberman Institute of Life Sciences, Faculty of Science, The Hebrew University of Jerusalem,Edmond J. Safra Campus, Givat Ram, Jerusalem 91904, Israel

2 National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda,MD 20894, USA

Correspondence should be addressed to Liran Carmel, [email protected] andEugene V. Koonin, [email protected]

Received 19 September 2009; Accepted 2 March 2010

Academic Editor: Wojciech Makalowski

Copyright © 2010 Liran Carmel et al. This is an open access article distributed under the Creative Commons Attribution License,which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Evolutionary binary characters are features of species or genes, indicating the absence (value zero) or presence (value one) ofsome property. Examples include eukaryotic gene architecture (the presence or absence of an intron in a particular locus), genecontent, and morphological characters. In many studies, the acquisition of such binary characters is assumed to represent a rareevolutionary event, and consequently, their evolution is analyzed using various flavors of parsimony. However, when gain and lossof the character are not rare enough, a probabilistic analysis becomes essential. Here, we present a comprehensive probabilisticmodel to describe the evolution of binary characters on a bifurcating phylogenetic tree. A fast software tool, EREM, is provided,using maximum likelihood to estimate the parameters of the model and to reconstruct ancestral states (presence and absence ininternal nodes) and events (gain and loss events along branches).

1. Introduction

Reconstruction of genome evolution critically dependson underlying evolutionary models. Many such models,whether implicit or explicit, have been proposed, reflectingdifferent approaches to model design [1]. Probabilistic mod-els, which describe the likelihoods of evolutionary eventsalong the branches of a phylogenetic tree, are among themost commonly used. With the accumulation of genomicdata and the advances in computational power, these modelsare becoming increasingly detailed and powerful. However,the models that have been most extensively explored relate tosequence evolution; whereas evolution of binary charactershas received much less attention. Here, we propose a generalprobabilistic model of evolution of binary characters. Whileoriginally developed to study the evolution of eukaryoticgene architecture, the model is formulated in general nota-tions and would apply to diverse classes of binary characters.

For example, the model can be used to study the evolutionof gene content among species, or the evolution of genomicmarkers, morphological characters, and the like. The modelassumes that the events along a particular branch at a givensite depend on the properties of both the site and the branch.Moreover, the model can accommodate rate variabilitybetween sites. A fast and flexible software implementation isoffered that can be used to analyze any submodel of the gen-eral model. In order to estimate the parameters of the modeland to reconstruct ancestral states, an efficient expectation-maximization algorithm, which allows for missing data inthe input, was developed. The software can also be used ina simulative mode, generating simulated input data.

2. The Model of Evolution

For the sake of concreteness, we present the model in termsof gene architecture, so the binary characters designate the

Page 47: Genome Evolutiondownloads.hindawi.com/journals/specialissues/732768.pdf · Advances in Bioinformatics EditorialBoard Shandar Ahmad, Japan T. Akutsu, Japan Rolf Backofen, Germany Craig

2 Advances in Bioinformatics

0 0 1 0 0 0 0 1 1 0

0 1 1 ∗ ∗ 0 0 1 1 1

0 0 0 1 0 0 1 0 0 0

0 1 1 1 1 0 1 1 1 1

10 sites

4Sp

ecie

s

Figure 1: A fragment of a hypothetical input for a 4-species analy-sis. This is a segment of a multiple alignment of 4 orthologous genes.Notice the gap in one of the genes, designated by ∗ for unknowncharacter. Among the 10 sites in this alignment fragment, thereare 6 unique patterns; ω1 = (0, 0, 0, 0)T , ω2 = (0, 1, 0, 1)T , ω3 =(1, 1, 0, 1)T , ω4 = (0, 0, 1, 1)T , ω5 = (0,∗, 1, 1)T , and ω6 =(0,∗, 0, 1)T . This portion of the alignment, therefore, contains 2copies of ω1, 2 copies of ω2, 3 copies of ω3, and one copy each fromω4, ω5, and ω6.

0

1

2

3 4 5 6

Δ1

Figure 2: A bifurcating tree with 4 terminal nodes, and 3 internalnodes. Branches are numbered by the node into which they lead.

presence or absence of an intron in a particular locus.However, with trivial notational changes, the model isequally applicable for describing evolution of other types ofbinary characters.

A length ng multiple alignment of a gene across S speciescan be described by a matrix of size S × ng . For analysisof binary characters, each such matrix is defined on thealphabet {0, 1,∗ }, with a star (∗) indicating a missing value.Each column represents the presence or absence of the binarycharacter at this site (Figure 1). For G genes aligned for thesame set of species, we have a collection of G matrices. Anycolumn in one of the multiple alignments is a vector of lengthS and is called a pattern. Let T be the phylogenetic tree whosetips match these S species, and let us index its nodes by0, 1, . . . , 2S − 2, with the convention that 0 is the index ofthe root. We use the same indexing for the branches, withthe convention that branch i leads into node i (Figure 2). Weassume that the topology of T , as well as all branch lengths,Δ1, . . . ,Δ2S−2, is known.

Let qt denote the state of node t (i.e., 0 or 1), and letqPt be the state of its parent node. We denote by T(g, t) thetransition matrix, such that Tij(g, t) = Pr(qt = j | qPt = i, g)is the probability of finding node t in state j given that its

parent node is in state i, and given that we are looking at geneg. The evolutionary model assumes

T(g, t) =

⎛⎝1− ξt

(1− e−ηgΔt) ξt

(1− e−ηgΔt)

1− (1− φt)e−θgΔt (1− φt)e−θgΔt⎞⎠. (1)

Here, ξt and φt are the intron gain and loss coefficients ofbranch t, and ηg and θg are the intron gain and loss ratesof gene g. Clearly, the overall probability of gain, loss, orretention of an intron in gene g along branch t dependson both the particular gene and the particular branch. Forinstance, the probability of an intron to be gained in geneg along branch t includes a contribution of the branch (ξt),and a contribution of the gene (1 − e−ηgΔt ). To complete theprobabilistic model on the phylogenetic tree, we define π0 asthe probability of the root to be in state 0 (i.e., to be lackingan intron) at a particular site.

Note that, in the absence of the branch-specific coeffi-cients (ξt = 1 and φt = 0), the transition matrix definesa two-state continuous-time Markovian process. Such amodel is popular in evolutionary studies and is implementedin widely used software, such as PAML [2], PHYLIP [3],and PAUP∗ [4]; a number of expectation-maximizationalgorithms have been designed to analyze similar Markovprocesses on phylogenetic trees [5–7]. However, these modelscannot take into account the combined influences of thebranches and the genes and therefore are not applicableunder our model.

In sequence evolution, rate variability among sites is cus-tomarily accommodated by introducing a random variable rwith unit mean, known as the rate coefficient. This coefficientis used to scale the time units of the phylogenetic tree for eachparticular site, reflecting a distinction between fast-evolvingsites and slow-evolving ones [8, 9]. We use a similar idea here,but in order to keep the gain and loss processes independent,two independent rate coefficients are introduced, rη and rθ ,to scale the gene-specific gain and loss rate parameters, ηg ←rη · ηg and θg ← rθ · θg . The rate coefficients are assumed tobe distributed according to the following distributions:

rη ∼ νδ(η)

+ (1− ν)Γ(η; λη

),

rθ ∼ Γ(θ; λθ).(2)

Here, Γ(x; λ) is the unit-mean gamma distribution of variablex with shape parameter λ, δ(x) is the Dirac delta function,and ν is the fraction of sites that are incapable of gaining thecharacter and are denoted zero sites.

We have developed an expectation-maximization algo-rithm to find the maximum likelihood estimators of themodel parameters [10, 11]. The full model is described by2G+ 4S parameters. This number might become exceedinglyhigh and results in high variance to the estimates. Tocircumvent this problem we have developed a two-phaseanalysis procedure [12, 13]. In the first, homogeneous phase,the genes are assumed to have the same gain and loss rates,thus they are all treated as one concatenated supergene. Thissets G = 1, reducing the number of parameters to 2 + 4S. Inthe second, heterogeneous phase, all the parameters estimated

Page 48: Genome Evolutiondownloads.hindawi.com/journals/specialissues/732768.pdf · Advances in Bioinformatics EditorialBoard Shandar Ahmad, Japan T. Akutsu, Japan Rolf Backofen, Germany Craig

Advances in Bioinformatics 3

in the homogeneous phase are frozen, and only the gene-specific parameters are estimated. Extensive simulations hadproved the effectiveness of this procedure (see below).

As indicated above, we used this model to study theevolution of the intron-exon gene architecture in eukaryotes.In this case, a coding nucleotide position is marked with 1if an intron got inserted into the sequence right after it. Wegenerated a set of G = 391 orthologous genes upon whichwe marked the presence or absence of the introns. We ranthe model upon this data and computed intron gain and losscoefficients for the different lineages [12], and intron gainand loss rates of the different genes [13].

3. The Software Tool

The software, named EREM after Evolutionary Recon-struction by Expectation-Maximization, was written inC++ using Microsoft Visual Studio.NET 2003, and isavailable from ftp://ftp.ncbi.nlm.nih.gov/pub/koonin/erem/;or http://carmelab.huji.ac.il/software.html. In the web sitewe provide the source code in both Windows and Unixversions (the latter version is courtesy of Stajich et al.[14]), a compiled executable tested under Windows XP, andfull documentation. EREM is designed to perform threedifferent tasks. First, based on the input data, it can estimatethe parameters of the model by likelihood maximization.Second, it can reconstruct ancestral states and events alongthe phylogenetic tree. Third, one can run the software ina simulation mode, where the phylogenetic tree and/or theinput data matrices are randomly generated.

To empirically assess the speed of EREM on a large dataset, we performed simulations on a 19-species tree with genescontaining a total of 105, 3·105, or 5·105 sites. On a Pentium3 GHz machine, this took, on average, 4.2, 9.6, and 12.3minutes, respectively.

4. Parameter Estimation

The maximum likelihood estimation of the model parame-ters is computed by an expectation-maximization algorithm[10, 11]. This algorithm is a general iterative scheme thatguarantees an increase in the likelihood at the end ofeach iteration. Every iteration is built of two computationalprocedures, called the expectation step (E-step) and themaximization step (M-step). Briefly, the E-step in our case iscarried out using a series of recursions along the phylogenetictree. In the M-step, the auxiliary function computed at theE-step is maximized in a parameter-by-parameter fashion,using low-tolerance one-dimensional maximization proce-dures.

For the full model, some combinations of parametersform invariants (yielding the same ancestral reconstruc-tions), thus the values estimated for some individual param-eters are hard to interpret. The fundamental output tobe analyzed is therefore the ancestral reconstruction (seenext section) and not the estimated values of the modelparameters.

5. Ancestral Reconstruction

Given the model parameters, EREM computes two types ofancestral reconstructions: (a) the average occupancy fraction(average number of sites with state 1 in any particular geneand in any particular node); (b) the average number ofcharacter gains, losses, or retentions that occurred for eachgene along each branch. Simulations have shown that thereconstructions are highly accurate. Applied to an intron-exon data set, we obtain a relative error of 1%, 3%, and 11%in estimating the number of introns in internal nodes, thenumber of loss events along each branch, and the number ofgain events along each branch, respectively [12].

EREM can output both a detailed gene-by-gene ancestralreconstruction report, as well as an overall report, summingthe results of all the individual genes.

6. Simulation Mode

One can run EREM in a simulation mode, in which it canrandomly generate any of the two types of inputs: (a) aphylogenetic tree, the user only provides the number ofterminal nodes and the time span of the tree; (b) alignmentmatrices for any number of genes. Here, a model is eitherprovided by the user or is randomly generated. Then, itis used to simulate evolution along the given or simulatedphylogenetic tree.

7. Input Files

EREM requires an input file which has the form of aformatted text file. The website contains a full description,with examples, of this file and all other files discussed here.If the phylogenetic tree is not simulated, the user shouldprovide it (see the web site for the acceptable formats).

If the set of alignment matrices is not simulated, it shouldbe supplied by the user in the form of two text files. The firstfile lists all the unique patterns in the data, and the second filelists, for each gene, the counts of each of the unique patterns.

8. Output Files

The primary output of EREM is a formatted text filethat summarizes the input, the values of the estimatedparameters, and the overall ancestral reconstruction. Also,for each estimated parameter, EREM produces a history file,which records how the value of this parameter changed alongthe iterations.

If a detailed, gene-by-gene ancestral reconstruction wasrequested by the user, EREM outputs a special text fileproviding this information. Additionally, if the phylogenetictree was simulated, the information about it is kept inanother text file. If the input data were simulated, the twosummary files describing it (see previous section) are storedas text files. In this simulation mode, the user can alsorequest to store the actual state of each site in each node ofthe tree, for the purposes of comparison with the ancestralreconstruction computed by EREM.

Page 49: Genome Evolutiondownloads.hindawi.com/journals/specialissues/732768.pdf · Advances in Bioinformatics EditorialBoard Shandar Ahmad, Japan T. Akutsu, Japan Rolf Backofen, Germany Craig

4 Advances in Bioinformatics

9. Matlab Auxiliary Functions

We use Matlab as the chief tool to analyze the results.Consequently, we have written a number of Matlab func-tions (written for Matlab version R2006a) to facilitate theinteraction between Matlab users and EREM. The set offunctions can be downloaded from the web site. In particular,these functions allow a Matlab user to generate input files, toread output files, to analyze the ancestral reconstructions, tovisualize some of the results, and to compute error bars to theestimations. We emphasize that these utilities are provided inour website to facilitate analysis of EREM output and are notan integral or necessary part of EREM.

10. Availability and Requirements

The website (ftp://ftp.ncbi.nlm.nih.gov/pub/koonin/erem/or http://carmelab.huji.ac.il/software.html) contains detaileddescription of the formats of all input and output files, as wellas a number of examples. Also, a full description of all Matlabauxiliary functions is provided.

The web site includes an executable for Windows XPProfessional, as well as C++ source code in Microsoft VisualStudio.NET 2003 and in Unix (the latter version is courtesyof Stajich et al. [14]).

Acknowledgments

The authors thank Stajich et al. [14] for providing the Unixsource code. L. Carmel is supported by the European UnionMarie Curie International Reintegration Grant (PIRG05-GA-2009-248639).

References

[1] J. Felsenstein, Inferring Phylogenies, Sinauer Associates, Sun-derland, Mass, USA, 2004.

[2] Z. Yang, “PAML 4: phylogenetic analysis by maximum like-lihood,” Molecular Biology and Evolution, vol. 24, no. 8, pp.1586–1591, 2007.

[3] J. Felsenstein, PHYLIP (Phylogeny Inference Package): Version3.6, Department of Genome Sciences, University of Washing-ton, Seattle, Wash, USA, 2005.

[4] D. L. Swofford, PAUP∗. Phylogenetic Analysis Using Parsimony(∗and Other Methods). Version 4, Sinauer Associates, Sunder-land, Mass, USA, 2003.

[5] W. J. Bruno, “Modeling residue usage in aligned proteinsequences via maximum likelihood,” Molecular Biology andEvolution, vol. 13, no. 10, pp. 1368–1374, 1996.

[6] I. Holmes and G. M. Rubin, “An expectation maximizationalgorithm for training hidden substitution models,” Journal ofMolecular Biology, vol. 317, no. 5, pp. 753–764, 2002.

[7] A. Siepel and D. Haussler, “Phylogenetic estimation ofcontext-dependent substitution rates by maximum likeli-hood,” Molecular Biology and Evolution, vol. 21, no. 3, pp. 468–488, 2004.

[8] M. Nel, R. Chakraborty, and P. A. Fuerst, “Infinite allelemodel with varying mutation rate,” Proceedings of the NationalAcademy of Sciences of the United States of America, vol. 73, no.11, pp. 4164–4168, 1976.

[9] T. Uzzell and K. W. Corbin, “Fitting discrete probabilitydistributions to evolutionary events,” Science, vol. 172, no.3988, pp. 1089–1096, 1971.

[10] L. Carmel, I. B. Rogozin, Y. I. Wolf, and E. V. Koonin, “Anexpectation-maximization algorithm for analysis of evolutionof exon-intron structure of eukaryotic genes,” in ComparativeGenomics, vol. 3678, pp. 35–46, 2005.

[11] L. Carmel, I. B. Rogozin, Y. I. Wolf, and E. V. Koonin, “Patternsof intron gain and conservation in eukaryotic genes,” BMCEvolutionary Biology, vol. 7, article 192, 2007.

[12] L. Carmel, Y. I. Wolf, I. B. Rogozin, and E. V. Koonin,“Three distinct modes of intron dynamics in the evolution ofeukaryotes,” Genome Research, vol. 17, no. 7, pp. 1034–1044,2007.

[13] L. Carmel, I. B. Rogozin, Y. I. Wolf, and E. V. Koonin,“Evolutionarily conserved genes preferentially accumulateintrons,” Genome Research, vol. 17, no. 7, pp. 1045–1050, 2007.

[14] J. E. Stajich, F. S. Dietrich, and S. W. Roy, “Comparativegenomic analysis of fungal genomes reveals intron-rich ances-tors,” Genome Biology, vol. 8, no. 10, article R223, 2007.