4 2001 the sequence of the human genome

DOI: 10.1126/science.1058040, 1304 (2001);291 Science

, et al.J. Craig VenterThe Sequence of the Human Genome

This copy is for your personal, non-commercial use only.

clicking here.colleagues, clients, or customers by , you can order high-quality copies for yourIf you wish to distribute this article to others

here.following the guidelines can be obtained byPermission to republish or repurpose articles or portions of articles

): January 25, 2012 www.sciencemag.org (this infomation is current as ofThe following resources related to this article are available online at

http://www.sciencemag.org/content/295/5559/1466.2.full.html http://www.sciencemag.org/content/292/5523/1838.full.html

A correction has been published for this article at:

http://www.sciencemag.org/content/291/5507/1304.full.htmlversion of this article at:

including high-resolution figures, can be found in the onlineUpdated information and services,

http://www.sciencemag.org/content/suppl/2001/04/27/291.5507.1304.DC3.html http://www.sciencemag.org/content/suppl/2001/02/10/291.5507.1304.DC2.html http://www.sciencemag.org/content/suppl/2001/02/10/291.5507.1304.DC1.html

can be found at: Supporting Online Material

http://www.sciencemag.org/content/291/5507/1304.full.html#relatedfound at:

can berelated to this article A list of selected additional articles on the Science Web sites

http://www.sciencemag.org/content/291/5507/1304.full.html#ref-list-1, 47 of which can be accessed free:cites 134 articlesThis article

5056 article(s) on the ISI Web of Sciencecited by This article has been

http://www.sciencemag.org/content/291/5507/1304.full.html#related-urls100 articles hosted by HighWire Press; see:cited by This article has been

http://www.sciencemag.org/cgi/collection/geneticsGenetics

subject collections:This article appears in the following

registered trademark of AAAS. is aScience2001 by the American Association for the Advancement of Science; all rights reserved. The title

CopyrightAmerican Association for the Advancement of Science, 1200 New York Avenue NW, Washington, DC 20005. (print ISSN 0036-8075; online ISSN 1095-9203) is published weekly, except the last week in December, by theScience

on J

anua

ry 2

5, 2

012

ww

w.s

cien

cem

ag.o

rgD

ownl

oade

d fro

m

The Sequence of the Human GenomeJ. Craig Venter,1* Mark D. Adams,1 Eugene W. Myers,1 Peter W. Li,1 Richard J. Mural,1

Granger G. Sutton,1 Hamilton O. Smith,1 Mark Yandell,1 Cheryl A. Evans,1 Robert A. Holt,1

Jeannine D. Gocayne,1 Peter Amanatides,1 Richard M. Ballew,1 Daniel H. Huson,1

Jennifer Russo Wortman,1 Qing Zhang,1 Chinnappa D. Kodira,1 Xiangqun H. Zheng,1 Lin Chen,1

Marian Skupski,1 Gangadharan Subramanian,1 Paul D. Thomas,1 Jinghui Zhang,1

George L. Gabor Miklos,2 Catherine Nelson,3 Samuel Broder,1 Andrew G. Clark,4 Joe Nadeau,5

Victor A. McKusick,6 Norton Zinder,7 Arnold J. Levine,7 Richard J. Roberts,8 Mel Simon,9

Carolyn Slayman,10 Michael Hunkapiller,11 Randall Bolanos,1 Arthur Delcher,1 Ian Dew,1 Daniel Fasulo,1

Michael Flanigan,1 Liliana Florea,1 Aaron Halpern,1 Sridhar Hannenhalli,1 Saul Kravitz,1 Samuel Levy,1

Clark Mobarry,1 Knut Reinert,1 Karin Remington,1 Jane Abu-Threideh,1 Ellen Beasley,1 Kendra Biddick,1

Vivien Bonazzi,1 Rhonda Brandon,1 Michele Cargill,1 Ishwar Chandramouliswaran,1 Rosane Charlab,1

Kabir Chaturvedi,1 Zuoming Deng,1 Valentina Di Francesco,1 Patrick Dunn,1 Karen Eilbeck,1

Carlos Evangelista,1 Andrei E. Gabrielian,1 Weiniu Gan,1 Wangmao Ge,1 Fangcheng Gong,1 Zhiping Gu,1

Ping Guan,1 Thomas J. Heiman,1 Maureen E. Higgins,1 Rui-Ru Ji,1 Zhaoxi Ke,1 Karen A. Ketchum,1

Zhongwu Lai,1 Yiding Lei,1 Zhenya Li,1 Jiayin Li,1 Yong Liang,1 Xiaoying Lin,1 Fu Lu,1

Gennady V. Merkulov,1 Natalia Milshina,1 Helen M. Moore,1 Ashwinikumar K Naik,1

Vaibhav A. Narayan,1 Beena Neelam,1 Deborah Nusskern,1 Douglas B. Rusch,1 Steven Salzberg,12

Wei Shao,1 Bixiong Shue,1 Jingtao Sun,1 Zhen Yuan Wang,1 Aihui Wang,1 Xin Wang,1 Jian Wang,1

Ming-Hui Wei,1 Ron Wides,13 Chunlin Xiao,1 Chunhua Yan,1 Alison Yao,1 Jane Ye,1 Ming Zhan,1

Weiqing Zhang,1 Hongyu Zhang,1 Qi Zhao,1 Liansheng Zheng,1 Fei Zhong,1 Wenyan Zhong,1

Shiaoping C. Zhu,1 Shaying Zhao,12 Dennis Gilbert,1 Suzanna Baumhueter,1 Gene Spier,1

Christine Carter,1 Anibal Cravchik,1 Trevor Woodage,1 Feroze Ali,1 Huijin An,1 Aderonke Awe,1

Danita Baldwin,1 Holly Baden,1 Mary Barnstead,1 Ian Barrow,1 Karen Beeson,1 Dana Busam,1

Amy Carver,1 Angela Center,1 Ming Lai Cheng,1 Liz Curry,1 Steve Danaher,1 Lionel Davenport,1

Raymond Desilets,1 Susanne Dietz,1 Kristina Dodson,1 Lisa Doup,1 Steven Ferriera,1 Neha Garg,1

Andres Gluecksmann,1 Brit Hart,1 Jason Haynes,1 Charles Haynes,1 Cheryl Heiner,1 Suzanne Hladun,1

Damon Hostin,1 Jarrett Houck,1 Timothy Howland,1 Chinyere Ibegwam,1 Jeffery Johnson,1

Francis Kalush,1 Lesley Kline,1 Shashi Koduru,1 Amy Love,1 Felecia Mann,1 David May,1

Steven McCawley,1 Tina McIntosh,1 Ivy McMullen,1 Mee Moy,1 Linda Moy,1 Brian Murphy,1

Keith Nelson,1 Cynthia Pfannkoch,1 Eric Pratts,1 Vinita Puri,1 Hina Qureshi,1 Matthew Reardon,1

Robert Rodriguez,1 Yu-Hui Rogers,1 Deanna Romblad,1 Bob Ruhfel,1 Richard Scott,1 Cynthia Sitter,1

Michelle Smallwood,1 Erin Stewart,1 Renee Strong,1 Ellen Suh,1 Reginald Thomas,1 Ni Ni Tint,1

Sukyee Tse,1 Claire Vech,1 Gary Wang,1 Jeremy Wetter,1 Sherita Williams,1 Monica Williams,1

Sandra Windsor,1 Emily Winn-Deen,1 Keriellen Wolfe,1 Jayshree Zaveri,1 Karena Zaveri,1

Josep F. Abril,14 Roderic Guigo,14 Michael J. Campbell,1 Kimmen V. Sjolander,1 Brian Karlak,1

Anish Kejariwal,1 Huaiyu Mi,1 Betty Lazareva,1 Thomas Hatton,1 Apurva Narechania,1 Karen Diemer,1

Anushya Muruganujan,1 Nan Guo,1 Shinji Sato,1 Vineet Bafna,1 Sorin Istrail,1 Ross Lippert,1

Russell Schwartz,1 Brian Walenz,1 Shibu Yooseph,1 David Allen,1 Anand Basu,1 James Baxendale,1

Louis Blick,1 Marcelo Caminha,1 John Carnes-Stine,1 Parris Caulk,1 Yen-Hui Chiang,1 My Coyne,1

Carl Dahlke,1 Anne Deslattes Mays,1 Maria Dombroski,1 Michael Donnelly,1 Dale Ely,1 Shiva Esparham,1

Carl Fosler,1 Harold Gire,1 Stephen Glanowski,1 Kenneth Glasser,1 Anna Glodek,1 Mark Gorokhov,1

Ken Graham,1 Barry Gropman,1 Michael Harris,1 Jeremy Heil,1 Scott Henderson,1 Jeffrey Hoover,1

Donald Jennings,1 Catherine Jordan,1 James Jordan,1 John Kasha,1 Leonid Kagan,1 Cheryl Kraft,1

Alexander Levitsky,1 Mark Lewis,1 Xiangjun Liu,1 John Lopez,1 Daniel Ma,1 William Majoros,1

Joe McDaniel,1 Sean Murphy,1 Matthew Newman,1 Trung Nguyen,1 Ngoc Nguyen,1 Marc Nodell,1

Sue Pan,1 Jim Peck,1 Marshall Peterson,1 William Rowe,1 Robert Sanders,1 John Scott,1

Michael Simpson,1 Thomas Smith,1 Arlan Sprague,1 Timothy Stockwell,1 Russell Turner,1 Eli Venter,1

Mei Wang,1 Meiyuan Wen,1 David Wu,1 Mitchell Wu,1 Ashley Xia,1 Ali Zandieh,1 Xiaohong Zhu1

T H E H U M A N G E N O M E

16 FEBRUARY 2001 VOL 291 SCIENCE www.sciencemag.org1304

on J

anua

ry 2

5, 2

012

ww

w.s

cien

cem

ag.o

rgD

ownl

oade

d fro

m

A 2.91-billion base pair (bp) consensus sequence of the euchromatic portion ofthe human genome was generated by the whole-genome shotgun sequencingmethod. The 14.8-billion bp DNA sequence was generated over 9 months from27,271,853 high-quality sequence reads (5.11-fold coverage of the genome)from both ends of plasmid clones made from the DNA of ve individuals. Twoassembly strategiesa whole-genome assembly and a regional chromosomeassemblywere used, each combining sequence data from Celera and thepublicly funded genome effort. The public data were shredded into 550-bpsegments to create a 2.9-fold coverage of those genome regions that had beensequenced, without including biases inherent in the cloning and assemblyprocedure used by the publicly funded group. This brought the effective cov-erage in the assemblies to eightfold, reducing the number and size of gaps inthe nal assembly over what would be obtained with 5.11-fold coverage. Thetwo assembly strategies yielded very similar results that largely agree withindependent mapping data. The assemblies effectively cover the euchromaticregions of the human chromosomes. More than 90% of the genome is inscaffold assemblies of 100,000 bp or more, and 25% of the genome is inscaffolds of 10 million bp or larger. Analysis of the genome sequence revealed26,588 protein-encoding transcripts for which there was strong corroboratingevidence and an additional ;12,000 computationally derived genes with mousematches or other weak supporting evidence. Although gene-dense clusters areobvious, almost half the genes are dispersed in low G1C sequence separatedby large tracts of apparently noncoding sequence. Only 1.1% of the genomeis spanned by exons, whereas 24% is in introns, with 75% of the genome beingintergenic DNA. Duplications of segmental blocks, ranging in size up to chro-mosomal lengths, are abundant throughout the genome and reveal a complexevolutionary history. Comparative genomic analysis indicates vertebrate ex-pansions of genes associated with neuronal function, with tissue-specic de-velopmental regulation, and with the hemostasis and immune systems. DNAsequence comparisons between the consensus sequence and publicly fundedgenome data provided locations of 2.1 million single-nucleotide polymorphisms(SNPs). A random pair of human haploid genomes differed at a rate of 1 bp per1250 on average, but there was marked heterogeneity in the level of poly-morphism across the genome. Less than 1% of all SNPs resulted in variation inproteins, but the task of determining which SNPs have functional consequencesremains an open challenge.

Decoding of the DNA that constitutes thehuman genome has been widely anticipatedfor the contribution it will make toward un-

derstanding human evolution, the causationof disease, and the interplay between theenvironment and heredity in defining the hu-man condition. A project with the goal ofdetermining the complete nucleotide se-quence of the human genome was first for-mally proposed in 1985 (1). In subsequentyears, the idea met with mixed reactions inthe scientific community (2). However, in1990, the Human Genome Project (HGP) wasofficially initiated in the United States underthe direction of the National Institutes ofHealth and the U.S. Department of Energywith a 15-year, $3 billion plan for completingthe genome sequence. In 1998 we announcedour intention to build a unique genome-sequencing facility, to determine the se-quence of the human genome over a 3-yearperiod. Here we report the penultimate mile-stone along the path toward that goal, a nearlycomplete sequence of the euchromatic por-tion of the human genome. The sequencingwas performed by a whole-genome randomshotgun method with subsequent assembly ofthe sequenced segments.

The modern history of DNA sequencingbegan in 1977, when Sanger reported his meth-od for determining the order of nucleotides of

DNA using chain-terminating nucleotide ana-logs (3). In the same year, the first human genewas isolated and sequenced (4). In 1986, Hoodand co-workers (5) described an improvementin the Sanger sequencing method that includedattaching fluorescent dyes to the nucleotides,which permitted them to be sequentially readby a computer. The first automated DNA se-quencer, developed by Applied Biosystems inCalifornia in 1987, was shown to be successfulwhen the sequences of two genes were obtainedwith this new technology (6). From early se-quencing of human genomic regions (7), itbecame clear that cDNA sequences (which arereverse-transcribed from RNA) would be es-sential to annotate and validate gene predictionsin the human genome. These studies were thebasis in part for the development of the ex-pressed sequence tag (EST) method of geneidentification (8), which is a random selection,very high throughput sequencing approach tocharacterize cDNA libraries. The EST methodled to the rapid discovery and mapping of hu-man genes (9). The increasing numbers of hu-man EST sequences necessitated the develop-ment of new computer algorithms to analyzelarge amounts of sequence data, and in 1993 atThe Institute for Genomic Research (TIGR), analgorithm was developed that permitted assem-bly and analysis of hundreds of thousands ofESTs. This algorithm permitted characteriza-tion and annotation of human genes on the basisof 30,000 EST assemblies (10).

The complete 49-kbp bacteriophage lamb-da genome sequence was determined by ashotgun restriction digest method in 1982(11). When considering methods for sequenc-ing the smallpox virus genome in 1991 (12),a whole-genome shotgun sequencing methodwas discussed and subsequently rejected ow-ing to the lack of appropriate software toolsfor genome assembly. However, in 1994,when a microbial genome-sequencing projectwas contemplated at TIGR, a whole-genomeshotgun sequencing approach was consideredpossible with the TIGR EST assembly algo-rithm. In 1995, the 1.8-Mbp Haemophilusinfluenzae genome was completed by awhole-genome shotgun sequencing method(13). The experience with several subsequentgenome-sequencing efforts established thebroad applicability of this approach (14, 15).

A key feature of the sequencing approachused for these megabase-size and larger ge-nomes was the use of paired-end sequences(also called mate pairs), derived from sub-clone libraries with distinct insert sizes andcloning characteristics. Paired-end sequencesare sequences 500 to 600 bp in length fromboth ends of double-stranded DNA clones ofprescribed lengths. The success of using endsequences from long segments (18 to 20 kbp)of DNA cloned into bacteriophage lambda inassembly of the microbial genomes led to thesuggestion (16 ) of an approach to simulta-

1Celera Genomics, 45 West Gude Drive, Rockville, MD20850, USA. 2GenetixXpress, 78 Pacic Road, PalmBeach, Sydney 2108, Australia. 3Berkeley DrosophilaGenome Project, University of California, Berkeley, CA94720, USA. 4Department of Biology, Penn State Uni-versity, 208 Mueller Lab, University Park, PA 16802,USA. 5Department of Genetics, Case Western ReserveUniversity School of Medicine, BRB-630, 10900 EuclidAvenue, Cleveland, OH 44106, USA. 6Johns HopkinsUniversity School of Medicine, Johns Hopkins Hospi-tal, 600 North Wolfe Street, Blalock 1007, Baltimore,MD 212874922, USA. 7Rockefeller University, 1230York Avenue, New York, NY 100216399, USA. 8NewEngland BioLabs, 32 Tozer Road, Beverly, MA 01915,USA. 9Division of Biology, 147-75, California Instituteof Technology, 1200 East California Boulevard, Pasa-dena, CA 91125, USA. 10Yale University School ofMedicine, 333 Cedar Street, P.O. Box 208000, NewHaven, CT 065208000, USA. 11Applied Biosystems,850 Lincoln Centre Drive, Foster City, CA 94404, USA.12The Institute for Genomic Research, 9712 MedicalCenter Drive, Rockville, MD 20850, USA. 13Faculty ofLife Sciences, Bar-Ilan University, Ramat-Gan, 52900Israel. 14Grup de Recerca en Informatica Medica, In-stitut Municipal dInvestigacio Medica, UniversitatPompeu Fabra, 08003-Barcelona, Catalonia, Spain.

*To whom correspondence should be addressed. E-mail: [email protected]


www.sciencemag.org SCIENCE VOL 291 16 FEBRUARY 2001 1305

on J

anua

ry 2

5, 2

012

ww

w.s

cien

cem

ag.o

rgD

ownl

oade

d fro

m

neously map and sequence the human ge-nome by means of end sequences from 150-kbp bacterial artificial chromosomes (BACs)(17, 18). The end sequences spanned byknown distances provide long-range continu-ity across the genome. A modification of theBAC end-sequencing (BES) method was ap-plied successfully to complete chromosome 2from the Arabidopsis thaliana genome (19).

In 1997, Weber and Myers (20) proposedwhole-genome shotgun sequencing of thehuman genome. Their proposal was not wellreceived (21). However, by early 1998, asless than 5% of the genome had been se-quenced, it was clear that the rate of progressin human genome sequencing worldwidewas very slow (22), and the prospects forfinishing the genome by the 2005 goal wereuncertain.

In early 1998, PE Biosystems (now AppliedBiosystems) developed an automated, high-throughput capillary DNA sequencer, subse-quently called the ABI PRISM 3700 DNAAnalyzer. Discussions between PE Biosystemsand TIGR scientists resulted in a plan to under-take the sequencing of the human genome withthe 3700 DNA Analyzer and the whole-genomeshotgun sequencing techniques developed atTIGR (23). Many of the principles of operationof a genome-sequencing facility were estab-lished in the TIGR facility (24). However, thefacility envisioned for Celera would have acapacity roughly 50 times that of TIGR, andthus new developments were required for sam-ple preparation and tracking and for whole-genome assembly. Some argued that the re-quired 150-fold scale-up from the H. influenzaegenome to the human genome with its complexrepeat sequences was not feasible (25). TheDrosophila melanogaster genome was thuschosen as a test case for whole-genome assem-bly on a large and complex eukaryotic genome.In collaboration with Gerald Rubin and theBerkeley Drosophila Genome Project, the nu-cleotide sequence of the 120-Mbp euchromaticportion of the Drosophila genome was deter-mined over a 1-year period (2628). The Dro-sophila genome-sequencing effort resulted intwo key findings: (i) that the assembly algo-rithms could generate chromosome assemblieswith highly accurate order and orientation withsubstantially less than 10-fold coverage, and (ii)that undertaking multiple interim assemblies inplace of one comprehensive final assembly wasnot of value.

These findings, together with the dramaticchanges in the public genome effort subsequentto the formation of Celera (29), led to a modi-fied whole-genome shotgun sequencing ap-proach to the human genome. We initially pro-posed to do 10-fold sequence coverage of thegenome over a 3-year period and to make in-terim assembled sequence data available quar-terly. The modifications included a plan to per-form random shotgun sequencing to ;5-fold

coverage and to use the unordered and unori-ented BAC sequence fragments and subassem-blies published in GenBank by the publiclyfunded genome effort (30) to accelerate theproject. We also abandoned the quarterly an-nouncements in the absence of interim assem-blies to report.

Although this strategy provided a reason-able result very early that was consistent with awhole-genome shotgun assembly with eight-fold coverage, the human genome sequence isnot as finished as the Drosophila genome waswith an effective 13-fold coverage. However, itbecame clear that even with this reduced cov-erage strategy, Celera could generate an accu-rately ordered and oriented scaffold sequence ofthe human genome in less than 1 year. Humangenome sequencing was initiated 8 September1999 and completed 17 June 2000. The firstassembly was completed 25 June 2000, and theassembly reported here was completed 1 Octo-ber 2000. Here we describe the whole-genomerandom shotgun sequencing effort applied tothe human genome. We developed two differ-ent assembly approaches for assembling the ;3billion bp that make up the 23 pairs of chromo-somes of the Homo sapiens genome. Any Gen-Bank-derived data were shredded to removepotential bias to the final sequence from chi-meric clones, foreign DNA contamination, ormisassembled contigs. Insofar as a correctlyand accurately assembled genome sequencewith faithful order and orientation of contigsis essential for an accurate analysis of thehuman genetic code, we have devoted a con-siderable portion of this manuscript to thedocumentation of the quality of our recon-struction of the genome. We also describe ourpreliminary analysis of the human geneticcode on the basis of computational methods.Figure 1 (see fold-out chart associated withthis issue; files for each chromosome can befound in Web fig. 1 on Science Online atwww.sciencemag.org/cgi/content/full/291/5507/1304/DC1) provides a graphical over-view of the genome and the features encodedin it. The detailed manual curation and inter-pretation of the genome are just beginning.

To aid the reader in locating specific an-alytical sections, we have divided the paperinto seven broad sections. A summary of themajor results appears at the beginning of eachsection.

1 Sources of DNA and Sequencing Methods2 Genome Assembly Strategy and

Characterization3 Gene Prediction and Annotation4 Genome Structure5 Genome Evolution6 A Genome-Wide Examination of

Sequence Variations7 An Overview of the Predicted Protein-

Coding Genes in the Human Genome8 Conclusions

1 Sources of DNA and SequencingMethods

Summary. This section discusses the rationaleand ethical rules governing donor selection toensure ethnic and gender diversity along withthe methodologies for DNA extraction and li-brary construction. The plasmid library con-struction is the first critical step in shotgunsequencing. If the DNA libraries are not uni-form in size, nonchimeric, and do not randomlyrepresent the genome, then the subsequent stepscannot accurately reconstruct the genome se-quence. We used automated high-throughputDNA sequencing and the computational infra-structure to enable efficient tracking of enor-mous amounts of sequence information (27.3million sequence reads; 14.9 billion bp of se-quence). Sequencing and tracking from bothends of plasmid clones from 2-, 10-, and 50-kbplibraries were essential to the computationalreconstruction of the genome. Our evidenceindicates that the accurate pairing rate of endsequences was greater than 98%.

Various policies of the United States and theWorld Medical Association, specifically theDeclaration of Helsinki, offer recommenda-tions for conducting experiments with humansubjects. We convened an Institutional Re-view Board (IRB) (31) that helped us estab-lish the protocol for obtaining and using hu-man DNA and the informed consent processused to enroll research volunteers for theDNA-sequencing studies reported here. Weadopted several steps and procedures to pro-tect the privacy rights and confidentiality ofthe research subjects (donors). These includ-ed a two-stage consent process, a secure ran-dom alphanumeric coding system for speci-mens and records, circumscribed contact withthe subjects by researchers, and options foroff-site contact of donors. In addition, Celeraapplied for and received a Certificate of Con-fidentiality from the Department of Healthand Human Services. This Certificate autho-rized Celera to protect the privacy of theindividuals who volunteered to be donors asprovided in Section 301(d) of the PublicHealth Service Act 42 U.S.C. 241(d).

Celera and the IRB believed that the ini-tial version of a completed human genomeshould be a composite derived from multipledonors of diverse ethnic backgrounds Pro-spective donors were asked, on a voluntarybasis, to self-designate an ethnogeographiccategory (e.g., African-American, Chinese,Hispanic, Caucasian, etc.). We enrolled 21donors (32).

Three basic items of information fromeach donor were recorded and linked by con-fidential code to the donated sample: age,sex, and self-designated ethnogeographicgroup. From females, ;130 ml of whole,heparinized blood was collected. From males,;130 ml of whole, heparinized blood was



on J

anua

ry 2

5, 2

012

ww

w.s

cien

cem

ag.o

rgD

ownl

oade

d fro

m

collected, as well as five specimens of semen,collected over a 6-week period. Permanentlymphoblastoid cell lines were created byEpstein-Barr virus immortalization. DNAfrom five subjects was selected for genomicDNA sequencing: two males and three fe-malesone African-American, one Asian-Chinese, one Hispanic-Mexican, and twoCaucasians (see Web fig. 2 on Science Onlineat www.sciencemag.org/cgi/content/291/5507/1304/DC1). The decision of whose DNA tosequence was based on a complex mix of fac-tors, including the goal of achieving diversity aswell as technical issues such as the quality ofthe DNA libraries and availability of immortal-ized cell lines.

1.1 Library construction andsequencingCentral to the whole-genome shotgun sequenc-ing process is preparation of high-quality plas-mid libraries in a variety of insert sizes so thatpairs of sequence reads (mates) are obtained,one read from both ends of each plasmid insert.High-quality libraries have an equal representa-tion of all parts of the genome, a small numberof clones without inserts, and no contaminationfrom such sources as the mitochondrial genomeand Escherichia coli genomic DNA. DNA fromeach donor was used to construct plasmid librar-ies in one or more of three size classes: 2 kbp, 10kbp, and 50 kbp (Table 1) (33).

In designing the DNA-sequencing pro-cess, we focused on developing a simplesystem that could be implemented in a robustand reproducible manner and monitored ef-fectively (Fig. 2) (34 ).

Current sequencing protocols are based on

the dideoxy sequencing method (35), whichtypically yields only 500 to 750 bp of sequenceper reaction. This limitation on read length hasmade monumental gains in throughput a pre-requisite for the analysis of large eukaryoticgenomes. We accomplished this at the Celerafacility, which occupies about 30,000 squarefeet of laboratory space and produces sequencedata continuously at a rate of 175,000 totalreads per day. The DNA-sequencing facility issupported by a high-performance computation-al facility (36).

The process for DNA sequencing was mod-ular by design and automated. Intermodulesample backlogs allowed four principalmodules to operate independently: (i) li-brary transformation, plating, and colonypicking; (ii) DNA template preparation;(iii) dideoxy sequencing reaction set-upand purification; and (iv) sequence deter-mination with the ABI PRISM 3700 DNAAnalyzer. Because the inputs and outputsof each module have been carefullymatched and sample backlogs are continu-ously managed, sequencing has proceededwithout a single days interruption since theinitiation of the Drosophila project in May1999. The ABI 3700 is a fully automatedcapillary array sequencer and as such canbe operated with a minimal amount ofhands-on time, currently estimated at about15 min per day. The capillary system alsofacilitates correct associations of sequenc-ing traces with samples through the elimi-nation of manual sample loading and lane-tracking errors associated with slab gels.About 65 production staff were hired andtrained, and were rotated on a regular basis

through the four production modules. Acentral laboratory information managementsystem (LIMS) tracked all sample plates byunique bar code identifiers. The facility wassupported by a quality control team that per-formed raw material and in-process testingand a quality assurance group with responsi-bilities including document control, valida-tion, and auditing of the facility. Critical tothe success of the scale-up was the validationof all software and instrumentation beforeimplementation, and production-scale testingof any process changes.

1.2 Trace processingAn automated trace-processing pipeline hasbeen developed to process each sequence file(37 ). After quality and vector trimming, theaverage trimmed sequence length was 543bp, and the sequencing accuracy was expo-nentially distributed with a mean of 99.5%and with less than 1 in 1000 reads being lessthan 98% accurate (26 ). Each trimmed se-quence was screened for matches to contam-inants including sequences of vector alone, E.coli genomic DNA, and human mitochondri-al DNA. The entire read for any sequencewith a significant match to a contaminant wasdiscarded. A total of 713 reads matched E.coli genomic DNA and 2114 reads matchedthe human mitochondrial genome.

1.3 Quality assessment and controlThe importance of the base-pair level ac-curacy of the sequence data increases as thesize and repetitive nature of the genome tobe sequenced increases. Each sequenceread must be placed uniquely in the ge-

Table 1. Celera-generated data input into assembly.

IndividualNumber of reads for different insert libraries

Total number ofbase pairs

2 kbp 10 kbp 50 kbp Total

No. of sequencing reads A 0 0 2,767,357 2,767,357 1,502,674,851B 11,736,757 7,467,755 66,930 19,271,442 10,464,393,006C 853,819 881,290 0 1,735,109 942,164,187D 952,523 1,046,815 0 1,999,338 1,085,640,534F 0 1,498,607 0 1,498,607 813,743,601

Total 13,543,099 10,894,467 2,834,287 27,271,853 14,808,616,179

Fold sequence coverage A 0 0 0.52 0.52(2.9-Gb genome) B 2.20 1.40 0.01 3.61

C 0.16 1.17 0 0.32D 0.18 0.20 0 0.37F 0 0.28 0 0.28

Total 2.54 2.04 0.53 5.11

Fold clone coverage A 0 0 18.39 18.39B 2.96 11.26 0.44 14.67C 0.22 1.33 0 1.54D 0.24 1.58 0 1.82F 0 2.26 0 2.26

Total 3.42 16.43 18.84 38.68

Insert size* (mean) Average 1,951 bp 10,800 bp 50,715 bpInsert size* (SD) Average 6.10% 8.10% 14.90%% Mates Average 74.50 80.80 75.60

*Insert size and SD are calculated from assembly of mates on contigs. % Mates is based on laboratory tracking of sequencing runs.



on J

anua

ry 2

5, 2

012

ww

w.s

cien

cem

ag.o

rgD

ownl

oade

d fro

m

nome, and even a modest error rate canreduce the effectiveness of assembly. Inaddition, maintaining the validity of mate-pair information is absolutely critical forthe algorithms described below. Proceduralcontrols were established for maintainingthe validity of sequence mate-pairs as se-quencing reactions proceeded through theprocess, including strict rules built into theLIMS. The accuracy of sequence data pro-duced by the Celera process was validatedin the course of the Drosophila genomeproject (26 ). By collecting data for the

entire human genome in a single facility,we were able to ensure uniform qualitystandards and the cost advantages associat-ed with automation, an economy of scale,and process consistency.

2 Genome Assembly Strategy andCharacterizationSummary. We describe in this section the twoapproaches that we used to assemble the ge-nome. One method involves the computationalcombination of all sequence reads with shred-ded data from GenBank to generate an indepen-

dent, nonbiased view of the genome. The sec-ond approach involves clustering all of the frag-ments to a region or chromosome on the basisof mapping information. The clustered datawere then shredded and subjected to computa-tional assembly. Both approaches provided es-sentially the same reconstruction of assembledDNA sequence with proper order and orienta-tion. The second method provided slightlygreater sequence coverage (fewer gaps) andwas the principal sequence used for the analysisphase. In addition, we document the complete-ness and correctness of this assembly process

Fig. 2. Flow diagram for sequencing pipeline. Samples are received,selected, and processed in compliance with standard operating proce-dures, with a focus on quality within and across departments. Eachprocess has dened inputs and outputs with the capability to exchange

samples and data with both internal and external entities according todened quality guidelines. Manufacturing pipeline processes, products,quality control measures, and responsible parties are indicated and aredescribed further in the text.



on J

anua

ry 2

5, 2

012

ww

w.s

cien

cem

ag.o

rgD

ownl

oade

d fro

m

and provide a comparison to the public genomesequence, which was reconstructed largely byan independent BAC-by-BAC approach. Ourassemblies effectively covered the euchromaticregions of the human chromosomes. More than90% of the genome was in scaffold assembliesof 100,000 bp or greater, and 25% of the ge-nome was in scaffolds of 10 million bp orlarger.

Shotgun sequence assembly is a classicexample of an inverse problem: given a setof reads randomly sampled from a targetsequence, reconstruct the order and the po-sition of those reads in the target. Genomeassembly algorithms developed for Dro-sophila have now been extended to assemblethe ;25-fold larger human genome. Celera as-semblies consist of a set of contigs that areordered and oriented into scaffolds that are thenmapped to chromosomal locations by usingknown markers. The contigs consist of a col-lection of overlapping sequence reads that pro-vide a consensus reconstruction for a contigu-ous interval of the genome. Mate pairs are acentral component of the assembly strategy.They are used to produce scaffolds in which thesize of gaps between consecutive contigs isknown with reasonable precision. This is ac-complished by observing that a pair of reads,one of which is in one contig, and the other ofwhich is in another, implies an orientation anddistance between the two contigs (Fig. 3). Fi-nally, our assemblies did not incorporate allreads into the final set of reported scaffolds.This set of unincorporated reads is termedchaff, and typically consisted of reads fromwithin highly repetitive regions, data from otherorganisms introduced through various routes asfound in many genome projects, and data ofpoor quality or with untrimmed vector.

2.1 Assembly data setsWe used two independent sets of data for ourassemblies. The first was a random shotgundata set of 27.27 million reads of average length543 bp produced at Celera. This consistedlargely of mate-pair reads from 16 librariesconstructed from DNA samples taken from fivedifferent donors. Libraries with insert sizes of 2,10, and 50 kbp were used. By looking at howmate pairs from a library were positioned inknown sequenced stretches of the genome, wewere able to characterize the range of insertsizes in each library and determine a mean andstandard deviation. Table 1 details the numberof reads, sequencing coverage, and clone cov-erage achieved by the data set. The clone cov-erage is the coverage of the genome in clonedDNA, considering the entire insert of eachclone that has sequence from both ends. Theclone coverage provides a measure of theamount of physical DNA coverage of the ge-nome. Assuming a genome size of 2.9 Gbp, theCelera trimmed sequences gave a 5.13 cover-age of the genome, and clone coverage was3.423, 16.403, and 18.843 for the 2-, 10-, and50-kbp libraries, respectively, for a total of38.73 clone coverage.

The second data set was from the publiclyfunded Human Genome Project (PFP) and isprimarily derived from BAC clones (30). TheBAC data input to the assemblies came from adownload of GenBank on 1 September 2000(Table 2) totaling 4443.3 Mbp of sequence.The data for each BAC is deposited at one offour levels of completion. Phase 0 data are a setof generally unassembled sequencing readsfrom a very light shotgun of the BAC, typicallyless than 13. Phase 1 data are unordered as-semblies of contigs, which we call BAC contigsor bactigs. Phase 2 data are ordered assembliesof bactigs. Phase 3 data are complete BAC

sequences. In the past 2 years the PFP hasfocused on a product of lower quality and com-pleteness, but on a faster time-course, by con-centrating on the production of Phase 1 datafrom a 33 to 43 light-shotgun of each BACclone.

We screened the bactig sequences for con-taminants by using the BLAST algorithmagainst three data sets: (i) vector sequencesin Univec core (38), filtered for a 25-bpmatch at 98% sequence identity at the endsof the sequence and a 30-bp match internalto the sequence; (ii) the nonhuman portionof the High Throughput Genomic (HTG)Seqences division of GenBank (39), fil-tered at 200 bp at 98%; and (iii) the non-redundant nucleotide sequences from Gen-Bank without primate and human virus en-tries, filtered at 200 bp at 98%. Whenever25 bp or more of vector was found within50 bp of the end of a contig, the tip up tothe matching vector was excised. Underthese criteria we removed 2.6 Mbp of pos-sible contaminant and vector from thePhase 3 data, 61.0 Mbp from the Phase 1and 2 data, and 16.1 Mbp from the Phase 0data (Table 2). This left us with a total of4363.7 Mbp of PFP sequence data 20%finished, 75% rough-draft (Phase 1 and 2),and 5% single sequencing reads (Phase 0).An additional 104,018 BAC end-sequencemate pairs were also downloaded and in-cluded in the data sets for both assemblyprocesses (18).

2.2 Assembly strategiesTwo different approaches to assembly werepursued. The first was a whole-genome as-sembly process that used Celera data and thePFP data in the form of additional syntheticshotgun data, and the second was a compart-mentalized assembly process that first parti-tioned the Celera and PFP data into setslocalized to large chromosomal segments andthen performed ab initio shotgun assembly oneach set. Figure 4 gives a schematic of theoverall process flow.

For the whole-genome assembly, the PFPdata was first disassembled or shredded into asynthetic shotgun data set of 550-bp reads thatform a perfect 23 covering of the bactigs. Thisresulted in 16.05 million faux reads that weresufficient to cover the genome 2.963 becauseof redundancy in the BAC data set, withoutincorporating the biases inherent in the PFPassembly process. The combined data set of43.32 million reads (83), and all associatedmate-pair information, were then subjected toour whole-genome assembly algorithm to pro-duce a reconstruction of the genome. Neitherthe location of a BAC in the genome nor itsassembly of bactigs was used in this process.Bactigs were shredded into reads because wefound strong evidence that 2.13% of them weremisassembled (40). Furthermore, BAC location

Fig. 3. Anatomy of whole-genome assembly. Overlapping shredded bactig fragments (red lines) andinternally derived reads from ve different individuals (black lines) are combined to produce acontig and a consensus sequence (green line). Contigs are connected into scaffolds (red) by usingmate pair information. Scaffolds are then mapped to the genome (gray line) with STS (blue star)physical map information.



on J

anua

ry 2

5, 2

012

ww

w.s

cien

cem

ag.o

rgD

ownl

oade

d fro

m

information was ignored because some BACswere not correctly placed on the PFP physicalmap and because we found strong evidence that

at least 2.2% of the BACs contained sequencedata that were not part of the given BAC (41),possibly as a result of sample-tracking errors

(see below). In short, we performed a true, abinitio whole-genome assembly in which wetook the expedient of deriving additional se-quence coverage, but not mate pairs, assembledbactigs, or genome locality, from some exter-nally generated data.

In the compartmentalized shotgun assembly(CSA), Celera and PFP data were partitionedinto the largest possible chromosomal segmentsor components that could be determined withconfidence, and then shotgun assembly was ap-plied to each partitioned subset wherein thebactig data were again shredded into faux readsto ensure an independent ab initio assembly ofthe component. By subsetting the data in thisway, the overall computational effort was re-duced and the effect of interchromosomal dupli-cations was ameliorated. This also resulted in areconstruction of the genome that was relativelyindependent of the whole-genome assembly re-sults so that the two assemblies could be com-pared for consistency. The quality of the parti-tioning into components was crucial so thatdifferent genome regions were not mixed to-gether. We constructed components from (i) thelongest scaffolds of the sequence from eachBAC and (ii) assembled scaffolds of data uniqueto Celeras data set. The BAC assemblies wereobtained by a combining assembler that used thebactigs and the 53 Celera data mapped to thosebactigs as input. This effort was undertaken asan interim step solely because the more accurateand complete the scaffold for a given sequencestretch, the more accurately one can tile thesescaffolds into contiguous components on thebasis of sequence overlap and mate-pair infor-mation. We further visually inspected and cu-rated the scaffold tiling of the components tofurther increase its accuracy. For the final CSAassembly, all but the partitioning was ignored,and an independent, ab initio reconstruction ofthe sequence in each component was obtainedby applying our whole-genome assembly algo-rithm to the partitioned, relevant Celera data andthe shredded, faux reads of the partitioned, rel-evant bactig data.

2.3 Whole-genome assemblyThe algorithms used for whole-genome as-sembly (WGA) of the human genome wereenhancements to those used to produce thesequence of the Drosophila genome reportedin detail in (28).

The WGA assembler consists of a pipelinecomposed of five principal stages: Screener,Overlapper, Unitigger, Scaffolder, and RepeatResolver, respectively. The Screener findsand marks all microsatellite repeats with lessthan a 6-bp element, and screens out allknown interspersed repeat elements, includ-ing Alu, Line, and ribosomal DNA. Markedregions get searched for overlaps, whereasscreened regions do not get searched, but canbe part of an overlap that involves unscreenedmatching segments.

Table 2. GenBank data input into assembly.

Center StatisticsCompletion phase sequence

0 1 and 2 3

Whitehead Institute/ Number of accession records 2,825 6,533 363MIT Center for Number of contigs 243,786 138,023 363Genome Research, Total base pairs 194,490,158 1,083,848,245 48,829,358USA Total vector masked (bp) 1,553,597 875,618 2,202

Total contaminant masked(bp)

13,654,482 4,417,055 98,028

Average contig length (bp) 798 7,853 134,516

Washington University, Number of accession records 19 3,232 1,300USA Number of contigs 2,127 61,812 1,300

Total base pairs 1,195,732 561,171,788 164,214,395Total vector masked (bp) 21,604 270,942 8,287Total contaminant masked

(bp)22,469 1,476,141 469,487


Baylor College of Number of accession records 0 1,626 363Medicine, USA Number of contigs 0 44,861 363

Total base pairs 0 265,547,066 49,017,104Total vector masked (bp) 0 218,769 4,960Total contaminant masked

(bp)0 1,784,700 485,137


Production Sequencing Number of accession records 135 2,043 754Facility, DOE Joint Number of contigs 7,052 34,938 754Genome Institute, Total base pairs 8,680,214 294,249,631 60,975,328USA Total vector masked (bp) 22,644 162,651 7,274

Total contaminant masked(bp)

665,818 4,642,372 118,387

Average contig length (bp) 1,231 8,422 80,867

The Institute of Physical Number of accession records 0 1,149 300and Chemical Number of contigs 0 25,772 300Research (RIKEN), Total base pairs 0 182,812,275 20,093,926Japan Total vector masked (bp) 0 203,792 2,371

Total contaminant masked (bp) 0 308,426 27,781Average contig length (bp) 0 7,093 66,978

Sanger Centre, UK Number of accession records 0 4,538 2,599Number of contigs 0 74,324 2,599Total base pairs 0 689,059,692 246,118,000Total vector masked (bp) 0 427,326 25,054Total contaminant masked (bp) 0 2,066,305 374,561Average contig length (bp) 0 9,271 94,697

Others* Number of accession records 42 1,894 3,458Number of contigs 5,978 29,898 3,458Total base pairs 5,564,879 283,358,877 246,474,157Total vector masked (bp) 57,448 279,477 32,136Total contaminant masked

(bp)575,366 1,616,665 1,791,849


All centers combined Number of accession records 3,021 21,015 9,137Number of contigs 258,943 409,628 9,137Total base pairs 209,930,983 3,360,047,574 835,722,268Total vector masked (bp) 1,655,293 2,438,575 82,284Total contaminant masked

(bp)14,918,135 16,311,664 3,365,230


*Other centers contributing at least 0.1% of the sequence include: Chinese National Human Genome Center;Genomanalyse Gesellschaft fuer Biotechnologische Forschung mbH; Genome Therapeutics Corporation; GENOSCOPE;Chinese Academy of Sciences; Institute of Molecular Biotechnology; Keio University School of Medicine; LawrenceLivermore National Laboratory; Cold Spring Harbor Laboratory; Los Alamos National Laboratory; Max-Planck Institut fuerMolekulare, Genetik; Japan Science and Technology Corporation; Stanford University; The Institute for GenomicResearch; The Institute of Physical and Chemical Research, Gene Bank; The University of Oklahoma; University of TexasSouthwestern Medical Center, University of Washington. The 4,405,700,825 bases contributed by all centers wereshredded into faux reads resulting in 2.963 coverage of the genome.



on J

anua

ry 2

5, 2

012

ww

w.s

cien

cem

ag.o

rgD

ownl

oade

d fro

m

The Overlapper compares every readagainst every other read in search of completeend-to-end overlaps of at least 40 bp and withno more than 6% differences in the match.Because all data are scrupulously vector-trimmed, the Overlapper can insist on com-plete overlap matches. Computing the set ofall overlaps took roughly 10,000 CPU hourswith a suite of four-processor Alpha SMPswith 4 gigabytes of RAM. This took 4 to 5days in elapsed time with 40 such machinesoperating in parallel.

Every overlap computed above is statisti-cally a 1-in-1017 event and thus not a coinci-dental event. What makes assembly combi-natorially difficult is that while many over-laps are actually sampled from overlappingregions of the genome, and thus imply thatthe sequence reads should be assembled to-gether, even more overlaps are actually fromtwo distinct copies of a low-copy repeatedelement not screened above, thus constitutingan error if put together. We call the formertrue overlaps and the latter repeat-inducedoverlaps. The assembler must avoid choos-ing repeat-induced overlaps, especially earlyin the process.

We achieve this objective in the Unitig-ger. We first find all assemblies of reads thatappear to be uncontested with respect to allother reads. We call the contigs formed fromthese subassemblies unitigs (for uniquely as-sembled contigs). Formally, these unitigs arethe uncontested interval subgraphs of thegraph of all overlaps (42). Unfortunately, al-though empirically many of these assembliesare correct (and thus involve only true over-laps), some are in fact collections of readsfrom several copies of a repetitive elementthat have been overcollapsed into a singlesubassembly. However, the overcollapsedunitigs are easily identified because their av-erage coverage depth is too high to be con-sistent with the overall level of sequencecoverage. We developed a simple statisticaldiscriminator that gives the logarithm of theodds ratio that a unitig is composed of uniqueDNA or of a repeat consisting of two or morecopies. The discriminator, set to a sufficientlystringent threshold, identifies a subset of theunitigs that we are certain are correct. Inaddition, a second, less stringent thresholdidentifies a subset of remaining unitigs verylikely to be correctly assembled, of which weselect those that will consistently scaffold(see below), and thus are again almost certainto be correct. We call the union of these twosets U-unitigs. Empirically, we found from a63 simulated shotgun of human chromosome22 that we get U-unitigs covering 98% of thestretches of unique DNA that are .2 kbplong. We are further able to identify theboundary of the start of a repetitive elementat the ends of a U-unitig and leverage this sothat U-unitigs span more than 93% of all

singly interspersed Alu elements and other100-to 400-bp repetitive segments.

The result of running the Unitigger wasthus a set of correctly assembled subcontigscovering an estimated 73.6% of the humangenome. The Scaffolder then proceeded touse mate-pair information to link these to-gether into scaffolds. When there are two ormore mate pairs that imply that a given pairof U-unitigs are at a certain distance andorientation with respect to each other, theprobability of this being wrong is againroughly 1 in 1010, assuming that mate pairsare false less than 2% of the time. Thus, onecan with high confidence link together allU-unitigs that are linked by at least two 2- or10-kbp mate pairs producing intermediate-sized scaffolds that are then recursivelylinked together by confirming 50-kbp matepairs and BAC end sequences. This processyielded scaffolds that are on the order ofmegabase pairs in size with gaps betweentheir contigs that generally correspond to re-petitive elements and occasionally to smallsequencing gaps. These scaffolds reconstructthe majority of the unique sequence within agenome.

For the Drosophila assembly, we engagedin a three-stage repeat resolution strategywhere each stage was progressively more

aggressive and thus more likely to make amistake. For the human assembly, we contin-ued to use the first Rocks substage whereall unitigs with a good, but not definitive,discriminator score are placed in a scaffoldgap. This was done with the condition thattwo or more mate pairs with one of theirreads already in the scaffold unambiguouslyplace the unitig in the given gap. We estimatethe probability of inserting a unitig into anincorrect gap with this strategy to be less than1027 based on a probabilistic analysis.

We revised the ensuing Stones substageof the human assembly, making it more likethe mechanism suggested in our earlier work(43). For each gap, every read R that is placedin the gap by virtue of its mated pair M beingin a contig of the scaffold and implying Rsplacement is collected. Celeras mate-pairinginformation is correct more than 99% of thetime. Thus, almost every, but not all, of thereads in the set belong in the gap, and whena read does not belong it rarely agrees withthe remainder of the reads. Therefore, wesimply assemble this set of reads within thegap, eliminating any reads that conflict withthe assembly. This operation proved muchmore reliable than the one it replaced for theDrosophila assembly; in the assembly of asimulated shotgun data set of human chromo-

Fig. 4. Architecture of Celeras two-pronged assembly strategy. Each oval denotes a computationprocess performing the function indicated by its label, with the labels on arcs between ovalsdescribing the nature of the objects produced and/or consumed by a process. This guresummarizes the discussion in the text that denes the terms and phrases used.



on J

anua

ry 2

5, 2

012

ww

w.s

cien

cem

ag.o

rgD

ownl

oade

d fro

m

some 22, all stones were placed correctly.The final method of resolving gaps is to

fill them with assembled BAC data that coverthe gap. We call this external gap walking.We did not include the very aggressive Peb-bles substage described in our Drosophilawork, which made enough mistakes so as toproduce repeat reconstructions for long inter-spersed elements whose quality was only99.62% correct. We decided that for the hu-man genome it was philosophically better notto introduce a step that was certain to produceless than 99.99% accuracy. The cost was asomewhat larger number of gaps of some-what larger size.

At the final stage of the assembly process,and also at several intermediate points, aconsensus sequence of every contig is pro-duced. Our algorithm is driven by the princi-ple of maximum parsimony, with quality-valueweighted measures for evaluating eachbase. The net effect is a Bayesian estimate ofthe correct base to report at each position.Consensus generation uses Celera data when-ever it is present. In the event that no Celeradata cover a given region, the BAC datasequence is used.

A key element of achieving a WGA of thehuman genome was to parallelize the Overlap-per and the central consensus sequencecon-structing subroutines. In addition, memory wasa real issuea straightforward application ofthe software we had built for Drosophila would

have required a computer with a 600-gigabyteRAM. By making the Overlapper and Unitiggerincremental, we were able to achieve the samecomputation with a maximum of instantaneoususage of 28 gigabytes of RAM. Moreover, theincremental nature of the first three stages al-lowed us to continually update the state of thispart of the computation as data were deliveredand then perform a 7-day run to complete Scaf-folding and Repeat Resolution whenever de-sired. For our assembly operations, the totalcompute infrastructure consists of 10 four-pro-cessor SMPs with 4 gigabytes of memory percluster (Compaqs ES40, Regatta) and a 16-processor NUMA machine with 64 gigabytesof memory (Compaqs GS160, Wildfire). Thetotal compute for a run of the assembler wasroughly 20,000 CPU hours.

The assembly of Celeras data, togetherwith the shredded bactig data, produced a set ofscaffolds totaling 2.848 Gbp in span and con-sisting of 2.586 Gbp of sequence. The chaff, orset of reads not incorporated in the assembly,numbered 11.27 million (26%), which is con-sistent with our experience for Drosophila.More than 84% of the genome was covered byscaffolds .100 kbp long, and these averaged91% sequence and 9% gaps with a total of2.297 Gbp of sequence. There were a total of93,857 gaps among the 1637 scaffolds .100kbp. The average scaffold size was 1.5 Mbp,the average contig size was 24.06 kbp, and theaverage gap size was 2.43 kbp, where the dis-

tribution of each was essentially exponential.More than 50% of all gaps were less than 500bp long, .62% of all gaps were less than 1 kbplong, and no gap was .100 kbp long. Similar-ly, more than 65% of the sequence is in contigs.30 kbp, more than 31% is in contigs .100kbp, and the largest contig was 1.22 Mbp long.Table 3 gives detailed summary statistics forthe structure of this assembly with a directcomparison to the compartmentalized shotgunassembly.

2.4 Compartmentalized shotgunassemblyIn addition to the WGA approach, we pur-sued a localized assembly approach that wasintended to subdivide the genome into seg-ments, each of which could be shotgun as-sembled individually. We expected that thiswould help in resolution of large interchro-mosomal duplications and improve the statis-tics for calculating U-unitigs. The compart-mentalized assembly process involved clus-tering Celera reads and bactigs into large,multiple megabase regions of the genome,and then running the WGA assembler on theCelera data and shredded, faux reads ob-tained from the bactig data.

The first phase of the CSA strategy was toseparate Celera reads into those that matchedthe BAC contigs for a particular PFP BACentry, and those that did not match any publicdata. Such matches must be guaranteed to

Table 3. Scaffold statistics for whole-genome and compartmentalized shotgun assemblies.

Scaffold size

All .30 kbp .100 kbp .500 kbp .1000 kbp

Compartmentalized shotgun assembly

No. of bp in scaffolds 2,905,568,203 2,748,892,430 2,700,489,906 2,489,357,260 2,248,689,128(including intrascaffold gaps)

No. of bp in contigs 2,653,979,733 2,524,251,302 2,491,538,372 2,320,648,201 2,106,521,902No. of scaffolds 53,591 2,845 1,935 1,060 721No. of contigs 170,033 112,207 107,199 93,138 82,009No. of gaps 116,442 109,362 105,264 92,078 81,288No. of gaps #1 kbp 72,091 69,175 67,289 59,915 53,354Average scaffold size (bp) 54,217 966,219 1,395,602 2,348,450 3,118,848Average contig size (bp) 15,609 22,496 23,242 24,916 25,686Average intrascaffold gap size

(bp)2,161 2,054 1,985 1,832 1,749

Largest contig (bp) 1,988,321 1,988,321 1,988,321 1,988,321 1,988,321% of total contigs 100 95 94 87 79

Whole-genome assembly

No. of bp in scaffolds(including intrascaffold gaps)

2,847,890,390 2,574,792,618 2,525,334,447 2,328,535,466 2,140,943,032

No. of bp in contigs 2,586,634,108 2,334,343,339 2,297,678,935 2,143,002,184 1,983,305,432No. of scaffolds 118,968 2,507 1,637 818 554No. of contigs 221,036 99,189 95,494 84,641 76,285No. of gaps 102,068 96,682 93,857 83,823 75,731No. of gaps #1 kbp 62,356 60,343 59,156 54,079 49,592Average scaffold size (bp) 23,938 1,027,041 1,542,660 2,846,620 3,864,518Average contig size (bp) 11,702 23,534 24,061 25,319 25,999Average intrascaffold gap size

(bp)2,560 2,487 2,426 2,213 2,082

Largest contig (bp) 1,224,073 1,224,073 1,224,073 1,224,073 1,224,073% of total contigs 100 90 89 83 77



on J

anua

ry 2

5, 2

012

ww

w.s

cien

cem

ag.o

rgD

ownl

oade

d fro

m

properly place a Celera read, so all reads werefirst masked against a library of commonrepetitive elements, and only matches of atleast 40 bp to unmasked portions of the readconstituted a hit. Of Celeras 27.27 millionreads, 20.76 million matched a bactig andanother 0.62 million reads, which did nothave any matches, were nonetheless identi-fied as belonging in the region of the bactigsBAC because their mate matched the bactig.Of the remaining reads, 2.92 million werecompletely screened out and so could not bematched, but the other 2.97 million reads hadunmasked sequence totaling 1.189 Gbp thatwere not found in the GenBank data set.Because the Celera data are 5.113 redundant,we estimate that 240 Mbp of unique Celerasequence is not in the GenBank data set.

In the next step of the CSA process, acombining assembler took the relevant 53Celera reads and bactigs for a BAC entry, andproduced an assembly of the combined datafor that locale. These high-quality sequencereconstructions were a transient result whoseutility was simply to provide more reliableinformation for the purposes of their tilinginto sets of overlapping and adjacent scaffoldsequences in the next step. In outline, thecombining assembler first examines the set ofmatching Celera reads to determine if thereare excessive pileups indicative of un-screened repetitive elements. Wherever theseoccur, reads in the repeat region whose mateshave not been mapped to consistent positionsare removed. Then all sets of mate pairs thatconsistently imply the same relative positionof two bactigs are bundled into a link andweighted according to the number of mates inthe bundle. A greedy strategy then attemptsto order the bactigs by selecting bundles ofmate-pairs in order of their weight. A selectedmate-pair bundle can tie together two forma-tive scaffolds. It is incorporated to form asingle scaffold only if it is consistent with themajority of links between contigs of the scaf-fold. Once scaffolding is complete, gaps arefilled by the Stones strategy describedabove for the WGA assembler.

The GenBank data for the Phase 1 and 2BACs consisted of an average of 19.8 bactigsper BAC of average size 8099 bp. Applica-tion of the combining assembler resulted inindividual Celera BAC assemblies being puttogether into an average of 1.83 scaffolds(median of 1 scaffold) consisting of an aver-age of 8.57 contigs of average size 18,973 bp.In addition to defining order and orientationof the sequence fragments, there were 57%fewer gaps in the combined result. For Phase0 data, the average GenBank entry consistedof 91.52 reads of average length 784 bp.Application of the combining assembler re-sulted in an average of 54.8 scaffolds consist-ing of an average of 58.1 contigs of averagesize 873 bp. Basically, some small amount of

assembly took place, but not enough Celeradata were matched to truly assemble the 0.53to 13 data set represented by the typicalPhase 0 BACs. The combining assemblerwas also applied to the Phase 3 BACs forSNP identification, confirmation of assem-bly, and localization of the Celera reads. Thephase 0 data suggest that a combined whole-genome shotgun data set and 13 light-shot-gun of BACs will not yield good assembly ofBAC regions; at least 33 light-shotgun ofeach BAC is needed.

The 5.89 million Celera fragments notmatching the GenBank data were assembledwith our whole-genome assembler. The as-sembly resulted in a set of scaffolds totaling442 Mbp in span and consisting of 326 Mbpof sequence. More than 20% of the scaffoldswere .5 kbp long, and these averaged 63%sequence and 27% gaps with a total of 302Mbp of sequence. All scaffolds .5 kbp wereforwarded along with all scaffolds producedby the combining assembler to the subse-quent tiling phase.

At this stage, we typically had one or twoscaffolds for every BAC region constitutingat least 95% of the relevant sequence, and acollection of disjoint Celera-unique scaffolds.The next step in developing the genome com-ponents was to determine the order and over-lap tiling of these BAC and Celera-uniquescaffolds across the genome. For this, weused Celeras 50-kbp mate-pairs information,and BAC-end pairs (18) and sequence taggedsite (STS) markers (44 ) to provide long-range guidance and chromosome separation.Given the relatively manageable number ofscaffolds, we chose not to produce this tilingin a fully automated manner, but to computean initial tiling with a good heuristic and thenuse human curators to resolve discrepanciesor missed join opportunities. To this end, wedeveloped a graphical user interface that dis-played the graph of tiling overlaps and theevidence for each. A human curator couldthen explore the implication of mapped STSdata, dot-plots of sequence overlap, and avisual display of the mate-pair evidence sup-porting a given choice. The result of thisprocess was a collection of components,where each component was a tiled set ofBAC and Celera-unique scaffolds that hadbeen curator-approved. The process resultedin 3845 components with an estimated spanof 2.922 Gbp.

In order to generate the final CSA, weassembled each component with the WGAalgorithm. As was done in the WGA process,the bactig data were shredded into a synthetic23 shotgun data set in order to give theassembler the freedom to independently as-semble the data. By using faux reads ratherthan bactigs, the assembly algorithm couldcorrect errors in the assembly of bactigs andremove chimeric content in a PFP data entry.

Chimeric or contaminating sequence (fromanother part of the genome) would not beincorporated into the reassembly of the com-ponent because it did not belong there. Ineffect, the previous steps in the CSA processserved only to bring together Celera frag-ments and PFP data relevant to a large con-tiguous segment of the genome, wherein weapplied the assembler used for WGA to pro-duce an ab initio assembly of the region.

WGA assembly of the components result-ed in a set of scaffolds totaling 2.906 Gbp inspan and consisting of 2.654 Gbp of se-quence. The chaff, or set of reads not incor-porated into the assembly, numbered 6.17million, or 22%. More than 90.0% of thegenome was covered by scaffolds spanning.100 kbp long, and these averaged 92.2%sequence and 7.8% gaps with a total of 2.492Gbp of sequence. There were a total of105,264 gaps among the 107,199 contigs thatbelong to the 1940 scaffolds spanning .100kbp. The average scaffold size was 1.4 Mbp,the average contig size was 23.24 kbp, andthe average gap size was 2.0 kbp where eachdistribution of sizes was exponential. Assuch, averages tend to be underrepresentativeof the majority of the data. Figure 5 shows ahistogram of the bases in scaffolds of varioussize ranges. Consider also that more than49% of all gaps were ,500 bp long, morethan 62% of all gaps were ,1 kbp, and allgaps are ,100 kbp long. Similarly, more than73% of the sequence is in contigs . 30 kbp,more than 49% is in contigs .100 kbp, andthe largest contig was 1.99 Mbp long. Table 3provides summary statistics for the structureof this assembly with a direct comparison tothe WGA assembly.

2.5 Comparison of the WGA and CSAscaffoldsHaving obtained two assemblies of the hu-man genome via independent computationalprocesses (WGA and CSA), we comparedscaffolds from the two assemblies as anothermeans of investigating their completeness,consistency, and contiguity. From each as-sembly, a set of reference scaffolds contain-ing at least 1000 fragments (Celera sequenc-ing reads or bactig shreds) was obtained; thisamounted to 2218 WGA scaffolds and 1717CSA scaffolds, for a total of 2.087 Gbp and2.474 Gbp. The sequence of each referencescaffold was compared to the sequence of allscaffolds from the other assembly with whichit shared at least 20 fragments or at least 20%of the fragments of the smaller scaffold. Foreach such comparison, all matches of at least200 bp with at most 2% mismatch weretabulated.

From this tabulation, we estimated theamount of unique sequence in each assemblyin two ways. The first was to determine thenumber of bases of each assembly that were



on J

anua

ry 2

5, 2

012

ww

w.s

cien

cem

ag.o

rgD

ownl

oade

d fro

m

not covered by a matching segment in theother assembly. Some 82.5 Mbp of the WGA(3.95%) was not covered by the CSA, where-as 204.5 Mbp (8.26%) of the CSA was notcovered by the WGA. This estimate did notrequire any consistency of the assemblies orany uniqueness of the matching segments.Thus, another analysis was conducted inwhich matches of less than 1 kbp between apair of scaffolds were excluded unless theywere confirmed by other matches having aconsistent order and orientation. This givessome measure of consistent coverage: 1.982Gbp (95.00%) of the WGA is covered by theCSA, and 2.169 Gbp (87.69%) of the CSA iscovered by the WGA by this more stringentmeasure.

The comparison of WGA to CSA alsopermitted evaluation of scaffolds for structur-al inconsistencies. We looked for instances inwhich a large section of a scaffold from oneassembly matched only one scaffold from theother assembly, but failed to match over thefull length of the overlap implied by thematching segments. An initial set of candi-dates was identified automatically, and theneach candidate was inspected by hand. Fromthis process, we identified 31 instances inwhich the assemblies appear to disagree in anonlocal fashion. These cases are being fur-ther evaluated to determine which assemblyis in error and why.

In addition, we evaluated local inconsis-tencies of order or orientation. The followingresults exclude cases in which one contig inone assembly corresponds to more than oneoverlapping contig in the other assembly (aslong as the order and orientation of the latteragrees with the positions they match in theformer). Most of these small rearrangementsinvolved segments on the order of hundredsof base pairs and rarely .1 kbp. We found atotal of 295 kbp (0.012%) in the CSA assem-blies that were locally inconsistent with theWGA assemblies, whereas 2.108 Mbp(0.11%) in the WGA assembly were incon-sistent with the CSA assembly.

The CSA assembly was a few percentagepoints better in terms of coverage and slightlymore consistent than the WGA, because itwas in effect performing a few thousand shot-gun assemblies of megabase-sized problems,whereas the WGA is performing a shotgunassembly of a gigabase-sized problem. Whenone considers the increase of two-and-a-halforders of magnitude in problem size, the in-formation loss between the two is remarkablysmall. Because CSA was logistically easier todeliver and the better of the two results avail-able at the time when downstream analysesneeded to be begun, all subsequent analysiswas performed on this assembly.

2.6 Mapping scaffolds to the genomeThe final step in assembling the genome was toorder and orient the scaffolds on the chromo-somes. We first grouped scaffolds together onthe basis of their order in the components fromCSA. These grouped scaffolds were reorderedby examining residual mate-pairing data be-tween the scaffolds. We next mapped the scaf-fold groups onto the chromosome using physi-cal mapping data. This step depends on havingreliable high-resolution map information suchthat each scaffold will overlap multiple mark-ers. There are two genome-wide types of mapinformation available: high-density STS mapsand fingerprint maps of BAC clones developedat Washington University (45). Among the ge-nome-wide STS maps, GeneMap99 (GM99)has the most markers and therefore was mostuseful for mapping scaffolds. The two differentmapping approaches are complementary to oneanother. The fingerprint maps should have bet-ter local order because they were built by com-parison of overlapping BAC clones. On theother hand, GM99 should have a more reliablelong-range order, because the framework mark-ers were derived from well-validated geneticmaps. Both types of maps were used as areference for human curation of the compo-nents that were the input to the regional assem-bly, but they did not determine the order ofsequences produced by the assembler.

In order to determine the effectiveness ofthe fingerprint maps and GM99 for mappingscaffolds, we first examined the reliability ofthese maps by comparison with large scaf-folds. Only 1% of the STS markers on the 10largest scaffolds (those .9 Mbp) weremapped on a different chromosome onGM99. Two percent of the STS markers dis-agreed in position by more than five frame-work bins. However, for the fingerprintmaps, a 2% chromosome discrepancy wasobserved, and on average 23.8% of BAClocations in the scaffold sequence disagreedwith fingerprint map placement by more thanfive BACs. When further examining thesource of discrepancy, it was found that mostof the discrepancy came from 4 of the 10scaffolds, indicating this there is variation inthe quality of either the map or the scaffolds.All four scaffolds were assembled, as well asthe other six, as judged by clone coverageanalysis, and showed the same low discrep-ancy rate to GM99, and thus we concludedthat the fingerprint map global order in thesecases was not reliable. Smaller scaffolds hada higher discordance rate with GM99 (4.21%of STSs were discordant by more than fiveframework bins), but a lower discordance ratewith the fingerprint maps (11% of BACsdisagreed with fingerprint maps by more thanfive BACs). This observation agrees with theclone coverage analysis (46 ) that Celera scaf-fold construction was better supported bylong-range mate pairs in larger scaffolds thanin small scaffolds.

We created two orderings of Celera scaf-folds on the basis of the markers (BAC orSTS) on these maps. Where the order ofscaffolds agreed between GM99 and theWashU BAC map, we had a high degree ofconfidence that that order was correct; thesescaffolds were termed anchor scaffolds.Only scaffolds with a low overall discrepancyrate with both maps were considered anchorscaffolds. Scaffolds in GM99 bins were al-lowed to permute in their order to matchWashU ordering, provided they did not vio-late their framework orders. Orientation ofindividual scaffolds was determined by thepresence of multiple mapped markers withconsistent order. Scaffolds with only onemarker have insufficient information to as-sign orientation. We found 70.1% of the ge-nome in anchored scaffolds, more than 99%of which are also oriented (Table 4). BecauseGM99 is of lower resolution than the WashUmap, a number of scaffolds without STSmatches could be ordered relative to the an-chored scaffolds because they included se-quence from the same or adjacent BACs onthe WashU map. On the other hand, becauseof occasional WashU global ordering dis-crepancies, a number of scaffolds determinedto be unmappable on the WashU map couldbe ordered relative to the anchored scaffolds

Fig. 5. Distribution of scaffold sizes of the CSA. For each range of scaffold sizes, the percent of totalsequence is indicated.



on J

anua

ry 2

5, 2

012

ww

w.s

cien

cem

ag.o

rgD

ownl

oade

d fro

m

with GM99. These scaffolds were termedordered scaffolds. We found that 13.9% ofthe assembly could be ordered by these ad-ditional methods, and thus 84.0% of the ge-nome was ordered unambiguously.

Next, all scaffolds that could be placed,but not ordered, between anchors were as-signed to the interval between the anchoredscaffolds and were deemed to be bound-ed between them. For example, small scaf-folds having STS hits from the same Gene-Map bin or hitting the same BAC cannot beordered relative to each other, but can beassigned a placement boundary relative toother anchored or ordered scaffolds. Theremaining scaffolds either had no localiza-tion information, conflicting information,or could only be assigned to a genericchromosome location. Using the above ap-proaches, ;98% of the genome was an-chored, ordered, or bounded.

Finally, we assigned a location for eachscaffold placed on the chromosome byspreading out the scaffolds per chromosome.We assumed that the remaining unmappedscaffolds, constituting 2% of the genome,were distributed evenly across the genome.By dividing the sum of unmapped scaffoldlengths with the sum of the number ofmapped scaffolds, we arrived at an estimateof interscaffold gap of 1483 bp. This gap wasused to separate all the scaffolds on eachchromosome and to assign an offset in thechromosome.

During the scaffold-mapping effort, we en-countered many problems that resulted in addi-tional quality assessment and validation analy-sis. At least 978 (3% of 33,173) BACs werebelieved to have sequence data from more thanone location in the genome (47). This is con-sistent with the bactig chimerism analysis re-ported above in the Assembly Strategies sec-tion. These BACs could not be assigned tounique positions within the CSA assembly andthus could not be used for ordering scaffolds.Likewise, it was not always possible to assignSTSs to unique locations in the assembly be-cause of genome duplications, repetitive ele-ments, and pseudogenes.

Because of the time required for an ex-haustive search for a perfect overlap, CSAgenerated 21,607 intrascaffold gaps wherethe mate-pair data suggested that the contigsshould overlap, but no overlap was found.These gaps were defined as a fixed 50 bp inlength and make up 18.6% of the total116,442 gaps in the CSA assembly.

We chose not to use the order of exonsimplied in cDNA or EST data as a way ofordering scaffolds. The rationale for not us-ing this data was that doing so would havebiased certain regions of the assembly byrearranging scaffolds to fit the transcript dataand made validation of both the assembly andgene definition processes more difficult.

2.7 Assembly and validation analysisWe analyzed the assembly of the genomefrom the perspectives of completeness(amount of coverage of the genome) andcorrectness (the structural accuracy of theorder and orientation and the consensus se-quence of the assembly).

Completeness. Completeness is defined asthe percentage of the euchromatic sequencerepresented in the assembly. This cannot beknown with absolute certainty until the eu-chromatin sequence has been completed.However, it is possible to estimate complete-ness on the basis of (i) the estimated sizes ofintrascaffold gaps; (ii) coverage of the twopublished chromosomes, 21 and 22 (48, 49);and (iii) analysis of the percentage of anindependent set of random sequences (STSmarkers) contained in the assembly. Thewhole-genome libraries contain heterochro-matic sequence and, although no attempt hasbeen made to assemble it, there may be in-stances of unique sequence embedded in re-gions of heterochromatin as were observed inDrosophila (50, 51).

The sequences of human chromosomes 21and 22 have been completed to high qualityand published (48, 49). Although this se-quence served as input to the assembler, thefinished sequence was shredded into a shot-gun data set so that the assembler had theopportunity to assemble it differently fromthe original sequence in the case of structuralpolymorphisms or assembly errors in theBAC data. In particular, the assembler mustbe able to resolve repetitive elements at thescale of components (generally multimega-base in size), and so this comparison revealsthe level to which the assembler resolvesrepeats. In certain areas, the assembly struc-ture differs from the published versions ofchromosomes 21 and 22 (see below). Theconsequence of the flexibility to assemblefinished sequence differently on the basisof Celera data resulted in an assembly withmore segments than the chromosome 21 and22 sequences. We examined the reasons whythere are more gaps in the Celera sequencethan in chromosomes 21 and 22 and expectthat they may be typical of gaps in otherregions of the genome. In the Celera assem-bly, there are 25 scaffolds, each containing atleast 10 kb of sequence, that collectively span94.3% of chromosome 21. Sixty-two scaf-folds span 95.7% of chromosome 22. Thetotal length of the gaps remaining in theCelera assembly for these two chromosomesis 3.4 Mbp. These gap sequences were ana-lyzed by RepeatMasker and by searchingagainst the entire genome assembly (52).About 50% of the gap sequence consisted ofcommon repetitive elements identified by Re-peatMasker; more than half of the remainderwas lower copy number repeat elements.

A more global way of assessing complete-

ness is to measure the content of an independentset of sequence data in the assembly. We com-pared 48,938 STS markers from Genemap99(51) to the scaffolds. Because these markerswere not used in the assembly processes, theyprovided a truly independent measure of com-pleteness. ePCR (53) and BLAST (54) wereused to locate STSs on the assembled genome.We found 44,524 (91%) of the STSs in themapped genome. An additional 2648 markers(5.4%) were found by searching the unas-sembled data or chaff. We identified 1283STS markers (2.6%) not found in either Celerasequence or BAC data as of September 2000,raising the possibility that these markers maynot be of human origin. If that were the case,the Celera assembled sequence would represent93.4% of the human genome and the unas-sembled data 5.5%, for a total of 98.9% cover-age. Similarly, we compared CSA against36,678 TNG radiation hybrid markers (55a)using the same method. We found that 32,371markers (88%) were located in the mappedCSA scaffolds, with 2055 markers (5.6%)found in the remainder. This gave a 94% cov-erage of the genome through another genome-wide survey.

Correctness. Correctness is defined as thestructural and sequence accuracy of the as-sembly. Because the source sequences for theCelera data and the GenBank data are fromdifferent individuals, we could not directlycompare the consensus sequence of the as-

Table 4. Summary of scaffold mapping. Scaffoldswere mapped to the genome with different levelsof condence (anchored scaffolds have the highestcondence; unmapped scaffolds have the lowest).Anchored scaffolds were consistently ordered bythe WashU BAC map and GM99. Ordered scaf-folds were consistently ordered by at least one ofthe following: the WashU BAC map, GM99, orcomponent tiling path. Bounded scaffolds had or-der conicts between at least two of the externalmaps, but their placements were adjacent to aneighboring anchored or ordered scaffold. Un-mapped scaffolds had, at most, a chromosomeassignment. The scaffold subcategories are givenbelow each category.

Mappedscaffoldcategory

Number Length (bp)%

Totallength

Anchored 1,526 1,860,676,676 70Oriented 1,246 1,852,088,645 70Unoriented 280 8,588,031 0.3

Ordered 2,001 369,235,857 14Oriented 839 329,633,166 12Unoriented 1,162 39,602,691 2

Bounded 38,241 368,753,463 14Oriented 7,453 274,536,424 10Unoriented 30,788 94,217,039 4

Unmapped 11,823 55,313,737 2Known 281 2,505,844 0.1

chromosomeUnknown

chromosome11,542 52,807,893 2



on J

anua

ry 2

5, 2

012

ww

w.s

cien

cem

ag.o

rgD

ownl

oade

d fro

m

sembly against other finished sequence fordetermining sequencing accuracy at the nu-cleotide level, although this has been done foridentifying polymorphisms as described inSection 6. The accuracy of the consensussequence is at least 99.96% on the basis of astatistical estimate derived from the qualityvalues of the underlying reads.

The structural consistency of the assemblycan be measured by mate-pair analysis. In acorrect assembly, every mated pair of se-quencing reads should be located on the con-sensus sequence with the correct separationand orientation between the pairs. A pair istermed valid when the reads are in thecorrect orientation and the distance betweenthem is within the mean 6 3 standard devi-ations of the distribution of insert sizes of thelibrary from which the pair was sampled. Apair is termed misoriented when the readsare not correctly oriented, and is termed mis-separated when the distance between thereads is not in the correct range but the readsare correctly oriented. The mean 6 the stan-dard deviation of each library used by theassembler was determined as describedabove. To validate these, we examined allreads mapped to the finished sequence ofchromosome 21 (48) and determined howmany incorrect mate pairs there were as aresult of laboratory tracking errors and chi-merism (two different segments of the ge-nome cloned into the same plasmid), and howtight the distribution of insert sizes was for

those that were correct (Table 5). The stan-dard deviations for all Celera libraries werequite small, less than 15% of the insertlength, with the exception of a few 50-kbplibraries. The 2- and 10-kbp libraries con-tained less than 2% invalid mate pairs, where-as the 50-kbp libraries were somewhat higher(;10%). Thus, although the mate-pair infor-mation was not perfect, its accuracy was suchthat measuring valid, misoriented, and mis-separated pairs with respect to a given assem-bly was deemed to be a reliable instrumentfor validation purposes, especially when sev-eral mate pairs confirm or deny an ordering.

The clone coverage of the genome was393, meaning that any given base pair was,on average, contained in 39 clones or, equiv-alently, spanned by 39 mate-paired reads.Areas of low clone coverage or areas with ahigh proportion of invalid mate pairs wouldindicate potential assembly problems. Wecomputed the coverage of each base in theassembly by valid mate pairs (Table 6). Insummary, for scaffolds .30 kbp in length,less than 1% of the Celera assembly was inregions of less than 33 clone coverage. Thus,more than 99% of the assembly, includingorder and orientation, is strongly supportedby this measure alone.

We examined the locations and number ofall misoriented and misseparated mates. Inaddition to doing this analysis on the CSAassembly (as of 1 October 2000), we alsoperformed a study of the PFP assembly as of

5 September 2000 (30, 55b). In this lattercase, Celera mate pairs had to be mapped tothe PFP assembly. To avoid mapping errorsdue to high-fidelity repeats, the only pairsmapped were those for which both readsmatched at only one location with less than6% differences. A threshold was set such thatsets of five or more simultaneously invalidmate pairs indicated a potential breakpoint,where the construction of the two assembliesdiffered. The graphic comparison of the CSAchromosome 21 assembly with the publishedsequence (Fig. 6A) serves as a validation ofthis methodology. Blue tick marks in thepanels indicate breakpoints. There were asimilar (small) number of breakpoints onboth chromosome sequences. The exceptionwas 12 sets of scaffolds in the Celera assem-bly (a total of 3% of the chromosome lengthin 212 single-contig scaffolds) that weremapped to the wrong positions because theywere too small to be mapped reliably. Figures6 and 7 and Table 6 illustrate the mate-pairdifferences and breakpoints between the twoassemblies. There was a higher percentage ofmisoriented and misseparated mate pairs inthe large-insert libraries (50 kbp and BACends) than in the small-insert libraries in bothassemblies (Table 6). The large-insert librar-ies are more likely to identify discrepanciessimply because they span a larger segment ofthe genome. The graphic comparison be-tween the two assemblies for chromosome 8(Fig. 6, B and C) shows that there are many

Table 5. Mate-pair validation. Celera fragment sequences were mapped tothe published sequence of chromosome 21. Each mate pair uniquelymapped was evaluated for correct orientation and placement (number

of mate pairs tested). If the two mates had incorrect relative orienta-tion or placement, they were considered invalid (number of invalid matepairs).

Librarytype

Libraryno.

Chromosome 21 Genome

Meaninsertsize(bp)

SD(bp)

SD/mean(%)

No. ofmatepairs

tested

No. ofinvalidmatepairs

%invalid

Meaninsert

size (bp)

SD(bp)

SD/mean(%)

2 kbp 1 2,081 106 5.1 3,642 38 1.0 2,082 90 4.32 1,913 152 7.9 28,029 413 1.5 1,923 118 6.13 2,166 175 8.1 4,405 57 1.3 2,162 158 7.3

10 kbp 4 11,385 851 7.5 4,319 80 1.9 11,370 696 6.15 14,523 1,875 12.9 7,355 156 2.1 14,142 1,402 9.96 9,635 1,035 10.7 5,573 109 2.0 9,606 934 9.77 10,223 928 9.1 34,079 399 1.2 10,190 777 7.6

50 kbp 8 64,888 2,747 4.2 16 1 6.3 65,500 5,504 8.49 53,410 5,834 10.9 914 170 18.6 53,311 5,546 10.4

10 52,034 7,312 14.1 5,871 569 9.7 51,498 6,588 12.811 52,282 7,454 14.3 2,629 213 8.1 52,282 7,454 14.312 46,616 7,378 15.8 2,153 215 10.0 45,418 9,068 20.013 55,788 10,099 18.1 2,244 249 11.1 53,062 10,893 20.514 39,894 5,019 12.6 199 7 3.5 36,838 9,988 27.1

BES 15 48,931 9,813 20.1 144 10 6.9 47,845 4,774 10.016 48,130 4,232 8.8 195 14 7.2 47,924 4,581 9.617 106,027 27,778 26.2 330 16 4.8 152,000 26,600 17.518 160,575 54,973 34.2 155 8 5.2 161,750 27,000 16.719 164,155 19,453 11.9 642 44 6.9 176,500 19,500 11.05

Sum 102,894 2,768 2.7(mean 5 2.7)



on J

anua

ry 2

5, 2

012

ww

w.s

cien

cem

ag.o

rgD

ownl

oade

d fro

m

more breakpoints for the PFP assembly thanfor the Celera assembly. Figure 7 shows thebreakpoint map (blue tick marks) for bothassemblies of each chromosome in a side-by-side fashion. The order and orientation ofCeleras assembly shows substantially fewerbreakpoints except on the two finished chro-mosomes. Figure 7 also depicts large gaps(.10 kbp) in both assemblies as red tickmarks. In the CSA assembly, the size of allgaps have been estimated on the basis of themate-pair data. Breakpoints can be caused bystructural polymorphisms, because the twoassemblies were derived from different hu-man genomes. They also reflect the unfin-ished nature of both genome assemblies.

3 Gene Prediction and AnnotationSummary. To enumerate the gene inventory,we developed an integrated, evidence-basedapproach named Otto. The evidence used toincrease the likelihood of identifying genesincludes regions conserved between themouse and human genomes, similarity toESTs or other mRNA-derived data, or simi-larity to other proteins. A comparison of Otto(combined Otto-RefSeq and Otto homology)with Genscan, a standard gene-prediction al-gorithm, showed greater sensitivity (0.78 ver-sus 0.50) and specificity (0.93 versus 0.63) ofOtto in the ability to define gene structure.Otto-predicted genes were complementedwith a set of genes from three gene-predictionprograms that exhibited weaker, but still sig-nificant, evidence that they may be ex-pressed. Conservative criteria, requiring atleast two lines of evidence, were used todefine a set of 26,383 genes with good con-fidence that were used for more detailed anal-ysis presented in the subsequent sections.Extensive manual curation to establish pre-cise characterization of gene structure will benecessary to improve the results from thisinitial computational approach.

3.1 Automated gene annotationA gene is a locus of cotranscribed exons. Asingle gene may gi

4 2001 the sequence of the human genome

Documents

articlesthis article

online material http

science web sites http

advancement of science

portions of articles

online issn

craig venterthe sequence

qing zhang