steven l. salzberg the institute for genomic research and johns hopkins university
Post on 13-Jan-2016
32 Views
Preview:
DESCRIPTION
TRANSCRIPT
Data Management in a High-Throughput, Science-based
Genome CenterNIGMS Protein Structure Initiative Workshop on Data Management
Steven L. Salzberg
The Institute for Genomic Researchand Johns Hopkins University
• How can you run 50 projects in parallel and:– Maintain production– Generate consistent, high-quality data– Share data and software with the scientific
community– Publish research of the highest quality– Adapt quickly to new technologies
Genomes completed and published by TIGR and our collaborators, 1995-present
Organism ReferenceArabidopsis thaliana Lin et al., Nature 402: 761-8 (2000)Archaeoglobus fulgidus Klenk et al., Nature 390:364-370 (1997)Bacillus anthracis Ames Read et al., Nature 423: 81-86 (2003)Bacillus anthracis Florida Read et al., Science 296, 2028-33 (2002)Borrelia burgdorferi Fraser et al., Nature 390: 580-586 (1997) Brucella suis Paulsen et al., PNAS 99 (2002)Caulobacter crescentus Nierman et al., PNAS 98 (2001)Chlamydia pneumoniae Read et al., Nucl. Acids Res. 28, (2000)Chlamydia muridarum Read et al., Nucl. Acids Res. 28, (2000)Chlamydophila caviae Read et al., Nucl. Acids Res. 31, (2003) Chlorobium tepidum Eisen et al., PNAS 99: 9509-9514 (2002)Coxiella burnetii RSA 493 Seshadri et al., PNAS 100: 5455-60 (2003)Deinococcus radiodurans White et al., Science 286 (1999)Enterococcus faecalis Paulsen et al., Science 299: 2071-2074 (2003)Haemophilus influenzae Fleischmann et al., Science 269, (1995)Helicobacter pylori Tomb et al., Nature 388:539-547 (1997)Methanococcus jannaschii Bult et al., Science 273:1058-1073 (1996)Mycobacterium tuberculosis Fleischmann et al., J. Bact.184, (2002)Mycoplasma genitalium Fraser et al., Science 270:397-403 (1995)Neisseria meningitidis Tettelin et al., Science 287 (2000)Oryza sativa (rice) chr 10 Wing et al., Science 300: 1566-1569 (2003)Plasmodium falciparum Gardner et al., Nature 419:531-534 (2002)Plasmodium yoelii Carlton et al., Nature 419:512-519(2002)Porphyromonas gingivalis Nelson et al., J. Bact., in revision.Pseudomonas putida Nelson et al., Envir. Microbiol. (2002)Shewanella oneidensis Heidelberg et al., Nat. Biotech. 20 (2002) Streptococcus agalactiae Tettelin et al., PNAS. 99 (2002) Streptococcus pneumoniae Tettelin et al., Science 293 (2001)Sulfolobus islandicus virus Arnold et al., Virology 15:252-66 (2000)Thermotoga maritima Nelson et al., Nature 399: 323-329 (1999)Treponema pallidum Fraser et al., Science 281: 375-388 (1998)Vibrio cholerae Heidelberg et al., Nature 406, (2000)
Genomes in progress or recently completed
Fibrobacter succinogenesPrevotella intermediaPseudomonas fluorescensSilicibacter pomeroyi DSS-3Streptococcus agalactiae A909Streptococcus gordoniiStreptococcus mitisStreptococcus pneumoniae 670Acidobacterium capsulatum Bacillus anthracis A01055Bacillus anthracis A0402Bacillus anthracis Ames 0581Burkholderia thailandensisCampylobacter coli RM2228Campylobacter upsaliensis RM3195Clostridium perfringens SM101Epulopiscium fisheloniiHyphomonas neptuniumListeria monocytogenes F6854Listeria monocytogenes H7858Mycoplasma arthritidis Mycoplasma capricolumMyxococcus xanthusPrevotella ruminicolaPyrococcus furiosusVerrucomicrobium spinosum Actinomyces naeslundii
Bacillus anthracis A0071 Bacillus anthracis Kruger BErwinia chrysanthemiGemmata obscuriglobus Mycobacterium tuberculosisRuminococcus albusStreptococcus sobrinusAspergillus fumigatus Brugia malayi Coccidioides immitisCryptococcus neoformansEntamoeba histolyticaOryza sativa Chromosome 3 & 10Plasmodium vivaxSchistosoma mansoniSolanum spp.Tetrahymena thermophilaToxoplasma gondii Theileria parvaTrichomonas vaginalis Trypanosoma brucei Trypanosoma cruzi
Acidithiobacillus ferrooxidansBacillus anthracis Kruger BBurkholderia mallei Clostridium perfringens ATCC13124Dehalococcoides ethenogenesDesulfovibrio vulgaris Ehrlichia chaffeensisEhrlichia sennetsuGeobacter sulfurreducens Listeria monocytogenes Methylococcus capsulatusMycobacterium avium 104Mycobacterium smegmatisPseudomonas syringae Staphylococcus aureus Staphylococcus epidermidis Treponema denticolaWolbachia sp.Anaplasma phagocytophilaBacillus cereus 10987Bacteroides forsythesBrucella ovisBaumannia cicadellinicolaCampylobacter jejuniCarboxydothermus hydrogenoformansColwellia sp. 34HDichelobacter nodosus
A Whole-Genome Shotgun Sequencing Project
Shotgun sequencingGenome Assembly AnnotationData release
Downstream research
Library construction
Colony picking
Template preparation
Sequencing reactions
Base calling
Sequence files
Assembler->Genome scaffold
Ordered contig set
Gap closuresequence editing
Re-assembly
ONE ASSEMBLY!
(per molecule)
Combinatorial PCRPOMP
Gene finding
Homology searches
Function assignments
Metabolic pathwaysGene families
Comparative genomics
Transcriptional/translational
regulatory elementsRepetitive sequences
Publicationwww.tigr.org
LIMS entry point
Microarraystudies
Vaccine, drugdevelopment
Human diseasestudies
Sequence Data Management
• Professional software engineers– Continual contact with lab staff
• Separate research staff– Computational research, separate from
production “pipeline”• Genome assembly• Gene finding• Sequence alignment
• Biology/genomics research staff
Joint Technology Center
• TIGR doubled its sequencing capacity in a 2-month period, Dec-Jan 2002-3
• We moved our entire facility to a new building and tripled its capacity in June-July 2003
• All databases, network connections, LIMS software continued operating smoothly throughout
Sequence LIMS Processes at TIGRSequence LIMS Processes at TIGR
Colony Plate Culture Plate DNA Plate Reaction Plate
DNA Sequencer (ABI 3730xl)
Chromatogram Files
LIMS-Database Interactions at TIGRLIMS-Database Interactions at TIGR(circa 2001)(circa 2001)
library librarytemplatesamplereaction
librarytemplatesamplereactiongel
librarytemplatesamplereactiongel
librarytemplate
UploaderTrackerCreateGel Sheet
GelSheetMaker
Map RickyTrackerCreate/EditRxn Sheet
librarytemplatesamplereactiongel-----------sequencefeaturebases
One database per sequencing project....
Finishing Center – Sequencing Center Data Interchange (mid-2003)
DNA
Data
• Reads - bases- Quality- Chromatograms + positions- Revision- Trimming info- Insert mapping/pairing- Chemistry, read end, etc
• Library info (size estimators)• Vectors used
QC•Yield •Randomness•Percent good quality•Percent contaminant
Sequencing Center (SC)
Reaction ListsOn existing clones Insert Id
Primer
Finishing Center (FC)
IT Support• High-quality computers and systems
support is absolutely critical
• At the same time, IT support should be invisible (ideally)
• TIGR has 15 full-time, professional IT staff– Systems administrators– Database administrators– Web administrators– Network administrators– Desktop support
IT Infrastructure• 10 Compaq Alpha ES40s, max 32 GB RAM
– high-end computing
• 15 UltraSPARC and SunFire servers– database and web services
• 400 Pentium-based Linux computers– grid computing
• Gigabit backbone network
• Network-attached file storage– NetApp, EMC
IT example: Grid computing facility
Pool ComputeCycles
Owner ComputeCycles
January 2001, 1-week snapshot
High-throughput, automated annotation
• 10 bioinformatics engineers maintain software pipeline
• Can completely process a bacterial genome in one day
• Manage all data uploads to GenBank• Specialized analyses for publications
Manual annotation: ~10 genes / day
• Eight bacterial genome annotators• Inspection of:
– Search results– TIGRFam matches– Experimentally characterized gene– Literature references – abstracts and more
• Assignment of:– Common name– Role category– Genetic name– EC number
Genome Annotation Processes
Owen WhiteDirector of Bioinformatics
Charles LuBioinformatics
Engineer II
William C. NelsonBioinformatics
Analyst
Todd CreasyBioinformatics
Engineer II
Jaideep P.Sundaram
BioinformaticsEngineer III
Christopher R.Hauser
BioinformaticsEngineer II
Kelly S. MoffatLaboratory Data
Specialist
Hean L. KooLaboratory Data
Specialist
Sean DaughertyBioinformatics
Analyst I
Lauren BrinkacBioinformatics
Analyst I
Robert DodsonBioinformatics
Analyst III
Robert DeBoyBioinformatics
Analyst II
Michelle GiglioStaff Scientist
TBABioinformatics
Engineer II
TBABioinformatics
Engineer II
Tanja DavidsenBioinformatics
Engineer II
Nikhat ZafarBioinformatics
Engineer I
TBABioinformatics
Engineer II
Steven L. SalzbergSenior DirectorBioinformatics
Martin ShumwaySoftware Eng
Manager
Arthur DelcherSr Bioinformatics
Scientist
Corina AntonescuBioinformatics
Engineer
SoftwareDevelopment Web Content
Data Curation
SoftwareMaintenance
Assembly, SNPsAnup MahurkarSoftware Engineer
Manager
Michael SchatzSoftwareEngineer
Daniel KosackSoftwareEngineer
Samuel AngiuoliBioinformatics
Supervisor
Ian PaulsenAssociate Investigator
Qinghu RenPost-doctoralResearcher
Jonathan Eisen Investigator
Phylogenetic Analyses
Karen NelsonAssociate Investigator
Metabolism
Transporters
Jeremy PetersonBioinformatics
Manager
TBABioinformatics
Engineer II
Pawel GajerStaff Scientist
...plus 10 more annotators
Website dbase
Manatee: a collaborative tool
• Manual Annotation Tool, Etc Etc…• Open Source: manatee.sourceforge.net• Based on Chado relational schema • Several installations
– one week to install• Fully documented
– API– User manual– Installation
• Testing– Unit, integration testing– Deployment
• Quarterly training classes
Gene Information Page
Gene Identification InformationGene Ontology and Cellular RoleGraphical Display of AnalysesTextual Display of Analyses
Pair-wise Alignment Summary
Experimentally characterized proteinsindicated by color
Summary of Genome Information
Gene Information PageOnline help system
• Published: 33• Completed: 18• Closure: 20• High-throughput sequencing: 22• Library construction: 19• Trend: more closely related
genomes
Annotation pipeline
{TigrDB
Gene coordsSeq/pep filesSearch resultsFamilies
ChuggaChugga
GenBank
Annotation research example: position effect
u
+ sulfite
+ sulfate
PEP
Pyruvate
Glucose-6-P+GlucoseCELLOBIOSESTARCH
GLUCONATE
ED and PPP
GLYCOGEN
(D)+(L)-LACTATE
GLUCOSE
Glucose-6-P
Fructose-6-P
Fructose-1,6-P
Glyceraldehyde-3-P + Dihydroxyacetone-3-P
1,3-biphosphoglycerate
3-phosphoglycerate
2-phosphoglycerate
MANNOSE1P
GLYCOLYSISGLYCOLYSIS
CITRATE
5911,1348
6913
1408
7260
084209741387
69175120
0970
0820
5961
523200856351
OAA8147,8148
5053
6-P-gluconate
3132
6891,6892
2159,7870,0657,6915,5092,0967,6890,4637
1930,4834,5182,4827,4833,1650,2213,2214,1937,1938,1941,1940,1931-1935,7020,7096,5742,1332,7329,6544,3911,7491
TCA and GLYOXYLATE BYPASS
Isocitrate
2-KETOGLUTARATE
Succinyl-CoASuccinate
Fumarate
OXALOACETATE
Malate
ARGININEGlutamate
1801-1812
UREA,ORNITHINE,PUTRESCINE
763050557149
ASPARAGINE ASPARTATE66837750 0359
ASPARTATE
cysteine
2637
HISTIDINE 0849-0858
2141-21506280,2133
SERINE0368,0955,8072
Acetate
Acetyl-CoA
48095030
7296 Acetyl-P
FORMALDEHYDE2276,5455,8021
FORMATE CO27759-77645068-5071
LEUCINE, VALINEISOLEUCINE
0252,0660,3055,6294,14502632,1452,0694,2975
GABA
206382138217
GLUTAMINE 0776
ALANINE7582
ATP+PPn ADP+PPn-1
0556
H2CO3 CO2+H200101
SARCOSINE8024-29
histidine
67556759
6987-92,3275,8077-84,0881-83
serine0651,5718,1053
0592-05966952-6956
7467,8030 GLYCINE
CO2+ NH3
THIOSULFATE sulfite
0743,1598,0144,1523Homoserine
537518156167
leucine6889,5386-90,2975
methionine
threonine 0066,6165,7479
5748
chorismate
NOPALINE1835-6
PHENYLALANINE tyrosine1785
GLUTAMATE
glutamate
PROLINE glutamate0993
ACETOIN7653-6
5704-5
RIBOSE4637
4374
TAURINE8184
4033-44CO2+H20
ACC
polyhydroxyalkanoate 0894-6
FRUCTOSE 7260
GLYCEROL6809-13
CELLULOSE4371-4379
glutamate
phenylalanine
57152841
arginine0606-7,0438,2772,7841,1801,7947,2874,1372-74,6805,6936, 6789,8270
GLUTAMATE 0606-07,0438,2771,7841,1801,7947,7430,1221,0748,2550
ornithine,proline
7859-6753740127-29
tryptophan
CHOLINE betaine aldehyde0169 betaine4115
CHOLINE SULFATE0135
lysine2825,1370,6077,6068-9,0533-36
METHIONINE -ketobutyrate, methanethiol, NH3
6437
AROMATIC SULFONATES
OH
8167-74
1058
PHENOLSULPHATE
3226
glutamate
glutamine
GLUCOSAMINE-6-P 0259
0418
ETHANOLAMINE 7675-78acetaldehyde 7672 ethanol
TYROSINE3099,1553-5,3712,5278
OH
H+
GABA(6)
0131602067039370452108097B00223
07674 H+
ethanolamine
H+
proline
06835003720190902717039290447204715B00258
H+
aromatic amino acids
008540177807051
H+
amino acids
004210116804113044490652806581
(6)
H+
histidine07482
H+
glycine betainecholine
003030081002293027780428508186
Na+
alanine/glycine07749
Na+
proline0099501724
H+/Na+
glutamate000390326306942
(3)
(6)
(2)
(3)
Na+
branched chain amino acids01943
H+ Na+
0229202183 00799 00329 0599500950 06173
(7)
xanthine/ uracil
H+
01500015150224807340B00058B00069
(6)
serine02843
purine/cytosine/allantoin
H+
0217302731027550349707398B00094
(6)
H+
amino acid? (11)
02733 0378002734 0508802783 0654802885 0706803094 0824603143
amino acid?
LysE family
03773037760468504749074160399704987
H+(7)
EI
HPr
fructoseIIC
IIAIIB
PEP
pyruvate
fructose-1-phosphate
PTS
00665, 07014,07012, 07257,07262
glycerol06808
waterB00055
H+
sugar02882 0320004513 0478205593
(5)
riboseATP
ADP
04642-0464504175-04184
04773
04940-0494806899-06095
sugar?ATP
ADP(2)
amino acidATP
ADP(10)
00031-0003400866-0086901324-0132902623-0262502830-0283806448-0645407487-0748908101-0810608188-08191B00245-B00248
(3)
009420278504312
branched chain amino acid
ATP
ADP(6)
01120-0112801159-0116504161-0416404192-0420106712-0671807539-07546
(2)
H+
sialic acid01629
H+
galactonate(3)
040520434905297
H+
glucarate02823
H+
hexuronate04419
H+
gluconate/ idonate(3)
029030312107493
H+
benzoate/4-hydroxy-benzoate
(7)03532 0530405642 0272704554 0455506330
chloride02291
mechanosensitive ion channel
(8)0151300486007970171003213042110578106366
(3)
0015603521B00088
K+
sulfate
(5)00100
00139062730738202334
H+/Na+
H+ /Na+
dicarboxylate006910466304955052700665006285
(6)
phosphate
(3)
020710633400027 H+/
Na+
H+
lactate01350
H+
formate06693
Na+?
(4)
03347041230575803386
AD
P
AT
P
phosphate
00373-0037704337-04343
02490
(2)
H+
citrate00024052680183104267
(4)
Mg2+/Co2+
0560603870
(2) AD
P
AT
P
Zn2+
00069-0007202516-02519
AD
P
AT
P
ironchelate/hemin
01097-0109900589-0059004338-0444304704-04707B00115-B0011701432-01435
(2)0578407705
AD
P
AT
P
B0030508304034060389304335-04336043880516900565-0056801690018920190204322-04323048640573307001-070020724105110-05113
(17)
?
AT
P
AD
P
H+
00242-00252
F-type ATPase
P-type ATPases
AT
P
AD
P (9)
Cd2+/Cu2+/K+/Mg2+
00189002790028100297006740436107600B00030 01979-01982
AD
P
AT
P(5)
Ni2+
03431-0343503232-03242
ATP
ADP
opines01833-01839
(4)
H+metalcation
00208012830387005226
(6)
K+H+/Na+
005010328607394017540229906628
arseniteH+
0424805470
chromateH+
04499
H+ammonium
0052405192
Na+H+
05004-05011
H+
dicarboxylate
06670-06773,02296
Na+
multidrug0047607628
H+
malonate?(3)
028910368807052
ureaB00086
Mg2+
01819
nicotinamidemononucleotide
04017
Cu2+?002970504106871
H+
polysaccharide03589
H+
(2) (22)
0031100787009860114401646017200031100455015020253603296038950397805254052560540705409058340649906501074100777404758
H+metal ion
(3)
00184-0018700290-0029204717-04720
H+multidrugH+ ?
(4)
?
(6)03067-0306903108-0311004086-0409005258-0526006308-0631207083-07084
arginine
ornithine(2) 06930
06933
H+?
00168017210178102322025880260503229035180450404743056220618606981B00210
(14)
H+
tartrate?(3)
013130316604064
(8)
H+?
004940284802916043560453504686063030759507737
(9)
muconateH+
05831
H+
2-ketogluconate03187
H+
nitrite05215
oxalateformate
05850
AD
P
AT
P
(2)
multidrug
01013 0498105526-05527
AD
P
AT
P
sulfate
00626-0063105788
AD
P
AT
P
nitrate
08223-0822602762-0276603450-03451
05211
AD
P
AT
PA
DP
AT
P
taurine
08179-0818208294-08297
03423
AD
P
AT
P(2)
polyamines00612-00617 02492-0250005790-05793 06140-0614707135-07142 07871-07877
003560244602641035630399905057
(6)
(6)
AD
P
AT
P
?
(2)
00678-0068304154-0415508198-0820306869
AD
P
AT
PA
DP
AT
P
peptides07115-0712601997-0200403700-03703
(3)
(3)
AD
P
AT
P
05701-05703
alkylphosphonate
07209-07212
00446
amino acids/glycine betaine
00077-0008101791-0179602897-0289906816-0682108073-0807507730-07733
001370064700981080590431205762
(2)
AD
P
AT
P
molybdate02470-02476
0295003439
(2)
(6)
(6)
(2)
(3)(2) 03300-0330100620-0062406089-0609102852-02858
BenF/PhaK/OprD porins
06207 02876 06746 0674304268 07118 00496 0018302775 07251 04190 0812401828 05266 02566 0625203169 02729 02324 0455906315 06665 02606(23)
TonB-dependentreceptor01170 01318 01568
01578 01890 0280302868 03243 0325703267 03551 0368504444 04701 0497905060 05596 0620506925 07144 0715407469 07689 0770307985 08117 08125
(27)
OprB-like porin
02876 0620706898
(3)
OprP porin00194
STACHYDRINE proline1320-1322
OXOPROLINE glutamate1954
Heavily curated multiple alignments based on protein families of the same function.
Proposed “cure” for transitive annotation. Based on Hidden Markov Models (HMMs). approaching 2,000 families. Complete assignments to Gene Ontology Cutoff scores for each family
Trusted (automated name assignment) Noise (manual inspection required)
Downloadable. Fully integrated into the Interpro database
TIGRFams
TIGRFAMs: Genome coverage
0
200
400
600
800
1000
1200
1400
1600
1800
2000
TIGRFAMS 174 190 247 0 256 216 189 332 302 386 298 148 338 370 186 450
Identified genes 522 393 1894 381 768 899 559 1338 947 1024 889 263 1210 961 504 1676
M. j. Chl tr Strep Chl trTherm
A. f. B. b. Caul D. r. H. i. H.p. M. g. TB Neis. Trep Vib.
Multiple Genome Annotation
• Genes usual pipeline– blast, Pfam, COG, TIGRFam, Interpro, etc
• Cluster genes based on common properties using hierarchical clustering
• Display
• Can annotators select grab subsets of genes for reliable assignment?
Malate dehydrogenase
Lactate dehydrogenase
Sybil Comparative System
• Open Source software• Complete, portable data management
system for genome annotation• Chado relational database (developed by
FlyBase and other collaborators)• Extensive graphical interface• Priority “use case”: management of genes
and genomes for identification of pathogen-related genes
Sybil Overview
A B C
Tabular views of conserved synteny,orthologs,blast matches across multiple genomes
Open Source Softwareat TIGR
• Manatee• Sybil (prototypes)• Annotation Engine• MUMmer (large-scale genome alignment)• BAMBUS (Assembly/scaffolding)• Glimmer, GlimmerM, Exonomy (Gene finding)• TM4 (Microarray tools)• Chado/BSML
Perl Artistic License
Conclusions
• Professional software and support staff are critical to a large, high-throughput research project
• Scientists benefit from frequent interactions with production line staff
• High-quality support allows scientific staff to devote more effort to scientific discovery
top related