church nhgri 2012
TRANSCRIPT
@deannachurch
Deanna M. Church, NCBI
The Evolution of Genome Data
Collins FS et al, 1998
Throughput: 500 Mb/yearCost: < $0.25 per base
Variation: 100,000 SNPs mapped
1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 20110
20,000
40,000
60,000
80,000
100,000
120,000
140,000
0
500,000
1,000,000
1,500,000
2,000,000
2,500,000
GenBank Base PairsUsers (Average)
Twenty Two Years of Growth:NCBI Data and User Services
Bas
e P
airs
(M
illio
ns)
Users/W
eekday
BLAST
EntrezGenBank at NCBIdbEST
3D StructureNetwork Entrez
WWWdbSTS
BankItGenomesTaxonomy
OMIMGeneMapCn3DUniGene
PubMedPSI-BLASTVASTePCR
Microbial GenomesPHI-BLASTCGAP
Human GenomeLinkOutLocusLinkRefSeqdbSNP
PubMed CentralBLINKMapViewerGEOGeneRIFs
WGSHLA HaplotypesHuman Genome-TPA
dbMHCBookShelfHuman Genome- Transcripts Alignments
Entrez GenesMouse Composite GenomeGnomon
PubChemTrace ArchiveCCDSCancer ChromosomesEnvironmental Samples
Public AccessInfluenza Seqs.GenSATGeneTests
Genome-Wide Association Studies dbGapEntrez Portal
Seq Read ArchiveUniSTSRefSeqGeneGenome Reference Consortium
Discovery InitiativeEntrez SensorsPrimer BLAST
PeptidomeBioSystemsFlu H1N1
dbVarEpigenomicsMyNCBI1000 Genomes Project
ClinVarGTRGenome Remapping ServicePubMed HealthCloneDBGenome Decoration Page
Steve Sherry, NCBI
2010
10
20
30
40
50
60
STR & IndelSNPAmbiguous mapping
Millions of rs-idsNCBI dbSNP database growth
human variations
Non-redundant annotations
25
50
75
100
125
150
175
1000 Genomes
Other projects
HapMap
TSC
Millions of submissionsSubmissions
by project
dbSNP build 135. November 2011
20001999 20112005
Kidd et al, 2007 APOBEC cluster
BLACK: DeletionWhite: Insertion
http://www.ncbi.nlm.nih.gov/dbvar
Church et al., 2011 PLoS
http://genomereference.org
Distributed data
Genome not in INSDC Database
Old Assembly Model
GRC Beginnings
Build sequence contigs based on contigs defined in TPF.
Check for orientation consistenciesSelect switch pointsInstantiate sequence for further analysis
Switch point
Consensus sequence
http://genomereference.org
Community Input
Distributed data
Genome not in INSDC Database
Old Assembly Model
Centralized Data
Large-Scale Variation Complicates Genome Assembly
Sequences from haplotype 1Sequences from haplotype 2
Old Assembly model: compress into a consensus
New Assembly model: represent both haplotypes
NCBI36 (hg18)
UGT2B17 Region
AC074378.4AC079749.5
AC134921.2AC147055.2
AC140484.1AC019173.4
AC093720.2AC021146.7
NCBI36 NC_000004.10 (chr4) Tiling Path
Xue Y et al, 2008
TMPRSS11E TMPRSS11E2
GRCh37 NC_000004.11 (chr4) Tiling Path
AC074378.4AC079749.5
AC134921.1AC147055.2
AC093720.2AC021146.7
TMPRSS11E
GRCh37: NT_167250.1 (UGT2B17 alternate locus)
AC074378.4AC140484.1
AC019173.4AC226496.2
AC021146.7
TMPRSS11E2
UGT2B17 Region
GRCh37 (hg19)
http://genomereference.org
7 alternate haplotypesat the MHC
Alternate loci released as:FASTA
AGPAlignment to chromosome
UGT2B17 MHC MAPT
Assembly (e.g. GRCh37)
Primary Assembly
Non-nuclear assembly unit
(e.g. MT)
ALT 1
ALT 2
ALT 3
ALT 4
ALT 5
ALT 9
ALT 6
ALT 7ALT
8
PAR
Genomic Region(MHC)
Genomic Region
(UGT2B17)Genomic
Region(MAPT)
Richa Agarwala
MHC Alternate locus
Alignment to chr6
Oh No! Not a new version of the human genome!
http://genomereference.org
Assembly (e.g. GRCh37.p5)
Primary Assembly
Non-nuclear assembly unit
(e.g. MT)
ALT 1
ALT 2
ALT 3
ALT 4
ALT 5
ALT 9
ALT 6
ALT 7ALT
8
PAR
…
Genomic Region(MHC)
Genomic Region
(UGT2B17)Genomic
Region(MAPT)
Patches
Genomic Region(ABO)
Genomic Region(SMA)
Genomic Region
(PECAM1)
TBC1D3C TBC1D3
TBC1D3C
TBC1D3H
Myo19 region (17q21)
70 Fix PATCHES: Chromosome will update in GRCh38
71 Novel PATCHES: Additional sequence added
(adds >1 Mb of novel sequence to the assembly)
(adds >800K of novel sequence to the assembly)
Releasing patches quarterly
Distributed data
Genome not in INSDC Database
Old Assembly Model
Centralized Data
Updated Assembly Model
Genome in INSDC DatabaseGenome not in INSDC Database
GenBank
Data Archives
Data in a common format Data in a single location (and mirrored) Most quality checked prior to deposition Robust data tracking mechanism (accession.version) Data owned by submitter
Data tracking
ABC14-1065514J1GapsPhase LengthDate
FP565796.1 1 121-Oct-2009
FP565796.2 1 014-Oct-2010
FP565796.3 3 007-Nov-2010
Mouse chrX: 34,800,000-34,890,000
NC_000086.123456 CM001013.17 2
Mouse chrX: 35,000,000-36,000000
X
MGSCv3 MGSCv36
hg19GRCh37
mm8MGSCv37
NCBIM37
danRer5Zv7
What’s in a name?
By any other name…
chr21:8,913,216-9,246,964
Zv7 chr21:8,913,216-9,246,964 X Mouse Build 36 chrX
By any other name…
http://www.ncbi.nlm.nih.gov/genome/assembly
GRCh37hg19
Assembly (e.g. GRCh37.p5)GCA_000001405.6 /GCF_000001405.17
Primary Assembly
GCA_000001305.1/GCF_000001305.13
ALT 1
GCA_000001315.1/GCF_000001315.1
ALT 2
GCA_000001325.1/GCF_000001325.2
ALT 3
GCA_000001335.1/GCF_000001335.1
ALT 4
GCA_000001345.1/GCF_000001345.1
ALT 5
GCA_000001355.1/GCF_000001355.1
ALT 6
GCA_000001365.1/GCF_000001365.2
ALT 7
GCA_000001375.1/GCF_000001375.1
ALT 8
GCA_000001385.1/GCF_000001385.1
ALT 9
GCA_000001395.1/GCF_000001395.1
PatchesGCA_000005045.5GCF_000005045.4
Non-nuclear assembly unit
(e.g. MT)
GCA_000006015.1/GCF_000006015.1
GenBank RefSeq vs
Submitter Owned RefSeq Owned
Redundancy Non-RedundantUpdated rarely Curated
INSDC Not INSDC
BRCA183 genomic records31 mRNA records27 protein records
3 genomic records 5 mRNA records1 RNA record5 protein records
RefSeq for Assemblies
Typical assembly edits
Addition of non-nuclear (e.g. MT) assembly units
Removal of contamination
Drop unlocalized/unplaced scaffoldsMask contamination that is placed on chromosome
http://www.ncbi.nlm.nih.gov/genome
Understanding relationships between assemblies using alignments
First Pass
Second Pass
Reciprocal best hit
Non-reciprocal, duplicative hits
No second pass alignments in GRCh37.p5
NCBI36
GRCh37.p5
http://www.ncbi.nlm.nih.gov/tools/gbench/
Genome Data is MORE than just the Genome
Genome Data is MORE than just the Genome
ATGCGTGCAAAATGCAGTGAGT
ATGCGTGCAAAATGCAGTGAGT
ATGCGTGCAAAATGCAGTGAGT
ATGCGTGCAAAATGCAGTGAGT
NM_000336.2:c.800C>T
ATGCGTGCAAAATGCAGTGAGT
ATGCGTGCAAAATGCAGTGAGT
ATGCGTGCAAAATGCAGTGAGT
ATGCGTGCAAAATGCAGTGAGT
NM_000336.2:c.800C>TNC_000001.10:g.(?_20700513)_(21062644_?)del
http://www.ncbi.nlm.nih.gov/variation/tools/1000genomes
http://www.ncbi.nlm.nih.gov/education/
http://www.youtube.com/NCBINLM @NCBI http://www.facebook.com/ncbi.nlm
Thanks!
For Slides: Francoise Thibaud-Nissen Evan Eichler Steve Sherry
The Genome Reference ConsortiumThe Genome Center at Washington University The Wellcome Trust Sanger InstituteThe European Bioinformatics InstituteThe National Center for Biotechnology Information
Church group at NCBIValerie SchneiderNathan BoukHsiu-Chuan ChenPeter MericVictor AnanievChao ChenJohn LopezJohn GarnerTim HefferonCliff Clausen
NCBI