church gmod2012 pt1
DESCRIPTION
Part one of my talk at the GMOD 2012 meetingTRANSCRIPT
@deannachurch
Navigating Genome Resources at NCBI
Deanna M. Church, NCBI
The Evolution of the Reference Human Genome
Part 1
NCBI
BLAST PubMed GenBank
1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 20110
20,000
40,000
60,000
80,000
100,000
120,000
140,000
0
500,000
1,000,000
1,500,000
2,000,000
2,500,000
GenBank Base PairsUsers (Average)
Twenty Two Years of Growth:NCBI Data and User Services
Bas
e P
airs
(M
illio
ns)
Users/W
eekday
BLAST
EntrezGenBank at NCBIdbEST
3D StructureNetwork Entrez
WWWdbSTS
BankItGenomesTaxonomy
OMIMGeneMapCn3DUniGene
PubMedPSI-BLASTVASTePCR
Microbial GenomesPHI-BLASTCGAP
Human GenomeLinkOutLocusLinkRefSeqdbSNP
PubMed CentralBLINKMapViewerGEOGeneRIFs
WGSHLA HaplotypesHuman Genome-TPA
dbMHCBookShelfHuman Genome- Transcripts Alignments
Entrez GenesMouse Composite GenomeGnomon
PubChemTrace ArchiveCCDSCancer ChromosomesEnvironmental Samples
Public AccessInfluenza Seqs.GenSATGeneTests
Genome-Wide Association Studies dbGapEntrez Portal
Seq Read ArchiveUniSTSRefSeqGeneGenome Reference Consortium
Discovery InitiativeEntrez SensorsPrimer BLAST
PeptidomeBioSystemsFlu H1N1
dbVarEpigenomicsMyNCBI1000 Genomes Project
ClinVarGTRGenome Remapping ServicePubMed HealthCloneDBGenome Decoration Page
NCBI
Tools Literature DataBlast
GBenchSplignCn3De-PCR
e-Utilities…
PubMedPubMed Central
BookshelfMeSH
Gene Reviews…
GenBankProtein DB
SRAGEO
dbSNPGene
RefSeq…
Entrez: Pathway to Discovery
Amino acid sequence similarityCoding region
features
Nucleotide sequence similarity
Term frequency statistics
Literature citations in sequence databases
Literature citations in sequence databases
MEDLINE abstracts
Nucleotide sequences
Protein sequences
http://www.ncbi.nlm.nih.gov/books/NBK25501/
Programmatic accesshttp://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubmed&term=science[journal]+AND+breast+cancer+AND+2008[pdat]&usehistory=y
<eSearchResult><Count>6</Count><RetMax>6</RetMax><RetStart>0</RetStart><IdList>
<Id>19008416</Id><Id>18927361</Id><Id>18787170</Id><Id>18487186</Id><Id>18239126</Id><Id>18239125</Id>
</IdList>…
http://www.ncbi.nlm.nih.gov/education/
http://www.youtube.com/NCBINLM @NCBI http://www.facebook.com/ncbi.nlm
Collins FS et al, 1998
Throughput: 500 Mb/yearCost: < $0.25 per base
Variation: 100,000 SNPs mapped
Steve Sherry, NCBI
2010
10
20
30
40
50
60
STR & IndelSNPAmbiguous mapping
Millions of rs-idsNCBI dbSNP database growth
human variations
Non-redundant annotations
25
50
75
100
125
150
175
1000 Genomes
Other projects
HapMap
TSC
Millions of submissionsSubmissions
by project
dbSNP build 135. November 2011
20001999 20112005
Kidd et al, 2007 APOBEC cluster
BLACK: DeletionWhite: Insertion
http://www.ncbi.nlm.nih.gov/dbvar
Church et al., 2011 PLoS
http://genomereference.org
Distributed data
Genome not in INSDC Database
Old Assembly Model
GRC Beginnings
Build sequence contigs based on contigs defined in TPF.
Check for orientation consistenciesSelect switch pointsInstantiate sequence for further analysis
Switch point
Consensus sequence
ftp://ftp.ncbi.nlm.nih.gov/pub/grc/human/
Community Input
Distributed data
Genome not in INSDC Database
Old Assembly Model
Centralized Data
Large-Scale Variation Complicates Genome Assembly
Sequences from haplotype 1Sequences from haplotype 2
Old Assembly model: compress into a consensus
New Assembly model: represent both haplotypes
NCBI36 (hg18)
UGT2B17 Region
AC074378.4AC079749.5
AC134921.2AC147055.2
AC140484.1AC019173.4
AC093720.2AC021146.7
NCBI36 NC_000004.10 (chr4) Tiling Path
Xue Y et al, 2008
TMPRSS11E TMPRSS11E2
GRCh37 NC_000004.11 (chr4) Tiling Path
AC074378.4AC079749.5
AC134921.1AC147055.2
AC093720.2AC021146.7
TMPRSS11E
GRCh37: NT_167250.1 (UGT2B17 alternate locus)
AC074378.4AC140484.1
AC019173.4AC226496.2
AC021146.7
TMPRSS11E2
UGT2B17 Region
GRCh37 (hg19)
http://genomereference.org
7 alternate haplotypesat the MHC
Alternate loci released as:FASTA
AGPAlignment to chromosome
UGT2B17 MHC MAPT
Assembly (e.g. GRCh37)
Primary Assembly
Non-nuclear assembly unit
(e.g. MT)
ALT 1
ALT 2
ALT 3
ALT 4
ALT 5
ALT 9
ALT 6
ALT 7ALT
8
PAR
Genomic Region(MHC)
Genomic Region
(UGT2B17)Genomic
Region(MAPT)
Richa Agarwala
MHC Alternate locus
Alignment to chr6
Oh No! Not a new version of the human genome!
http://genomereference.org
Assembly (e.g. GRCh37.p5)
Primary Assembly
Non-nuclear assembly unit
(e.g. MT)
ALT 1
ALT 2
ALT 3
ALT 4
ALT 5
ALT 9
ALT 6
ALT 7ALT
8
PAR
…
Genomic Region(MHC)
Genomic Region
(UGT2B17)Genomic
Region(MAPT)
Patches
Genomic Region(ABO)
Genomic Region(SMA)
Genomic Region
(PECAM1)
TBC1D3C TBC1D3
TBC1D3C
TBC1D3H
Myo19 region (17q21)
60 Fix PATCHES: Chromosome will update in GRCh38
70 Novel PATCHES: Additional sequence added
(adds >1 Mb of novel sequence to the assembly)
(adds >800K of novel sequence to the assembly)
Releasing patches quarterly
Distributed data
Genome not in INSDC Database
Old Assembly Model
Centralized Data
Updated Assembly Model
Genome in INSDC DatabaseGenome not in INSDC Database