church gmod2012 pt1

36
@deannachurch Navigating Genome Resources at NCBI Deanna M. Church, NCBI The Evolution of the Reference Human Genome Part 1

Upload: deanna-church

Post on 26-Jan-2015

114 views

Category:

Technology


3 download

DESCRIPTION

Part one of my talk at the GMOD 2012 meeting

TRANSCRIPT

Page 1: Church gmod2012 pt1

@deannachurch

Navigating Genome Resources at NCBI

Deanna M. Church, NCBI

The Evolution of the Reference Human Genome

Part 1

Page 2: Church gmod2012 pt1

NCBI

BLAST PubMed GenBank

Page 3: Church gmod2012 pt1

1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 20110

20,000

40,000

60,000

80,000

100,000

120,000

140,000

0

500,000

1,000,000

1,500,000

2,000,000

2,500,000

GenBank Base PairsUsers (Average)

Twenty Two Years of Growth:NCBI Data and User Services

Bas

e P

airs

(M

illio

ns)

Users/W

eekday

BLAST

EntrezGenBank at NCBIdbEST

3D StructureNetwork Entrez

WWWdbSTS

BankItGenomesTaxonomy

OMIMGeneMapCn3DUniGene

PubMedPSI-BLASTVASTePCR

Microbial GenomesPHI-BLASTCGAP

Human GenomeLinkOutLocusLinkRefSeqdbSNP

PubMed CentralBLINKMapViewerGEOGeneRIFs

WGSHLA HaplotypesHuman Genome-TPA

dbMHCBookShelfHuman Genome- Transcripts Alignments

Entrez GenesMouse Composite GenomeGnomon

PubChemTrace ArchiveCCDSCancer ChromosomesEnvironmental Samples

Public AccessInfluenza Seqs.GenSATGeneTests

Genome-Wide Association Studies dbGapEntrez Portal

Seq Read ArchiveUniSTSRefSeqGeneGenome Reference Consortium

Discovery InitiativeEntrez SensorsPrimer BLAST

PeptidomeBioSystemsFlu H1N1

dbVarEpigenomicsMyNCBI1000 Genomes Project

ClinVarGTRGenome Remapping ServicePubMed HealthCloneDBGenome Decoration Page

Page 4: Church gmod2012 pt1

NCBI

Tools Literature DataBlast

GBenchSplignCn3De-PCR

e-Utilities…

PubMedPubMed Central

BookshelfMeSH

Gene Reviews…

GenBankProtein DB

SRAGEO

dbSNPGene

RefSeq…

Page 5: Church gmod2012 pt1

Entrez: Pathway to Discovery

Amino acid sequence similarityCoding region

features

Nucleotide sequence similarity

Term frequency statistics

Literature citations in sequence databases

Literature citations in sequence databases

MEDLINE abstracts

Nucleotide sequences

Protein sequences

Page 6: Church gmod2012 pt1

http://www.ncbi.nlm.nih.gov/books/NBK25501/

Programmatic accesshttp://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubmed&term=science[journal]+AND+breast+cancer+AND+2008[pdat]&usehistory=y

<eSearchResult><Count>6</Count><RetMax>6</RetMax><RetStart>0</RetStart><IdList>

<Id>19008416</Id><Id>18927361</Id><Id>18787170</Id><Id>18487186</Id><Id>18239126</Id><Id>18239125</Id>

</IdList>…

Page 7: Church gmod2012 pt1

http://www.ncbi.nlm.nih.gov/education/

http://www.youtube.com/NCBINLM @NCBI http://www.facebook.com/ncbi.nlm

Page 8: Church gmod2012 pt1

Collins FS et al, 1998

Throughput: 500 Mb/yearCost: < $0.25 per base

Variation: 100,000 SNPs mapped

Page 9: Church gmod2012 pt1

Steve Sherry, NCBI

2010

10

20

30

40

50

60

STR & IndelSNPAmbiguous mapping

Millions of rs-idsNCBI dbSNP database growth

human variations

Non-redundant annotations

25

50

75

100

125

150

175

1000 Genomes

Other projects

HapMap

TSC

Millions of submissionsSubmissions

by project

dbSNP build 135. November 2011

20001999 20112005

Page 10: Church gmod2012 pt1

Kidd et al, 2007 APOBEC cluster

BLACK: DeletionWhite: Insertion

Page 11: Church gmod2012 pt1

http://www.ncbi.nlm.nih.gov/dbvar

Page 12: Church gmod2012 pt1
Page 13: Church gmod2012 pt1

Church et al., 2011 PLoS

http://genomereference.org

Page 14: Church gmod2012 pt1

Distributed data

Genome not in INSDC Database

Old Assembly Model

GRC Beginnings

Page 15: Church gmod2012 pt1
Page 16: Church gmod2012 pt1
Page 17: Church gmod2012 pt1
Page 18: Church gmod2012 pt1

Build sequence contigs based on contigs defined in TPF.

Check for orientation consistenciesSelect switch pointsInstantiate sequence for further analysis

Switch point

Consensus sequence

Page 19: Church gmod2012 pt1
Page 20: Church gmod2012 pt1

ftp://ftp.ncbi.nlm.nih.gov/pub/grc/human/

Page 21: Church gmod2012 pt1
Page 22: Church gmod2012 pt1

Community Input

Page 23: Church gmod2012 pt1

Distributed data

Genome not in INSDC Database

Old Assembly Model

Centralized Data

Page 24: Church gmod2012 pt1

Large-Scale Variation Complicates Genome Assembly

Sequences from haplotype 1Sequences from haplotype 2

Old Assembly model: compress into a consensus

New Assembly model: represent both haplotypes

Page 25: Church gmod2012 pt1

NCBI36 (hg18)

UGT2B17 Region

Page 26: Church gmod2012 pt1

AC074378.4AC079749.5

AC134921.2AC147055.2

AC140484.1AC019173.4

AC093720.2AC021146.7

NCBI36 NC_000004.10 (chr4) Tiling Path

Xue Y et al, 2008

TMPRSS11E TMPRSS11E2

GRCh37 NC_000004.11 (chr4) Tiling Path

AC074378.4AC079749.5

AC134921.1AC147055.2

AC093720.2AC021146.7

TMPRSS11E

GRCh37: NT_167250.1 (UGT2B17 alternate locus)

AC074378.4AC140484.1

AC019173.4AC226496.2

AC021146.7

TMPRSS11E2

UGT2B17 Region

Page 27: Church gmod2012 pt1

GRCh37 (hg19)

http://genomereference.org

7 alternate haplotypesat the MHC

Alternate loci released as:FASTA

AGPAlignment to chromosome

UGT2B17 MHC MAPT

Page 28: Church gmod2012 pt1
Page 29: Church gmod2012 pt1

Assembly (e.g. GRCh37)

Primary Assembly

Non-nuclear assembly unit

(e.g. MT)

ALT 1

ALT 2

ALT 3

ALT 4

ALT 5

ALT 9

ALT 6

ALT 7ALT

8

PAR

Genomic Region(MHC)

Genomic Region

(UGT2B17)Genomic

Region(MAPT)

Page 30: Church gmod2012 pt1

Richa Agarwala

MHC Alternate locus

Alignment to chr6

Page 31: Church gmod2012 pt1

Oh No! Not a new version of the human genome!

http://genomereference.org

Page 32: Church gmod2012 pt1
Page 33: Church gmod2012 pt1

Assembly (e.g. GRCh37.p5)

Primary Assembly

Non-nuclear assembly unit

(e.g. MT)

ALT 1

ALT 2

ALT 3

ALT 4

ALT 5

ALT 9

ALT 6

ALT 7ALT

8

PAR

Genomic Region(MHC)

Genomic Region

(UGT2B17)Genomic

Region(MAPT)

Patches

Genomic Region(ABO)

Genomic Region(SMA)

Genomic Region

(PECAM1)

Page 34: Church gmod2012 pt1

TBC1D3C TBC1D3

TBC1D3C

TBC1D3H

Myo19 region (17q21)

Page 35: Church gmod2012 pt1

60 Fix PATCHES: Chromosome will update in GRCh38

70 Novel PATCHES: Additional sequence added

(adds >1 Mb of novel sequence to the assembly)

(adds >800K of novel sequence to the assembly)

Releasing patches quarterly

Page 36: Church gmod2012 pt1

Distributed data

Genome not in INSDC Database

Old Assembly Model

Centralized Data

Updated Assembly Model

Genome in INSDC DatabaseGenome not in INSDC Database