church gmod2012 pt1

Post on 26-Jan-2015

114 Views

Category:

Technology

3 Downloads

Preview:

Click to see full reader

DESCRIPTION

Part one of my talk at the GMOD 2012 meeting

TRANSCRIPT

@deannachurch

Navigating Genome Resources at NCBI

Deanna M. Church, NCBI

The Evolution of the Reference Human Genome

Part 1

NCBI

BLAST PubMed GenBank

1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 20110

20,000

40,000

60,000

80,000

100,000

120,000

140,000

0

500,000

1,000,000

1,500,000

2,000,000

2,500,000

GenBank Base PairsUsers (Average)

Twenty Two Years of Growth:NCBI Data and User Services

Bas

e P

airs

(M

illio

ns)

Users/W

eekday

BLAST

EntrezGenBank at NCBIdbEST

3D StructureNetwork Entrez

WWWdbSTS

BankItGenomesTaxonomy

OMIMGeneMapCn3DUniGene

PubMedPSI-BLASTVASTePCR

Microbial GenomesPHI-BLASTCGAP

Human GenomeLinkOutLocusLinkRefSeqdbSNP

PubMed CentralBLINKMapViewerGEOGeneRIFs

WGSHLA HaplotypesHuman Genome-TPA

dbMHCBookShelfHuman Genome- Transcripts Alignments

Entrez GenesMouse Composite GenomeGnomon

PubChemTrace ArchiveCCDSCancer ChromosomesEnvironmental Samples

Public AccessInfluenza Seqs.GenSATGeneTests

Genome-Wide Association Studies dbGapEntrez Portal

Seq Read ArchiveUniSTSRefSeqGeneGenome Reference Consortium

Discovery InitiativeEntrez SensorsPrimer BLAST

PeptidomeBioSystemsFlu H1N1

dbVarEpigenomicsMyNCBI1000 Genomes Project

ClinVarGTRGenome Remapping ServicePubMed HealthCloneDBGenome Decoration Page

NCBI

Tools Literature DataBlast

GBenchSplignCn3De-PCR

e-Utilities…

PubMedPubMed Central

BookshelfMeSH

Gene Reviews…

GenBankProtein DB

SRAGEO

dbSNPGene

RefSeq…

Entrez: Pathway to Discovery

Amino acid sequence similarityCoding region

features

Nucleotide sequence similarity

Term frequency statistics

Literature citations in sequence databases

Literature citations in sequence databases

MEDLINE abstracts

Nucleotide sequences

Protein sequences

http://www.ncbi.nlm.nih.gov/books/NBK25501/

Programmatic accesshttp://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubmed&term=science[journal]+AND+breast+cancer+AND+2008[pdat]&usehistory=y

<eSearchResult><Count>6</Count><RetMax>6</RetMax><RetStart>0</RetStart><IdList>

<Id>19008416</Id><Id>18927361</Id><Id>18787170</Id><Id>18487186</Id><Id>18239126</Id><Id>18239125</Id>

</IdList>…

http://www.ncbi.nlm.nih.gov/education/

http://www.youtube.com/NCBINLM @NCBI http://www.facebook.com/ncbi.nlm

Collins FS et al, 1998

Throughput: 500 Mb/yearCost: < $0.25 per base

Variation: 100,000 SNPs mapped

Steve Sherry, NCBI

2010

10

20

30

40

50

60

STR & IndelSNPAmbiguous mapping

Millions of rs-idsNCBI dbSNP database growth

human variations

Non-redundant annotations

25

50

75

100

125

150

175

1000 Genomes

Other projects

HapMap

TSC

Millions of submissionsSubmissions

by project

dbSNP build 135. November 2011

20001999 20112005

Kidd et al, 2007 APOBEC cluster

BLACK: DeletionWhite: Insertion

http://www.ncbi.nlm.nih.gov/dbvar

Church et al., 2011 PLoS

http://genomereference.org

Distributed data

Genome not in INSDC Database

Old Assembly Model

GRC Beginnings

Build sequence contigs based on contigs defined in TPF.

Check for orientation consistenciesSelect switch pointsInstantiate sequence for further analysis

Switch point

Consensus sequence

ftp://ftp.ncbi.nlm.nih.gov/pub/grc/human/

Community Input

Distributed data

Genome not in INSDC Database

Old Assembly Model

Centralized Data

Large-Scale Variation Complicates Genome Assembly

Sequences from haplotype 1Sequences from haplotype 2

Old Assembly model: compress into a consensus

New Assembly model: represent both haplotypes

NCBI36 (hg18)

UGT2B17 Region

AC074378.4AC079749.5

AC134921.2AC147055.2

AC140484.1AC019173.4

AC093720.2AC021146.7

NCBI36 NC_000004.10 (chr4) Tiling Path

Xue Y et al, 2008

TMPRSS11E TMPRSS11E2

GRCh37 NC_000004.11 (chr4) Tiling Path

AC074378.4AC079749.5

AC134921.1AC147055.2

AC093720.2AC021146.7

TMPRSS11E

GRCh37: NT_167250.1 (UGT2B17 alternate locus)

AC074378.4AC140484.1

AC019173.4AC226496.2

AC021146.7

TMPRSS11E2

UGT2B17 Region

GRCh37 (hg19)

http://genomereference.org

7 alternate haplotypesat the MHC

Alternate loci released as:FASTA

AGPAlignment to chromosome

UGT2B17 MHC MAPT

Assembly (e.g. GRCh37)

Primary Assembly

Non-nuclear assembly unit

(e.g. MT)

ALT 1

ALT 2

ALT 3

ALT 4

ALT 5

ALT 9

ALT 6

ALT 7ALT

8

PAR

Genomic Region(MHC)

Genomic Region

(UGT2B17)Genomic

Region(MAPT)

Richa Agarwala

MHC Alternate locus

Alignment to chr6

Oh No! Not a new version of the human genome!

http://genomereference.org

Assembly (e.g. GRCh37.p5)

Primary Assembly

Non-nuclear assembly unit

(e.g. MT)

ALT 1

ALT 2

ALT 3

ALT 4

ALT 5

ALT 9

ALT 6

ALT 7ALT

8

PAR

Genomic Region(MHC)

Genomic Region

(UGT2B17)Genomic

Region(MAPT)

Patches

Genomic Region(ABO)

Genomic Region(SMA)

Genomic Region

(PECAM1)

TBC1D3C TBC1D3

TBC1D3C

TBC1D3H

Myo19 region (17q21)

60 Fix PATCHES: Chromosome will update in GRCh38

70 Novel PATCHES: Additional sequence added

(adds >1 Mb of novel sequence to the assembly)

(adds >800K of novel sequence to the assembly)

Releasing patches quarterly

Distributed data

Genome not in INSDC Database

Old Assembly Model

Centralized Data

Updated Assembly Model

Genome in INSDC DatabaseGenome not in INSDC Database

top related