church nhgri 2012

52
@deannachurch Deanna M. Church, NCBI The Evolution of Genome Data

Upload: deanna-church

Post on 11-May-2015

1.081 views

Category:

Technology


1 download

TRANSCRIPT

Page 1: Church nhgri 2012

@deannachurch

Deanna M. Church, NCBI

The Evolution of Genome Data

Page 2: Church nhgri 2012

Collins FS et al, 1998

Throughput: 500 Mb/yearCost: < $0.25 per base

Variation: 100,000 SNPs mapped

Page 3: Church nhgri 2012

1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 20110

20,000

40,000

60,000

80,000

100,000

120,000

140,000

0

500,000

1,000,000

1,500,000

2,000,000

2,500,000

GenBank Base PairsUsers (Average)

Twenty Two Years of Growth:NCBI Data and User Services

Bas

e P

airs

(M

illio

ns)

Users/W

eekday

BLAST

EntrezGenBank at NCBIdbEST

3D StructureNetwork Entrez

WWWdbSTS

BankItGenomesTaxonomy

OMIMGeneMapCn3DUniGene

PubMedPSI-BLASTVASTePCR

Microbial GenomesPHI-BLASTCGAP

Human GenomeLinkOutLocusLinkRefSeqdbSNP

PubMed CentralBLINKMapViewerGEOGeneRIFs

WGSHLA HaplotypesHuman Genome-TPA

dbMHCBookShelfHuman Genome- Transcripts Alignments

Entrez GenesMouse Composite GenomeGnomon

PubChemTrace ArchiveCCDSCancer ChromosomesEnvironmental Samples

Public AccessInfluenza Seqs.GenSATGeneTests

Genome-Wide Association Studies dbGapEntrez Portal

Seq Read ArchiveUniSTSRefSeqGeneGenome Reference Consortium

Discovery InitiativeEntrez SensorsPrimer BLAST

PeptidomeBioSystemsFlu H1N1

dbVarEpigenomicsMyNCBI1000 Genomes Project

ClinVarGTRGenome Remapping ServicePubMed HealthCloneDBGenome Decoration Page

Page 4: Church nhgri 2012

Steve Sherry, NCBI

2010

10

20

30

40

50

60

STR & IndelSNPAmbiguous mapping

Millions of rs-idsNCBI dbSNP database growth

human variations

Non-redundant annotations

25

50

75

100

125

150

175

1000 Genomes

Other projects

HapMap

TSC

Millions of submissionsSubmissions

by project

dbSNP build 135. November 2011

20001999 20112005

Page 5: Church nhgri 2012

Kidd et al, 2007 APOBEC cluster

BLACK: DeletionWhite: Insertion

Page 6: Church nhgri 2012

http://www.ncbi.nlm.nih.gov/dbvar

Page 7: Church nhgri 2012
Page 8: Church nhgri 2012

Church et al., 2011 PLoS

http://genomereference.org

Page 9: Church nhgri 2012

Distributed data

Genome not in INSDC Database

Old Assembly Model

GRC Beginnings

Page 10: Church nhgri 2012

Build sequence contigs based on contigs defined in TPF.

Check for orientation consistenciesSelect switch pointsInstantiate sequence for further analysis

Switch point

Consensus sequence

Page 11: Church nhgri 2012
Page 12: Church nhgri 2012

http://genomereference.org

Page 13: Church nhgri 2012
Page 14: Church nhgri 2012

Community Input

Page 15: Church nhgri 2012

Distributed data

Genome not in INSDC Database

Old Assembly Model

Centralized Data

Page 16: Church nhgri 2012

Large-Scale Variation Complicates Genome Assembly

Sequences from haplotype 1Sequences from haplotype 2

Old Assembly model: compress into a consensus

New Assembly model: represent both haplotypes

Page 17: Church nhgri 2012

NCBI36 (hg18)

UGT2B17 Region

Page 18: Church nhgri 2012

AC074378.4AC079749.5

AC134921.2AC147055.2

AC140484.1AC019173.4

AC093720.2AC021146.7

NCBI36 NC_000004.10 (chr4) Tiling Path

Xue Y et al, 2008

TMPRSS11E TMPRSS11E2

GRCh37 NC_000004.11 (chr4) Tiling Path

AC074378.4AC079749.5

AC134921.1AC147055.2

AC093720.2AC021146.7

TMPRSS11E

GRCh37: NT_167250.1 (UGT2B17 alternate locus)

AC074378.4AC140484.1

AC019173.4AC226496.2

AC021146.7

TMPRSS11E2

UGT2B17 Region

Page 19: Church nhgri 2012

GRCh37 (hg19)

http://genomereference.org

7 alternate haplotypesat the MHC

Alternate loci released as:FASTA

AGPAlignment to chromosome

UGT2B17 MHC MAPT

Page 20: Church nhgri 2012
Page 21: Church nhgri 2012

Assembly (e.g. GRCh37)

Primary Assembly

Non-nuclear assembly unit

(e.g. MT)

ALT 1

ALT 2

ALT 3

ALT 4

ALT 5

ALT 9

ALT 6

ALT 7ALT

8

PAR

Genomic Region(MHC)

Genomic Region

(UGT2B17)Genomic

Region(MAPT)

Page 22: Church nhgri 2012

Richa Agarwala

MHC Alternate locus

Alignment to chr6

Page 23: Church nhgri 2012
Page 24: Church nhgri 2012

Oh No! Not a new version of the human genome!

http://genomereference.org

Page 25: Church nhgri 2012
Page 26: Church nhgri 2012

Assembly (e.g. GRCh37.p5)

Primary Assembly

Non-nuclear assembly unit

(e.g. MT)

ALT 1

ALT 2

ALT 3

ALT 4

ALT 5

ALT 9

ALT 6

ALT 7ALT

8

PAR

Genomic Region(MHC)

Genomic Region

(UGT2B17)Genomic

Region(MAPT)

Patches

Genomic Region(ABO)

Genomic Region(SMA)

Genomic Region

(PECAM1)

Page 27: Church nhgri 2012

TBC1D3C TBC1D3

TBC1D3C

TBC1D3H

Myo19 region (17q21)

Page 28: Church nhgri 2012

70 Fix PATCHES: Chromosome will update in GRCh38

71 Novel PATCHES: Additional sequence added

(adds >1 Mb of novel sequence to the assembly)

(adds >800K of novel sequence to the assembly)

Releasing patches quarterly

Page 29: Church nhgri 2012

Distributed data

Genome not in INSDC Database

Old Assembly Model

Centralized Data

Updated Assembly Model

Genome in INSDC DatabaseGenome not in INSDC Database

Page 30: Church nhgri 2012

GenBank

Data Archives

Data in a common format Data in a single location (and mirrored) Most quality checked prior to deposition Robust data tracking mechanism (accession.version) Data owned by submitter

Page 31: Church nhgri 2012

Data tracking

ABC14-1065514J1GapsPhase LengthDate

FP565796.1 1 121-Oct-2009

FP565796.2 1 014-Oct-2010

FP565796.3 3 007-Nov-2010

Page 32: Church nhgri 2012

Mouse chrX: 34,800,000-34,890,000

NC_000086.123456 CM001013.17 2

Page 33: Church nhgri 2012

Mouse chrX: 35,000,000-36,000000

X

MGSCv3 MGSCv36

Page 34: Church nhgri 2012

hg19GRCh37

mm8MGSCv37

NCBIM37

danRer5Zv7

What’s in a name?

Page 35: Church nhgri 2012

By any other name…

chr21:8,913,216-9,246,964

Page 36: Church nhgri 2012

Zv7 chr21:8,913,216-9,246,964 X Mouse Build 36 chrX

By any other name…

Page 37: Church nhgri 2012

http://www.ncbi.nlm.nih.gov/genome/assembly

GRCh37hg19

Page 38: Church nhgri 2012
Page 39: Church nhgri 2012

Assembly (e.g. GRCh37.p5)GCA_000001405.6 /GCF_000001405.17

Primary Assembly

GCA_000001305.1/GCF_000001305.13

ALT 1

GCA_000001315.1/GCF_000001315.1

ALT 2

GCA_000001325.1/GCF_000001325.2

ALT 3

GCA_000001335.1/GCF_000001335.1

ALT 4

GCA_000001345.1/GCF_000001345.1

ALT 5

GCA_000001355.1/GCF_000001355.1

ALT 6

GCA_000001365.1/GCF_000001365.2

ALT 7

GCA_000001375.1/GCF_000001375.1

ALT 8

GCA_000001385.1/GCF_000001385.1

ALT 9

GCA_000001395.1/GCF_000001395.1

PatchesGCA_000005045.5GCF_000005045.4

Non-nuclear assembly unit

(e.g. MT)

GCA_000006015.1/GCF_000006015.1

Page 40: Church nhgri 2012

GenBank RefSeq vs

Submitter Owned RefSeq Owned

Redundancy Non-RedundantUpdated rarely Curated

INSDC Not INSDC

BRCA183 genomic records31 mRNA records27 protein records

3 genomic records 5 mRNA records1 RNA record5 protein records

Page 41: Church nhgri 2012
Page 42: Church nhgri 2012

RefSeq for Assemblies

Typical assembly edits

Addition of non-nuclear (e.g. MT) assembly units

Removal of contamination

Drop unlocalized/unplaced scaffoldsMask contamination that is placed on chromosome

Page 43: Church nhgri 2012

http://www.ncbi.nlm.nih.gov/genome

Page 44: Church nhgri 2012

Understanding relationships between assemblies using alignments

First Pass

Second Pass

Reciprocal best hit

Non-reciprocal, duplicative hits

Page 45: Church nhgri 2012
Page 46: Church nhgri 2012

No second pass alignments in GRCh37.p5

NCBI36

GRCh37.p5

http://www.ncbi.nlm.nih.gov/tools/gbench/

Page 47: Church nhgri 2012

Genome Data is MORE than just the Genome

Page 48: Church nhgri 2012

Genome Data is MORE than just the Genome

ATGCGTGCAAAATGCAGTGAGT

ATGCGTGCAAAATGCAGTGAGT

ATGCGTGCAAAATGCAGTGAGT

ATGCGTGCAAAATGCAGTGAGT

NM_000336.2:c.800C>T

Page 49: Church nhgri 2012

ATGCGTGCAAAATGCAGTGAGT

ATGCGTGCAAAATGCAGTGAGT

ATGCGTGCAAAATGCAGTGAGT

ATGCGTGCAAAATGCAGTGAGT

NM_000336.2:c.800C>TNC_000001.10:g.(?_20700513)_(21062644_?)del

Page 50: Church nhgri 2012

http://www.ncbi.nlm.nih.gov/variation/tools/1000genomes

Page 51: Church nhgri 2012

http://www.ncbi.nlm.nih.gov/education/

http://www.youtube.com/NCBINLM @NCBI http://www.facebook.com/ncbi.nlm

Page 52: Church nhgri 2012

Thanks!

For Slides: Francoise Thibaud-Nissen Evan Eichler Steve Sherry

The Genome Reference ConsortiumThe Genome Center at Washington University The Wellcome Trust Sanger InstituteThe European Bioinformatics InstituteThe National Center for Biotechnology Information

Church group at NCBIValerie SchneiderNathan BoukHsiu-Chuan ChenPeter MericVictor AnanievChao ChenJohn LopezJohn GarnerTim HefferonCliff Clausen

NCBI