church nhgri 2012

Post on 11-May-2015

1.083 Views

Category:

Technology

1 Downloads

Preview:

Click to see full reader

TRANSCRIPT

@deannachurch

Deanna M. Church, NCBI

The Evolution of Genome Data

Collins FS et al, 1998

Throughput: 500 Mb/yearCost: < $0.25 per base

Variation: 100,000 SNPs mapped

1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 20110

20,000

40,000

60,000

80,000

100,000

120,000

140,000

0

500,000

1,000,000

1,500,000

2,000,000

2,500,000

GenBank Base PairsUsers (Average)

Twenty Two Years of Growth:NCBI Data and User Services

Bas

e P

airs

(M

illio

ns)

Users/W

eekday

BLAST

EntrezGenBank at NCBIdbEST

3D StructureNetwork Entrez

WWWdbSTS

BankItGenomesTaxonomy

OMIMGeneMapCn3DUniGene

PubMedPSI-BLASTVASTePCR

Microbial GenomesPHI-BLASTCGAP

Human GenomeLinkOutLocusLinkRefSeqdbSNP

PubMed CentralBLINKMapViewerGEOGeneRIFs

WGSHLA HaplotypesHuman Genome-TPA

dbMHCBookShelfHuman Genome- Transcripts Alignments

Entrez GenesMouse Composite GenomeGnomon

PubChemTrace ArchiveCCDSCancer ChromosomesEnvironmental Samples

Public AccessInfluenza Seqs.GenSATGeneTests

Genome-Wide Association Studies dbGapEntrez Portal

Seq Read ArchiveUniSTSRefSeqGeneGenome Reference Consortium

Discovery InitiativeEntrez SensorsPrimer BLAST

PeptidomeBioSystemsFlu H1N1

dbVarEpigenomicsMyNCBI1000 Genomes Project

ClinVarGTRGenome Remapping ServicePubMed HealthCloneDBGenome Decoration Page

Steve Sherry, NCBI

2010

10

20

30

40

50

60

STR & IndelSNPAmbiguous mapping

Millions of rs-idsNCBI dbSNP database growth

human variations

Non-redundant annotations

25

50

75

100

125

150

175

1000 Genomes

Other projects

HapMap

TSC

Millions of submissionsSubmissions

by project

dbSNP build 135. November 2011

20001999 20112005

Kidd et al, 2007 APOBEC cluster

BLACK: DeletionWhite: Insertion

http://www.ncbi.nlm.nih.gov/dbvar

Church et al., 2011 PLoS

http://genomereference.org

Distributed data

Genome not in INSDC Database

Old Assembly Model

GRC Beginnings

Build sequence contigs based on contigs defined in TPF.

Check for orientation consistenciesSelect switch pointsInstantiate sequence for further analysis

Switch point

Consensus sequence

http://genomereference.org

Community Input

Distributed data

Genome not in INSDC Database

Old Assembly Model

Centralized Data

Large-Scale Variation Complicates Genome Assembly

Sequences from haplotype 1Sequences from haplotype 2

Old Assembly model: compress into a consensus

New Assembly model: represent both haplotypes

NCBI36 (hg18)

UGT2B17 Region

AC074378.4AC079749.5

AC134921.2AC147055.2

AC140484.1AC019173.4

AC093720.2AC021146.7

NCBI36 NC_000004.10 (chr4) Tiling Path

Xue Y et al, 2008

TMPRSS11E TMPRSS11E2

GRCh37 NC_000004.11 (chr4) Tiling Path

AC074378.4AC079749.5

AC134921.1AC147055.2

AC093720.2AC021146.7

TMPRSS11E

GRCh37: NT_167250.1 (UGT2B17 alternate locus)

AC074378.4AC140484.1

AC019173.4AC226496.2

AC021146.7

TMPRSS11E2

UGT2B17 Region

GRCh37 (hg19)

http://genomereference.org

7 alternate haplotypesat the MHC

Alternate loci released as:FASTA

AGPAlignment to chromosome

UGT2B17 MHC MAPT

Assembly (e.g. GRCh37)

Primary Assembly

Non-nuclear assembly unit

(e.g. MT)

ALT 1

ALT 2

ALT 3

ALT 4

ALT 5

ALT 9

ALT 6

ALT 7ALT

8

PAR

Genomic Region(MHC)

Genomic Region

(UGT2B17)Genomic

Region(MAPT)

Richa Agarwala

MHC Alternate locus

Alignment to chr6

Oh No! Not a new version of the human genome!

http://genomereference.org

Assembly (e.g. GRCh37.p5)

Primary Assembly

Non-nuclear assembly unit

(e.g. MT)

ALT 1

ALT 2

ALT 3

ALT 4

ALT 5

ALT 9

ALT 6

ALT 7ALT

8

PAR

Genomic Region(MHC)

Genomic Region

(UGT2B17)Genomic

Region(MAPT)

Patches

Genomic Region(ABO)

Genomic Region(SMA)

Genomic Region

(PECAM1)

TBC1D3C TBC1D3

TBC1D3C

TBC1D3H

Myo19 region (17q21)

70 Fix PATCHES: Chromosome will update in GRCh38

71 Novel PATCHES: Additional sequence added

(adds >1 Mb of novel sequence to the assembly)

(adds >800K of novel sequence to the assembly)

Releasing patches quarterly

Distributed data

Genome not in INSDC Database

Old Assembly Model

Centralized Data

Updated Assembly Model

Genome in INSDC DatabaseGenome not in INSDC Database

GenBank

Data Archives

Data in a common format Data in a single location (and mirrored) Most quality checked prior to deposition Robust data tracking mechanism (accession.version) Data owned by submitter

Data tracking

ABC14-1065514J1GapsPhase LengthDate

FP565796.1 1 121-Oct-2009

FP565796.2 1 014-Oct-2010

FP565796.3 3 007-Nov-2010

Mouse chrX: 34,800,000-34,890,000

NC_000086.123456 CM001013.17 2

Mouse chrX: 35,000,000-36,000000

X

MGSCv3 MGSCv36

hg19GRCh37

mm8MGSCv37

NCBIM37

danRer5Zv7

What’s in a name?

By any other name…

chr21:8,913,216-9,246,964

Zv7 chr21:8,913,216-9,246,964 X Mouse Build 36 chrX

By any other name…

http://www.ncbi.nlm.nih.gov/genome/assembly

GRCh37hg19

Assembly (e.g. GRCh37.p5)GCA_000001405.6 /GCF_000001405.17

Primary Assembly

GCA_000001305.1/GCF_000001305.13

ALT 1

GCA_000001315.1/GCF_000001315.1

ALT 2

GCA_000001325.1/GCF_000001325.2

ALT 3

GCA_000001335.1/GCF_000001335.1

ALT 4

GCA_000001345.1/GCF_000001345.1

ALT 5

GCA_000001355.1/GCF_000001355.1

ALT 6

GCA_000001365.1/GCF_000001365.2

ALT 7

GCA_000001375.1/GCF_000001375.1

ALT 8

GCA_000001385.1/GCF_000001385.1

ALT 9

GCA_000001395.1/GCF_000001395.1

PatchesGCA_000005045.5GCF_000005045.4

Non-nuclear assembly unit

(e.g. MT)

GCA_000006015.1/GCF_000006015.1

GenBank RefSeq vs

Submitter Owned RefSeq Owned

Redundancy Non-RedundantUpdated rarely Curated

INSDC Not INSDC

BRCA183 genomic records31 mRNA records27 protein records

3 genomic records 5 mRNA records1 RNA record5 protein records

RefSeq for Assemblies

Typical assembly edits

Addition of non-nuclear (e.g. MT) assembly units

Removal of contamination

Drop unlocalized/unplaced scaffoldsMask contamination that is placed on chromosome

http://www.ncbi.nlm.nih.gov/genome

Understanding relationships between assemblies using alignments

First Pass

Second Pass

Reciprocal best hit

Non-reciprocal, duplicative hits

No second pass alignments in GRCh37.p5

NCBI36

GRCh37.p5

http://www.ncbi.nlm.nih.gov/tools/gbench/

Genome Data is MORE than just the Genome

Genome Data is MORE than just the Genome

ATGCGTGCAAAATGCAGTGAGT

ATGCGTGCAAAATGCAGTGAGT

ATGCGTGCAAAATGCAGTGAGT

ATGCGTGCAAAATGCAGTGAGT

NM_000336.2:c.800C>T

ATGCGTGCAAAATGCAGTGAGT

ATGCGTGCAAAATGCAGTGAGT

ATGCGTGCAAAATGCAGTGAGT

ATGCGTGCAAAATGCAGTGAGT

NM_000336.2:c.800C>TNC_000001.10:g.(?_20700513)_(21062644_?)del

http://www.ncbi.nlm.nih.gov/variation/tools/1000genomes

http://www.ncbi.nlm.nih.gov/education/

http://www.youtube.com/NCBINLM @NCBI http://www.facebook.com/ncbi.nlm

Thanks!

For Slides: Francoise Thibaud-Nissen Evan Eichler Steve Sherry

The Genome Reference ConsortiumThe Genome Center at Washington University The Wellcome Trust Sanger InstituteThe European Bioinformatics InstituteThe National Center for Biotechnology Information

Church group at NCBIValerie SchneiderNathan BoukHsiu-Chuan ChenPeter MericVictor AnanievChao ChenJohn LopezJohn GarnerTim HefferonCliff Clausen

NCBI

top related