church emory2013

Post on 24-Jun-2015

2.162 Views

Category:

Technology

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

Seminar at Emory Sep 2013

TRANSCRIPT

Deanna M. Church Staff Scientist, NCBI

@deannachurch

The intersection of genome assembly and

variation management. 

http://genomereference.org

Valerie Schneider, NCBI

Variation Resources Team at NCBI

Ming WardLon PhanBrad HolmesAnna GlodekMichael KholodovRama MaitiJuliana SampsonDavid ShaoEugene ShekhtmanQiang WangHua Zhang

Donna MaglottMelissa LandrumJennifer LeeGeorge RileyRay TullyCraig WallinShanmuga ChitipirallaDouglas HoffmanWonhee JangKen KatzMichael OvetskyRicardo Villamarin

Tim HefferonJohn LopezJohn GarnerChao Chen

Learning Objectives

Why the reference assembly matters for your analysis

How the reference assembly is changing

Tools and Resources to find data

Why should you care about the Reference Assembly?

Genes, NCBI Homo sapiens Annotation Release 105

Transcript

CDS

dbSNP Build 138 using annotation release 104

http://www.bioplanet.com/gcat

What is the Reference Assembly?

An assembly is a MODEL of the genome

BAC insertBAC vector

Shotgun sequence

Assemble

GAPS

“finishers” go in to manually fill the gaps, often by PCR

http://www.ncbi.nlm.nih.gov/projects/genome/assembly/grc/issue_detail.cgi?id=HG-1012

http://www.ncbi.nlm.nih.gov/projects/genome/assembly/grc/issue_detail.cgi?id=HG-1321

RP11-34P13 64E8 RP4-669L17 RP5-857K21 RP11-206L10 RP11-54O7

Gaps

http://genomereference.org

NCBI36 (hg18)

GRC

h37

(hg1

9)

NCBI35 (hg17)

GRCh37 (hg19)

AL139246.20

AL139246.21

Build sequence contigs based on contigs defined in TPF (Tiling Path File).

Check for orientation consistenciesSelect switch pointsInstantiate sequence for further analysis

Switch point

Consensus sequence

NCBI36

nsv832911 (nstd68) Submitted on NCBI35 (hg17)

NCBI35 (hg17) Tiling Path

GRCh37 (hg19) Tiling Path

Gap Inserted

Moved approximately 2 Mb distal on chr15

NC_0000015.8 (chr15)

NC_0000015.9 (chr15)

Removed from assembly

Added to assembly

HG-24

Sequences from haplotype 1Sequences from haplotype 2

Old Assembly model: compress into a consensus

New Assembly model: represent both haplotypes

AC074378.4AC079749.5

AC134921.2AC147055.2

AC140484.1AC019173.4

AC093720.2AC021146.7

NCBI36 NC_000004.10 (chr4) Tiling Path

Xue Y et al, 2008

TMPRSS11E TMPRSS11E2

GRCh37 NC_000004.11 (chr4) Tiling Path

AC074378.4AC079749.5

AC134921.1AC147055.2

AC093720.2AC021146.7

TMPRSS11E

GRCh37: NT_167250.1 (UGT2B17 alternate locus)

AC074378.4AC140484.1

AC019173.4AC226496.2

AC021146.7

TMPRSS11E2

nsv532126 (nstd37)

GRCh37 (hg19)

http://genomereference.org

7 alternate haplotypesat the MHC

Alternate loci released as:FASTA

AGPAlignment to chromosome

UGT2B17 MHC MAPT

MHC (chr6)Chr 6 representation (PGF)

Alt_Ref_Locus_2 (COX)

Data management and the Reference Assembly?

NC_000086.123456 CM001013.17 2Mouse chrX: 34,800,000-34,890,000

Mouse chrX: 35,000,000-36,000000

X

MGSCv3 MGSCv36

ABC14-1065514J1GapsPhase LengthDate

FP565796.1 1 121-Oct-2009

FP565796.2 1 014-Oct-2010

FP565796.3 3 007-Nov-2010

hg19GRCh37

mm8MGSCv37

NCBIM37

danRer5Zv7

chr21:8,913,216-9,246,964

Zv7 chr21:8,913,216-9,246,964 X Mouse Build 36 chrX

http://www.ncbi.nlm.nih.gov/genome/assembly

GenBank RefSeq vs

Submitter Owned RefSeq Owned

Redundancy Non-RedundantUpdated rarely Curated

INSDC Not INSDC

BRCA183 genomic records31 mRNA records27 protein records

3 genomic records 5 mRNA records1 RNA record5 protein records

http://www.ncbi.nlm.nih.gov/refseq/rsghttp://www.lrg-sequence.org/

http://www.ncbi.nlm.nih.gov/refseq/rsg

RefSeq Gene

L R

http://www.ncbi.nlm.nih.gov/genome/tools/remap

From Assembly 1 <-> Assembly 2Assembly <-> RefSeqGene/LRGPrimary Assembly <-> Alternate loci

Variant Calling and the Reference Assembly

Kidd et al, 2007 APOBEC cluster

Part of chr22 assembly

Alternate locus for chr22

White: InsertionBlack: Deletion

http://www.ncbi.nlm.nih.gov/variation/tools/1000genomes

Hydin: chr16 (16q22.2)Hydin2: chr1 (1q21.1)Missing in NCBI35/NCBI36 Unlocalized in GRCh37 Finished in GRCh38

Alignment to Hydin2 Genomic, 300 Kb, 99.4% ID

Alignment to Hydin1 CHM1_1.0, >99.9% ID

(Paralogous)

(Allelic)Alignment to Hydin2 Genomic, 300 Kb, 99.4% ID

Alignment to Hydin1 CHM1_1.0, >99.9% ID

Doggett et al., 2006

http://www.ncbi.nlm.nih.gov/variation/tools/1000genomes

CDC27

1KG Phase 1 Strict accessibility mask

SNP (all)

SNP (not 1KG)

Sudmant et al., 2010

Issues with the Reference Assembly

http://genomereference.org

Dennis et al., 2012

1q32 1q21 1p21

1p21 patch alignment to chromosome 1

Fixing Rare/Incorrect Bases

Adding Novel Sequence

Karen Miga and Jim Kent arXiv:1307.0035

Preview of GRCh38 (scheduled Fall 2013)

TEX28 TKTL1

LOC101060233(opsin related)

LOC101060234(TEX28 related)

GRCh37 (current reference assembly)NC_000023.10 (chrX)

NW_003871103.3

FAM23_MRC1 Region, chr10

Segmental Duplications

1KG accessibility Mask

Novel Patch 250 kb of artificial duplication

Adding Novel Sequence

GRCh37p13120 Fix Patches60 Novel

Human Resolved for GRCh38

http://genomereference.org

How to identify problemregions in the

Reference Assembly

1000 Genomes Browser: http://www.ncbi.nlm.nih.gov/variation/tools/1000genomesGeT-RM Browser: http://www.ncbi.nlm.nih.gov/variation/tools/getrmVariation Viewer: http://www.ncbi.nlm.nih.gov/variation/view (coming Oct 2013!)

Tiling Path

Sequence Bar

Segmental Duplications, Eichler Lab

1000 Genomes strict accessibility mask

Annotated clone assembly problems

dbSNP Build 138 based on annotation run 104

Model based paralogous sequence differences, NCBI annotation run #Paralogous/pseudo gene alignments, NCBI annotation run #

Single Unique Nucleotide (SUN) map, Sudmant 2010ClinVar Long Variations

GRC Curation Issues

ClinVar Short Variations

top related