tools for hts analysis

50
TOOLS FOR HTS ANALYSIS Michael Brudno and Marc Fiume Department of Computer Science University of Toronto

Upload: candra

Post on 14-Jan-2016

28 views

Category:

Documents


0 download

DESCRIPTION

Tools For HTS Analysis. Michael Brudno and Marc Fiume Department of Computer Science University of Toronto. Outline. Lab focus Our tools SHRiMP : read mapper VARiD : SNP and indel finder Savant : genome browser Discussion. Our Tools. READ MAPPING ( SHRiMP ). ASSEMBLY (UNNAMED). - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Tools For HTS Analysis

TOOLS FOR HTS ANALYSIS

Michael Brudno and Marc FiumeDepartment of Computer ScienceUniversity of Toronto

Page 2: Tools For HTS Analysis

Outline•Lab focus•Our tools

•SHRiMP: read mapper•VARiD: SNP and indel finder•Savant: genome browser

• Discussion

Page 3: Tools For HTS Analysis

Our Tools

READ MAPPING (SHRiMP)

READ MAPPING (SHRiMP)

SNP DETECTION(VARiD)

SNP DETECTION(VARiD)

INDEL DETECTION(MODiL)

INDEL DETECTION(MODiL)

CNV DETECTION(CNVer)

CNV DETECTION(CNVer)

ASSEMBLY(UNNAMED)ASSEMBLY

(UNNAMED)VISUALIZATION(SAVANT)

VISUALIZATION(SAVANT)

Page 4: Tools For HTS Analysis

SHRIMP – SHORT READ MAPPING PACKAGE

Savant Genome Browser - http://compbio.cs.toronto.edu/savant/

Page 5: Tools For HTS Analysis

Key SHRiMP Features•High Sensitivity•Support for common formats (SAM, FASTQ, etc)•Flexible seeding framework•Multi-threading•Full support for SOLiD and Illumina (and 454) reads

Savant Genome Browser - http://compbio.cs.toronto.edu/savant/

Page 6: Tools For HTS Analysis

Sensitivity/Specificity Comparison

Savant Genome Browser - http://compbio.cs.toronto.edu/savant/

Page 7: Tools For HTS Analysis

Runtime comparison

Savant Genome Browser - http://compbio.cs.toronto.edu/savant/

Unpaired 50bp Reads Paired 75bp Reads

Mapping 6 million reads to C. Savignyi (180 Mb)

Page 8: Tools For HTS Analysis

VARID – SNP AND INDEL DETECTION

Savant Genome Browser - http://compbio.cs.toronto.edu/savant/

Page 9: Tools For HTS Analysis

motivation | methods | results | summary

Variation detection from NGS reads

Reference: TCAGCATCGGCATCGACTGCACAGGACCAGTCGATCGAC

Donor: ??????????????????????????????????????? GCATCGACTGCA CGGGATCGACTGAligned reads: ATCCATTGCA GATCCACTGCAC

• Determine differences (variation) between reference and donorusing NGS reads of the donor

Page 10: Tools For HTS Analysis

MotivationColor-space and Letter-space platforms

bring them together

MotivationColor-space and Letter-space platforms

bring them together

MethodsMethods

SummarySummary

ResultsResults

16

Page 11: Tools For HTS Analysis

motivation | methods | results | summary

Sequencing Platforms

• letter-space Sanger, 454, Illumina, etc

> NC_005109.2 | BRCA1 SX3TCAGCATCGGCATCGACTGCACAGG

• color-space AB SOLiD less software tools available

> NC_005109.2 | BRCA1 AF3T212313230313232121311120

• many differences -> useful to combine this information• sequencing biases• inherent errors • advantages

17

Page 12: Tools For HTS Analysis

A G

C T

2

1

2

1 3

0 0

00

A

A

C

0

G T

1 32

C 1

G

0

2

3 2

3 10

T 3 2 01

Color Space

motivation | methods | results | summary

Translation Matrix Translation Automata

18

Page 13: Tools For HTS Analysis

Translating

> T212313230313232121311120> T

Sequencing Error vs SNP

Sequencing Error> T212313230313232121311120> T212313230310232121311120> TCAGCATCGGCAAGCTGACGTGTCC

SNP> TCAGCATCGGCATCGACTGCACAGG> TCAGCATCGGCAGCGACTGCACAGG> T212313230312332121311120

A G

C T

CAGCATCGGCATCGACTGCACAGG

Color Space

motivation | methods | results | summary

19

Page 14: Tools For HTS Analysis

Color Space

motivation | methods | results | summary

20

• clear distinction between a sequencing error and a SNP• can this help us in SNP detection? sounds like it!

single color change error, 2 colors changed (likely) SNP.

Easy snp call Well covered bases Difficult Casereference incolor-space

reads

position

Page 15: Tools For HTS Analysis

Detection• Heterozygous SNPs• Homozygous SNPs• Tri-allelic SNPs• small indels• account for various errors, quality values & misalignments

Motivation • variation caller to handle both letter-space & color-space reads

Motivation

motivation | methods | results | summary

21

VARiD• system to make inferences on the donor bases

• variation detection

Page 16: Tools For HTS Analysis

Methods

Simple HMM Modelstates, emissions, transitions, FB

Extended HMM Modelgaps, diploids, exceptions

Methods

Simple HMM Modelstates, emissions, transitions, FB

Extended HMM Modelgaps, diploids, exceptions

MotivationMotivation

SummarySummary

ResultsResults

22

Page 17: Tools For HTS Analysis

Statistical model for a system - states

Assume that system is a Markov process with state unobserved. Markov Process: next state depends only on current state

We can observe the state’s emission (output)each state has a probability distribution over outputs

Hidden Markov Model (HMM)

motivation | methods | results | summary

23

S1 S2 S3

e1

e2

e1

e2

e1

e2

Page 18: Tools For HTS Analysis

Hidden Markov Model (HMM)

motivation | methods | results | summary

24

Apply HMM to variation detection: • we don’t know the state (donor), but • we can observe some output determined by the state (reads)

Page 19: Tools For HTS Analysis

Hidden Markov Model (HMM)

motivation | methods | results | summary

25

. . . . . B6 B7 B8 B9 . . . . .

B6 B7 B7 B8 B8 B9

AA

AC

color 0

color 1

AA

AC

color 0

color 1

AA

AC

color 0

color 1

unknowndonor

Page 20: Tools For HTS Analysis

Why pairs of letters? Handle colors.• AA and TT gives the same colors. Can’t just model colors

The donor could be:• letters: AA color 0• letters: AC color 1 :• letters: TT color 016 combinations

A G

C T

2

1

2

1 3

0 0

00

A

A

C

0

G T

1 32

C 1

G

0

2

3 2

3 10

T 3 2 01

States

motivation | methods | results | summary

26

B6 B7

Page 21: Tools For HTS Analysis

States and Transitions

motivation | methods | results | summary

27

. . . B6 B7 B8 . . .

B6 B7 B7 B8

AA

CA

AT

TT

:

:

GA

CT

::

AA

TT

.

.

.

.

.

.

.

.

.

Poss

ible

Sta

tes

Transitions• only certain transitions allowed

• when allowed, p(Xt|Xt-1) = freq(Xt)

• each state depends only on the previous states (Markov Process)

States• 16 possible states• only look at second letter

Page 22: Tools For HTS Analysis

T01020100311223 T1030101311223 T20100311223

ATTGCGCAATGCG TTGGGCAATGCGA GCGCACTGCGAC

Unknown genome

Color reads

Letter reads

Emissions

motivation | methods | results | summary

28

..... B6B7B8B9 ..... B7 B8

color 0

color 1

AA

AC

color 0

AA

Page 23: Tools For HTS Analysis

AA

color 0

color 1

color 2

color 3

letters A

letters C

1 – 3ε

ε

ε

ε

emissionprobabilityp(em|AA)

letters T ξ

1- 3ξ

letters G ξ

ξ

Emission Probabilities

motivation | methods | results | summary

29

Same color emission distribution

TT

Different letter emission distribution

TT

Page 24: Tools For HTS Analysis

])31[(])31[( 2112 Ep

E.g. For state CC:

Combining emission probabilities• probability that this state emitted these reads.

motivation | methods | results | summary

Emission Probabilities

30

T01020100311223 T1030101311223 T20100311223

ATTGCGCAATGCG TTGGGCAATGCGA GCGCACTGCGAC

..... B6B7B8B9 .....

Page 25: Tools For HTS Analysis

Summary

• unknown state • donor pair at location

•transitions • transition probabilities

• emissions • reads at location• emission probabilities

motivation | methods | results | summary

Simple HMM

31

B6 B7

AA

AC

color 0

color 1

Page 26: Tools For HTS Analysis

• Have set-up a form of an HMM• run Forward-Backward algorithm • get probability distribution over states at some position

AA

CA

AT

TT

:

:

GA

CT

::

likely state

motivation | methods | results | summary

Forward-Backward Algorithm

32

• Variation Detection:compare most likely state with reference:

ref: GCTATCCAdon: ...AT...

Page 27: Tools For HTS Analysis

Methods

Simple HMM Modelstates, emissions, transitions, FB

Extended HMM Modelgaps, diploids, exceptions

Methods

Simple HMM Modelstates, emissions, transitions, FB

Extended HMM Modelgaps, diploids, exceptions

MotivationMotivation

SummarySummary

ResultsResults

33

Page 28: Tools For HTS Analysis

Simple HMM • only detects homozygous SNPs

Extended HMM:• short indels• heterozygous SNPs• complex error profiles & quality values

motivation | methods | results | summary

Extended HMM

34

Page 29: Tools For HTS Analysis

Expand states• Have states that include gaps

• emit: gap or color

A---

-G

AGTG

T-T-

• Have larger states, for diploids

• Transitions built in similar fashion as before• Same algorithm, but in all we have 1600 states with very sparse transitions

Expansion: Gaps and het. SNPs

motivation | methods | results | summary

35

Page 30: Tools For HTS Analysis

• Emission probabilities o Support quality values o Use variable error rates for emissions

• Translate through the first lettero first color is incorrecto letter-space signal

Donor: ACAGCATCGGCATCGACTGC 1123132303123321213read: >T2123132303123321213 > C123132303123321213

Expansion

motivation | methods | results | summary

36

• Post-process putative SNPso uncorrelated adjacent errors may support het SNPso check putative SNPs

Page 31: Tools For HTS Analysis

motivation | methods | results | summary

blue: varid steps

Summary

Page 32: Tools For HTS Analysis

ResultsResults

MotivationMotivation

MethodsMethods

SummarySummary

38

Page 33: Tools For HTS Analysis

Results

motivation | methods | results | summary

• Human dataset from Harismendy et al, 2009. (NA17156,17275,17460,17773)

Color-space dataset:• Compare random subsets:

• Corona (with AB mapper) • VARiD (with SHRiMP) • VARiD (with AB mapper)

Conclusions:• Using F-measure, the three pipelines perform very similarly. • High-coverage results is as good as can be achieved

39

Page 34: Tools For HTS Analysis

Results

motivation | methods | results | summary

• Human dataset from Harismendy et al, 2009. (NA17156,17275,17460,17773)

Letter-space dataset:• Compare random subsets :

• GigaBayes (with Mosaik) • VARiD (with SHRiMP) • VARiD (with Mosaik)

Conclusion:• Using F-measure the three pipelines perform very similarly.• High-coverage results is as good as can be achieved

40

Page 35: Tools For HTS Analysis

Results

VARiD: Combining Letter-space and Color-space Datato achieve increased accuracy in at-cost comparison

motivation | methods | results | summary

41

Page 36: Tools For HTS Analysis

SummarySummary

MotivationMotivation

MethodsMethods

ResultsResults

42

Page 37: Tools For HTS Analysis

Summary of VARiD• HMM modeling underlying donor• Treats color-space and letter-space together in the same framework• no translation – take advantage of each technology’s properties• accurately calls short SNPs, indels in both color- and letter-space

• improved results with hybrid data.

Summary

motivation | methods | results | summary

• Website: http://compbio.cs.utoronto.ca/varid (VARiD freely available)

• Contact: [email protected]

• Website: http://compbio.cs.utoronto.ca/varid (VARiD freely available)

• Contact: [email protected]

43

Page 38: Tools For HTS Analysis

SAVANT GENOME BROWSER

Savant Genome Browser - http://compbio.cs.toronto.edu/savant/

Page 39: Tools For HTS Analysis

Challenge in Genomic Data Analysis• genomic data is generated in high volumes• interpretation and analysis challenge• typical pipeline employs many separate tools for computation and visualization

Savant Genome Browser - http://compbio.cs.toronto.edu/savant/

Page 40: Tools For HTS Analysis

Tools for HTS data analysisTool Cost Computation Visualization

Read Alignment e.g. Bowtie, BWA

Free Y N

File Format Conversion e.g. Galaxy, SAMTools

Free Y N

Other Comand-line Toolse.g. Genetic Variation Discovery, Comparitive Genomics, etc.

Free Y N

UCSC Genome Browser Free N Y

Integrative Genomics Viewer Free N Y

GBrowse Free N Y

CLC Genomics Workbench $$$ Y Y

Savant Genome Browser - http://compbio.cs.toronto.edu/savant/

• substantial disconnect between the processes of computational analysis and visualization

Page 41: Tools For HTS Analysis

Tools for HTS data analysisTool Cost Computation Visualization

Read Alignment e.g. Bowtie, BWA

Free Y N

File Format Conversion e.g. Galaxy, SAMTools

Free Y N

Other Comand-line Toolse.g. Genetic Variation Discovery, Comparitive Genomics, etc.

Free Y N

UCSC Genome Browser Free N Y

Integrative Genomics Viewer Free N Y

GBrowse Free N Y

CLC Genomics Workbench $$$ Y Y

Savant Genome Browser Free Y Y

Savant Genome Browser - http://compbio.cs.toronto.edu/savant/

• substantial disconnect between the processes of computational analysis and visualization

Page 42: Tools For HTS Analysis

Savant Genome Browser• platform for integrated visual analysis of genomic data

• feature-rich genome browser•computationally extensible via plugin framework

Savant Genome Browser - http://compbio.cs.toronto.edu/savant/

Page 43: Tools For HTS Analysis

(Very) Short List of Features

Savant Genome Browser - http://compbio.cs.toronto.edu/savant/

Page 44: Tools For HTS Analysis

FEATURE DEMONSTRATION

Savant Genome Browser - http://compbio.cs.toronto.edu/savant/

INTERFACEHTS READ ALIGNMENTSEXAMPLE PLUGINS: SNP FINDER

Page 45: Tools For HTS Analysis

Power of visual analytics• task: find the correct parameter for command-line tool

Savant Genome Browser - http://compbio.cs.toronto.edu/savant/

Page 46: Tools For HTS Analysis

Plugin Framework• unlocks the potential for performing visual analytics

• mutually beneficial for both users and tool developers

for users: perform complex data analyses on-the-fly within a visual environment

for programmers: platform for simple development and deployment of various programs

Savant Genome Browser - http://compbio.cs.toronto.edu/savant/

Page 47: Tools For HTS Analysis

CONCLUSIONS

Savant Genome Browser - http://compbio.cs.toronto.edu/savant/

Page 48: Tools For HTS Analysis

Conclusions• Savant is a platform for integrated visualization and analysis of genomic data

•stand-alone genome browser•novel features: e.g. table view, visualization modes, data selection, etc.

•computationally extensible through plugin framework

• makes interpretation and analysis of genomic data easier and more efficient

Savant Genome Browser - http://compbio.cs.toronto.edu/savant/

Page 49: Tools For HTS Analysis

Acknowledgements

Recep Andrew Vlad MikeBrudno

Yue Marc

Vanessa OrionJoe Nilgun

Paul

Vera

Misko Yoni

Page 50: Tools For HTS Analysis

Questions?

SHRiMPhttp://compbio.cs.toronto.edu/shrimp

VARiDhttp://compbio.cs.toronto.edu/varid

Savant Genome Browserhttp://compbio.cs.toronto.edu/savant