gsnap: fast and snp-tolerant detection of complex variants ... · gsnap: fast and snp-tolerant...

116
GSNAP: Fast and SNP-tolerant detection of complex variants and splic- ing in short reads by Thomas D. Wu and Serban Nacu Matt Huska Freie Universität Berlin Computational Methods for High-Throughput Omics Data, WS 2011

Upload: trandung

Post on 19-Aug-2019

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: GSNAP: Fast and SNP-tolerant detection of complex variants ... · GSNAP: Fast and SNP-tolerant detection of complex variants and splic-ing in short reads by Thomas D. Wu and Serban

GSNAP: Fast and SNP-tolerant detection of complex variants and splic-ing in short readsby Thomas D. Wu and Serban Nacu

Matt HuskaFreie Universität Berlin

Computational Methods for High-Throughput Omics Data, WS 2011

Page 2: GSNAP: Fast and SNP-tolerant detection of complex variants ... · GSNAP: Fast and SNP-tolerant detection of complex variants and splic-ing in short reads by Thomas D. Wu and Serban

Outline

IntroductionMotivationGSNAP FeaturesExamples of Complex Variant Detection

The AlgorithmSummaryPreprocessingMethod 1: Spanning Set Generation and FilteringMethod 2: Complete Set Generation and FilteringVerification of Candidate RegionsDetecting Insertions and DeletionsDetecting Splice Junctions

ResultsSimulated ReadsTranscriptional ReadsLimitations

Conclusions

,

FU Berlin, GSNAP, Omics WS 2011 2

Page 3: GSNAP: Fast and SNP-tolerant detection of complex variants ... · GSNAP: Fast and SNP-tolerant detection of complex variants and splic-ing in short reads by Thomas D. Wu and Serban

Outline

IntroductionMotivationGSNAP FeaturesExamples of Complex Variant Detection

The AlgorithmSummaryPreprocessingMethod 1: Spanning Set Generation and FilteringMethod 2: Complete Set Generation and FilteringVerification of Candidate RegionsDetecting Insertions and DeletionsDetecting Splice Junctions

ResultsSimulated ReadsTranscriptional ReadsLimitations

Conclusions

,

FU Berlin, GSNAP, Omics WS 2011 2

Page 4: GSNAP: Fast and SNP-tolerant detection of complex variants ... · GSNAP: Fast and SNP-tolerant detection of complex variants and splic-ing in short reads by Thomas D. Wu and Serban

Outline

IntroductionMotivationGSNAP FeaturesExamples of Complex Variant Detection

The AlgorithmSummaryPreprocessingMethod 1: Spanning Set Generation and FilteringMethod 2: Complete Set Generation and FilteringVerification of Candidate RegionsDetecting Insertions and DeletionsDetecting Splice Junctions

ResultsSimulated ReadsTranscriptional ReadsLimitations

Conclusions

,

FU Berlin, GSNAP, Omics WS 2011 2

Page 5: GSNAP: Fast and SNP-tolerant detection of complex variants ... · GSNAP: Fast and SNP-tolerant detection of complex variants and splic-ing in short reads by Thomas D. Wu and Serban

Outline

IntroductionMotivationGSNAP FeaturesExamples of Complex Variant Detection

The AlgorithmSummaryPreprocessingMethod 1: Spanning Set Generation and FilteringMethod 2: Complete Set Generation and FilteringVerification of Candidate RegionsDetecting Insertions and DeletionsDetecting Splice Junctions

ResultsSimulated ReadsTranscriptional ReadsLimitations

Conclusions

,

FU Berlin, GSNAP, Omics WS 2011 2

Page 6: GSNAP: Fast and SNP-tolerant detection of complex variants ... · GSNAP: Fast and SNP-tolerant detection of complex variants and splic-ing in short reads by Thomas D. Wu and Serban

Outline

IntroductionMotivationGSNAP FeaturesExamples of Complex Variant Detection

The AlgorithmSummaryPreprocessingMethod 1: Spanning Set Generation and FilteringMethod 2: Complete Set Generation and FilteringVerification of Candidate RegionsDetecting Insertions and DeletionsDetecting Splice Junctions

ResultsSimulated ReadsTranscriptional ReadsLimitations

Conclusions

,

FU Berlin, GSNAP, Omics WS 2011 3

Page 7: GSNAP: Fast and SNP-tolerant detection of complex variants ... · GSNAP: Fast and SNP-tolerant detection of complex variants and splic-ing in short reads by Thomas D. Wu and Serban

Motivation

É New technology allows us to sequence more reads in shorter time.Increasing at an incredible rate with no signs of slowing down.

É "Why should we be happy with millions of reads, when we canhave. . .

,

FU Berlin, GSNAP, Omics WS 2011 4

Page 8: GSNAP: Fast and SNP-tolerant detection of complex variants ... · GSNAP: Fast and SNP-tolerant detection of complex variants and splic-ing in short reads by Thomas D. Wu and Serban

Motivation

É New technology allows us to sequence more reads in shorter time.Increasing at an incredible rate with no signs of slowing down.

É "Why should we be happy with millions of reads, when we canhave. . .

,

FU Berlin, GSNAP, Omics WS 2011 4

Page 9: GSNAP: Fast and SNP-tolerant detection of complex variants ... · GSNAP: Fast and SNP-tolerant detection of complex variants and splic-ing in short reads by Thomas D. Wu and Serban

Motivation

. . .billions?"

,

FU Berlin, GSNAP, Omics WS 2011 5

Page 10: GSNAP: Fast and SNP-tolerant detection of complex variants ... · GSNAP: Fast and SNP-tolerant detection of complex variants and splic-ing in short reads by Thomas D. Wu and Serban

Motivation: Why do we need another readmapping algorithm?

É The reads themselves are also getting longer. Longer reads = higherprobability for complex variants within a read.

É Current (Feb 2010) read mappers tend to either be very fast (BWA,Bowtie, SOAP2) or sensitive to variants (SOAP)

É GSNAP is intended to be fast and able to handle complex variants

,

FU Berlin, GSNAP, Omics WS 2011 6

Page 11: GSNAP: Fast and SNP-tolerant detection of complex variants ... · GSNAP: Fast and SNP-tolerant detection of complex variants and splic-ing in short reads by Thomas D. Wu and Serban

Motivation: Why do we need another readmapping algorithm?

É The reads themselves are also getting longer. Longer reads = higherprobability for complex variants within a read.

É Current (Feb 2010) read mappers tend to either be very fast (BWA,Bowtie, SOAP2) or sensitive to variants (SOAP)

É GSNAP is intended to be fast and able to handle complex variants

,

FU Berlin, GSNAP, Omics WS 2011 6

Page 12: GSNAP: Fast and SNP-tolerant detection of complex variants ... · GSNAP: Fast and SNP-tolerant detection of complex variants and splic-ing in short reads by Thomas D. Wu and Serban

Motivation: Why do we need another readmapping algorithm?

É The reads themselves are also getting longer. Longer reads = higherprobability for complex variants within a read.

É Current (Feb 2010) read mappers tend to either be very fast (BWA,Bowtie, SOAP2) or sensitive to variants (SOAP)

É GSNAP is intended to be fast and able to handle complex variants

,

FU Berlin, GSNAP, Omics WS 2011 6

Page 13: GSNAP: Fast and SNP-tolerant detection of complex variants ... · GSNAP: Fast and SNP-tolerant detection of complex variants and splic-ing in short reads by Thomas D. Wu and Serban

Outline

IntroductionMotivationGSNAP FeaturesExamples of Complex Variant Detection

The AlgorithmSummaryPreprocessingMethod 1: Spanning Set Generation and FilteringMethod 2: Complete Set Generation and FilteringVerification of Candidate RegionsDetecting Insertions and DeletionsDetecting Splice Junctions

ResultsSimulated ReadsTranscriptional ReadsLimitations

Conclusions

,

FU Berlin, GSNAP, Omics WS 2011 7

Page 14: GSNAP: Fast and SNP-tolerant detection of complex variants ... · GSNAP: Fast and SNP-tolerant detection of complex variants and splic-ing in short reads by Thomas D. Wu and Serban

Key Features of GSNAP

É Can handle short and long insertions and deletions

É Detects short and long distance splicing (includinginterchromosomal)É user-provided database of splice sites (eg. RefSeq)É probabilistic model

É SNP tolerant (given a user-provided database eg. dbSNP)É Can align bisulfite-treated DNA (for studying methlyation state)É Still pretty fast

,

FU Berlin, GSNAP, Omics WS 2011 8

Page 15: GSNAP: Fast and SNP-tolerant detection of complex variants ... · GSNAP: Fast and SNP-tolerant detection of complex variants and splic-ing in short reads by Thomas D. Wu and Serban

Key Features of GSNAP

É Can handle short and long insertions and deletionsÉ Detects short and long distance splicing (including

interchromosomal)

É user-provided database of splice sites (eg. RefSeq)É probabilistic model

É SNP tolerant (given a user-provided database eg. dbSNP)É Can align bisulfite-treated DNA (for studying methlyation state)É Still pretty fast

,

FU Berlin, GSNAP, Omics WS 2011 8

Page 16: GSNAP: Fast and SNP-tolerant detection of complex variants ... · GSNAP: Fast and SNP-tolerant detection of complex variants and splic-ing in short reads by Thomas D. Wu and Serban

Key Features of GSNAP

É Can handle short and long insertions and deletionsÉ Detects short and long distance splicing (including

interchromosomal)É user-provided database of splice sites (eg. RefSeq)

É probabilistic model

É SNP tolerant (given a user-provided database eg. dbSNP)É Can align bisulfite-treated DNA (for studying methlyation state)É Still pretty fast

,

FU Berlin, GSNAP, Omics WS 2011 8

Page 17: GSNAP: Fast and SNP-tolerant detection of complex variants ... · GSNAP: Fast and SNP-tolerant detection of complex variants and splic-ing in short reads by Thomas D. Wu and Serban

Key Features of GSNAP

É Can handle short and long insertions and deletionsÉ Detects short and long distance splicing (including

interchromosomal)É user-provided database of splice sites (eg. RefSeq)É probabilistic model

É SNP tolerant (given a user-provided database eg. dbSNP)É Can align bisulfite-treated DNA (for studying methlyation state)É Still pretty fast

,

FU Berlin, GSNAP, Omics WS 2011 8

Page 18: GSNAP: Fast and SNP-tolerant detection of complex variants ... · GSNAP: Fast and SNP-tolerant detection of complex variants and splic-ing in short reads by Thomas D. Wu and Serban

Key Features of GSNAP

É Can handle short and long insertions and deletionsÉ Detects short and long distance splicing (including

interchromosomal)É user-provided database of splice sites (eg. RefSeq)É probabilistic model

É SNP tolerant (given a user-provided database eg. dbSNP)

É Can align bisulfite-treated DNA (for studying methlyation state)É Still pretty fast

,

FU Berlin, GSNAP, Omics WS 2011 8

Page 19: GSNAP: Fast and SNP-tolerant detection of complex variants ... · GSNAP: Fast and SNP-tolerant detection of complex variants and splic-ing in short reads by Thomas D. Wu and Serban

Key Features of GSNAP

É Can handle short and long insertions and deletionsÉ Detects short and long distance splicing (including

interchromosomal)É user-provided database of splice sites (eg. RefSeq)É probabilistic model

É SNP tolerant (given a user-provided database eg. dbSNP)É Can align bisulfite-treated DNA (for studying methlyation state)

É Still pretty fast

,

FU Berlin, GSNAP, Omics WS 2011 8

Page 20: GSNAP: Fast and SNP-tolerant detection of complex variants ... · GSNAP: Fast and SNP-tolerant detection of complex variants and splic-ing in short reads by Thomas D. Wu and Serban

Key Features of GSNAP

É Can handle short and long insertions and deletionsÉ Detects short and long distance splicing (including

interchromosomal)É user-provided database of splice sites (eg. RefSeq)É probabilistic model

É SNP tolerant (given a user-provided database eg. dbSNP)É Can align bisulfite-treated DNA (for studying methlyation state)É Still pretty fast

,

FU Berlin, GSNAP, Omics WS 2011 8

Page 21: GSNAP: Fast and SNP-tolerant detection of complex variants ... · GSNAP: Fast and SNP-tolerant detection of complex variants and splic-ing in short reads by Thomas D. Wu and Serban

Outline

IntroductionMotivationGSNAP FeaturesExamples of Complex Variant Detection

The AlgorithmSummaryPreprocessingMethod 1: Spanning Set Generation and FilteringMethod 2: Complete Set Generation and FilteringVerification of Candidate RegionsDetecting Insertions and DeletionsDetecting Splice Junctions

ResultsSimulated ReadsTranscriptional ReadsLimitations

Conclusions

,

FU Berlin, GSNAP, Omics WS 2011 9

Page 22: GSNAP: Fast and SNP-tolerant detection of complex variants ... · GSNAP: Fast and SNP-tolerant detection of complex variants and splic-ing in short reads by Thomas D. Wu and Serban

Long deletion of 17nt plus mismatches

É 17nt deletion matching an entry in dbSNP, including mismatches:

,

FU Berlin, GSNAP, Omics WS 2011 10

Page 23: GSNAP: Fast and SNP-tolerant detection of complex variants ... · GSNAP: Fast and SNP-tolerant detection of complex variants and splic-ing in short reads by Thomas D. Wu and Serban

Splicing identified using the probabilisticmodel

É An intron within exon 1 of HOXA9. Is also experimentally supported.

,

FU Berlin, GSNAP, Omics WS 2011 11

Page 24: GSNAP: Fast and SNP-tolerant detection of complex variants ... · GSNAP: Fast and SNP-tolerant detection of complex variants and splic-ing in short reads by Thomas D. Wu and Serban

Splicing identified using a database of knownsplice sites

É Splicing sites identified despite having low probabilistic scores.

,

FU Berlin, GSNAP, Omics WS 2011 12

Page 25: GSNAP: Fast and SNP-tolerant detection of complex variants ... · GSNAP: Fast and SNP-tolerant detection of complex variants and splic-ing in short reads by Thomas D. Wu and Serban

Interchromosomal splicing (gene fusion)

É Splicing between BCAS4 (chr 20) and BCAS3 (chr 17).

,

FU Berlin, GSNAP, Omics WS 2011 13

Page 26: GSNAP: Fast and SNP-tolerant detection of complex variants ... · GSNAP: Fast and SNP-tolerant detection of complex variants and splic-ing in short reads by Thomas D. Wu and Serban

SNP-tolerant alignment

É SNP-tolerance allows both genotypes to align well

,

FU Berlin, GSNAP, Omics WS 2011 14

Page 27: GSNAP: Fast and SNP-tolerant detection of complex variants ... · GSNAP: Fast and SNP-tolerant detection of complex variants and splic-ing in short reads by Thomas D. Wu and Serban

Outline

IntroductionMotivationGSNAP FeaturesExamples of Complex Variant Detection

The AlgorithmSummaryPreprocessingMethod 1: Spanning Set Generation and FilteringMethod 2: Complete Set Generation and FilteringVerification of Candidate RegionsDetecting Insertions and DeletionsDetecting Splice Junctions

ResultsSimulated ReadsTranscriptional ReadsLimitations

Conclusions

,

FU Berlin, GSNAP, Omics WS 2011 15

Page 28: GSNAP: Fast and SNP-tolerant detection of complex variants ... · GSNAP: Fast and SNP-tolerant detection of complex variants and splic-ing in short reads by Thomas D. Wu and Serban

The Algorithm in Brief

É Index the genome using a hash table

É Break up short reads into shorter elements and look each up in hashtable

É Look at resulting position lists for each element and see if theysupport a common target location and have a reasonable number ofmismatches

É Verify the number of mismatches by checking the whole read againstthe reference

,

FU Berlin, GSNAP, Omics WS 2011 16

Page 29: GSNAP: Fast and SNP-tolerant detection of complex variants ... · GSNAP: Fast and SNP-tolerant detection of complex variants and splic-ing in short reads by Thomas D. Wu and Serban

The Algorithm in Brief

É Index the genome using a hash tableÉ Break up short reads into shorter elements and look each up in hash

table

É Look at resulting position lists for each element and see if theysupport a common target location and have a reasonable number ofmismatches

É Verify the number of mismatches by checking the whole read againstthe reference

,

FU Berlin, GSNAP, Omics WS 2011 16

Page 30: GSNAP: Fast and SNP-tolerant detection of complex variants ... · GSNAP: Fast and SNP-tolerant detection of complex variants and splic-ing in short reads by Thomas D. Wu and Serban

The Algorithm in Brief

É Index the genome using a hash tableÉ Break up short reads into shorter elements and look each up in hash

tableÉ Look at resulting position lists for each element and see if they

support a common target location and have a reasonable number ofmismatches

É Verify the number of mismatches by checking the whole read againstthe reference

,

FU Berlin, GSNAP, Omics WS 2011 16

Page 31: GSNAP: Fast and SNP-tolerant detection of complex variants ... · GSNAP: Fast and SNP-tolerant detection of complex variants and splic-ing in short reads by Thomas D. Wu and Serban

The Algorithm in Brief

É Index the genome using a hash tableÉ Break up short reads into shorter elements and look each up in hash

tableÉ Look at resulting position lists for each element and see if they

support a common target location and have a reasonable number ofmismatches

É Verify the number of mismatches by checking the whole read againstthe reference

,

FU Berlin, GSNAP, Omics WS 2011 16

Page 32: GSNAP: Fast and SNP-tolerant detection of complex variants ... · GSNAP: Fast and SNP-tolerant detection of complex variants and splic-ing in short reads by Thomas D. Wu and Serban

Outline

IntroductionMotivationGSNAP FeaturesExamples of Complex Variant Detection

The AlgorithmSummaryPreprocessingMethod 1: Spanning Set Generation and FilteringMethod 2: Complete Set Generation and FilteringVerification of Candidate RegionsDetecting Insertions and DeletionsDetecting Splice Junctions

ResultsSimulated ReadsTranscriptional ReadsLimitations

Conclusions

,

FU Berlin, GSNAP, Omics WS 2011 17

Page 33: GSNAP: Fast and SNP-tolerant detection of complex variants ... · GSNAP: Fast and SNP-tolerant detection of complex variants and splic-ing in short reads by Thomas D. Wu and Serban

Hash Table Space Requirements

É Hashing every overlapping oligomer in the entire reference sequencewould take too much memory (approx. 14GB for the human genome)

É We only hash 12mers every 3nt (approx. 4GB)É Optional SNP-tolerance only adds a small amount to total memory

requirements (3.8 GB → 4.0 GB)É Entire table only needs to be in memory during construction.

Afterwards it is mmap’d and only part is loaded into memory.

,

FU Berlin, GSNAP, Omics WS 2011 18

Page 34: GSNAP: Fast and SNP-tolerant detection of complex variants ... · GSNAP: Fast and SNP-tolerant detection of complex variants and splic-ing in short reads by Thomas D. Wu and Serban

Hash Table Space Requirements

É Hashing every overlapping oligomer in the entire reference sequencewould take too much memory (approx. 14GB for the human genome)

É We only hash 12mers every 3nt (approx. 4GB)

É Optional SNP-tolerance only adds a small amount to total memoryrequirements (3.8 GB → 4.0 GB)

É Entire table only needs to be in memory during construction.Afterwards it is mmap’d and only part is loaded into memory.

,

FU Berlin, GSNAP, Omics WS 2011 18

Page 35: GSNAP: Fast and SNP-tolerant detection of complex variants ... · GSNAP: Fast and SNP-tolerant detection of complex variants and splic-ing in short reads by Thomas D. Wu and Serban

Hash Table Space Requirements

É Hashing every overlapping oligomer in the entire reference sequencewould take too much memory (approx. 14GB for the human genome)

É We only hash 12mers every 3nt (approx. 4GB)É Optional SNP-tolerance only adds a small amount to total memory

requirements (3.8 GB → 4.0 GB)

É Entire table only needs to be in memory during construction.Afterwards it is mmap’d and only part is loaded into memory.

,

FU Berlin, GSNAP, Omics WS 2011 18

Page 36: GSNAP: Fast and SNP-tolerant detection of complex variants ... · GSNAP: Fast and SNP-tolerant detection of complex variants and splic-ing in short reads by Thomas D. Wu and Serban

Hash Table Space Requirements

É Hashing every overlapping oligomer in the entire reference sequencewould take too much memory (approx. 14GB for the human genome)

É We only hash 12mers every 3nt (approx. 4GB)É Optional SNP-tolerance only adds a small amount to total memory

requirements (3.8 GB → 4.0 GB)É Entire table only needs to be in memory during construction.

Afterwards it is mmap’d and only part is loaded into memory.

,

FU Berlin, GSNAP, Omics WS 2011 18

Page 37: GSNAP: Fast and SNP-tolerant detection of complex variants ... · GSNAP: Fast and SNP-tolerant detection of complex variants and splic-ing in short reads by Thomas D. Wu and Serban

Hashing the Reference Genome

,

FU Berlin, GSNAP, Omics WS 2011 19

Page 38: GSNAP: Fast and SNP-tolerant detection of complex variants ... · GSNAP: Fast and SNP-tolerant detection of complex variants and splic-ing in short reads by Thomas D. Wu and Serban

Including SNPs

,

FU Berlin, GSNAP, Omics WS 2011 20

Page 39: GSNAP: Fast and SNP-tolerant detection of complex variants ... · GSNAP: Fast and SNP-tolerant detection of complex variants and splic-ing in short reads by Thomas D. Wu and Serban

Resulting Reference "Space"

,

FU Berlin, GSNAP, Omics WS 2011 21

Page 40: GSNAP: Fast and SNP-tolerant detection of complex variants ... · GSNAP: Fast and SNP-tolerant detection of complex variants and splic-ing in short reads by Thomas D. Wu and Serban

Outline

IntroductionMotivationGSNAP FeaturesExamples of Complex Variant Detection

The AlgorithmSummaryPreprocessingMethod 1: Spanning Set Generation and FilteringMethod 2: Complete Set Generation and FilteringVerification of Candidate RegionsDetecting Insertions and DeletionsDetecting Splice Junctions

ResultsSimulated ReadsTranscriptional ReadsLimitations

Conclusions

,

FU Berlin, GSNAP, Omics WS 2011 22

Page 41: GSNAP: Fast and SNP-tolerant detection of complex variants ... · GSNAP: Fast and SNP-tolerant detection of complex variants and splic-ing in short reads by Thomas D. Wu and Serban

Spanning Set Generation and Filtering

,

FU Berlin, GSNAP, Omics WS 2011 23

Page 42: GSNAP: Fast and SNP-tolerant detection of complex variants ... · GSNAP: Fast and SNP-tolerant detection of complex variants and splic-ing in short reads by Thomas D. Wu and Serban

Spanning Set Generation and Filtering

,

FU Berlin, GSNAP, Omics WS 2011 24

Page 43: GSNAP: Fast and SNP-tolerant detection of complex variants ... · GSNAP: Fast and SNP-tolerant detection of complex variants and splic-ing in short reads by Thomas D. Wu and Serban

Spanning Set Generation and Filtering

,

FU Berlin, GSNAP, Omics WS 2011 25

Page 44: GSNAP: Fast and SNP-tolerant detection of complex variants ... · GSNAP: Fast and SNP-tolerant detection of complex variants and splic-ing in short reads by Thomas D. Wu and Serban

Spanning Set Generation and Filtering

,

FU Berlin, GSNAP, Omics WS 2011 26

Page 45: GSNAP: Fast and SNP-tolerant detection of complex variants ... · GSNAP: Fast and SNP-tolerant detection of complex variants and splic-ing in short reads by Thomas D. Wu and Serban

Spanning Set Generation and Filtering

,

FU Berlin, GSNAP, Omics WS 2011 27

Page 46: GSNAP: Fast and SNP-tolerant detection of complex variants ... · GSNAP: Fast and SNP-tolerant detection of complex variants and splic-ing in short reads by Thomas D. Wu and Serban

Spanning Set Generation and Filtering

,

FU Berlin, GSNAP, Omics WS 2011 28

Page 47: GSNAP: Fast and SNP-tolerant detection of complex variants ... · GSNAP: Fast and SNP-tolerant detection of complex variants and splic-ing in short reads by Thomas D. Wu and Serban

Spanning Set Generation and Filtering

,

FU Berlin, GSNAP, Omics WS 2011 29

Page 48: GSNAP: Fast and SNP-tolerant detection of complex variants ... · GSNAP: Fast and SNP-tolerant detection of complex variants and splic-ing in short reads by Thomas D. Wu and Serban

Spanning Set Generation and Filtering

É Generating elements: used to find supporting candidate locationsÉ Uses a multiway merging procedure (Knuth TAOCP Vol 3)É Slow: linear on the sum of the list lengths. O(l logn) runtime where n is

the number of position lists and l is the sum of their lengths.

É Filtering elements: filter the candidate locations found using thegenerating elementsÉ Fast: uses a binary search. O(log li) for position list i

É Choose the elements with the shortest position lists as generatingelements, and the longer ones as filtering elements.

É They choose K + 2 generating sets, where K is the constraint score (=max number of mismatches)

,

FU Berlin, GSNAP, Omics WS 2011 30

Page 49: GSNAP: Fast and SNP-tolerant detection of complex variants ... · GSNAP: Fast and SNP-tolerant detection of complex variants and splic-ing in short reads by Thomas D. Wu and Serban

Spanning Set Generation and Filtering

É Generating elements: used to find supporting candidate locationsÉ Uses a multiway merging procedure (Knuth TAOCP Vol 3)É Slow: linear on the sum of the list lengths. O(l logn) runtime where n is

the number of position lists and l is the sum of their lengths.É Filtering elements: filter the candidate locations found using the

generating elementsÉ Fast: uses a binary search. O(log li) for position list i

É Choose the elements with the shortest position lists as generatingelements, and the longer ones as filtering elements.

É They choose K + 2 generating sets, where K is the constraint score (=max number of mismatches)

,

FU Berlin, GSNAP, Omics WS 2011 30

Page 50: GSNAP: Fast and SNP-tolerant detection of complex variants ... · GSNAP: Fast and SNP-tolerant detection of complex variants and splic-ing in short reads by Thomas D. Wu and Serban

Spanning Set Generation and Filtering

É Generating elements: used to find supporting candidate locationsÉ Uses a multiway merging procedure (Knuth TAOCP Vol 3)É Slow: linear on the sum of the list lengths. O(l logn) runtime where n is

the number of position lists and l is the sum of their lengths.É Filtering elements: filter the candidate locations found using the

generating elementsÉ Fast: uses a binary search. O(log li) for position list i

É Choose the elements with the shortest position lists as generatingelements, and the longer ones as filtering elements.

É They choose K + 2 generating sets, where K is the constraint score (=max number of mismatches)

,

FU Berlin, GSNAP, Omics WS 2011 30

Page 51: GSNAP: Fast and SNP-tolerant detection of complex variants ... · GSNAP: Fast and SNP-tolerant detection of complex variants and splic-ing in short reads by Thomas D. Wu and Serban

Spanning Set Generation and Filtering

É Generating elements: used to find supporting candidate locationsÉ Uses a multiway merging procedure (Knuth TAOCP Vol 3)É Slow: linear on the sum of the list lengths. O(l logn) runtime where n is

the number of position lists and l is the sum of their lengths.É Filtering elements: filter the candidate locations found using the

generating elementsÉ Fast: uses a binary search. O(log li) for position list i

É Choose the elements with the shortest position lists as generatingelements, and the longer ones as filtering elements.

É They choose K + 2 generating sets, where K is the constraint score (=max number of mismatches)

,

FU Berlin, GSNAP, Omics WS 2011 30

Page 52: GSNAP: Fast and SNP-tolerant detection of complex variants ... · GSNAP: Fast and SNP-tolerant detection of complex variants and splic-ing in short reads by Thomas D. Wu and Serban

Spanning Set Generation and Filtering

,

FU Berlin, GSNAP, Omics WS 2011 31

Page 53: GSNAP: Fast and SNP-tolerant detection of complex variants ... · GSNAP: Fast and SNP-tolerant detection of complex variants and splic-ing in short reads by Thomas D. Wu and Serban

Spanning Set Generation and Filtering

,

FU Berlin, GSNAP, Omics WS 2011 32

Page 54: GSNAP: Fast and SNP-tolerant detection of complex variants ... · GSNAP: Fast and SNP-tolerant detection of complex variants and splic-ing in short reads by Thomas D. Wu and Serban

Spanning Set Generation and Filtering

,

FU Berlin, GSNAP, Omics WS 2011 33

Page 55: GSNAP: Fast and SNP-tolerant detection of complex variants ... · GSNAP: Fast and SNP-tolerant detection of complex variants and splic-ing in short reads by Thomas D. Wu and Serban

Spanning Set Generation and Filtering

É Problem: The spanning set method can only be used to detect alimited number of mismatches.

É If we want to allow K matches, for K > 1, then we need at least(K + 2) generating elements. The read length L limits the number ofelements N, and the spanning set elements are non-overlapping in allthree shifts if L = 10(mod12), so N ≤ b(L+ 2)/12c.

É So we need to satisfy N > (K + 2), and as such we can only apply thespanning set method when K ≤ b(L+ 2)/12c − 2 for L ≥ 34.

É For reads of length 100 (Illumina), the we can allow a maximum of 6mismatches.

É For reads of length 400 (454), the we can allow a maximum of 31mismatches.

É If we want to allow larger numbers of mismatches or the samenumber of mismatches in shorter reads, we need to use anothermethod. . .

,

FU Berlin, GSNAP, Omics WS 2011 34

Page 56: GSNAP: Fast and SNP-tolerant detection of complex variants ... · GSNAP: Fast and SNP-tolerant detection of complex variants and splic-ing in short reads by Thomas D. Wu and Serban

Spanning Set Generation and Filtering

É Problem: The spanning set method can only be used to detect alimited number of mismatches.

É If we want to allow K matches, for K > 1, then we need at least(K + 2) generating elements. The read length L limits the number ofelements N, and the spanning set elements are non-overlapping in allthree shifts if L = 10(mod12), so N ≤ b(L+ 2)/12c.

É So we need to satisfy N > (K + 2), and as such we can only apply thespanning set method when K ≤ b(L+ 2)/12c − 2 for L ≥ 34.

É For reads of length 100 (Illumina), the we can allow a maximum of 6mismatches.

É For reads of length 400 (454), the we can allow a maximum of 31mismatches.

É If we want to allow larger numbers of mismatches or the samenumber of mismatches in shorter reads, we need to use anothermethod. . .

,

FU Berlin, GSNAP, Omics WS 2011 34

Page 57: GSNAP: Fast and SNP-tolerant detection of complex variants ... · GSNAP: Fast and SNP-tolerant detection of complex variants and splic-ing in short reads by Thomas D. Wu and Serban

Spanning Set Generation and Filtering

É Problem: The spanning set method can only be used to detect alimited number of mismatches.

É If we want to allow K matches, for K > 1, then we need at least(K + 2) generating elements. The read length L limits the number ofelements N, and the spanning set elements are non-overlapping in allthree shifts if L = 10(mod12), so N ≤ b(L+ 2)/12c.

É So we need to satisfy N > (K + 2), and as such we can only apply thespanning set method when K ≤ b(L+ 2)/12c − 2 for L ≥ 34.

É For reads of length 100 (Illumina), the we can allow a maximum of 6mismatches.

É For reads of length 400 (454), the we can allow a maximum of 31mismatches.

É If we want to allow larger numbers of mismatches or the samenumber of mismatches in shorter reads, we need to use anothermethod. . .

,

FU Berlin, GSNAP, Omics WS 2011 34

Page 58: GSNAP: Fast and SNP-tolerant detection of complex variants ... · GSNAP: Fast and SNP-tolerant detection of complex variants and splic-ing in short reads by Thomas D. Wu and Serban

Spanning Set Generation and Filtering

É Problem: The spanning set method can only be used to detect alimited number of mismatches.

É If we want to allow K matches, for K > 1, then we need at least(K + 2) generating elements. The read length L limits the number ofelements N, and the spanning set elements are non-overlapping in allthree shifts if L = 10(mod12), so N ≤ b(L+ 2)/12c.

É So we need to satisfy N > (K + 2), and as such we can only apply thespanning set method when K ≤ b(L+ 2)/12c − 2 for L ≥ 34.

É For reads of length 100 (Illumina), the we can allow a maximum of 6mismatches.

É For reads of length 400 (454), the we can allow a maximum of 31mismatches.

É If we want to allow larger numbers of mismatches or the samenumber of mismatches in shorter reads, we need to use anothermethod. . .

,

FU Berlin, GSNAP, Omics WS 2011 34

Page 59: GSNAP: Fast and SNP-tolerant detection of complex variants ... · GSNAP: Fast and SNP-tolerant detection of complex variants and splic-ing in short reads by Thomas D. Wu and Serban

Spanning Set Generation and Filtering

É Problem: The spanning set method can only be used to detect alimited number of mismatches.

É If we want to allow K matches, for K > 1, then we need at least(K + 2) generating elements. The read length L limits the number ofelements N, and the spanning set elements are non-overlapping in allthree shifts if L = 10(mod12), so N ≤ b(L+ 2)/12c.

É So we need to satisfy N > (K + 2), and as such we can only apply thespanning set method when K ≤ b(L+ 2)/12c − 2 for L ≥ 34.

É For reads of length 100 (Illumina), the we can allow a maximum of 6mismatches.

É For reads of length 400 (454), the we can allow a maximum of 31mismatches.

É If we want to allow larger numbers of mismatches or the samenumber of mismatches in shorter reads, we need to use anothermethod. . .

,

FU Berlin, GSNAP, Omics WS 2011 34

Page 60: GSNAP: Fast and SNP-tolerant detection of complex variants ... · GSNAP: Fast and SNP-tolerant detection of complex variants and splic-ing in short reads by Thomas D. Wu and Serban

Outline

IntroductionMotivationGSNAP FeaturesExamples of Complex Variant Detection

The AlgorithmSummaryPreprocessingMethod 1: Spanning Set Generation and FilteringMethod 2: Complete Set Generation and FilteringVerification of Candidate RegionsDetecting Insertions and DeletionsDetecting Splice Junctions

ResultsSimulated ReadsTranscriptional ReadsLimitations

Conclusions

,

FU Berlin, GSNAP, Omics WS 2011 35

Page 61: GSNAP: Fast and SNP-tolerant detection of complex variants ... · GSNAP: Fast and SNP-tolerant detection of complex variants and splic-ing in short reads by Thomas D. Wu and Serban

Complete set generation and filtering

É Uses the complete set of overlapping 12mers.É Works for any number of mismatches as long as read and target have≥ 14 consecutive matches (12mer out of phase by up to two bases)

É Exhaustive for K ≤ bL/14c − 1

,

FU Berlin, GSNAP, Omics WS 2011 36

Page 62: GSNAP: Fast and SNP-tolerant detection of complex variants ... · GSNAP: Fast and SNP-tolerant detection of complex variants and splic-ing in short reads by Thomas D. Wu and Serban

Complete set generation and filtering

É Lower bound on mismatches: b(∆p+ 6)/12cwhere ∆p is the distance between start locations of consecutivesupporting 12mers.

,

FU Berlin, GSNAP, Omics WS 2011 37

Page 63: GSNAP: Fast and SNP-tolerant detection of complex variants ... · GSNAP: Fast and SNP-tolerant detection of complex variants and splic-ing in short reads by Thomas D. Wu and Serban

Complete set generation and filtering

É Lower bound on mismatches: b(∆p+ 6)/12cwhere ∆p is the distance between start locations of consecutivesupporting 12mers.

,

FU Berlin, GSNAP, Omics WS 2011 37

Page 64: GSNAP: Fast and SNP-tolerant detection of complex variants ... · GSNAP: Fast and SNP-tolerant detection of complex variants and splic-ing in short reads by Thomas D. Wu and Serban

Complete set example

,

FU Berlin, GSNAP, Omics WS 2011 38

Page 65: GSNAP: Fast and SNP-tolerant detection of complex variants ... · GSNAP: Fast and SNP-tolerant detection of complex variants and splic-ing in short reads by Thomas D. Wu and Serban

Complete set generation and filtering

,

FU Berlin, GSNAP, Omics WS 2011 39

Page 66: GSNAP: Fast and SNP-tolerant detection of complex variants ... · GSNAP: Fast and SNP-tolerant detection of complex variants and splic-ing in short reads by Thomas D. Wu and Serban

Outline

IntroductionMotivationGSNAP FeaturesExamples of Complex Variant Detection

The AlgorithmSummaryPreprocessingMethod 1: Spanning Set Generation and FilteringMethod 2: Complete Set Generation and FilteringVerification of Candidate RegionsDetecting Insertions and DeletionsDetecting Splice Junctions

ResultsSimulated ReadsTranscriptional ReadsLimitations

Conclusions

,

FU Berlin, GSNAP, Omics WS 2011 40

Page 67: GSNAP: Fast and SNP-tolerant detection of complex variants ... · GSNAP: Fast and SNP-tolerant detection of complex variants and splic-ing in short reads by Thomas D. Wu and Serban

Verification of Candidate Regions

É The spanning set and complete set methods generate candidateregions for which we know a lower bound on the number ofmismatches.

É These regions need to be verified to check the exact number ofmismatches.

,

FU Berlin, GSNAP, Omics WS 2011 41

Page 68: GSNAP: Fast and SNP-tolerant detection of complex variants ... · GSNAP: Fast and SNP-tolerant detection of complex variants and splic-ing in short reads by Thomas D. Wu and Serban

Remember: Resulting Reference "Space"

,

FU Berlin, GSNAP, Omics WS 2011 42

Page 69: GSNAP: Fast and SNP-tolerant detection of complex variants ... · GSNAP: Fast and SNP-tolerant detection of complex variants and splic-ing in short reads by Thomas D. Wu and Serban

Compressed Genome

É Text usually stored as 8 bit characters

É Because we have a reduced alphabet the reference is stored as 3 bitsper character: 2 bits for the nt + a flagÉ Flag in major-allele genome: indicates unknown or ambiguous ntÉ Flag in minor-allele genome: indicates a SNP

,

FU Berlin, GSNAP, Omics WS 2011 43

Page 70: GSNAP: Fast and SNP-tolerant detection of complex variants ... · GSNAP: Fast and SNP-tolerant detection of complex variants and splic-ing in short reads by Thomas D. Wu and Serban

Compressed Genome

É Text usually stored as 8 bit charactersÉ Because we have a reduced alphabet the reference is stored as 3 bits

per character: 2 bits for the nt + a flag

É Flag in major-allele genome: indicates unknown or ambiguous ntÉ Flag in minor-allele genome: indicates a SNP

,

FU Berlin, GSNAP, Omics WS 2011 43

Page 71: GSNAP: Fast and SNP-tolerant detection of complex variants ... · GSNAP: Fast and SNP-tolerant detection of complex variants and splic-ing in short reads by Thomas D. Wu and Serban

Compressed Genome

É Text usually stored as 8 bit charactersÉ Because we have a reduced alphabet the reference is stored as 3 bits

per character: 2 bits for the nt + a flagÉ Flag in major-allele genome: indicates unknown or ambiguous ntÉ Flag in minor-allele genome: indicates a SNP

,

FU Berlin, GSNAP, Omics WS 2011 43

Page 72: GSNAP: Fast and SNP-tolerant detection of complex variants ... · GSNAP: Fast and SNP-tolerant detection of complex variants and splic-ing in short reads by Thomas D. Wu and Serban

Verification of Candidate Regions

É Query sequence converted to the same compressed representationas the reference

É Shifted into position and bitwise XOR combined with the major- andminor-allele genomes separately

É Resulting arrays are bitwise AND’d, so mismatches at a SNP onlyoccur if both alleles do not match

,

FU Berlin, GSNAP, Omics WS 2011 44

Page 73: GSNAP: Fast and SNP-tolerant detection of complex variants ... · GSNAP: Fast and SNP-tolerant detection of complex variants and splic-ing in short reads by Thomas D. Wu and Serban

Outline

IntroductionMotivationGSNAP FeaturesExamples of Complex Variant Detection

The AlgorithmSummaryPreprocessingMethod 1: Spanning Set Generation and FilteringMethod 2: Complete Set Generation and FilteringVerification of Candidate RegionsDetecting Insertions and DeletionsDetecting Splice Junctions

ResultsSimulated ReadsTranscriptional ReadsLimitations

Conclusions

,

FU Berlin, GSNAP, Omics WS 2011 45

Page 74: GSNAP: Fast and SNP-tolerant detection of complex variants ... · GSNAP: Fast and SNP-tolerant detection of complex variants and splic-ing in short reads by Thomas D. Wu and Serban

Detecting Insertions and Deletions

,

FU Berlin, GSNAP, Omics WS 2011 46

Page 75: GSNAP: Fast and SNP-tolerant detection of complex variants ... · GSNAP: Fast and SNP-tolerant detection of complex variants and splic-ing in short reads by Thomas D. Wu and Serban

Detecting Insertions and Deletions

,

FU Berlin, GSNAP, Omics WS 2011 47

Page 76: GSNAP: Fast and SNP-tolerant detection of complex variants ... · GSNAP: Fast and SNP-tolerant detection of complex variants and splic-ing in short reads by Thomas D. Wu and Serban

Detecting Insertions and Deletions

,

FU Berlin, GSNAP, Omics WS 2011 48

Page 77: GSNAP: Fast and SNP-tolerant detection of complex variants ... · GSNAP: Fast and SNP-tolerant detection of complex variants and splic-ing in short reads by Thomas D. Wu and Serban

Detecting Insertions and Deletions

,

FU Berlin, GSNAP, Omics WS 2011 49

Page 78: GSNAP: Fast and SNP-tolerant detection of complex variants ... · GSNAP: Fast and SNP-tolerant detection of complex variants and splic-ing in short reads by Thomas D. Wu and Serban

Outline

IntroductionMotivationGSNAP FeaturesExamples of Complex Variant Detection

The AlgorithmSummaryPreprocessingMethod 1: Spanning Set Generation and FilteringMethod 2: Complete Set Generation and FilteringVerification of Candidate RegionsDetecting Insertions and DeletionsDetecting Splice Junctions

ResultsSimulated ReadsTranscriptional ReadsLimitations

Conclusions

,

FU Berlin, GSNAP, Omics WS 2011 50

Page 79: GSNAP: Fast and SNP-tolerant detection of complex variants ... · GSNAP: Fast and SNP-tolerant detection of complex variants and splic-ing in short reads by Thomas D. Wu and Serban

Detecting Splice Junctions

É Known splice sites: user-provided databaseÉ Novel splice sites: maximum entropy probabilistic model from Yeo

and Burge, 2004

,

FU Berlin, GSNAP, Omics WS 2011 51

Page 80: GSNAP: Fast and SNP-tolerant detection of complex variants ... · GSNAP: Fast and SNP-tolerant detection of complex variants and splic-ing in short reads by Thomas D. Wu and Serban

Detecting Splice Junctions

É Short-distance splice sites are on the same chromosome and < somedistance apart (default: 200,000 nt)

É Method similar to the one we used to find middle deletions earlier. . .

,

FU Berlin, GSNAP, Omics WS 2011 52

Page 81: GSNAP: Fast and SNP-tolerant detection of complex variants ... · GSNAP: Fast and SNP-tolerant detection of complex variants and splic-ing in short reads by Thomas D. Wu and Serban

Detecting Splice Junctions

É Crossover area is then searched for donor or acceptor sites (eitherknown or novel with high probability).

,

FU Berlin, GSNAP, Omics WS 2011 53

Page 82: GSNAP: Fast and SNP-tolerant detection of complex variants ... · GSNAP: Fast and SNP-tolerant detection of complex variants and splic-ing in short reads by Thomas D. Wu and Serban

Detecting Splice Junctions

É Crossover area is then searched for donor or acceptor sites (eitherknown or novel with high probability).

,

FU Berlin, GSNAP, Omics WS 2011 53

Page 83: GSNAP: Fast and SNP-tolerant detection of complex variants ... · GSNAP: Fast and SNP-tolerant detection of complex variants and splic-ing in short reads by Thomas D. Wu and Serban

Detecting Splice Junctions

É Long-distance splice sites can be on different chromosomesÉ Require higher probability scores for novel splice sites than

short-distance splice sitesÉ Candidates with matching breakpoints on the read are matched

,

FU Berlin, GSNAP, Omics WS 2011 54

Page 84: GSNAP: Fast and SNP-tolerant detection of complex variants ... · GSNAP: Fast and SNP-tolerant detection of complex variants and splic-ing in short reads by Thomas D. Wu and Serban

Detecting Splice Junctions

,

FU Berlin, GSNAP, Omics WS 2011 55

Page 85: GSNAP: Fast and SNP-tolerant detection of complex variants ... · GSNAP: Fast and SNP-tolerant detection of complex variants and splic-ing in short reads by Thomas D. Wu and Serban

Detecting Splice Junctions

É If both splice sites can not be found then GSNAP will return one site(a "half-intron")

,

FU Berlin, GSNAP, Omics WS 2011 56

Page 86: GSNAP: Fast and SNP-tolerant detection of complex variants ... · GSNAP: Fast and SNP-tolerant detection of complex variants and splic-ing in short reads by Thomas D. Wu and Serban

Outline

IntroductionMotivationGSNAP FeaturesExamples of Complex Variant Detection

The AlgorithmSummaryPreprocessingMethod 1: Spanning Set Generation and FilteringMethod 2: Complete Set Generation and FilteringVerification of Candidate RegionsDetecting Insertions and DeletionsDetecting Splice Junctions

ResultsSimulated ReadsTranscriptional ReadsLimitations

Conclusions

,

FU Berlin, GSNAP, Omics WS 2011 57

Page 87: GSNAP: Fast and SNP-tolerant detection of complex variants ... · GSNAP: Fast and SNP-tolerant detection of complex variants and splic-ing in short reads by Thomas D. Wu and Serban

Simulated Reads

É Runtime comparison between GSNAP and other alignment tools (for100,000 reads)

É Simulated increasingly complicated variantsÉ exact matches onlyÉ 1 - 3 mismatchesÉ short insertions and deletionsÉ longer insertions and deletions

,

FU Berlin, GSNAP, Omics WS 2011 58

Page 88: GSNAP: Fast and SNP-tolerant detection of complex variants ... · GSNAP: Fast and SNP-tolerant detection of complex variants and splic-ing in short reads by Thomas D. Wu and Serban

Simulated Reads: 36nt

GSNAP BWA Bowtie SOAP2 SOAP MAQ

Variant: Exact

Mapping Software

Run

time

(s)

050

010

0015

0020

0025

0030

00

,

FU Berlin, GSNAP, Omics WS 2011 59

Page 89: GSNAP: Fast and SNP-tolerant detection of complex variants ... · GSNAP: Fast and SNP-tolerant detection of complex variants and splic-ing in short reads by Thomas D. Wu and Serban

Simulated Reads: 36nt

GSNAP BWA Bowtie SOAP2 SOAP MAQ

Variant: 1 mm

Mapping Software

Run

time

(s)

050

010

0015

0020

0025

0030

00

,

FU Berlin, GSNAP, Omics WS 2011 60

Page 90: GSNAP: Fast and SNP-tolerant detection of complex variants ... · GSNAP: Fast and SNP-tolerant detection of complex variants and splic-ing in short reads by Thomas D. Wu and Serban

Simulated Reads: 36nt

GSNAP BWA Bowtie SOAP2 SOAP MAQ

Variant: 2 mm

Mapping Software

Run

time

(s)

050

010

0015

0020

0025

0030

00

,

FU Berlin, GSNAP, Omics WS 2011 61

Page 91: GSNAP: Fast and SNP-tolerant detection of complex variants ... · GSNAP: Fast and SNP-tolerant detection of complex variants and splic-ing in short reads by Thomas D. Wu and Serban

Simulated Reads: 36nt

GSNAP BWA Bowtie SOAP2 SOAP MAQ

Variant: 3 mm

Mapping Software

Run

time

(s)

050

010

0015

0020

0025

0030

00

,

FU Berlin, GSNAP, Omics WS 2011 62

Page 92: GSNAP: Fast and SNP-tolerant detection of complex variants ... · GSNAP: Fast and SNP-tolerant detection of complex variants and splic-ing in short reads by Thomas D. Wu and Serban

Simulated Reads: 36nt

GSNAP BWA Bowtie SOAP2 SOAP MAQ

Variant: Ins (1...3nt)

Mapping Software

Run

time

(s)

050

010

0015

0020

0025

0030

00

,

FU Berlin, GSNAP, Omics WS 2011 63

Page 93: GSNAP: Fast and SNP-tolerant detection of complex variants ... · GSNAP: Fast and SNP-tolerant detection of complex variants and splic-ing in short reads by Thomas D. Wu and Serban

Simulated Reads: 36nt

GSNAP BWA Bowtie SOAP2 SOAP MAQ

Variant: Del (1...3nt)

Mapping Software

Run

time

(s)

050

010

0015

0020

0025

0030

00

,

FU Berlin, GSNAP, Omics WS 2011 64

Page 94: GSNAP: Fast and SNP-tolerant detection of complex variants ... · GSNAP: Fast and SNP-tolerant detection of complex variants and splic-ing in short reads by Thomas D. Wu and Serban

Simulated Reads: 36nt

GSNAP BWA Bowtie SOAP2 SOAP MAQ

Variant: Ins (4...9nt)

Mapping Software

Run

time

(s)

050

010

0015

0020

0025

0030

00

,

FU Berlin, GSNAP, Omics WS 2011 65

Page 95: GSNAP: Fast and SNP-tolerant detection of complex variants ... · GSNAP: Fast and SNP-tolerant detection of complex variants and splic-ing in short reads by Thomas D. Wu and Serban

Simulated Reads: 36nt

GSNAP BWA Bowtie SOAP2 SOAP MAQ

Variant: Del (4...30nt)

Mapping Software

Run

time

(s)

050

010

0015

0020

0025

0030

00

,

FU Berlin, GSNAP, Omics WS 2011 66

Page 96: GSNAP: Fast and SNP-tolerant detection of complex variants ... · GSNAP: Fast and SNP-tolerant detection of complex variants and splic-ing in short reads by Thomas D. Wu and Serban

Simulated Reads: 36nt

Exa

ct

1 m

m

2 m

m

3 m

m

Ins

(1...

3nt)

Del

(1.

..3nt

)

Ins

(4...

9nt)

Del

(4.

..30n

t)

Run

time

(s)

050

010

0015

0020

0025

0030

00

GSNAPBWABowtieSOAP2SOAPMAQ

,

FU Berlin, GSNAP, Omics WS 2011 67

Page 97: GSNAP: Fast and SNP-tolerant detection of complex variants ... · GSNAP: Fast and SNP-tolerant detection of complex variants and splic-ing in short reads by Thomas D. Wu and Serban

Simulated Reads: 70nt

Exa

ct

1 m

m

2 m

m

3 m

m

4 m

m

Ins

(1...

3nt)

Del

(1.

..3nt

)

Ins

(4...

9nt)

Del

(4.

..30n

t)

Run

time

(s)

050

010

0015

0020

0025

0030

00

GSNAPBWABowtieSOAP2SOAPMAQ

,

FU Berlin, GSNAP, Omics WS 2011 68

Page 98: GSNAP: Fast and SNP-tolerant detection of complex variants ... · GSNAP: Fast and SNP-tolerant detection of complex variants and splic-ing in short reads by Thomas D. Wu and Serban

Simulated Reads: 100nt

Exa

ct

1 m

m

2 m

m

3 m

m

4 m

m

5 m

m

Ins

(1...

3nt)

Del

(1.

..3nt

)

Ins

(4...

9nt)

Del

(4.

..30n

t)

Run

time

(s)

050

010

0015

0020

0025

0030

00

GSNAPBWABowtieSOAP2SOAPMAQ

,

FU Berlin, GSNAP, Omics WS 2011 69

Page 99: GSNAP: Fast and SNP-tolerant detection of complex variants ... · GSNAP: Fast and SNP-tolerant detection of complex variants and splic-ing in short reads by Thomas D. Wu and Serban

Simulated Reads: Percent misses (on uniquereads)

É Most algorithms were able to perfectly map unique reads, with thefollowing notable exceptions:

É GSNAP: 12% misses for 36nt reads with 3 mismatchesÉ SOAP: 15% misses for 36nt reads with 3 mismatches, around 5% for

1-3nt indelsÉ BWA: 1-5% misses in 1-3nt indels

,

FU Berlin, GSNAP, Omics WS 2011 70

Page 100: GSNAP: Fast and SNP-tolerant detection of complex variants ... · GSNAP: Fast and SNP-tolerant detection of complex variants and splic-ing in short reads by Thomas D. Wu and Serban

Simulated Reads: Percent misses (on uniquereads)

É Most algorithms were able to perfectly map unique reads, with thefollowing notable exceptions:

É GSNAP: 12% misses for 36nt reads with 3 mismatches

É SOAP: 15% misses for 36nt reads with 3 mismatches, around 5% for1-3nt indels

É BWA: 1-5% misses in 1-3nt indels

,

FU Berlin, GSNAP, Omics WS 2011 70

Page 101: GSNAP: Fast and SNP-tolerant detection of complex variants ... · GSNAP: Fast and SNP-tolerant detection of complex variants and splic-ing in short reads by Thomas D. Wu and Serban

Simulated Reads: Percent misses (on uniquereads)

É Most algorithms were able to perfectly map unique reads, with thefollowing notable exceptions:

É GSNAP: 12% misses for 36nt reads with 3 mismatchesÉ SOAP: 15% misses for 36nt reads with 3 mismatches, around 5% for

1-3nt indels

É BWA: 1-5% misses in 1-3nt indels

,

FU Berlin, GSNAP, Omics WS 2011 70

Page 102: GSNAP: Fast and SNP-tolerant detection of complex variants ... · GSNAP: Fast and SNP-tolerant detection of complex variants and splic-ing in short reads by Thomas D. Wu and Serban

Simulated Reads: Percent misses (on uniquereads)

É Most algorithms were able to perfectly map unique reads, with thefollowing notable exceptions:

É GSNAP: 12% misses for 36nt reads with 3 mismatchesÉ SOAP: 15% misses for 36nt reads with 3 mismatches, around 5% for

1-3nt indelsÉ BWA: 1-5% misses in 1-3nt indels

,

FU Berlin, GSNAP, Omics WS 2011 70

Page 103: GSNAP: Fast and SNP-tolerant detection of complex variants ... · GSNAP: Fast and SNP-tolerant detection of complex variants and splic-ing in short reads by Thomas D. Wu and Serban

Memory Requirements

É GSNAP: should have access to 5 GB of memory, otherwise it will runslowly

É BWA: 2.2 GBÉ Bowtie: 1.1 GB (exact matches) or 2.2 GB (allowing mismatches)É MAQ: 302 MBÉ SOAP: 14 GBÉ SOAP2: unknown ("only provided as a binary and did not have the

required compile time flag")

,

FU Berlin, GSNAP, Omics WS 2011 71

Page 104: GSNAP: Fast and SNP-tolerant detection of complex variants ... · GSNAP: Fast and SNP-tolerant detection of complex variants and splic-ing in short reads by Thomas D. Wu and Serban

Outline

IntroductionMotivationGSNAP FeaturesExamples of Complex Variant Detection

The AlgorithmSummaryPreprocessingMethod 1: Spanning Set Generation and FilteringMethod 2: Complete Set Generation and FilteringVerification of Candidate RegionsDetecting Insertions and DeletionsDetecting Splice Junctions

ResultsSimulated ReadsTranscriptional ReadsLimitations

Conclusions

,

FU Berlin, GSNAP, Omics WS 2011 72

Page 105: GSNAP: Fast and SNP-tolerant detection of complex variants ... · GSNAP: Fast and SNP-tolerant detection of complex variants and splic-ing in short reads by Thomas D. Wu and Serban

Transcriptional Reads

É Only looked at the effect of splicing, indels and SNP tolerance

É Including known splicing information → increased yield approx. 8%É Including SNP tolerance → minor increase in yield (0.5%) but effected

about 8% of alignments

,

FU Berlin, GSNAP, Omics WS 2011 73

Page 106: GSNAP: Fast and SNP-tolerant detection of complex variants ... · GSNAP: Fast and SNP-tolerant detection of complex variants and splic-ing in short reads by Thomas D. Wu and Serban

Transcriptional Reads

É Only looked at the effect of splicing, indels and SNP toleranceÉ Including known splicing information → increased yield approx. 8%

É Including SNP tolerance → minor increase in yield (0.5%) but effectedabout 8% of alignments

,

FU Berlin, GSNAP, Omics WS 2011 73

Page 107: GSNAP: Fast and SNP-tolerant detection of complex variants ... · GSNAP: Fast and SNP-tolerant detection of complex variants and splic-ing in short reads by Thomas D. Wu and Serban

Transcriptional Reads

É Only looked at the effect of splicing, indels and SNP toleranceÉ Including known splicing information → increased yield approx. 8%É Including SNP tolerance → minor increase in yield (0.5%) but effected

about 8% of alignments

,

FU Berlin, GSNAP, Omics WS 2011 73

Page 108: GSNAP: Fast and SNP-tolerant detection of complex variants ... · GSNAP: Fast and SNP-tolerant detection of complex variants and splic-ing in short reads by Thomas D. Wu and Serban

Outline

IntroductionMotivationGSNAP FeaturesExamples of Complex Variant Detection

The AlgorithmSummaryPreprocessingMethod 1: Spanning Set Generation and FilteringMethod 2: Complete Set Generation and FilteringVerification of Candidate RegionsDetecting Insertions and DeletionsDetecting Splice Junctions

ResultsSimulated ReadsTranscriptional ReadsLimitations

Conclusions

,

FU Berlin, GSNAP, Omics WS 2011 74

Page 109: GSNAP: Fast and SNP-tolerant detection of complex variants ... · GSNAP: Fast and SNP-tolerant detection of complex variants and splic-ing in short reads by Thomas D. Wu and Serban

Limitations

É Matching is only exhaustive if there is at least a 14nt contiguousmatch, otherwise it’s a heuristic

É Limited to one indel or splice site per readÉ Does not use read quality scoresÉ Does not work with ABI SOLiD data

,

FU Berlin, GSNAP, Omics WS 2011 75

Page 110: GSNAP: Fast and SNP-tolerant detection of complex variants ... · GSNAP: Fast and SNP-tolerant detection of complex variants and splic-ing in short reads by Thomas D. Wu and Serban

Limitations

É Matching is only exhaustive if there is at least a 14nt contiguousmatch, otherwise it’s a heuristic

É Limited to one indel or splice site per read

É Does not use read quality scoresÉ Does not work with ABI SOLiD data

,

FU Berlin, GSNAP, Omics WS 2011 75

Page 111: GSNAP: Fast and SNP-tolerant detection of complex variants ... · GSNAP: Fast and SNP-tolerant detection of complex variants and splic-ing in short reads by Thomas D. Wu and Serban

Limitations

É Matching is only exhaustive if there is at least a 14nt contiguousmatch, otherwise it’s a heuristic

É Limited to one indel or splice site per readÉ Does not use read quality scores

É Does not work with ABI SOLiD data

,

FU Berlin, GSNAP, Omics WS 2011 75

Page 112: GSNAP: Fast and SNP-tolerant detection of complex variants ... · GSNAP: Fast and SNP-tolerant detection of complex variants and splic-ing in short reads by Thomas D. Wu and Serban

Limitations

É Matching is only exhaustive if there is at least a 14nt contiguousmatch, otherwise it’s a heuristic

É Limited to one indel or splice site per readÉ Does not use read quality scoresÉ Does not work with ABI SOLiD data

,

FU Berlin, GSNAP, Omics WS 2011 75

Page 113: GSNAP: Fast and SNP-tolerant detection of complex variants ... · GSNAP: Fast and SNP-tolerant detection of complex variants and splic-ing in short reads by Thomas D. Wu and Serban

TL;DR

É Comparable to other fast read alignment algorithms in terms ofspeed, but can handle more complex variants and splicing

,

FU Berlin, GSNAP, Omics WS 2011 76

Page 114: GSNAP: Fast and SNP-tolerant detection of complex variants ... · GSNAP: Fast and SNP-tolerant detection of complex variants and splic-ing in short reads by Thomas D. Wu and Serban

For Further Reading I

Knuth D.E.The Art of Computer Programming: Sorting and Searching. Vol 3.Addison-Wesley, 1973

Thomas D. Wu and Serban NacuFast and SNP-tolerant detection of complex variants and splicing inshort readsBioinformatics, 2010 Apr 1;26(7):873-81.

Yeo G and Burge CB.Maximum entropy modeling of short sequence motifs withapplications to RNA splicing signalsJ Comput Biol. 2004;11(2-3):377-94.

,

FU Berlin, GSNAP, Omics WS 2011 77

Page 115: GSNAP: Fast and SNP-tolerant detection of complex variants ... · GSNAP: Fast and SNP-tolerant detection of complex variants and splic-ing in short reads by Thomas D. Wu and Serban

Aligning Paired-End Reads

É Optional

,

FU Berlin, GSNAP, Omics WS 2011 78

Page 116: GSNAP: Fast and SNP-tolerant detection of complex variants ... · GSNAP: Fast and SNP-tolerant detection of complex variants and splic-ing in short reads by Thomas D. Wu and Serban

Bisulfite-converted DNA

É Optional

,

FU Berlin, GSNAP, Omics WS 2011 79