gsnap: fast and snp-tolerant detection of complex variants ... · gsnap: fast and snp-tolerant...
TRANSCRIPT
GSNAP: Fast and SNP-tolerant detection of complex variants and splic-ing in short readsby Thomas D. Wu and Serban Nacu
Matt HuskaFreie Universität Berlin
Computational Methods for High-Throughput Omics Data, WS 2011
Outline
IntroductionMotivationGSNAP FeaturesExamples of Complex Variant Detection
The AlgorithmSummaryPreprocessingMethod 1: Spanning Set Generation and FilteringMethod 2: Complete Set Generation and FilteringVerification of Candidate RegionsDetecting Insertions and DeletionsDetecting Splice Junctions
ResultsSimulated ReadsTranscriptional ReadsLimitations
Conclusions
,
FU Berlin, GSNAP, Omics WS 2011 2
Outline
IntroductionMotivationGSNAP FeaturesExamples of Complex Variant Detection
The AlgorithmSummaryPreprocessingMethod 1: Spanning Set Generation and FilteringMethod 2: Complete Set Generation and FilteringVerification of Candidate RegionsDetecting Insertions and DeletionsDetecting Splice Junctions
ResultsSimulated ReadsTranscriptional ReadsLimitations
Conclusions
,
FU Berlin, GSNAP, Omics WS 2011 2
Outline
IntroductionMotivationGSNAP FeaturesExamples of Complex Variant Detection
The AlgorithmSummaryPreprocessingMethod 1: Spanning Set Generation and FilteringMethod 2: Complete Set Generation and FilteringVerification of Candidate RegionsDetecting Insertions and DeletionsDetecting Splice Junctions
ResultsSimulated ReadsTranscriptional ReadsLimitations
Conclusions
,
FU Berlin, GSNAP, Omics WS 2011 2
Outline
IntroductionMotivationGSNAP FeaturesExamples of Complex Variant Detection
The AlgorithmSummaryPreprocessingMethod 1: Spanning Set Generation and FilteringMethod 2: Complete Set Generation and FilteringVerification of Candidate RegionsDetecting Insertions and DeletionsDetecting Splice Junctions
ResultsSimulated ReadsTranscriptional ReadsLimitations
Conclusions
,
FU Berlin, GSNAP, Omics WS 2011 2
Outline
IntroductionMotivationGSNAP FeaturesExamples of Complex Variant Detection
The AlgorithmSummaryPreprocessingMethod 1: Spanning Set Generation and FilteringMethod 2: Complete Set Generation and FilteringVerification of Candidate RegionsDetecting Insertions and DeletionsDetecting Splice Junctions
ResultsSimulated ReadsTranscriptional ReadsLimitations
Conclusions
,
FU Berlin, GSNAP, Omics WS 2011 3
Motivation
É New technology allows us to sequence more reads in shorter time.Increasing at an incredible rate with no signs of slowing down.
É "Why should we be happy with millions of reads, when we canhave. . .
,
FU Berlin, GSNAP, Omics WS 2011 4
Motivation
É New technology allows us to sequence more reads in shorter time.Increasing at an incredible rate with no signs of slowing down.
É "Why should we be happy with millions of reads, when we canhave. . .
,
FU Berlin, GSNAP, Omics WS 2011 4
Motivation
. . .billions?"
,
FU Berlin, GSNAP, Omics WS 2011 5
Motivation: Why do we need another readmapping algorithm?
É The reads themselves are also getting longer. Longer reads = higherprobability for complex variants within a read.
É Current (Feb 2010) read mappers tend to either be very fast (BWA,Bowtie, SOAP2) or sensitive to variants (SOAP)
É GSNAP is intended to be fast and able to handle complex variants
,
FU Berlin, GSNAP, Omics WS 2011 6
Motivation: Why do we need another readmapping algorithm?
É The reads themselves are also getting longer. Longer reads = higherprobability for complex variants within a read.
É Current (Feb 2010) read mappers tend to either be very fast (BWA,Bowtie, SOAP2) or sensitive to variants (SOAP)
É GSNAP is intended to be fast and able to handle complex variants
,
FU Berlin, GSNAP, Omics WS 2011 6
Motivation: Why do we need another readmapping algorithm?
É The reads themselves are also getting longer. Longer reads = higherprobability for complex variants within a read.
É Current (Feb 2010) read mappers tend to either be very fast (BWA,Bowtie, SOAP2) or sensitive to variants (SOAP)
É GSNAP is intended to be fast and able to handle complex variants
,
FU Berlin, GSNAP, Omics WS 2011 6
Outline
IntroductionMotivationGSNAP FeaturesExamples of Complex Variant Detection
The AlgorithmSummaryPreprocessingMethod 1: Spanning Set Generation and FilteringMethod 2: Complete Set Generation and FilteringVerification of Candidate RegionsDetecting Insertions and DeletionsDetecting Splice Junctions
ResultsSimulated ReadsTranscriptional ReadsLimitations
Conclusions
,
FU Berlin, GSNAP, Omics WS 2011 7
Key Features of GSNAP
É Can handle short and long insertions and deletions
É Detects short and long distance splicing (includinginterchromosomal)É user-provided database of splice sites (eg. RefSeq)É probabilistic model
É SNP tolerant (given a user-provided database eg. dbSNP)É Can align bisulfite-treated DNA (for studying methlyation state)É Still pretty fast
,
FU Berlin, GSNAP, Omics WS 2011 8
Key Features of GSNAP
É Can handle short and long insertions and deletionsÉ Detects short and long distance splicing (including
interchromosomal)
É user-provided database of splice sites (eg. RefSeq)É probabilistic model
É SNP tolerant (given a user-provided database eg. dbSNP)É Can align bisulfite-treated DNA (for studying methlyation state)É Still pretty fast
,
FU Berlin, GSNAP, Omics WS 2011 8
Key Features of GSNAP
É Can handle short and long insertions and deletionsÉ Detects short and long distance splicing (including
interchromosomal)É user-provided database of splice sites (eg. RefSeq)
É probabilistic model
É SNP tolerant (given a user-provided database eg. dbSNP)É Can align bisulfite-treated DNA (for studying methlyation state)É Still pretty fast
,
FU Berlin, GSNAP, Omics WS 2011 8
Key Features of GSNAP
É Can handle short and long insertions and deletionsÉ Detects short and long distance splicing (including
interchromosomal)É user-provided database of splice sites (eg. RefSeq)É probabilistic model
É SNP tolerant (given a user-provided database eg. dbSNP)É Can align bisulfite-treated DNA (for studying methlyation state)É Still pretty fast
,
FU Berlin, GSNAP, Omics WS 2011 8
Key Features of GSNAP
É Can handle short and long insertions and deletionsÉ Detects short and long distance splicing (including
interchromosomal)É user-provided database of splice sites (eg. RefSeq)É probabilistic model
É SNP tolerant (given a user-provided database eg. dbSNP)
É Can align bisulfite-treated DNA (for studying methlyation state)É Still pretty fast
,
FU Berlin, GSNAP, Omics WS 2011 8
Key Features of GSNAP
É Can handle short and long insertions and deletionsÉ Detects short and long distance splicing (including
interchromosomal)É user-provided database of splice sites (eg. RefSeq)É probabilistic model
É SNP tolerant (given a user-provided database eg. dbSNP)É Can align bisulfite-treated DNA (for studying methlyation state)
É Still pretty fast
,
FU Berlin, GSNAP, Omics WS 2011 8
Key Features of GSNAP
É Can handle short and long insertions and deletionsÉ Detects short and long distance splicing (including
interchromosomal)É user-provided database of splice sites (eg. RefSeq)É probabilistic model
É SNP tolerant (given a user-provided database eg. dbSNP)É Can align bisulfite-treated DNA (for studying methlyation state)É Still pretty fast
,
FU Berlin, GSNAP, Omics WS 2011 8
Outline
IntroductionMotivationGSNAP FeaturesExamples of Complex Variant Detection
The AlgorithmSummaryPreprocessingMethod 1: Spanning Set Generation and FilteringMethod 2: Complete Set Generation and FilteringVerification of Candidate RegionsDetecting Insertions and DeletionsDetecting Splice Junctions
ResultsSimulated ReadsTranscriptional ReadsLimitations
Conclusions
,
FU Berlin, GSNAP, Omics WS 2011 9
Long deletion of 17nt plus mismatches
É 17nt deletion matching an entry in dbSNP, including mismatches:
,
FU Berlin, GSNAP, Omics WS 2011 10
Splicing identified using the probabilisticmodel
É An intron within exon 1 of HOXA9. Is also experimentally supported.
,
FU Berlin, GSNAP, Omics WS 2011 11
Splicing identified using a database of knownsplice sites
É Splicing sites identified despite having low probabilistic scores.
,
FU Berlin, GSNAP, Omics WS 2011 12
Interchromosomal splicing (gene fusion)
É Splicing between BCAS4 (chr 20) and BCAS3 (chr 17).
,
FU Berlin, GSNAP, Omics WS 2011 13
SNP-tolerant alignment
É SNP-tolerance allows both genotypes to align well
,
FU Berlin, GSNAP, Omics WS 2011 14
Outline
IntroductionMotivationGSNAP FeaturesExamples of Complex Variant Detection
The AlgorithmSummaryPreprocessingMethod 1: Spanning Set Generation and FilteringMethod 2: Complete Set Generation and FilteringVerification of Candidate RegionsDetecting Insertions and DeletionsDetecting Splice Junctions
ResultsSimulated ReadsTranscriptional ReadsLimitations
Conclusions
,
FU Berlin, GSNAP, Omics WS 2011 15
The Algorithm in Brief
É Index the genome using a hash table
É Break up short reads into shorter elements and look each up in hashtable
É Look at resulting position lists for each element and see if theysupport a common target location and have a reasonable number ofmismatches
É Verify the number of mismatches by checking the whole read againstthe reference
,
FU Berlin, GSNAP, Omics WS 2011 16
The Algorithm in Brief
É Index the genome using a hash tableÉ Break up short reads into shorter elements and look each up in hash
table
É Look at resulting position lists for each element and see if theysupport a common target location and have a reasonable number ofmismatches
É Verify the number of mismatches by checking the whole read againstthe reference
,
FU Berlin, GSNAP, Omics WS 2011 16
The Algorithm in Brief
É Index the genome using a hash tableÉ Break up short reads into shorter elements and look each up in hash
tableÉ Look at resulting position lists for each element and see if they
support a common target location and have a reasonable number ofmismatches
É Verify the number of mismatches by checking the whole read againstthe reference
,
FU Berlin, GSNAP, Omics WS 2011 16
The Algorithm in Brief
É Index the genome using a hash tableÉ Break up short reads into shorter elements and look each up in hash
tableÉ Look at resulting position lists for each element and see if they
support a common target location and have a reasonable number ofmismatches
É Verify the number of mismatches by checking the whole read againstthe reference
,
FU Berlin, GSNAP, Omics WS 2011 16
Outline
IntroductionMotivationGSNAP FeaturesExamples of Complex Variant Detection
The AlgorithmSummaryPreprocessingMethod 1: Spanning Set Generation and FilteringMethod 2: Complete Set Generation and FilteringVerification of Candidate RegionsDetecting Insertions and DeletionsDetecting Splice Junctions
ResultsSimulated ReadsTranscriptional ReadsLimitations
Conclusions
,
FU Berlin, GSNAP, Omics WS 2011 17
Hash Table Space Requirements
É Hashing every overlapping oligomer in the entire reference sequencewould take too much memory (approx. 14GB for the human genome)
É We only hash 12mers every 3nt (approx. 4GB)É Optional SNP-tolerance only adds a small amount to total memory
requirements (3.8 GB → 4.0 GB)É Entire table only needs to be in memory during construction.
Afterwards it is mmap’d and only part is loaded into memory.
,
FU Berlin, GSNAP, Omics WS 2011 18
Hash Table Space Requirements
É Hashing every overlapping oligomer in the entire reference sequencewould take too much memory (approx. 14GB for the human genome)
É We only hash 12mers every 3nt (approx. 4GB)
É Optional SNP-tolerance only adds a small amount to total memoryrequirements (3.8 GB → 4.0 GB)
É Entire table only needs to be in memory during construction.Afterwards it is mmap’d and only part is loaded into memory.
,
FU Berlin, GSNAP, Omics WS 2011 18
Hash Table Space Requirements
É Hashing every overlapping oligomer in the entire reference sequencewould take too much memory (approx. 14GB for the human genome)
É We only hash 12mers every 3nt (approx. 4GB)É Optional SNP-tolerance only adds a small amount to total memory
requirements (3.8 GB → 4.0 GB)
É Entire table only needs to be in memory during construction.Afterwards it is mmap’d and only part is loaded into memory.
,
FU Berlin, GSNAP, Omics WS 2011 18
Hash Table Space Requirements
É Hashing every overlapping oligomer in the entire reference sequencewould take too much memory (approx. 14GB for the human genome)
É We only hash 12mers every 3nt (approx. 4GB)É Optional SNP-tolerance only adds a small amount to total memory
requirements (3.8 GB → 4.0 GB)É Entire table only needs to be in memory during construction.
Afterwards it is mmap’d and only part is loaded into memory.
,
FU Berlin, GSNAP, Omics WS 2011 18
Hashing the Reference Genome
,
FU Berlin, GSNAP, Omics WS 2011 19
Including SNPs
,
FU Berlin, GSNAP, Omics WS 2011 20
Resulting Reference "Space"
,
FU Berlin, GSNAP, Omics WS 2011 21
Outline
IntroductionMotivationGSNAP FeaturesExamples of Complex Variant Detection
The AlgorithmSummaryPreprocessingMethod 1: Spanning Set Generation and FilteringMethod 2: Complete Set Generation and FilteringVerification of Candidate RegionsDetecting Insertions and DeletionsDetecting Splice Junctions
ResultsSimulated ReadsTranscriptional ReadsLimitations
Conclusions
,
FU Berlin, GSNAP, Omics WS 2011 22
Spanning Set Generation and Filtering
,
FU Berlin, GSNAP, Omics WS 2011 23
Spanning Set Generation and Filtering
,
FU Berlin, GSNAP, Omics WS 2011 24
Spanning Set Generation and Filtering
,
FU Berlin, GSNAP, Omics WS 2011 25
Spanning Set Generation and Filtering
,
FU Berlin, GSNAP, Omics WS 2011 26
Spanning Set Generation and Filtering
,
FU Berlin, GSNAP, Omics WS 2011 27
Spanning Set Generation and Filtering
,
FU Berlin, GSNAP, Omics WS 2011 28
Spanning Set Generation and Filtering
,
FU Berlin, GSNAP, Omics WS 2011 29
Spanning Set Generation and Filtering
É Generating elements: used to find supporting candidate locationsÉ Uses a multiway merging procedure (Knuth TAOCP Vol 3)É Slow: linear on the sum of the list lengths. O(l logn) runtime where n is
the number of position lists and l is the sum of their lengths.
É Filtering elements: filter the candidate locations found using thegenerating elementsÉ Fast: uses a binary search. O(log li) for position list i
É Choose the elements with the shortest position lists as generatingelements, and the longer ones as filtering elements.
É They choose K + 2 generating sets, where K is the constraint score (=max number of mismatches)
,
FU Berlin, GSNAP, Omics WS 2011 30
Spanning Set Generation and Filtering
É Generating elements: used to find supporting candidate locationsÉ Uses a multiway merging procedure (Knuth TAOCP Vol 3)É Slow: linear on the sum of the list lengths. O(l logn) runtime where n is
the number of position lists and l is the sum of their lengths.É Filtering elements: filter the candidate locations found using the
generating elementsÉ Fast: uses a binary search. O(log li) for position list i
É Choose the elements with the shortest position lists as generatingelements, and the longer ones as filtering elements.
É They choose K + 2 generating sets, where K is the constraint score (=max number of mismatches)
,
FU Berlin, GSNAP, Omics WS 2011 30
Spanning Set Generation and Filtering
É Generating elements: used to find supporting candidate locationsÉ Uses a multiway merging procedure (Knuth TAOCP Vol 3)É Slow: linear on the sum of the list lengths. O(l logn) runtime where n is
the number of position lists and l is the sum of their lengths.É Filtering elements: filter the candidate locations found using the
generating elementsÉ Fast: uses a binary search. O(log li) for position list i
É Choose the elements with the shortest position lists as generatingelements, and the longer ones as filtering elements.
É They choose K + 2 generating sets, where K is the constraint score (=max number of mismatches)
,
FU Berlin, GSNAP, Omics WS 2011 30
Spanning Set Generation and Filtering
É Generating elements: used to find supporting candidate locationsÉ Uses a multiway merging procedure (Knuth TAOCP Vol 3)É Slow: linear on the sum of the list lengths. O(l logn) runtime where n is
the number of position lists and l is the sum of their lengths.É Filtering elements: filter the candidate locations found using the
generating elementsÉ Fast: uses a binary search. O(log li) for position list i
É Choose the elements with the shortest position lists as generatingelements, and the longer ones as filtering elements.
É They choose K + 2 generating sets, where K is the constraint score (=max number of mismatches)
,
FU Berlin, GSNAP, Omics WS 2011 30
Spanning Set Generation and Filtering
,
FU Berlin, GSNAP, Omics WS 2011 31
Spanning Set Generation and Filtering
,
FU Berlin, GSNAP, Omics WS 2011 32
Spanning Set Generation and Filtering
,
FU Berlin, GSNAP, Omics WS 2011 33
Spanning Set Generation and Filtering
É Problem: The spanning set method can only be used to detect alimited number of mismatches.
É If we want to allow K matches, for K > 1, then we need at least(K + 2) generating elements. The read length L limits the number ofelements N, and the spanning set elements are non-overlapping in allthree shifts if L = 10(mod12), so N ≤ b(L+ 2)/12c.
É So we need to satisfy N > (K + 2), and as such we can only apply thespanning set method when K ≤ b(L+ 2)/12c − 2 for L ≥ 34.
É For reads of length 100 (Illumina), the we can allow a maximum of 6mismatches.
É For reads of length 400 (454), the we can allow a maximum of 31mismatches.
É If we want to allow larger numbers of mismatches or the samenumber of mismatches in shorter reads, we need to use anothermethod. . .
,
FU Berlin, GSNAP, Omics WS 2011 34
Spanning Set Generation and Filtering
É Problem: The spanning set method can only be used to detect alimited number of mismatches.
É If we want to allow K matches, for K > 1, then we need at least(K + 2) generating elements. The read length L limits the number ofelements N, and the spanning set elements are non-overlapping in allthree shifts if L = 10(mod12), so N ≤ b(L+ 2)/12c.
É So we need to satisfy N > (K + 2), and as such we can only apply thespanning set method when K ≤ b(L+ 2)/12c − 2 for L ≥ 34.
É For reads of length 100 (Illumina), the we can allow a maximum of 6mismatches.
É For reads of length 400 (454), the we can allow a maximum of 31mismatches.
É If we want to allow larger numbers of mismatches or the samenumber of mismatches in shorter reads, we need to use anothermethod. . .
,
FU Berlin, GSNAP, Omics WS 2011 34
Spanning Set Generation and Filtering
É Problem: The spanning set method can only be used to detect alimited number of mismatches.
É If we want to allow K matches, for K > 1, then we need at least(K + 2) generating elements. The read length L limits the number ofelements N, and the spanning set elements are non-overlapping in allthree shifts if L = 10(mod12), so N ≤ b(L+ 2)/12c.
É So we need to satisfy N > (K + 2), and as such we can only apply thespanning set method when K ≤ b(L+ 2)/12c − 2 for L ≥ 34.
É For reads of length 100 (Illumina), the we can allow a maximum of 6mismatches.
É For reads of length 400 (454), the we can allow a maximum of 31mismatches.
É If we want to allow larger numbers of mismatches or the samenumber of mismatches in shorter reads, we need to use anothermethod. . .
,
FU Berlin, GSNAP, Omics WS 2011 34
Spanning Set Generation and Filtering
É Problem: The spanning set method can only be used to detect alimited number of mismatches.
É If we want to allow K matches, for K > 1, then we need at least(K + 2) generating elements. The read length L limits the number ofelements N, and the spanning set elements are non-overlapping in allthree shifts if L = 10(mod12), so N ≤ b(L+ 2)/12c.
É So we need to satisfy N > (K + 2), and as such we can only apply thespanning set method when K ≤ b(L+ 2)/12c − 2 for L ≥ 34.
É For reads of length 100 (Illumina), the we can allow a maximum of 6mismatches.
É For reads of length 400 (454), the we can allow a maximum of 31mismatches.
É If we want to allow larger numbers of mismatches or the samenumber of mismatches in shorter reads, we need to use anothermethod. . .
,
FU Berlin, GSNAP, Omics WS 2011 34
Spanning Set Generation and Filtering
É Problem: The spanning set method can only be used to detect alimited number of mismatches.
É If we want to allow K matches, for K > 1, then we need at least(K + 2) generating elements. The read length L limits the number ofelements N, and the spanning set elements are non-overlapping in allthree shifts if L = 10(mod12), so N ≤ b(L+ 2)/12c.
É So we need to satisfy N > (K + 2), and as such we can only apply thespanning set method when K ≤ b(L+ 2)/12c − 2 for L ≥ 34.
É For reads of length 100 (Illumina), the we can allow a maximum of 6mismatches.
É For reads of length 400 (454), the we can allow a maximum of 31mismatches.
É If we want to allow larger numbers of mismatches or the samenumber of mismatches in shorter reads, we need to use anothermethod. . .
,
FU Berlin, GSNAP, Omics WS 2011 34
Outline
IntroductionMotivationGSNAP FeaturesExamples of Complex Variant Detection
The AlgorithmSummaryPreprocessingMethod 1: Spanning Set Generation and FilteringMethod 2: Complete Set Generation and FilteringVerification of Candidate RegionsDetecting Insertions and DeletionsDetecting Splice Junctions
ResultsSimulated ReadsTranscriptional ReadsLimitations
Conclusions
,
FU Berlin, GSNAP, Omics WS 2011 35
Complete set generation and filtering
É Uses the complete set of overlapping 12mers.É Works for any number of mismatches as long as read and target have≥ 14 consecutive matches (12mer out of phase by up to two bases)
É Exhaustive for K ≤ bL/14c − 1
,
FU Berlin, GSNAP, Omics WS 2011 36
Complete set generation and filtering
É Lower bound on mismatches: b(∆p+ 6)/12cwhere ∆p is the distance between start locations of consecutivesupporting 12mers.
,
FU Berlin, GSNAP, Omics WS 2011 37
Complete set generation and filtering
É Lower bound on mismatches: b(∆p+ 6)/12cwhere ∆p is the distance between start locations of consecutivesupporting 12mers.
,
FU Berlin, GSNAP, Omics WS 2011 37
Complete set example
,
FU Berlin, GSNAP, Omics WS 2011 38
Complete set generation and filtering
,
FU Berlin, GSNAP, Omics WS 2011 39
Outline
IntroductionMotivationGSNAP FeaturesExamples of Complex Variant Detection
The AlgorithmSummaryPreprocessingMethod 1: Spanning Set Generation and FilteringMethod 2: Complete Set Generation and FilteringVerification of Candidate RegionsDetecting Insertions and DeletionsDetecting Splice Junctions
ResultsSimulated ReadsTranscriptional ReadsLimitations
Conclusions
,
FU Berlin, GSNAP, Omics WS 2011 40
Verification of Candidate Regions
É The spanning set and complete set methods generate candidateregions for which we know a lower bound on the number ofmismatches.
É These regions need to be verified to check the exact number ofmismatches.
,
FU Berlin, GSNAP, Omics WS 2011 41
Remember: Resulting Reference "Space"
,
FU Berlin, GSNAP, Omics WS 2011 42
Compressed Genome
É Text usually stored as 8 bit characters
É Because we have a reduced alphabet the reference is stored as 3 bitsper character: 2 bits for the nt + a flagÉ Flag in major-allele genome: indicates unknown or ambiguous ntÉ Flag in minor-allele genome: indicates a SNP
,
FU Berlin, GSNAP, Omics WS 2011 43
Compressed Genome
É Text usually stored as 8 bit charactersÉ Because we have a reduced alphabet the reference is stored as 3 bits
per character: 2 bits for the nt + a flag
É Flag in major-allele genome: indicates unknown or ambiguous ntÉ Flag in minor-allele genome: indicates a SNP
,
FU Berlin, GSNAP, Omics WS 2011 43
Compressed Genome
É Text usually stored as 8 bit charactersÉ Because we have a reduced alphabet the reference is stored as 3 bits
per character: 2 bits for the nt + a flagÉ Flag in major-allele genome: indicates unknown or ambiguous ntÉ Flag in minor-allele genome: indicates a SNP
,
FU Berlin, GSNAP, Omics WS 2011 43
Verification of Candidate Regions
É Query sequence converted to the same compressed representationas the reference
É Shifted into position and bitwise XOR combined with the major- andminor-allele genomes separately
É Resulting arrays are bitwise AND’d, so mismatches at a SNP onlyoccur if both alleles do not match
,
FU Berlin, GSNAP, Omics WS 2011 44
Outline
IntroductionMotivationGSNAP FeaturesExamples of Complex Variant Detection
The AlgorithmSummaryPreprocessingMethod 1: Spanning Set Generation and FilteringMethod 2: Complete Set Generation and FilteringVerification of Candidate RegionsDetecting Insertions and DeletionsDetecting Splice Junctions
ResultsSimulated ReadsTranscriptional ReadsLimitations
Conclusions
,
FU Berlin, GSNAP, Omics WS 2011 45
Detecting Insertions and Deletions
,
FU Berlin, GSNAP, Omics WS 2011 46
Detecting Insertions and Deletions
,
FU Berlin, GSNAP, Omics WS 2011 47
Detecting Insertions and Deletions
,
FU Berlin, GSNAP, Omics WS 2011 48
Detecting Insertions and Deletions
,
FU Berlin, GSNAP, Omics WS 2011 49
Outline
IntroductionMotivationGSNAP FeaturesExamples of Complex Variant Detection
The AlgorithmSummaryPreprocessingMethod 1: Spanning Set Generation and FilteringMethod 2: Complete Set Generation and FilteringVerification of Candidate RegionsDetecting Insertions and DeletionsDetecting Splice Junctions
ResultsSimulated ReadsTranscriptional ReadsLimitations
Conclusions
,
FU Berlin, GSNAP, Omics WS 2011 50
Detecting Splice Junctions
É Known splice sites: user-provided databaseÉ Novel splice sites: maximum entropy probabilistic model from Yeo
and Burge, 2004
,
FU Berlin, GSNAP, Omics WS 2011 51
Detecting Splice Junctions
É Short-distance splice sites are on the same chromosome and < somedistance apart (default: 200,000 nt)
É Method similar to the one we used to find middle deletions earlier. . .
,
FU Berlin, GSNAP, Omics WS 2011 52
Detecting Splice Junctions
É Crossover area is then searched for donor or acceptor sites (eitherknown or novel with high probability).
,
FU Berlin, GSNAP, Omics WS 2011 53
Detecting Splice Junctions
É Crossover area is then searched for donor or acceptor sites (eitherknown or novel with high probability).
,
FU Berlin, GSNAP, Omics WS 2011 53
Detecting Splice Junctions
É Long-distance splice sites can be on different chromosomesÉ Require higher probability scores for novel splice sites than
short-distance splice sitesÉ Candidates with matching breakpoints on the read are matched
,
FU Berlin, GSNAP, Omics WS 2011 54
Detecting Splice Junctions
,
FU Berlin, GSNAP, Omics WS 2011 55
Detecting Splice Junctions
É If both splice sites can not be found then GSNAP will return one site(a "half-intron")
,
FU Berlin, GSNAP, Omics WS 2011 56
Outline
IntroductionMotivationGSNAP FeaturesExamples of Complex Variant Detection
The AlgorithmSummaryPreprocessingMethod 1: Spanning Set Generation and FilteringMethod 2: Complete Set Generation and FilteringVerification of Candidate RegionsDetecting Insertions and DeletionsDetecting Splice Junctions
ResultsSimulated ReadsTranscriptional ReadsLimitations
Conclusions
,
FU Berlin, GSNAP, Omics WS 2011 57
Simulated Reads
É Runtime comparison between GSNAP and other alignment tools (for100,000 reads)
É Simulated increasingly complicated variantsÉ exact matches onlyÉ 1 - 3 mismatchesÉ short insertions and deletionsÉ longer insertions and deletions
,
FU Berlin, GSNAP, Omics WS 2011 58
Simulated Reads: 36nt
GSNAP BWA Bowtie SOAP2 SOAP MAQ
Variant: Exact
Mapping Software
Run
time
(s)
050
010
0015
0020
0025
0030
00
,
FU Berlin, GSNAP, Omics WS 2011 59
Simulated Reads: 36nt
GSNAP BWA Bowtie SOAP2 SOAP MAQ
Variant: 1 mm
Mapping Software
Run
time
(s)
050
010
0015
0020
0025
0030
00
,
FU Berlin, GSNAP, Omics WS 2011 60
Simulated Reads: 36nt
GSNAP BWA Bowtie SOAP2 SOAP MAQ
Variant: 2 mm
Mapping Software
Run
time
(s)
050
010
0015
0020
0025
0030
00
,
FU Berlin, GSNAP, Omics WS 2011 61
Simulated Reads: 36nt
GSNAP BWA Bowtie SOAP2 SOAP MAQ
Variant: 3 mm
Mapping Software
Run
time
(s)
050
010
0015
0020
0025
0030
00
,
FU Berlin, GSNAP, Omics WS 2011 62
Simulated Reads: 36nt
GSNAP BWA Bowtie SOAP2 SOAP MAQ
Variant: Ins (1...3nt)
Mapping Software
Run
time
(s)
050
010
0015
0020
0025
0030
00
,
FU Berlin, GSNAP, Omics WS 2011 63
Simulated Reads: 36nt
GSNAP BWA Bowtie SOAP2 SOAP MAQ
Variant: Del (1...3nt)
Mapping Software
Run
time
(s)
050
010
0015
0020
0025
0030
00
,
FU Berlin, GSNAP, Omics WS 2011 64
Simulated Reads: 36nt
GSNAP BWA Bowtie SOAP2 SOAP MAQ
Variant: Ins (4...9nt)
Mapping Software
Run
time
(s)
050
010
0015
0020
0025
0030
00
,
FU Berlin, GSNAP, Omics WS 2011 65
Simulated Reads: 36nt
GSNAP BWA Bowtie SOAP2 SOAP MAQ
Variant: Del (4...30nt)
Mapping Software
Run
time
(s)
050
010
0015
0020
0025
0030
00
,
FU Berlin, GSNAP, Omics WS 2011 66
Simulated Reads: 36nt
Exa
ct
1 m
m
2 m
m
3 m
m
Ins
(1...
3nt)
Del
(1.
..3nt
)
Ins
(4...
9nt)
Del
(4.
..30n
t)
Run
time
(s)
050
010
0015
0020
0025
0030
00
GSNAPBWABowtieSOAP2SOAPMAQ
,
FU Berlin, GSNAP, Omics WS 2011 67
Simulated Reads: 70nt
Exa
ct
1 m
m
2 m
m
3 m
m
4 m
m
Ins
(1...
3nt)
Del
(1.
..3nt
)
Ins
(4...
9nt)
Del
(4.
..30n
t)
Run
time
(s)
050
010
0015
0020
0025
0030
00
GSNAPBWABowtieSOAP2SOAPMAQ
,
FU Berlin, GSNAP, Omics WS 2011 68
Simulated Reads: 100nt
Exa
ct
1 m
m
2 m
m
3 m
m
4 m
m
5 m
m
Ins
(1...
3nt)
Del
(1.
..3nt
)
Ins
(4...
9nt)
Del
(4.
..30n
t)
Run
time
(s)
050
010
0015
0020
0025
0030
00
GSNAPBWABowtieSOAP2SOAPMAQ
,
FU Berlin, GSNAP, Omics WS 2011 69
Simulated Reads: Percent misses (on uniquereads)
É Most algorithms were able to perfectly map unique reads, with thefollowing notable exceptions:
É GSNAP: 12% misses for 36nt reads with 3 mismatchesÉ SOAP: 15% misses for 36nt reads with 3 mismatches, around 5% for
1-3nt indelsÉ BWA: 1-5% misses in 1-3nt indels
,
FU Berlin, GSNAP, Omics WS 2011 70
Simulated Reads: Percent misses (on uniquereads)
É Most algorithms were able to perfectly map unique reads, with thefollowing notable exceptions:
É GSNAP: 12% misses for 36nt reads with 3 mismatches
É SOAP: 15% misses for 36nt reads with 3 mismatches, around 5% for1-3nt indels
É BWA: 1-5% misses in 1-3nt indels
,
FU Berlin, GSNAP, Omics WS 2011 70
Simulated Reads: Percent misses (on uniquereads)
É Most algorithms were able to perfectly map unique reads, with thefollowing notable exceptions:
É GSNAP: 12% misses for 36nt reads with 3 mismatchesÉ SOAP: 15% misses for 36nt reads with 3 mismatches, around 5% for
1-3nt indels
É BWA: 1-5% misses in 1-3nt indels
,
FU Berlin, GSNAP, Omics WS 2011 70
Simulated Reads: Percent misses (on uniquereads)
É Most algorithms were able to perfectly map unique reads, with thefollowing notable exceptions:
É GSNAP: 12% misses for 36nt reads with 3 mismatchesÉ SOAP: 15% misses for 36nt reads with 3 mismatches, around 5% for
1-3nt indelsÉ BWA: 1-5% misses in 1-3nt indels
,
FU Berlin, GSNAP, Omics WS 2011 70
Memory Requirements
É GSNAP: should have access to 5 GB of memory, otherwise it will runslowly
É BWA: 2.2 GBÉ Bowtie: 1.1 GB (exact matches) or 2.2 GB (allowing mismatches)É MAQ: 302 MBÉ SOAP: 14 GBÉ SOAP2: unknown ("only provided as a binary and did not have the
required compile time flag")
,
FU Berlin, GSNAP, Omics WS 2011 71
Outline
IntroductionMotivationGSNAP FeaturesExamples of Complex Variant Detection
The AlgorithmSummaryPreprocessingMethod 1: Spanning Set Generation and FilteringMethod 2: Complete Set Generation and FilteringVerification of Candidate RegionsDetecting Insertions and DeletionsDetecting Splice Junctions
ResultsSimulated ReadsTranscriptional ReadsLimitations
Conclusions
,
FU Berlin, GSNAP, Omics WS 2011 72
Transcriptional Reads
É Only looked at the effect of splicing, indels and SNP tolerance
É Including known splicing information → increased yield approx. 8%É Including SNP tolerance → minor increase in yield (0.5%) but effected
about 8% of alignments
,
FU Berlin, GSNAP, Omics WS 2011 73
Transcriptional Reads
É Only looked at the effect of splicing, indels and SNP toleranceÉ Including known splicing information → increased yield approx. 8%
É Including SNP tolerance → minor increase in yield (0.5%) but effectedabout 8% of alignments
,
FU Berlin, GSNAP, Omics WS 2011 73
Transcriptional Reads
É Only looked at the effect of splicing, indels and SNP toleranceÉ Including known splicing information → increased yield approx. 8%É Including SNP tolerance → minor increase in yield (0.5%) but effected
about 8% of alignments
,
FU Berlin, GSNAP, Omics WS 2011 73
Outline
IntroductionMotivationGSNAP FeaturesExamples of Complex Variant Detection
The AlgorithmSummaryPreprocessingMethod 1: Spanning Set Generation and FilteringMethod 2: Complete Set Generation and FilteringVerification of Candidate RegionsDetecting Insertions and DeletionsDetecting Splice Junctions
ResultsSimulated ReadsTranscriptional ReadsLimitations
Conclusions
,
FU Berlin, GSNAP, Omics WS 2011 74
Limitations
É Matching is only exhaustive if there is at least a 14nt contiguousmatch, otherwise it’s a heuristic
É Limited to one indel or splice site per readÉ Does not use read quality scoresÉ Does not work with ABI SOLiD data
,
FU Berlin, GSNAP, Omics WS 2011 75
Limitations
É Matching is only exhaustive if there is at least a 14nt contiguousmatch, otherwise it’s a heuristic
É Limited to one indel or splice site per read
É Does not use read quality scoresÉ Does not work with ABI SOLiD data
,
FU Berlin, GSNAP, Omics WS 2011 75
Limitations
É Matching is only exhaustive if there is at least a 14nt contiguousmatch, otherwise it’s a heuristic
É Limited to one indel or splice site per readÉ Does not use read quality scores
É Does not work with ABI SOLiD data
,
FU Berlin, GSNAP, Omics WS 2011 75
Limitations
É Matching is only exhaustive if there is at least a 14nt contiguousmatch, otherwise it’s a heuristic
É Limited to one indel or splice site per readÉ Does not use read quality scoresÉ Does not work with ABI SOLiD data
,
FU Berlin, GSNAP, Omics WS 2011 75
TL;DR
É Comparable to other fast read alignment algorithms in terms ofspeed, but can handle more complex variants and splicing
,
FU Berlin, GSNAP, Omics WS 2011 76
For Further Reading I
Knuth D.E.The Art of Computer Programming: Sorting and Searching. Vol 3.Addison-Wesley, 1973
Thomas D. Wu and Serban NacuFast and SNP-tolerant detection of complex variants and splicing inshort readsBioinformatics, 2010 Apr 1;26(7):873-81.
Yeo G and Burge CB.Maximum entropy modeling of short sequence motifs withapplications to RNA splicing signalsJ Comput Biol. 2004;11(2-3):377-94.
,
FU Berlin, GSNAP, Omics WS 2011 77
Aligning Paired-End Reads
É Optional
,
FU Berlin, GSNAP, Omics WS 2011 78
Bisulfite-converted DNA
É Optional
,
FU Berlin, GSNAP, Omics WS 2011 79