lecture 7 cs5661 heuristic psa “words” to describe dot-matrix analysis approaches –fasta...
TRANSCRIPT
![Page 1: Lecture 7 CS5661 Heuristic PSA “Words” to describe dot-matrix analysis Approaches –FASTA –BLAST Searching databases for sequence similarities –PSA –Alternative](https://reader036.vdocuments.us/reader036/viewer/2022082713/5697c02b1a28abf838cd8561/html5/thumbnails/1.jpg)
Lecture 7 CS566 1
Heuristic PSA
• “Words” to describe dot-matrix analysis• Approaches
– FASTA– BLAST
• Searching databases for sequence similarities– PSA– Alternative strategies
• Iterative searching• Reverse searching
![Page 2: Lecture 7 CS5661 Heuristic PSA “Words” to describe dot-matrix analysis Approaches –FASTA –BLAST Searching databases for sequence similarities –PSA –Alternative](https://reader036.vdocuments.us/reader036/viewer/2022082713/5697c02b1a28abf838cd8561/html5/thumbnails/2.jpg)
Lecture 7 CS566 2
“Words” for Dot-matrix analysis
• Useful ideas from DM Alignment– Diagonal represents local match– Broken diagonal = intervening mismatch– Displaced diagonals = Matches with gaps
• Advantage of using word-based alignment– Faster algorithm
• Word-list comparison faster than sequence comparison
• Hashes used for rapid comparison of words• “Devil is in the details”
![Page 3: Lecture 7 CS5661 Heuristic PSA “Words” to describe dot-matrix analysis Approaches –FASTA –BLAST Searching databases for sequence similarities –PSA –Alternative](https://reader036.vdocuments.us/reader036/viewer/2022082713/5697c02b1a28abf838cd8561/html5/thumbnails/3.jpg)
Lecture 7 CS566 3
FASTA (Fast-All)
• Motivation: Needed rapid PSA method to search databases for matches to query sequence (1:n comparisons)
• ktup (k-tuple or word) based alignment– Create hash tables for sequences– Find matching ktups (“hot-spots”/short
diagonals) in pair of sequences• ktup size = 2 for protein (6 for DNA)
![Page 4: Lecture 7 CS5661 Heuristic PSA “Words” to describe dot-matrix analysis Approaches –FASTA –BLAST Searching databases for sequence similarities –PSA –Alternative](https://reader036.vdocuments.us/reader036/viewer/2022082713/5697c02b1a28abf838cd8561/html5/thumbnails/4.jpg)
Lecture 7 CS566 4
FASTA
• Find 10 best “diagonal-runs”– Group hot-spots by the (i-j) diagonal they lie in
• Main diagonal numbered 0;• Positive diagonals lie above main diagonal,
negative lie below
– Diagonal-run = set of consecutive (not necessarily contiguous) hot-spots, penalized by size of intervening mismatch
– Save top 10 diagonal runs
![Page 5: Lecture 7 CS5661 Heuristic PSA “Words” to describe dot-matrix analysis Approaches –FASTA –BLAST Searching databases for sequence similarities –PSA –Alternative](https://reader036.vdocuments.us/reader036/viewer/2022082713/5697c02b1a28abf838cd8561/html5/thumbnails/5.jpg)
Lecture 7 CS566 5
FASTA• Find init1
– Init1 = best contiguous subsequence from top 10 diagonal runs, based on AAS (default BLOSUM50)
• Define local search space around init1– Include (32 / ktup) +/- diagonals in search space
• For ktup = 2, 16 diagonals around init1
• Perform Smith-Waterman PSA in reduced space– Report resulting alignment as opt
![Page 6: Lecture 7 CS5661 Heuristic PSA “Words” to describe dot-matrix analysis Approaches –FASTA –BLAST Searching databases for sequence similarities –PSA –Alternative](https://reader036.vdocuments.us/reader036/viewer/2022082713/5697c02b1a28abf838cd8561/html5/thumbnails/6.jpg)
Lecture 7 CS566 6
BLAST (Basic local alignment search tool)• Built upon ideas derived from FASTA, with
incorporation of new elements• For every word in query, generate set of words
– Use AAS for similarity score between query word and all possible words of same size
– Include all words exceeding cut-off in set– Example: For word DED, and threshold 0, word set
includes DED, DDD, EEE, EDE etc.
• For every query word, generate hot-spots based on set of similar words
• Then merge contiguous words along same diagonal (a la FASTA) to form High Scoring Pairs (HSPs)
![Page 7: Lecture 7 CS5661 Heuristic PSA “Words” to describe dot-matrix analysis Approaches –FASTA –BLAST Searching databases for sequence similarities –PSA –Alternative](https://reader036.vdocuments.us/reader036/viewer/2022082713/5697c02b1a28abf838cd8561/html5/thumbnails/7.jpg)
Lecture 7 CS566 7
FASTA versus BLAST
• Word matching exact in FASTA but inexact (AAS-based) in BLAST
• Larger word size in BLAST
• FASTA more sensitive (Why?) but slower (Why?)
• BLAST handles “low-complexity” inline– Programs DUST and/or SEG used for filtering
sequences
![Page 8: Lecture 7 CS5661 Heuristic PSA “Words” to describe dot-matrix analysis Approaches –FASTA –BLAST Searching databases for sequence similarities –PSA –Alternative](https://reader036.vdocuments.us/reader036/viewer/2022082713/5697c02b1a28abf838cd8561/html5/thumbnails/8.jpg)
Lecture 7 CS566 8
Variations on BLAST-based searching
• Mapping query to different alphabets– Protein versus DNA, – DNA versus protein (Multiple reading frames)
• PSI-BLAST: Position-specific iterative BLAST– Use query to find hits– Assemble hits into on-the-fly Position-specific-scoring
matrix (PSSM)
• RPS-BLAST: Reverse position-specific BLAST– Query is search space– Database of PSSMs used to search for match