lecture 7 cs5661 heuristic psa “words” to describe dot-matrix analysis approaches –fasta...

8
Lecture 7 CS566 1 Heuristic PSA “Words” to describe dot-matrix analysis • Approaches – FASTA – BLAST Searching databases for sequence similarities – PSA Alternative strategies • Iterative searching • Reverse searching

Upload: georgina-rose

Post on 21-Jan-2016

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Lecture 7 CS5661 Heuristic PSA “Words” to describe dot-matrix analysis Approaches –FASTA –BLAST Searching databases for sequence similarities –PSA –Alternative

Lecture 7 CS566 1

Heuristic PSA

• “Words” to describe dot-matrix analysis• Approaches

– FASTA– BLAST

• Searching databases for sequence similarities– PSA– Alternative strategies

• Iterative searching• Reverse searching

Page 2: Lecture 7 CS5661 Heuristic PSA “Words” to describe dot-matrix analysis Approaches –FASTA –BLAST Searching databases for sequence similarities –PSA –Alternative

Lecture 7 CS566 2

“Words” for Dot-matrix analysis

• Useful ideas from DM Alignment– Diagonal represents local match– Broken diagonal = intervening mismatch– Displaced diagonals = Matches with gaps

• Advantage of using word-based alignment– Faster algorithm

• Word-list comparison faster than sequence comparison

• Hashes used for rapid comparison of words• “Devil is in the details”

Page 3: Lecture 7 CS5661 Heuristic PSA “Words” to describe dot-matrix analysis Approaches –FASTA –BLAST Searching databases for sequence similarities –PSA –Alternative

Lecture 7 CS566 3

FASTA (Fast-All)

• Motivation: Needed rapid PSA method to search databases for matches to query sequence (1:n comparisons)

• ktup (k-tuple or word) based alignment– Create hash tables for sequences– Find matching ktups (“hot-spots”/short

diagonals) in pair of sequences• ktup size = 2 for protein (6 for DNA)

Page 4: Lecture 7 CS5661 Heuristic PSA “Words” to describe dot-matrix analysis Approaches –FASTA –BLAST Searching databases for sequence similarities –PSA –Alternative

Lecture 7 CS566 4

FASTA

• Find 10 best “diagonal-runs”– Group hot-spots by the (i-j) diagonal they lie in

• Main diagonal numbered 0;• Positive diagonals lie above main diagonal,

negative lie below

– Diagonal-run = set of consecutive (not necessarily contiguous) hot-spots, penalized by size of intervening mismatch

– Save top 10 diagonal runs

Page 5: Lecture 7 CS5661 Heuristic PSA “Words” to describe dot-matrix analysis Approaches –FASTA –BLAST Searching databases for sequence similarities –PSA –Alternative

Lecture 7 CS566 5

FASTA• Find init1

– Init1 = best contiguous subsequence from top 10 diagonal runs, based on AAS (default BLOSUM50)

• Define local search space around init1– Include (32 / ktup) +/- diagonals in search space

• For ktup = 2, 16 diagonals around init1

• Perform Smith-Waterman PSA in reduced space– Report resulting alignment as opt

Page 6: Lecture 7 CS5661 Heuristic PSA “Words” to describe dot-matrix analysis Approaches –FASTA –BLAST Searching databases for sequence similarities –PSA –Alternative

Lecture 7 CS566 6

BLAST (Basic local alignment search tool)• Built upon ideas derived from FASTA, with

incorporation of new elements• For every word in query, generate set of words

– Use AAS for similarity score between query word and all possible words of same size

– Include all words exceeding cut-off in set– Example: For word DED, and threshold 0, word set

includes DED, DDD, EEE, EDE etc.

• For every query word, generate hot-spots based on set of similar words

• Then merge contiguous words along same diagonal (a la FASTA) to form High Scoring Pairs (HSPs)

Page 7: Lecture 7 CS5661 Heuristic PSA “Words” to describe dot-matrix analysis Approaches –FASTA –BLAST Searching databases for sequence similarities –PSA –Alternative

Lecture 7 CS566 7

FASTA versus BLAST

• Word matching exact in FASTA but inexact (AAS-based) in BLAST

• Larger word size in BLAST

• FASTA more sensitive (Why?) but slower (Why?)

• BLAST handles “low-complexity” inline– Programs DUST and/or SEG used for filtering

sequences

Page 8: Lecture 7 CS5661 Heuristic PSA “Words” to describe dot-matrix analysis Approaches –FASTA –BLAST Searching databases for sequence similarities –PSA –Alternative

Lecture 7 CS566 8

Variations on BLAST-based searching

• Mapping query to different alphabets– Protein versus DNA, – DNA versus protein (Multiple reading frames)

• PSI-BLAST: Position-specific iterative BLAST– Use query to find hits– Assemble hits into on-the-fly Position-specific-scoring

matrix (PSSM)

• RPS-BLAST: Reverse position-specific BLAST– Query is search space– Database of PSSMs used to search for match