database searches blast. basic local alignment search tool –altschul, gish, miller, myers, lipman,...

13
Database Searches BLAST

Upload: kelly-harmon

Post on 13-Jan-2016

212 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Database Searches BLAST. Basic Local Alignment Search Tool –Altschul, Gish, Miller, Myers, Lipman, J. Mol. Biol. 215 (1990) –Altschul, Madden, Schaffer,

Database Searches

BLAST

Page 2: Database Searches BLAST. Basic Local Alignment Search Tool –Altschul, Gish, Miller, Myers, Lipman, J. Mol. Biol. 215 (1990) –Altschul, Madden, Schaffer,

BLAST

• Basic Local Alignment Search Tool– Altschul, Gish, Miller, Myers, Lipman, J. Mol.

Biol. 215 (1990)– Altschul, Madden, Schaffer, Zhang, Zhang,

Miller, Lipman, Nucleic Acids Res. 25 (1997)

• Main ideas: – Increase search speed by finding fewer, but

better, hot spots during initial screening phase– Uses longer word sizes– Integrate scoring matrix into first phase

• Compare with FASTA, which requires exact matches

Page 3: Database Searches BLAST. Basic Local Alignment Search Tool –Altschul, Gish, Miller, Myers, Lipman, J. Mol. Biol. 215 (1990) –Altschul, Madden, Schaffer,

BLAST Terminology

• Segment pair: equal-length substrings of sequences S1 and S2

• Locally maximal segment pair: segment pair whose alignment score cannot be improved by extending or shortening it

• Maximum segment pair (MSP) = segment pair with maximum score over all segment pairs in the sequences S1 and S2

• High-scoring segment pair (HSP): A segment pair with score higher than some cutoff score, s.

• w is the length parameter; t is the threshold parameter

Page 4: Database Searches BLAST. Basic Local Alignment Search Tool –Altschul, Gish, Miller, Myers, Lipman, J. Mol. Biol. 215 (1990) –Altschul, Madden, Schaffer,

BLAST: Hits

• A hit is a w-length word in the database that aligns with a word from the query sequence with score > t

• BLAST looks for hits instead of exact matches – Allows word size to be kept high for speed, without

sacrificing sensitivity

• Typically, w = 3-5 for amino acids, ~11-12 for DNA

• t is the most critical parameter:– ↑t ↓ “background” hits (faster)– ↓t ↑ ability to detect more distant relationships (at

cost of increased noise

Page 5: Database Searches BLAST. Basic Local Alignment Search Tool –Altschul, Gish, Miller, Myers, Lipman, J. Mol. Biol. 215 (1990) –Altschul, Madden, Schaffer,

Hits

• For each word, evaluate score of match (exact or not) according to BLOSUM62– E.g., for PQG, score is 7+5+6 = 18

• There are 20w possible w-length words, but considering only those with score > t, greatly reduces number of matches– E.g., there are 203 = 8000 possible matches

to PQG, but only 50 achieve score > t = 13

Page 6: Database Searches BLAST. Basic Local Alignment Search Tool –Altschul, Gish, Miller, Myers, Lipman, J. Mol. Biol. 215 (1990) –Altschul, Madden, Schaffer,

BLAST

Page 7: Database Searches BLAST. Basic Local Alignment Search Tool –Altschul, Gish, Miller, Myers, Lipman, J. Mol. Biol. 215 (1990) –Altschul, Madden, Schaffer,

Extending a hit

• After locating a hit, BLAST attempts to extend hit in both directions, until score has drops more than X below the maximum score yet attained.

• Extension step typically accounts for > 90% of execution time.

Page 8: Database Searches BLAST. Basic Local Alignment Search Tool –Altschul, Gish, Miller, Myers, Lipman, J. Mol. Biol. 215 (1990) –Altschul, Madden, Schaffer,

Extending a hit

Page 9: Database Searches BLAST. Basic Local Alignment Search Tool –Altschul, Gish, Miller, Myers, Lipman, J. Mol. Biol. 215 (1990) –Altschul, Madden, Schaffer,

Improvement: 2-hit method

• Do extensions only when there are two hits on the same diagonal within some distance A of each other (e.g., A =40)

• Reduces sensitivity (ability to detect distantly related sequences)– To compensate, use lower t value (e.g., 11

rather than 13)

• Since we only extend when there are two nearby hits, many fewer regions are extended

Page 10: Database Searches BLAST. Basic Local Alignment Search Tool –Altschul, Gish, Miller, Myers, Lipman, J. Mol. Biol. 215 (1990) –Altschul, Madden, Schaffer,

Gapped BLAST

• Allows local alignments with indels (similar to FASTA)

• Local alignments from different diagonal are merged into a different local alignment followed by some indels followed by a second local alignment, etc.– equivalent to a path through the dynamic

programming matrix composed of alternating diagonal sections and paths connecting them

Page 11: Database Searches BLAST. Basic Local Alignment Search Tool –Altschul, Gish, Miller, Myers, Lipman, J. Mol. Biol. 215 (1990) –Altschul, Madden, Schaffer,

Gapped BLAST

• Original BLAST implicitly handled gaps by finding several distinct HSPs and calculating a statistical assessment of the combined result– Two or more HSPs each below the cutoff value might in

combination rise to statistical significance

• Gapped BLAST, extend hits by allowing gaps when hits are promising (exceed sg): – Advantage: We can afford to miss some HSPs as long as

at least one is found

• Use dynamic programming, starting from center of each high-scoring region if s > sg – sg is chosen such that gapped alignment is triggered in

about 1/50 of the sequences compared

Page 12: Database Searches BLAST. Basic Local Alignment Search Tool –Altschul, Gish, Miller, Myers, Lipman, J. Mol. Biol. 215 (1990) –Altschul, Madden, Schaffer,

PSI-BLAST

• Position-Specific Iterated BLAST• Generates a multiple alignment from

statistically significant alignments produced by BLAST

• Produces a position-specific score matrix (PSSM)– Can search the database using the PSSM – Match sequences to profile– Generate new profiles– Repeat (iteration)– Search gradually extends to increasingly

divergent sequences

Page 13: Database Searches BLAST. Basic Local Alignment Search Tool –Altschul, Gish, Miller, Myers, Lipman, J. Mol. Biol. 215 (1990) –Altschul, Madden, Schaffer,

Flavors of BLAST

• BLASTP - protein query against protein DB• BLASTN - DNA/RNA query against GenBank

(DNA)• BLASTX - 6 frame trans. DNA query against

proteinDB• TBLASTN - protein query against 6 frame GB

transl.• TBLASTX - 6 frame DNA query to 6 frame GB

transl.• PSI-BLAST - protein ‘profile’ query against

protein DB• PHI-BLAST - protein pattern against protein

DB