Heuristic methods for sequence alignment in practice
Sushmita RoyBMI/CS 576
www.biostat.wisc.edu/bmi576/Sushmita Roy
[email protected] 27th, 2012
BMI/CS 576
Heuristic alignment motivation
• too slow for large databases with high query traffic• heuristic methods do fast approximation to dynamic programming
– FASTA [Pearson & Lipman, 1988]– BLAST [Altschul et al., 1990; Altschul et al., Nucleic Acids
Research 1997]
)(mnO
Heuristic alignment motivation
• consider the task of searching the RefSeq collection of sequences against a query sequence:– most recent release of DB contains 17,368,769
proteins, 2,700,925 genomic sequences– Entails 17,000,000*(300*300) matrix operations
(assuming query sequence is of length 300 and avg. sequence length is 300)
• Need faster methods to search databases in practice
Heuristic alignment
• heuristic algorithm: a problem-solving method which isn’t guaranteed to find the optimal solution, but which is efficient and finds good solutions
• key heuristics in BLAST– A good alignment is made up short stretches of matches: seeds– Extend seeds to make longer alignments
• key tradeoff made: sensitivity vs. speed
Scoring matrices
• BLOSUM matrices for proteins• DNA: separate scores for match and mismatch, and
gaps.
Overview of BLAST (Basic Alignment Search Tool)
• given: query sequence q, word length w, word score threshold T, segment score threshold S– compile a list of “words” (of length w) that score at
least T when compared to words from q– scan database for matches to words in list– extend all matches to seek high-scoring alignments
• return: alignments scoring at least S
Determining query words
Given:query sequence: QLNFSAGWword length w = 2 (default for protein usually w = 3, 12 fro DNA)word score threshold T = 9
Step 1: determine all words of length w in query sequence
QL LN NF FS SA AG GW
Determining query words
Step 2: determine all words that score at least T when compared to a word in the query sequence
QL QL=9LN LN=10NF NF=12, NY=9…SA none...
words fromsequence query words w/ T≥9
Scanning the database
• search database for all occurrences of query words• approach:
– index database sequences into table of words (pre-compute this)
– index query words into table (at query time)
NPNSNTNWNY
QLNFSAGW
MFNYT, STNYD…
NPGAT, TSQRPNP…
query sequence
database sequences
Extending hits
• BLAST extends hits into local alignments
• The original version of BLAST extended each hit separately
Extending hits in original BLAST
• extend hits in both directions (without allowing gaps) • terminate extension in one direction when score falls
certain distance below best score for shorter extensions
€
score(c) ≥ score(b) −ε ?
• return segment pairs scoring at least S
initial hit
best extension b
current extension c
Query pre-processing
• Some regions of the sequence are low complexity– repeats
• Identify low complexity regions and flag before BLAST– DNA sequence: DUST– Protein:SEG
Sensitivity vs. running time
• the main parameter controlling the sensitivity vs. running-time trade-off is T (threshold for what becomes a query word)– small T: greater sensitivity, more hits to expand– large T: lower sensitivity, fewer hits to expand
DBin matchest significan #
detected matchest significan #ysensitivit =
Speeding up BLAST: the two-hit method
• extension step typically accounts for 90% of BLAST’s execution time
• key intuition: – HSPs is longer than a word and made up closely located words– extend only when two words are on the same diagonal within
distance A of each other
• to maintain sensitivity, lower T parameter– more single hits found– but only small fraction have associated 2nd hit
The two-hit method
‘+’ hits with T > 12
‘’ hits with T > 10extend thesecases (A<40)
Figure from: Altschul et al. Nucleic Acids Research 25, 1997
Gapped BLAST
• To improve sensitivity
• trigger gapped alignment if two-hit extension has a sufficiently high score Sg
• find length-11 segment with highest score; use central pair in this segment as seed
• run DP both forward & backward from seed
• prune cells when local alignment score falls a certain distance below best score yet
query
database
http://blast.ncbi.nlm.nih.gov/Blast.cgi
BLAST Results
BLAST programs
Program Query Database
BLASTP Protein Protein
BLASTN DNA DNA
BLASTXTranslated
DNAProtein
TBLASTN ProteinTranslated
DNA
TBLASTXTranslated
DNATranslated
DNA
BLAST comments
• Use word-hits to pre-filter low quality matches– May miss some good matches– Fast: empirically, 10 to 50 times faster than Smith-Waterman– Gapped BLAST: improve sensitivity– Key parameters: T, HSP score for gap extension, seed for gap
extension
• Large impact:– NCBI’s BLAST server handles more than 500,000 queries a day– most used bioinformatics program in the world
Other alignment problems
• Whole genome sequence alignment– O(n2) too high for MegaBase length alignments– Hash and extend methods still expensive– Idea 1: Align gene by gene, and concatenate alignments– MUMmer: Suffix tree, Longest increasing sequence, Smith-
Waterman Algorithm• MUM: Maximal Unique Match• Need O(n) implementation of suffix tree
– Exploits closeness of sequences to be aligned (e.g. strains)
• Aligning reads to reference genome– Burrows Wheel Aligner
Suffix Tree for atgtgtgtc$
atgtgtgtc$ $c$ gt t
c$ c$ gt
7
1 9
5 3
8
6
4 2
10
c$ c$
c$
gtgtc$
gtc$
gt
Slide credit: Adam M Phillippy