Download - Heuristic methods for sequence alignment in practice Sushmita Roy BMI/CS 576 Sushmita Roy [email protected] Sep 27 th,

Heuristic methods for sequence alignment in practice

Sushmita RoyBMI/CS 576

www.biostat.wisc.edu/bmi576/Sushmita Roy

[email protected] 27th, 2012

BMI/CS 576

mailto:[email protected]

Heuristic alignment motivation

• too slow for large databases with high query traffic• heuristic methods do fast approximation to dynamic programming

– FASTA [Pearson & Lipman, 1988]– BLAST [Altschul et al., 1990; Altschul et al., Nucleic Acids

Research 1997]

)(mnO

Heuristic alignment motivation

• consider the task of searching the RefSeq collection of sequences against a query sequence:– most recent release of DB contains 17,368,769

proteins, 2,700,925 genomic sequences– Entails 17,000,000*(300*300) matrix operations

(assuming query sequence is of length 300 and avg. sequence length is 300)

• Need faster methods to search databases in practice

Heuristic alignment

• heuristic algorithm: a problem-solving method which isn’t guaranteed to find the optimal solution, but which is efficient and finds good solutions

• key heuristics in BLAST– A good alignment is made up short stretches of matches: seeds– Extend seeds to make longer alignments

• key tradeoff made: sensitivity vs. speed

Scoring matrices

• BLOSUM matrices for proteins• DNA: separate scores for match and mismatch, and

gaps.

Overview of BLAST (Basic Alignment Search Tool)

• given: query sequence q, word length w, word score threshold T, segment score threshold S– compile a list of “words” (of length w) that score at

least T when compared to words from q– scan database for matches to words in list– extend all matches to seek high-scoring alignments

• return: alignments scoring at least S

Determining query words

Given:query sequence: QLNFSAGWword length w = 2 (default for protein usually w = 3, 12 fro DNA)word score threshold T = 9

Step 1: determine all words of length w in query sequence

QL LN NF FS SA AG GW

Determining query words

Step 2: determine all words that score at least T when compared to a word in the query sequence

QL QL=9LN LN=10NF NF=12, NY=9…SA none...

words fromsequence query words w/ T≥9

Scanning the database

• search database for all occurrences of query words• approach:

– index database sequences into table of words (pre-compute this)

– index query words into table (at query time)

NPNSNTNWNY

QLNFSAGW

MFNYT, STNYD…

NPGAT, TSQRPNP…

query sequence

database sequences

Extending hits

• BLAST extends hits into local alignments

• The original version of BLAST extended each hit separately

Extending hits in original BLAST

• extend hits in both directions (without allowing gaps) • terminate extension in one direction when score falls

certain distance below best score for shorter extensions

€

score(c) ≥ score(b) −ε ?

• return segment pairs scoring at least S

initial hit

best extension b

current extension c

Query pre-processing

• Some regions of the sequence are low complexity– repeats

• Identify low complexity regions and flag before BLAST– DNA sequence: DUST– Protein:SEG

Sensitivity vs. running time

• the main parameter controlling the sensitivity vs. running-time trade-off is T (threshold for what becomes a query word)– small T: greater sensitivity, more hits to expand– large T: lower sensitivity, fewer hits to expand

DBin matchest significan #

detected matchest significan #ysensitivit =

Speeding up BLAST: the two-hit method

• extension step typically accounts for 90% of BLAST’s execution time

• key intuition: – HSPs is longer than a word and made up closely located words– extend only when two words are on the same diagonal within

distance A of each other

• to maintain sensitivity, lower T parameter– more single hits found– but only small fraction have associated 2nd hit

The two-hit method

‘+’ hits with T > 12

‘’ hits with T > 10extend thesecases (A<40)

Figure from: Altschul et al. Nucleic Acids Research 25, 1997

Gapped BLAST

• To improve sensitivity

• trigger gapped alignment if two-hit extension has a sufficiently high score Sg

• find length-11 segment with highest score; use central pair in this segment as seed

• run DP both forward & backward from seed

• prune cells when local alignment score falls a certain distance below best score yet

query

database

http://blast.ncbi.nlm.nih.gov/Blast.cgi

BLAST Results

BLAST programs

Program Query Database

BLASTP Protein Protein

BLASTN DNA DNA

BLASTXTranslated

DNAProtein

TBLASTN ProteinTranslated

DNA

TBLASTXTranslated

DNATranslated

DNA

BLAST comments

• Use word-hits to pre-filter low quality matches– May miss some good matches– Fast: empirically, 10 to 50 times faster than Smith-Waterman– Gapped BLAST: improve sensitivity– Key parameters: T, HSP score for gap extension, seed for gap

extension

• Large impact:– NCBI’s BLAST server handles more than 500,000 queries a day– most used bioinformatics program in the world

Other alignment problems

• Whole genome sequence alignment– O(n2) too high for MegaBase length alignments– Hash and extend methods still expensive– Idea 1: Align gene by gene, and concatenate alignments– MUMmer: Suffix tree, Longest increasing sequence, Smith-

Waterman Algorithm• MUM: Maximal Unique Match• Need O(n) implementation of suffix tree

– Exploits closeness of sequences to be aligned (e.g. strains)

• Aligning reads to reference genome– Burrows Wheel Aligner

Suffix Tree for atgtgtgtc$

atgtgtgtc$ $c$ gt t

c$ c$ gt

7

1 9

5 3

8

6

4 2

10

c$ c$

c$

gtgtc$

gtc$

gt

Slide credit: Adam M Phillippy

Download - Heuristic methods for sequence alignment in practice Sushmita Roy BMI/CS 576 Sushmita Roy [email protected] Sep 27 th,

Top Related