database similarity search. 2 sequences that are similar probably have the same function why do we...

37
Database Similarity Search

Upload: asher-stokes

Post on 21-Jan-2016

230 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Database Similarity Search. 2 Sequences that are similar probably have the same function Why do we care to align sequences?

Database Similarity Search

Page 2: Database Similarity Search. 2 Sequences that are similar probably have the same function Why do we care to align sequences?

2

Sequences that are similar probably have the same function

Why do we care to align sequences?

Page 3: Database Similarity Search. 2 Sequences that are similar probably have the same function Why do we care to align sequences?

new sequence

?

Sequence Database

≈ Similar function

Discover Function of a new sequence

Page 4: Database Similarity Search. 2 Sequences that are similar probably have the same function Why do we care to align sequences?

4

Discover Function of a new sequence

Page 5: Database Similarity Search. 2 Sequences that are similar probably have the same function Why do we care to align sequences?

Searching Databases for similar sequences

Naïve solution: Use exact algorithm to compare each sequence in the database to query.

Is this reasonable ??

How much time will it take to calculate?

Page 6: Database Similarity Search. 2 Sequences that are similar probably have the same function Why do we care to align sequences?

Complexity for genomes

• Human genome contains 3 109 base pairs– Searching an mRNA against HG requires ~1012

cells

-Even efficient exact algorithms will be extremely slow when preformed millions of times even with parallel computing.

Page 7: Database Similarity Search. 2 Sequences that are similar probably have the same function Why do we care to align sequences?

So what can we do?

Page 8: Database Similarity Search. 2 Sequences that are similar probably have the same function Why do we care to align sequences?

Searching databases

Solution:Use a heuristic (approximate) algorithm

Page 9: Database Similarity Search. 2 Sequences that are similar probably have the same function Why do we care to align sequences?

Heuristic strategy

Reduce the search space

Remove regions that are not useful for meaningful alignments

Perform efficient search strategies

Preprocess database into new data structure to enable fast accession

Page 10: Database Similarity Search. 2 Sequences that are similar probably have the same function Why do we care to align sequences?

Heuristic strategy

• Reduce the search space

Remove regions that are not useful for meaningful alignments

• Preprocess database into new data structure to enable fast accession

Page 11: Database Similarity Search. 2 Sequences that are similar probably have the same function Why do we care to align sequences?

• AAAAAAAAAAA

• ATATATATATATA

• Transposable elements

What sequences to remove?

53% of the genomeis repetitive DNALow complexity sequences(JUNK???)

Page 12: Database Similarity Search. 2 Sequences that are similar probably have the same function Why do we care to align sequences?

Low Complexity Sequences

What's wrong with them?* Not informative* Produce artificial high scoring alignments.

So what do we do?We apply Low Complexity masking to the database and the query sequence

MaskTCGATCGTATATATACGGGGGGTA TCGATCGNNNNNNNNCNNNNNNTA

Page 13: Database Similarity Search. 2 Sequences that are similar probably have the same function Why do we care to align sequences?

Heuristic strategy

• Remove low-complexity regions that are not useful for meaningful alignments

• Perform efficient search strategies

Preprocess database into new data structure to enable fast accession

Page 14: Database Similarity Search. 2 Sequences that are similar probably have the same function Why do we care to align sequences?

BLAST Basic Local Alignment Search Tool

• General idea - a good alignment contains subsequences of high identity (local alignment):

ACGCCCGGGAGCGC

CTGGGCGTATAGCCC

–First, identify (most efficiently) short almost exact matches .–Next, extended to longer regions of similarity.–Finally, optimize the alignment using an exact algorithm.

Altschul et al 1990

Page 15: Database Similarity Search. 2 Sequences that are similar probably have the same function Why do we care to align sequences?

DNA/RNA vs protein alphabet

DNA(4)

A T G C

RNA(4)

A U G C

Protein (20)

ACDEFGHIKLMNPQRSTVWY

A T=A G…. A T=A G…. A G>>A W….

WHY is it different?

Page 16: Database Similarity Search. 2 Sequences that are similar probably have the same function Why do we care to align sequences?

The 20 Amino Acids

Page 17: Database Similarity Search. 2 Sequences that are similar probably have the same function Why do we care to align sequences?

The 20 Amino Acids

A

W

G

Page 18: Database Similarity Search. 2 Sequences that are similar probably have the same function Why do we care to align sequences?

Scoring system for amino acids mismatches

Page 19: Database Similarity Search. 2 Sequences that are similar probably have the same function Why do we care to align sequences?

BLAST Basic Local Alignment Search Tool

• General idea - a good alignment contains subsequences of high identity (local alignment):

ACGCCCGGGAGCGC

CTGGGCGTATAGCCC

–First, identify (most efficiently) short almost exact matches .–Next, extended to longer regions of similarity.–Finally, optimize the alignment using an exact algorithm.

Altschul et al 1990

Page 20: Database Similarity Search. 2 Sequences that are similar probably have the same function Why do we care to align sequences?

BLAST(Protein Sequence Example)

First, identify (most efficiently) short almost exact matches between the query sequence and the database.

Query sequence …FSGTWYA…

Words of length 3: FSG, SGT, GTW, TWY, WYA

Page 21: Database Similarity Search. 2 Sequences that are similar probably have the same function Why do we care to align sequences?

BLAST

FSG SGT GTW TWY WYA YSG TGT ATW SWY WFA

FTG.. SVT. GSW. TWF.. WYS….

Preprocessing of the database

Seq 1 FSGTWYA FSG, SGT, GTW, TWY, WAYSeq 2 FDRTSYV FDR, DRT, RTS, TSY, SYVSeq 3 SWRTYVA SWR, WRT,RTY, TYV, YVA…….

Seq 3546

Seq 102

Seq 1 BAG OF WORDS

Page 22: Database Similarity Search. 2 Sequences that are similar probably have the same function Why do we care to align sequences?

BLAST

Query sequence …FSGTWYA…Words of length 3: FSG, SGT, GTW, TWY, WYA…

DATABASE

FSG SGT GTW TWY WYA YSG TGT ATW SWY WFA FTG SVT GSW TWF WYS….

SEQ N INVIEIAFDGTWTCATTNAMHEWASNINETEEN

Page 23: Database Similarity Search. 2 Sequences that are similar probably have the same function Why do we care to align sequences?

BLAST Basic Local Alignment Search Tool

• General idea - a good alignment contains subsequences of high identity (local alignment):

ACGCCCGGGAGCGC

CTGGGCGTATAGCCC

–First, identify (most efficiently) short almost exact matches .–Next, extended to longer regions of similarity.–Finally, optimize the alignment an exact algorithm.

Altschul et al 1990

Page 24: Database Similarity Search. 2 Sequences that are similar probably have the same function Why do we care to align sequences?

BLAST2.Extend word pairs as much as possible,

i.e., as long as the total score increases

High-scoring Segment Pairs (HSPs)

Q: FIRSTLINIHFSGTWYAAMESIRPATRICKREAD

D: INVIEIAFDGTWTCATTNAMHEWASNINETEEN

Q= query sequence, D= sequence in database

3. Finally, optimize the alignment using an exact algorithm.

Page 25: Database Similarity Search. 2 Sequences that are similar probably have the same function Why do we care to align sequences?

Running BLAST to predict a function of a new protein

>Arrestin protein (C. elegance)MFIANNCMPQFRWEDMPTTQINIVLAEPRCMAGEFFNAKVLLDSSDPDTVVHSFCAEIKGIGRTGWVNIHTDKIFETEKTYIDTQVQLCDSGTCLPVGKHQFPVQIRIPLNCPSSYESQFGSIRYQMKVELRASTDQASCSEVFPLVILTRSFFDDVPLNAMSPIDFKDEVDFTCCTLPFGCVSLNMSLTRTAFRIGESIEAVVTINNRTRKGLKEVALQLIMKTQFEARSRYEHVNEKKLAEQLIEMVPLGAVKSRCRMEFEKCLLRIPDAAPPTQNYNRGAGESSIIAIHYVLKLTALPGIECEIPLIVTSCGYMDPHKQAAFQHHLNRSKAKVSKTEQQQRKTRNIVEENPYFR

Page 26: Database Similarity Search. 2 Sequences that are similar probably have the same function Why do we care to align sequences?
Page 27: Database Similarity Search. 2 Sequences that are similar probably have the same function Why do we care to align sequences?
Page 28: Database Similarity Search. 2 Sequences that are similar probably have the same function Why do we care to align sequences?

How to interpret a BLAST score:

•The score is a measure of the similarity of the query to the sequence shown.

How do we know if the score is significant?

-Statistical significance

-Biological significance

Page 29: Database Similarity Search. 2 Sequences that are similar probably have the same function Why do we care to align sequences?

The expectation value E-value is the number of alignmentswith scores greater than or equal to score Sthat are expected to occur by chance in a database search.

page 105

How to interpret a BLAST search:

For each blast score we can calculate an expectation value (E-value)

Page 30: Database Similarity Search. 2 Sequences that are similar probably have the same function Why do we care to align sequences?
Page 31: Database Similarity Search. 2 Sequences that are similar probably have the same function Why do we care to align sequences?

BLAST- E value:

Increases linearly with

length of query sequence

Increases linearly with

length of database

Decreases exponentially with score of

alignment

–K ,λ: statistical parameters dependent upon scoring system and background residue frequencies

m = length of query ; n= length of database ; s= score

Page 32: Database Similarity Search. 2 Sequences that are similar probably have the same function Why do we care to align sequences?

What is a Good E-value (Thumb rule)

• E values of less than 0.00001 show that sequences are almost always related.

• Greater E values, can represent functional relationships as well.

• Sometimes a real (biological) match has an E value > 1• Sometimes a similar E value occurs for a short exact

match and long less exact match

Page 33: Database Similarity Search. 2 Sequences that are similar probably have the same function Why do we care to align sequences?

How to interpret a BLAST search:

•The score is a measure of the similarity of the query to the sequence shown.

How do we know if the score is significant?

-Statistical significance

-Biological significance

Page 34: Database Similarity Search. 2 Sequences that are similar probably have the same function Why do we care to align sequences?

Treating Gaps in BLAST

>Human DNACATGCGACTGACcgacgtcgatcgatacgactagctagcATCGATCATA>Human mRNACATGCGACTGACATCGATCATA

Sometimes correction to the model are needed to infer biological significance

Page 35: Database Similarity Search. 2 Sequences that are similar probably have the same function Why do we care to align sequences?

Gap Scores

• Standard solution: affine gap model

wx = g + r(x-1) wx : total gap penalty; g: gap open penalty;

r: gap extend penalty ;x: gap length

– Once-off cost for opening a gap– Lower cost for extending the gap– Changes required to algorithm

Page 36: Database Similarity Search. 2 Sequences that are similar probably have the same function Why do we care to align sequences?

Gapped BLAST

4. Connect several HSPs by aligning the sequences in between them:

THEFIRSTLINIHFSGTWYAA____M_ESIRPATRICKREAD

INVIEIAFDGTWTCATTNAMHEW___ASNINETEEN

The Gapped Blast algorithm allows several segments that are separated by short gaps to be connected together to one alignment

Page 37: Database Similarity Search. 2 Sequences that are similar probably have the same function Why do we care to align sequences?

BLAST BLAST is a family of programs

Query: DNA Protein

Database: DNA Protein