database searching with blast celia van gelder cmbi umc radboud september 2013 outline of today’s...

36
Database searching with BLAST Celia van Gelder CMBI UMC Radboud September 2013 Outline of today’s lecture Transfer of information Database searching with sequences Sequence Alignment Scoring Matrices Significance of alignments BLAST method parameters output

Upload: piers-paul

Post on 25-Dec-2015

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Database searching with BLAST Celia van Gelder CMBI UMC Radboud September 2013 Outline of today’s lecture Transfer of information Database searching with

Database searching with BLAST

Celia van GelderCMBI

UMC RadboudSeptember 2013

Outline of today’s lecture

• Transfer of information• Database searching with sequences• Sequence Alignment• Scoring Matrices • Significance of alignments• BLAST

• method• parameters• output

Page 2: Database searching with BLAST Celia van Gelder CMBI UMC Radboud September 2013 Outline of today’s lecture Transfer of information Database searching with

Transfer of information

The main topic of this course is transfer of information from a well known to a “new” system (sequence).

In the protein world that leads to the questions:

1) From which protein can I transfer information2) How do I transfer what information from where to where

Today’s answer is BLAST…

Page 3: Database searching with BLAST Celia van Gelder CMBI UMC Radboud September 2013 Outline of today’s lecture Transfer of information Database searching with

BLAST - Searching with sequences

LAST WEEK:

Searching with words (Google like)Query = word(s)Tool used: (MRS-Search, Entrez, SRS, …)

TODAY:

Searching with sequences Query = sequenceTool used: BLAST (MRS, NCBI, ..)

Page 4: Database searching with BLAST Celia van Gelder CMBI UMC Radboud September 2013 Outline of today’s lecture Transfer of information Database searching with

Database Searching with a query sequence

Purpose:

To identify similarities between

Your query sequence (with unknown structure and function)

and

Database sequences (with elucidated structures and function)

If we identify similarity we can transfer information!

Page 5: Database searching with BLAST Celia van Gelder CMBI UMC Radboud September 2013 Outline of today’s lecture Transfer of information Database searching with

Transfer of information to corresponding residues

Your sequence: DRTGHNIPLMSTRKTYHIHIENASEERTIKLLMN

is phosphorylated on one of the two serines.

Which one? What is your approach?

Page 6: Database searching with BLAST Celia van Gelder CMBI UMC Radboud September 2013 Outline of today’s lecture Transfer of information Database searching with

Transfer of information to corresponding residues

BLAST finds two database hits that are annotated to have a phosphorylated serine.

DRT-GHNIPLMSTRK-TYHIHIENASEERTIKLLMNDRR-GTTINLMTTKR-TYADELENASEDRTLLLNMNAEPIYYHL---LTKRETYHIHIENASEEKIIKIVVN

“this serine is phorphorylated in a known protein from the database, so in my protein the corresponding serine is likely to be phosphorylated too”.

Page 7: Database searching with BLAST Celia van Gelder CMBI UMC Radboud September 2013 Outline of today’s lecture Transfer of information Database searching with

Database searching concept

– The query sequence is compared (aligned) with every sequence in the database.

– High-scoring database sequences are assumed to be evolutionary related to the query sequence.

– If sequences are related by divergence from a common ancestor, there are said to be homologous.

Page 8: Database searching with BLAST Celia van Gelder CMBI UMC Radboud September 2013 Outline of today’s lecture Transfer of information Database searching with

Sequence Alignment

gap = insertion or deletion (indel)

A

B

B

A

Page 9: Database searching with BLAST Celia van Gelder CMBI UMC Radboud September 2013 Outline of today’s lecture Transfer of information Database searching with

Sequence alignment is easy:

You only need three things:

1) A computer program that produces all possible alignments, and

2) A computer program that gives each alignment a score, and, the simplest,

3) A computer program that selects the highest scoring alignment from the very large number you tried.

Page 10: Database searching with BLAST Celia van Gelder CMBI UMC Radboud September 2013 Outline of today’s lecture Transfer of information Database searching with

Scoring/Substitution Matrix

• Scoring scheme for quality of an alignment

• Contains scores for every possible amino acid substitution in a sequence alignment

• For protein/protein comparisons we need a 20 x 20 matrix with scores for pairs of residues. Every cell in the matrix contains at position X, Y a score for the substitution/mutation amino acid X -> amino acid Y

Page 11: Database searching with BLAST Celia van Gelder CMBI UMC Radboud September 2013 Outline of today’s lecture Transfer of information Database searching with

Scores

• Positive score if corresponding amino acid residues in the two aligned sequences are identical or similar. This is a likely change.

• Negative score if corresponding amino acid residues are not similar. This is an unlikely change.

• The scores are numbers that you can add up.

Page 12: Database searching with BLAST Celia van Gelder CMBI UMC Radboud September 2013 Outline of today’s lecture Transfer of information Database searching with

Amino Acid substitutions, some thoughts

Not all 20x20 possible mutations occur equally often

• Residues mutate more easily to similar ones (e.g. Leucine and Isoleucine)

• Residues at surface mutate more easily• Aromatics mutate preferably into aromatics• Core tends to be hydrophobic; • Cysteines are dangerous at the surface• Cysteines in sulfur bridges (S-S) seldom mutate• Some amino acids have similar codons

(for example TTT & TTC for Phe, TTA & TTG for Leu)• Etc etc

Page 13: Database searching with BLAST Celia van Gelder CMBI UMC Radboud September 2013 Outline of today’s lecture Transfer of information Database searching with

PAM250 Matrix (Dayhoff Matrix)

Page 14: Database searching with BLAST Celia van Gelder CMBI UMC Radboud September 2013 Outline of today’s lecture Transfer of information Database searching with

Scoring example

Score of an alignment is the sum of the scores of all pairs of residues in the alignment

sequence 1: TCCPSIVARSNsequence 2: SCCPSISARNT

1 12 12 6 2 5 -1 2 6 1 0 => score = 46

Page 15: Database searching with BLAST Celia van Gelder CMBI UMC Radboud September 2013 Outline of today’s lecture Transfer of information Database searching with

Scoring matrix, cntnd

• When you use bioinformatics tools (BLAST, CLUSTAL, etc) the scoring matrix often is a paramater that you can choose.

• Two widely used matrices (often default in the packages)

PAM250 (Dayhoff et al)

Based on closely similar proteins

BLOSUM62 (Henikoff et al)

Based on conserved regions

Considered best for distantly related proteins

Page 16: Database searching with BLAST Celia van Gelder CMBI UMC Radboud September 2013 Outline of today’s lecture Transfer of information Database searching with

Significance of alignment (1)

When is an alignment statistically significant?

In other words:

How much different is the alignment score found from scores obtained by aligning a random sequence to the query sequence?

Or:

What is the probability that an alignment with this score could have arisen by chance?

Page 17: Database searching with BLAST Celia van Gelder CMBI UMC Radboud September 2013 Outline of today’s lecture Transfer of information Database searching with

Significance of alignment (2)

Database size= 200 x 106 amino acids

peptide #hits

A 10 x 106

AP 500 x 103

IAP 25000LIAP 1250WLIAP 62,5KWLIAP 3,1KWLIAPY 0,16KWLIAPYS 0,008

Page 18: Database searching with BLAST Celia van Gelder CMBI UMC Radboud September 2013 Outline of today’s lecture Transfer of information Database searching with

Sequence similarity search

Question: What database sequences are most similar to (or contain the most similar regions to) my own sequence?

Input: Query sequenceOutput: List of sequences that are similar to the query

sequence

Page 19: Database searching with BLAST Celia van Gelder CMBI UMC Radboud September 2013 Outline of today’s lecture Transfer of information Database searching with

BLAST

• BLAST – Basic Local Alignment Search Tool

• BLAST finds the highest scoring locally optimal alignments between a query sequence and all database sequences.

• Very fast algorithm

• Can be used to search extremely large databases

• Sufficiently sensitive and selective for most purposes

• Robust – the default parameters can usually be used

Page 20: Database searching with BLAST Celia van Gelder CMBI UMC Radboud September 2013 Outline of today’s lecture Transfer of information Database searching with

Why use BLAST?

BLAST searching is fundamental to understanding the relatedness of any favorite query sequence to other known proteins or DNA sequences.

Applications include• discovering new genes or proteins• discovering variants of genes or proteins• exploring protein structure and function• Etc.

It is all about transfer of information!

Page 21: Database searching with BLAST Celia van Gelder CMBI UMC Radboud September 2013 Outline of today’s lecture Transfer of information Database searching with

BLAST – Algorithm

Step 1: Read/understand user query sequence.

Step 2: Use hashing technology to select several thousand likely candidates.

Step 3: Do a real alignment between the query sequence and those likely candidate. N.B. ‘Real alignment’ is a main topic of this course.

Step 4: Present result to user: list of sequences that match query sequence & their alignments

Page 22: Database searching with BLAST Celia van Gelder CMBI UMC Radboud September 2013 Outline of today’s lecture Transfer of information Database searching with

Basic BLAST Algorithms

Program Query Database

BLASTP Protein Protein 1

BLASTN DNA DNA 1

BLASTX translatedDNA protein 6

TBLASTN protein translatedDNA 6

TBLASTX translatedDNA translatedDNA 36

Page 23: Database searching with BLAST Celia van Gelder CMBI UMC Radboud September 2013 Outline of today’s lecture Transfer of information Database searching with

DNA potentially encodes six proteins

5’ CAT CAA 5’ ATC AAC 5’ TCA ACT

5’ GTG GGT 5’ TGG GTA 5’ GGG TAG

5’ CATCAACTACAACTCCAAAGACACCCTTACACATCAACAAACCTACCCAC 3’3’ GTAGTTGATGTTGAGGTTTCTGTGGGAATGTGTAGTTGTTTGGATGGGTG 5’

Slide from Bioinformatics and Functional Genomicsby Jonathan PevsnerCopyright © 2009

Page 24: Database searching with BLAST Celia van Gelder CMBI UMC Radboud September 2013 Outline of today’s lecture Transfer of information Database searching with

Steps in running BLAST

•Entering your query sequence (cut-and-paste)

•Select the database(s) you want to search

And, optionally:

•Choose output parameters

•Choose alignment parameters (scoring matrix, filters,….)

Page 25: Database searching with BLAST Celia van Gelder CMBI UMC Radboud September 2013 Outline of today’s lecture Transfer of information Database searching with

BLAST Input - FASTA format

>relevant_sequence_name optional commentsAFIWLLSCYALLGTTFGCGVNAIHPVLTGLSKIVNGEEAVPGTWPWQVTLQDRSGFHFCSLISEDWVVTAAHCGVRTSEILIAGEFDQGSDEDNIQVLRIAKVFKQPKYSILTVNNDITLLKLASPARYSQTISAVCLPSVDDDAGSLCATTGWGRTKYNANKSPDKLERAALPLLT

Page 26: Database searching with BLAST Celia van Gelder CMBI UMC Radboud September 2013 Outline of today’s lecture Transfer of information Database searching with

BLAST Output

A high scoreindicates a likely relationship

A low E-value indicates that a match is unlikely to have arisen by chance

Click here to go to the corresponding swissprot entry

Click here to study alignment in detail; Look here first!!

Page 27: Database searching with BLAST Celia van Gelder CMBI UMC Radboud September 2013 Outline of today’s lecture Transfer of information Database searching with

BLAST Output

Low scores with high E-values suggest that matches have arisen by chance

But remember:

Mathematical significance ≠ biological significance!

Page 28: Database searching with BLAST Celia van Gelder CMBI UMC Radboud September 2013 Outline of today’s lecture Transfer of information Database searching with

Alignment Significance in BLAST P value (probability)

•A p value is a way of representing the significance of an alignment.

•The closer to zero, the greater the confidence that the hit is significant.

• 0<p<1

Page 29: Database searching with BLAST Celia van Gelder CMBI UMC Radboud September 2013 Outline of today’s lecture Transfer of information Database searching with

Alignment Significance in BLAST E value (expect value)

•The expect value E is the number of alignments with scores greater than or equal to the current score S that are expected to occur by chance in a database search.

• e.g. an E value of 5 assigned to a hit indicates that in a database of the current size one might expect to see 5 matches with a similar score simply by chance.

• Rule of thumb: An E value of 10-6 or better normally means that things are OK.

Page 30: Database searching with BLAST Celia van Gelder CMBI UMC Radboud September 2013 Outline of today’s lecture Transfer of information Database searching with

BLAST result: easy

Page 31: Database searching with BLAST Celia van Gelder CMBI UMC Radboud September 2013 Outline of today’s lecture Transfer of information Database searching with

BLAST result: less easy

Page 32: Database searching with BLAST Celia van Gelder CMBI UMC Radboud September 2013 Outline of today’s lecture Transfer of information Database searching with

BLAST result: very difficult

Page 33: Database searching with BLAST Celia van Gelder CMBI UMC Radboud September 2013 Outline of today’s lecture Transfer of information Database searching with

BLAST parameter: Low complexity filter

• Many sequences contain repeats or stretches that consist predominantly of one type of amino acid

• We call this low-complexity regions.

• Examples:• Many nuclear proteins have a poly-asparagine tail (polyN) • Huntington´s disease PolyGlutamine (polyQ) repeat• Membrane proteins often consist of mainly hydrophobic

amino acids• Many binding proteins have proline rich stretches. Example

PPPPPPL/R

Page 34: Database searching with BLAST Celia van Gelder CMBI UMC Radboud September 2013 Outline of today’s lecture Transfer of information Database searching with

NNNNNNNN

BLAST - Low complexity filter

Filter ON

Filter OFF

Use the low complexity filter to adapt your BLAST query sequence:

Low complexity regions influence your BLAST output

NNNNNNNN

Choice depends on your research question!

Page 35: Database searching with BLAST Celia van Gelder CMBI UMC Radboud September 2013 Outline of today’s lecture Transfer of information Database searching with

Low complexity motifs visible

Page 36: Database searching with BLAST Celia van Gelder CMBI UMC Radboud September 2013 Outline of today’s lecture Transfer of information Database searching with

Things we discussed today

Why we want to do database searches –Transfer of information!

Alignment & scoring methods

Significance of alignments

BLAST• principle of method• BLAST output, in particular E-value• BLAST input parameters, in particular low complexity filter

Let´s BLAST!!