blosum substitution matrix pab

40
July 23, 2003 Patricia Babbitt, PhD Univ. of Calif., San Francisco 1 Introduction to Bioinformatics: Protein Informatics 7/23/03 NHLBI Symposium: From Genome to Disease Patricia C. Babbitt University of California, San Francisco [email protected]

Upload: bwwcom

Post on 11-Apr-2015

248 views

Category:

Documents


7 download

TRANSCRIPT

Page 1: Blosum Substitution Matrix Pab

July 23, 2003 Patricia Babbitt, PhDUniv. of Calif., San Francisco

1

Introduction to Bioinformatics:Protein Informatics

7/23/03NHLBI Symposium: From Genome to Disease

Patricia C. BabbittUniversity of California, San Francisco

[email protected]

Page 2: Blosum Substitution Matrix Pab

July 23, 2003 Patricia Babbitt, PhDUniv. of Calif., San Francisco

2

“ –mastics, –omens & omics”(courtesy of Cambridge Healthtech Institute: 50 & counting...)

• Biome• Celluome• Chronome• Clinome• Complexome• Crystallome• Cytome• Diagnome• Enzymome• Epigenome• Fluxome• Foldome• Functome• Genome• Glycome• Infectuome

• Immunome• Interactome• Localizome• Metabolome• Methylome• Microbiome• Morphome• Operome• ORFeome• Pathogenome• Peptidome• Pharmacogenomics• Phenome• Phylogenome• Physiome

• Promoterome• Proteome• Pseudogenome• Regulome• Resistome• Ribonome• Secretome• Signalome• Somatonome• Toxicome• Transcriptome• Translatome• Unknome• Vaccinomics• Variome

Page 3: Blosum Substitution Matrix Pab

July 23, 2003 Patricia Babbitt, PhDUniv. of Calif., San Francisco

3

• deduction of function• tracing ancestral connections• understanding enzyme mechanisms• structural analysis of receptors, molecules involved

in cell signaling• identification of molecular surfaces in protein-

protein, protein-DNA interactions• protein engineering• clustering of families, superfamilies• metabolic computing/comparative genome analysis

Applications of Protein Informatics

Page 4: Blosum Substitution Matrix Pab

July 23, 2003 Patricia Babbitt, PhDUniv. of Calif., San Francisco

4

Tools/Approaches for Protein Informatics

• database searching/pairwise alignments• pattern searching and motif analysis• multiple alignments• phylogenetic tree construction• sequence and structure comparison• comparative genomics• “metabolic computing”• transmembrane/2° structure prediction• 3D structure prediction/modeling• visualization• composition/pI/mass analysis

Page 5: Blosum Substitution Matrix Pab

July 23, 2003 Patricia Babbitt, PhDUniv. of Calif., San Francisco

5

• Protein sequence analysis is more specific and lessnoisy than nucleic acid analysis due to the inherentdifferences in the message content of nucleic acid andamino acid codes

• 20-letter code vs 4-letter code, degeneracy of codonmessaging

• But searches for many functional genomicsexperiments must be done at nucleotide level...

Protein vs. nucleic acid sequenceanalysis?

Page 6: Blosum Substitution Matrix Pab

July 23, 2003 Patricia Babbitt, PhDUniv. of Calif., San Francisco

6

Outline: Performing your own Analyses inProtein Informatics

• Ins and Outs of database searching– underlying assumptions– scoring, optimization, statistical significance, caveats

• Fasta, Blast & PsiBlast• Pattern searching & motif analysis• Pre-computed analyses for protein families using

sequence and structure information, motif databases

Page 7: Blosum Substitution Matrix Pab

July 23, 2003 Patricia Babbitt, PhDUniv. of Calif., San Francisco

7

• The first and most common operation in proteininformatics...and the only way to access the information inlarge databases

• Primary tool for inference of homologous structure andfunction

• Improved algorithms to handle large databases quickly

• Provides an estimate of statistical significance

• Generates alignments

• Definitions of similarity can be tuned using differentscoring matrices and algorithm-specific parameters

Database searching

Page 8: Blosum Substitution Matrix Pab

July 23, 2003 Patricia Babbitt, PhDUniv. of Calif., San Francisco

8

The underlying assumption used infunctional inference...

…requires comparison of sequences

Sequence Conservation

Structure Conservation

Function Conservation

Page 9: Blosum Substitution Matrix Pab

July 23, 2003 Patricia Babbitt, PhDUniv. of Calif., San Francisco

9

Formalizing the Problem

• Given: two sequences that you want to align• Goal: find the best alignment that can be obtained by

sliding one sequence along the other• Requirements:

– a scheme for evaluating matches/mis-matches between anytwo characters

– a score for insertions/deletions– a method for optimization of the total score– a method for evaluating the significance of the alignment

Page 10: Blosum Substitution Matrix Pab

July 23, 2003 Patricia Babbitt, PhDUniv. of Calif., San Francisco

10

• The degree of match between two letters can berepresented in a matrix

• Changing the matrix can change the alignment– Simplest: Identity (unitary) matrix– Better: Definitions of similarity based on inferences about chemical

or biological properties –Examples: PAM, Blosum, Gonnet matrices

• The score should have the form: pab /qa qb , where pab isthe probability that residue a is substituted by residue b,and qa and qb are the background probabilities for residuea and b respectively.

• Handling gaps remains an incompletely solved problem...

Scoring Systems

Page 11: Blosum Substitution Matrix Pab

July 23, 2003 Patricia Babbitt, PhDUniv. of Calif., San Francisco

11

• Derived from the BLOCKS database, which, in turn isderived from the PROSITE library(see http://blocks.fhcrc.org/blocks/; http://www.expasy.ch/prosite/)

• BLOCKS generated from multiply aligned sequencesegments without gaps clustered at various similaritythresholds and corrected to avoid sampling bias

• Derived from data representing highly conservedsequence segments from divergent proteins rather thandata based on very similar sequences (as with PAMmatrices)

BLOSUM (BLOcks SUbstitution) Matrices

Page 12: Blosum Substitution Matrix Pab

July 23, 2003 Patricia Babbitt, PhDUniv. of Calif., San Francisco

12

• Many sequences from aligned families are used togenerate the matrices

• Sequences identical at >X% are eliminated to avoidbias from proteins over-represented in the database

• Specific matrices refer to these clustering cut-offs, i.e.,BLOSUM62 reflects observed substitutions betweensegments <62% identical

• These matrices have become the default scoringschemes used at most primary internet search sites

• Different matrices can make a difference to yourresults!

*adapted from Ewens & Grant, Statistical Methods in Bionformatics

Derivation of BLOSUM matrices*

Page 13: Blosum Substitution Matrix Pab

July 23, 2003 Patricia Babbitt, PhDUniv. of Calif., San Francisco

13

• scoring matrices are tailored to degree of divergenceand may require a specific query length for optimalperformance*

*adapted from information available at the NCBI Blast web site

Query Length Substitution Matrix

<35 PAM-30

35-50 PAM-70

50-85 BLOSUM-80

>85 BLOSUM-62

Page 14: Blosum Substitution Matrix Pab

July 23, 2003 Patricia Babbitt, PhDUniv. of Calif., San Francisco

14

Scoring and optimization

Page 15: Blosum Substitution Matrix Pab

July 23, 2003 Patricia Babbitt, PhDUniv. of Calif., San Francisco

15

SEQUENCEHOMOLOGS• •E • • •Q •U •E • • •N • •C •E • • •AN •AL •O •G• •

• Dot matrix plots: a simple description of alignmentoperations illustrating types of relationships betweena sequence pair

Page 16: Blosum Substitution Matrix Pab

July 23, 2003 Patricia Babbitt, PhDUniv. of Calif., San Francisco

16

• The signal-to-noise ratio can be improved usingfiltering techniques designed to minimize thecomposition- dependent background

• Example of common filters: over-lapping, fixed-length"windows" for sequence comparison

• To be counted, a comparison must achieve aminimum threshold score summed over the window,derived empirically or from a statistical or evolutionarymodel of sequence similarity

• The window size and minimum threshold score (oftentermed "stringency") at which the score is counted canbe user-defined

Page 17: Blosum Substitution Matrix Pab

July 23, 2003 Patricia Babbitt, PhDUniv. of Calif., San Francisco

17

Seq1 = SEQUENCEHOMOLOGSeq2 = SEQUENCEANALOGWindow = 7, Stringency = 42% (3/7 matches)

SEQUENCSEQUENCEANALOG (7/7 matches)

SEQUENCSEQUENCEANALOG (0/7 matches)

...

CEHOMOLSEQUENCEANALOG (2/7 matches)

...

HOMOLOG (3/7 matches)SEQUENCEANALOG

Page 18: Blosum Substitution Matrix Pab

July 23, 2003 Patricia Babbitt, PhDUniv. of Calif., San Francisco

18

Window = 30; Stringency = 2

Page 19: Blosum Substitution Matrix Pab

July 23, 2003 Patricia Babbitt, PhDUniv. of Calif., San Francisco

19

Window = 30; Stringency = 11

Page 20: Blosum Substitution Matrix Pab

July 23, 2003 Patricia Babbitt, PhDUniv. of Calif., San Francisco

20

• To measure the local similarity between 2 sequences, scorescan be used in the matrix instead of dots for a sliding windowcomparison– Summing the identities/similarities at each position– For a window of 5 residues and storing the score in the position

corresponding to the center of the window:

1P R I M E511-1-2+0+4 = +21S E Q U E N C E A N A L Y S I S P R I M E R21 . . .

1P R I M E5 16+6+5+6+4 = +271S E Q U E N C E A N A L Y S I S P R I M E R21

Page 21: Blosum Substitution Matrix Pab

July 23, 2003 Patricia Babbitt, PhDUniv. of Calif., San Francisco

21

Statistical Significance

• A good way to determine if the alignment score hasstatistical meaning is to compare it with the scoregenerated from the alignment of two randomsequences

• A model of ‘random’ sequences is needed. Thesimplest model chooses the amino acid residues in asequence independently, with backgroundprobabilities

• For an un-gapped alignment, the score of a match toa random sequence is the sum of many similarrandom variables, the sum can be approximated by anormal distribution.

Page 22: Blosum Substitution Matrix Pab

July 23, 2003 Patricia Babbitt, PhDUniv. of Calif., San Francisco

22

– Comparing a query sequence to a set of random sequences of uniform length results inscores that obey an extreme value distribution rather than a normal distribution, e.g.,can lead to overestimation of an alignment’s significance (see Altschul et al, 1994)

Page 23: Blosum Substitution Matrix Pab

July 23, 2003 Patricia Babbitt, PhDUniv. of Calif., San Francisco

23

• For database searches, the ONLY criteriaavailable to judge the likelihood of a structural orevolutionary relationship between 2 sequences isan estimate of statistical significance

• Statistical significance and biological significanceare NOT necessarily the same

Caveats

Page 24: Blosum Substitution Matrix Pab

July 23, 2003 Patricia Babbitt, PhDUniv. of Calif., San Francisco

24

Query= /phosphonatase/phosSt.gcg (255 letters) (10/20/99/pcb)Database: /mol/seq/blast/db/swissprot 78,725 sequences; 28,368,147 total letters!

Score ESequences producing significant alignments: (bits) Value

sp|O06995|PGMB_BACSU Begin: 93 End: 204 PUTATIVE BETA-PHOSPHOGLUCOMUTASE (BETA-PGM) 38 0.020sp|P31467|YIEH_ECOLI Begin: 1 End: 180 HYPOTHETICAL 24.7 KD PROTEIN IN TNAB-BGLB I... 36 0.10sp|O14165|YDX1_SCHPO Begin: 34 End: 201 HYPOTHETICAL 27.1 KD PROTEIN C4C5.01 IN CHR... 31 2.6sp|P41277|GPP1_YEAST Begin: 133 End: 200 (DL)-GLYCEROL-3-PHOSPHATASE 1 30 4.4sp|Q39565|DYHB_CHLRE Begin: 3911 End: 4032 DYNEIN BETA CHAIN, FLAGELLAR OUTER ARM 29 7.6sp|P77625|YFBT_ECOLI Begin: 143 End: 187 HYPOTHETICAL 23.7 KD PROTEIN IN LRHA-ACKA I... 29 10.0sp|Q40297|FCPA_MACPY Begin: 146 End: 176 FUCOXANTHIN-CHLOROPHYLL A-C BINDING PROTEIN... 29 13sp|P40853|GPHP_ALCEU Begin: 94 End: 188 PHOSPHOGLYCOLATE PHOSPHATASE, PLASMID (PGP) 29 13sp|Q40296|FCPB_MACPY Begin: 146 End: 176 FUCOXANTHIN-CHLOROPHYLL A-C BINDING PROTEIN... 29 13sp|P52183|ANNU_SCHAM Begin: 119 End: 168 ANNULIN (PROTEIN-GLUTAMINE GAMMA-GLUTAMYLTR... 29 13sp|P40106|GPP2_YEAST Begin: 133 End: 200 (DL)-GLYCEROL-3-PHOSPHATASE 2 28 17sp|P37934|MAY3_SCHCO Begin: 435 End: 552 MATING-TYPE PROTEIN A-ALPHA Y3 27 29sp|O06219|MURE_MYCTU Begin: 255 End: 371 UDP-N-ACETYLMURAMOYLALANYL-D-GLUTAMATE--2,6... 27 29sp|P08419|EL2_PIG Begin: 182 End: 245 ELASTASE 2 PRECURSO 27 38sp|Q11034|Y07S_MYCTU Begin: 163 End: 218 HYPOTHETICAL 69.5 KD PROTEIN CY02B10.28C 27 38sp|P00577|RPOC_ECOLI Begin: 1290 End: 1401 DNA-DIRECTED RNA POLYMERASE BETA' CHAIN (T 27 38sp|P32662|GPH_ECOLI Begin: 20 End: 49 PHOSPHOGLYCOLATE PHOSPHATASE (PGP) 27 38sp|P32662|GPH_ECOLI Begin: 116 End: 224 PHOSPHOGLYCOLATE PHOSPHATASE (PGP) 27 28sp|P32282|RIR1_BPT4 Begin: 239 End: 266 RIBONUCLEOSIDE-DIPHOSPHATE REDUCTASE ALPHA C... 27 50sp|P17346|LEC2_MEGRO Begin: 36 End: 121 LECTIN BRA-2 27 50sp|P54947|YXEH_BACSU Begin: 24 End: 51 HYPOTHETICAL 30.2 KD PROTEIN IN IDH-DEOR IN... 27 50sp|P77366|PGMB_ECOLI Begin: 95 End: 190 PUTATIVE BETA-PHOSPHOGLUCOMUTASE (BETA-PGM) 27 50sp|P30139|THIG_ECOLI Begin: 43 End: 79 THIG PROTEIN 27 50sp|P95649|CBBY_RHOSH Begin: 96 End: 189 CBBY PROTEIN 27 50sp|Q43154|GSHC_SPIOL Begin: 228 End: 327 GLUTATHIONE REDUCTASE, CHLOROPLAST PRECURSO... 26 66sp|P34132|NT6A_HUMAN Begin: 191 End: 215 NEUROTROPHIN-6 ALPHA (NT-6 ALPHA) 26 66sp|P34134|NT6G_HUMAN Begin: 115 End: 144 NEUROTROPHIN-6 GAMMA (NT-6 GAMMA) 26 66sp|P95650|GPH_RHOSH Begin: 48 End: 114 PHOSPHOGLYCOLATE PHOSPHATASE (PGP) 26 66

Page 25: Blosum Substitution Matrix Pab

July 23, 2003 Patricia Babbitt, PhDUniv. of Calif., San Francisco

25

0

50

100

150

200

0 200 400 600 800 1000

chan

ges/

100

amin

o ac

ids

millions of years since divergence

Hemoglobin

Fibrinopeptides

Cytochrome C

• Different proteins evolve at different rates

Page 26: Blosum Substitution Matrix Pab

July 23, 2003 Patricia Babbitt, PhDUniv. of Calif., San Francisco

26

• Different domains within a single proteinevolve at different rates

C-peptide

B-chain C-peptide A-chain

A-chain

B-chain

r = 0.13 x 10-9/site/yearr = 0.97 x 10-9/site/year

Proinsulin

Mature insulin

Page 27: Blosum Substitution Matrix Pab

July 23, 2003 Patricia Babbitt, PhDUniv. of Calif., San Francisco

27

• "Fast" search algorithm generates global alignments,allows gaps(see http://www.ebi.ac.uk/fasta33/)

• Extensively updated since first release– added statistical analysis– multiple variants available– FASTA3 is the current implementation

FASTA

Page 28: Blosum Substitution Matrix Pab

July 23, 2003 Patricia Babbitt, PhDUniv. of Calif., San Francisco

28

• FASTA Compares protein vs protein or DNA vs DNA

• FASTX/FASTY Compares DNA query to proteinsequence db, DNA translated in 3 forward (or reverse)frames; allows frameshifts

• TFASTX Compares protein query vs DNA sequence ordb, translated in all 6 reading frames; no accommodationfor introns

• FASTS Compares a set of short peptide fragmentsderived from mass spectrometric proteomic analysis vsprotein or DNA db

FASTA flavors(see http://fasta.bioch.virginia.edu/)

Page 29: Blosum Substitution Matrix Pab

July 23, 2003 Patricia Babbitt, PhDUniv. of Calif., San Francisco

29

• Original "fast" search algorithm generates localalignments without gaps (Blast 1.4)

• Newer versions (Blast 2.0x) accommodates gaps

• Access at NCBI and other sites:http://www.ncbi.nlm.nih.gov/BLAST/

• Documentation– Manual: http://www.ncbi.nlm.nih.gov/BLAST/blast_help.html– FACS: http://www.ncbi.nlm.nih.gov/BLAST/blast_FAQs.html– Tutorial: http://www.ncbi.nlm.nih.gov/BLAST/tutorial/Altschul-1.html

BLAST

Page 30: Blosum Substitution Matrix Pab

July 23, 2003 Patricia Babbitt, PhDUniv. of Calif., San Francisco

30

BLAST flavors

• blastp compares an amino acid query sequence against a proteinsequence database

• blastn compares a nucleotide query sequence against a nucleotidesequence database

• blastx compares the six-frame conceptual translation products ofa nucleotide query sequence (both strands) against a proteinsequence database

• tblastn compares a protein query sequence against a nucleotidesequence database dynamically translated in all six readingframes (both strands)

• tblastx compares the six-frame translations of a nucleotide querysequence against the six-frame translations of a nucleotide sequencedatabase

Page 31: Blosum Substitution Matrix Pab

July 23, 2003 Patricia Babbitt, PhDUniv. of Calif., San Francisco

31

• These methods are so widely used because theyreally are that good...

• BUT, there are some disadvantages:– Loss of sub-optimal alignments– Pairwise comparisons limit information content– Many biologically significant relationships may be lost in the

"noise," i.e., hits that are not statistically significant

• BLAST is not “better” than FASTA

Some Generalities about Fasta, Blast

Page 32: Blosum Substitution Matrix Pab

July 23, 2003 Patricia Babbitt, PhDUniv. of Calif., San Francisco

32

• Generalizes BLAST algorithm to use a position-specific score matrix in place of a query sequence andassociated substitution matrix for searching thedatabases

• Position-specific score matrix generated from theoutput of a gapped Blast search, i.e., uses a profile ormotif defined in the initial Blast search in place of asingle query sequence and matrix for subsequentsearches of the database

• Results in a database search “tuned” to the specificsequence characteristics of interest

Psi-Blast: Extending our reach...

Page 33: Blosum Substitution Matrix Pab

July 23, 2003 Patricia Babbitt, PhDUniv. of Calif., San Francisco

33

• Constructs a multiple alignment from a Gapped Blastsearch and generates a profile from any significantlocal alignments found

• The profile is compared to the protein database andPSI-BLAST estimates the statistical significance ofthe local alignments found, using "significant" hits toextend the profile for the next round

• PSI-BLAST iterates step 2 an arbitrary number oftimes or until convergence

*Adapted from the PSI-BLAST tutorial at NCBI

Steps in a Psi-Blast search*

Page 34: Blosum Substitution Matrix Pab

July 23, 2003 Patricia Babbitt, PhDUniv. of Calif., San Francisco

34

• Access at http://www.ncbi.nlm.nih.gov/BLAST/

• Tutorial athttp://www.ncbi.nlm.nih.gov/BLAST/tutorial/Altschul-2.html

• A short explanation of PSI-BLAST statistics athttp://www.ncbi.nlm.nih.gov/BLAST/tutorial/Altschul-3.html

• See also:Park J et al “Sequence comparisons using multiplesequences detect three times as many remote homologs as pairwisemethods,” JMB 284:1201-10, 1998

PSI-BLAST information on the web

Page 35: Blosum Substitution Matrix Pab

July 23, 2003 Patricia Babbitt, PhDUniv. of Calif., San Francisco

35

Other alternatives

• Many, many other DB searching algorithms areavailable– Smith-Waterman– Methods based on probabilistic models/profiles, e.g., Hidden

Markov models– Motif searching

• Or, you can use (or start with) pre-computedanalyses of protein families

Page 36: Blosum Substitution Matrix Pab

July 23, 2003 Patricia Babbitt, PhDUniv. of Calif., San Francisco

36

• Identification of very distant homologs• May point to important functional units in a

protein• Can be used to "anchor" a multiple alignment• Databases of motifs can be used to develop other

informatics applications

Example: BLOCKS Æ Blosum matrices

Why do motif analysis?

Page 37: Blosum Substitution Matrix Pab

July 23, 2003 Patricia Babbitt, PhDUniv. of Calif., San Francisco

37

Motif analysis

• Focuses on conserved patterns among two or moresequences to determine relationships

• Many variants of motif searching available– Consensus-based, e.g., Prosite

http://expasy.nhri.org.tw/prosite/– Manually annotated motifs, distant relationships, e.g.,

PRINTShttp://www.bioinf.man.ac.uk/dbbrowser/PRINTS/

– Statistical, e.g., MEME (Multiple EM for Motif Elicitation)http://meme.sdsc.edu/meme/website/

– Database searching, e.g., PHI-BLASThttp://www.ncbi.nlm.nih.gov/BLAST/

Page 38: Blosum Substitution Matrix Pab

July 23, 2003 Patricia Babbitt, PhDUniv. of Calif., San Francisco

38

Meme & Mast

• Meme: motif discovery toolhttp://meme.sdsc.edu/meme/website/intro.html– motifs represented as position-dependent letter-probability

matrices which describe the probability of each possibleletter at each position in the pattern

– output can be converted to BLOCKS which can then beconverted to PSSMs (position-specific scoring matrices)

• Mast: database searching tool using one or moremotifs as queries– provides a match score for each sequence in the database

compared with each of the motifs in the group of motifsprovided represented as p-values

– provides probable order and spacing of occurrences of themotifs in the sequence hits

Page 39: Blosum Substitution Matrix Pab

July 23, 2003 Patricia Babbitt, PhDUniv. of Calif., San Francisco

39

Some pre-calculated motif/family compilations

• Prosite: Protein families/domains showing biologicallyimportant patterns (1637 different patterns, rules andprofiles/matrices as of 6/03) http://us.expasy.org/prosite/

• Pfam: Multiple sequence alignments and HMMs formany protein domains (5724 families as of 5/03)http://pfam.wustl.edu/

• Prints: Conserved motifs characterizing proteinfamilies (1800 entries, encoding 10,931 individualmotifs as of 4/03) http://bioinf.man.ac.uk/dbbrowser/PRINTS/

• Compilation of specific protein family websites at theMRC http://www.hgmp.mrc.ac.uk/GenomeWeb/prot-family.html

Page 40: Blosum Substitution Matrix Pab

July 23, 2003 Patricia Babbitt, PhDUniv. of Calif., San Francisco

40

Laboratory Exercises & Resources fromBaygenomics

http://baygenomics.ucsf.edu/PGAConference2003/

• Using the LDL receptor as an example– DB searching– TMD prediction– Prosite, Pfam, Prints, Motif analysis– Multiple alignment generation and interpretation– Tree building/visualization– 2° structure/TMD prediction– 3D structure visualization

• Part of a 2-day hands-on workshop (& and onlineversion)– extensive help files– detailed answer keys