an introduction to bioinformatics

34
An Introduction to Bioinformatics abase Searching - Pairwise Alignme

Upload: melina

Post on 16-Jan-2016

19 views

Category:

Documents


0 download

DESCRIPTION

An Introduction to Bioinformatics. Database Searching - Pairwise Alignments. AIMS. To explain the principles underlying local and global alignment programs To explain what substitution matrices are and how they are used To introduce the commonly used pairwise alignment programs - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: An Introduction to Bioinformatics

An Introduction to Bioinformatics

Database Searching - Pairwise Alignments

Page 2: An Introduction to Bioinformatics

AIMS

OBJECTIVES

To explain the principles underlying local and global alignment programs

To explain what substitution matrices are and how

they are used

To introduce the commonly used pairwise alignment programs

To explore the significance of alignment results

Carry out FastA and Blast searches

To select appropriate substitution matrices

To evaluate the significance of alignment/search results

Page 3: An Introduction to Bioinformatics

INTRODUCTION

• Sequence comparisons

• Protein v Protein• DNA v DNA• Protein v DNA• DNA v Protein

• Pair-wise comparison

• Methodology

Page 4: An Introduction to Bioinformatics

Similarity v Homology……….

“If two genes shared a common ancestor then they are homologous”

They did or they didn’t, they are or they arn’t

% Homology

Page 5: An Introduction to Bioinformatics

Definitions

Page 6: An Introduction to Bioinformatics

Similarity v Homology…….

But :-

• Comparison of two sequences complex

• Differences need to be quantified

• infer homology from degree of similarity

Page 7: An Introduction to Bioinformatics

Information theory……….?

Protein sequence = message

4.19 bits per residuebits = log2M

bit: The amount of information required to distinguish between two equally likely choices

Ref: Molecular Information theory - http://www-lmmb.ncifcrf.gov/~toms/

http://www.lecb.ncifcrf.gov/~toms/paper/nano2/latex/index.html

Page 8: An Introduction to Bioinformatics

• Average protein size of 150 residues

• Information content of 630 bits.

• Probability that two random sequences specify the same message

is 2-630 or about 10-190.

• Convergent evolution giving rise to two similar sequences would

be very rare

• If two sequences exhibit significant similarity arose from a

common ancestor and are homologous.

Are two proteins related ?

Page 9: An Introduction to Bioinformatics

• The English alphabet contains 26 letters, that of DNA 4, and that of protein 20

• Measure similarity or dissimilarity

Basic concept

Page 10: An Introduction to Bioinformatics

• Hamming Distance

• Measure No of differences between two sequences

• The answer to the above is…………..

• The proportional or p-distance. Hamming distance divided by the total sequence length, so ranges from 0 to 1. In the above example the p-distance is 10/14

AGATCTAG ACGAAGGCATCATGCAGT

Basic concept……….

10

Page 11: An Introduction to Bioinformatics

Basic concept……….

The log-odds ratio. - measure of how unlikely two sequences should

be so similar.

- based on the observed frequencies of each of the characters (bases or amino acids) in the sequences, and the probability of observing each homologous pair in the two sequences.

- positive score, measuring similarity, calculated by adding the scores from pre-calculated matrices (PAM and BLOSUM for protein, unitary for DNA).

Page 12: An Introduction to Bioinformatics

Two problems to consider:

• GAPS

• genes evolve•deletions, insertions, recombination

• give penalties for gap creations and extensions

• Global or Local Alignments

• Will sequences be similar over their whole length?• Use different algorithms

AGATCTAG-ACGA-TGCAGTAGGCATCATGCAGT

Page 13: An Introduction to Bioinformatics

Global and Local Alignments

• A global approach will attempt to align two sequences along

their entire length

• A local alignment will look for local regions of similarity or

subsequences.

Page 14: An Introduction to Bioinformatics

T H E C A T S A T O N T H E M A T T l l l l l H l l E l l R A l l l T l l l l l S l A l l l T l l l l l

O l N l T l l l l l H l l E l l

C l A l l l T l l l l l

Dotplots are the simplest form of alignment

Identical sequences, or subsequences areidentified by diaganol lines

Page 15: An Introduction to Bioinformatics

DOTTUP website does this analysis

Example of Rabbit vEmperor PenguinHaemoglobin

Page 16: An Introduction to Bioinformatics

Matrices - PAM and BLOSUM

• Certain groups of amino acids have similar physico-chemical properties e.g Lysine and Arginine

– conservative substitution

• Genetic code is degenerate - silent mutations

• Dayhoff - Point Accepted Mutation (PAM)

Page 17: An Introduction to Bioinformatics

Matrices - PAM and BLOSUM

1 PAM unit is the extent of evolutionary divergence in which 1% of amino acid residues are altered

• Alignment of 15 very closely related proteins• Calculate a matrix of probability of a mutation altering one amino acid residue to any other amino acid on the basis of 1 PAM.• Extrapolate to PAM250

– more useful for proteins not well conserved

PAM

Page 18: An Introduction to Bioinformatics

PAM250 matrix Problems: derived from proteins of only slight divergence

Page 19: An Introduction to Bioinformatics

BLOSUM

• Henikov and Henikov (1992) derived matrices based on sequences more divergent.

• The BLOSUM (BLOcks SUbstition Matrix) matrices cover sequences with 80% or more similarity (BLOSUM 80), 62% or greater similarity (BLOSUM 62) etc

• Based on local not global alignments

Page 20: An Introduction to Bioinformatics
Page 21: An Introduction to Bioinformatics

Alignments - local

• Choose one sequence to be searched against the other • Query sequence (q) and target sequence (t)• Divide the query sequence into small subsequences, called

words• For each word of q, look along t to find other words in t which

are similar• Matching words "anchors" build up a better alignment between

q and t• Assess how good this alignment is.

Basic principle

Page 22: An Introduction to Bioinformatics

FastA and BLAST

FastA

• Pearson and Lipman Method (late 80s)

• Query sequence compared to each sequence in a database•matching words (up to 6 nucleotides, or two amino acids in a row)

• Rescore best regions with matrices

• Algorithm checks concatenation

• Best sequences displayed

Page 23: An Introduction to Bioinformatics

FastA and BLAST

BLAST

• Basic Local Alignment Search Tool

•Compares query to database

– For each pair - finds maximal segment pair (using BLOSUM)

– The algorithm calculates probability of random occurrence

– Faster than FastA, less accurate, method of choice since introduction of GAP-BLAST

Page 24: An Introduction to Bioinformatics
Page 25: An Introduction to Bioinformatics

Significance?

Only Local Alignments - without gaps

HSPs/MSPs - alignment occurring by chance (p value) is derived from the observed score (S) to the expected distribution of scores

– larger databases - larger probability of a sequence match by chance– the closer the p-value to zero the more confidence can be given to the alignment

Page 26: An Introduction to Bioinformatics

Types of BLAST

• Nucleotide BLAST Standard nucleotide-nucleotide BLAST [blastn]

MEGABLAST Search for short nearly exact matches

• Protein BLAST Standard protein-protein BLAST [blastp]

PSI- and PHI-BLAST Search for short nearly exact matches

• Translated BLAST Searches Nucleotide query - Protein db [blastx]

Protein query - Translated db [tblastn] Nucleotide query - Translated db [tblastx]

Page 27: An Introduction to Bioinformatics

Example.I have a new mRNA sequence:

TGGCGGCGGCGGCGGCGGTTGTCCCGGCTGTGCCGGTTGGTGTGGCCCGTCAGCCCGCGTACCACAGCGCCCGGGCCGCGTCGAGCCCAGTACAGCCAAGCCGCTGCGGCCGGGTCCGGCGCGGGCGGCGCGCGCAGACGGAGGGCGGCGGCCGCGGCCAGGGCGGCCCGTGGGACCGCGGGCCCCCGGCGCAGCGCTGCCCGGCTCCCGGCCCTGCCGGCCTCCTCCCTTGGCGCCGCGGCCATGGCGGCCAGCGCGAAGCGGAAGCAGGAGGAGAAGCACCTGAAGATGCTGCGGGACATGACCGGCCTCCCGCGCAACCGAAAGTGCTTCGACTGCGACCAGCGCGGCCCCACCTACGTTAACATGACGGTCGGCTCCTTCGTGTGTACCTCCTGCTCCGGCAGCCTGCGAGGATTAAATCCACCACACAGGGTGAAATCTATCTCCATGACAACATTCACACAACAGGAAATTGAATTCTTACAAAAACATGGAAATGAAGTCTGTAAACAGATTTGGCTAGGATTATTTGATGATAGATCTTCAGCAATTCCAGACTTCAGGGATCCACAAAAAGTGAAAGAGTTTCTACAAGAAAAGTATGAAAAGAAAAGATGGTATGTCCCGCCAGAACAAGCCAAAGTCGTGGCATCAGTTCATGCATCTATTTCAGGGTCCTCTGCCAGTAGCACAAGCAGCACACCTGAGGTCAAACCACTGAAATCTCTTTTAGGGGATTCTGCACCAACACTGCACTTAAATAAGGGCACACCTAGTCAGTCCCCAGTTGTAGGTCGTTCTCAAGGGCAGCAGCAGGAGAAGAAGCAATTTGACCTTTTAAGTGATCTCGGCTCAGACATCTTTGCTGCTCCAGCTCCTCAGTCAACAGCTACAGCCAATTTTGCTAACTTTGCACATTTCAACAGTCATGCAGCTCAGAATTCTGCAAATGCAGATTTTGCAAACTTTGATGCATTTGGACAGTCTAGTGGTTCGAGTAATTTTGGAGGTTTCCCCACAGCAAGTCACTCTCCTTTTCAGCCCCAAACTACAGGTGGAAGTGCTGCATCAGTAAATGCTAATTTTGCTCATTTTGATAACTTCCCCAAATCCTCCAGTGCTGATTTTGGAACCTTCAATACTTCCCAGAGTCATCAAACAGCATCAGCTGTTAGTAAAGTTTCAACGAACAAAGCTGGTTTACAGACTGCAGACAAATATGCAGCACTTGCTAATTTAGACAATATCTTCAGTGCCGGGCAAGGTGGTGATCAGGGAAGTGGCTTTGGGACCACAGGTAAAGCTCCTGTTGGTTCTGTGGTTTCAGTTCCCAGTCAGTCAAGTGCATCTTCAGACAAGTATGCAGCTCTGGCAGAACTAGACAGCGTTTTCAGTTCTGCAGCCACCTCCAGTAATGCGTATACTTCCACAAGTAATGCTAGCAGCAATGTTTTTGGAACAGTGCCAGTGGTTGCTTCTGCACAGACACAGCCTGCTTCATCAAGTGTGCCTGCTCCATTTGGACGTACGCCTTCCACAAATCCATTTGTTGCTGCTGCTGGTCCTTCTGTGGCATCTTCTACAAACCCATTTCAGACCAATGCCAGAGGAGCAACAGCGGCAACCTTTGGCACTGCATCCATGAGCATGCCCACGGGATTCGGCACTCCTGCTCCCTACAGTCTTCCCACCAGCTTTAGTGGCAGCTTTCAGCAGCCTGCCTTTCCAGCCCAAGCAGCTTTCCCTCAACAGACAGCTTTTTCTCAACAGCCCAATGGTGCAGGTTTTGCAGCATTTGGACAAACAAAGCCAGTAGTAACCCCTTTTGGTCAAGTTGCAGCTGCTGGAGTATCTAGTAATCCTTTTATGACTGGTGCACCAACAGGACAATTTCCAACAGGAAGCTCATCAACCAATCCTTTCTTATAGCCTTATATAGACAATTTACTGGAACGAACTTTTATGTGGTCACATTACATCTCTCCACCTCTTGCACTGTTGTCTTGTTTCACTGATCTTAGCTTTAAACACAAGAGAAGTCTTTAAAAAGCCTGCATTGTGTATTAAACACCAGGTAATATGTGCAAAACCGAGGGCTCCAGTAACACCTTCTAACCTGTGAATTGGCAGAAAAGGGTAGCGGTATCATGTATATTAAAATTGGCTAATATTAAGTTATTGCAGATACCACATTCATTATGCTGCAGTACTGTACATATTTTTCTTAGAAATTAGCTATTTGTGCATATCAGTATTTGTAACTTTAACACATTGTTATGTGAGAAATGTTACTGGGGAAATAGATCAGCCACTTTTAAGGTGCTGTCATATATCTTGGAATGAATGACCTAAAATCATTTTAACCATTGCTACTGGAAAGTAACAGAGTCAAAATTGGAAGGTTTTATTCATTCTTGAATTTTTCCTTTCTAAAGAGCTCTTCTATTTATACATGCCTAAATTCTTTTAAAATGTAGAGGGATACCTGTCTGCATAATAAAGCTGATCATGTTTTGCTACAGTTTGCAGGTGAAAAAAAATAAATATTATAAAATAAAAAAAAAAAAAAAGAAAAAAAAAA

Page 28: An Introduction to Bioinformatics

I’ve pasted my sequence

I’ve selected the database

I hit BLAST!

Page 29: An Introduction to Bioinformatics

Record this number

Press Format!

Page 30: An Introduction to Bioinformatics
Page 31: An Introduction to Bioinformatics
Page 32: An Introduction to Bioinformatics
Page 33: An Introduction to Bioinformatics
Page 34: An Introduction to Bioinformatics

Setting up a BLAST search Step 1. Plan the search Step 2. Enter the query sequence Step 3. Choose the appropriate search parameters Step 4. Submit the query

Deciphering the BLAST output Step 1. Examine the alignment scores and statistics Step 2. Examine the alignments Step 3. Review search details to plan the next step

Post-BLAST analysis Perform a PSI-BLAST analysis Create a multiple alignment Try motif searching with PHI-BLAST