bioinformatica 27-10-2011-t4-alignments

136

Upload: wvcrieki

Post on 11-May-2015

376 views

Category:

Technology


1 download

TRANSCRIPT

Page 1: Bioinformatica 27-10-2011-t4-alignments
Page 2: Bioinformatica 27-10-2011-t4-alignments

FBW27-10-2011

Wim Van Criekinge

Page 3: Bioinformatica 27-10-2011-t4-alignments

Inhoud Lessen: Bioinformatica

• don 29-09-2011: 1* Bioinformatics (practicum 8.30-11.00)

• don 06-10-2011: 2* Biological Databases (practicum 9.00-11.30)

• don 20-10-2011: 3 Sequence Similarity (Scoring Matrices)

• don 27-10-2011: 4 Sequence Alignments

• don 10-11-2011: 5 Database Searching Fasta/Blast

• don 17-11-2011: 6 Phylogenetics

• don 24-11-2011: 7 Protein Structure

• don 01-12-2011: 8 Gene Prediction, Gene Ontologies & HMM

• don 08-12-2011: 9 ncRNA, Chip Data Analysis, AI

• don 15-12-2011: 10 Bio- & Cheminformatics in Drug Discovery (inhaalweek)

• Opgelet: Geen les op don 13-10-2010 en don 3-11-2010

Page 4: Bioinformatica 27-10-2011-t4-alignments

Rat versus mouse RBP

Rat versus bacteriallipocalin

Page 5: Bioinformatica 27-10-2011-t4-alignments

– Henikoff and Henikoff have compared the BLOSUM matrices to PAM by evaluating how effectively the matrices can detect known members of a protein family from a database when searching with the ungapped local alignment program BLAST. They conclude that overall the BLOSUM 62 matrix is the most effective.

• However, all the substitution matrices investigated perform better than BLOSUM 62 for a proportion of the families. This suggests that no single matrix is the complete answer for all sequence comparisons.

• It is probably best to compliment the BLOSUM 62 matrix with comparisons using 250 PAMS, and Overington structurally derived matrices.

– It seems likely that as more protein three dimensional structures are determined, substitution tables derived from structure comparison will give the most reliable data.

Overview

Page 6: Bioinformatica 27-10-2011-t4-alignments

Dotplots

• What is it ?– Graphical representation using two orthogonal

axes and “dots” for regions of similarity. – In a bioinformatics context two sequence are

used on the axes and dots are plotted when a given treshold is met in a given window.

• Dot-plotting is the best way to see all of the structures in common between two sequences or to visualize all of the repeated or inverted repeated structures in one sequence

Page 7: Bioinformatica 27-10-2011-t4-alignments

Visual Alignments (Dot Plots)

• Matrix– Rows: Characters in one sequence– Columns: Characters in second sequence

• Filling– Loop through each row; if character in row, col match, fill

in the cell– Continue until all cells have been examined

Page 8: Bioinformatica 27-10-2011-t4-alignments

Dotplot-simulator.pl

print " $seq1\n";

for(my $teller=0;$teller<=$seq2_length;$teller++){

print substr($seq2,$teller,1);

$w2=substr($seq2,$teller,$window);

for(my $teller2=0;$teller2<=$seq_length;$teller2++){

$w1=substr($seq1,$teller2,$window);

if($w1 eq $w2){print "*";}else{print " ";}

}

print"\n";

}

Page 9: Bioinformatica 27-10-2011-t4-alignments

Overview

Window size = 1, stringency 100%

Page 10: Bioinformatica 27-10-2011-t4-alignments

Noise in Dot Plots

• Nucleic Acids (DNA, RNA)– 1 out of 4 bases matches at random

• Stringency– Window size is considered– Percentage of bases matching in the window is

set as threshold

Page 11: Bioinformatica 27-10-2011-t4-alignments

Reduction of Dot Plot Noise

Self alignment of ACCTGAGCTCACCTGAGTTA

Page 12: Bioinformatica 27-10-2011-t4-alignments

Dotplot-simulator.pl

Example: ZK822 Genomic and cDNA

Gene prediction:

How many exons ?

Confirm donor and aceptor sites ?

Remember to check the reverse complement !

Page 13: Bioinformatica 27-10-2011-t4-alignments

Chromosome Y self comparison

Page 14: Bioinformatica 27-10-2011-t4-alignments

• Regions of similarity appear as diagonal runs of dots

• Reverse diagonals (perpendicular to diagonal) indicate inversions

• Reverse diagonals crossing diagonals (Xs) indicate palindromes

• A gap is introduced by each vertical or horizontal skip

Overview

Page 15: Bioinformatica 27-10-2011-t4-alignments

• Window size changes with goal of analysis– size of average exon– size of average protein structural

element– size of gene promoter– size of enzyme active site

Overview

Page 16: Bioinformatica 27-10-2011-t4-alignments

Rules of thumb Don't get too many points, about 3-

5 times the length of the sequence is about right (1-2%)

Window size about 20 for distant proteins 12 for nucleic acid

Check sequence vs. itself Check sequence vs. sequence Anticipate results

(e.g. “in-house” sequence vs genomic, question)

Overview

Page 17: Bioinformatica 27-10-2011-t4-alignments

Available Dot Plot Programs

Dotlet (Java Applet) http://www.isrec.isb-sib.ch/java/dotlet/Dotlet.html

Page 18: Bioinformatica 27-10-2011-t4-alignments

Sequence Alignments

Introduction Algorithms

What ?ExamplesProperties

Dynamic Programming for Pairwise AlignmentConceptExampleNeedleman-Wunsch(.pl)Smith-Waterman(.pl)

Multiple AlignmentMSAHierarchical Pairwise Alignent

ClustalW, PileUpFormattingInterpretation

Alternative MethodsSIMBlast2Dali

Page 19: Bioinformatica 27-10-2011-t4-alignments

Global and local alignment

Pairwise sequence alignment can be global or local

Global: the sequences are completely aligned(Needleman and Wunsch, 1970)

Local: only the best sub-regions are aligned(Smith and Waterman, 1981). BLASTuses local alignment.

Page 20: Bioinformatica 27-10-2011-t4-alignments

– In order to characterize protein families, identify shared regions of homology in a multiple sequence alignment; (this happens generally when a sequence search revealed homologies to several sequences)

– Determination of the consensus sequence of several aligned sequences

– Help prediction of the secondary and tertiary structures of new sequences;

– Preliminary step in molecular evolution analysis using Phylogenetic methods for constructing phylogenetic trees – Garbage in, Garbage out– Chicken/egg

Why we do multiple alignments?

Page 21: Bioinformatica 27-10-2011-t4-alignments

Why we do multiple alignments?

• To find conserved regions– Local multiple alignment reveals conserved

regions– Conserved regions usually are key functional

regions– These regions are prime targets for drug

developments• To do phylogenetic analysis:

– Same protein from different species– Optimal multiple alignment probably implies

history– Discover irregularities, such as Cystic Fibrosis

gene

Page 22: Bioinformatica 27-10-2011-t4-alignments

VTISCTGSSSNIGAG-NHVKWYQQLPGQLPGVTISCTGTSSNIGS--ITVNWYQQLPGQLPGLRLSCSSSGFIFSS--YAMYWVRQAPGQAPGLSLTCTVSGTSFDD--YYSTWVRQPPGQPPGPEVTCVVVDVSHEDPQVKFNWYVDG--ATLVCLISDFYPGA--VTVAWKADS--AALGCLVKDYFPEP--VTVSWNSG---VSLTCLVKGFYPSD--IAVEWWSNG--

Page 23: Bioinformatica 27-10-2011-t4-alignments

Sequence Alignments

IntroductionAlgorithms

What ?ExamplesProperties

Dynamic Programming for Pairwise AlignmentConceptExampleNeedleman-Wunsch(.pl)Smith-Waterman(.pl)

Multiple AlignmentMSAHierarchical Pairwise Alignent

ClustalW, PileUpFormattingInterpretation

Alternative MethodsSIMBlast2Dali

Page 24: Bioinformatica 27-10-2011-t4-alignments

Algorithms and Programs

• Algorithm: a method or a process followed to solve a problem.– A recipe.

• An algorithm takes the input to a problem (function) and transforms it to the output.– A mapping of input to output.

• A problem can have many algorithms.

Page 25: Bioinformatica 27-10-2011-t4-alignments

Bubble Sort Algorithm

1. Initialize the size of the list to be sorted to be the actual size of the list.

2. Loop through the list until no element needs to be exchanged with another to reach its correct position.

2.1 Loop (i) from 0 to size of the list to be sorted - 2.2.1.1 Compare the ith and (i + 1)st elements in the unsorted list.2.1.2 Swap the ith and (i + 1)st elements if not in order ( ascending or

descending as desired).2.2 Decrease the size of the list to be sorted by 1.

One of the simplest sorting algorithms proceeds by walking down the list, comparing adjacent elements, and swapping them if they are in the wrong order. The process is continued until the list is sorted.

More formally:

Each pass "bubbles" the largest element in the unsorted part of the list to its correct location.

A 13 7 43 5 3 19 2 23 29 ?? ?? ?? ?? ??

Page 26: Bioinformatica 27-10-2011-t4-alignments

Bubble Sort Implementation

void BubbleSort(int List[] , int Size) {

int tempInt; // temp variable for swapping list elems for (int Stop = Size - 1; Stop > 0; Stop--) { for (int Check = 0; Check < Stop; Check++) { // make a pass

if (List[Check] > List[Check + 1]) { // compare elems

tempInt = List[Check]; // swap if in the List[Check] = List[Check + 1]; // wrong order List[Check + 1] = tempInt; } } }}

Bubblesort compares and swaps adjacent elements; simple but not very efficient.

Efficiency note: the outer loop could be modified to exit if the list is already sorted.

Here is an ascending-order implementation of the bubblesort algorithm for integer arrays:

Page 27: Bioinformatica 27-10-2011-t4-alignments

ijs

• 6 eierdooiers + 105 gram S1 kristalsuiker

• 1’ kloppen to “ruban”

• Ondertussen 500 ml volle melk laten opwarmen met 105 gram S1 suiker

• Toevoegen vanille en/of chocolade (kaneel)

• Langzaam de bijna kokende melk onder ruban kloppen (van het vuur)

• Terug op het vuur: “Porter a la nappe”

• Afkoelen

• “Afdraaien” (in ijsmachine)

• 15” voor stolling 500 ml room toevoegen

Page 28: Bioinformatica 27-10-2011-t4-alignments

ijs implementatie

Page 29: Bioinformatica 27-10-2011-t4-alignments

"Great algorithms are the poetry of computation"

Page 30: Bioinformatica 27-10-2011-t4-alignments

"Great algorithms are the poetry of computation"

1946: The Metropolis Algorithm for Monte Carlo. Through the use of random processes, this algorithm offers an efficient way to stumble toward answers to problems that are too complicated to solve exactly.

1947: Simplex Method for Linear Programming. An elegant solution to a common problem in planning and decision-making.

1950: Krylov Subspace Iteration Method. A technique for rapidly solving the linear equations that abound in scientific computation.

1951: The Decompositional Approach to Matrix Computations. A suite of techniques for numerical linear algebra.

1957: The Fortran Optimizing Compiler. Turns high-level code into efficient computer-readable code.

1959: QR Algorithm for Computing Eigenvalues. Another crucial matrix operation made swift and practical.

1962: Quicksort Algorithms for Sorting. For the efficient handling of large databases. 1965: Fast Fourier Transform. Perhaps the most ubiquitous algorithm in use today, it

breaks down waveforms (like sound) into periodic components. 1977: Integer Relation Detection. A fast method for spotting simple equations satisfied

by collections of seemingly unrelated numbers. 1987: Fast Multipole Method. A breakthrough in dealing with the complexity of n-body

calculations, applied in problems ranging from celestial mechanics to protein folding. From Random Samples, Science page 799, February 4, 2000.

Page 31: Bioinformatica 27-10-2011-t4-alignments

Algorithm Properties

• An algorithm possesses the following properties:– It must be correct.– It must be composed of a series of concrete steps.– There can be no ambiguity as to which step will be

performed next.– It must be composed of a finite number of steps.– It must terminate.

• A computer program is an instance, or concrete representation, for an algorithm in some programming language.

Page 32: Bioinformatica 27-10-2011-t4-alignments

Measuring Algorithm Efficiency

• Types of complexity– Space complexity– Time complexity

• Analysis of algorithms– The measuring of the complexity of an algorithm

• Cannot compute actual time for an algorithm– We usually measure worst-case time

Page 33: Bioinformatica 27-10-2011-t4-alignments

Measuring Algorithm Efficiency

Three algorithms for computing 1 + 2 + … n for an integer n > 0

Page 34: Bioinformatica 27-10-2011-t4-alignments

Measuring Algorithm Efficiency

The number of operations required by the algorithms

Page 35: Bioinformatica 27-10-2011-t4-alignments

Measuring Algorithm Efficiency

The number of operations required by the algorithms as a function of n

Page 36: Bioinformatica 27-10-2011-t4-alignments

Big Oh Notation

• To say "Algorithm A has a worst-case time requirement proportional to n"– We say A is O(n)– Read "Big Oh of n"

• For the other two algorithms– Algorithm B is O(n2)– Algorithm C is O(1)

• O is derived from order (magnitude)

Page 37: Bioinformatica 27-10-2011-t4-alignments

Picturing Efficiency

O(n) algorithm

Page 38: Bioinformatica 27-10-2011-t4-alignments

Picturing Efficiency

An O(n2) algorithm.

Page 39: Bioinformatica 27-10-2011-t4-alignments

Picturing Efficiency

Another O(n2) algorithm.

Page 40: Bioinformatica 27-10-2011-t4-alignments

Sequence Alignments

IntroductionAlgorithms

What ?ExamplesProperties

Dynamic Programming for Pairwise AlignmentConceptExampleNeedleman-Wunsch(.pl)Smith-Waterman(.pl)

Multiple AlignmentMSAHierarchical Pairwise Alignent

ClustalW, PileUpFormattingInterpretation

Alternative MethodsSIMBlast2Dali

Page 41: Bioinformatica 27-10-2011-t4-alignments

The best alignment:

The one with the maximum total score

Page 42: Bioinformatica 27-10-2011-t4-alignments

• Exhaustive …– All combinations:

• Algorithm – Dynamic programming (much faster)

• Heuristics– Needleman – Wunsh for global

alignments(Journal of Molecular Biology, 1970)

– Later adapated by Smith-Waterman for local alignment

Overview

Page 43: Bioinformatica 27-10-2011-t4-alignments

• Score of an alignment: reward matches and penalize mismatches and spaces.– eg, each column gets a (different)

value for: • a match: +1, (both have the same

characters); • a mismatch : -1, (both have different

characters); and • a space in a column: -2.

– The total score of an alignment is the sum of the values assigned to its columns.

Page 44: Bioinformatica 27-10-2011-t4-alignments

A metric …

GACGGATTAG, GATCGGAATAG

GA-CGGATTAGGATCGGAATAG

+1 (a match), -1 (a mismatch),-2 (gap)

9*1 + 1*(-1)+1*(-2) = 6

Page 45: Bioinformatica 27-10-2011-t4-alignments

Dynamic programming Reduce the problem:

the solution to a large problem is to simplify … if we first know the solution to a smaller problem that is a subset of the larger problem

Overview

P

P2P1 P3

P

Page 46: Bioinformatica 27-10-2011-t4-alignments

Dynamic Programming

• Finding optimal solution to search problem

• Recursively computes solution• Fundamental principle is to produce

optimal solutions to smaller pieces of the problem first and then glue them together

• Efficient divide-and-conquer strategy because it uses a bottom-up approach and utilizes a look-up table instead of recomputing optimal solutions to sub-problems

P

P2P1 P3

P

Page 47: Bioinformatica 27-10-2011-t4-alignments

Dynamic Programming

What is the best way to get from A to C ?

Rules: Three stops

Solutions: Try all and select best, requires (combin(13,3)) = 286 calculations

A C

Page 48: Bioinformatica 27-10-2011-t4-alignments

Dynamic Programming

What is the best way to get from A to C ?

If we known that B is on the optimal path ?

A CB

Page 49: Bioinformatica 27-10-2011-t4-alignments

Dynamic Programming

What is the best way to get from A to B ?

A CB

12

3

45

6

Page 50: Bioinformatica 27-10-2011-t4-alignments

Dynamic Programming

What is the best way to get from B to C ?

A CB

23

4

5

6

1

Page 51: Bioinformatica 27-10-2011-t4-alignments

Dynamic Programming

How many paths from A to C via B ?

6 * 6 = 36

A CB

12

3

4

5

6

1

Page 52: Bioinformatica 27-10-2011-t4-alignments

Dynamic Programming

Solve the subproblem A to B: 6 calculations

A CB

12

3

45

6

Page 53: Bioinformatica 27-10-2011-t4-alignments

Dynamic Programming

Solve the subproblem B to C: 6 calculations

A CB

23

4

5

6

1

Page 54: Bioinformatica 27-10-2011-t4-alignments

Dynamic Programming

If B is on optimal path from A->C, this optimal path = optimal path from A to B + optimal path from B to C

12 calculations needed (not 36 or 286)

A CB

5

3

Page 55: Bioinformatica 27-10-2011-t4-alignments

the best alignment between

• a zinc-finger core sequence: – CKHVFCRVCI

• and a sequence fragment from a viral polyprotein: – CKKCFCKCV

Page 56: Bioinformatica 27-10-2011-t4-alignments

C K H V F C R V C I +--------------------C | 1 1 1 K | 1 K | 1 C | 1 1 1 F | 1 C | 1 1 1 K | 1 C | 1 1 1 V | 1 1

Dynamic Programming

Page 57: Bioinformatica 27-10-2011-t4-alignments

C K H V F C R V C I +--------------------C | 1 1 1 K | 1 K | 1 C | 1 1 1 F | 1 C | 1 1 1 K | 1 C | 1 1 1 V | 1 1

Dynamic Programming

Page 58: Bioinformatica 27-10-2011-t4-alignments

C K H V F C R V C I +--------------------C | 1 1 1 0K | 1 0K | 1 0C | 1 1 1 0F | 1 0C | 1 1 1 0K | 1 0C | 1 1 1 0V | 0 0 0 1 0 0 0 1 0 0

Dynamic Programming

Page 59: Bioinformatica 27-10-2011-t4-alignments

C K H V F C R V C I +--------------------C | 1 1 1 0K | 1 0K | 1 0C | 1 1 1 0F | 1 0C | 1 1 1 0K | 1 0C | 2 1 1 0 V | 0 0 0 1 0 0 0 1 0 0

Dynamic Programming

Page 60: Bioinformatica 27-10-2011-t4-alignments

C K H V F C R V C I +--------------------C | 1 1 1 0K | 1 0 0K | 1 0 0C | 1 1 1 0F | 1 0 0C | 1 1 1 0K | 1 0 0C | 2 1 1 1 1 2 1 0 1 0V | 0 0 0 1 0 0 0 1 0 0

Dynamic Programming

Page 61: Bioinformatica 27-10-2011-t4-alignments

C K H V F C R V C I +--------------------C | 1 1 1 1 0K | 1 1 0 0K | 1 1 0 0C | 1 1 1 1 0F | 1 1 0 0C | 1 1 1 1 0K | 2 3 2 2 2 1 1 1 0 0C | 2 1 1 1 1 2 1 0 1 0V | 0 0 0 1 0 0 0 1 0 0

Dynamic Programming

Page 62: Bioinformatica 27-10-2011-t4-alignments

C K H V F C R V C I +--------------------C | 1 1 1 1 1 0K | 1 1 1 0 0K | 1 1 1 0 0C | 1 1 1 1 1 0F | 1 1 1 0 0C | 4 2 2 2 2 2 1 1 1 0K | 2 3 2 2 2 1 1 1 0 0C | 2 1 1 1 1 2 1 0 1 0V | 0 0 0 1 0 0 0 1 0 0

Dynamic Programming

Page 63: Bioinformatica 27-10-2011-t4-alignments

C K H V F C R V C I +--------------------C | 1 2 1 1 1 0K | 1 1 1 1 0 0K | 1 1 1 1 0 0C | 1 2 1 1 1 0F | 2 2 2 2 3 1 1 1 0 0C | 4 2 2 2 2 2 1 1 1 0K | 2 3 2 2 2 1 1 1 0 0C | 2 1 1 1 1 2 1 0 1 0V | 0 0 0 1 0 0 0 1 0 0

Dynamic Programming

Page 64: Bioinformatica 27-10-2011-t4-alignments

C K H V F C R V C I +--------------------C | 1 2 2 1 1 1 0K | 1 2 1 1 1 0 0K | 1 2 1 1 1 0 0C | 4 3 3 3 2 2 1 1 1 0F | 2 2 2 2 3 1 1 1 0 0C | 4 2 2 2 2 2 1 1 1 0K | 2 3 2 2 2 1 1 1 0 0C | 2 1 1 1 1 2 1 0 1 0V | 0 0 0 1 0 0 0 1 0 0

Dynamic Programming

Page 65: Bioinformatica 27-10-2011-t4-alignments

C K H V F C R V C I +--------------------C | 1 3 2 2 1 1 1 0K | 1 3 2 1 1 1 0 0K | 3 4 3 3 2 1 1 1 0 0C | 4 3 3 3 2 2 1 1 1 0F | 2 2 2 2 3 1 1 1 0 0C | 4 2 2 2 2 2 1 1 1 0K | 2 3 2 2 2 1 1 1 0 0C | 2 1 1 1 1 2 1 0 1 0V | 0 0 0 1 0 0 0 1 0 0

Dynamic Programming

Page 66: Bioinformatica 27-10-2011-t4-alignments

C K H V F C R V C I +--------------------C | 1 3 3 2 2 1 1 1 0K | 4 4 3 3 2 1 1 1 0 0K | 3 4 3 3 2 1 1 1 0 0C | 4 3 3 3 2 2 1 1 1 0F | 2 2 2 2 3 1 1 1 0 0C | 4 2 2 2 2 2 1 1 1 0K | 2 3 2 2 2 1 1 1 0 0C | 2 1 1 1 1 2 1 0 1 0V | 0 0 0 1 0 0 0 1 0 0

Dynamic Programming

Page 67: Bioinformatica 27-10-2011-t4-alignments

C K H V F C R V C I +--------------------C | 5 3 3 3 2 2 1 1 1 0K | 4 4 3 3 2 1 1 1 0 0K | 3 4 3 3 2 1 1 1 0 0C | 4 3 3 3 2 2 1 1 1 0F | 2 2 2 2 3 1 1 1 0 0C | 4 2 2 2 2 2 1 1 1 0K | 2 3 2 2 2 1 1 1 0 0C | 2 1 1 1 1 2 1 0 1 0V | 0 0 0 1 0 0 0 1 0 0

Dynamic Programming

Page 68: Bioinformatica 27-10-2011-t4-alignments

C K H V F C R V C I +--------------------C | 5 3 3 3 2 2 1 1 1 0K | 4 4 3 3 2 1 1 1 0 0K | 3 4 3 3 2 1 1 1 0 0C | 4 3 3 3 2 2 1 1 1 0F | 3 2 2 2 3 1 1 1 0 0C | 4 2 2 2 2 2 1 1 1 0K | 2 3 2 2 2 1 1 1 0 0C | 2 1 1 1 1 2 1 0 1 0V | 0 0 0 1 0 0 0 1 0 0

Dynamic Programming

Page 69: Bioinformatica 27-10-2011-t4-alignments

C K H V F C R V C I +--------------------C | 5 3 3 3 2 2 1 1 1 0K | 4 4 3 3 2 1 1 1 0 0K | 3 4 3 3 2 1 1 1 0 0C | 4 3 3 3 2 2 1 1 1 0F | 3 2 2 2 3 1 1 1 0 0C | 4 2 2 2 2 2 1 1 1 0K | 2 3 2 2 2 1 1 1 0 0C | 2 1 1 1 1 2 1 0 1 0V | 0 0 0 1 0 0 0 1 0 0

Dynamic Programming

Page 70: Bioinformatica 27-10-2011-t4-alignments

C K H V F C R V C I +--------------------C | 5 3 3 3 2 2 1 1 1 0K | 4 4 3 3 2 1 1 1 0 0K | 3 4 3 3 2 1 1 1 0 0C | 4 3 3 3 2 2 1 1 1 0F | 3 2 2 2 3 1 1 1 0 0C | 4 2 2 2 2 2 1 1 1 0K | 2 3 2 2 2 1 1 1 0 0C | 2 1 1 1 1 2 1 0 1 0V | 0 0 0 1 0 0 0 1 0 0

Dynamic Programming

Page 71: Bioinformatica 27-10-2011-t4-alignments

C K H V F C R V C I +--------------------C | 5 3 3 3 2 2 1 1 1 0K | 4 4 3 3 2 1 1 1 0 0K | 3 4 3 3 2 1 1 1 0 0C | 4 3 3 3 2 2 1 1 1 0F | 3 2 2 2 3 1 1 1 0 0C | 4 2 2 2 2 2 1 1 1 0K | 2 3 2 2 2 1 1 1 0 0C | 2 1 1 1 1 2 1 0 1 0V | 0 0 0 1 0 0 0 1 0 0

Dynamic Programming

Page 72: Bioinformatica 27-10-2011-t4-alignments

C K H V F C R V C I +--------------------C | 5 3 3 3 2 2 1 1 1 0K | 4 4 3 3 2 1 1 1 0 0K | 3 4 3 3 2 1 1 1 0 0C | 4 3 3 3 2 2 1 1 1 0F | 3 2 2 2 3 1 1 1 0 0C | 4 2 2 2 2 2 1 1 1 0K | 2 3 2 2 2 1 1 1 0 0C | 2 1 1 1 1 2 1 0 1 0V | 0 0 0 1 0 0 0 1 0 0

Dynamic Programming

Page 73: Bioinformatica 27-10-2011-t4-alignments

C K H V F C R V C I +--------------------C | 5 3 3 3 2 2 1 1 1 0K | 4 4 3 3 2 1 1 1 0 0K | 3 4 3 3 2 1 1 1 0 0C | 4 3 3 3 2 2 1 1 1 0F | 3 2 2 2 3 1 1 1 0 0C | 4 2 2 2 2 2 1 1 1 0K | 2 3 2 2 2 1 1 1 0 0C | 2 1 1 1 1 2 1 0 1 0V | 0 0 0 1 0 0 0 1 0 0

Dynamic Programming

Page 74: Bioinformatica 27-10-2011-t4-alignments

C K H V F C R V C I +--------------------C | 5 3 3 3 2 2 1 1 1 0K | 4 4 3 3 2 1 1 1 0 0K | 3 4 3 3 2 1 1 1 0 0C | 4 3 3 3 2 2 1 1 1 0F | 3 2 2 2 3 1 1 1 0 0C | 4 2 2 2 2 2 1 1 1 0K | 2 3 2 2 2 1 1 1 0 0C | 2 1 1 1 1 2 1 0 1 0V | 0 0 0 1 0 0 0 1 0 0

Dynamic Programming

Page 75: Bioinformatica 27-10-2011-t4-alignments

C K H V F C R V C I +--------------------C | 5 3 3 3 2 2 1 1 1 0K | 4 4 3 3 2 1 1 1 0 0K | 3 4 3 3 2 1 1 1 0 0C | 4 3 3 3 2 2 1 1 1 0F | 3 2 2 2 3 1 1 1 0 0C | 4 2 2 2 2 2 1 1 1 0K | 2 3 2 2 2 1 1 1 0 0C | 2 1 1 1 1 2 1 0 1 0V | 0 0 0 1 0 0 0 1 0 0

Dynamic Programming

Page 76: Bioinformatica 27-10-2011-t4-alignments

C K H V F C R V C I +--------------------C | 5 3 3 3 2 2 1 1 1 0K | 4 4 3 3 2 1 1 1 0 0K | 3 4 3 3 2 1 1 1 0 0C | 4 3 3 3 2 2 1 1 1 0F | 3 2 2 2 3 1 1 1 0 0C | 4 2 2 2 2 2 1 1 1 0K | 2 3 2 2 2 1 1 1 0 0C | 2 1 1 1 1 2 1 0 1 0V | 0 0 0 1 0 0 0 1 0 0

Dynamic Programming

Page 77: Bioinformatica 27-10-2011-t4-alignments

C K H V F C R V C I +--------------------C | 5 3 3 3 2 2 1 1 1 0K | 4 4 3 3 2 1 1 1 0 0K | 3 4 3 3 2 1 1 1 0 0C | 4 3 3 3 2 2 1 1 1 0F | 3 2 2 2 3 1 1 1 0 0C | 4 2 2 2 2 2 1 1 1 0K | 2 3 2 2 2 1 1 1 0 0C | 2 1 1 1 1 2 1 0 1 0V | 0 0 0 1 0 0 0 1 0 0

C K H V F C R V C I C K K C F C - K C V

C K H V F C R V C I C K K C F C K - C V

C - K H V F C R V C I C K K C - F C - K C V

C K H - V F C R V C I C K K C - F C - K C V

Dynamic Programming

Page 78: Bioinformatica 27-10-2011-t4-alignments

C K H V F C R V C I +--------------------C | 5 3 3 3 2 2 1 1 1 0K | 4 4 3 3 2 1 1 1 0 0K | 3 4 3 3 2 1 1 1 0 0C | 4 3 3 3 2 2 1 1 1 0F | 3 2 2 2 3 1 1 1 0 0C | 4 2 2 2 2 2 1 1 1 0K | 2 3 2 2 2 1 1 1 0 0C | 2 1 1 1 1 2 1 0 1 0V | 0 0 0 1 0 0 0 1 0 0

C K H V F C R V C I C K K C F C - K C V

C K H V F C R V C I C K K C F C K - C V

C - K H V F C R V C I C K K C - F C - K C V

C K H - V F C R V C I C K K C - F C - K C V

Dynamic Programming

Page 79: Bioinformatica 27-10-2011-t4-alignments

Extensions to basic dynamic programming method

use gap penalties – constant gap penalty for gap > 1– gap penalty proportional to gap size

• one penalty for starting a gap (gap opening penalty)

• different (lower) penalty for adding to a gap (gap extension penalty)

• for nucleic acids, can be used to mimic thermodynamics of helix formation– two kinds of gap opening penalties

• one for gap closed by AT, different for GC

Dynamic Programming

Page 80: Bioinformatica 27-10-2011-t4-alignments

• Zie cursus voor voorbeeld met gap-penalties– zoek de fouten ;-)

• Beschikbaar als perl programma waarmee we kunnen experimenteren

Page 81: Bioinformatica 27-10-2011-t4-alignments
Page 82: Bioinformatica 27-10-2011-t4-alignments

Needleman-Wunsch.pl

# initializationmy @matrix;$matrix[0][0]{score} = 0;$matrix[0][0]{pointer} = "none";for(my $j = 1; $j <= length($seq1); $j++) {

$matrix[0][$j]{score} = $GAP * $j;$matrix[0][$j]{pointer} = "left";

}for (my $i = 1; $i <= length($seq2); $i++) {

$matrix[$i][0]{score} = $GAP * $i;$matrix[$i][0]{pointer} = "up";

}

Page 83: Bioinformatica 27-10-2011-t4-alignments

Needleman-Wunsch-edu.pl

The Score Matrix----------------

Seq1(j)1 2 3 4 5 6 7 8 9 10Seq2 * C K H V F C R V C I(i) * 0 -1 -2 -3 -4 -5 -6 -7 -8 -9 -101 C -1 1 0 -1 -2 -3 -4 -5 -6 -7 -82 K -2 0 2 1 0 -1 -2 -3 -4 -5 -63 K -3 -1 1 1 0 -1 -2 -3 -4 -5 -64 C -4 -2 0 0 0 -1 0 -1 -2 -3 -45 F -5 -3 -1 -1 -1 1 0 -1 -2 -3 -46 C -6 -4 -2 -2 -2 0 2 1 0 -1 -27 K -7 -5 -3 -3 -3 -1 1 1 0 -1 -28 C -8 -6 -4 -4 -4 -2 0 0 0 1 09 V -9 -7 -5 -5 -3 -3 -1 -1 1 0 0

Page 84: Bioinformatica 27-10-2011-t4-alignments

Needleman-Wunsch-edu.pl

The Score Matrix----------------

Seq1(j)1 2 3 4 5 6 7 8 9 10Seq2 * C K H V F C R V C I(i) * 0 -1 -2 -3 -4 -5 -6 -7 -8 -9 -101 C -1 1 0 -1 -2 -3 -4 -5 -6 -7 -82 K -2 0 2 1 0 -1 -2 -3 -4 -5 -63 K -3 -1 1 1 0 -1 -2 -3 -4 -5 -64 C -4 -2 0 0 0 -1 0 -1 -2 -3 -45 F -5 -3 -1 -1 -1 1 0 -1 -2 -3 -46 C -6 -4 -2 -2 -2 0 2 1 0 -1 -27 K -7 -5 -3 -3 -3 -1 1 1 0 -1 -28 C -8 -6 -4 -4 -4 -2 0 0 0 1 09 V -9 -7 -5 -5 -3 -3 -1 -1 1 0 0

Page 85: Bioinformatica 27-10-2011-t4-alignments

Needleman-Wunsch.pl# fillfor(my $i = 1; $i <= length($seq2); $i++) {

for(my $j = 1; $j <= length($seq1); $j++) {my ($diagonal_score, $left_score, $up_score);

# calculate match scoremy $letter1 = substr($seq1, $j-1, 1);my $letter2 = substr($seq2, $i-1, 1);if ($letter1 eq $letter2) {

$diagonal_score = $matrix[$i-1][$j-1]{score} + $MATCH;}else {

$diagonal_score = $matrix[$i-1][$j-1]{score} + $MISMATCH;}

# calculate gap scores$up_score = $matrix[$i-1][$j]{score} + $GAP;$left_score = $matrix[$i][$j-1]{score} + $GAP;

# choose best scoreif ($diagonal_score >= $up_score) {

if ($diagonal_score >= $left_score) {$matrix[$i][$j]{score} = $diagonal_score;$matrix[$i][$j]{pointer} = "diagonal";

}else {

$matrix[$i][$j]{score} = $left_score;$matrix[$i][$j]{pointer} = "left";

}} else {

if ($up_score >= $left_score) {$matrix[$i][$j]{score} = $up_score;$matrix[$i][$j]{pointer} = "up";

}else {

$matrix[$i][$j]{score} = $left_score;$matrix[$i][$j]{pointer} = "left";

}}

}}

Page 86: Bioinformatica 27-10-2011-t4-alignments

Needleman-Wunsch.pl

#!e:\perl\bin -wuse strict;

# usage statementdie "usage: $0 <sequence 1> <sequence 2>\n" unless @ARGV

== 2;

# get sequences from command linemy ($seq1, $seq2) = @ARGV;

# scoring schememy $MATCH = 1; # +1 for letters that matchmy $MISMATCH = -1; # -1 for letters that mismatchmy $GAP = -1; # -1 for any gap

Page 87: Bioinformatica 27-10-2011-t4-alignments

Needleman-Wunsch-edu.pl

The Score Matrix----------------

Seq1(j)1 2 3 4 5 6 7 8 9 10Seq2 * C K H V F C R V C I(i) * 0 -1 -2 -3 -4 -5 -6 -7 -8 -9 -101 C -1 1 0 -1 -2 -3 -4 -5 -6 -7 -82 K -2 0 2 1 0 -1 -2 -3 -4 -5 -63 K -3 -1 1 1 0 -1 -2 -3 -4 -5 -64 C -4 -2 0 0 0 -1 0 -1 -2 -3 -45 F -5 -3 -1 -1 -1 1 0 -1 -2 -3 -46 C -6 -4 -2 -2 -2 0 2 1 0 -1 -27 K -7 -5 -3 -3 -3 -1 1 1 0 -1 -28 C -8 -6 -4 -4 -4 -2 0 0 0 1 09 V -9 -7 -5 -5 -3 -3 -1 -1 1 0 0

abc

A: matrix(i,j) = matrix(i-1,j-1) + (MIS)MATCH if (substr(seq1,j-1,1) eq substr(seq2,i-1,1)

B: up_score = matrix(i-1,j) + GAP

C: left_score = matrix(i,j-1) + GAP

Page 88: Bioinformatica 27-10-2011-t4-alignments

Needleman-Wunsch-edu.pl

The Score Matrix----------------

Seq1(j)1 2 3 4 5 6 7 8 9 10Seq2 * C K H V F C R V C I(i) * 0 -1 -2 -3 -4 -5 -6 -7 -8 -9 -101 C -1 1 0 -1 -2 -3 -4 -5 -6 -7 -82 K -2 0 2 1 0 -1 -2 -3 -4 -5 -63 K -3 -1 1 1 0 -1 -2 -3 -4 -5 -64 C -4 -2 0 0 0 -1 0 -1 -2 -3 -45 F -5 -3 -1 -1 -1 1 0 -1 -2 -3 -46 C -6 -4 -2 -2 -2 0 2 1 0 -1 -27 K -7 -5 -3 -3 -3 -1 1 1 0 -1 -28 C -8 -6 -4 -4 -4 -2 0 0 0 1 09 V -9 -7 -5 -5 -3 -3 -1 -1 1 0 0

Page 89: Bioinformatica 27-10-2011-t4-alignments

Needleman-Wunsch-edu.pl

Page 90: Bioinformatica 27-10-2011-t4-alignments

Needleman-Wunsch.pl

my $align1 = "";my $align2 = "";

my $j = length($seq1);my $i = length($seq2);

while (1) {last if $matrix[$i][$j]{pointer} eq "none";

if ($matrix[$i][$j]{pointer} eq "diagonal") {$align1 .= substr($seq1, $j-1, 1);$align2 .= substr($seq2, $i-1, 1);$i--; $j--;

}elsif ($matrix[$i][$j]{pointer} eq "left") {

$align1 .= substr($seq1, $j-1, 1);$align2 .= "-";$j--;

}elsif ($matrix[$i][$j]{pointer} eq "up") {

$align1 .= "-";$align2 .= substr($seq2, $i-1, 1);$i--;

}}

$align1 = reverse $align1;$align2 = reverse $align2;print "$align1\n";print "$align2\n";

Page 91: Bioinformatica 27-10-2011-t4-alignments

Needleman-Wunsch-edu.pl

Seq1: CKHVFCRVCISeq2: CKKCFC-KCV ++--++--+- score = 0

Page 92: Bioinformatica 27-10-2011-t4-alignments

• Practicum: use similarity function in initialization step -> scoring tables

• Time Complexity

• Use random proteins to generate histogram of scores from aligned random sequences

Page 93: Bioinformatica 27-10-2011-t4-alignments

Time complexity with needleman-wunsch.pl

Sequence Length (aa) Execution Time (s)

10 0

25 0

50 0

100 1

500 5

1000 19

2500 559

5000 Memory could not be written

Page 94: Bioinformatica 27-10-2011-t4-alignments

• -edu version

• Monte-carlo version

Page 95: Bioinformatica 27-10-2011-t4-alignments

Average around -64 !

-80-78-76-74-72 **-70 *******-68 ***************-66 *************************-64 ************************************************************-60 ***********************-58 ***************-56 ********-54 ****-52 *-50-48-46-44-42-40-38

Page 96: Bioinformatica 27-10-2011-t4-alignments

If the sequences are similar, the path of the best alignment should be very close to the main diagonal.

Therefore, we may not need to fill the entire matrix, rather, we fill a narrow band of entries around the main diagonal.

An algorithm that fills in a band of width 2k+1 around the main diagonal.

Page 97: Bioinformatica 27-10-2011-t4-alignments

Smith-Waterman.pl

• Three changes– The edges of the matrix are initialized to 0 instead

of increasing gap penalties– The maximum score is never less than 0, and no

pointer is recorded unless the score is greater than 0

– The trace-back starts from the highest score in the matrix (rather than at the end of the matrix) and ends at a score of 0 (rather than the start of the matrix)

• Demonstration

Page 98: Bioinformatica 27-10-2011-t4-alignments

Sequence Alignments

IntroductionAlgorithms

What ?ExamplesProperties

Dynamic Programming for Pairwise AlignmentConceptExampleNeedleman-Wunsch(.pl)Smith-Waterman(.pl)

Multiple AlignmentMSAHierarchical Pairwise Alignent

ClustalW, PileUpFormattingInterpretation

Alternative MethodsSIMBlast2Dali

Page 99: Bioinformatica 27-10-2011-t4-alignments

The best alignment:

The one with the maximum total score

Multiple Aligment: n>2

Page 100: Bioinformatica 27-10-2011-t4-alignments

2 to 3: hyperlattice

Page 101: Bioinformatica 27-10-2011-t4-alignments

On its top-left side, the cube is "covered" by the polyhedron. The edges 1, 2, 3, 6 and 7 are coming from the inside, and edges 4 and 5 can be ignored (and are therefore not labeled in the figure).

Page 102: Bioinformatica 27-10-2011-t4-alignments

• Each node in the k-dimensional hyperlattice is visited once, and therefore the running time must be proportional to the number of nodes in the lattice. – This number is the product of the lengths of the

sequences.– eg. the 3-dimensional lattice as visualized.

Computational Complexity of MA by standard Dynamic Programming

Page 103: Bioinformatica 27-10-2011-t4-alignments

• The memory space requirement is even worse. To trace back the alignment, we need to store the whole lattice, a data structure the size of a multidimensional skyscraper.– In fact, space is the No.1 problem here, bogging down

multiple alignment methods that try to achieve optimality.

– Furthermore, incorporating a realistic gap model, we will further increase our demands on space and running time

Page 104: Bioinformatica 27-10-2011-t4-alignments

Size/Time limits…

Page 105: Bioinformatica 27-10-2011-t4-alignments

• The most practical and widely used method in multiple sequence alignment is the hierarchical extensions of pairwise alignment methods.

• The principal is that multiple alignments is achieved by successive application of pairwise methods.

– First do all pairwise alignments (not just one sequence with all others)

– Then combine pairwise alignments to generate overall alignment

Multiple Alignment Method

Page 106: Bioinformatica 27-10-2011-t4-alignments

• The steps are summarized as follows:– Compare all sequences pairwise. – Perform cluster analysis on the pairwise data to

generate a hierarchy for alignment. This may be in the form of a binary tree or a simple ordering

– Build the multiple alignment by first aligning the most similar pair of sequences, then the next most similar pair and so on. Once an alignment of two sequences has been made, then this is fixed. Thus for a set of sequences A, B, C, D having aligned A with C and B with D the alignment of A, B, C, D is obtained by comparing the alignments of A and C with that of B and D using averaged scores at each aligned position.

Multiple Alignment Method

Page 107: Bioinformatica 27-10-2011-t4-alignments

Multiple Alignment Method

Page 108: Bioinformatica 27-10-2011-t4-alignments

Multiple Alignment Method

Page 109: Bioinformatica 27-10-2011-t4-alignments

• Automatic multiple alignemnt– extend dynamic programming (MSA - Lipman)

• limit: computing power: length and number of sequences (e.q. 2000^8)

– progressive alignment (Feng & Doolittle)• use “guide tree” (PileUp, ClustalW etc)

• Dedicated alignment editing program– Boxshade– SeaView– SeqPup (Java)

• Combination (Biology – Computation)

Multiple Sequence Alignment programs

Page 110: Bioinformatica 27-10-2011-t4-alignments

• ClustalW is a general purpose multiple alignment program for DNA or proteins.

• ClustalW is produced by Julie D. Thompson, Toby Gibson of European Molecular Biology Laboratory, Germany and Desmond Higgins of European Bioinformatics Institute, Cambridge, UK. Algorithmic

• Improves the sensitivity of progressive multiple sequence alignment through sequence weighting, positions-specific gap penalties and weight matrix choice. Nucleic Acids Research, 22:4673-4680.

ClustalW

Page 111: Bioinformatica 27-10-2011-t4-alignments

****** MULTIPLE ALIGNMENT MENU ****** 1. Do complete multiple alignment now (Slow/Accurate) 2. Produce guide tree file only 3. Do alignment using old guide tree file

4. Toggle Slow/Fast pairwise alignments = SLOW

5. Pairwise alignment parameters 6. Multiple alignment parameters

7. Reset gaps between alignments? = OFF 8. Toggle screen display = ON 9. Output format options

S. Execute a system command H. HELP or press [RETURN] to go back to main menu

Your choice:

Running ClustalW

Page 112: Bioinformatica 27-10-2011-t4-alignments
Page 113: Bioinformatica 27-10-2011-t4-alignments

• Before you run PILEUP, it is necessary to study the sequences that will be aligned.

• PILEUP is very sensitive to gaps, so if a set of sequences are of different lengths, gaps will be added to the ends of all shorter sequences to make them equal to the longest one in the set.

• If you try to align five 300 nucleotide EST's with a single 20,000 nucleotide cosmid, you are adding 5 X 19,700 gaps to the alignment - and PILEUP will crash!

PileUp

Page 114: Bioinformatica 27-10-2011-t4-alignments

• The final product of a PILEUP run is a set of aligned sequences, which are stored in a Multiple Sequence File (called .msf by GCG). This msf file is a text file that can be formatted with a text editor, but GCG has some dedicated tools for improving the looks of msf files for easier interpretation and for publication.

• Consensus sequences can be calculated and the relationship of each character of each sequence to the consensus can be highlighted using the program PRETTY

Formatting Multiple Alignments

Page 115: Bioinformatica 27-10-2011-t4-alignments

• Shading of regions of high homology can be created using the programs BOXSHADE and PRETTYBOX , but that goes beyond the scope of this tutorial. (Boxshade: http://www.ch.embnet.org/software/BOX_form.html)

• In addition to these programs that run on the Alpha, the output of PILEUP (or CLUSTAL) can be moved by FTP from your RCR account to a local Mac or PC.

• Since this output is a plain text file, it can be edited with any word processing program, or imported into any drawing program to add boldface text, underlining, shading, boxes, arrows, etc

Formatting Multiple Alignments

Page 116: Bioinformatica 27-10-2011-t4-alignments

http://dot.imgen.bcm.tmc.edu:9331/multi-align/multi-align.html

Page 117: Bioinformatica 27-10-2011-t4-alignments

VTISCTGSSSNIGAG-NHVKWYQQLPGQLPGVTISCTGTSSNIGS--ITVNWYQQLPGQLPGLRLSCSSSGFIFSS--YAMYWVRQAPGQAPGLSLTCTVSGTSFDD--YYSTWVRQPPGQPPGPEVTCVVVDVSHEDPQVKFNWYVDG--ATLVCLISDFYPGA--VTVAWKADS--AALGCLVKDYFPEP--VTVSWNSG---VSLTCLVKGFYPSD--IAVEWWSNG--

An example of Multiple Alignment … immunoglobulin

Page 118: Bioinformatica 27-10-2011-t4-alignments

• Their alignment highlights conserved residues (one of the cysteines forming the disulphide bridges, and the tryptophan are notable)

• conserved regions (in particular, "Q.PG" at the end of the first 4 sequences), and more sophisticated patterns, like the dominance of hydrophobic residues at fragment positions 1 and 3.

• The alternating hydrophobicity pattern is typical for the surface beta-strand at the beginning of each fragment. Indeed, multiple alignments are helpful for protein structure prediction.

An example of Multiple Alignment … immunoglobulin

Page 119: Bioinformatica 27-10-2011-t4-alignments

• Providing the alignment is accurate then the following may be inferred about the secondary structure from a multiple sequence alignment.

The position of insertions and deletions (INDELS) suggests regions where surface loops exist.

Conserved glycine or proline suggests a beta-turn.

A Practical Approach: Interpretation

Page 120: Bioinformatica 27-10-2011-t4-alignments

• Residues with hydrophobic properties conserved at i, i+2, i+4 separated by unconserved or hydrophilic residues suggest surface beta- strands.

A short run of hydrophobic amino acids (4 residues) suggests a buried beta-strand.

Pairs of conserved hydrophobic amino acids separated by pairs of unconserved, or hydrophilic residues suggests an alfa-helix with one face packing in the protein core. Likewise, an i, i+3, i+4, i+7 pattern of conserved hydrophobic residues.

A Practical Approach: Interpretation

Page 121: Bioinformatica 27-10-2011-t4-alignments

• Take out noise (GAPS)

• Extra information (structure - function)

• Recursive selection– first most similar to have an idea about

conserved regions– manual scan for these in more distant

members then include these

A Practical Approach: Which sequences to use ?

Page 122: Bioinformatica 27-10-2011-t4-alignments

Sequence Alignments

IntroductionAlgorithms

What ?ExamplesProperties

Dynamic Programming for Pairwise AlignmentConceptExampleNeedleman-Wunsch(.pl)Smith-Waterman(.pl)

Multiple AlignmentMSAHierarchical Pairwise Alignent

ClustalW, PileUpFormattingInterpretation

Alternative MethodsSIMBlast2Dali

Page 123: Bioinformatica 27-10-2011-t4-alignments

L-align (2 sequences)

SIM (www.expasy.ch)

LALNVIEW is available for UNIX, Mac and PC on the

ExPASy anonymous FTP server.

very nice TWEAKING tool (70% criteria)

Page 124: Bioinformatica 27-10-2011-t4-alignments

Length

P-value

SIM

Page 125: Bioinformatica 27-10-2011-t4-alignments

SIM

Page 126: Bioinformatica 27-10-2011-t4-alignments

SIM

Page 127: Bioinformatica 27-10-2011-t4-alignments

How can I use NCBIto compare twosequences?

Answer:Use the “BLAST 2 Sequences”program

Page 128: Bioinformatica 27-10-2011-t4-alignments

• Go to http://www.ncbi.nlm.nih.gov/BLAST• Choose BLAST 2 sequences• In the program,

[1] choose blastp (protein search) or blastn (for DNA)[2] paste in your accession numbers (or use FASTA format)[3] select optional parameters, such as

--BLOSU62 matrix is default for proteins try PAM250 for distantly related proteins--gap creation and extension penalties

[4] click “align”

Practical guide to pairwise alignment: the “BLAST 2 sequences” website

Page 129: Bioinformatica 27-10-2011-t4-alignments
Page 130: Bioinformatica 27-10-2011-t4-alignments
Page 131: Bioinformatica 27-10-2011-t4-alignments

Question #2: How can I use NCBIto compare a sequence to anentire database?

BLAST!

Page 132: Bioinformatica 27-10-2011-t4-alignments
Page 133: Bioinformatica 27-10-2011-t4-alignments
Page 134: Bioinformatica 27-10-2011-t4-alignments

• An introduction to Basic Concepts in Computer Science for Life Scientists

• Dotplot patterns: A Literal Look at Pattern Languages

Page 135: Bioinformatica 27-10-2011-t4-alignments

• CpG Islands– Download from ENSEMBL 1000 (random) promoters (3000 bp) (hint:

use Biomart)– How many times would you expect to observe CG if all nucleotides

were equipropable– Count the number op times CG is observed for these 1000 genes and

make a histogram from these scores. – Are there any other dinucleatides over- or underrepresented– CG repeats are often methylated. In order to study methylation

patterns bisulfide treatment of DNA is used. Bisulfide changes every C which is not followed by G into T. Generate computationally the bisulfide treated version of DNA (hint: while (s/C([^G])/T$1/g) {};)

– How would you find primers that discriminate between methylated and unmethylated DNA ? Given that the genome is 3.109 bp how long do you need to make a primer to avoid mispriming ?

Practicum 3

Page 136: Bioinformatica 27-10-2011-t4-alignments

Weblems

W4.1: Align the amino acid sequence of acetylcholine receptor from human, rat, mouse, dog with

ClustalWT-CoffeeDaliMSA

W4.2: Use BoxShade to create a word file indicating the different conserved resides in colours

W4.3: Perform a LocalAlignent using SIM and Lalign on the same sequence and Blast2

W4.4: Do the different methods give different results, what are the default settings they use ?

W4.5: How would you identify critical residues for catalytic activity ?