design and creation of multiple sequence alignments unit 15 biol221t: advanced bioinformatics for...

Design and creation of Design and creation of multiple sequence multiple sequence

alignmentsalignmentsUnit 15Unit 15

BIOL221TBIOL221T: Advanced : Advanced Bioinformatics for Bioinformatics for

BiotechnologyBiotechnologyIrene Gabashvili, PhD

IPA 6.0 licenseIPA 6.0 license

Need a list of e-mails to create Need a list of e-mails to create accountsaccounts

Will have a 6 weeks license (instead Will have a 6 weeks license (instead of 2 weeks)of 2 weeks)

Problem Set 3 is Pathway Analysis, Problem Set 3 is Pathway Analysis, Lab of March 19 will be on using IPA Lab of March 19 will be on using IPA too too

Problem Set 2 ReviewProblem Set 2 Review

Sensitivity and SpecificitySensitivity and Specificity Parameters for Multiple Alignment Parameters for Multiple Alignment

(Databases, Search Terms, Scores)(Databases, Search Terms, Scores) TransfacTransfac DotplotsDotplots

Gene prediction Gene prediction flowchartflowchart

Evaluation of Splice Site Prediction

Fig 5.11Baxevanis & Ouellette 2005

What do measures really mean?

Note typo in B&O

ROC curves (plots of (1-Sn) ROC curves (plots of (1-Sn) vs Sp)vs Sp)

A A receiver operating characteristicreceiver operating characteristic ((ROCROC), or simply ), or simply ROC curveROC curve, is a , is a graphical plot of the plot of the sensitivity vs. (1 - vs. (1 - specificity) for a ) for a binary classifier system system as its discrimination threshold is varied.as its discrimination threshold is varied.

The sensitivity and specificity of a The sensitivity and specificity of a diagnostic test depends on more than diagnostic test depends on more than just the "quality" of the test--they also just the "quality" of the test--they also depend on the definition of what depend on the definition of what constitutes an abnormal test.constitutes an abnormal test.

Evaluation of Splice Site Prediction

• Normalized specificity:

1

1

ActualTrue False

PP=TP+FP

PN=FN+TN

AP=TP+FNAN=FP+TN

PredictedTrue

False TNFN

FPTP

• Specificity: rAN

AP

• Misclassification rates: FN

AP

FP

AN

• Sensitivity: = Coverage

Careful: different definitions for "Specificity"

ActualTrue False

PP=TP+FP

PN=FN+TN

AP=TP+FNAN=FP+TN

PredictedTrue

False TNFN

FPTP

• Specificity:

cf. Guig�ó definitions Sn: Sensitivity = TP/(TP+FN)

Sp: Specificity = TN/(TN+FP) = Sp-

AC: Approximate Coefficient = 0.5 x ((TP/(TP+FN)) + (TP/(TP+FP)) + (TN/(TN+FP)) + (TN/(TN+FN))) - 1

Other measures? Predictive Values, Correlation Coefficient

Brendel definitions

9

Best measures for comparing different methods?

• ROC curves (Receiver Operating Characteristic?!!)

http://www.anaesthetist.com/mnm/stats/roc/

"The Magnificent ROC" - has fun applets & quotes:

"There is no statistical test, however intuitive and simple, which will not be abused by medical researchers"

• Correlation Coefficient(Matthews correlation coefficient (MCC)

MCC = 1 for a perfect prediction 0 for a completely random assignment

-1 for a "perfectly incorrect" prediction

Just FYI

10

PromotersPromotersWhat signals are there?What signals are there?

Simple ones in prokaryotesSimple ones in prokaryotes

Prokaryotic promoters Prokaryotic promoters RNA polymerase complexRNA polymerase complex recognizes recognizes

promoter sequences located very close to & promoter sequences located very close to & on 5’ side (“upstream”) of initiation site on 5’ side (“upstream”) of initiation site

RNA polymerase complexRNA polymerase complex binds directlybinds directly to to these. with no requirement for “transcription these. with no requirement for “transcription factors”factors”

Prokaryotic promoter sequences are highly Prokaryotic promoter sequences are highly conservedconserved

-10 region -10 region -35 region-35 region

Simpler view of complex promoters in eukaryotes:


13

Eukaryotic genes are transcribed by Eukaryotic genes are transcribed by 3 different RNA polymerases3 different RNA polymerases

Recognize different types of promoters & enhancers:

14

Eukaryotic promoters & Eukaryotic promoters & enhancers enhancers

PromotersPromoters located “relatively” close to initiation located “relatively” close to initiation sitesite

(but can be located within gene, rather than upstream!)(but can be located within gene, rather than upstream!)

Enhancers Enhancers also required for regulated transcriptionalso required for regulated transcription(these control expression in specific cell types, developmental stages, in (these control expression in specific cell types, developmental stages, in response to environment)response to environment)

RNA polymerase complexes do notRNA polymerase complexes do not specifically specifically recognize promoter sequences directlyrecognize promoter sequences directly

TTranscription factorsranscription factors bind first and serve as bind first and serve as “landmarks” for recognition by RNA polymerase “landmarks” for recognition by RNA polymerase complexescomplexes

15

Eukaryotic transcription Eukaryotic transcription factors factors

Transcription factorsTranscription factors (TFs) are DNA binding (TFs) are DNA binding proteins that also interact with RNA polymerase proteins that also interact with RNA polymerase complex to activate or repress transcriptioncomplex to activate or repress transcription

TFs contain characteristic TFs contain characteristic “DNA binding “DNA binding motifs”motifs”

http://www.ncbi.nlm.nih.gov/books/bv.fcgi?rid=genomes.table.7039

TFs recognize specific short DNA sequence TFs recognize specific short DNA sequence motifs motifs “transcription factor binding sites”“transcription factor binding sites”

Several databases for these, e.g.Several databases for these, e.g. TRANSFAC http://www.generegulation.com/cgibin/pub/databases/transfac

Zinc finger-containing Zinc finger-containing transcription factors transcription factors

• Common in eukaryotic proteins

• Estimated 1% of mammalian genes encode zinc-finger proteins

• In C. elegans, there are 500!

• Can be used as highly specific DNA binding modules

• Potentially valuable tools for directed genome modification (esp. in plants) & human gene therapy

Promoter prediction: Eukaryotes vs Promoter prediction: Eukaryotes vs prokaryotesprokaryotes

Promoter prediction is easier in microbial genomes

Why? Highly conservedSimpler gene structuresMore sequenced genomes!

(for comparative approaches)

Methods? Previously: mostly HMM-based Now: similarity-based. comparative

methodsbecause so many genomes

available

18

Predicting promoters: Steps & Predicting promoters: Steps & StrategiesStrategies

Closely related to gene prediction! • Obtain genomic sequence• Use sequence-similarity based comparison

(BLAST, MSA) to find related genesBut: "regulatory" regions are much less well-conserved than coding regions

• Locate ORFs • Identify TSS (if possible!)• Use promoter prediction programs • Analyze motifs, etc. in sequence (TRANSFAC)

Predicting promoters: Steps & Predicting promoters: Steps & StrategiesStrategies

Identify TSS --if possible?• One of biggest problems is determining exact TSS!

Not very many full-length cDNAs!• Good starting point? (human & vertebrate genes)

Use FirstEFfound within UCSC Genome Browseror submit to FirstEF web server


Automated promoter prediction Automated promoter prediction strategiesstrategies

1)Pattern-driven algorithms

2)Sequence-driven algorithms

3)Combined "evidence-based"

BEST RESULTS? Combined, sequential

Promoter Prediction: Pattern-driven Promoter Prediction: Pattern-driven algorithmsalgorithms

• Success depends on availability of collections of annotated binding sites (TRANSFAC & PROMO)

• Tend to produce huge numbers of FPs

• Why? • Binding sites (BS) for specific TFs often variable• Binding sites are short (typically 5-15 bp)• Interactions between TFs (& other proteins) influence affinity &

specificity of TF binding • One binding site often recognized by multiple BFs • Biology is complex: promoters often specific to

organism/cell/stage/environmental condition

Promoter Prediction: Pattern-driven Promoter Prediction: Pattern-driven algorithmsalgorithms

Solutions to problem of too many FP predictions?

• Take sequence context/biology into account • Eukaryotes: clusters of TFBSs are common

• Prokaryotes: knowledge of factors helps• Probability of "real" binding site increases if annotated

transcription start site (TSS) nearby • But: What about enhancers? (no TSS nearby!)

& Only a small fraction of TSSs have been experimentally mapped

• Do the wet lab experiments! • But: Promoter-bashing is tedious

Promoter Prediction: Sequence-driven Promoter Prediction: Sequence-driven algorithmsalgorithms

• Assumption: common functionality can be deduced from sequence conservation• Alignments of co-regulated genes should highlight elements

involved in regulationCareful: How determine co-regulation?

• Orthologous genes from difference species• Genes experimentally determined to be

co-regulated (using microarrays??)• Comparative promoter prediction:

"Phylogenetic footprinting" - more later….

Problems:• Need sets of co-regulated genes• For comparative (phylogenetic) methods

• Must choose appropriate species• Different genomes evolve at different rates• Classical alignment methods have trouble with translocations, inversions in order of functional

elements• If background conservation of entire region is highly

conserved, comparison is useless• Not enough data (Prokaryotes >>> Eukaryotes)

• Biology is complex: many (most?) regulatory elements are not conserved across species!

Promoter Prediction: Sequence-driven Promoter Prediction: Sequence-driven algorithmsalgorithms

Examples of promoter Examples of promoter prediction/characterization prediction/characterization

softwaresoftwareLab: used MATCH, MatInspector

TRANSFACMEME & MASTBLAST, etc.

Others?FIRST EFDragon Promoter Finder

also see Dragon Genome Explorer (has specialized promoter software for GC-rich DNA, finding CpG islands, etc)JASPAR

TRANSFAC matrix entry: for TRANSFAC matrix entry: for TATA TATA boxbox

Fields:• Accession & ID •Brief description•TFs associated with this entry•Weight matrix •Number of sites used to build (How many here?)•Other info


Global alignment of human & mouse Global alignment of human & mouse obese gene promoters (200 bp obese gene promoters (200 bp

upstream from TSS)upstream from TSS)


GenBank IDs and GenBank IDs and AccessionsAccessions

http://www.ncbi.nlm.nih.gov/http://www.ncbi.nlm.nih.gov/RefSeq/key.html#accessions RefSeq/key.html#accessions (Accession Formats: RefSeq)(Accession Formats: RefSeq)

http://www.ncbi.nlm.nih.gov/http://www.ncbi.nlm.nih.gov/Sitemap/samplerecord.html Sitemap/samplerecord.html (GenBank Sample Record)(GenBank Sample Record)

Why we do multiple alignments?Why we do multiple alignments?

– Help prediction of the secondary and tertiary Help prediction of the secondary and tertiary structures of new sequences;structures of new sequences;

– Preliminary step in molecular evolution Preliminary step in molecular evolution analysis using Phylogenetic methods for analysis using Phylogenetic methods for constructing phylogenetic trees.constructing phylogenetic trees.

An example of Multiple An example of Multiple AlignmentAlignment

VTISCTGSSSNIGAG-NHVKWYQQLPGQLPGVTISCTGTSSNIGS--ITVNWYQQLPGQLPGLRLSCSSSGFIFSS--YAMYWVRQAPGQAPGLSLTCTVSGTSFDD--YYSTWVRQPPGQPPGPEVTCVVVDVSHEDPQVKFNWYVDG--ATLVCLISDFYPGA--VTVAWKADS--AALGCLVKDYFPEP--VTVSWNSG---VSLTCLVKGFYPSD--IAVEWWSNG--

Visualization exampleVisualization example

Other multiple alignment Other multiple alignment programsprograms

ClustalW / ClustalX

pileup

multalign

multal

saga

hmmt

DIALIGN

SBpima

MLpima

T-Coffee

...

ClustalW- for multiple ClustalW- for multiple alignmentalignment

ClustalW can create multiple alignments, ClustalW can create multiple alignments, manipulate existing alignments, do manipulate existing alignments, do profile analysis and create phylogentic profile analysis and create phylogentic trees.trees.

Alignment can be done by 2 methods:Alignment can be done by 2 methods:- slow/accurate - slow/accurate

- fast/approximate- fast/approximate

Running ClustalW Running ClustalW [~]% clustalw

************************************************************** ******** CLUSTAL W (1.7) Multiple Sequence Alignments ******** **************************************************************

1. Sequence Input From Disc 2. Multiple Alignments 3. Profile / Structure Alignments 4. Phylogenetic trees

S. Execute a system command H. HELP X. EXIT (leave program)

Your choice:

Running ClustalWRunning ClustalW

The input file for clustalW is a file containing all sequences in one of the following formats:NBRF/PIR, EMBL/SwissProt, Pearson (Fasta),GDE, Clustal, GCG/MSF, RSF.

Using ClustalWUsing ClustalW****** MULTIPLE ALIGNMENT MENU ****** 1. Do complete multiple alignment now (Slow/Accurate) 2. Produce guide tree file only 3. Do alignment using old guide tree file

4. Toggle Slow/Fast pairwise alignments = SLOW

5. Pairwise alignment parameters 6. Multiple alignment parameters

7. Reset gaps between alignments? = OFF 8. Toggle screen display = ON 9. Output format options

S. Execute a system command H. HELP or press [RETURN] to go back to main menu

Your choice:

Output of ClustalWOutput of ClustalWCLUSTAL W (1.7) multiple sequence alignment

HSTNFR GGGAAGAG---TTCCCCAGGGACCTCTCTCTAATCAGCCCTCTGGCCCAG------GCAGSYNTNFTRP GGGAAGAG---TTCCCCAGGGACCTCTCTCTAATCAGCCCTCTGGCCCAG------GCAGCFTNFA -------------------------------------------TGTCCAG------ACAGCATTNFAA GGGAAGAG---CTCCCACATGGCCTGCAACTAATCAACCCTCTGCCCCAG------ACACRABTNFM AGGAGGAAGAGTCCCCAAACAACCTCCATCTAGTCAACCCTGTGGCCCAGATGGTCACCCRNTNFAA AGGAGGAGAAGTTCCCAAATGGGCTCCCTCTCATCAGTTCCATGGCCCAGACCCTCACACOATNFA1 GGGAAGAGCAGTCCCCAGCTGGCCCCTCCTTCAACAGGCCTCTGGTTCAG------ACACOATNFAR GGGAAGAGCAGTCCCCAGCTGGCCCCTCCTTCAACAGGCCTCTGGTTCAG------ACACBSPTNFA GGGAAGAGCAGTCCCCAGGTGGCCCCTCCATCAACAGCCCTCTGGTTCAA------ACACCEU14683 GGGAAGAGCAATCCCCAACTGGCCTCTCCATCAACAGCCCTCTGGTTCAG------ACCC ** *

ClustalW optionsClustalW optionsYour choice: 5 ********* PAIRWISE ALIGNMENT PARAMETERS ********* Slow/Accurate alignments:

1. Gap Open Penalty :15.00 2. Gap Extension Penalty :6.66 3. Protein weight matrix :BLOSUM30 4. DNA weight matrix :IUB

Fast/Approximate alignments:

5. Gap penalty :5 6. K-tuple (word) size :2 7. No. of top diagonals :4 8. Window size :4

9. Toggle Slow/Fast pairwise alignments = SLOW

H. HELPEnter number (or [RETURN] to exit):

ClustalW optionsClustalW optionsYour choice: 6

********* MULTIPLE ALIGNMENT PARAMETERS *********

1. Gap Opening Penalty :15.00 2. Gap Extension Penalty :6.66 3. Delay divergent sequences :40 %

4. DNA Transitions Weight :0.50

5. Protein weight matrix :BLOSUM series 6. DNA weight matrix :IUB 7. Use negative matrix :OFF

8. Protein Gap Parameters

H. HELP

Enter number (or [RETURN] to exit):

Blocks database and toolsBlocks database and tools

Blocks are multiply aligned ungapped Blocks are multiply aligned ungapped segments corresponding to the most highly segments corresponding to the most highly conserved regions of proteins.conserved regions of proteins.

The Blocks web server tools are : The Blocks web server tools are : Block Searcher, Get Blocks and Block Block Searcher, Get Blocks and Block Maker. These are aids to detection and Maker. These are aids to detection and verification of protein sequence homology.verification of protein sequence homology.

They compare a protein or DNA sequence They compare a protein or DNA sequence to a database of protein blocks, retrieve to a database of protein blocks, retrieve blocks, and create new blocks,respectively. blocks, and create new blocks,respectively.

The BLOCKS web The BLOCKS web serverserver

At URL: http://blocks.fhcrc.org/At URL: http://blocks.fhcrc.org/

The BLOCKS WWW server can be used to The BLOCKS WWW server can be used to create blocks of a group of sequences, create blocks of a group of sequences, or to compare a protein sequence to a or to compare a protein sequence to a database of blocks.database of blocks.

The Blocks Searcher tool should be used The Blocks Searcher tool should be used for multiple alignment of distantly for multiple alignment of distantly related protein sequences.related protein sequences.

The Blocks Searcher The Blocks Searcher tooltool

For searching a database of blocks, the first position of the For searching a database of blocks, the first position of the sequence is aligned with the first position of the first block, sequence is aligned with the first position of the first block, and a score for that amino acid is obtained from the profile and a score for that amino acid is obtained from the profile column corresponding to that position. Scores are summed column corresponding to that position. Scores are summed over the width of the alignment, and then the block is over the width of the alignment, and then the block is aligned with the next position. aligned with the next position.

This procedure is carried out exhaustively for all positions This procedure is carried out exhaustively for all positions of the sequence for all blocks in the database, and the best of the sequence for all blocks in the database, and the best alignments between a sequence and entries in the alignments between a sequence and entries in the BLOCKS database are noted. If a particular block scores BLOCKS database are noted. If a particular block scores highly, it is possible that the sequence is related to the highly, it is possible that the sequence is related to the group of sequences the block represents. group of sequences the block represents.

The Blocks Searcher toolThe Blocks Searcher tool

Typically, a group of proteins has more than one Typically, a group of proteins has more than one region in common and their relationship is region in common and their relationship is represented as a series of blocks separated by represented as a series of blocks separated by unaligned regions. If a second block for a group unaligned regions. If a second block for a group also scores highly in the search, the evidence also scores highly in the search, the evidence that the sequence is related to the group is that the sequence is related to the group is strengthened, and is further strengthened if a strengthened, and is further strengthened if a third block also scores it highly, and so on. third block also scores it highly, and so on.

The BLOCKS DatabaseThe BLOCKS Database

The blocks for the BLOCKS database are The blocks for the BLOCKS database are made automatically by looking for the most made automatically by looking for the most highly conserved regions in groups of highly conserved regions in groups of proteins represented in the PROSITE proteins represented in the PROSITE database. These blocks are then database. These blocks are then calibrated against the SWISS-PROT calibrated against the SWISS-PROT database to obtain a measure of the database to obtain a measure of the chance distribution of matches. It is these chance distribution of matches. It is these calibrated blocks that make up the calibrated blocks that make up the BLOCKS database.BLOCKS database.

The Block Maker ToolThe Block Maker Tool

Block Maker finds conserved blocks in a Block Maker finds conserved blocks in a group of two or more unaligned protein group of two or more unaligned protein sequences, which are assumed to be sequences, which are assumed to be related, using two different algorithms.related, using two different algorithms.

Input file must contain at least 2 sequences.Input file must contain at least 2 sequences.

Input sequences must be in FastA format.Input sequences must be in FastA format.

Results are returned by e-mail.Results are returned by e-mail.

Progressive ApproachesProgressive Approaches

CLUSTALWCLUSTALW Perform pairwise alignmentsPerform pairwise alignments Construct a tree, joining most similar Construct a tree, joining most similar

sequences first (sequences first (guide treeguide tree)) Align sequences sequentially, using the Align sequences sequentially, using the

phylogenetic treephylogenetic tree PILEUPPILEUP

Similar to CLUSTALWSimilar to CLUSTALW Uses UPGMA to produce tree (chapter 6)Uses UPGMA to produce tree (chapter 6)

Clustal method

Higgins and Sharp 1988 Higgins and Sharp 1988 ref: CLUSTAL: a package for performing multiple sequence ref: CLUSTAL: a package for performing multiple sequence

alignment on a microcomputer. alignment on a microcomputer. GeneGene, , 7373, 237–244. [Medline], 237–244. [Medline]

ProgressiveProgressive alignment method alignment method

An approximation strategy (An approximation strategy (heuristic heuristic algorithmalgorithm) yields a possible ) yields a possible alignment, but not necessarily the alignment, but not necessarily the best onebest one

ABCD

AA BB CC DD

AA

BB 1111

CC 33 11

DD 22 22 1010

Compute the pairwise Compute the pairwise alignments for alignments for all all

against allagainst all (6 pairwise (6 pairwise alignments)alignments)

the similarities are the similarities are stored in a tablestored in a table

First step:

50

AA BB CC DD

AA

BB 1111

CC 33 11

DD 22 22 1010

A

D

C

B

cluster the sequences to create cluster the sequences to create a tree (a tree (guide treeguide tree):):

•Represents the order in which Represents the order in which pairs of sequences are to be pairs of sequences are to be alignedaligned•Highly similar sequences are Highly similar sequences are neighbors in the tree neighbors in the tree •Highly distant sequences are Highly distant sequences are distant from each other in the treedistant from each other in the tree

Second step:

A

D

C

B

Align most similar Align most similar pairspairs

Align the alignments as Align the alignments as if each of them was a if each of them was a single sequence (with single sequence (with the use of a consensus the use of a consensus sequence or a profile)sequence or a profile)

Third step:

52

Clustal programs

ClustalVClustalV ClustalClustalWW

Thompson et al., 1994 Thompson et al., 1994 Uses: sequence weighting, positions-Uses: sequence weighting, positions-

specific gap penalties and weight specific gap penalties and weight matrix choicematrix choice

W stands for weight sequences W stands for weight sequences clustalclustalXX - windows implementation - windows implementation

53

ClustalW method rules (1)

sequence weighting Each sequence is weighted Each sequence is weighted

according to how different it is from according to how different it is from the other sequences. the other sequences. For the case where one specific For the case where one specific

subfamily is overrepresented in the subfamily is overrepresented in the datadata

54


weight matrix choice

The substitution matrix used for The substitution matrix used for each alignment step depends on the each alignment step depends on the similarity of the sequences. similarity of the sequences.

55


positions-specific gap penalties

Gaps found in initial alignments Gaps found in initial alignments remain fixed through the process remain fixed through the process (ends gap)(ends gap)

Hydrophobic residues have higher Hydrophobic residues have higher gap penalties than hydrophilicgap penalties than hydrophilic they are more likely to be in the they are more likely to be in the

hydrophobic core, where gaps hydrophobic core, where gaps should not occur. should not occur.

56

ClustalW method shortcomings

(1) (1) Sequences that are similar Sequences that are similar only in only in sub- regions sub- regions

ClustalW forces a global alignments, not local. ClustalW forces a global alignments, not local.

(2) (2) A sequence that contains a A sequence that contains a large large insertion/deletion compared insertion/deletion compared to the rest to the rest will extremely affect will extremely affect the alignment the alignment

(again global not local).(again global not local).

ClustalW method shortcomings

(3) (3) A sequence that contains a A sequence that contains a repetitive repetitive element (such as a domain), element (such as a domain), whereas whereas all other sequences all other sequences only contain one only contain one copy.copy.

Comments Pairwise alignment is an Pairwise alignment is an optimaloptimal

algorithmalgorithm

Multiple alignment is Multiple alignment is not an optimal not an optimal algorithm – only a heuristic. Better algorithm – only a heuristic. Better alignments may exist!alignments may exist!

The algorithm yields a possible alignment, The algorithm yields a possible alignment, but not necessarily the best one.but not necessarily the best one.

ClustalW in the web server

Global multiple sequence alignment Global multiple sequence alignment program for DNA or proteins program for DNA or proteins

Available from a number of sitesAvailable from a number of sites EMBL-EBIEMBL-EBI

ResultsResults

61

Results

Alignment with colors

identity similarty

CLUSTAL format

CLUSTAL W(1.82) multiple sequence alignmentCLUSTAL W(1.82) multiple sequence alignment

YPK1 SQLSWKRLLMKGYIPPYKPAVSD-Q--NSMDTSNFDEEFTR--SEKPIDSVVDEYLSESVYPK1 SQLSWKRLLMKGYIPPYKPAVSD-Q--NSMDTSNFDEEFTR--SEKPIDSVVDEYLSESVYPK2 KDISWKKLLLKGYIPPYKPIVKDTQ--SEIDTANFDQEFTK---EKPIDSVVDEYLSASIYPK2 KDISWKKLLLKGYIPPYKPIVKDTQ--SEIDTANFDQEFTK---EKPIDSVVDEYLSASIKPCA_HUMAN RRIDWEKLENREIQPPFKPKVC------GKGAENFDKFFTR---GQPVLTPPDQLVIANIKPCA_HUMAN RRIDWEKLENREIQPPFKPKVC------GKGAENFDKFFTR---GQPVLTPPDQLVIANIKPCZ_HUMAN RSIDWDLLEKKQALPPFQPQIT---M-DDYGLDNFDTQFTS---EPVQLTPDDEDAIKRIKPCZ_HUMAN RSIDWDLLEKKQALPPFQPQIT---M-DDYGLDNFDTQFTS---EPVQLTPDDEDAIKRIKAPA KEVVWEKLLSRNIETPYEPPIQ----QGQGDTSQFDKYPE----EDINYGVQGEDPYADLKAPA KEVVWEKLLSRNIETPYEPPIQ----QGQGDTSQFDKYPE----EDINYGVQGEDPYADLKAPC NEVIWEKLLARYIETPYEPPIQ----QGQGDTSQFDRYPE-EVDEEFNYGIQGEDPYMDLKAPC NEVIWEKLLARYIETPYEPPIQ----QGQGDTSQFDRYPE-EVDEEFNYGIQGEDPYMDLKAPB SEVVWERLLAKDIETPYEPPIT----SGIGDTSLFDQYPE-DV-EQLDYGIQGDDPYAEYKAPB SEVVWERLLAKDIETPYEPPIT----SGIGDTSLFDQYPE-DV-EQLDYGIQGDDPYAEYKS6_HUMAN RHINWEELLARKVEPPFKPLLQ-----SEEDVSQFDSKFTR-V-QTPVDSP-DDSTLSESKS6_HUMAN RHINWEELLARKVEPPFKPLLQ-----SEEDVSQFDSKFTR-V-QTPVDSP-DDSTLSES

* *. * *.

YPK1 -----MQKQFYPK1 -----MQKQFYPK2 ----N-QKQFYPK2 ----N-QKQFKPCA_HUMAN D--O--QSDFKPCA_HUMAN D--O--QSDFKPCZ_HUMAN D-----QSEFKPCZ_HUMAN D-----QSEFKAPA -D----FRDFKAPA -D----FRDFKAPC -D----MKEFKAPC -D----MKEFKAPB --P---FQDFKAPB --P---FQDFKS6_HUMAN A-----NQVFKS6_HUMAN A-----NQVF

ClustalW at EMBL - Jalview

conservation

Jalview is a multiple alignment editor

Jalview

color menu:color menu: TaylorTaylor colorscolors (each amino acid is colored (each amino acid is colored

differently)differently) Zappo colorsZappo colors (amino acids are colored (amino acids are colored

according to their physico-chemical according to their physico-chemical properties)properties)

Hydrophobicity colorsHydrophobicity colors (colors amino aids (colors amino aids according to a certain score scale that according to a certain score scale that represents hydrophobicity)represents hydrophobicity)

Coloring residues above a percentage Coloring residues above a percentage identity thresholdidentity threshold

User defined color schemesUser defined color schemes

Example - Zappo colors

physico-chemical properties color-physico-chemical properties color-code:code:

67

Guide Tree

68

ClustalX

ClustalX provides a window-based ClustalX provides a window-based user interface to the ClustalW user interface to the ClustalW program.program.

It uses the developed by the NCBI as It uses the developed by the NCBI as

part of their part of their NCBI SOFTWARE NCBI SOFTWARE DEVELOPEMENT TOOLKIT.DEVELOPEMENT TOOLKIT.

69

T-coffee

Another MSA program Another MSA program Protein & nucleotide MSA programProtein & nucleotide MSA program Uses principles similar to ClustalWUses principles similar to ClustalW More accurate but longer running More accurate but longer running

timestimes Limits the number of sequences it Limits the number of sequences it

can align (~100)can align (~100) T-coffee at EMBnetT-coffee at EMBnet

71

T-coffee results

72

Phylip format 5 995 99

Cabd_199509 PQITLWQRPLVTIKIGGQLKEALLDTGADDTVLEEMNLPGKWKPKIIGGICabd_199509 PQITLWQRPLVTIKIGGQLKEALLDTGADDTVLEEMNLPGKWKPKIIGGIJCSB1_199401 PQITLWQRPIVTIKIGGQLKEALLDTGAD---LEEM-NPGRWKPKIIGGIJCSB1_199401 PQITLWQRPIVTIKIGGQLKEALLDTGAD---LEEM-NPGRWKPKIIGGIJCSB2_199401 PQITLWQRPVT-IK-GG-QLEALLDTGADDTL-EEI-LPGRW-PKMIGGIJCSB2_199401 PQITLWQRPVT-IK-GG-QLEALLDTGADDTL-EEI-LPGRW-PKMIGGIJCSB4_199401 PQITLWQRPVT--K-GG-LKEALLDTGADDTE-----DPGRWKPKMIGGIJCSB4_199401 PQITLWQRPVT--K-GG-LKEALLDTGADDTE-----DPGRWKPKMIGGIJCSB5_199401 PQITLWQRPIVTIKVGGQLKEALLDTGADDTVL-EMNLPGRWKPKMIGGIJCSB5_199401 PQITLWQRPIVTIKVGGQLKEALLDTGADDTVL-EMNLPGRWKPKMIGGI

GGFIKVRQYDQVPIEICGHKAIGTVLVGPTPSNIIGRNLLTQLGCTLNFGGFIKVRQYDQVPIEICGHKAIGTVLVGPTPSNIIGRNLLTQLGCTLNF GGFVKVRQYDQIPIDICGHKVIGTVL-GPTPANVIGRNLLTQIGCTLNFGGFVKVRQYDQIPIDICGHKVIGTVL-GPTPANVIGRNLLTQIGCTLNF GGFVKVR-YDQVPIEICGH--IGTVLVGPTPANIIGRNLMTQLGCTLNFGGFVKVR-YDQVPIEICGH--IGTVLVGPTPANIIGRNLMTQLGCTLNF GGFLKVRQYDQIPVEICGHKAIGTVL-GPTPANIIGRNLLTQIG-TLNFGGFLKVRQYDQIPVEICGHKAIGTVL-GPTPANIIGRNLLTQIG-TLNF GGFVKVRQYDQIPIEICGHKAIGTVLVGPTPANIVGRNLLTQIGCTLNFGGFVKVRQYDQIPIEICGHKAIGTVLVGPTPANIVGRNLLTQIGCTLNF

The Biology WorkBenchThe Biology WorkBench

http://workbench.sdsc.edu/http://workbench.sdsc.edu/ http://www.ngbw.org/http://www.ngbw.org/

Nucleic Acid Sequence Tools, Nucleic Acid Sequence Tools, including BLAST, CLUSTALW, including BLAST, CLUSTALW, MFOLD, PRIMER3MFOLD, PRIMER3

74

Muscle

Protein & nucleotide MSA programProtein & nucleotide MSA program Improvements in both accuracy and Improvements in both accuracy and

speedspeed exploiting a range of existing and new exploiting a range of existing and new

algorithmic techniques algorithmic techniques combination of progressive and iterative combination of progressive and iterative

alignment strategies alignment strategies details of the method details of the method web serverweb server downloads: Windows, Linux, Macdownloads: Windows, Linux, Mac

75

Muscle web server

76

Editing MSA There are a variety of tools that can be used to There are a variety of tools that can be used to

modify a multiple alignment (SeaView, BioEdit, modify a multiple alignment (SeaView, BioEdit, JalView)JalView)

These programs can be very useful in formatting These programs can be very useful in formatting and annotating an alignment for publication. and annotating an alignment for publication.

An editor can also be used to make modifications An editor can also be used to make modifications by hand to improve biologically significant by hand to improve biologically significant regions in a multiple alignment created by one of regions in a multiple alignment created by one of the automated alignment programs. the automated alignment programs.

77

MSA approaches Progressive approach Progressive approach

CLUSTALW (CLUSTALX), PileUp, CLUSTALW (CLUSTALX), PileUp, T-COFFEE, MAFFT, MUSCLET-COFFEE, MAFFT, MUSCLE

Iterative approach: Iterative approach: Repeatedly realign subsets of Repeatedly realign subsets of sequences.sequences.

MultAlin, DiAlig, MAFFT, MultAlin, DiAlig, MAFFT, MUSCLE,ProbConsMUSCLE,ProbCons

Genetic algorithmGenetic algorithmSAGASAGA

Graph algorithm Graph algorithm POAPOA

Conclusion There is no single method that There is no single method that

always generates the best alignmentalways generates the best alignment

It may thus be wise to use more than It may thus be wise to use more than one methodone method

Alignment editors can be used to Alignment editors can be used to correct the alignmentscorrect the alignments

design and creation of multiple sequence alignments unit 15 biol221t: advanced bioinformatics for...

Documents