design and creation of multiple sequence alignments unit 15 biol221t: advanced bioinformatics for...
TRANSCRIPT
![Page 1: Design and creation of multiple sequence alignments Unit 15 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD](https://reader030.vdocuments.us/reader030/viewer/2022032709/56649ead5503460f94bb4af0/html5/thumbnails/1.jpg)
Design and creation of Design and creation of multiple sequence multiple sequence
alignmentsalignmentsUnit 15Unit 15
BIOL221TBIOL221T: Advanced : Advanced Bioinformatics for Bioinformatics for
BiotechnologyBiotechnologyIrene Gabashvili, PhD
![Page 2: Design and creation of multiple sequence alignments Unit 15 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD](https://reader030.vdocuments.us/reader030/viewer/2022032709/56649ead5503460f94bb4af0/html5/thumbnails/2.jpg)
IPA 6.0 licenseIPA 6.0 license
Need a list of e-mails to create Need a list of e-mails to create accountsaccounts
Will have a 6 weeks license (instead Will have a 6 weeks license (instead of 2 weeks)of 2 weeks)
Problem Set 3 is Pathway Analysis, Problem Set 3 is Pathway Analysis, Lab of March 19 will be on using IPA Lab of March 19 will be on using IPA too too
![Page 3: Design and creation of multiple sequence alignments Unit 15 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD](https://reader030.vdocuments.us/reader030/viewer/2022032709/56649ead5503460f94bb4af0/html5/thumbnails/3.jpg)
Problem Set 2 ReviewProblem Set 2 Review
Sensitivity and SpecificitySensitivity and Specificity Parameters for Multiple Alignment Parameters for Multiple Alignment
(Databases, Search Terms, Scores)(Databases, Search Terms, Scores) TransfacTransfac DotplotsDotplots
![Page 4: Design and creation of multiple sequence alignments Unit 15 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD](https://reader030.vdocuments.us/reader030/viewer/2022032709/56649ead5503460f94bb4af0/html5/thumbnails/4.jpg)
Gene prediction Gene prediction flowchartflowchart
![Page 5: Design and creation of multiple sequence alignments Unit 15 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD](https://reader030.vdocuments.us/reader030/viewer/2022032709/56649ead5503460f94bb4af0/html5/thumbnails/5.jpg)
Evaluation of Splice Site Prediction
Fig 5.11Baxevanis & Ouellette 2005
What do measures really mean?
Note typo in B&O
![Page 6: Design and creation of multiple sequence alignments Unit 15 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD](https://reader030.vdocuments.us/reader030/viewer/2022032709/56649ead5503460f94bb4af0/html5/thumbnails/6.jpg)
ROC curves (plots of (1-Sn) ROC curves (plots of (1-Sn) vs Sp)vs Sp)
A A receiver operating characteristicreceiver operating characteristic ((ROCROC), or simply ), or simply ROC curveROC curve, is a , is a graphical plot of the plot of the sensitivity vs. (1 - vs. (1 - specificity) for a ) for a binary classifier system system as its discrimination threshold is varied.as its discrimination threshold is varied.
The sensitivity and specificity of a The sensitivity and specificity of a diagnostic test depends on more than diagnostic test depends on more than just the "quality" of the test--they also just the "quality" of the test--they also depend on the definition of what depend on the definition of what constitutes an abnormal test.constitutes an abnormal test.
![Page 7: Design and creation of multiple sequence alignments Unit 15 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD](https://reader030.vdocuments.us/reader030/viewer/2022032709/56649ead5503460f94bb4af0/html5/thumbnails/7.jpg)
Evaluation of Splice Site Prediction
• Normalized specificity:
1
1
ActualTrue False
PP=TP+FP
PN=FN+TN
AP=TP+FNAN=FP+TN
PredictedTrue
False TNFN
FPTP
• Specificity: rAN
AP
• Misclassification rates: FN
AP
FP
AN
• Sensitivity: = Coverage
![Page 8: Design and creation of multiple sequence alignments Unit 15 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD](https://reader030.vdocuments.us/reader030/viewer/2022032709/56649ead5503460f94bb4af0/html5/thumbnails/8.jpg)
Careful: different definitions for "Specificity"
ActualTrue False
PP=TP+FP
PN=FN+TN
AP=TP+FNAN=FP+TN
PredictedTrue
False TNFN
FPTP
• Specificity:
cf. Guig�ó definitions Sn: Sensitivity = TP/(TP+FN)
Sp: Specificity = TN/(TN+FP) = Sp-
AC: Approximate Coefficient = 0.5 x ((TP/(TP+FN)) + (TP/(TP+FP)) + (TN/(TN+FP)) + (TN/(TN+FN))) - 1
Other measures? Predictive Values, Correlation Coefficient
Brendel definitions
![Page 9: Design and creation of multiple sequence alignments Unit 15 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD](https://reader030.vdocuments.us/reader030/viewer/2022032709/56649ead5503460f94bb4af0/html5/thumbnails/9.jpg)
9
Best measures for comparing different methods?
• ROC curves (Receiver Operating Characteristic?!!)
http://www.anaesthetist.com/mnm/stats/roc/
"The Magnificent ROC" - has fun applets & quotes:
"There is no statistical test, however intuitive and simple, which will not be abused by medical researchers"
• Correlation Coefficient(Matthews correlation coefficient (MCC)
MCC = 1 for a perfect prediction 0 for a completely random assignment
-1 for a "perfectly incorrect" prediction
Just FYI
![Page 10: Design and creation of multiple sequence alignments Unit 15 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD](https://reader030.vdocuments.us/reader030/viewer/2022032709/56649ead5503460f94bb4af0/html5/thumbnails/10.jpg)
10
PromotersPromotersWhat signals are there?What signals are there?
Simple ones in prokaryotesSimple ones in prokaryotes
![Page 11: Design and creation of multiple sequence alignments Unit 15 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD](https://reader030.vdocuments.us/reader030/viewer/2022032709/56649ead5503460f94bb4af0/html5/thumbnails/11.jpg)
Prokaryotic promoters Prokaryotic promoters RNA polymerase complexRNA polymerase complex recognizes recognizes
promoter sequences located very close to & promoter sequences located very close to & on 5’ side (“upstream”) of initiation site on 5’ side (“upstream”) of initiation site
RNA polymerase complexRNA polymerase complex binds directlybinds directly to to these. with no requirement for “transcription these. with no requirement for “transcription factors”factors”
Prokaryotic promoter sequences are highly Prokaryotic promoter sequences are highly conservedconserved
-10 region -10 region -35 region-35 region
![Page 12: Design and creation of multiple sequence alignments Unit 15 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD](https://reader030.vdocuments.us/reader030/viewer/2022032709/56649ead5503460f94bb4af0/html5/thumbnails/12.jpg)
Simpler view of complex promoters in eukaryotes:
Fig 5.12Baxevanis & Ouellette 2005
![Page 13: Design and creation of multiple sequence alignments Unit 15 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD](https://reader030.vdocuments.us/reader030/viewer/2022032709/56649ead5503460f94bb4af0/html5/thumbnails/13.jpg)
13
Eukaryotic genes are transcribed by Eukaryotic genes are transcribed by 3 different RNA polymerases3 different RNA polymerases
Recognize different types of promoters & enhancers:
![Page 14: Design and creation of multiple sequence alignments Unit 15 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD](https://reader030.vdocuments.us/reader030/viewer/2022032709/56649ead5503460f94bb4af0/html5/thumbnails/14.jpg)
14
Eukaryotic promoters & Eukaryotic promoters & enhancers enhancers
PromotersPromoters located “relatively” close to initiation located “relatively” close to initiation sitesite
(but can be located within gene, rather than upstream!)(but can be located within gene, rather than upstream!)
Enhancers Enhancers also required for regulated transcriptionalso required for regulated transcription(these control expression in specific cell types, developmental stages, in (these control expression in specific cell types, developmental stages, in response to environment)response to environment)
RNA polymerase complexes do notRNA polymerase complexes do not specifically specifically recognize promoter sequences directlyrecognize promoter sequences directly
TTranscription factorsranscription factors bind first and serve as bind first and serve as “landmarks” for recognition by RNA polymerase “landmarks” for recognition by RNA polymerase complexescomplexes
![Page 15: Design and creation of multiple sequence alignments Unit 15 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD](https://reader030.vdocuments.us/reader030/viewer/2022032709/56649ead5503460f94bb4af0/html5/thumbnails/15.jpg)
15
Eukaryotic transcription Eukaryotic transcription factors factors
Transcription factorsTranscription factors (TFs) are DNA binding (TFs) are DNA binding proteins that also interact with RNA polymerase proteins that also interact with RNA polymerase complex to activate or repress transcriptioncomplex to activate or repress transcription
TFs contain characteristic TFs contain characteristic “DNA binding “DNA binding motifs”motifs”
http://www.ncbi.nlm.nih.gov/books/bv.fcgi?rid=genomes.table.7039
TFs recognize specific short DNA sequence TFs recognize specific short DNA sequence motifs motifs “transcription factor binding sites”“transcription factor binding sites”
Several databases for these, e.g.Several databases for these, e.g. TRANSFAC http://www.generegulation.com/cgibin/pub/databases/transfac
![Page 16: Design and creation of multiple sequence alignments Unit 15 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD](https://reader030.vdocuments.us/reader030/viewer/2022032709/56649ead5503460f94bb4af0/html5/thumbnails/16.jpg)
Zinc finger-containing Zinc finger-containing transcription factors transcription factors
• Common in eukaryotic proteins
• Estimated 1% of mammalian genes encode zinc-finger proteins
• In C. elegans, there are 500!
• Can be used as highly specific DNA binding modules
• Potentially valuable tools for directed genome modification (esp. in plants) & human gene therapy
![Page 17: Design and creation of multiple sequence alignments Unit 15 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD](https://reader030.vdocuments.us/reader030/viewer/2022032709/56649ead5503460f94bb4af0/html5/thumbnails/17.jpg)
Promoter prediction: Eukaryotes vs Promoter prediction: Eukaryotes vs prokaryotesprokaryotes
Promoter prediction is easier in microbial genomes
Why? Highly conservedSimpler gene structuresMore sequenced genomes!
(for comparative approaches)
Methods? Previously: mostly HMM-based Now: similarity-based. comparative
methodsbecause so many genomes
available
![Page 18: Design and creation of multiple sequence alignments Unit 15 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD](https://reader030.vdocuments.us/reader030/viewer/2022032709/56649ead5503460f94bb4af0/html5/thumbnails/18.jpg)
18
Predicting promoters: Steps & Predicting promoters: Steps & StrategiesStrategies
Closely related to gene prediction! • Obtain genomic sequence• Use sequence-similarity based comparison
(BLAST, MSA) to find related genesBut: "regulatory" regions are much less well-conserved than coding regions
• Locate ORFs • Identify TSS (if possible!)• Use promoter prediction programs • Analyze motifs, etc. in sequence (TRANSFAC)
![Page 19: Design and creation of multiple sequence alignments Unit 15 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD](https://reader030.vdocuments.us/reader030/viewer/2022032709/56649ead5503460f94bb4af0/html5/thumbnails/19.jpg)
Predicting promoters: Steps & Predicting promoters: Steps & StrategiesStrategies
Identify TSS --if possible?• One of biggest problems is determining exact TSS!
Not very many full-length cDNAs!• Good starting point? (human & vertebrate genes)
Use FirstEFfound within UCSC Genome Browseror submit to FirstEF web server
Fig 5.10Baxevanis & Ouellette 2005
![Page 20: Design and creation of multiple sequence alignments Unit 15 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD](https://reader030.vdocuments.us/reader030/viewer/2022032709/56649ead5503460f94bb4af0/html5/thumbnails/20.jpg)
Automated promoter prediction Automated promoter prediction strategiesstrategies
1)Pattern-driven algorithms
2)Sequence-driven algorithms
3)Combined "evidence-based"
BEST RESULTS? Combined, sequential
![Page 21: Design and creation of multiple sequence alignments Unit 15 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD](https://reader030.vdocuments.us/reader030/viewer/2022032709/56649ead5503460f94bb4af0/html5/thumbnails/21.jpg)
Promoter Prediction: Pattern-driven Promoter Prediction: Pattern-driven algorithmsalgorithms
• Success depends on availability of collections of annotated binding sites (TRANSFAC & PROMO)
• Tend to produce huge numbers of FPs
• Why? • Binding sites (BS) for specific TFs often variable• Binding sites are short (typically 5-15 bp)• Interactions between TFs (& other proteins) influence affinity &
specificity of TF binding • One binding site often recognized by multiple BFs • Biology is complex: promoters often specific to
organism/cell/stage/environmental condition
![Page 22: Design and creation of multiple sequence alignments Unit 15 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD](https://reader030.vdocuments.us/reader030/viewer/2022032709/56649ead5503460f94bb4af0/html5/thumbnails/22.jpg)
Promoter Prediction: Pattern-driven Promoter Prediction: Pattern-driven algorithmsalgorithms
Solutions to problem of too many FP predictions?
• Take sequence context/biology into account • Eukaryotes: clusters of TFBSs are common
• Prokaryotes: knowledge of factors helps• Probability of "real" binding site increases if annotated
transcription start site (TSS) nearby • But: What about enhancers? (no TSS nearby!)
& Only a small fraction of TSSs have been experimentally mapped
• Do the wet lab experiments! • But: Promoter-bashing is tedious
![Page 23: Design and creation of multiple sequence alignments Unit 15 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD](https://reader030.vdocuments.us/reader030/viewer/2022032709/56649ead5503460f94bb4af0/html5/thumbnails/23.jpg)
Promoter Prediction: Sequence-driven Promoter Prediction: Sequence-driven algorithmsalgorithms
• Assumption: common functionality can be deduced from sequence conservation• Alignments of co-regulated genes should highlight elements
involved in regulationCareful: How determine co-regulation?
• Orthologous genes from difference species• Genes experimentally determined to be
co-regulated (using microarrays??)• Comparative promoter prediction:
"Phylogenetic footprinting" - more later….
![Page 24: Design and creation of multiple sequence alignments Unit 15 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD](https://reader030.vdocuments.us/reader030/viewer/2022032709/56649ead5503460f94bb4af0/html5/thumbnails/24.jpg)
Problems:• Need sets of co-regulated genes• For comparative (phylogenetic) methods
• Must choose appropriate species• Different genomes evolve at different rates• Classical alignment methods have trouble with translocations, inversions in order of functional
elements• If background conservation of entire region is highly
conserved, comparison is useless• Not enough data (Prokaryotes >>> Eukaryotes)
• Biology is complex: many (most?) regulatory elements are not conserved across species!
Promoter Prediction: Sequence-driven Promoter Prediction: Sequence-driven algorithmsalgorithms
![Page 25: Design and creation of multiple sequence alignments Unit 15 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD](https://reader030.vdocuments.us/reader030/viewer/2022032709/56649ead5503460f94bb4af0/html5/thumbnails/25.jpg)
Examples of promoter Examples of promoter prediction/characterization prediction/characterization
softwaresoftwareLab: used MATCH, MatInspector
TRANSFACMEME & MASTBLAST, etc.
Others?FIRST EFDragon Promoter Finder
also see Dragon Genome Explorer (has specialized promoter software for GC-rich DNA, finding CpG islands, etc)JASPAR
![Page 26: Design and creation of multiple sequence alignments Unit 15 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD](https://reader030.vdocuments.us/reader030/viewer/2022032709/56649ead5503460f94bb4af0/html5/thumbnails/26.jpg)
TRANSFAC matrix entry: for TRANSFAC matrix entry: for TATA TATA boxbox
Fields:• Accession & ID •Brief description•TFs associated with this entry•Weight matrix •Number of sites used to build (How many here?)•Other info
Fig 5.13Baxevanis & Ouellette 2005
![Page 27: Design and creation of multiple sequence alignments Unit 15 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD](https://reader030.vdocuments.us/reader030/viewer/2022032709/56649ead5503460f94bb4af0/html5/thumbnails/27.jpg)
Global alignment of human & mouse Global alignment of human & mouse obese gene promoters (200 bp obese gene promoters (200 bp
upstream from TSS)upstream from TSS)
Fig 5.14Baxevanis & Ouellette 2005
![Page 28: Design and creation of multiple sequence alignments Unit 15 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD](https://reader030.vdocuments.us/reader030/viewer/2022032709/56649ead5503460f94bb4af0/html5/thumbnails/28.jpg)
GenBank IDs and GenBank IDs and AccessionsAccessions
http://www.ncbi.nlm.nih.gov/http://www.ncbi.nlm.nih.gov/RefSeq/key.html#accessions RefSeq/key.html#accessions (Accession Formats: RefSeq)(Accession Formats: RefSeq)
http://www.ncbi.nlm.nih.gov/http://www.ncbi.nlm.nih.gov/Sitemap/samplerecord.html Sitemap/samplerecord.html (GenBank Sample Record)(GenBank Sample Record)
![Page 29: Design and creation of multiple sequence alignments Unit 15 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD](https://reader030.vdocuments.us/reader030/viewer/2022032709/56649ead5503460f94bb4af0/html5/thumbnails/29.jpg)
Why we do multiple alignments?Why we do multiple alignments?
– Help prediction of the secondary and tertiary Help prediction of the secondary and tertiary structures of new sequences;structures of new sequences;
– Preliminary step in molecular evolution Preliminary step in molecular evolution analysis using Phylogenetic methods for analysis using Phylogenetic methods for constructing phylogenetic trees.constructing phylogenetic trees.
![Page 30: Design and creation of multiple sequence alignments Unit 15 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD](https://reader030.vdocuments.us/reader030/viewer/2022032709/56649ead5503460f94bb4af0/html5/thumbnails/30.jpg)
An example of Multiple An example of Multiple AlignmentAlignment
VTISCTGSSSNIGAG-NHVKWYQQLPGQLPGVTISCTGTSSNIGS--ITVNWYQQLPGQLPGLRLSCSSSGFIFSS--YAMYWVRQAPGQAPGLSLTCTVSGTSFDD--YYSTWVRQPPGQPPGPEVTCVVVDVSHEDPQVKFNWYVDG--ATLVCLISDFYPGA--VTVAWKADS--AALGCLVKDYFPEP--VTVSWNSG---VSLTCLVKGFYPSD--IAVEWWSNG--
![Page 31: Design and creation of multiple sequence alignments Unit 15 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD](https://reader030.vdocuments.us/reader030/viewer/2022032709/56649ead5503460f94bb4af0/html5/thumbnails/31.jpg)
Visualization exampleVisualization example
![Page 32: Design and creation of multiple sequence alignments Unit 15 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD](https://reader030.vdocuments.us/reader030/viewer/2022032709/56649ead5503460f94bb4af0/html5/thumbnails/32.jpg)
Other multiple alignment Other multiple alignment programsprograms
ClustalW / ClustalX
pileup
multalign
multal
saga
hmmt
DIALIGN
SBpima
MLpima
T-Coffee
...
![Page 33: Design and creation of multiple sequence alignments Unit 15 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD](https://reader030.vdocuments.us/reader030/viewer/2022032709/56649ead5503460f94bb4af0/html5/thumbnails/33.jpg)
Other multiple alignment Other multiple alignment programsprograms
ClustalW / ClustalX
pileup
multalign
multal
saga
hmmt
DIALIGN
SBpima
MLpima
T-Coffee
...
![Page 34: Design and creation of multiple sequence alignments Unit 15 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD](https://reader030.vdocuments.us/reader030/viewer/2022032709/56649ead5503460f94bb4af0/html5/thumbnails/34.jpg)
ClustalW- for multiple ClustalW- for multiple alignmentalignment
ClustalW can create multiple alignments, ClustalW can create multiple alignments, manipulate existing alignments, do manipulate existing alignments, do profile analysis and create phylogentic profile analysis and create phylogentic trees.trees.
Alignment can be done by 2 methods:Alignment can be done by 2 methods:- slow/accurate - slow/accurate
- fast/approximate- fast/approximate
![Page 35: Design and creation of multiple sequence alignments Unit 15 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD](https://reader030.vdocuments.us/reader030/viewer/2022032709/56649ead5503460f94bb4af0/html5/thumbnails/35.jpg)
Running ClustalW Running ClustalW [~]% clustalw
************************************************************** ******** CLUSTAL W (1.7) Multiple Sequence Alignments ******** **************************************************************
1. Sequence Input From Disc 2. Multiple Alignments 3. Profile / Structure Alignments 4. Phylogenetic trees
S. Execute a system command H. HELP X. EXIT (leave program)
Your choice:
![Page 36: Design and creation of multiple sequence alignments Unit 15 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD](https://reader030.vdocuments.us/reader030/viewer/2022032709/56649ead5503460f94bb4af0/html5/thumbnails/36.jpg)
Running ClustalWRunning ClustalW
The input file for clustalW is a file containing all sequences in one of the following formats:NBRF/PIR, EMBL/SwissProt, Pearson (Fasta),GDE, Clustal, GCG/MSF, RSF.
![Page 37: Design and creation of multiple sequence alignments Unit 15 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD](https://reader030.vdocuments.us/reader030/viewer/2022032709/56649ead5503460f94bb4af0/html5/thumbnails/37.jpg)
Using ClustalWUsing ClustalW****** MULTIPLE ALIGNMENT MENU ****** 1. Do complete multiple alignment now (Slow/Accurate) 2. Produce guide tree file only 3. Do alignment using old guide tree file
4. Toggle Slow/Fast pairwise alignments = SLOW
5. Pairwise alignment parameters 6. Multiple alignment parameters
7. Reset gaps between alignments? = OFF 8. Toggle screen display = ON 9. Output format options
S. Execute a system command H. HELP or press [RETURN] to go back to main menu
Your choice:
![Page 38: Design and creation of multiple sequence alignments Unit 15 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD](https://reader030.vdocuments.us/reader030/viewer/2022032709/56649ead5503460f94bb4af0/html5/thumbnails/38.jpg)
Output of ClustalWOutput of ClustalWCLUSTAL W (1.7) multiple sequence alignment
HSTNFR GGGAAGAG---TTCCCCAGGGACCTCTCTCTAATCAGCCCTCTGGCCCAG------GCAGSYNTNFTRP GGGAAGAG---TTCCCCAGGGACCTCTCTCTAATCAGCCCTCTGGCCCAG------GCAGCFTNFA -------------------------------------------TGTCCAG------ACAGCATTNFAA GGGAAGAG---CTCCCACATGGCCTGCAACTAATCAACCCTCTGCCCCAG------ACACRABTNFM AGGAGGAAGAGTCCCCAAACAACCTCCATCTAGTCAACCCTGTGGCCCAGATGGTCACCCRNTNFAA AGGAGGAGAAGTTCCCAAATGGGCTCCCTCTCATCAGTTCCATGGCCCAGACCCTCACACOATNFA1 GGGAAGAGCAGTCCCCAGCTGGCCCCTCCTTCAACAGGCCTCTGGTTCAG------ACACOATNFAR GGGAAGAGCAGTCCCCAGCTGGCCCCTCCTTCAACAGGCCTCTGGTTCAG------ACACBSPTNFA GGGAAGAGCAGTCCCCAGGTGGCCCCTCCATCAACAGCCCTCTGGTTCAA------ACACCEU14683 GGGAAGAGCAATCCCCAACTGGCCTCTCCATCAACAGCCCTCTGGTTCAG------ACCC ** *
![Page 39: Design and creation of multiple sequence alignments Unit 15 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD](https://reader030.vdocuments.us/reader030/viewer/2022032709/56649ead5503460f94bb4af0/html5/thumbnails/39.jpg)
ClustalW optionsClustalW optionsYour choice: 5 ********* PAIRWISE ALIGNMENT PARAMETERS ********* Slow/Accurate alignments:
1. Gap Open Penalty :15.00 2. Gap Extension Penalty :6.66 3. Protein weight matrix :BLOSUM30 4. DNA weight matrix :IUB
Fast/Approximate alignments:
5. Gap penalty :5 6. K-tuple (word) size :2 7. No. of top diagonals :4 8. Window size :4
9. Toggle Slow/Fast pairwise alignments = SLOW
H. HELPEnter number (or [RETURN] to exit):
![Page 40: Design and creation of multiple sequence alignments Unit 15 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD](https://reader030.vdocuments.us/reader030/viewer/2022032709/56649ead5503460f94bb4af0/html5/thumbnails/40.jpg)
ClustalW optionsClustalW optionsYour choice: 6
********* MULTIPLE ALIGNMENT PARAMETERS *********
1. Gap Opening Penalty :15.00 2. Gap Extension Penalty :6.66 3. Delay divergent sequences :40 %
4. DNA Transitions Weight :0.50
5. Protein weight matrix :BLOSUM series 6. DNA weight matrix :IUB 7. Use negative matrix :OFF
8. Protein Gap Parameters
H. HELP
Enter number (or [RETURN] to exit):
![Page 41: Design and creation of multiple sequence alignments Unit 15 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD](https://reader030.vdocuments.us/reader030/viewer/2022032709/56649ead5503460f94bb4af0/html5/thumbnails/41.jpg)
Blocks database and toolsBlocks database and tools
Blocks are multiply aligned ungapped Blocks are multiply aligned ungapped segments corresponding to the most highly segments corresponding to the most highly conserved regions of proteins.conserved regions of proteins.
The Blocks web server tools are : The Blocks web server tools are : Block Searcher, Get Blocks and Block Block Searcher, Get Blocks and Block Maker. These are aids to detection and Maker. These are aids to detection and verification of protein sequence homology.verification of protein sequence homology.
They compare a protein or DNA sequence They compare a protein or DNA sequence to a database of protein blocks, retrieve to a database of protein blocks, retrieve blocks, and create new blocks,respectively. blocks, and create new blocks,respectively.
![Page 42: Design and creation of multiple sequence alignments Unit 15 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD](https://reader030.vdocuments.us/reader030/viewer/2022032709/56649ead5503460f94bb4af0/html5/thumbnails/42.jpg)
The BLOCKS web The BLOCKS web serverserver
At URL: http://blocks.fhcrc.org/At URL: http://blocks.fhcrc.org/
The BLOCKS WWW server can be used to The BLOCKS WWW server can be used to create blocks of a group of sequences, create blocks of a group of sequences, or to compare a protein sequence to a or to compare a protein sequence to a database of blocks.database of blocks.
The Blocks Searcher tool should be used The Blocks Searcher tool should be used for multiple alignment of distantly for multiple alignment of distantly related protein sequences.related protein sequences.
![Page 43: Design and creation of multiple sequence alignments Unit 15 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD](https://reader030.vdocuments.us/reader030/viewer/2022032709/56649ead5503460f94bb4af0/html5/thumbnails/43.jpg)
The Blocks Searcher The Blocks Searcher tooltool
For searching a database of blocks, the first position of the For searching a database of blocks, the first position of the sequence is aligned with the first position of the first block, sequence is aligned with the first position of the first block, and a score for that amino acid is obtained from the profile and a score for that amino acid is obtained from the profile column corresponding to that position. Scores are summed column corresponding to that position. Scores are summed over the width of the alignment, and then the block is over the width of the alignment, and then the block is aligned with the next position. aligned with the next position.
This procedure is carried out exhaustively for all positions This procedure is carried out exhaustively for all positions of the sequence for all blocks in the database, and the best of the sequence for all blocks in the database, and the best alignments between a sequence and entries in the alignments between a sequence and entries in the BLOCKS database are noted. If a particular block scores BLOCKS database are noted. If a particular block scores highly, it is possible that the sequence is related to the highly, it is possible that the sequence is related to the group of sequences the block represents. group of sequences the block represents.
![Page 44: Design and creation of multiple sequence alignments Unit 15 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD](https://reader030.vdocuments.us/reader030/viewer/2022032709/56649ead5503460f94bb4af0/html5/thumbnails/44.jpg)
The Blocks Searcher toolThe Blocks Searcher tool
Typically, a group of proteins has more than one Typically, a group of proteins has more than one region in common and their relationship is region in common and their relationship is represented as a series of blocks separated by represented as a series of blocks separated by unaligned regions. If a second block for a group unaligned regions. If a second block for a group also scores highly in the search, the evidence also scores highly in the search, the evidence that the sequence is related to the group is that the sequence is related to the group is strengthened, and is further strengthened if a strengthened, and is further strengthened if a third block also scores it highly, and so on. third block also scores it highly, and so on.
![Page 45: Design and creation of multiple sequence alignments Unit 15 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD](https://reader030.vdocuments.us/reader030/viewer/2022032709/56649ead5503460f94bb4af0/html5/thumbnails/45.jpg)
The BLOCKS DatabaseThe BLOCKS Database
The blocks for the BLOCKS database are The blocks for the BLOCKS database are made automatically by looking for the most made automatically by looking for the most highly conserved regions in groups of highly conserved regions in groups of proteins represented in the PROSITE proteins represented in the PROSITE database. These blocks are then database. These blocks are then calibrated against the SWISS-PROT calibrated against the SWISS-PROT database to obtain a measure of the database to obtain a measure of the chance distribution of matches. It is these chance distribution of matches. It is these calibrated blocks that make up the calibrated blocks that make up the BLOCKS database.BLOCKS database.
![Page 46: Design and creation of multiple sequence alignments Unit 15 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD](https://reader030.vdocuments.us/reader030/viewer/2022032709/56649ead5503460f94bb4af0/html5/thumbnails/46.jpg)
The Block Maker ToolThe Block Maker Tool
Block Maker finds conserved blocks in a Block Maker finds conserved blocks in a group of two or more unaligned protein group of two or more unaligned protein sequences, which are assumed to be sequences, which are assumed to be related, using two different algorithms.related, using two different algorithms.
Input file must contain at least 2 sequences.Input file must contain at least 2 sequences.
Input sequences must be in FastA format.Input sequences must be in FastA format.
Results are returned by e-mail.Results are returned by e-mail.
![Page 47: Design and creation of multiple sequence alignments Unit 15 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD](https://reader030.vdocuments.us/reader030/viewer/2022032709/56649ead5503460f94bb4af0/html5/thumbnails/47.jpg)
Progressive ApproachesProgressive Approaches
CLUSTALWCLUSTALW Perform pairwise alignmentsPerform pairwise alignments Construct a tree, joining most similar Construct a tree, joining most similar
sequences first (sequences first (guide treeguide tree)) Align sequences sequentially, using the Align sequences sequentially, using the
phylogenetic treephylogenetic tree PILEUPPILEUP
Similar to CLUSTALWSimilar to CLUSTALW Uses UPGMA to produce tree (chapter 6)Uses UPGMA to produce tree (chapter 6)
![Page 48: Design and creation of multiple sequence alignments Unit 15 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD](https://reader030.vdocuments.us/reader030/viewer/2022032709/56649ead5503460f94bb4af0/html5/thumbnails/48.jpg)
Clustal method
Higgins and Sharp 1988 Higgins and Sharp 1988 ref: CLUSTAL: a package for performing multiple sequence ref: CLUSTAL: a package for performing multiple sequence
alignment on a microcomputer. alignment on a microcomputer. GeneGene, , 7373, 237–244. [Medline], 237–244. [Medline]
ProgressiveProgressive alignment method alignment method
An approximation strategy (An approximation strategy (heuristic heuristic algorithmalgorithm) yields a possible ) yields a possible alignment, but not necessarily the alignment, but not necessarily the best onebest one
![Page 49: Design and creation of multiple sequence alignments Unit 15 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD](https://reader030.vdocuments.us/reader030/viewer/2022032709/56649ead5503460f94bb4af0/html5/thumbnails/49.jpg)
ABCD
AA BB CC DD
AA
BB 1111
CC 33 11
DD 22 22 1010
Compute the pairwise Compute the pairwise alignments for alignments for all all
against allagainst all (6 pairwise (6 pairwise alignments)alignments)
the similarities are the similarities are stored in a tablestored in a table
First step:
![Page 50: Design and creation of multiple sequence alignments Unit 15 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD](https://reader030.vdocuments.us/reader030/viewer/2022032709/56649ead5503460f94bb4af0/html5/thumbnails/50.jpg)
50
AA BB CC DD
AA
BB 1111
CC 33 11
DD 22 22 1010
A
D
C
B
cluster the sequences to create cluster the sequences to create a tree (a tree (guide treeguide tree):):
•Represents the order in which Represents the order in which pairs of sequences are to be pairs of sequences are to be alignedaligned•Highly similar sequences are Highly similar sequences are neighbors in the tree neighbors in the tree •Highly distant sequences are Highly distant sequences are distant from each other in the treedistant from each other in the tree
Second step:
![Page 51: Design and creation of multiple sequence alignments Unit 15 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD](https://reader030.vdocuments.us/reader030/viewer/2022032709/56649ead5503460f94bb4af0/html5/thumbnails/51.jpg)
A
D
C
B
Align most similar Align most similar pairspairs
Align the alignments as Align the alignments as if each of them was a if each of them was a single sequence (with single sequence (with the use of a consensus the use of a consensus sequence or a profile)sequence or a profile)
Third step:
![Page 52: Design and creation of multiple sequence alignments Unit 15 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD](https://reader030.vdocuments.us/reader030/viewer/2022032709/56649ead5503460f94bb4af0/html5/thumbnails/52.jpg)
52
Clustal programs
ClustalVClustalV ClustalClustalWW
Thompson et al., 1994 Thompson et al., 1994 Uses: sequence weighting, positions-Uses: sequence weighting, positions-
specific gap penalties and weight specific gap penalties and weight matrix choicematrix choice
W stands for weight sequences W stands for weight sequences clustalclustalXX - windows implementation - windows implementation
![Page 53: Design and creation of multiple sequence alignments Unit 15 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD](https://reader030.vdocuments.us/reader030/viewer/2022032709/56649ead5503460f94bb4af0/html5/thumbnails/53.jpg)
53
ClustalW method rules (1)
sequence weighting Each sequence is weighted Each sequence is weighted
according to how different it is from according to how different it is from the other sequences. the other sequences. For the case where one specific For the case where one specific
subfamily is overrepresented in the subfamily is overrepresented in the datadata
![Page 54: Design and creation of multiple sequence alignments Unit 15 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD](https://reader030.vdocuments.us/reader030/viewer/2022032709/56649ead5503460f94bb4af0/html5/thumbnails/54.jpg)
54
ClustalW method rules (2)
weight matrix choice
The substitution matrix used for The substitution matrix used for each alignment step depends on the each alignment step depends on the similarity of the sequences. similarity of the sequences.
![Page 55: Design and creation of multiple sequence alignments Unit 15 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD](https://reader030.vdocuments.us/reader030/viewer/2022032709/56649ead5503460f94bb4af0/html5/thumbnails/55.jpg)
55
ClustalW method rules (3)
positions-specific gap penalties
Gaps found in initial alignments Gaps found in initial alignments remain fixed through the process remain fixed through the process (ends gap)(ends gap)
Hydrophobic residues have higher Hydrophobic residues have higher gap penalties than hydrophilicgap penalties than hydrophilic they are more likely to be in the they are more likely to be in the
hydrophobic core, where gaps hydrophobic core, where gaps should not occur. should not occur.
![Page 56: Design and creation of multiple sequence alignments Unit 15 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD](https://reader030.vdocuments.us/reader030/viewer/2022032709/56649ead5503460f94bb4af0/html5/thumbnails/56.jpg)
56
ClustalW method shortcomings
(1) (1) Sequences that are similar Sequences that are similar only in only in sub- regions sub- regions
ClustalW forces a global alignments, not local. ClustalW forces a global alignments, not local.
(2) (2) A sequence that contains a A sequence that contains a large large insertion/deletion compared insertion/deletion compared to the rest to the rest will extremely affect will extremely affect the alignment the alignment
(again global not local).(again global not local).
![Page 57: Design and creation of multiple sequence alignments Unit 15 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD](https://reader030.vdocuments.us/reader030/viewer/2022032709/56649ead5503460f94bb4af0/html5/thumbnails/57.jpg)
ClustalW method shortcomings
(3) (3) A sequence that contains a A sequence that contains a repetitive repetitive element (such as a domain), element (such as a domain), whereas whereas all other sequences all other sequences only contain one only contain one copy.copy.
![Page 58: Design and creation of multiple sequence alignments Unit 15 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD](https://reader030.vdocuments.us/reader030/viewer/2022032709/56649ead5503460f94bb4af0/html5/thumbnails/58.jpg)
Comments Pairwise alignment is an Pairwise alignment is an optimaloptimal
algorithmalgorithm
Multiple alignment is Multiple alignment is not an optimal not an optimal algorithm – only a heuristic. Better algorithm – only a heuristic. Better alignments may exist!alignments may exist!
The algorithm yields a possible alignment, The algorithm yields a possible alignment, but not necessarily the best one.but not necessarily the best one.
![Page 59: Design and creation of multiple sequence alignments Unit 15 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD](https://reader030.vdocuments.us/reader030/viewer/2022032709/56649ead5503460f94bb4af0/html5/thumbnails/59.jpg)
ClustalW in the web server
Global multiple sequence alignment Global multiple sequence alignment program for DNA or proteins program for DNA or proteins
Available from a number of sitesAvailable from a number of sites EMBL-EBIEMBL-EBI
![Page 60: Design and creation of multiple sequence alignments Unit 15 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD](https://reader030.vdocuments.us/reader030/viewer/2022032709/56649ead5503460f94bb4af0/html5/thumbnails/60.jpg)
ResultsResults
![Page 61: Design and creation of multiple sequence alignments Unit 15 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD](https://reader030.vdocuments.us/reader030/viewer/2022032709/56649ead5503460f94bb4af0/html5/thumbnails/61.jpg)
61
Results
![Page 62: Design and creation of multiple sequence alignments Unit 15 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD](https://reader030.vdocuments.us/reader030/viewer/2022032709/56649ead5503460f94bb4af0/html5/thumbnails/62.jpg)
Alignment with colors
identity similarty
![Page 63: Design and creation of multiple sequence alignments Unit 15 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD](https://reader030.vdocuments.us/reader030/viewer/2022032709/56649ead5503460f94bb4af0/html5/thumbnails/63.jpg)
CLUSTAL format
CLUSTAL W(1.82) multiple sequence alignmentCLUSTAL W(1.82) multiple sequence alignment
YPK1 SQLSWKRLLMKGYIPPYKPAVSD-Q--NSMDTSNFDEEFTR--SEKPIDSVVDEYLSESVYPK1 SQLSWKRLLMKGYIPPYKPAVSD-Q--NSMDTSNFDEEFTR--SEKPIDSVVDEYLSESVYPK2 KDISWKKLLLKGYIPPYKPIVKDTQ--SEIDTANFDQEFTK---EKPIDSVVDEYLSASIYPK2 KDISWKKLLLKGYIPPYKPIVKDTQ--SEIDTANFDQEFTK---EKPIDSVVDEYLSASIKPCA_HUMAN RRIDWEKLENREIQPPFKPKVC------GKGAENFDKFFTR---GQPVLTPPDQLVIANIKPCA_HUMAN RRIDWEKLENREIQPPFKPKVC------GKGAENFDKFFTR---GQPVLTPPDQLVIANIKPCZ_HUMAN RSIDWDLLEKKQALPPFQPQIT---M-DDYGLDNFDTQFTS---EPVQLTPDDEDAIKRIKPCZ_HUMAN RSIDWDLLEKKQALPPFQPQIT---M-DDYGLDNFDTQFTS---EPVQLTPDDEDAIKRIKAPA KEVVWEKLLSRNIETPYEPPIQ----QGQGDTSQFDKYPE----EDINYGVQGEDPYADLKAPA KEVVWEKLLSRNIETPYEPPIQ----QGQGDTSQFDKYPE----EDINYGVQGEDPYADLKAPC NEVIWEKLLARYIETPYEPPIQ----QGQGDTSQFDRYPE-EVDEEFNYGIQGEDPYMDLKAPC NEVIWEKLLARYIETPYEPPIQ----QGQGDTSQFDRYPE-EVDEEFNYGIQGEDPYMDLKAPB SEVVWERLLAKDIETPYEPPIT----SGIGDTSLFDQYPE-DV-EQLDYGIQGDDPYAEYKAPB SEVVWERLLAKDIETPYEPPIT----SGIGDTSLFDQYPE-DV-EQLDYGIQGDDPYAEYKS6_HUMAN RHINWEELLARKVEPPFKPLLQ-----SEEDVSQFDSKFTR-V-QTPVDSP-DDSTLSESKS6_HUMAN RHINWEELLARKVEPPFKPLLQ-----SEEDVSQFDSKFTR-V-QTPVDSP-DDSTLSES
* *. * *.
YPK1 -----MQKQFYPK1 -----MQKQFYPK2 ----N-QKQFYPK2 ----N-QKQFKPCA_HUMAN D--O--QSDFKPCA_HUMAN D--O--QSDFKPCZ_HUMAN D-----QSEFKPCZ_HUMAN D-----QSEFKAPA -D----FRDFKAPA -D----FRDFKAPC -D----MKEFKAPC -D----MKEFKAPB --P---FQDFKAPB --P---FQDFKS6_HUMAN A-----NQVFKS6_HUMAN A-----NQVF
![Page 64: Design and creation of multiple sequence alignments Unit 15 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD](https://reader030.vdocuments.us/reader030/viewer/2022032709/56649ead5503460f94bb4af0/html5/thumbnails/64.jpg)
ClustalW at EMBL - Jalview
conservation
Jalview is a multiple alignment editor
![Page 65: Design and creation of multiple sequence alignments Unit 15 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD](https://reader030.vdocuments.us/reader030/viewer/2022032709/56649ead5503460f94bb4af0/html5/thumbnails/65.jpg)
Jalview
color menu:color menu: TaylorTaylor colorscolors (each amino acid is colored (each amino acid is colored
differently)differently) Zappo colorsZappo colors (amino acids are colored (amino acids are colored
according to their physico-chemical according to their physico-chemical properties)properties)
Hydrophobicity colorsHydrophobicity colors (colors amino aids (colors amino aids according to a certain score scale that according to a certain score scale that represents hydrophobicity)represents hydrophobicity)
Coloring residues above a percentage Coloring residues above a percentage identity thresholdidentity threshold
User defined color schemesUser defined color schemes
![Page 66: Design and creation of multiple sequence alignments Unit 15 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD](https://reader030.vdocuments.us/reader030/viewer/2022032709/56649ead5503460f94bb4af0/html5/thumbnails/66.jpg)
Example - Zappo colors
physico-chemical properties color-physico-chemical properties color-code:code:
![Page 67: Design and creation of multiple sequence alignments Unit 15 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD](https://reader030.vdocuments.us/reader030/viewer/2022032709/56649ead5503460f94bb4af0/html5/thumbnails/67.jpg)
67
Guide Tree
![Page 68: Design and creation of multiple sequence alignments Unit 15 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD](https://reader030.vdocuments.us/reader030/viewer/2022032709/56649ead5503460f94bb4af0/html5/thumbnails/68.jpg)
68
ClustalX
ClustalX provides a window-based ClustalX provides a window-based user interface to the ClustalW user interface to the ClustalW program.program.
It uses the developed by the NCBI as It uses the developed by the NCBI as
part of their part of their NCBI SOFTWARE NCBI SOFTWARE DEVELOPEMENT TOOLKIT.DEVELOPEMENT TOOLKIT.
![Page 69: Design and creation of multiple sequence alignments Unit 15 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD](https://reader030.vdocuments.us/reader030/viewer/2022032709/56649ead5503460f94bb4af0/html5/thumbnails/69.jpg)
69
T-coffee
Another MSA program Another MSA program Protein & nucleotide MSA programProtein & nucleotide MSA program Uses principles similar to ClustalWUses principles similar to ClustalW More accurate but longer running More accurate but longer running
timestimes Limits the number of sequences it Limits the number of sequences it
can align (~100)can align (~100) T-coffee at EMBnetT-coffee at EMBnet
![Page 70: Design and creation of multiple sequence alignments Unit 15 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD](https://reader030.vdocuments.us/reader030/viewer/2022032709/56649ead5503460f94bb4af0/html5/thumbnails/70.jpg)
70
![Page 71: Design and creation of multiple sequence alignments Unit 15 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD](https://reader030.vdocuments.us/reader030/viewer/2022032709/56649ead5503460f94bb4af0/html5/thumbnails/71.jpg)
71
T-coffee results
![Page 72: Design and creation of multiple sequence alignments Unit 15 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD](https://reader030.vdocuments.us/reader030/viewer/2022032709/56649ead5503460f94bb4af0/html5/thumbnails/72.jpg)
72
Phylip format 5 995 99
Cabd_199509 PQITLWQRPLVTIKIGGQLKEALLDTGADDTVLEEMNLPGKWKPKIIGGICabd_199509 PQITLWQRPLVTIKIGGQLKEALLDTGADDTVLEEMNLPGKWKPKIIGGIJCSB1_199401 PQITLWQRPIVTIKIGGQLKEALLDTGAD---LEEM-NPGRWKPKIIGGIJCSB1_199401 PQITLWQRPIVTIKIGGQLKEALLDTGAD---LEEM-NPGRWKPKIIGGIJCSB2_199401 PQITLWQRPVT-IK-GG-QLEALLDTGADDTL-EEI-LPGRW-PKMIGGIJCSB2_199401 PQITLWQRPVT-IK-GG-QLEALLDTGADDTL-EEI-LPGRW-PKMIGGIJCSB4_199401 PQITLWQRPVT--K-GG-LKEALLDTGADDTE-----DPGRWKPKMIGGIJCSB4_199401 PQITLWQRPVT--K-GG-LKEALLDTGADDTE-----DPGRWKPKMIGGIJCSB5_199401 PQITLWQRPIVTIKVGGQLKEALLDTGADDTVL-EMNLPGRWKPKMIGGIJCSB5_199401 PQITLWQRPIVTIKVGGQLKEALLDTGADDTVL-EMNLPGRWKPKMIGGI
GGFIKVRQYDQVPIEICGHKAIGTVLVGPTPSNIIGRNLLTQLGCTLNFGGFIKVRQYDQVPIEICGHKAIGTVLVGPTPSNIIGRNLLTQLGCTLNF GGFVKVRQYDQIPIDICGHKVIGTVL-GPTPANVIGRNLLTQIGCTLNFGGFVKVRQYDQIPIDICGHKVIGTVL-GPTPANVIGRNLLTQIGCTLNF GGFVKVR-YDQVPIEICGH--IGTVLVGPTPANIIGRNLMTQLGCTLNFGGFVKVR-YDQVPIEICGH--IGTVLVGPTPANIIGRNLMTQLGCTLNF GGFLKVRQYDQIPVEICGHKAIGTVL-GPTPANIIGRNLLTQIG-TLNFGGFLKVRQYDQIPVEICGHKAIGTVL-GPTPANIIGRNLLTQIG-TLNF GGFVKVRQYDQIPIEICGHKAIGTVLVGPTPANIVGRNLLTQIGCTLNFGGFVKVRQYDQIPIEICGHKAIGTVLVGPTPANIVGRNLLTQIGCTLNF
![Page 73: Design and creation of multiple sequence alignments Unit 15 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD](https://reader030.vdocuments.us/reader030/viewer/2022032709/56649ead5503460f94bb4af0/html5/thumbnails/73.jpg)
The Biology WorkBenchThe Biology WorkBench
http://workbench.sdsc.edu/http://workbench.sdsc.edu/ http://www.ngbw.org/http://www.ngbw.org/
Nucleic Acid Sequence Tools, Nucleic Acid Sequence Tools, including BLAST, CLUSTALW, including BLAST, CLUSTALW, MFOLD, PRIMER3MFOLD, PRIMER3
![Page 74: Design and creation of multiple sequence alignments Unit 15 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD](https://reader030.vdocuments.us/reader030/viewer/2022032709/56649ead5503460f94bb4af0/html5/thumbnails/74.jpg)
74
Muscle
Protein & nucleotide MSA programProtein & nucleotide MSA program Improvements in both accuracy and Improvements in both accuracy and
speedspeed exploiting a range of existing and new exploiting a range of existing and new
algorithmic techniques algorithmic techniques combination of progressive and iterative combination of progressive and iterative
alignment strategies alignment strategies details of the method details of the method web serverweb server downloads: Windows, Linux, Macdownloads: Windows, Linux, Mac
![Page 75: Design and creation of multiple sequence alignments Unit 15 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD](https://reader030.vdocuments.us/reader030/viewer/2022032709/56649ead5503460f94bb4af0/html5/thumbnails/75.jpg)
75
Muscle web server
![Page 76: Design and creation of multiple sequence alignments Unit 15 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD](https://reader030.vdocuments.us/reader030/viewer/2022032709/56649ead5503460f94bb4af0/html5/thumbnails/76.jpg)
76
Editing MSA There are a variety of tools that can be used to There are a variety of tools that can be used to
modify a multiple alignment (SeaView, BioEdit, modify a multiple alignment (SeaView, BioEdit, JalView)JalView)
These programs can be very useful in formatting These programs can be very useful in formatting and annotating an alignment for publication. and annotating an alignment for publication.
An editor can also be used to make modifications An editor can also be used to make modifications by hand to improve biologically significant by hand to improve biologically significant regions in a multiple alignment created by one of regions in a multiple alignment created by one of the automated alignment programs. the automated alignment programs.
![Page 77: Design and creation of multiple sequence alignments Unit 15 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD](https://reader030.vdocuments.us/reader030/viewer/2022032709/56649ead5503460f94bb4af0/html5/thumbnails/77.jpg)
77
MSA approaches Progressive approach Progressive approach
CLUSTALW (CLUSTALX), PileUp, CLUSTALW (CLUSTALX), PileUp, T-COFFEE, MAFFT, MUSCLET-COFFEE, MAFFT, MUSCLE
Iterative approach: Iterative approach: Repeatedly realign subsets of Repeatedly realign subsets of sequences.sequences.
MultAlin, DiAlig, MAFFT, MultAlin, DiAlig, MAFFT, MUSCLE,ProbConsMUSCLE,ProbCons
Genetic algorithmGenetic algorithmSAGASAGA
Graph algorithm Graph algorithm POAPOA
![Page 78: Design and creation of multiple sequence alignments Unit 15 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD](https://reader030.vdocuments.us/reader030/viewer/2022032709/56649ead5503460f94bb4af0/html5/thumbnails/78.jpg)
Conclusion There is no single method that There is no single method that
always generates the best alignmentalways generates the best alignment
It may thus be wise to use more than It may thus be wise to use more than one methodone method
Alignment editors can be used to Alignment editors can be used to correct the alignmentscorrect the alignments