advanced tools and algorithms in bioinformatics chittibabu guda summer, 2004

Advanced Tools and Algorithms

in Bioinformatics

Chittibabu Guda

Summer, 2004

UCSD Extension, Department of Biosciences

Clustering Tools

• Clustering is grouping together of related sequences based on some set thresholds such as length, % identity, composition etc.

• % identity is the most commonly used criterion to remove redundant sequences in the databases

• Clustering helps improve the speed of database searches in the orders of magnitude with minimal loss of content

• The general principle in clustering is pair-wise alignment of sequences in all-to-all combination

• Most commonly used tools are

• blastclust

• cd-hit

BLASTCLUSThttp://www.csc.fi/molbio/progs/blast/blastclust.html

• BLAST score-based single-linkage clustering

• All sequences in the database are compared pair-wise in all-to-all combinations, based on the BLAST score

• For each pair, the top scoring alignment is evaluated based on two factors

• Length coverage- L’/L (for one or both sequences)

• Score density – I/AL

• where, L’ is length of sequence in the alignment, L is total length of the sequence, I is the number of identical residues and AL is the total alignment length (L’+gaps)

• If both these factors score above the set thresholds, the two sequences are considered as neighbors

• The default e-value is 1e-6

• This program is 20-30 times faster than BLASTCLUST for it avoids all-to-all comparison of pair-wise alignments

• Short word filters are applied to reduce the number of pair-wise alignments

• First index tables are built for short words of 2-5 residues, in all possible combinations

• (ABC-), a 4-letter alphabet can make a maximum of 16 two-letter pairs

• AB, AC, A-, BA, CA, -A, BC, B-, CB, -B, C-, -C, AA, BB, CC, --

• So, for (20+1) amino acids, the index table size would be 21n where n is the word size (If n=5, total number of words would be ~ 4 million)

• Program compares the type and number of identical peptides between the representative and the new sequence

• Only those pairs that meet the minimum criterion will be further aligned to confirm the identity

• Very fast algorithm for clustering larger databases like NR

CD-HIT (http://bioinformatics.ljcrf.edu/cd-hi/)

Phylogenetic Analysis

Terminology

• Homologous : Similar

• Paralogous : Similar sequences in the same species, originated by gene duplication

• Orthologous: Similar sequences in different species by divergent evolution

• Xenologous: Genes acquired by horizontal gene transfer

• Analogous: Similarity by convergent evolution

Methods of building phylogenetic trees

• Based on the data processing

• Discrete methods

• Maximum-parsimony method

• Maximum-Likelihood method

• Distance-based methods

• Based on the tree-building algorithm

• Clustering methods

• UPGMA

• Neighbor-joining

• Optimality criterion

Distance-based versus discrete methods

• Distance methods first convert aligned sequences into a pair-wise distance matrix and then input the matrix into a tree building method

• Discrete methods are based on characters i.e., consider each nucleotide or amino acid directly

• In distance methods, once a distance matrix is built the biological information is lost while, in discrete methods additional information such as which site contributes to the length of each branch is preserved

• Distance based methods are faster and easier to implement than discrete methods

Clustering versus optimality criteria-based methods

• Clustering methods follow a set of steps and arrive at a single tree while in the other case, a set of all possible trees are built and the best of them is evaluated based on the score

• Clustering methods do not allow us to evaluate competing hypotheses

• Clustering methods are faster, easy to implement and produce an unambiguous output while the other methods are computationally very expensive

• Optimality methods often result in good quality trees since they could be interactively corrected

• Eck and Dayhoff method counts the number of all to all amino acid substitutions in a phylogeny, but in this method, both high and low probable substitutions (acc. to genetic code) are treated equally

• Ex: AAA (K) CGC (R) vs AAC (N) AGC (S)

• Fitch method counts the minimum number of nucleotide changes required to achieve the observed variation, but this method treats both synonymous and non-synonymous changes equally

• Ex: UUU(F) CUU(L) CUA(L) CAA (Q)

• In Maximum parsimony method a moderate approach between the above two methods is used. All amino acid changes be consistent with the genetic code and synonymous changes are counted less times than non-synonymous changes.

• In the above example the number of changes from F Q is counted as two, not three

Parsimony Methods :Background

Maximum Parsimony Method

• Also called minimum evolution method

• Predict tree(s) that minimizes the number of steps required to generate the observed variation in the sequences

• For each aligned column in the multiple alignment, phylogenetic trees that require smallest number of evolutionary changes to produce the observed variation are identified

• Finally, those trees that produce the smallest number of changes overall for all sequence positions are identified

• Very time consuming, not good for large number of sequences or sequences with a large amount of variation

• For DNA: DNAPARS

• For proteins: PROTPARS

Protpars Example

Distance-based Method

• Distance between pairs of sequences is calculated based on

• Dayhoff’s PAM matrix values

• Fraction of non-identical amino acids between the two sequences

• Depending on whether the conversion of amino acids is within the group or to a different group

• A distance matrix of (n x n) is calculated between all pair-wise combinations where each diagonal is identical to the other

• Distance matrix is used as input in different algorithms to calculate an optimal evolutionary tree

Distance Matrix generated by Protdist

HUMAN MOUSE DROME SOLTU WHEAT ARATH NEUCR YEAST

• The key is how best the pair-wise distances are made additive on a predicted evolutionary tree

• Using the distance matrix, several phylogenetic trees are built and evaluated based on the following criteria

• Goodness of fit methods seek the metric tree that best accounts for the observed pair-wise distances

• Minimum evolution method: Seeks the tree whose sum of branch lengths is the minimum (minimum evolution)

• Methods used

• FITCH: Based on Fitch-Margoliash method

• NEIGHBOR: Based on neighbor-joining or UPGMA methods

Distance method continued …

Tree building using Fitch-Margoliash method (1967)

Da = ( DAB + DAC - DBC ) / 2

Db = ( DAB + DBC - DAC ) / 2

Dc = ( DAC + DBC - DAB ) / 2

Feng-Doolittle Method …..

Human Chimp Gorilla OrangA Human 0 88 103 160B Chimp 0 106 170C Gorilla 0 166D Orang 0

C B A

DaDcDb

C B A

45.5

A B C D

51.5 42.5

9.0

Da = ( 88 + 103 - 106 ) / 2 = 42.5

Db = ( 88 + 106 - 103 ) / 2 = 45.5

Dc = ( 103 + 106 - 88 ) / 2 = 60.5

Join the first 3 sequences

Feng-Doolittle Method …..

Hum/Chimp Gorilla OrangA Hum/Chimp 0 104.5 165B Gorilla 0 166C Orang 0

A B CHuman Chimp Gorilla Orang

A Human 0 88 103 160B Chimp 0 106 170C Gorilla 0 166D Orang 0

A B C D

B A’ A

45.5

52.75 42.5

9.25Da = ( 104.5 + 165 - 166 ) / 2 = 51.75

Db = ( 104.5 + 166 - 165 ) / 2 = 52.75

Dc = ( 165 + 166 - 104.5 ) / 2 = 113.25

Join the 4th sequence to current tree

C

82.5

30.75

Maximum-Likelihood Methods

• These methods are discrete methods similar to maximum parsimony (MP) methods, however probability calculations are used to find a tree that best accounts for the variation in a set of sequences

• Analysis is performed on all columns in the multiple alignment and all possible trees are considered

• Compared to MP methods, more divergent sequences can be analyzed

• However, the main disadvantage is that these methods are computationally intensive

Genome-scale Data Analysis

Sequenced Genome

CompleteProteome

Ensembl/translation

Known function

Knownstructure

InterproPfam

Unknownfunction &structure

YesNo

Pdbsearch

No

Yes

Finding right tools for right tasks

• Finding paralogues by clustering (BLASTCLUST, CD-HIT)

• Finding homologues and orthologues (BLAST)

• Finding remote homologues (PSI-BLAST)

• Finding functional annotation (PFAM, INTERPRO)

• Finding structural annotation (Blast PDB)

• Finding low complex regions (SEG, CAST)

• Finding transmembrane regions (TMHMM)

• Finding disordered regions (COILS, PONDR)

• Finding secondary structure (JPRED, TOPpred)

• Web-based tools vs. Standalone tools

• Download

• NCBI : ftp://ftp.ncbi.nih.gov

• EBI: ftp://ftp.ebi.ac.uk

• PDB: ftp://ftp.rcsb.org

• PFAM: ftp://ftp.genetics.wustl.edu

• Local installation and configuration

Accessing Tools and Data

Structure-based Algorithms

Protein Data Bank (PDB) http://www.rcsb.org

• About 26000 structures including X-Ray, NMR and models

• Structures include 23597 proteins, 1108 protein/nucleic acid complexes, 1336 nucleic acids and 18 carbohydrates

• Sequence numbering

• PDB/Atomic numbering

• PDB ID/chain ID

Growth of PDB entries

Growth of new folds in PDB

• Midwest Center for Structural Genomics• Northeast Structural Genomics Consortium • New York Structural Genomics Research Consortium • Southeast Collaboratory for Structural Genomics • Structural Genomics Center • Tuberculosis (TB) Structural Genomics Consortium • Joint Center for Structural Genomics • Center for Eukaryotic Structural Genomics • Structural Genomics of Pathogenic Protozoa Consortium

NIGMS funded Structural Genomics Projects

Protein Structure Databases

• SCOP : Structural Classification of Proteins

• CATH : Class, Architecture, Topology & Homologous superfamily

• FSSP/DALI : Fold classification based on Structure-Structure alignment of Proteins

• HSSP: Homology-derived Secondary Structure of Proteins

• HOMSTRAD : Homologous Structure Alignment Database

• DSSP : Database of Secondary Structure Assignments

• DMAPS : Database of Multiple Alignment for Protein Structures

• Protein structures are determined by X-ray crystallography or NMR methods

• Structural alignment involves establishing equivalencies between residues in two or more proteins based on their 3D-coordinates

• 3-D coordinates from C- atoms are most commonly used for calculation of distance in structural alignments

Structure Alignments

• Dynamic programming (Taylor & Orengo, 1989)

• Combinatorial Extension (Shindyalov & Bourne, 1998)

• Monte Carlo method (Mirny & Shakhnovich, 1998, Guda et. al., 2001)

• Environment profile method (Jung & Lee., 2000)

• Genetic Algorithms (May & Johnson, 1995)

Methods used for structure alignment

• CE method is based on determining Aligned Fragment Pairs (AFPs) with local similarities and joining AFPs to form a continuous path

• AFPs are based on the difference in the local geometry of structures being compared

• For ex., inter-residue distances are calculated between 8 residues in all possible combinations, except between the neighboring residues ((n-1)(n-2)/2). This is done for all candidate AFPs in each structure

• Difference(d) in the average distances is calculated and all candidate AFPs with d under some threshold are considered AFPs

• Consecutive AFPs are selected based on calculation of inter-residue distances between two AFP members in the same chain in 64 (8x8) combinations and selecting the ones with minimum average difference (d)

Combinatorial Extension (CE) Methodhttp://cl.sdsc.edu/ce.html

http://cl.sdsc.edu/ce.html









• The alignment path is constructed from AFPs selected from any position in the similarity matrix and consecutive AFPs are added in either direction such that,

• two consecutive AFPs are aligned without gaps OR

• two consecutive AFPs are aligned with gaps inserted in either of the proteins, but not in both

• The maximum allowable size of a gap is 30. This is required to limit the gap size, however, similarities requiring gap size > 30 are misrepresented by this algorithm

• A few best alignments are superimposed and r.m.s.d. (Root mean square deviation) is iteratively optimized using dynamic programming by adjusting gaps

• Finally, the pair with lowest RMSD value is selected

CE Method …

Extending the optimal path

FSSP/DALI http://www.ebi.ac.uk/dali/fssp/fssp.html

• Fold Classification based on Structure-Structure alignment of Proteins

• All structures in PDB are clustered into families based on 25% sequence identity and representatives for each family are selected

• FSSP was built using completely automatic method (DALI), based on all-against-all comparison of representative set of structures

• DALI (Distance matrix ALIgnment) is based on distance maps that contains all pair-wise distances between residue centers i. e., C-œ atoms

• The distance matrices from each protein are decomposed into hexapeptide-hexapeptide submatrices. Similar contact patterns are paired and combined into larger sets of pairs

• A Monte Carlo procedure is used to optimize similarity score

• Multiple structure alignments were built based on pair-wise comparison of representative and member within the family and between representatives

http://www.ebi.ac.uk/dali/fssp/fssp.html

HOMSTRADhttp://www-cryst.bioc.cam.ac.uk/homstrad/

• HOMologous STRucture Alignment Database

• 1032 families with 3454 structures

• Structures with only C-alpha values were excluded

• Structurally similar proteins were clustered into homologous families and alignments were built based on 3-D coordinate data

• Uses COMPARER and MNYFIT for building structure alignments

• Multiple alignments were calculated only for representative members of each family

http://www-cryst.bioc.cam.ac.uk/homstrad









Limitations of current methods

Most of the multiple alignment methods are based on master-slave or progressive alignments. These are biased towards the master structure or the initial alignment

Example:master

• The Target/Scoring function

• The Search Algorithm

• The Search Constraints

• Algorithm

Essential elements of the Method

Monte Carlo Optimization Methodhttp://cemc.sdsc.eduhttp://dmaps.sdsc.edu

Problem: Most of the multiple alignment methods are based on pair-wise alignment of structures to a Master structure. This leads to biased alignments towards the master, ignoring the similarities within the other structures

• Compute a distance-based score for the current alignment

• Make a random trial change to the current alignment and compute the change in the score (S)

• If S > 0, the move is always accepted

• If S <= 0, the move may be accepted by adding an additional score of P

where,

-C is a constant

-m is the trial move count

• Once a move is accepted, the change in the alignment becomes permanent

• This procedure is iterated until there is no further change in the score, i.e., the system is converged

General Monte Carlo Approach

m

sCP

Monte Carlo Simulation ...

Scoring function (Modified from Levitt & Gerstein, 1998)

- S is the total score for the alignment

- l is the total number of columns and i is the column position, in the alignment

- M = 20 (Maximum score of a column, chosen arbitrarily)

- di is the average C distance between residues in column i.

- p and q are residues in column i

- N =(m x m-1)/2 (all-to-all combinations)

- m is the residue count in column i

- d0 is a constant (the distance increase that can be tolerated)

- G is Affine gap penalty term ( G = I + pE) where, I=15, E=7. I and E are gap initiation & extension penalties, respectively, and p is the number of gap extensions

N

d

d qp

pq

i

0

0

,10

,0

dd

ddA

i

i

Search Constraints

• Minimum Block length: > 3 (3-6)

• Residue Threshold: 50 % (33-66 %)

Free poolBlock



1. Shift Right

2. Shift Left

3. Expand Right

4. Expand Left

5. Shrink Right

6. Shrink Left

7. Split/Shrink

Random Trial Move Set

Shift Left

Before Accepting Move: Score = 30796, Distance = 3.815

After Accepting Move: Score = 30846, Distance = 3.849


Expand Right

Before Accepting Move: Score = 30850, Distance = 3.852


Free pool of residues

Expanded fragment


Expand Left

Before Accepting Move: Score = 31093 Distance = 4.042


Free pool of residues

Expanded fragment


Shrink

Before shrinking

After shrinking


Split and Shrink

After Split and Shrinking

Before Split and Shrinking


260

270

280

290

300

310

320

0 2000 4000 6000 8000 10000 12000

Move count

Nu

mb

er

of

ali

gn

men

t co

lum

ns

2.5

2.6

2.7

2.8

2.9

3

3.1

3.2

Ali

gn

men

t d

ista

nce

Number of alignment columns Average alignment distance

Typical Monte Carlo behavior


-80

-60

-40

-20

0

20

40

60

-0.200.20.40.60.811.2

Change in the average alignment distance

Ch

ang

e in

th

e n

um

ber

of

alig

nm

ent

colu

mn

s

Relation between alignment improvement and distance increase


Example 1Monte Carlo Simulation ...

ID A (CE) B (CE+MC) C (HOM.)

Example 2


ID A (CE) B (CE+MC) C (HOMSTRAD)

CE-MC Web Server

• Accessible at http://cemc.sdsc.edu

• A web-based facility to perform multiple structure alignments

• User could upload local coordinate files and compare against the PDB files

• Initial seed alignments are built based on CE algorithm and iteratively optimized using Monte Carlo Optimization

• Results are emailed upon completion of job

• Output is displayed in 4-different formats as follows

• JOY/html

• JOY/post-script

• Text

• FASTA

DMAPS Web Server

• Accesible at http://dmaps.sdsc.edu

• Stores pre-calculated multiple structure alignments for all structural families in the PDB

• All structure chains in the PDB were clustered into ~1700 familes and multiple structure alignments were performed using Monte Carlo algorithm

• Multiple structure alignment for a structure family is accessible with the PDB chain ID of any member of that family

• Results are retrieved and displayed in 4 different families, i.e., JOY/html, JOY/post-script, Text and FASTA

Final Project Work

advanced tools and algorithms in bioinformatics chittibabu guda summer, 2004

Documents

similar sequences

aligned sequences

method discrete methods

pairwise distance matrix

redundant sequences

related sequences

number of pair

total length