tics algorithms

Upload: natachacrooks

Post on 05-Apr-2018

238 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/31/2019 tics Algorithms

    1/47

    BIOINFORMATICSALGORITHMS

    BASED ON THE 2009TEACHING OF THE CAMBRIDGE COMPUTERSCIENCEPARTIIBIOINFORMATICSCOURSE BYPIETRO LI

    Vaughan EveleighJesus College, Cambridge University

  • 7/31/2019 tics Algorithms

    2/47

    2

  • 7/31/2019 tics Algorithms

    3/47

    3

    CONTENTS

    1 DNA and Protein Sequences ......................................................................................... 51.1 Preparation ................................................................................................................ 5

    1.1.1 Manhattan Tourist ............................................................................................ 51.1.1.1 Naive Algorithm....................................................................................... 51.1.1.2 Dynamic Algorithm ................................................................................. 5

    1.2 Strings ......................................................................................................................... 61.2.1 Longest Common Subsequence ..................................................................... 71.2.2 Neeleman-Wunsch (Global Alignment) ....................................................... 81.2.3

    Smith Waterman (Local Alignment) .............................................................. 9

    1.2.4 Affine Gaps ..................................................................................................... 101.2.5 Banded Dynamic Programming ................................................................... 111.2.6 Computing Path with Linear Space ............................................................. 111.2.7 Block Alignment ............................................................................................. 121.2.8 Four Russians Block Alignment Speedup .................................................. 141.2.9 Four Russians Technique - Longest Common Sub-Expression ............. 141.2.10 Nussinov Algorithm ...................................................................................... 151.2.11 BLAST (Multiple Alignment) ....................................................................... 171.2.12 Pattern Hunter (Multiple Alignment) .......................................................... 181.2.13 BLAT (Multiple Alignment) ......................................................................... 19

    1.3 Trees ......................................................................................................................... 191.3.1 Parsimony ........................................................................................................ 19

    1.3.1.1 Sankoff Algorithm ................................................................................. 201.3.1.2 Fitchs Algorithm ................................................................................... 21

    1.3.2 Large Parsimony Problem ............................................................................. 221.3.3 Distance ........................................................................................................... 23

    1.3.3.1 UPGMA .................................................................................................. 241.3.3.2 Neighbour Joining ................................................................................. 25

    1.3.4 Likelihood ........................................................................................................ 271.3.5 Bootstrapping Algorithm .............................................................................. 291.3.6 Prims Algorithm ............................................................................................ 29

    1.4 Information Theory and DNA ............................................................................. 29

  • 7/31/2019 tics Algorithms

    4/47

    4

    1.4.1 Information Content of a DNA Motif ....................................................... 301.4.2 Entropy of Multiple Alignment.................................................................... 311.4.3 Information Content of a String .................................................................. 311.4.4 Motifs ............................................................................................................... 311.4.5 Exhaustive Search .......................................................................................... 331.4.6 Gibbs Sampling .............................................................................................. 33

    1.5 Hidden Markov Models ......................................................................................... 351.5.1 Forward Algorithm ........................................................................................ 361.5.2 Viterbi Algorithm ........................................................................................... 371.5.3 Backward Algorithm ...................................................................................... 38

    2 Working with Microarray .............................................................................................. 392.1 Clustering ................................................................................................................. 39

    2.1.1 Lloyd Algorithm (k-means) ........................................................................... 402.1.2 Greedy Algorithm (k-means) ........................................................................ 412.1.3 CAST (Cluster Affinity Search Technique) ................................................ 412.1.4 QT clustering .................................................................................................. 422.1.5 Markov Clustering Algorithm ...................................................................... 43

    2.2 Genetic Networks Analysis ................................................................................... 432.3 Systems Biology ...................................................................................................... 45

    2.3.1 Gillespie Algorithm ........................................................................................ 46Complexity Summary ............................................................................................................. 47

  • 7/31/2019 tics Algorithms

    5/47

    5

    1 DNAAND PROTEIN SEQUENCES

    1.1 PREPARATION

    DNA (Deoxyribonucleic acid) uses a 4-letter alphabet (A,T,C,G)

    RNA (Ribonucleic acid) also uses a 4-letter alphabet (A,U,C,G)Proteins use 20 amino acids (A,C,D,E,F,G,H,I,K,L,M,N,P,Q,R,S,T,V,W,Y)

    1.1.1 MANHATTANTOURIST

    The problemGiven a weighted grid G, travel from source (top left) to thesink (bottom right) along the highest scoring path onlytravelling south and east.

    The solutionThe problem can be generalised finding the longest path from

    the source to an arbitrary destination .1.1.1.1 Naive Algorithm

    Start at the destination node and calculate which of the immediately adjacent nodes

    has the highest path score from the source. For each of these edges, recurse.

    path(i,j)

    if (i = 0 or j = 0)

    return 0

    else

    X = path(i-1, j) + edge (i-1,j) to (i,j)

    Y = path(i, j-1) + edge (i,j-1) to (i,j)return max(X,Y)= (!!)= (1)

    Although this exhaustive algorithm produces accurate results it is not efficient. Many

    path values are repeatedly computed.

    1.1.1.2Dynamic Algorithm

    Dynamic programming improves the naive algorithm by storing the results of previous

    computations and reusing them when required at a later stage. The idea behind a

    dynamic algorithm is that unnecessary calculations are not re-computed. Although this

    significantly improves time complexity, in many cases the space complexity can be

    quite demanding.

    In the case of the Manhattan tourist problem we only need to store the values of 1 row

    and 1 column at any time.

  • 7/31/2019 tics Algorithms

    6/47

    6

    DynamicPath(i,j)

    S0,0=0

    for x=1 to i

    Sx,0 = Sx-1,0+edge (x-1,0) to (x,0)

    for y=1 to j

    S0,y = S0,y-1+edge (0,y-1) to (0,y)for x=1 to i

    for y=1 to j

    A = Sx,y-1+edge (x,y-1) to (x,y)

    B = Sx-1,y+edge (x-1,y) to (x,y)

    Sx,y = max (A,B)

    Return Si,j

    Where Sx,yare stored values = () = ( +)If our DAG representing the city were to also contain diagonal paths we would require

    a 3rd condition in the final for loop.

    DynamicDiagonalPath(i,j)

    S0,0=0

    for x=1 to i

    Sx,0 = Sx-1,0+edge (x-1,0) to (x,0)

    for y=1 to j

    S0,y = S0,y-1+edge (0,y-1) to (0,y)

    for x=1 to i

    for y=1 to j

    A = Sx,y-1+edge (x,y-1) to (x,y)

    B = Sx-1,y+edge (x-1,y) to (x,y)C = Sx-1,y-1+edge (x-1,y-1) to (x,y)

    Sx,y = max (A,B,C)

    Return Si,j = () = ( + + 1) = ( +)Many of the future algorithms will resemble the Manhattan tourist problem

    1.2 STRINGS

    There are several ways by which we can compare the similarity of strings.

    Edit Distance (non trivial) the minimum number of operations (insertions,deletions and substitutions) required to transform 1 string into another

    Hamming Distance (trivial) the number of differences when comparing the value of a string against the value of another

    Consider the two strings of length 7 and of length 6 : ATCTGAT

    : TGCATA

  • 7/31/2019 tics Algorithms

    7/47

    7

    After comparison of string against we can count the number of matches,insertions and deletions.

    1.2.1 LONGESTCOMMONSUBSEQUENCEAlthough the hamming distance is commonly used in computer science, the edit

    distance is of greater use in biology. By aligning two strings by their longest common

    sub-sequences, the minimal distance can be found.

    The longest common subsequence is similar to edit

    distance but only uses insertions and deletions, not

    substitutions.

    This problem can be represented as a hybrid of the

    Manhattan tourist problem (right) where diagonal

    paths represent matched strings and horizontal or

    vertical lines represent edits. Each of the edges are

    assigned a weighting.

    weighting 0

    0

    1 where = otherwise LongestCommonSubsequence(i,j)

    S0,0=0

    for x=1 to i

    Sx,0 = 0

    for y=1 to j

    S0,y = 0

    for x=1 to i

    for y=1 to j

    A = Sx,y-1+0

    B = Sx-1,y+0C = Sx-1,y-1+edge (x-1,y-1) to (x,y)

    Sx,y = max (A,B,C)

    Return Si,j = ()If we are only concerned with returning the optimal score (and

    not the path) the LCS can be calculated with linear space. When

    computing the score of any cell in our adjacency matrix only the

    scores of cells immediately above, immediately left and

    immediately diagonal are required. As a result, historic values

  • 7/31/2019 tics Algorithms

    8/47

    8

    from other computations are not required.

    () = ( +)The algorithm can also be modified to remember the route when deciding which of

    the paths (A, B or C) to take. This requires allocation of an adjacency matrix of size populated with the score and direction of each cell. This information allows usto backtrack to find which sequence of insertions and deletions generated the score.

    () = () () = ( +)This is the simplest form of alignment as only insertions and deletions are allowed.

    This algorithm is rather restrictive, awarding 1 for matches and not penalising indels

    (Abbreviation for insertions and deletions). We will now consider ways in whichmismatches can be penalised.

    1.2.2 NEELEMAN-WUNSCH(GLOBALALIGNMENT)Global alignment assumes that the two proteins are basically similar over the entirelength of one another. The alignment attempts to match them to each other from end

    to end, even though parts of the alignment may not be very convincing. E.g.

    Global alignment penalises insertions or deletions by decreasing the overall alignment

    score by the value .We first need to initialise our scoring matrix as we did for the longest common

    subsequence algorithm.

    weightingVertical -Horizontal -Diagonali,j

    1 where

    =

    j

    otherwise

    aaagcggaagtcacag

    ||.||.||||| |.||

    aaggctgaagt-atag

  • 7/31/2019 tics Algorithms

    9/47

    9

    Needleman-Wunsch (i,j)

    S0,0=0

    for x=1 to i

    Sx,0 = -x*d

    for y=1 to jS0,y = -y*d

    for x=1 to i

    for y=1 to j

    A = Sx,y-1 - d

    B = Sx-1,y - d

    C = Sx-1,y-1+edge (x-1,y-1) to (x,y)

    Sx,y = max (A,B,C)

    Return Si,j = ()

    (

    ) =

    (

    +

    )

    () = () () = ( +)Again this algorithm could be modified to store the path taken through the matrix.

    1.2.3 SMITHWATERMAN(LOCALALIGNMENT)Local alignmentsearches for segments of the two sequences that match well. There is noattempt to force entire sequences into an alignment, just those parts that appear to

    have good similarity, according to some criterion. E.g.

    The Smith Waterman algorithm is based on the Needleman-Wunsch algorithm but

    ignores badly aligning regions. It does this by assigning 0 to any cells that would have

    been allocated a negative value using Needleman-Wunsch.

    Smith-Waterman (i,j)

    S0,0=0

    for x=1 to i

    Sx,0 = 0

    for y=1 to jS0,y = 0

    for x=1 to i

    for y=1 to j

    A = Sx,y-1 - d

    B = Sx-1,y - d

    C = Sx-1,y-1+edge (x-1,y-1) to (x,y)

    D = 0

    Sx,y = max (A,B,C,D)

    Return Si,j

    =

    (

    )

    (

    ) =

    (

    )

    On termination we can use our alignment matrix to find string alignments.

    aaagcggaagtcacag

    ......||||| ....aaggctgaagt-atag

  • 7/31/2019 tics Algorithms

    10/47

    10

    To find the longest match we are just required to find the with the highestvalue

    To find alignments greater than length & > Valid scores can be found in () time complexity

    The alignment can be found from the scores in ( +) time complexity1.2.4 AFFINEGAPSNeedleman Wunsch can be optimised further to score gaps more accurately. Gaps are

    currently being scored uniformly, however they usually occur in clusters. The uniform

    gap penalty is changed to a function that considers the length of a gap.We implement this usingaffine gaps.

    = gap length = initial penalty = successive penaltyHere the first gap incurs a penalty of and subsequent gaps incur a penalty of.

    Two alignment matrices are used to record scores.

    contains scores assuming aligns to contains scores assuming or align to a gapThe example below also includes the function (,) that returns a constant if the value of string aligns with the value of string. This is just a shorthand notationfor the edge function explained earlier.

    , = 0

    = + 1

  • 7/31/2019 tics Algorithms

    11/47

    11

    Needleman-Wunsch-Affine (i,j)

    F0,0=0

    for x=1 to i

    Fx,0 = d+(x-1)*e

    for y=1 to j

    F0,y = d+(y-1)*efor x=1 to i

    for y=1 to j

    A = Fx-1,y-1 + s(x,y)

    B = Gx-1,y-1 + s(x,y)

    Fx,y = max (A,B)

    L = Fx-1,y - d

    M = Fx,y-1 - d

    N = Gx-1,y - e

    O = Gx,y-1 - e

    Gx,y = max(L,M,N,O)

    Return max(Gi,j, Fi,j)

    = () () = () () = ( +)1.2.5 BANDEDDYNAMICPROGRAMMINGProvided we know that strings are similar the majority of

    computations can be ignored. For example, if we were

    comparing two DNA sequences from the same species, the

    optimal alignment will not deviate far from the perfect

    diagonal line. As a result we can just exclude computations

    outside of a set boundary.

    Although this reduces the real run time speed, it does not

    greatly affect the asymptotic complexity.

    = ( ()) ()1.2.6 COMPUTINGPATH WITHLINEARSPACE

    When running dynamic algorithms with quadratic space complexity and quadratic time

    complexity, memory resources usually limit computation before processor cycles.

    As explained in 1.2.1 it is possible to calculate the optimal score using dynamicprogramming in linear space. It is possible to modify the algorithm to return the path

    in linear space but at the expense of doubling the required computations by a factor of

    2.

    Score Only Path without optimisation Path with optimisation = () = ( +) = () = () = (2) () = ( +)This desired space complexity is achieved by finding where the longest path crosses

    the middle line before recursively subdividing the problem

  • 7/31/2019 tics Algorithms

    12/47

    12

    Method1. Split the matrix into 2

    2. Run the algorithm on the first half of thematrix remembering the values of the final

    column (prefix values)

    3. Run the algorithm in reverse on thesecond half of the matrix rememberingthe values of the final column (suffix

    values)

    4. Find the greatest length where the pathcrosses the middle line. This is the middle

    vertex of the optimal path.

    Length(i) = Prefix(i) + Suffix(i)

    5. Now that we have our mid-point we willrecurse on 2 sub sections of the matrix.

    The upper left and lower right regions.

    1.2.7 BLOCKALIGNMENTSo far all of the algorithms have required (2) time to align two sequences of length. The idealistic algorithm for aligning two sequences would be () in time butthis has yet to be achieved and the lower bounds of the Global Alignment Problem

    remain unknown.

    To reduce the required computation time blockscan be compared instead of individualletters. This will only provide an approximation to the longest common substring

    algorithm as only the corners of blocks are considered. The block alignment algorithm

    only allows the longest sub expression path to enter a block through its corners.

    Accuracy is lost in favour of speed.

    This is achieved by splitting two DNA sequences, and say, into blocks of length so

    = 1 . +1 .2 +1 . and,

    = |

    1. . . .

    ||

    +1. . . .

    2

    | . . . |

    +1. . . .

    |

  • 7/31/2019 tics Algorithms

    13/47

    13

    We create our alignment matrix of size , and populate the edge values. Thehorizontal and vertical edges will represent insertions or deletions of whole blocks and

    will have the usual penalty constant, . The value of the diagonal edges will be equal tothe longest common path alignment score of the two sub blocks.

    weightingVertical Horizontal Diagonali,j Needleman-Wunsch ( , )

    Block-Alignment (u,v,n)

    for x=1 to n/t

    for y=1 to n/t

    edge(x-1,y-1) to (x,y) = Needleman-Wunsch (ux, vy)

    S0,0=0

    for x=1 to n/tSx,0 = -x*d

    for y=1 to n/t

    S0,y = -y*d

    for x=1 to n/t

    for y=1 to n/t

    A = Sx,y-1 - d

    B = Sx-1,y - d

    C = Sx-1,y-1+edge (x-1,y-1) to (x,y)

    Sx,y = max (A,B,C)

    Return Si,j

    Where

    and

    are DNA sequences,

    is the length of the DNA

    sequences, is the length of a block, is the block of string and isour scoring matrix.If computations resulting from the first nested for-loop (which calculates the

    alignment of blocks) are ignored then time complexity is significantly improved. This

    is only possible when sub-block alignment values have already been computed. This

    may well be the case if we are examining the same DNA strings but with different

    penalties for insertions and deletions.

    = = 2

    2In the case where penalties are not being adjusted, the alignment score may already be

    stored in a pre-computed lookup table. Access to such a table will take (). = On2

    t2logn

    It may not always be the case that we have pre-computed block alignment values. In

    this case, the cost of running the initialisation step that calculates the diagonal edges

    score makes no improvement on our initial algorithm.

  • 7/31/2019 tics Algorithms

    14/47

    14

    = = 2This can be improved using the Four Russians technique.

    1.2.8 FOURRUSSIANSBLOCKALIGNMENTSPEEDUPThe Four Russians Technique is very similar to the block alignment algorithmachieving a significant reduction in the time complexity.

    To achieve this goal, the block length should be approximatelylog 4

    where is thesequence length and 4 is the number of letters in the alphabet of DNA.

    log4

    Also, instead of calculating a lookup table of alignment values of size , a lookuptable of size 4 4 . log

    4

    4 4 = Computing the initial values of the lookup table now only takes () when is boundto log ().

    = + 2()2 = O 21.2.9 FOUR RUSSIANS TECHNIQUE - LONGEST COMMON SUB-EXPRESSION

    Recall that the block alignment algorithm only allows the longest subexpression path to enter blocks through their corners. When alignmentscored for sub-blocks are calculated, only the corner values are stored aspoints of interest.

    The Longest common sub expression algorithm can take anypath through the matrix. By extending the Four Russian Block

    Alignment speedup making every point along the side of a block a

    point of interest unrestricted entry and exit between blocks is possible.

    Instead of performing dynamic programming on the corner vertices of

    blocks, dynamic programming is used on all edge vertices, ignoring internal vertices.

    This totals 2 vertices.Again, the four Russians technique is used to create a lookup table that stores all of

    these values. In essence we are interested in the following problem :

  • 7/31/2019 tics Algorithms

    15/47

    15

    given the alignment scores in the first column and first row of a block and the two strings, compute thealignment scores in the last row and last column.

    This poses a problem. What are the scores of the first row and column? This clearly

    varies depending on the path taken through the matrix. As a result the values of all

    possible combinations of first row and column values are calculated for all

    combinations of strings. This could clearly be an enormous lookup table if there were

    a large number of possible first row and column initial value combinations.

    By careful observation of the LCS problem, we can see that the initial values

    of the first row or column of any block is not entirely arbitrary. Recall that

    a match scores 1 and an insertion or deletion scores 0.The alignmentscores in LCS are monotonically increasing and adjacent elements cannot

    differ by more than 1. Therefore there are 2 possible scores for each initialrow and column. There are also only4 possible strings (due to the DNAalphabet size of 4). Therefore we can very efficiently compute the lookup values.

    2 2 4 4 = 26 Given that =

    4due to the four Russians :

    = 264 = 1.5Our initialisation step is now sub quadratic. As a result the overall time complexity is

    dominated by the dynamic programming algorithm.

    () = O 21.2.10NUSSINOVALGORITHM

    The Nussinov algorithm finds the optimal secondary

    structure of RNA (right). This is essentially its 3D

    representation.1

    The Nussinov algorithm has two stages. The first fillsthe dynamic array with

    scores. The second uses

    these scores to trace the

    secondary structure of the RNA.

    The first fill stage is based on the LCS dynamic

    algorithm. One noticeable difference is that it is notnecessary to fill the entire table (see left).

    The biological side of the algorithm specifies rules for

    1 http://bioinf.kvl.dk/~gorodkin/teach/bioinf2004/talk4_nov5.pdf

    Monotonic function

  • 7/31/2019 tics Algorithms

    16/47

    16

    what is a valid paring of letters. This is based on the Watson-Crick basepairs.

    , = 1 if the and letters of the string are a valid pairing0 otherwise

    Nussinov Fill

    Given subsequence with letters (1 , , )for i=2 to

    i,i-1 = 0

    for i = 1 to i,i = 0

    for all sub sequences of length 2 to length A = i+1,j

    B = i,j-1

    C = i+1,j-1+(i,j)

    D =

  • 7/31/2019 tics Algorithms

    17/47

    17

    1.2.11 BLAST(MULTIPLEALIGNMENT)It is often the case that we wish to align far more than just 2 strings. If we were to use

    the previous dynamic algorithms for the alignment of strings, we would just requirethe allocation and analysis of a

    -dimensional matrix.

    For sequences of length , there are a possible 2 1 paths through the matrix. Ifwe recall the time complexity of the LCS algorithm but extended it into dimensionswe get :

    Time Complexity (k-dimension LCS)= (2 1) = 2Unfortunately, due to the exponential running time this becomes unpractical and

    unusable very quickly.

    This problem has led to the development of algorithms such as BLAST (basic local

    alignment search tool). BLAST greatly improves the speed at which strings can be

    compared but at the expense of approximated results.

    A BLAST search enables a researcher to compare a query sequence against a library or

    database of sequences, and identify library sequences that resemble the query sequence

    above a certain threshold. For example, following the discovery of a previously

    unknown mouse gene, a scientist will typically perform a BLAST search of the human

    genome to see if humans carry a similar gene; BLAST will identify sequences in the

    human genome that resemble the mouse gene based on similarity of sequence. BLAST

    has become one of the most widely used bioinformatics algorithms due to its emphasis

    on speed over sensitivity.

    where is the search string, is the length of sub strings, isthe database of known sub strings and is the threshold value = ()

    There are many variations of the basic BLAST algorithm. The original algorithm

    found the substring

    of length

    in the dictionary

    by performing basic local

    alignment around the string until the threshold was exceeded. In the basic case, thelocal alignment only acknowledged matches and mismatches. The algorithm

    BLAST (n,m,D,k){

    For all words, w, of length m from the query

    string n

    {

    Match win database D

    IF match was found

    {

    Perform local alignment around w until our

    score

    falls below threshold k}

    }}

  • 7/31/2019 tics Algorithms

    18/47

    18

    terminated when either the alignment score fell below tolerance , or the ratio ofmatches to mismatches fell below tolerance .

    An alternative and newer approach isgapped BLAST.This has the same basic structurebut the local alignment stage also allows for some insertions and deletions. We score

    the local alignment in the usual way and terminate when the score becomes less thanthe threshold value. This results in some deviation from the perfect diagonal line in

    our matrix.

    There are many more variations on this algorithm tailored to different kinds of pattern

    matching and different biological applications. The effectiveness of the algorithm

    varies drastically with the choice of input variables. Some prior knowledge about the

    strings being comparing can greatly improve BLASTs results.

    1.2.12PATTERNHUNTER(MULTIPLEALIGNMENT)Pattern Hunter is a variation on BLAST which provides increased sensitivity andincreased speed. BLAST only matches consecutive sequences of length during thedictionary lookup. Pattern Hunter introduces the concept of a spaced seed providinggreater flexibility.

    Consider the BLAST seed mask of length 11

    11111111111

    Here the seed represents the fact that we want to match all 11 consecutive characters

    in the search string with 11 consecutive characters in the dictionary. If there had been

    a single mutation of any character in the search string BLAST would not pick up anymatch in the dictionary.

  • 7/31/2019 tics Algorithms

    19/47

    19

    Now consider the Pattern Hunter spaced seed of length 11

    110111001110111

    Here a 0 represents dont care. This is still matching 11 characters in the search string

    with 11 characters in the dictionary, but the letters are not necessarily consecutive. Thespaced seed models are defined before the algorithm is run.

    This algorithm provides a higher hit probability and a lower expected number of

    random hits.

    1.2.13 BLAT(MULTIPLEALIGNMENT)BLAT (BLAST like alignment tool) is just an inversion of the BLAST algorithm.

    Instead of building an index from the query string and scanning through the database,

    we build an index from the database and scan linearly through the query sequence.

    This results in a significant runtime performance increase due to the fact that the index

    fits in RAM memory which has much faster access.

    1.3 TREES

    Phylogenetic trees are often used in bioinformatics to represent how species or

    sequences are related. They can represent how many organisms have evolved or how

    strings have mutated.

    There are two main types of tree. In rooted trees the root position is the common

    ancestor of all sequences (A - E in the figure). Branch lengths can be used to indicate

    the amount of divergence/change. Un-rooted trees contain no information about a

    hypothetical common ancestor. Branch lengths still reflect degree of divergence.

    1.3.1 PARSIMONYTrees can be constructed in many ways. One of the most common ways to build a tree

    is using minimum parsimony. Parsimony is a measure of how complex the tree is. In

    our case trees with fewer mutations between parent and child have a lower parsimony

    score. Minimum parsimony is the simplest set of assumptions or mutations that can

    explain an observation.

  • 7/31/2019 tics Algorithms

    20/47

    20

    The simplest way of generating a parsimony score is to use the Hamming distance

    (explained earlier). This is known as small parsimony. In many cases it may be

    desirable to create a lookup matrix which assigns different scores to differentkinds of mutations, known as weighted parsimony. This way, common mutations can

    have lower penalties than unusual mutations. Small parsimony is just a special case of

    the weighted parsimony lookup table where the diagonal values are 0 and all others are

    1.

    1.3.1.1Sankoff Algorithm

    The Sankoff Algorithm is a way of evaluating the weighted small parsimony problem.

    Given a tree T with each leaf labelled with a letter from kletter alphabet and a scoring matrix (

    ), output a tree T with the internal vertices of the tree T minimizing

    the weighted parsimony score

  • 7/31/2019 tics Algorithms

    21/47

    21

    Sankoff Algorithm

    Given a tree = (,)and string , with leaf nodes labelledwith letters in order from string

    , assign to each node

    integer costs cx() for each letter at position inrecursively, starting with the leaf nodes, as follows: in at all leaf nodes , let = 0 +

    in at all internal nodes , let = min

    yin

    + ,

    Where , is the cost of mutating from the letter atposition to the letter at position based on the parsimonyscoring martix.

    Once all costs have been assigned, we then label each

    internal node with the letter as follows (backtrackingstage):

    For the root node , let = such that miny AThen, for every already labelled node , label each child of with

    = if miny( ) + 1 > c () miny( ) otherwise = (2)If we were to use a small parsimony scoring matrix then miny in , where is the root node, would be the minimum number of mutations that would explain the

    tree. If we use weighted parsimony then miny in , where is the root node,would be the minimum parsimony score possible explaining the tree.The assigned labels are one possible assignment that exhibits this minimum parsimony

    score. There may be many label variations that produce the same parsimony score.

    1.3.1.2Fitchs Algorithm

    The Fitch algorithm is very similar to Sankoff. It also finds the minimum parsimony of

    a tree but only for small parsimony. Fitch produces an identical node labelling toSankoff given that a non-weighted small parsimony scoring matrix is used.

  • 7/31/2019 tics Algorithms

    22/47

    22

    Given two strings and , we can find the most likely way that mutated into using small parsimony. We begin by creating a binary tree where only the leaf nodes arelabelled with the letters in string.Given a binary tree T = (V,E), with leaf nodes labelled withletters from an alphabet A, assign to each node Va set ofletters A recursively, starting with the leaf nodes, asfollows:

    For a leaf node with label , let =

    For an internal node with children and , let = 0

    Now label each internal node with a single letter.

    (backtracking stage)

    Label the root node with any .Then, for every already labelled internal node , label eachchild of with

    =

    = ()The labelling produced exhibits the minimum number of mutations that can explain

    the original tree. There may be label combinations that produce the same score.

    1.3.2 LARGEPARSIMONYPROBLEMThis is very similar to the previous small parsimony problem, but instead of calculating

    the parsimony score of a single string we want to calculate the parsimony score of

    multiple string alignments.

    Given a multiple alignment = {1 , . . . ,}, its parsimony score is defined as() = {(,) | }

    The Large Parsimony Problem is to compute ().Potentially, we need to consider all (2 5)!! possible un-rooted trees or (2 3)!!possible rooted trees. Unfortunately, in general this cant be avoided and the

    maximum parsimony problem is known to be NP-hard.

  • 7/31/2019 tics Algorithms

    23/47

    23

    Exhaustive enumeration of all possible tree topologies will only work for 10,say.

    Thus, we need more efficient strategies that either solve the problem exactly or return

    good approximations with heuristic searches. These algorithms are not covered here.

    1.3.3 DISTANCEThe distance between two nodes in a tree (tree distance) is always going to be greater

    or equal to the edit distance between two nodes. The tree distance between two nodes

    is the sum of all the edges along the shortest path between the nodes.

    We use the notation:

    to represent the edit distance between nodes and, and () to represent the tree distance between nodes and.

    Given strings we can easily create a matrix, , representing the edit distancebetween strings and. Our distance based algorithms will produce distance trees thatbestfitthe distance matrix.

    In an ideal (and optimal) case:

    = ()Such optimality is always possible when we have trees with no more than 3 leaves, but

    this is rarely the case when the number of leaves is > 3. Such trees are said to be

    additiveand should yield a simple solution.

    = + 2

    A special case of tree distances is a degenerate triple. A degenerate triple is a set of

    three distinct elements 1

  • 7/31/2019 tics Algorithms

    24/47

    24

    has a degenerate triple ,, thencan be removed from thus reducing the sizeof the problem. If distance matrix does not have a degenerate triple ,, one cancreate a degenerative triple in by shortening all hanging edges (in the tree).1.3.3.1

    UPGMA

    UPGMA (Un-weighted Pair Group Method usingArithmetic averages) is the first of

    our best-fit-tree distance algorithms that uses iterative clustering to create a

    hierarchical phylogenetic tree.

    It uses a pair-wise distance matrix where is the number of sequences and is the distance between sequences and. As explained earlier, there are manyways of calculating the distance between strings but UPGMA usually uses the edit

    distance.

    On each itteration the algorithm combines seqences to create clusters. Whencalculating the distance to or from a cluster we average (mean) the distance values of

    all possible combinations of sequence tuples, (,) say, where is a sequence from thefirst cluster and is a tuple from the other.

    = { , }

    where is the number of elements in cluster and is a sequence in cluster.UPGMA Algorithm

    1. Begin with sequences and populate an distancematrix where : =

    2. Find the smallest value of 3. Combine sequences and creating a cluster 4. Create a new node in our tree with child nodes

    and

    at

    height 2 5. Repeat steps 2 to 4 with 1 sequences as input until

    only 2 sequences remain. This new input will be the same

    as our original with sequences and removed andcluster added (The distances to a cluster is the averagedistance to sequences in the cluster)

    6. Place root midway between the two remaining clusters= (2)

  • 7/31/2019 tics Algorithms

    25/47

    25

    UPGMA assumes that:1. all leaves are placed at the same level2. when two sequences are joined in a cluster, the common ancestor is

    equidistant from each sequence;3. the "molecular clock" rate of evolution is the same on each branch of the tree.

    Unfortunately these assumptions can cause many inaccuracies. Trees that should look

    like (c) might end up like (d). It is the final assumption (3) that causes the erroneous

    results. This anomaly can be solved using the neighbour joining algorithm.

    1.3.3.2Neighbour JoiningThe Neighbour Joining algorithm gets around the pitfalls of UPGMA as no

    assumption is made about the mutation rate. Like UPGMA it is a bottom up clustering

    algorithm that produces phylogenetic trees and calculates the length of branches. Its

    results differ by the fact that the trees are non hierarchical.

    The algorithm starts with a star tree, where each node represents a sequence. One

    chooses the two nodes with shortest distance and connects them with a new internal

    node. The distance could be the difference in percent between the two sequences.

    When this is done two new nodes with the smallest distance are picked out and

    connected with another new node. This will continue until the whole star is resolved.

    Neighbour Joining populates a second matrix, based on the distance matrix eachiteration. The new distance values are calculated such that:

    = + 2( 2) + 2 2 where represents all nodes that remain in the star.

  • 7/31/2019 tics Algorithms

    26/47

    26

    Neighbour Joining

    1. Begin with sequences and populate our distancematrix where :

    =

    2. Populate a such that,

    = + 2( 2) + 2 2 3. Find the smallest value of4. Combine sequences and creating a cluster 5. Create a new node in our tree with child nodes and.6. Repeat steps 2 to 5 with

    1 sequences as input until

    only 3 sequences or clusters remain. This new input willbe the same as our previous with nodes and removedand node added.= (5)

    Every time we create a cluster we are required to recomputed the whole of our to account for fast evolving edges. Calculating the position of the clusteris very costly yielding(5) complexity. For each stage we are required to calculate times, there are stages and for each we are required to sum over allthe elements of the matrix, . Therefore, =

    5

    .The complexity can be reduced by introducing a new parameter: = =1

    We are only required to calculate and once each round achieving (1) .Therefore by using dynamic programming it is not necesary to sum over all the

    elements of the matrix for each

    . This reduces the complexity to

    3

    .

  • 7/31/2019 tics Algorithms

    27/47

    27

    = (3)1.3.4 LIKELIHOODMaximum Likelihood evaluates a phylogenetic tree in terms of the probability that the

    proposed model of the evolutionary process would give rise to the observed data.

    Often it is the case that certain mutations are more common than others. Once a local

    phylogeny is constructed, it can be scored according to how well the tree helps explain

    the evolutionary path. A minimum distance based tree is only guaranteed to have

    maximal likelihood when all mutations have equal probability.

    Given a sequence, , of length , I will use to define an individual nucleotideposition.position 1

    Sequence 1 = ACTGTCGATCGCGCGCGCGATCG2 = ACTCGATTZCGCAATCGCGATCG3 = ACTGTCACTCCAGATCGCGCGCG

    =

    = 2 + 2( 2) =

    = + 2

    Neighbour Joining (Optimised)

    1.Begin with sequences and populate our distancematrix where ,

    2. Find the smallest value of3. Combine sequences and creating a cluster 4. Create a new node in our tree with child nodes and.5. Calculate the branch lengths of and to such that:

    Where

    =

    =1

    6.Calculate the distance between the new internal node and each node in the remaining star such that:

    remembering the values.

    7. Repeat steps 2 to 5 with 1 sequences as inputuntil only 3 sequences or clusters remain. This new

  • 7/31/2019 tics Algorithms

    28/47

    28

    e.g. Given a tree:

    We root at an arbitrary internal node:

    For each in the sequence we calculate the likelihood of the tree by summing theprobabilities of all the possible ancestral-state combinations:

    = (),

    The overall likelihood of the tree representing the full sequence can be calculated by

    taking the product of all the individual cognate-set likelihoods.

    = 1 2 = =1 As this is costly to compute and the probability of any individual observation is small

    we take the natural log of the likelihood, . = ln1 + ln2 ++ ln = =1

    If a tree has a relatively high likelihood score this means that, given the tree and themodel of evolution, the data is a relatively likely outcome. The maximum likelihood

    () tree is that tree or trees making the data most likely. There may be many treeswith same likelihood score.

    When scoring a tree we assume that all mutations are independent allowing us to

    calculate the likelihood, , for each position individually.This can be summarised with the following algorithm:

  • 7/31/2019 tics Algorithms

    29/47

    29

    Calculate Likelihood

    1.Root the tree at any internal node (models are time

    reversible)

    2.Calculate

    for each

    = () , 3.Combine the values to calculate for the whole tree.

    = 1 2 = =1 = (3)1.3.5 BOOTSTRAPPINGALGORITHMDue to the quadratic nature of the tree comparison algorithms it is often the case that

    they are very expensive and time consuming to run. The bootstrapping algorithmincreases the speed of best fit multiple alignment algorithms by repeating many times

    on small trees and output the most frequent result.

    Bootstrapping Algorithm

    1.Select random columns from a multiple alignment (one

    column may appear several times)

    2.Build a phylogenetic tree based on the random sample

    3.Repeat stages 1 and 2 many times

    4.Output the tree that is constructed most frequently

    1.3.6 PRIMSALGORITHMPrim's algorithm finds the minimum spanning tree for a connected weighted graph.This means it finds a subset of the edges that forms a tree that includes every vertex,

    where the total weight of all the edges in the tree is minimized.

    Prims Algorithm

    1. Start from a random vertex and make this the root of our

    tree, 2.Add the shortest edge connecting a vertex in to a

    vertex not in 3.Repeat step 2 until all nodes are in our set

    = ( log)(Where is the number of edges)Whilst the time complexity of this algorithm is highly desirable we cannot construct

    meaningful phylogenetic trees. When constructing typical phylogenetic trees the input

    is made up of leaf nodes. The internal nodes are unknown and have to be generated

    during construction of the tree.

    1.4 INFORMATIONTHEORY AND DNA

    Information theory in the context of DNA expresses the amount of information

    encoded in a string or observation. The measure of information has unit bits. We usethree types of comparisons:

  • 7/31/2019 tics Algorithms

    30/47

    30

    Entropy: The entropy of a discrete random variable is a measure of theuncertainty associated with the value of. For example, if someone

    were to select a random letter from a given alphabet, how many

    binary questions would it take to identify the letter. The entropy of a

    random letter from a 32 letter alphabet can be expressed as = 5 .Conditional Entropy: The conditional entropy of a variable is the uncertainty

    associated with the value of given a random variable . If thevariable is an independent random variable then the conditionalentropy of given , , will be the same as the entropy of,(). The conditional entropy of a variable will always be less thanor equal to the entropy of that variable.

    Mutual Information: Mutual information measures the amount of information that

    can be obtained about one random variable by observing another.

    This is essentially the difference between learning the value ofgiven and learning the value of without .

    ; = (|)

    1.4.1 INFORMATIONCONTENT OF ADNAMOTIFThe information encoded at position in the string can be expressed as

    =

    = log2 =1 + log2 =1

    Where given an alphabet of possible characters of length , is the backgroundprobability and is the motif probability. If all characters are equiprobable, thebackground probability for any letter will be 1. In the case of DNA with a 4 letteralphabet = 14 giving a background probability of 2.

  • 7/31/2019 tics Algorithms

    31/47

    31

    The information content encoded by a DNA motif of length is the sum of all theinformation encoded by its individual characters.

    =

    =1

    1.4.2 ENTROPY OFMULTIPLEALIGNMENTThe entropyof a multiple alignment is a measure of the uncertainty of a single column.

    Entropy of Multiple Alignment

    1.Given an alignment column, we calculate the frequency of

    occurrence of every possible letter

    2.The entropy of the column is the sum of these

    probabilities

    = log2 =

    1.4.3 INFORMATIONCONTENT OF ASTRINGThe information content of a string is the sum of the information content of everyposition, , in our string.

    =

    log

    =

    Where is the probability of letter at position in our positional weight matrix and is the frequency of by chance.1.4.4 MOTIFS

    A sequence motif is a nucleotide or amino-acid sequence pattern. This pattern may be

    present in many different positions in many different strings. Finding the same motif

    in multiple strings often suggests a regulatory relationship between those genes.

    However, all motif occurrences may not always be exactly the same as genes may havebeen turned on and off by regulatory proteins or mutate at non important bases. A

    motif logo graphically illustrates the importance of letters within a motif at each

    position. Larger letters are more important than smaller letters. This represents

    conserved and variable regions of the motif.

  • 7/31/2019 tics Algorithms

    32/47

    32

    The cumulative size of the letters at a given position is calculated by calculating the

    information content of the alignment and subtracting it from the background

    probability.

    =

    The individual letter size is calculated by the information content at its positionmultiplied by the fraction of occurrence.

    = . occurrence . It is intuitive that larger letters have less variation than smaller letters.

    Given a set of motifs it is possible to find the

    motif consensus. The consensus can be thoughtof as the ancestor from which all mutated motifs

    emerged.

    First, align all patterns by their start index and

    construct a matrix containing the frequency of

    each nucleotide in each column, known as the

    profile.

    The consensus nucleotide in each position is the

    nucleotide with the highest score in each column.

    The distance between a real motif and the consensus sequence is generally less than

    that for two real motifs.

    The number of motif occurrences (sites) and how relations extend can be estimated

    using Markov Chain theory.

    e.g.

    1 = 2 = ()() 3 =

    All Markov chains estimate the frequency of a word from base composition alone with

    increasing orders producing more accurate results. A Markov chain of order supposes that the base present at a certain position in a sequence depends only on the

    bases present at the previous positions.

  • 7/31/2019 tics Algorithms

    33/47

    33

    1.4.5 EXHAUSTIVESEARCHAn exhaustive search generates a motif,, that best matches a set of sequences, .

    Given set of sequences

    =

    1

    and motif defined

    =

    1

    find

    such that

    the match with 1 is optimal.Whilst there are several ways to define an optimalmatch hamming distance is usuallyused.

    , = and

    , = (, ) In the case of DNA and RNA, which use 4-letter alphabets, the number of possible

    motifs with length is 4 .Whilst this always finds the best motif it is very costly with running time:

    = (4)where = .It is possible to speed up the basic exhaustive search algorithm at the expense of

    accuracy. Instead of searching through all alphabet permutations of length , onlywords of length that occur in some are considered. This only requires (2)but does not always yield an accurate answer. If is weak and doesnt occur in thena random motif may have a higher score.

    1.4.6 GIBBSSAMPLINGGibbs Motif Sampling identifies motifs, conserved regions, in DNA or protein

    sequences solving the Motif Finding Problem.

    Motif Finding Problem: given a set of

    DNA sequences each of length

    , find the motifwith optimal match.

    The algorithm uses an iterative random sampling method increasing the odds that it

    will converge to the correct solution.

  • 7/31/2019 tics Algorithms

    34/47

    34

    Gibbs Sampling Algorithm

    Method Example

    Given length of motif and a set ofsequences

    :

    1.Randomly choose starting positions = (1, . . . , ) and form the set of mers associated with thesestarting positions.

    2.Randomly choose one of the t

    sequences.Sequence 2

    3.Create a profile p from the othert sequences.

    4.For each position in the removed

    sequence, calculate the

    probability that the mer startingat that position was generated by

    using likelihood.

    = =1

    AAAATTTACCTTAGAAGG 0.000732

    AAAATTTACCTTAGAAGG 0.000122

    AAAATTTACCTTAGAAGG 0

    AAAATTTACCTTAGAAGG 0

    AAAATTTACCTTAGAAGG 0

    AAAATTTACCTTAGAAGG 0

    AAAATTTACCTTAGAAGG 0

    AAAATTTACCTTAGAAGG 0.000183

    AAAATTTACCTTAGAAGG 0

    AAAATTTACCTTAGAAGG 0

    AAAATTTACCTTAGAAGG 0

    5.Create a distribution of

    probabilities of mers (|) ,and randomly select a new starting

    position based on this

    distribution.

    a.To create this distribution,

    divide each probability

    (|) by the lowestprobability:Position 1: prob(AAAATTTA | P ) = .000732/.000122 = 6Position 2: prob(AAATTTAC | P ) = .000122/.000122 = 1Position 8: prob(ACCTTAGA | P ) = .000183/.000122 = 1.5Ratio = 6 : 1 : 1.5

    b.Define probabilities of

    starting positions according

    to computed ratios

    Probability (Position 1): 6/(6+1+1.5)= 0.706Probability (Position 2): 1/(6+1+1.5)= 0.118Probability (Position 8): 1.5/(6+1+1.5)=0.176

    c.Select the start position

    according to computed ratios:

    P(Selecting Starting Position 1): .706P(Selecting Starting Position 2): .118P(Selecting Starting Position 8): .176

    6.Repeat steps 25 until there isno improvement

  • 7/31/2019 tics Algorithms

    35/47

    35

    1.5 HIDDEN MARKOVMODELS

    A Hidden Markov Model (HMM) is a statistical model in which the system being

    modelled is assumed to be a Markov process with unobserved state. A HMM is

    memory-less with the only thing effecting the next step being the current state.

    Definition

    Given an alphabet (1, 2, , ) and set of states ( 1 , . . . ,) , the transitionprobability from state to state is written .1. The sum of all probability transitions will equal 1.

    1 ++ = 1, = 12. The sum of all starting state probabilities will equal 1

    01 ++ 0 = 13. Emission probability within each state (probability that the

    hidden state emits the observable symbol) = ( = | = )1++ = 1

    Given a sequence

    =

    1

    and a parse

    =

    1

    (a sequence of states), we

    can calculate the parse likelihood,

    , = ,11,122,211,1= 12221111= 0112 11 (1) ()

    Hidden Markov models are used to solve three main questions

    1. EvaluationGiven a HMM

    and a sequence

    Find [|], the probability that sequence was generated by the model. = (,) = () This can be calculated using the Forward dynamic algorithm

    2. DecodingGiven a HMM and a sequence Find the sequence

    of states that maximises

    ,

    Described algebraically as

  • 7/31/2019 tics Algorithms

    36/47

    36

    = max11 1 1,1 1, , = = =

    This is also known as the Viterbi path and solved using the Viterbi

    dynamic algorithm.

    3. LearningGiven a HMM with unspecified transmission/emission probabilitiesand a sequence ,Find the HMM parameters = ( , ()) that maximise [|]Solutions for this problem will be omitted.

    1.5.1 FORWARDALGORITHMThe Forward algorithm solves the evaluation problem, the probability of given theHMM. This is calculated by summing over all the possible ways of calculating:

    = (,) = () To avoid summing over an exponential number of paths , the forward probability isdefined as

    forward probability: = 1 , = = (1 11 1,1 1, = )

    = (1 12 1,1 2 , = ) () = 1

  • 7/31/2019 tics Algorithms

    37/47

    37

    = k2N = ()The main issue with this algorithm is that of underflow. As the numbers become

    increasingly small, they often become inaccurate. This can be solved by rescaling at

    each position by multiplying by a constant.

    1.5.2 VITERBIALGORITHMThe Viterbi algorithm is a dynamic programming algorithm that solves the decoding

    problem for finding the most likely sequence of hidden states (Viterbi path) that result

    in a sequence of observed events.

    = k2N

    00 = 1

    0

    = 0,

    > 0

    = 1 () = ( 1), = max

    =

    1=

    (

    )

    Viterbi Algorithm

    Given a sequence = 1 Initialise

    Iterate

    Termination

    Trace back

    00 = 10 = 0, > 0 = ( 1)

    = ()0

    Forward Algorithm

    Given a sequence = 1 and aMarkov model ,Initialise

    Iterate

    Termination

    Where

    0is the probability that

    the terminating state is (usually = 0)

  • 7/31/2019 tics Algorithms

    38/47

    38

    = ()Again, this algorithm is subject to underflow. As the numbers become increasingly

    small, they often become inaccurate. This is solved by taking the log of all values

    during the iteration stage. = log + max [ 1+ log ]1.5.3 BACKWARDALGORITHM

    This final algorithm calculates ( = |) , the probability distribution of the position given .Like the Forward algorithm, the backward probability can be derived:

    backward probability:

    =

    +1

    ,

    =

    = (+1+1 ,+1| = )= (+1+1 ,+1 = ,+2| = )

    = (+1) (+2+1 ,+2|+1 = ) =

    (

    +1)

    + 1

    = k2N

    =

    (

    )

    = 0 ,

    =

    +1

    (

    + 1)

    = 01(1)

    Backward Algorithm

    Given a sequence = 1 and aMarkov model ,Initialise

    Iterate

    Termination

    Where 0 is the probability thatthe terminating state is (usually = 0)

  • 7/31/2019 tics Algorithms

    39/47

    39

    The underflow problem associated with the backwards algorithm is solved in the same

    way as the forwards algorithm. At each position we rescale by multiplying by a

    constant.

    The equation to calculate the probability distribution can be derived to using forwards

    and backwards algorithms,

    = = 1 , = , +1 =1 , = )(+1|1 , =

    = , , = 1 , = (+1| = )

    An implementation of this equation combining both the forwards and backwards

    algorithm is often referred to as the forwards-backwards algorithm.

    2 WORKING WITH MICROARRAYDNA microarray is a multiplex technology consisting of

    an array containing thousands of microscopic spots of

    DNA. The microarrays measure changes to the genes

    under varying conditions, such as time, heat or pressure.

    An expression level is estimated by measuring the

    amount of mRNA (Messenger RNA) for a particulargene. mRNA is a molecule of RNA which carries

    messages to the sites of protein synthesis. More mRNA

    usually indicates more gene activity.

    Microarray analysis can produce an activity diagram (right).

    2.1 CLUSTERING

    Clustering is a method of grouping functionally related genes. The genes will be related

    based on some distance metric which may combine several factors. Genes are

    clustered by plotting points in n-dimensional space, comparing all gene pairs using thedistance metric and grouping genes with small distances.

    Given a set consisting of points and a parameter , the -means clusteringproblem finds a set consisting of points (cluster centres) that minimises thesquared error distortion (,) over all possible choices of.

    , = ( ,)2 1

    Samples

    Genes

  • 7/31/2019 tics Algorithms

    40/47

    40

    2.1.1 LLOYDALGORITHM(K-MEANS)The most common algorithm for approximating the -means problem is the Lloydalgorithm which uses an iterative refinement heuristic. The algorithm begins by

    partitioning the input points into

    initial sets, either at random or using some

    heuristic data. It then calculates the mean point of each set. It constructs a newpartition by associating each point with the closest point before recalculating the

    points for the new clusters. The algorithm then repeats by alternate application of

    these two steps until the results converge, which is obtained when the points no longer

    switch clusters (or alternatively points no longer change).

    Lloyd's algorithm uses a heuristic for solving the k-means problem which, when used

    with certain combinations of starting points, could converge to the wrong answer. The

    Lloyd algorithm has remained popular because it has a very quick running time with

    the number of iterations often far less than the number of points.

    Lloyd Algorithm

    Randomly assign the cluster centres While the cluster centres keep changing

    o Assign each data point to a cluster corresponding tothe closest cluster, where 1

    o After the assignment of all data points, compute new

    cluster representatives according to the centre of

    gravity of each cluster, that is, the new cluster

    representative is

  • 7/31/2019 tics Algorithms

    41/47

    41

    2.1.2 GREEDYALGORITHM(K-MEANS)The Lloyd algorithm is fast but moves many data points each iteration possiblyresulting in sub-optimal convergence. A more conservative method would be to moveone point at a time only if it improves the overall clustering cost. The smaller theclustering cost of a partition of data points the better the clustering is. Whilst this mayproduce better clustering, it is usually much more costly.

    Progressive Greedy K-Means(k)

    Select an arbitrary partition P into clusterswhile forever

    bestChange 0for every cluster

    for every element not in if moving to cluster reduces its clustering cost

    if (cost(

    ) cost(

    ) > bestChange

    bestChange cost(P) cost(Pi C) if bestChange > 0

    Change partition by moving to else

    return P

    2.1.3 CAST(CLUSTERAFFINITYSEARCHTECHNIQUE)CAST is a clustering algorithm that groups genes based on affinity.

    Affinity: A measure of similarity between a gene and all other genes ina cluster.

    Threshold Affinity: A user specified criterion for retaining a gene in a

    cluster defined as the percentage of the maximum affinity at

    that point.

    It requires more computing power than -means, but does not require the number ofclusters to be specified beforehand. The algorithm is also consistent, returning the

    same result when run several times. The algorithm returns an optimal set of clusters

    with diameters near the threshold affinity.

  • 7/31/2019 tics Algorithms

    42/47

    42

    CAST ALGORITHM

    1.Create an empty cluster

    2.Set initial affinity of all genes to 0

    3.Move the two most similar genes into the new cluster

    4.Update the affinities of all genes, both clustered and un-

    clustered = + 5.Whilst there remains a gene whose affinity value is

    greater than the threshold

    Add the gene with the highest affinity to thecluster

    Update the affinities of all genes6.Whilst there remains a clustered gene whose affinity is

    lower than the current threshold

    Remove the gene with the lowest affinity from thecluster

    Update the affinities of all genes

    7.Repeat steps 5 and 6 until there is no further change

    8.Save the cluster removing points in the cluster from

    further consideration.

    9.Repeat steps 1 to 8 with the reduced set of points until

    all genes have been assigned to a cluster= (2)2.1.4 QTCLUSTERINGQT (quality threshold) clustering is an alternative method of partitioning data. Like

    CAST, it requires more computing power than

    -means and does not require the

    number of clusters to be specified beforehand. The algorithm is also consistent,returning the same result when run several times.

    QT Clustering ALGORITHM

    1.Choose a maximum cluster diameter.

    2.Choose a gene as the seed for a new cluster.

    3.Add the closest point, the next closest, and so on, to

    the cluster until the diameter surpasses the threshold.

    4.Repeat steps 2 and 3 for every gene.

    5.Save the cluster with the most points as the first true

    cluster, and remove all points in the cluster from

    further consideration.6.Repeat steps 2 to 5 with the reduced set of points until

    the last cluster formed has fewer genes than the user-

    specified number. All genes that are not part of a

    cluster are unassigned = (3)where is the number of genes

    The distance between a point and a group of points is computed using complete

    linkage (Jack-knifed distance). This is the maximum distance between the seed and all

    other genes in the cluster.

  • 7/31/2019 tics Algorithms

    43/47

    43

    2.1.5 MARKOVCLUSTERINGALGORITHMFinally, the Markov clustering algorithm randomly walks through the probability graph

    described by the similarity matrix to identify clusters of related genes. The basic idea

    underlying the algorithm is the algorithm will walk more probable routes more often.

    Dense clusters correspond to regions with a large number of paths.

    The algorithm uses three steps:

    Markov Clustering Algorithm

    1.Given a network with vertexes, take the corresponding adjacency matrix and normalise each column toobtain a stochastic matrix.

    2.Takes the power of this matrix (expansion)3.Then take the power () of every element (inflation).The expansion parameter is often taken equal to 2, whilethe granularity of the clustering is controlled by tuning

    the inflation parameter.2.2

    G

    ENETICN

    ETWORKSA

    NALYSISA genetic network is a collection of DNA segments in a cell which interact with each

    other (indirectly through their RNA and protein expression products (mRNA)) and

    with other substances in the cell. This controls the rate at which genes in the network

    are transcribed into mRNA. This kind of network can often cause chain reactions.

    Assume there are two related genes, A and B. Neither is expressed initially but a third

    gene, X, causes A to be expressed which in turn causes B to be expressed. This kind of

    reaction can often be thought of as a circuit consisting of logic gates.

    Gene activity can affect some genes directly and other genes indirectly, known as the

    primary and secondary targets respectively. Our aim is to represent a large genetic

    network with gene prertubations in fewer than 2 steps. A perturbation static graphmodel is used. This essentially perturbs a gene network one gene at a time monitoring

    the behaviour of other genes. This identifies direct and indirect gene-gene

    relationships. If this were a black boxed circuit made up of logical elements it wouldbe called bit twiddling!

  • 7/31/2019 tics Algorithms

    44/47

    44

    Perturbation static graph model

    Given a gene network, 1 STEP 1. For each gene , compare the control

    experiment to the perturbed experiment where isperturbed to identify differently expressed genes.

    STEP 2. Use the most parsimonious graph that representsthe behaviour observed in 1.

    In more detail:STEP 1

    (1) Given a gene network(2) Find the adjacency

    list

    (3) Use the adjacency listto find the accessibility

    list

    STEP 2

    (4) Find the most parsimonious graph representing the accessibility list (3)Assuming the accessibility list is non-cyclic, producing the most parsimonious

    graph is relatively straight forward. Let() be the accessibility list and() be the adjacency list at an acyclic directed graph, its mostparsimonious graph, and the set of all nodes of . Then thefollowing identity holds:

    =\ () ()

  • 7/31/2019 tics Algorithms

    45/47

    45

    Non-Cyclic most parsimonious graph

    for all nodes of () =()for all nodes

    of

    if node has not been visitedcall PRUNE()PRUNE()

    for all nodes ()if() =

    declare as visitedelse

    call PRUNE()for all nodes ()

    for all nodes

    (

    )

    if ()delete fromdeclare as visitedIf the accessibility list contains cycles it is not algorithmically possible to

    produce a unique graph for the accessibility list. All genes within a cycle effect

    all other genes. This is an experimental limitation. Where cycles do occur we

    shrink each cycle into a single node and apply the same algorithm as for the

    non-cyclic case. When and (where ) are two nodes in a directed graph,iff

    (

    ) and

    (

    ) then they belong to the same component.

    This algorithm is limited. It is unable to resolve cyclic graphs and requires far more

    data than conventional methods which use gene expression correlations. Also, for an

    accessibility list there may be many consistent networks. This algorithm only

    constructs the most parsimonious tree which is not necessarily an accurate model.

    2.3 SYSTEMS BIOLOGY

    Systems biology has become a particular field of interest in recent years (2000

    onwards). It is the study of how different biological systems interact with each other

    with respect to time. Knowledge about molecules, genes, cells, tissue, organs,

    chemicals and much more can be combined to study how they interact.

    A simple example of a biological system is our digestive system. We know that when

    we eat the energy is not released into our blood stream immediately. The time taken

    and the rate at which energy passes through the system has many variables (i.e. sugar

    content of the food, metabolic rate, blood sugar levels etc...). Given all of the system

    dependencies, we want to be able to describe the likelihood that a given event will

    occur.

    Ideally, we want to produce a model that represents the flow of events throughout a

    system. If our system has a definite series of events we can accurately use differential

  • 7/31/2019 tics Algorithms

    46/47

    46

    equations as our model, but often such systems have a random element making

    stochastic algorithms more suitable.

    2.3.1 GILLESPIEALGORITHMThe Gillespie algorithm is a stochastic algorithm for the simulation of geneticnetworks. It uses random numbers in order to generate sequences of events and inter-

    event times in a biological system.

    It is assumed that in a genetic network, different proteins/enzymes are

    related to each other: they can trigger or inhibit the production of others or

    they can react with another substance to form a new substrate. Any possible

    reaction will have a reaction rate, , associated with it (which is equal tothe probability of this reaction happening at any step). Additionally, the

    number of molecules of substances referred to by the reactions in the environment is

    known.

    First a random number, say , is generated in order to decide which reaction happens.Given that the probability of a reaction occurring is given by

    = =1 where is the reaction rate for and the denominator normalises by the cumulativereaction rate of all known reactions in the system. The reaction will occur if

    1