algorithms for discovering patterns in sequences

Algorithms for Discovering Patterns in Sequences

Raj Bhatnagar

University of Cincinnati

Outline• Applications of Sequence Mining

– Genomic Sequences– Engineering/Scientific Data Analysis– Market Basket Analysis

• Algorithm Goals– Temporal Patterns– Interesting Frequent Subsequences– Interesting Frequent Substrings (with mutations)– Detection of Outlier sequences

Why Mine a Dataset?

Discover Patterns:– Associations

• A subsequence pattern follows another• Help make a prediction

– Clusters• Typical modes of operation

– Temporal Dependencies• Events related in time

– Spatial Dependencies• Events related in space

Example Problems

From various domains– Engineering– Scientific– Genomic– Business

Energy Consumption

Gene Expression over Time

Multivariate Time Series

• Multivariate time series data is a series of multiple attributes observed over a period of time at equal intervals

• Examples: stocks, weather, utility, sales, scientific, sensor monitoring and bioinformatics

Spatio-Temporal PatternsGoal:

Find patterns in space-time dimensions for phenomena

Issues:

Simultaneous handling of space and time dimensions

Problem:

How to handle large complexity of algorithms

Phenomena in space-time

Discover Interesting Subsequences

actgatAAAAAAAAGGGGGGGggcgtacacattagCTGATTCCAATACAGacgt

aaAAAAAAAAGGGGGGGaaacttttccgaataCTGATTCCAATACAGgatcagt

atgacttAAAAAAAAGGGGGGGtgctctcccgattttcCTGATTCCAATACAGc

aggAAAAAAAAGGGGGGGagccctaacggacttaatCCTGATTCCAATACAGta

ggaggAAAAAAAAGGGGGGGagccctaacggacttaatCCTGATTCCAATACAG

AAAAAAAAGGGGGGG-(10,15)-CTGATTCCAATACAG

Blue pattern sequence may have upto k substitutions

• Finding frequent patterns• Doing similarity search• Clustering of time series• Periodicity detection• Finding temporal associations• Summarization• Prediction

Main Sequence Mining Tasks

Finding Frequent Substrings

abacebc . abcdobd . abaaebd . eoaoobd . abceoad .

GST generated considering the substrings starting at location #0 of each string is:

(0) < root > - (8) ab - (10) c – (11)dobd - (15) 8

- (35) e0ad – (39) 32 - (18) a – (19) aebd – (23) 16

- (3) cebc – (7) 0 - (24) e0a00bd - (31) 24

0 1 2 3 4 5 6 7 128 9 10 11 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 35 34 37 36 3938

Substrings

Lengthsize(s)

Positionpos

Setset P

Cardinality|P|

aba 3 0 {0, 2} 2

abc 2 0 {1,4} 3

ab 2 0 {0,1,2,4} 4

Two Main Approaches:• Generalized Suffix Trees (Linear Time)• Find Least Common Substrings (Linear Time)

Finding Frequent Substrings

• Recursively generate the GST with each string truncated by removing the first character

Original strings:

abacebc . abcdobd . abaaebd . eoaoobd . abceoad .

Truncated strings:

bacebc . bcdobd . baaebd . oaoobd .bceoad .

(0) < root > - (8) b - (10) dobd - (14) 8 - (31) e0ad – (35) 29 - (16) a – (17) aebd – (21) 15 - (3) cebd – (7) 1 - (22) 0a00bd - (28) 24

s |s| pos set P |P|

ba

2 1 {0, 2} 2

bc

2 1 {1,4} 2

b 1 1 {0, 1, 2,4}

4

Results after Phase-I

• Substrings generated after Phase-I are:

Phase2: Subsequence Hypotheses

Result of Phase 1:string size(s) profiles(G) size(G)

ABa---- 3 {0,2} 2

AB----- 2 {0,1,2} 3

--a---- 1 {0,2,3} 3

----*bd 3 {1,3} 2

-----bd 2 {1,2,3} 3

Result after Phase 2:

subsequence size(T) profiles(R) size(R)

AB---bd 4 {1,2} 2

--a--bd 3 {2,3} 2

Merging Profile Sets

• Two sets of profiles are similar and will be merged together if:

Size(Intersection(P1, P2)) ≥ threshold

Size(Union(P1, P2 ))

where P1 and P2 any two sets and threshold is user-defined

Merged Substrings

Generalizing Substring Patterns

• Core subsequences can be written in the form of regular expression or regex

• Each symbol is considered replaceable by preceding or following alphabet

E.g. for the substring,

aba – eb - -

regex will be [ab]{1}[abc]{1}[ab]{1}.{1}[def]{1}[abc]{1}.{2}

Generalization of Sequence Patterns

• Hypothesis T specifies a core subsequence shared by the set of profiles R

• Core subsequences and set R, affected by various factors such as loss of information

• T is considered as seed of hypothesis

Cluster #1 from Utility Dataset

Cluster #2 from Utility Dataset

What is Achieved

• Discover:– Partial-sequence temporal

hypotheses – identities of profiles for each

temporal hypothesis

Histone Cluster

Histone ClusterORF Process Function Peak Phase Order Cluster OrderYBL002W chromatin structure histone H2B S 424 789YBL003C chromatin structure histone H2A S 441 790YBR009C chromatin structure histone H4 S 417 791YBR010W chromatin structure histone H3 S 420 795YDR224C chromatin structure histone H2B S 432 794YDR225W chromatin structure histone H2A S 437 796YNL030W chromatin structure histone H4 S 418 792YNL031C chromatin structure histone H3 S 430 793YPL127C chromatin structure histone H1 S 448 788-------------------------------------------------------------------------------------------

YBL002W CC ab dccaCBBBabc*BCCCA* xcCBCbbcccACACCBBbaabbb* dcABCaaabbBCxBBBa cccbABCBBBBBabYBL003C CC ab cdcbBCBBbacaBCCC** cbCCCAbcccaCCCCBA*bcbbaA bcABBb*ab*AxABBAa cb*baBBCBBBB*bYBR009C xC *x ddcaCCBBaab*BCCB*a xcCCBbccccbCBCBBa*bbxaAB baB*AaA*bxdBaBBBB cccbABCCCBABAbYBR010W BC aA ccc*CBBAbbc*BCCBAa bbCCB*ccdbbCBCCCAAxbcaAB bbA*Ab*bbAAxaBBB* bccbaBBBBB*BabYDR224C BC aa ccc*CCB*abbaBBCA*b cbCCB*cccbbCCCCCAAcccaAB caBAAaabbAAxaBBBA cccb*BBCBBBBAbYDR225W CC *b cdcaCCC*abbbBCCBAa cbCCC*bccbaCCCBCBAbccbaB cbAA*a*bbAAxaBBBB bccbaBCBBBBB*bYNL030W CC bb dccACCBAabb*CBCBAa cbCCBacccbbCCCCB*AbbcaAB b*AaAbabbBxBaA*BA cccb*BCCBBBB*bYNL031C CC ab cccABCB*abb*BCCAAa bbCCB*bccbbBBCBC*Abbbb*B bAAAAbabbBAxaA*B* bcbb*BBCBAA**bYPL127C Da bb dcaBBCBAabb*BBBBAa cbBCABbcdbaBABBCBBbcbaAB ccBBBB*bcbABBxAAb bccbABBCBBBA*a

^.{4}[cdx]{2}.{3}[BCx]{1}.{3}[abx]{1}.{3}[BCDx].{7}[BCx]{1}.{3}[bcdx]{2}.{5}[BCx]{1}.{33}[BCx]{1}[ABCx]{1}.{5}$

CC-b--c---------BCC-----C---bcc----C-------b----A-----b-----------b-B--B----bYBL002W CCabdccaCBBBabc*BCCCA*xcCBCbbcccACACCBBbaabbb*dcABCaaabbBCxBBBacccbABCBBBBBabYBL003C CCabcdcbBCBBbacaBCCC**cbCCCAbcccaCCCCBA*bcbbaAbcABBb*ab*AxABBAacb*baBBCBBBB*bYDR225W CC*bcdcaCCC*abbbBCCBAacbCCC*bccbaCCCBCBAbccbaBcbAA*a*bbAAxaBBBBbccbaBCBBBBB*bYNL031C CCabcccABCB*abb*BCCAAabbCCB*bccbbBBCBC*Abbbb*BbAAAAbabbBAxaA*B*bcbb*BBCBAA**b ++x----x+++x---x+++xxx--+++x----x+x+++xx-x--xxxxx+x-x--x++x+x+x--x-x++++++xx- 5,6,10,14,18,26,30,31,37,71,72

Fourier Results

String-Based Results

Regular Expression

Generalization

Localized Focus of Generalization = more genes included

Example 2 with Yeast Cell Data

ORF Process Function Peak Phase Order Cluster OrderYBR009C chromatin structure histone H4 S 417 791

YBR010W chromatin structure histone H3 S 420 795

YDR224C chromatin structure histone H2B S 432 794

YDR225W chromatin structure histone H2A S 437 796

YLR300W cell wall biogenesis "exo-beta-1,3-glucanase" G1 381 756

YNL030W chromatin structure histone H4 S 418 792

YNL031C chromatin structure histone H3 S 430 793

Not classified as cell cycle related:

YBR106W, YBR118W, YBR189W, YBR206W, YCLX11W, YDL014W, YDL213C,

YDR037W, YDR134C, YGL148W, YKL009W, YLR449W, YNL110C, YPR163C

^.{46}[bcx]{1}.{1}[ABx]{1}[aA*x]{1}[AB*]{1}[abx]{1}[aA*x]{1}.{1}[abcx]{1}[ABx]{1}.{1}

[ABCx]{1}[ab*x]{1}[ABx]{1}.{1}[ABCx]{1}.{15}$

Original Hypothesis

New Genes Included After Localized Generalization

New Regular Expression

Find Longest Common Substrings

Algorithms for LCS

• Substring : continuous sequence of characters in a string

• Subsequence : obtained by deleting zero or more symbols in a given string

abcdefghia

Substrings : cdefg, efgh, abcd

Subsequences: ade , cefhi, abc, aia

• LCS is common subsequence of maximal length between two strings

String 1 : abcdabcefghijk

String 2: xbcaghaehijk

LCS = bcaehijk, Length of LCS = 8

Longest Common Subsequence

• Brute Force has exponential time complexity in the length of string

• Dynamic Programming can find LCS in O(mn) time and space complexity

• Length of LCS can be found in O(min(m,n)) space complexity and O(mn) time complexity

Finding LCS

• Finding frequent patterns• Doing similarity search• Clustering of time series• Periodicity detection• Finding temporal associations• Summarization• Prediction


Recursive Formulation:

LCS[i, j] = 0, if i = 0 or j = 0 LCS[i-1, j-1] + 1, if i, j > 0 and ai = bj

max(LCS[i, j-1], LCS[i-1, j]), if i, j > 0 and ai ≠ bj

Finding the LCS Length

Recursive Formulation:

LCS[i, j] = 0, if i = 0 or j = 0 LCS[i-1, j-1] + 1, if i, j > 0 and ai = bj

max(LCS[i, j-1], LCS[i-1, j]), if i, j > 0 and ai ≠ bj

Iterative solution is more efficient than recursive

Finding the LCS Length

lcs_length(A, B) { // A is a string with length m // b is another string with length n , m>=n // L is an array to keep intermediate values in Dynamic Programming

for (i = m; i >= 0; i--) for (j = n; j >= 0; j--) {

if (A[i] = '$' || B[j] = '$') L[i,j] = 0; //end of stringselse if (A[i] == B[j]) L[i,j] = 1 + L[i+1, j+1];else L[i,j] = max(L[i+1, j], L[i, j+1]);

}return L[0,0];

}

Finding the LCS Length: Algorithm

Sequential Clustering

• Clustering is partitioning the data in equivalence classes

• Data is input one or few times

• Unique classification based on input order

• Simple and fast for large data

Basic Sequential Clustering

Sequential Clustering with Buckets

• Finding frequent patterns• Doing similarity search• Clustering of time series• Periodicity detection• Biological Sequence Problems


Multivariate Time Series

• Multivariate time series data is a series of multiple attributes observed over a period of time at equal intervals

• Examples: stocks, weather, utility, sales, scientific, sensor monitoring and bioinformatics

• Finding frequent patterns• Doing similarity search• Clustering of time series• Periodicity detection• Finding temporal associations• Prediction

Time Series Analysis Tasks

Why temporal association rules?

• More information about correlations between frequent patterns

• Contains richer information than knowledge frequent patterns

• Helps to build diagnostic and prediction tools

Finding temporal associations: recent work

• Mannilla - Discovery of frequent episodes in event sequences [2], 1997

• Das - Rule Discovery from Time Series [1], 1998• Kam - Discovering temporal patterns for interval-based

events [3], 2000• Roddick - Discovering Richer Temporal Association

Rules from Interval-based Data [4], 2004

• Mörchen - Discovering Temporal Knowledge in Multivariate Time Series [5], 2004

Research Issues

• Finding richer set of temporal relationships { contains, follows, overlaps, meets, equals …} than sequence mining does {follows}

• Robustness of rules - room for noise in patterns• Understanding temporal relationships at different

levels of abstraction• Efficient algorithms to find patterns with noise

and temporal associations.

Frameworks• Kam [1] uses Allens temporal relations to find

rules • Roddick [4] uses state sequence framework

similar to Höppner [6]• Mörchen [5] is based on Unification-based

Temporal Grammar.

Temporal Association Rules

Allen’s relationships

Above figure is presented from [1]

What are A1 rules?

kk ArelArelArelArelA 14332211 .......

Given a multivariate time series, minimum support for number of occurrences, minimum pattern length find all the temporal association rules (similar to A1) less than size k

Problem

multivariate time series sequences

dimensionality reduction, discretization and symbolic representation

aaaeffdaaaaaaaaacccaaaaaaaaaaedefggcbabaacfgfc…dccbbccdccdeedcdcdeecccdddeeecdeccccddegedbcdc…dddcbbcfffeeegffcbcdeeefffffecbbaaaaaacffecbbb…

frequent patterns

{aaaaaaaa, bbbaaa, bbaaa, eeef, aaac, eaaa ...}{cdee, dccde, deed, ddeeec, eeec, ccccc,edddd …}{fff,fffe,aaaaa,fecbb,ddcb,bbbbc,bbbbc,cbbb…}

clusteringFrequent pattern enumeration

clusters

{aaaaaaaa} {bbba,bba,..} {eeef…} {aaac,eaaa …}…{cdee}{dccde…}{deed,ddeeec,eeec…}{ccccc…}{edddd…}…{fff,fffe…}{fecbb…}{ddcb…}{bbbbc,bbbbc,cbbb…} …

Temporal association rule discovery

1.{aaaa,aaac,aaae,abaa…}followed by{cbb,ccbb,cccbb…}2.{bbbbb,cbbbbbc,…}followed by{baaaa,aaaaaaa,aaaaaa

…} overlaps {aaaa,aaac,aaae…}3.{dccccdd,ccccdd} contains {aaaaa,aaaa }4. …5. …….

Summarization and visualization

Temporal rules Summarized rule

Mining in 3 steps

1.Find all frequent patterns in each dimension along with all the occurrence.

2.Cluster the similar Patterns to form equivalence classes

3.Find temporal associations between these equivalence classes using iterative algorithm

Step 1: Finding Frequent Patterns

1.The Data in each dimension is quantized to form a string ( equal frequency, equal interval, SAX, persist etc…) [8].

2.Enhanced Suffix Tree is constructed for this string using O(n) Algorithm [7].

3.All the Frequent patterns along with the locations are enumerated in Linear Time by complete traversal of Tree.

Step 1: Enhanced Suffix Tree

Enhanced Suffix Tree Construction Algorithm

• The Frequent patterns are clustered together using string similarity measures.

• Any reasonable string similarity measure and algorithm can be chosen.

• LCS based similarity measure along with sequential clustering algorithm is chosen because of robustness and efficiency.

Step 2: Clustering frequent patterns

Step 2: string similarity measure

• Similarity measure Sim( s1, s2) =

• Distance measure Dist(s1, s2) = 1 - Sim( s1,s2)

• LCS (s1, s2) is the length of longest common subsequence

• |s| is the length of string s

|) s2||s1(|

s2)LCS(s1,*2

Step 2. Clustering

• Sequential clustering Algorithm

• Frequent patterns are divided in to overlapping buckets based upon their length

• Clustering is done for patterns in each bucket to reduce the clustering complexity

• Finally, each pattern is assigned to closest cluster if it is a member of two clusters

Step 2. Clustering

An Example cluster with cluster center ccccddccccccc

ccccdcccccccc cccccddbccccc ccccdcccccccd cccccdcddcccc cccccdcdcccccccccddccccccc cccccddcccccc dccccdccccccc ccccccddccccccccccdccccccc ccccccdccdccc ccccccdcccccc ccccccdcdcccc

0

1

2

3

4

5

6

7

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Step 2. Clustering Algorithm

Step 3. Temporal Relations

• Only three relations are explored here

• Overlap also means meets, starts with, overlaps and overlapped by

Temporal Relations

• Temporal relationship concept extended to interval sequence by including support

Temporal Relations

Temporal relationship concept extended to interval sequence by including support

Temporal Relations

Examples

Example of terminology

Step 3. Algorithm

Step 3. Example

Summarization

• Inference of higher level or more general knowledge from lower level temporal dependencies.

• Summarization generalizes the temporal association rules by identifying the time windows in which the rule is applicable.

• Measures for summarization are coverage, average length of coverage and maximum coverage length.

Results: Utility Dataset

Results: EEG Dataset

Scalable up to 1 million rows

Results: Great Lakes dataset

Results: Power Consumption dataset

daily patterns

Weekly patterns after summarization

Spatio-Temporal Clusters

Temporal Patterns

Time axis

Observed Value

Example Domains for Temporal Profiles: - Gene Expression

- Each gene’s expression level (1000s genes)

- Utility Consumption- Each day/week’s utility consumption (daily profile for multiple years)

- Social Data- Number of crime incidents/month

Goal:

Cluster together profiles with similar behavior

Problem:

How to define similar behavior?

Similar for?

- Complete period

- One interval

- Multiple Intervals

Spatial Patterns

Example Domains for Spatial Profiles: - Social Data

- Number of crime incidents/area - Scientific Data

- Pollution level in land

x

y

. .. . .

.. . . . .

.. . . . .

Goal:

Cluster together profiles close to each other

Spatial Closeness:

- Euclidean distance

- Chain connectivity

.

.

...

....

.

.. .

.........

Spatio-Temporal PatternsGoal:

Find patterns in space-time dimensions for phenomena

Issues:

Simultaneous handling of space and time dimensions

Problem:

How to handle large complexity of algorithms

Phenomena in space-time

Examples:

Crime data: - Each layer represents month i.e. t-axis- x, y axis represents space coordinates

Energy Consumption

t

Avg. Temperature

yx

Avg. Production

Brewery data:- t-axis represent weeks - x, y axis represents the Average Production and Average Temperature

Tim

e

Clustering AlgorithmsTemporal Clustering Spatial Clustering Spatio-Temporal Clustering

• Discrete Fourier Transform [Agarwal] - Pros: Index time sequences for similarity searching

- Cons: Not able to represent subsequence temporal concepts

• Rule discovery from Time Series [Das1998] - Pros: Windows clustered according to their similarity to determine temporal rules.

- Cons: Unable to cluster profiles according to subsequences of similarity

• PAM - Pros: Works well for small datasets

- Cons: Expensive

• CLARA

- Pros: Works well for large datasets

- Cons: Expensive

• CLARANS - Motivated by PAM and CLARA

• BIRCH & CURE - Suitable for large datasets, clusters found with one scan

• R-Trees

- Often used for indexing spatio-temporal data

- Overlapping causes backtracking in search

• MR-Tree & HR- Tree - use overlapping of R-Tree to represent successive states of database

- if the number of moving objects from one time instant to another is large, the approach degenerates to independent tree structures and thus no paths are common

Generalizing to multiple dimensions

• Two different approaches– Phenomenon moving in the same

direction i.e. parallel to time axis

– Phenomenon moving in different directions i.e. N, NW, NE, S, SW, SE, W, E

• Task: classify the phenomena in time, then track them in space– STEP I: Discovery of temporal clusters

– STEP II: Finding spatial clusters

Phenomenon moving in different directions

Goal:To discover subsequences of the complete temporal profile that are shared by a large number of profilesand are neighbors spatially

Modified Sequences Generated

b c

d b

c d

d c

a c

c c

a b

b c

There are 4 layers, so the final sequences formed are of length 7 i.e. 2n – 1, where n is number of layers

Complexity• Number of strings generated moving from Layer 1

to Layer 2 are:

• For L layers, number of strings generated when phenomenon move from layer 1 to L are:

Results- Test data with 20 profiles- Each profile observed at 30 time points

Profile

Tim

e

Results (Brewery Data 1)-Energy Consumption data of size 51X14- 51 profiles, each observed over a time period of 14 days

Time

En

erg

y C

on

sum

pti

on

Results (Brewery Data 2)- Production data of size 51X14- 51 profiles, each observed over a time period of 14 days

Results- Sequences generated from a test data with 7 layers, each layer of size 10X10

- 4480 strings generated

Tim

e

Profile Index

Ob

serv

ed V

alu

es

Sequences and Bioinformatics

Importance of Sequences

• DNA: sequence of letters A,C,G,T

• Proteins: sequence of 20 amino acids

• Central premise of bioinformatics– Sequence structure function

• Ultimate goal: predict function based on sequence

Strings and Sequences

Exact String Matching

Solutions to Exact Matching

• O(n+m) upper bound for worst case

• Preprocess the pattern P

• Preprocess the text T– Suffix tree method: Weiner, Ukkonen and

others• Building the tree: O(m)• Searching: O(n)

Suffix Trees

• T = cabca

• Suffixes are – cabca– abca– bca– ca– a

• P = ca

ca

b

c

a

$

1

a

b

c

a

$

2

b

c

a

$

3

$4

$

5

Approximate String Matching

• Error in the experiment or observation

• Redundancy in biological system– Insertion/deletions– Mutations

The Task: Best Alignment

• Given Strings– HEAGAWGHE– PAWHE

• Many Alignments possible

HEAGAWGHE HEAGAWGHE HEAGAWGHE

P_A _ _W_HE _ _P_AW _HE _PA_ _W _HE

Scoring Function

• a score for match H E A G …

X X A X …

• a score for mis-match H E A G …

X X P X …

• a score for alignment against gap H E A G …

X X _ X …

Algorithms

• Needleman-Wunsch (NW): Global Alignment

• Smith-Waterman (SW): Local Alignment

• Both Based on idea of dynamic programming

All Possible Alignments

• Represented using a table– First string, a1a2, is aligned along top

– Second string, b1b2, is aligned along the left side

a1 a2

b1

b2

0 1 1

1

1

3

5 13

5 _ a1a2b1b2

\ |

| b1a1b2a2

\_

| _ _ |_ _ |_ _ _|_ _ | |_ |_ | | b1b2a1a2

| | |_

\ a1b1a2b2

\ \ \ _ |_ | a1b1b2a2

|_ _ \ | b1a1a2b2

\

Main idea of DP

• Choices:– Align a3 to b3, s(a3,b3)= -5 score: 5 -5 =0– Align a3 to a gap, s(a3,-)= -8 score: -3 -8=-11 – Align b3 to a gap, s(-,b3)= -8 score: 12-8=4

a1 a2 a3

b1

b2

b3

512

-3 4

Main idea of DP

F(i-1,j-1)

F(i-1,j)

F(i,j-1) (i,j)

• Score is sum of independent piecewise scores

• Global alignment (Needleman-Wunsch):

– F(0,0) = 0; F(k,0) = F(0,k) = - k d;

– F(i,j) = max { F(i-1,j-1)+s(ai,bj) ; F(i-1,j)-d ; F(i,j-1)-d }

• Local alignment (Smith-Waterman):

– F(0,0) = 0; F(k,0) = F(0,k) = 0;

– F(i,j) = max { 0 ; F(i-1,j-1)+s(ai,bj) ; F(i-1,j)-d ; F(i,j-1)-d }

Overview of DP

• Start from the upper left corner

• Assign scores for the leading gaps

• Iteratively fill in the scores in the table

• Find the maximum score, bottom-right for global alignment

• Back-track to the upper-left to find the best overall alignment

Pairwise score from Blosum50

H E A G A W G H E

P -2 -1 -1 -2 -1 -4 -2 -2 -1

A -2 -1 5 0 5 -3 0 -2 -1

W -3 -3 -3 -3 -3 15 -3 -3 -3

H 10 0 -2 -2 -2 -3 -2 10 0

E 0 6 -1 -3 -1 -3 -3 0 6

DP Table for global alignment

H E A G A W G H E

0 <-8 <-16 <-24 <-32 <-40 <-48 <-56 <-64 <-72

P ^-8 *-2 *-9 *-17 <-25 *-33 <-41 <-49 <-57 *-65

A ^-16 ^-10 *-3 *-4 <-12 *-20 <-28 <-36 <-44 <-52

W ^-24 ^-18 ^-11 *-6 *-7 *-15 *-5 <-13 <-21 <-29

H ^-32 *-14 *-18 *-13 *-8 *-9 ^-13 *-7 *-3 <-11

E ^-40 ^-22 *-8 <-16 ^-16 *-9 *-12 ^-15 *-7 *3

HEAGAWGHE

--P-AW-HE

DP Table for local alignment

H E A G A W G H E

0 0 0 0 0 0 0 0 0 0

P 0 0 0 0 0 0 0 0 0 0

A 0 0 0 *5 0 *5 0 0 0 0

W 0 0 0 0 *2 0 *20 <12 <4 0

H 0 *10 <2 0 0 0 ^12 *18 *22 <14

E 0 ^2 *16 <8 0 0 ^4 ^10 *18 *28

AWGHE AW-HE

Multiple Alignment

• Complexity of DP: O(n2)

• For k strings: O(nk)

• Explore other options– Hidden Markov Model

Markov Chain Models

• Similar to finite automata

• Emits sequences with certain probability

A

C G

T

ATGT…

P(ATGT) = P(A|TGT) P(T|GT) P(G|T) P(T)

P(ATGT) = P(A|T) P(T|G) P(G|T) P(T)

Hidden Markov Models

• Generalization of Markov Models

• Hidden states that emit sequences

A*

C* G*

T* A

C G

T

Adding four more states (A*,C*,T*,G*) to represent the “island” model, as opposed to non-island model with unlikely transitions between the models one obtains a “hidden” MM for CpG islands.

HMMs for Gene Prediction

HMMs & Supervised Learning

• Input: a training set of aligned sequences

• Find optimal transition and emission probabilities

• Criteria: maximize probability of observing the training sequences

• Algorithms– Baum-Welch (Expectation Maximization)– Viterbi training algorithm

Recognition Phase

• We have optimized probablities

• Predict likelihood of a sequence belonging to a family– Whats the probablity that a sequence is

generated by the HMM?

A Simple HMM Model

Beg Mj End… …

Example

AGAAACTAGGAATTTGAATCT

P(AGAAACT)=16/81P(TGGATTT)=1/81

1 2 3 4 5 6 7

A 2/3 0 2/3 1 2/3 0 0

T 1/3 0 0 0 1/3 1/3 1

C 0 0 0 0 0 2/3 0

G 0 1 1/3 0 0 0 0

Each blue square represents a match state that “emits” each letter withcertain probability ej(a) which is defined by frequency of a at position j:

Insertions…

Insert states emit symbols just like the match states, however, theemission probabilities are typically assumed to follow the backgrounddistribution and thus do not contribute to log-odds scores.

Transitions Ij -> Ij are allowed and account for an arbitrary numberof inserted residues that are effectively unaligned (their order withinan inserted region is arbitrary).

Beg Mj End

Ij

… and Deletions

Beg Mj End

Dj

Deletions are represented by silent states which do not emit any letters.A sequence of deletions (with D -> D transitions) may be used to connectany two match states, accounting for segments of the multiple alignmentthat are not aligned to any symbol in a query sequence (string).

The total cost of a deletion is the sum of the costs of individual transitions(M->D, D->D, D->M) that define this deletion. As in case of insertions, bothlinear and affine gap penalties can be easily incorporated in this scheme.

HMMs for Multiple Alignment

Beg Mj End

Ij

Dj

ExampleAG---CA-AG-CAG-AA---AAACAG---C** *

Emision and Transition Counts

C0 C1 C2 C3

A - 4 0 0

C - 0 0 4

G - 0 3 0

T - 0 0 0

Beg Mj End

Ij

Dj

AG...CA-AG.CAGAA.---AAACAG...C

C0 C1 C2 C3

A 0 0 6 0

C 0 0 0 0

G 0 0 1 0

T 0 0 0 0

Match emissions Insert emissions

4 23 4

1

1

1

21

41

12

C0 C1 C2 C3

Results

Training File 2

Validation File 2

Bibliography

1. G. Das, K. Lin, H. Mannila, G. Renganathan and P. Smyth, "Rule Discovery from Time Series," in Knowledge Discovery and Data Mining, 1998, pp. 16-22

2. H. Mannila, "Discovery of Frequent Episodes in Event Sequences," Data Mining and Knowledge Discovery, vol. 1, pp. 259, 1997.

3. P. Kam and A.W. Fu, "Discovering Temporal Patterns for Interval-Based Events," in Second International Conference on Data Warehousing and Knowledge Discovery (DaWaK 2000); Lecture Notes in Computer Science, 2000, pp. 317-326.

4. J. Roddick F. and E. Winarko, "Discovering Richer Temporal Association Rules from Interval-based Data : Extended Report," School of Informatics and Engineering,Flinders University., Adelaide, Australia, Tech. Rep. SIE-05-003, March 2005, 2005.

5. F. Mörchen and A. Ultsch, "Discovering Temporal Knowledge in Multivariate Time Series," in GfKl Dortmund, 2004,

6. H¨oppner, F.: Learning temporal rules from state sequence. In: Proceedings of IJCAI Workshop on Learning from Temporal and Spatial Data,Seattle, USA (2001) 25-31

7. E. Ukkonen, "On-Line Construction of Suffix Trees," Algorithmica, vol. 14, pp. 249-260, 1995.

8. C.S. Daw, C.E.A. Finney and E.R. Tracy, "A review of symbolic analysis of experimental data," Rev.Sci.Instrum., vol. 74, pp. 915-930, feb. 2003.

9. T. Sergios and K. Konstantinos Eds., Pattern Recognition, San Diego: Elsevier Academic Press, 2003

algorithms for discovering patterns in sequences

Documents

frequent substrings

time dimensionsproblem

period of time

p2 threshold sizeunionp1

truncated strings

timespatial dependenciesevents

simultaneous handling

form of regular expression