algorithms for discovering patterns in sequences

113
Algorithms for Discovering Patterns in Sequences Raj Bhatnagar University of Cincinnati

Upload: india

Post on 02-Feb-2016

66 views

Category:

Documents


2 download

DESCRIPTION

Algorithms for Discovering Patterns in Sequences. Raj Bhatnagar University of Cincinnati. Outline. Applications of Sequence Mining Genomic Sequences Engineering/Scientific Data Analysis Market Basket Analysis Algorithm Goals Temporal Patterns Interesting Frequent Subsequences - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Algorithms for Discovering Patterns in Sequences

Algorithms for Discovering Patterns in Sequences

Raj Bhatnagar

University of Cincinnati

Page 2: Algorithms for Discovering Patterns in Sequences

Outline• Applications of Sequence Mining

– Genomic Sequences– Engineering/Scientific Data Analysis– Market Basket Analysis

• Algorithm Goals– Temporal Patterns– Interesting Frequent Subsequences– Interesting Frequent Substrings (with mutations)– Detection of Outlier sequences

Page 3: Algorithms for Discovering Patterns in Sequences

Why Mine a Dataset?

Discover Patterns:– Associations

• A subsequence pattern follows another• Help make a prediction

– Clusters• Typical modes of operation

– Temporal Dependencies• Events related in time

– Spatial Dependencies• Events related in space

Page 4: Algorithms for Discovering Patterns in Sequences

Example Problems

From various domains– Engineering– Scientific– Genomic– Business

Page 5: Algorithms for Discovering Patterns in Sequences

Energy Consumption

Page 6: Algorithms for Discovering Patterns in Sequences

Gene Expression over Time

Page 7: Algorithms for Discovering Patterns in Sequences

Multivariate Time Series

• Multivariate time series data is a series of multiple attributes observed over a period of time at equal intervals

• Examples: stocks, weather, utility, sales, scientific, sensor monitoring and bioinformatics

Page 8: Algorithms for Discovering Patterns in Sequences

Spatio-Temporal PatternsGoal:

Find patterns in space-time dimensions for phenomena

Issues:

Simultaneous handling of space and time dimensions

Problem:

How to handle large complexity of algorithms

Phenomena in space-time

Page 9: Algorithms for Discovering Patterns in Sequences

Discover Interesting Subsequences

actgatAAAAAAAAGGGGGGGggcgtacacattagCTGATTCCAATACAGacgt

aaAAAAAAAAGGGGGGGaaacttttccgaataCTGATTCCAATACAGgatcagt

atgacttAAAAAAAAGGGGGGGtgctctcccgattttcCTGATTCCAATACAGc

aggAAAAAAAAGGGGGGGagccctaacggacttaatCCTGATTCCAATACAGta

ggaggAAAAAAAAGGGGGGGagccctaacggacttaatCCTGATTCCAATACAG

AAAAAAAAGGGGGGG-(10,15)-CTGATTCCAATACAG

Blue pattern sequence may have upto k substitutions

Page 10: Algorithms for Discovering Patterns in Sequences

• Finding frequent patterns• Doing similarity search• Clustering of time series• Periodicity detection• Finding temporal associations• Summarization• Prediction

Main Sequence Mining Tasks

Page 11: Algorithms for Discovering Patterns in Sequences

Finding Frequent Substrings

abacebc . abcdobd . abaaebd . eoaoobd . abceoad .

GST generated considering the substrings starting at location #0 of each string is:

(0) < root > - (8) ab - (10) c – (11)dobd - (15) 8

- (35) e0ad – (39) 32 - (18) a – (19) aebd – (23) 16

- (3) cebc – (7) 0 - (24) e0a00bd - (31) 24

0 1 2 3 4 5 6 7 128 9 10 11 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 35 34 37 36 3938

Substrings

Lengthsize(s)

Positionpos

Setset P

Cardinality|P|

aba 3 0 {0, 2} 2

abc 2 0 {1,4} 3

ab 2 0 {0,1,2,4} 4

Two Main Approaches:• Generalized Suffix Trees (Linear Time)• Find Least Common Substrings (Linear Time)

Page 12: Algorithms for Discovering Patterns in Sequences

Finding Frequent Substrings

• Recursively generate the GST with each string truncated by removing the first character

Original strings:

abacebc . abcdobd . abaaebd . eoaoobd . abceoad .

Truncated strings:

bacebc . bcdobd . baaebd . oaoobd .bceoad .

(0) < root > - (8) b - (10) dobd - (14) 8 - (31) e0ad – (35) 29 - (16) a – (17) aebd – (21) 15 - (3) cebd – (7) 1 - (22) 0a00bd - (28) 24

s |s| pos set P |P|

ba

2 1 {0, 2} 2

bc

2 1 {1,4} 2

b 1 1 {0, 1, 2,4}

4

Page 13: Algorithms for Discovering Patterns in Sequences

Results after Phase-I

• Substrings generated after Phase-I are:

Page 14: Algorithms for Discovering Patterns in Sequences

Phase2: Subsequence Hypotheses

Result of Phase 1:string size(s) profiles(G) size(G)

ABa---- 3 {0,2} 2

AB----- 2 {0,1,2} 3

--a---- 1 {0,2,3} 3

----*bd 3 {1,3} 2

-----bd 2 {1,2,3} 3

Result after Phase 2:

subsequence size(T) profiles(R) size(R)

AB---bd 4 {1,2} 2

--a--bd 3 {2,3} 2

Page 15: Algorithms for Discovering Patterns in Sequences

Merging Profile Sets

• Two sets of profiles are similar and will be merged together if:

Size(Intersection(P1, P2)) ≥ threshold

Size(Union(P1, P2 ))

where P1 and P2 any two sets and threshold is user-defined

Page 16: Algorithms for Discovering Patterns in Sequences

Merged Substrings

Page 17: Algorithms for Discovering Patterns in Sequences

Generalizing Substring Patterns

• Core subsequences can be written in the form of regular expression or regex

• Each symbol is considered replaceable by preceding or following alphabet

E.g. for the substring,

aba – eb - -

regex will be [ab]{1}[abc]{1}[ab]{1}.{1}[def]{1}[abc]{1}.{2}

Page 18: Algorithms for Discovering Patterns in Sequences

Generalization of Sequence Patterns

• Hypothesis T specifies a core subsequence shared by the set of profiles R

• Core subsequences and set R, affected by various factors such as loss of information

• T is considered as seed of hypothesis

Page 19: Algorithms for Discovering Patterns in Sequences

Cluster #1 from Utility Dataset

Page 20: Algorithms for Discovering Patterns in Sequences

Cluster #2 from Utility Dataset

Page 21: Algorithms for Discovering Patterns in Sequences

What is Achieved

• Discover:– Partial-sequence temporal

hypotheses – identities of profiles for each

temporal hypothesis

Page 22: Algorithms for Discovering Patterns in Sequences

Histone Cluster

Page 23: Algorithms for Discovering Patterns in Sequences

Histone ClusterORF Process Function Peak Phase Order Cluster OrderYBL002W chromatin structure histone H2B S 424 789YBL003C chromatin structure histone H2A S 441 790YBR009C chromatin structure histone H4 S 417 791YBR010W chromatin structure histone H3 S 420 795YDR224C chromatin structure histone H2B S 432 794YDR225W chromatin structure histone H2A S 437 796YNL030W chromatin structure histone H4 S 418 792YNL031C chromatin structure histone H3 S 430 793YPL127C chromatin structure histone H1 S 448 788-------------------------------------------------------------------------------------------

YBL002W CC ab dccaCBBBabc*BCCCA* xcCBCbbcccACACCBBbaabbb* dcABCaaabbBCxBBBa cccbABCBBBBBabYBL003C CC ab cdcbBCBBbacaBCCC** cbCCCAbcccaCCCCBA*bcbbaA bcABBb*ab*AxABBAa cb*baBBCBBBB*bYBR009C xC *x ddcaCCBBaab*BCCB*a xcCCBbccccbCBCBBa*bbxaAB baB*AaA*bxdBaBBBB cccbABCCCBABAbYBR010W BC aA ccc*CBBAbbc*BCCBAa bbCCB*ccdbbCBCCCAAxbcaAB bbA*Ab*bbAAxaBBB* bccbaBBBBB*BabYDR224C BC aa ccc*CCB*abbaBBCA*b cbCCB*cccbbCCCCCAAcccaAB caBAAaabbAAxaBBBA cccb*BBCBBBBAbYDR225W CC *b cdcaCCC*abbbBCCBAa cbCCC*bccbaCCCBCBAbccbaB cbAA*a*bbAAxaBBBB bccbaBCBBBBB*bYNL030W CC bb dccACCBAabb*CBCBAa cbCCBacccbbCCCCB*AbbcaAB b*AaAbabbBxBaA*BA cccb*BCCBBBB*bYNL031C CC ab cccABCB*abb*BCCAAa bbCCB*bccbbBBCBC*Abbbb*B bAAAAbabbBAxaA*B* bcbb*BBCBAA**bYPL127C Da bb dcaBBCBAabb*BBBBAa cbBCABbcdbaBABBCBBbcbaAB ccBBBB*bcbABBxAAb bccbABBCBBBA*a

^.{4}[cdx]{2}.{3}[BCx]{1}.{3}[abx]{1}.{3}[BCDx].{7}[BCx]{1}.{3}[bcdx]{2}.{5}[BCx]{1}.{33}[BCx]{1}[ABCx]{1}.{5}$

CC-b--c---------BCC-----C---bcc----C-------b----A-----b-----------b-B--B----bYBL002W CCabdccaCBBBabc*BCCCA*xcCBCbbcccACACCBBbaabbb*dcABCaaabbBCxBBBacccbABCBBBBBabYBL003C CCabcdcbBCBBbacaBCCC**cbCCCAbcccaCCCCBA*bcbbaAbcABBb*ab*AxABBAacb*baBBCBBBB*bYDR225W CC*bcdcaCCC*abbbBCCBAacbCCC*bccbaCCCBCBAbccbaBcbAA*a*bbAAxaBBBBbccbaBCBBBBB*bYNL031C CCabcccABCB*abb*BCCAAabbCCB*bccbbBBCBC*Abbbb*BbAAAAbabbBAxaA*B*bcbb*BBCBAA**b ++x----x+++x---x+++xxx--+++x----x+x+++xx-x--xxxxx+x-x--x++x+x+x--x-x++++++xx- 5,6,10,14,18,26,30,31,37,71,72

Fourier Results

String-Based Results

Regular Expression

Generalization

Page 24: Algorithms for Discovering Patterns in Sequences

Localized Focus of Generalization = more genes included

Page 25: Algorithms for Discovering Patterns in Sequences

Example 2 with Yeast Cell Data

ORF Process Function Peak Phase Order Cluster OrderYBR009C chromatin structure histone H4 S 417 791

YBR010W chromatin structure histone H3 S 420 795

YDR224C chromatin structure histone H2B S 432 794

YDR225W chromatin structure histone H2A S 437 796

YLR300W cell wall biogenesis "exo-beta-1,3-glucanase" G1 381 756

YNL030W chromatin structure histone H4 S 418 792

YNL031C chromatin structure histone H3 S 430 793

Not classified as cell cycle related:

YBR106W, YBR118W, YBR189W, YBR206W, YCLX11W, YDL014W, YDL213C,

YDR037W, YDR134C, YGL148W, YKL009W, YLR449W, YNL110C, YPR163C

^.{46}[bcx]{1}.{1}[ABx]{1}[aA*x]{1}[AB*]{1}[abx]{1}[aA*x]{1}.{1}[abcx]{1}[ABx]{1}.{1}

[ABCx]{1}[ab*x]{1}[ABx]{1}.{1}[ABCx]{1}.{15}$

Original Hypothesis

New Genes Included After Localized Generalization

New Regular Expression

Page 26: Algorithms for Discovering Patterns in Sequences

Find Longest Common Substrings

Page 27: Algorithms for Discovering Patterns in Sequences

Algorithms for LCS

• Substring : continuous sequence of characters in a string

• Subsequence : obtained by deleting zero or more symbols in a given string

abcdefghia

Substrings : cdefg, efgh, abcd

Subsequences: ade , cefhi, abc, aia

Page 28: Algorithms for Discovering Patterns in Sequences

• LCS is common subsequence of maximal length between two strings

String 1 : abcdabcefghijk

String 2: xbcaghaehijk

LCS = bcaehijk, Length of LCS = 8

Longest Common Subsequence

Page 29: Algorithms for Discovering Patterns in Sequences

• Brute Force has exponential time complexity in the length of string

• Dynamic Programming can find LCS in O(mn) time and space complexity

• Length of LCS can be found in O(min(m,n)) space complexity and O(mn) time complexity

Finding LCS

Page 30: Algorithms for Discovering Patterns in Sequences

• Finding frequent patterns• Doing similarity search• Clustering of time series• Periodicity detection• Finding temporal associations• Summarization• Prediction

Main Sequence Mining Tasks

Page 31: Algorithms for Discovering Patterns in Sequences

Recursive Formulation:

LCS[i, j] = 0, if i = 0 or j = 0 LCS[i-1, j-1] + 1, if i, j > 0 and ai = bj

max(LCS[i, j-1], LCS[i-1, j]), if i, j > 0 and ai ≠ bj

Finding the LCS Length

Page 32: Algorithms for Discovering Patterns in Sequences

Recursive Formulation:

LCS[i, j] = 0, if i = 0 or j = 0 LCS[i-1, j-1] + 1, if i, j > 0 and ai = bj

max(LCS[i, j-1], LCS[i-1, j]), if i, j > 0 and ai ≠ bj

Iterative solution is more efficient than recursive

Finding the LCS Length

Page 33: Algorithms for Discovering Patterns in Sequences

lcs_length(A, B) { // A is a string with length m // b is another string with length n , m>=n // L is an array to keep intermediate values in Dynamic Programming

for (i = m; i >= 0; i--) for (j = n; j >= 0; j--) {

if (A[i] = '$' || B[j] = '$') L[i,j] = 0; //end of stringselse if (A[i] == B[j]) L[i,j] = 1 + L[i+1, j+1];else L[i,j] = max(L[i+1, j], L[i, j+1]);

}return L[0,0];

}

Finding the LCS Length: Algorithm

Page 34: Algorithms for Discovering Patterns in Sequences

Sequential Clustering

• Clustering is partitioning the data in equivalence classes

• Data is input one or few times

• Unique classification based on input order

• Simple and fast for large data

Page 35: Algorithms for Discovering Patterns in Sequences

Basic Sequential Clustering

Page 36: Algorithms for Discovering Patterns in Sequences

Sequential Clustering with Buckets

Page 37: Algorithms for Discovering Patterns in Sequences

• Finding frequent patterns• Doing similarity search• Clustering of time series• Periodicity detection• Biological Sequence Problems

Main Sequence Mining Tasks

Page 38: Algorithms for Discovering Patterns in Sequences

Multivariate Time Series

• Multivariate time series data is a series of multiple attributes observed over a period of time at equal intervals

• Examples: stocks, weather, utility, sales, scientific, sensor monitoring and bioinformatics

Page 39: Algorithms for Discovering Patterns in Sequences

• Finding frequent patterns• Doing similarity search• Clustering of time series• Periodicity detection• Finding temporal associations• Prediction

Time Series Analysis Tasks

Page 40: Algorithms for Discovering Patterns in Sequences

Why temporal association rules?

• More information about correlations between frequent patterns

• Contains richer information than knowledge frequent patterns

• Helps to build diagnostic and prediction tools

Page 41: Algorithms for Discovering Patterns in Sequences

Finding temporal associations: recent work

• Mannilla - Discovery of frequent episodes in event sequences [2], 1997

• Das - Rule Discovery from Time Series [1], 1998• Kam - Discovering temporal patterns for interval-based

events [3], 2000• Roddick - Discovering Richer Temporal Association

Rules from Interval-based Data [4], 2004

• Mörchen - Discovering Temporal Knowledge in Multivariate Time Series [5], 2004

Page 42: Algorithms for Discovering Patterns in Sequences

Research Issues

• Finding richer set of temporal relationships { contains, follows, overlaps, meets, equals …} than sequence mining does {follows}

• Robustness of rules - room for noise in patterns• Understanding temporal relationships at different

levels of abstraction• Efficient algorithms to find patterns with noise

and temporal associations.

Page 43: Algorithms for Discovering Patterns in Sequences

Frameworks• Kam [1] uses Allens temporal relations to find

rules • Roddick [4] uses state sequence framework

similar to Höppner [6]• Mörchen [5] is based on Unification-based

Temporal Grammar.

Temporal Association Rules

Page 44: Algorithms for Discovering Patterns in Sequences

Allen’s relationships

Above figure is presented from [1]

Page 45: Algorithms for Discovering Patterns in Sequences

What are A1 rules?

kk ArelArelArelArelA 14332211 .......

Page 46: Algorithms for Discovering Patterns in Sequences

Given a multivariate time series, minimum support for number of occurrences, minimum pattern length find all the temporal association rules (similar to A1) less than size k

Problem

Page 47: Algorithms for Discovering Patterns in Sequences

multivariate time series sequences

dimensionality reduction, discretization and symbolic representation

aaaeffdaaaaaaaaacccaaaaaaaaaaedefggcbabaacfgfc…dccbbccdccdeedcdcdeecccdddeeecdeccccddegedbcdc…dddcbbcfffeeegffcbcdeeefffffecbbaaaaaacffecbbb…

frequent patterns

{aaaaaaaa, bbbaaa, bbaaa, eeef, aaac, eaaa ...}{cdee, dccde, deed, ddeeec, eeec, ccccc,edddd …}{fff,fffe,aaaaa,fecbb,ddcb,bbbbc,bbbbc,cbbb…}

clusteringFrequent pattern enumeration

clusters

{aaaaaaaa} {bbba,bba,..} {eeef…} {aaac,eaaa …}…{cdee}{dccde…}{deed,ddeeec,eeec…}{ccccc…}{edddd…}…{fff,fffe…}{fecbb…}{ddcb…}{bbbbc,bbbbc,cbbb…} …

Temporal association rule discovery

1.{aaaa,aaac,aaae,abaa…}followed by{cbb,ccbb,cccbb…}2.{bbbbb,cbbbbbc,…}followed by{baaaa,aaaaaaa,aaaaaa

…} overlaps {aaaa,aaac,aaae…}3.{dccccdd,ccccdd} contains {aaaaa,aaaa }4. …5. …….

Summarization and visualization

Temporal rules Summarized rule

Page 48: Algorithms for Discovering Patterns in Sequences

Mining in 3 steps

1.Find all frequent patterns in each dimension along with all the occurrence.

2.Cluster the similar Patterns to form equivalence classes

3.Find temporal associations between these equivalence classes using iterative algorithm

Page 49: Algorithms for Discovering Patterns in Sequences

Step 1: Finding Frequent Patterns

1.The Data in each dimension is quantized to form a string ( equal frequency, equal interval, SAX, persist etc…) [8].

2.Enhanced Suffix Tree is constructed for this string using O(n) Algorithm [7].

3.All the Frequent patterns along with the locations are enumerated in Linear Time by complete traversal of Tree.

Page 50: Algorithms for Discovering Patterns in Sequences

Step 1: Enhanced Suffix Tree

Page 51: Algorithms for Discovering Patterns in Sequences

Enhanced Suffix Tree Construction Algorithm

Page 52: Algorithms for Discovering Patterns in Sequences

• The Frequent patterns are clustered together using string similarity measures.

• Any reasonable string similarity measure and algorithm can be chosen.

• LCS based similarity measure along with sequential clustering algorithm is chosen because of robustness and efficiency.

Step 2: Clustering frequent patterns

Page 53: Algorithms for Discovering Patterns in Sequences

Step 2: string similarity measure

• Similarity measure Sim( s1, s2) =

• Distance measure Dist(s1, s2) = 1 - Sim( s1,s2)

• LCS (s1, s2) is the length of longest common subsequence

• |s| is the length of string s

|) s2||s1(|

s2)LCS(s1,*2

Page 54: Algorithms for Discovering Patterns in Sequences

Step 2. Clustering

• Sequential clustering Algorithm

• Frequent patterns are divided in to overlapping buckets based upon their length

• Clustering is done for patterns in each bucket to reduce the clustering complexity

• Finally, each pattern is assigned to closest cluster if it is a member of two clusters

Page 55: Algorithms for Discovering Patterns in Sequences

Step 2. Clustering

An Example cluster with cluster center ccccddccccccc

ccccdcccccccc cccccddbccccc ccccdcccccccd cccccdcddcccc cccccdcdcccccccccddccccccc cccccddcccccc dccccdccccccc ccccccddccccccccccdccccccc ccccccdccdccc ccccccdcccccc ccccccdcdcccc

0

1

2

3

4

5

6

7

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Page 56: Algorithms for Discovering Patterns in Sequences

Step 2. Clustering Algorithm

Page 57: Algorithms for Discovering Patterns in Sequences

Step 3. Temporal Relations

• Only three relations are explored here

• Overlap also means meets, starts with, overlaps and overlapped by

Page 58: Algorithms for Discovering Patterns in Sequences

Temporal Relations

• Temporal relationship concept extended to interval sequence by including support

Page 59: Algorithms for Discovering Patterns in Sequences

Temporal Relations

Temporal relationship concept extended to interval sequence by including support

Page 60: Algorithms for Discovering Patterns in Sequences

Temporal Relations

Page 61: Algorithms for Discovering Patterns in Sequences

Examples

Example of terminology

Page 62: Algorithms for Discovering Patterns in Sequences

Step 3. Algorithm

Page 63: Algorithms for Discovering Patterns in Sequences

Step 3. Example

Page 64: Algorithms for Discovering Patterns in Sequences

Step 3. Example

Page 65: Algorithms for Discovering Patterns in Sequences

Summarization

• Inference of higher level or more general knowledge from lower level temporal dependencies.

• Summarization generalizes the temporal association rules by identifying the time windows in which the rule is applicable.

• Measures for summarization are coverage, average length of coverage and maximum coverage length.

Page 66: Algorithms for Discovering Patterns in Sequences

Results: Utility Dataset

Page 67: Algorithms for Discovering Patterns in Sequences

Results: EEG Dataset

Scalable up to 1 million rows

Page 68: Algorithms for Discovering Patterns in Sequences

Results: Great Lakes dataset

Page 69: Algorithms for Discovering Patterns in Sequences

Results: Power Consumption dataset

daily patterns

Weekly patterns after summarization

Page 70: Algorithms for Discovering Patterns in Sequences

Spatio-Temporal Clusters

Page 71: Algorithms for Discovering Patterns in Sequences

Temporal Patterns

Time axis

Observed Value

Example Domains for Temporal Profiles: - Gene Expression

- Each gene’s expression level (1000s genes)

- Utility Consumption- Each day/week’s utility consumption (daily profile for multiple years)

- Social Data- Number of crime incidents/month

Goal:

Cluster together profiles with similar behavior

Problem:

How to define similar behavior?

Similar for?

- Complete period

- One interval

- Multiple Intervals

Page 72: Algorithms for Discovering Patterns in Sequences

Spatial Patterns

Example Domains for Spatial Profiles: - Social Data

- Number of crime incidents/area - Scientific Data

- Pollution level in land

x

y

. .. . .

.. . . . .

.. . . . .

Goal:

Cluster together profiles close to each other

Spatial Closeness:

- Euclidean distance

- Chain connectivity

.

.

...

....

.

.. .

.........

Page 73: Algorithms for Discovering Patterns in Sequences

Spatio-Temporal PatternsGoal:

Find patterns in space-time dimensions for phenomena

Issues:

Simultaneous handling of space and time dimensions

Problem:

How to handle large complexity of algorithms

Phenomena in space-time

Page 74: Algorithms for Discovering Patterns in Sequences

Examples:

Crime data: - Each layer represents month i.e. t-axis- x, y axis represents space coordinates

Energy Consumption

t

Avg. Temperature

yx

Avg. Production

Brewery data:- t-axis represent weeks - x, y axis represents the Average Production and Average Temperature

Tim

e

Page 75: Algorithms for Discovering Patterns in Sequences

Clustering AlgorithmsTemporal Clustering Spatial Clustering Spatio-Temporal Clustering

• Discrete Fourier Transform [Agarwal] - Pros: Index time sequences for similarity searching

- Cons: Not able to represent subsequence temporal concepts

• Rule discovery from Time Series [Das1998] - Pros: Windows clustered according to their similarity to determine temporal rules.

- Cons: Unable to cluster profiles according to subsequences of similarity

• PAM - Pros: Works well for small datasets

- Cons: Expensive

• CLARA

- Pros: Works well for large datasets

- Cons: Expensive

• CLARANS - Motivated by PAM and CLARA

• BIRCH & CURE - Suitable for large datasets, clusters found with one scan

• R-Trees

- Often used for indexing spatio-temporal data

- Overlapping causes backtracking in search

• MR-Tree & HR- Tree - use overlapping of R-Tree to represent successive states of database

- if the number of moving objects from one time instant to another is large, the approach degenerates to independent tree structures and thus no paths are common

Page 76: Algorithms for Discovering Patterns in Sequences

Generalizing to multiple dimensions

• Two different approaches– Phenomenon moving in the same

direction i.e. parallel to time axis

– Phenomenon moving in different directions i.e. N, NW, NE, S, SW, SE, W, E

• Task: classify the phenomena in time, then track them in space– STEP I: Discovery of temporal clusters

– STEP II: Finding spatial clusters

Page 77: Algorithms for Discovering Patterns in Sequences

Phenomenon moving in different directions

Goal:To discover subsequences of the complete temporal profile that are shared by a large number of profilesand are neighbors spatially

Page 78: Algorithms for Discovering Patterns in Sequences

Modified Sequences Generated

b c

d b

c d

d c

a c

c c

a b

b c

There are 4 layers, so the final sequences formed are of length 7 i.e. 2n – 1, where n is number of layers

Page 79: Algorithms for Discovering Patterns in Sequences

Complexity• Number of strings generated moving from Layer 1

to Layer 2 are:

• For L layers, number of strings generated when phenomenon move from layer 1 to L are:

Page 80: Algorithms for Discovering Patterns in Sequences

Results- Test data with 20 profiles- Each profile observed at 30 time points

Profile

Tim

e

Page 81: Algorithms for Discovering Patterns in Sequences

Results (Brewery Data 1)-Energy Consumption data of size 51X14- 51 profiles, each observed over a time period of 14 days

Time

En

erg

y C

on

sum

pti

on

Page 82: Algorithms for Discovering Patterns in Sequences

Results (Brewery Data 2)- Production data of size 51X14- 51 profiles, each observed over a time period of 14 days

Page 83: Algorithms for Discovering Patterns in Sequences

Results- Sequences generated from a test data with 7 layers, each layer of size 10X10

- 4480 strings generated

Tim

e

Profile Index

Ob

serv

ed V

alu

es

Page 84: Algorithms for Discovering Patterns in Sequences

Sequences and Bioinformatics

Page 85: Algorithms for Discovering Patterns in Sequences

Importance of Sequences

• DNA: sequence of letters A,C,G,T

• Proteins: sequence of 20 amino acids

• Central premise of bioinformatics– Sequence structure function

• Ultimate goal: predict function based on sequence

Page 86: Algorithms for Discovering Patterns in Sequences

Strings and Sequences

Page 87: Algorithms for Discovering Patterns in Sequences

Exact String Matching

Page 88: Algorithms for Discovering Patterns in Sequences

Solutions to Exact Matching

• O(n+m) upper bound for worst case

• Preprocess the pattern P

• Preprocess the text T– Suffix tree method: Weiner, Ukkonen and

others• Building the tree: O(m)• Searching: O(n)

Page 89: Algorithms for Discovering Patterns in Sequences

Suffix Trees

• T = cabca

• Suffixes are – cabca– abca– bca– ca– a

• P = ca

ca

b

c

a

$

1

a

b

c

a

$

2

b

c

a

$

3

$4

$

5

Page 90: Algorithms for Discovering Patterns in Sequences

Approximate String Matching

• Error in the experiment or observation

• Redundancy in biological system– Insertion/deletions– Mutations

Page 91: Algorithms for Discovering Patterns in Sequences

The Task: Best Alignment

• Given Strings– HEAGAWGHE– PAWHE

• Many Alignments possible

HEAGAWGHE HEAGAWGHE HEAGAWGHE

P_A _ _W_HE _ _P_AW _HE _PA_ _W _HE

Page 92: Algorithms for Discovering Patterns in Sequences

Scoring Function

• a score for match H E A G …

X X A X …

• a score for mis-match H E A G …

X X P X …

• a score for alignment against gap H E A G …

X X _ X …

Page 93: Algorithms for Discovering Patterns in Sequences

Algorithms

• Needleman-Wunsch (NW): Global Alignment

• Smith-Waterman (SW): Local Alignment

• Both Based on idea of dynamic programming

Page 94: Algorithms for Discovering Patterns in Sequences

All Possible Alignments

• Represented using a table– First string, a1a2, is aligned along top

– Second string, b1b2, is aligned along the left side

a1 a2

b1

b2

0 1 1

1

1

3

5 13

5 _ a1a2b1b2

\ |

| b1a1b2a2

\_

| _ _ |_ _ |_ _ _|_ _ | |_ |_ | | b1b2a1a2

| | |_

\ a1b1a2b2

\ \ \ _ |_ | a1b1b2a2

|_ _ \ | b1a1a2b2

\

Page 95: Algorithms for Discovering Patterns in Sequences

Main idea of DP

• Choices:– Align a3 to b3, s(a3,b3)= -5 score: 5 -5 =0– Align a3 to a gap, s(a3,-)= -8 score: -3 -8=-11 – Align b3 to a gap, s(-,b3)= -8 score: 12-8=4

a1 a2 a3

b1

b2

b3

512

-3 4

Page 96: Algorithms for Discovering Patterns in Sequences

Main idea of DP

F(i-1,j-1)

F(i-1,j)

F(i,j-1) (i,j)

• Score is sum of independent piecewise scores

• Global alignment (Needleman-Wunsch):

– F(0,0) = 0; F(k,0) = F(0,k) = - k d;

– F(i,j) = max { F(i-1,j-1)+s(ai,bj) ; F(i-1,j)-d ; F(i,j-1)-d }

• Local alignment (Smith-Waterman):

– F(0,0) = 0; F(k,0) = F(0,k) = 0;

– F(i,j) = max { 0 ; F(i-1,j-1)+s(ai,bj) ; F(i-1,j)-d ; F(i,j-1)-d }

Page 97: Algorithms for Discovering Patterns in Sequences

Overview of DP

• Start from the upper left corner

• Assign scores for the leading gaps

• Iteratively fill in the scores in the table

• Find the maximum score, bottom-right for global alignment

• Back-track to the upper-left to find the best overall alignment

Page 98: Algorithms for Discovering Patterns in Sequences

Pairwise score from Blosum50

H E A G A W G H E

P -2 -1 -1 -2 -1 -4 -2 -2 -1

A -2 -1 5 0 5 -3 0 -2 -1

W -3 -3 -3 -3 -3 15 -3 -3 -3

H 10 0 -2 -2 -2 -3 -2 10 0

E 0 6 -1 -3 -1 -3 -3 0 6

Page 99: Algorithms for Discovering Patterns in Sequences

DP Table for global alignment

H E A G A W G H E

0 <-8 <-16 <-24 <-32 <-40 <-48 <-56 <-64 <-72

P ^-8 *-2 *-9 *-17 <-25 *-33 <-41 <-49 <-57 *-65

A ^-16 ^-10 *-3 *-4 <-12 *-20 <-28 <-36 <-44 <-52

W ^-24 ^-18 ^-11 *-6 *-7 *-15 *-5 <-13 <-21 <-29

H ^-32 *-14 *-18 *-13 *-8 *-9 ^-13 *-7 *-3 <-11

E ^-40 ^-22 *-8 <-16 ^-16 *-9 *-12 ^-15 *-7 *3

HEAGAWGHE

--P-AW-HE

Page 100: Algorithms for Discovering Patterns in Sequences

DP Table for local alignment

H E A G A W G H E

0 0 0 0 0 0 0 0 0 0

P 0 0 0 0 0 0 0 0 0 0

A 0 0 0 *5 0 *5 0 0 0 0

W 0 0 0 0 *2 0 *20 <12 <4 0

H 0 *10 <2 0 0 0 ^12 *18 *22 <14

E 0 ^2 *16 <8 0 0 ^4 ^10 *18 *28

AWGHE AW-HE

Page 101: Algorithms for Discovering Patterns in Sequences

Multiple Alignment

• Complexity of DP: O(n2)

• For k strings: O(nk)

• Explore other options– Hidden Markov Model

Page 102: Algorithms for Discovering Patterns in Sequences

Markov Chain Models

• Similar to finite automata

• Emits sequences with certain probability

A

C G

T

ATGT…

P(ATGT) = P(A|TGT) P(T|GT) P(G|T) P(T)

P(ATGT) = P(A|T) P(T|G) P(G|T) P(T)

Page 103: Algorithms for Discovering Patterns in Sequences

Hidden Markov Models

• Generalization of Markov Models

• Hidden states that emit sequences

A*

C* G*

T* A

C G

T

Adding four more states (A*,C*,T*,G*) to represent the “island” model, as opposed to non-island model with unlikely transitions between the models one obtains a “hidden” MM for CpG islands.

Page 104: Algorithms for Discovering Patterns in Sequences

HMMs for Gene Prediction

Page 105: Algorithms for Discovering Patterns in Sequences

HMMs & Supervised Learning

• Input: a training set of aligned sequences

• Find optimal transition and emission probabilities

• Criteria: maximize probability of observing the training sequences

• Algorithms– Baum-Welch (Expectation Maximization)– Viterbi training algorithm

Page 106: Algorithms for Discovering Patterns in Sequences

Recognition Phase

• We have optimized probablities

• Predict likelihood of a sequence belonging to a family– Whats the probablity that a sequence is

generated by the HMM?

Page 107: Algorithms for Discovering Patterns in Sequences

A Simple HMM Model

Beg Mj End… …

Example

AGAAACTAGGAATTTGAATCT

P(AGAAACT)=16/81P(TGGATTT)=1/81

1 2 3 4 5 6 7

A 2/3 0 2/3 1 2/3 0 0

T 1/3 0 0 0 1/3 1/3 1

C 0 0 0 0 0 2/3 0

G 0 1 1/3 0 0 0 0

Each blue square represents a match state that “emits” each letter withcertain probability ej(a) which is defined by frequency of a at position j:

Page 108: Algorithms for Discovering Patterns in Sequences

Insertions…

Insert states emit symbols just like the match states, however, theemission probabilities are typically assumed to follow the backgrounddistribution and thus do not contribute to log-odds scores.

Transitions Ij -> Ij are allowed and account for an arbitrary numberof inserted residues that are effectively unaligned (their order withinan inserted region is arbitrary).

Beg Mj End

Ij

Page 109: Algorithms for Discovering Patterns in Sequences

… and Deletions

Beg Mj End

Dj

Deletions are represented by silent states which do not emit any letters.A sequence of deletions (with D -> D transitions) may be used to connectany two match states, accounting for segments of the multiple alignmentthat are not aligned to any symbol in a query sequence (string).

The total cost of a deletion is the sum of the costs of individual transitions(M->D, D->D, D->M) that define this deletion. As in case of insertions, bothlinear and affine gap penalties can be easily incorporated in this scheme.

Page 110: Algorithms for Discovering Patterns in Sequences

HMMs for Multiple Alignment

Beg Mj End

Ij

Dj

ExampleAG---CA-AG-CAG-AA---AAACAG---C** *

Page 111: Algorithms for Discovering Patterns in Sequences

Emision and Transition Counts

C0 C1 C2 C3

A - 4 0 0

C - 0 0 4

G - 0 3 0

T - 0 0 0

Beg Mj End

Ij

Dj

AG...CA-AG.CAGAA.---AAACAG...C

C0 C1 C2 C3

A 0 0 6 0

C 0 0 0 0

G 0 0 1 0

T 0 0 0 0

Match emissions Insert emissions

4 23 4

1

1

1

21

41

12

C0 C1 C2 C3

Page 112: Algorithms for Discovering Patterns in Sequences

Results

Training File 2

Validation File 2

Page 113: Algorithms for Discovering Patterns in Sequences

Bibliography

1. G. Das, K. Lin, H. Mannila, G. Renganathan and P. Smyth, "Rule Discovery from Time Series," in Knowledge Discovery and Data Mining, 1998, pp. 16-22

2. H. Mannila, "Discovery of Frequent Episodes in Event Sequences," Data Mining and Knowledge Discovery, vol. 1, pp. 259, 1997.

3. P. Kam and A.W. Fu, "Discovering Temporal Patterns for Interval-Based Events," in Second International Conference on Data Warehousing and Knowledge Discovery (DaWaK 2000); Lecture Notes in Computer Science, 2000, pp. 317-326.

4. J. Roddick F. and E. Winarko, "Discovering Richer Temporal Association Rules from Interval-based Data : Extended Report," School of Informatics and Engineering,Flinders University., Adelaide, Australia, Tech. Rep. SIE-05-003, March 2005, 2005.

5. F. Mörchen and A. Ultsch, "Discovering Temporal Knowledge in Multivariate Time Series," in GfKl Dortmund, 2004,

6. H¨oppner, F.: Learning temporal rules from state sequence. In: Proceedings of IJCAI Workshop on Learning from Temporal and Spatial Data,Seattle, USA (2001) 25-31

7. E. Ukkonen, "On-Line Construction of Suffix Trees," Algorithmica, vol. 14, pp. 249-260, 1995.

8. C.S. Daw, C.E.A. Finney and E.R. Tracy, "A review of symbolic analysis of experimental data," Rev.Sci.Instrum., vol. 74, pp. 915-930, feb. 2003.

9. T. Sergios and K. Konstantinos Eds., Pattern Recognition, San Diego: Elsevier Academic Press, 2003