algorithms for discovering patterns in sequences
DESCRIPTION
Algorithms for Discovering Patterns in Sequences. Raj Bhatnagar University of Cincinnati. Outline. Applications of Sequence Mining Genomic Sequences Engineering/Scientific Data Analysis Market Basket Analysis Algorithm Goals Temporal Patterns Interesting Frequent Subsequences - PowerPoint PPT PresentationTRANSCRIPT
Algorithms for Discovering Patterns in Sequences
Raj Bhatnagar
University of Cincinnati
Outline• Applications of Sequence Mining
– Genomic Sequences– Engineering/Scientific Data Analysis– Market Basket Analysis
• Algorithm Goals– Temporal Patterns– Interesting Frequent Subsequences– Interesting Frequent Substrings (with mutations)– Detection of Outlier sequences
Why Mine a Dataset?
Discover Patterns:– Associations
• A subsequence pattern follows another• Help make a prediction
– Clusters• Typical modes of operation
– Temporal Dependencies• Events related in time
– Spatial Dependencies• Events related in space
Example Problems
From various domains– Engineering– Scientific– Genomic– Business
Energy Consumption
Gene Expression over Time
Multivariate Time Series
• Multivariate time series data is a series of multiple attributes observed over a period of time at equal intervals
• Examples: stocks, weather, utility, sales, scientific, sensor monitoring and bioinformatics
Spatio-Temporal PatternsGoal:
Find patterns in space-time dimensions for phenomena
Issues:
Simultaneous handling of space and time dimensions
Problem:
How to handle large complexity of algorithms
Phenomena in space-time
Discover Interesting Subsequences
actgatAAAAAAAAGGGGGGGggcgtacacattagCTGATTCCAATACAGacgt
aaAAAAAAAAGGGGGGGaaacttttccgaataCTGATTCCAATACAGgatcagt
atgacttAAAAAAAAGGGGGGGtgctctcccgattttcCTGATTCCAATACAGc
aggAAAAAAAAGGGGGGGagccctaacggacttaatCCTGATTCCAATACAGta
ggaggAAAAAAAAGGGGGGGagccctaacggacttaatCCTGATTCCAATACAG
AAAAAAAAGGGGGGG-(10,15)-CTGATTCCAATACAG
Blue pattern sequence may have upto k substitutions
• Finding frequent patterns• Doing similarity search• Clustering of time series• Periodicity detection• Finding temporal associations• Summarization• Prediction
Main Sequence Mining Tasks
Finding Frequent Substrings
abacebc . abcdobd . abaaebd . eoaoobd . abceoad .
GST generated considering the substrings starting at location #0 of each string is:
(0) < root > - (8) ab - (10) c – (11)dobd - (15) 8
- (35) e0ad – (39) 32 - (18) a – (19) aebd – (23) 16
- (3) cebc – (7) 0 - (24) e0a00bd - (31) 24
0 1 2 3 4 5 6 7 128 9 10 11 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 35 34 37 36 3938
Substrings
Lengthsize(s)
Positionpos
Setset P
Cardinality|P|
aba 3 0 {0, 2} 2
abc 2 0 {1,4} 3
ab 2 0 {0,1,2,4} 4
Two Main Approaches:• Generalized Suffix Trees (Linear Time)• Find Least Common Substrings (Linear Time)
Finding Frequent Substrings
• Recursively generate the GST with each string truncated by removing the first character
Original strings:
abacebc . abcdobd . abaaebd . eoaoobd . abceoad .
Truncated strings:
bacebc . bcdobd . baaebd . oaoobd .bceoad .
(0) < root > - (8) b - (10) dobd - (14) 8 - (31) e0ad – (35) 29 - (16) a – (17) aebd – (21) 15 - (3) cebd – (7) 1 - (22) 0a00bd - (28) 24
s |s| pos set P |P|
ba
2 1 {0, 2} 2
bc
2 1 {1,4} 2
b 1 1 {0, 1, 2,4}
4
Results after Phase-I
• Substrings generated after Phase-I are:
Phase2: Subsequence Hypotheses
Result of Phase 1:string size(s) profiles(G) size(G)
ABa---- 3 {0,2} 2
AB----- 2 {0,1,2} 3
--a---- 1 {0,2,3} 3
----*bd 3 {1,3} 2
-----bd 2 {1,2,3} 3
Result after Phase 2:
subsequence size(T) profiles(R) size(R)
AB---bd 4 {1,2} 2
--a--bd 3 {2,3} 2
Merging Profile Sets
• Two sets of profiles are similar and will be merged together if:
Size(Intersection(P1, P2)) ≥ threshold
Size(Union(P1, P2 ))
where P1 and P2 any two sets and threshold is user-defined
Merged Substrings
Generalizing Substring Patterns
• Core subsequences can be written in the form of regular expression or regex
• Each symbol is considered replaceable by preceding or following alphabet
E.g. for the substring,
aba – eb - -
regex will be [ab]{1}[abc]{1}[ab]{1}.{1}[def]{1}[abc]{1}.{2}
Generalization of Sequence Patterns
• Hypothesis T specifies a core subsequence shared by the set of profiles R
• Core subsequences and set R, affected by various factors such as loss of information
• T is considered as seed of hypothesis
Cluster #1 from Utility Dataset
Cluster #2 from Utility Dataset
What is Achieved
• Discover:– Partial-sequence temporal
hypotheses – identities of profiles for each
temporal hypothesis
Histone Cluster
Histone ClusterORF Process Function Peak Phase Order Cluster OrderYBL002W chromatin structure histone H2B S 424 789YBL003C chromatin structure histone H2A S 441 790YBR009C chromatin structure histone H4 S 417 791YBR010W chromatin structure histone H3 S 420 795YDR224C chromatin structure histone H2B S 432 794YDR225W chromatin structure histone H2A S 437 796YNL030W chromatin structure histone H4 S 418 792YNL031C chromatin structure histone H3 S 430 793YPL127C chromatin structure histone H1 S 448 788-------------------------------------------------------------------------------------------
YBL002W CC ab dccaCBBBabc*BCCCA* xcCBCbbcccACACCBBbaabbb* dcABCaaabbBCxBBBa cccbABCBBBBBabYBL003C CC ab cdcbBCBBbacaBCCC** cbCCCAbcccaCCCCBA*bcbbaA bcABBb*ab*AxABBAa cb*baBBCBBBB*bYBR009C xC *x ddcaCCBBaab*BCCB*a xcCCBbccccbCBCBBa*bbxaAB baB*AaA*bxdBaBBBB cccbABCCCBABAbYBR010W BC aA ccc*CBBAbbc*BCCBAa bbCCB*ccdbbCBCCCAAxbcaAB bbA*Ab*bbAAxaBBB* bccbaBBBBB*BabYDR224C BC aa ccc*CCB*abbaBBCA*b cbCCB*cccbbCCCCCAAcccaAB caBAAaabbAAxaBBBA cccb*BBCBBBBAbYDR225W CC *b cdcaCCC*abbbBCCBAa cbCCC*bccbaCCCBCBAbccbaB cbAA*a*bbAAxaBBBB bccbaBCBBBBB*bYNL030W CC bb dccACCBAabb*CBCBAa cbCCBacccbbCCCCB*AbbcaAB b*AaAbabbBxBaA*BA cccb*BCCBBBB*bYNL031C CC ab cccABCB*abb*BCCAAa bbCCB*bccbbBBCBC*Abbbb*B bAAAAbabbBAxaA*B* bcbb*BBCBAA**bYPL127C Da bb dcaBBCBAabb*BBBBAa cbBCABbcdbaBABBCBBbcbaAB ccBBBB*bcbABBxAAb bccbABBCBBBA*a
^.{4}[cdx]{2}.{3}[BCx]{1}.{3}[abx]{1}.{3}[BCDx].{7}[BCx]{1}.{3}[bcdx]{2}.{5}[BCx]{1}.{33}[BCx]{1}[ABCx]{1}.{5}$
CC-b--c---------BCC-----C---bcc----C-------b----A-----b-----------b-B--B----bYBL002W CCabdccaCBBBabc*BCCCA*xcCBCbbcccACACCBBbaabbb*dcABCaaabbBCxBBBacccbABCBBBBBabYBL003C CCabcdcbBCBBbacaBCCC**cbCCCAbcccaCCCCBA*bcbbaAbcABBb*ab*AxABBAacb*baBBCBBBB*bYDR225W CC*bcdcaCCC*abbbBCCBAacbCCC*bccbaCCCBCBAbccbaBcbAA*a*bbAAxaBBBBbccbaBCBBBBB*bYNL031C CCabcccABCB*abb*BCCAAabbCCB*bccbbBBCBC*Abbbb*BbAAAAbabbBAxaA*B*bcbb*BBCBAA**b ++x----x+++x---x+++xxx--+++x----x+x+++xx-x--xxxxx+x-x--x++x+x+x--x-x++++++xx- 5,6,10,14,18,26,30,31,37,71,72
Fourier Results
String-Based Results
Regular Expression
Generalization
Localized Focus of Generalization = more genes included
Example 2 with Yeast Cell Data
ORF Process Function Peak Phase Order Cluster OrderYBR009C chromatin structure histone H4 S 417 791
YBR010W chromatin structure histone H3 S 420 795
YDR224C chromatin structure histone H2B S 432 794
YDR225W chromatin structure histone H2A S 437 796
YLR300W cell wall biogenesis "exo-beta-1,3-glucanase" G1 381 756
YNL030W chromatin structure histone H4 S 418 792
YNL031C chromatin structure histone H3 S 430 793
Not classified as cell cycle related:
YBR106W, YBR118W, YBR189W, YBR206W, YCLX11W, YDL014W, YDL213C,
YDR037W, YDR134C, YGL148W, YKL009W, YLR449W, YNL110C, YPR163C
^.{46}[bcx]{1}.{1}[ABx]{1}[aA*x]{1}[AB*]{1}[abx]{1}[aA*x]{1}.{1}[abcx]{1}[ABx]{1}.{1}
[ABCx]{1}[ab*x]{1}[ABx]{1}.{1}[ABCx]{1}.{15}$
Original Hypothesis
New Genes Included After Localized Generalization
New Regular Expression
Find Longest Common Substrings
Algorithms for LCS
• Substring : continuous sequence of characters in a string
• Subsequence : obtained by deleting zero or more symbols in a given string
abcdefghia
Substrings : cdefg, efgh, abcd
Subsequences: ade , cefhi, abc, aia
• LCS is common subsequence of maximal length between two strings
String 1 : abcdabcefghijk
String 2: xbcaghaehijk
LCS = bcaehijk, Length of LCS = 8
Longest Common Subsequence
• Brute Force has exponential time complexity in the length of string
• Dynamic Programming can find LCS in O(mn) time and space complexity
• Length of LCS can be found in O(min(m,n)) space complexity and O(mn) time complexity
Finding LCS
• Finding frequent patterns• Doing similarity search• Clustering of time series• Periodicity detection• Finding temporal associations• Summarization• Prediction
Main Sequence Mining Tasks
Recursive Formulation:
LCS[i, j] = 0, if i = 0 or j = 0 LCS[i-1, j-1] + 1, if i, j > 0 and ai = bj
max(LCS[i, j-1], LCS[i-1, j]), if i, j > 0 and ai ≠ bj
Finding the LCS Length
Recursive Formulation:
LCS[i, j] = 0, if i = 0 or j = 0 LCS[i-1, j-1] + 1, if i, j > 0 and ai = bj
max(LCS[i, j-1], LCS[i-1, j]), if i, j > 0 and ai ≠ bj
Iterative solution is more efficient than recursive
Finding the LCS Length
lcs_length(A, B) { // A is a string with length m // b is another string with length n , m>=n // L is an array to keep intermediate values in Dynamic Programming
for (i = m; i >= 0; i--) for (j = n; j >= 0; j--) {
if (A[i] = '$' || B[j] = '$') L[i,j] = 0; //end of stringselse if (A[i] == B[j]) L[i,j] = 1 + L[i+1, j+1];else L[i,j] = max(L[i+1, j], L[i, j+1]);
}return L[0,0];
}
Finding the LCS Length: Algorithm
Sequential Clustering
• Clustering is partitioning the data in equivalence classes
• Data is input one or few times
• Unique classification based on input order
• Simple and fast for large data
Basic Sequential Clustering
Sequential Clustering with Buckets
• Finding frequent patterns• Doing similarity search• Clustering of time series• Periodicity detection• Biological Sequence Problems
Main Sequence Mining Tasks
Multivariate Time Series
• Multivariate time series data is a series of multiple attributes observed over a period of time at equal intervals
• Examples: stocks, weather, utility, sales, scientific, sensor monitoring and bioinformatics
• Finding frequent patterns• Doing similarity search• Clustering of time series• Periodicity detection• Finding temporal associations• Prediction
Time Series Analysis Tasks
Why temporal association rules?
• More information about correlations between frequent patterns
• Contains richer information than knowledge frequent patterns
• Helps to build diagnostic and prediction tools
Finding temporal associations: recent work
• Mannilla - Discovery of frequent episodes in event sequences [2], 1997
• Das - Rule Discovery from Time Series [1], 1998• Kam - Discovering temporal patterns for interval-based
events [3], 2000• Roddick - Discovering Richer Temporal Association
Rules from Interval-based Data [4], 2004
• Mörchen - Discovering Temporal Knowledge in Multivariate Time Series [5], 2004
Research Issues
• Finding richer set of temporal relationships { contains, follows, overlaps, meets, equals …} than sequence mining does {follows}
• Robustness of rules - room for noise in patterns• Understanding temporal relationships at different
levels of abstraction• Efficient algorithms to find patterns with noise
and temporal associations.
Frameworks• Kam [1] uses Allens temporal relations to find
rules • Roddick [4] uses state sequence framework
similar to Höppner [6]• Mörchen [5] is based on Unification-based
Temporal Grammar.
Temporal Association Rules
Allen’s relationships
Above figure is presented from [1]
What are A1 rules?
kk ArelArelArelArelA 14332211 .......
Given a multivariate time series, minimum support for number of occurrences, minimum pattern length find all the temporal association rules (similar to A1) less than size k
Problem
multivariate time series sequences
dimensionality reduction, discretization and symbolic representation
aaaeffdaaaaaaaaacccaaaaaaaaaaedefggcbabaacfgfc…dccbbccdccdeedcdcdeecccdddeeecdeccccddegedbcdc…dddcbbcfffeeegffcbcdeeefffffecbbaaaaaacffecbbb…
frequent patterns
{aaaaaaaa, bbbaaa, bbaaa, eeef, aaac, eaaa ...}{cdee, dccde, deed, ddeeec, eeec, ccccc,edddd …}{fff,fffe,aaaaa,fecbb,ddcb,bbbbc,bbbbc,cbbb…}
clusteringFrequent pattern enumeration
clusters
{aaaaaaaa} {bbba,bba,..} {eeef…} {aaac,eaaa …}…{cdee}{dccde…}{deed,ddeeec,eeec…}{ccccc…}{edddd…}…{fff,fffe…}{fecbb…}{ddcb…}{bbbbc,bbbbc,cbbb…} …
Temporal association rule discovery
1.{aaaa,aaac,aaae,abaa…}followed by{cbb,ccbb,cccbb…}2.{bbbbb,cbbbbbc,…}followed by{baaaa,aaaaaaa,aaaaaa
…} overlaps {aaaa,aaac,aaae…}3.{dccccdd,ccccdd} contains {aaaaa,aaaa }4. …5. …….
Summarization and visualization
Temporal rules Summarized rule
Mining in 3 steps
1.Find all frequent patterns in each dimension along with all the occurrence.
2.Cluster the similar Patterns to form equivalence classes
3.Find temporal associations between these equivalence classes using iterative algorithm
Step 1: Finding Frequent Patterns
1.The Data in each dimension is quantized to form a string ( equal frequency, equal interval, SAX, persist etc…) [8].
2.Enhanced Suffix Tree is constructed for this string using O(n) Algorithm [7].
3.All the Frequent patterns along with the locations are enumerated in Linear Time by complete traversal of Tree.
Step 1: Enhanced Suffix Tree
Enhanced Suffix Tree Construction Algorithm
• The Frequent patterns are clustered together using string similarity measures.
• Any reasonable string similarity measure and algorithm can be chosen.
• LCS based similarity measure along with sequential clustering algorithm is chosen because of robustness and efficiency.
Step 2: Clustering frequent patterns
Step 2: string similarity measure
• Similarity measure Sim( s1, s2) =
• Distance measure Dist(s1, s2) = 1 - Sim( s1,s2)
• LCS (s1, s2) is the length of longest common subsequence
• |s| is the length of string s
|) s2||s1(|
s2)LCS(s1,*2
Step 2. Clustering
• Sequential clustering Algorithm
• Frequent patterns are divided in to overlapping buckets based upon their length
• Clustering is done for patterns in each bucket to reduce the clustering complexity
• Finally, each pattern is assigned to closest cluster if it is a member of two clusters
Step 2. Clustering
An Example cluster with cluster center ccccddccccccc
ccccdcccccccc cccccddbccccc ccccdcccccccd cccccdcddcccc cccccdcdcccccccccddccccccc cccccddcccccc dccccdccccccc ccccccddccccccccccdccccccc ccccccdccdccc ccccccdcccccc ccccccdcdcccc
0
1
2
3
4
5
6
7
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Step 2. Clustering Algorithm
Step 3. Temporal Relations
• Only three relations are explored here
• Overlap also means meets, starts with, overlaps and overlapped by
Temporal Relations
• Temporal relationship concept extended to interval sequence by including support
Temporal Relations
Temporal relationship concept extended to interval sequence by including support
Temporal Relations
Examples
Example of terminology
Step 3. Algorithm
Step 3. Example
Step 3. Example
Summarization
• Inference of higher level or more general knowledge from lower level temporal dependencies.
• Summarization generalizes the temporal association rules by identifying the time windows in which the rule is applicable.
• Measures for summarization are coverage, average length of coverage and maximum coverage length.
Results: Utility Dataset
Results: EEG Dataset
Scalable up to 1 million rows
Results: Great Lakes dataset
Results: Power Consumption dataset
daily patterns
Weekly patterns after summarization
Spatio-Temporal Clusters
Temporal Patterns
Time axis
Observed Value
Example Domains for Temporal Profiles: - Gene Expression
- Each gene’s expression level (1000s genes)
- Utility Consumption- Each day/week’s utility consumption (daily profile for multiple years)
- Social Data- Number of crime incidents/month
Goal:
Cluster together profiles with similar behavior
Problem:
How to define similar behavior?
Similar for?
- Complete period
- One interval
- Multiple Intervals
Spatial Patterns
Example Domains for Spatial Profiles: - Social Data
- Number of crime incidents/area - Scientific Data
- Pollution level in land
x
y
. .. . .
.. . . . .
.. . . . .
Goal:
Cluster together profiles close to each other
Spatial Closeness:
- Euclidean distance
- Chain connectivity
.
.
...
....
.
.. .
.........
Spatio-Temporal PatternsGoal:
Find patterns in space-time dimensions for phenomena
Issues:
Simultaneous handling of space and time dimensions
Problem:
How to handle large complexity of algorithms
Phenomena in space-time
Examples:
Crime data: - Each layer represents month i.e. t-axis- x, y axis represents space coordinates
Energy Consumption
t
Avg. Temperature
yx
Avg. Production
Brewery data:- t-axis represent weeks - x, y axis represents the Average Production and Average Temperature
Tim
e
Clustering AlgorithmsTemporal Clustering Spatial Clustering Spatio-Temporal Clustering
• Discrete Fourier Transform [Agarwal] - Pros: Index time sequences for similarity searching
- Cons: Not able to represent subsequence temporal concepts
• Rule discovery from Time Series [Das1998] - Pros: Windows clustered according to their similarity to determine temporal rules.
- Cons: Unable to cluster profiles according to subsequences of similarity
• PAM - Pros: Works well for small datasets
- Cons: Expensive
• CLARA
- Pros: Works well for large datasets
- Cons: Expensive
• CLARANS - Motivated by PAM and CLARA
• BIRCH & CURE - Suitable for large datasets, clusters found with one scan
• R-Trees
- Often used for indexing spatio-temporal data
- Overlapping causes backtracking in search
• MR-Tree & HR- Tree - use overlapping of R-Tree to represent successive states of database
- if the number of moving objects from one time instant to another is large, the approach degenerates to independent tree structures and thus no paths are common
Generalizing to multiple dimensions
• Two different approaches– Phenomenon moving in the same
direction i.e. parallel to time axis
– Phenomenon moving in different directions i.e. N, NW, NE, S, SW, SE, W, E
• Task: classify the phenomena in time, then track them in space– STEP I: Discovery of temporal clusters
– STEP II: Finding spatial clusters
Phenomenon moving in different directions
Goal:To discover subsequences of the complete temporal profile that are shared by a large number of profilesand are neighbors spatially
Modified Sequences Generated
b c
d b
c d
d c
a c
c c
a b
b c
There are 4 layers, so the final sequences formed are of length 7 i.e. 2n – 1, where n is number of layers
Complexity• Number of strings generated moving from Layer 1
to Layer 2 are:
• For L layers, number of strings generated when phenomenon move from layer 1 to L are:
Results- Test data with 20 profiles- Each profile observed at 30 time points
Profile
Tim
e
Results (Brewery Data 1)-Energy Consumption data of size 51X14- 51 profiles, each observed over a time period of 14 days
Time
En
erg
y C
on
sum
pti
on
Results (Brewery Data 2)- Production data of size 51X14- 51 profiles, each observed over a time period of 14 days
Results- Sequences generated from a test data with 7 layers, each layer of size 10X10
- 4480 strings generated
Tim
e
Profile Index
Ob
serv
ed V
alu
es
Sequences and Bioinformatics
Importance of Sequences
• DNA: sequence of letters A,C,G,T
• Proteins: sequence of 20 amino acids
• Central premise of bioinformatics– Sequence structure function
• Ultimate goal: predict function based on sequence
Strings and Sequences
Exact String Matching
Solutions to Exact Matching
• O(n+m) upper bound for worst case
• Preprocess the pattern P
• Preprocess the text T– Suffix tree method: Weiner, Ukkonen and
others• Building the tree: O(m)• Searching: O(n)
Suffix Trees
• T = cabca
• Suffixes are – cabca– abca– bca– ca– a
• P = ca
ca
b
c
a
$
1
a
b
c
a
$
2
b
c
a
$
3
$4
$
5
Approximate String Matching
• Error in the experiment or observation
• Redundancy in biological system– Insertion/deletions– Mutations
The Task: Best Alignment
• Given Strings– HEAGAWGHE– PAWHE
• Many Alignments possible
HEAGAWGHE HEAGAWGHE HEAGAWGHE
P_A _ _W_HE _ _P_AW _HE _PA_ _W _HE
Scoring Function
• a score for match H E A G …
X X A X …
• a score for mis-match H E A G …
X X P X …
• a score for alignment against gap H E A G …
X X _ X …
Algorithms
• Needleman-Wunsch (NW): Global Alignment
• Smith-Waterman (SW): Local Alignment
• Both Based on idea of dynamic programming
All Possible Alignments
• Represented using a table– First string, a1a2, is aligned along top
– Second string, b1b2, is aligned along the left side
a1 a2
b1
b2
0 1 1
1
1
3
5 13
5 _ a1a2b1b2
\ |
| b1a1b2a2
\_
| _ _ |_ _ |_ _ _|_ _ | |_ |_ | | b1b2a1a2
| | |_
\ a1b1a2b2
\ \ \ _ |_ | a1b1b2a2
|_ _ \ | b1a1a2b2
\
Main idea of DP
• Choices:– Align a3 to b3, s(a3,b3)= -5 score: 5 -5 =0– Align a3 to a gap, s(a3,-)= -8 score: -3 -8=-11 – Align b3 to a gap, s(-,b3)= -8 score: 12-8=4
a1 a2 a3
b1
b2
b3
512
-3 4
Main idea of DP
F(i-1,j-1)
F(i-1,j)
F(i,j-1) (i,j)
• Score is sum of independent piecewise scores
• Global alignment (Needleman-Wunsch):
– F(0,0) = 0; F(k,0) = F(0,k) = - k d;
– F(i,j) = max { F(i-1,j-1)+s(ai,bj) ; F(i-1,j)-d ; F(i,j-1)-d }
• Local alignment (Smith-Waterman):
– F(0,0) = 0; F(k,0) = F(0,k) = 0;
– F(i,j) = max { 0 ; F(i-1,j-1)+s(ai,bj) ; F(i-1,j)-d ; F(i,j-1)-d }
Overview of DP
• Start from the upper left corner
• Assign scores for the leading gaps
• Iteratively fill in the scores in the table
• Find the maximum score, bottom-right for global alignment
• Back-track to the upper-left to find the best overall alignment
Pairwise score from Blosum50
H E A G A W G H E
P -2 -1 -1 -2 -1 -4 -2 -2 -1
A -2 -1 5 0 5 -3 0 -2 -1
W -3 -3 -3 -3 -3 15 -3 -3 -3
H 10 0 -2 -2 -2 -3 -2 10 0
E 0 6 -1 -3 -1 -3 -3 0 6
DP Table for global alignment
H E A G A W G H E
0 <-8 <-16 <-24 <-32 <-40 <-48 <-56 <-64 <-72
P ^-8 *-2 *-9 *-17 <-25 *-33 <-41 <-49 <-57 *-65
A ^-16 ^-10 *-3 *-4 <-12 *-20 <-28 <-36 <-44 <-52
W ^-24 ^-18 ^-11 *-6 *-7 *-15 *-5 <-13 <-21 <-29
H ^-32 *-14 *-18 *-13 *-8 *-9 ^-13 *-7 *-3 <-11
E ^-40 ^-22 *-8 <-16 ^-16 *-9 *-12 ^-15 *-7 *3
HEAGAWGHE
--P-AW-HE
DP Table for local alignment
H E A G A W G H E
0 0 0 0 0 0 0 0 0 0
P 0 0 0 0 0 0 0 0 0 0
A 0 0 0 *5 0 *5 0 0 0 0
W 0 0 0 0 *2 0 *20 <12 <4 0
H 0 *10 <2 0 0 0 ^12 *18 *22 <14
E 0 ^2 *16 <8 0 0 ^4 ^10 *18 *28
AWGHE AW-HE
Multiple Alignment
• Complexity of DP: O(n2)
• For k strings: O(nk)
• Explore other options– Hidden Markov Model
Markov Chain Models
• Similar to finite automata
• Emits sequences with certain probability
A
C G
T
ATGT…
P(ATGT) = P(A|TGT) P(T|GT) P(G|T) P(T)
P(ATGT) = P(A|T) P(T|G) P(G|T) P(T)
Hidden Markov Models
• Generalization of Markov Models
• Hidden states that emit sequences
A*
C* G*
T* A
C G
T
Adding four more states (A*,C*,T*,G*) to represent the “island” model, as opposed to non-island model with unlikely transitions between the models one obtains a “hidden” MM for CpG islands.
HMMs for Gene Prediction
HMMs & Supervised Learning
• Input: a training set of aligned sequences
• Find optimal transition and emission probabilities
• Criteria: maximize probability of observing the training sequences
• Algorithms– Baum-Welch (Expectation Maximization)– Viterbi training algorithm
Recognition Phase
• We have optimized probablities
• Predict likelihood of a sequence belonging to a family– Whats the probablity that a sequence is
generated by the HMM?
A Simple HMM Model
Beg Mj End… …
Example
AGAAACTAGGAATTTGAATCT
P(AGAAACT)=16/81P(TGGATTT)=1/81
1 2 3 4 5 6 7
A 2/3 0 2/3 1 2/3 0 0
T 1/3 0 0 0 1/3 1/3 1
C 0 0 0 0 0 2/3 0
G 0 1 1/3 0 0 0 0
Each blue square represents a match state that “emits” each letter withcertain probability ej(a) which is defined by frequency of a at position j:
Insertions…
Insert states emit symbols just like the match states, however, theemission probabilities are typically assumed to follow the backgrounddistribution and thus do not contribute to log-odds scores.
Transitions Ij -> Ij are allowed and account for an arbitrary numberof inserted residues that are effectively unaligned (their order withinan inserted region is arbitrary).
Beg Mj End
Ij
… and Deletions
Beg Mj End
Dj
Deletions are represented by silent states which do not emit any letters.A sequence of deletions (with D -> D transitions) may be used to connectany two match states, accounting for segments of the multiple alignmentthat are not aligned to any symbol in a query sequence (string).
The total cost of a deletion is the sum of the costs of individual transitions(M->D, D->D, D->M) that define this deletion. As in case of insertions, bothlinear and affine gap penalties can be easily incorporated in this scheme.
HMMs for Multiple Alignment
Beg Mj End
Ij
Dj
ExampleAG---CA-AG-CAG-AA---AAACAG---C** *
Emision and Transition Counts
C0 C1 C2 C3
A - 4 0 0
C - 0 0 4
G - 0 3 0
T - 0 0 0
Beg Mj End
Ij
Dj
AG...CA-AG.CAGAA.---AAACAG...C
C0 C1 C2 C3
A 0 0 6 0
C 0 0 0 0
G 0 0 1 0
T 0 0 0 0
Match emissions Insert emissions
4 23 4
1
1
1
21
41
12
C0 C1 C2 C3
Results
Training File 2
Validation File 2
Bibliography
1. G. Das, K. Lin, H. Mannila, G. Renganathan and P. Smyth, "Rule Discovery from Time Series," in Knowledge Discovery and Data Mining, 1998, pp. 16-22
2. H. Mannila, "Discovery of Frequent Episodes in Event Sequences," Data Mining and Knowledge Discovery, vol. 1, pp. 259, 1997.
3. P. Kam and A.W. Fu, "Discovering Temporal Patterns for Interval-Based Events," in Second International Conference on Data Warehousing and Knowledge Discovery (DaWaK 2000); Lecture Notes in Computer Science, 2000, pp. 317-326.
4. J. Roddick F. and E. Winarko, "Discovering Richer Temporal Association Rules from Interval-based Data : Extended Report," School of Informatics and Engineering,Flinders University., Adelaide, Australia, Tech. Rep. SIE-05-003, March 2005, 2005.
5. F. Mörchen and A. Ultsch, "Discovering Temporal Knowledge in Multivariate Time Series," in GfKl Dortmund, 2004,
6. H¨oppner, F.: Learning temporal rules from state sequence. In: Proceedings of IJCAI Workshop on Learning from Temporal and Spatial Data,Seattle, USA (2001) 25-31
7. E. Ukkonen, "On-Line Construction of Suffix Trees," Algorithmica, vol. 14, pp. 249-260, 1995.
8. C.S. Daw, C.E.A. Finney and E.R. Tracy, "A review of symbolic analysis of experimental data," Rev.Sci.Instrum., vol. 74, pp. 915-930, feb. 2003.
9. T. Sergios and K. Konstantinos Eds., Pattern Recognition, San Diego: Elsevier Academic Press, 2003