splash: structural pattern localization analysis by sequential histograms a. califano, ibm tj watson...
TRANSCRIPT
![Page 1: SPLASH: Structural Pattern Localization Analysis by Sequential Histograms A. Califano, IBM TJ Watson Presented by Tao Tao April 14 th, 2004](https://reader030.vdocuments.us/reader030/viewer/2022032709/56649ec75503460f94bd3282/html5/thumbnails/1.jpg)
SPLASH: Structural Pattern Localization Analysis by Sequential HistogramsA. Califano, IBM TJ Watson
Presented by Tao Tao
April 14th, 2004
![Page 2: SPLASH: Structural Pattern Localization Analysis by Sequential Histograms A. Califano, IBM TJ Watson Presented by Tao Tao April 14 th, 2004](https://reader030.vdocuments.us/reader030/viewer/2022032709/56649ec75503460f94bd3282/html5/thumbnails/2.jpg)
Motif: A functional regain of a DNA or protein sequence
How to discover the functional regains automatically?
Amino Acids sequences
![Page 3: SPLASH: Structural Pattern Localization Analysis by Sequential Histograms A. Califano, IBM TJ Watson Presented by Tao Tao April 14 th, 2004](https://reader030.vdocuments.us/reader030/viewer/2022032709/56649ec75503460f94bd3282/html5/thumbnails/3.jpg)
Automatic Motif discovery
Problem- Use A, B, C, … stands for different amino acids- A protein sequence: ABABAABCDBAA…- Motifs are certain patterns in sequences
for example: ABCA Previous Methods: small scale discovery- Several sequences similar functions alignment Can we use data mining to generate motifs
candidates first?
![Page 4: SPLASH: Structural Pattern Localization Analysis by Sequential Histograms A. Califano, IBM TJ Watson Presented by Tao Tao April 14 th, 2004](https://reader030.vdocuments.us/reader030/viewer/2022032709/56649ec75503460f94bd3282/html5/thumbnails/4.jpg)
Automatically discover motifs: What properties should a motif have? It has a specific function conservative
frequent appearing in sequences Evolution likely not continually identical
For example: ABCBABABA
AB--ABAB- string matching, suffix tree …
AB-BAB-B- how?
![Page 5: SPLASH: Structural Pattern Localization Analysis by Sequential Histograms A. Califano, IBM TJ Watson Presented by Tao Tao April 14 th, 2004](https://reader030.vdocuments.us/reader030/viewer/2022032709/56649ec75503460f94bd3282/html5/thumbnails/5.jpg)
Formal problem definition
Input: A string of characters: S=s1s2,…,sL
Output: A frequent pattern: (∑ U ●)* ●: a wild card to match a single character, ∑: a full character * : repeat arbitrary times
Note: NO arbitrary-length gap.
ABCD, AED are different
![Page 6: SPLASH: Structural Pattern Localization Analysis by Sequential Histograms A. Califano, IBM TJ Watson Presented by Tao Tao April 14 th, 2004](https://reader030.vdocuments.us/reader030/viewer/2022032709/56649ec75503460f94bd3282/html5/thumbnails/6.jpg)
Regular Expression: to describe a certain type of patterns | or : A|B means A or B ● wild card to match any characters
A●B means: AAB, ABB, ACB, … * to repeat any times (including 0 times)
(AB)* means null, AB, ABAB, ABABAB, … + to repeat any times (not including 0 times) …
![Page 7: SPLASH: Structural Pattern Localization Analysis by Sequential Histograms A. Califano, IBM TJ Watson Presented by Tao Tao April 14 th, 2004](https://reader030.vdocuments.us/reader030/viewer/2022032709/56649ec75503460f94bd3282/html5/thumbnails/7.jpg)
Any requirements for output patterns? Can wild card be anywhere? Do we need
some constraints on wild cards?
What means “frequent”?
How long should a qualified pattern be?
![Page 8: SPLASH: Structural Pattern Localization Analysis by Sequential Histograms A. Califano, IBM TJ Watson Presented by Tao Tao April 14 th, 2004](https://reader030.vdocuments.us/reader030/viewer/2022032709/56649ec75503460f94bd3282/html5/thumbnails/8.jpg)
Can wild card be anywhere?
A pattern can have ●: for example A●BA●●B But, A●●●●●●●●●●●●●●●●BBA ??
Probably, it cannot be too “sparse”… Naïve solution: no more than n ● But, for example n=5
A●●●●●B : 5 ●
A●BB●●A●●B●●A : 7●
![Page 9: SPLASH: Structural Pattern Localization Analysis by Sequential Histograms A. Califano, IBM TJ Watson Presented by Tao Tao April 14 th, 2004](https://reader030.vdocuments.us/reader030/viewer/2022032709/56649ec75503460f94bd3282/html5/thumbnails/9.jpg)
Given a pattern P, any length l0 region in P must have k0 full characters
Example: l0 = 5, k0 = 3
s1s2s3s4s5s6s7s8s9……
Density: how “sparse” do we allow?
Two ● at most
![Page 10: SPLASH: Structural Pattern Localization Analysis by Sequential Histograms A. Califano, IBM TJ Watson Presented by Tao Tao April 14 th, 2004](https://reader030.vdocuments.us/reader030/viewer/2022032709/56649ec75503460f94bd3282/html5/thumbnails/10.jpg)
Given a pattern P, any length l0 region in P must have k0 full characters
Example: l0 = 5, k0 = 3
s1s2s3s4s5s6s7s8s9……
Density: how “sparse” do we allow?
Two ● at most
![Page 11: SPLASH: Structural Pattern Localization Analysis by Sequential Histograms A. Califano, IBM TJ Watson Presented by Tao Tao April 14 th, 2004](https://reader030.vdocuments.us/reader030/viewer/2022032709/56649ec75503460f94bd3282/html5/thumbnails/11.jpg)
Given a pattern P, any length l0 region in P must have k0 full characters
Example: l0 = 5, k0 = 3
s1s2s3s4s5s6s7s8s9……
Density: how “sparse” do we allow?
Two ● at most
![Page 12: SPLASH: Structural Pattern Localization Analysis by Sequential Histograms A. Califano, IBM TJ Watson Presented by Tao Tao April 14 th, 2004](https://reader030.vdocuments.us/reader030/viewer/2022032709/56649ec75503460f94bd3282/html5/thumbnails/12.jpg)
Given a pattern P, any length l0 region in P must have k0 full characters
Example: l0 = 5, k0 = 3
s1s2s3s4s5s6s7s8s9……
A●●ABB●A √ BA●●A●BB X
Density: how “sparse” do we allow?
![Page 13: SPLASH: Structural Pattern Localization Analysis by Sequential Histograms A. Califano, IBM TJ Watson Presented by Tao Tao April 14 th, 2004](https://reader030.vdocuments.us/reader030/viewer/2022032709/56649ec75503460f94bd3282/html5/thumbnails/13.jpg)
Frequency and length
At least, the patterns have K0 full characters repeating J0 times
Example J0 =3
ABCBABABA √
ABCBABABA X Example K0 =3
ABCBABABA X
ABCBABABA √
![Page 14: SPLASH: Structural Pattern Localization Analysis by Sequential Histograms A. Califano, IBM TJ Watson Presented by Tao Tao April 14 th, 2004](https://reader030.vdocuments.us/reader030/viewer/2022032709/56649ec75503460f94bd3282/html5/thumbnails/14.jpg)
Summary of parameters for a pattern Sequence S, and its length L Pattern P, K full character, appears J times Length constraints: K ≥ K0
Frequency constraint: J ≥ J0
Density constraint: l0 , k0
![Page 15: SPLASH: Structural Pattern Localization Analysis by Sequential Histograms A. Califano, IBM TJ Watson Presented by Tao Tao April 14 th, 2004](https://reader030.vdocuments.us/reader030/viewer/2022032709/56649ec75503460f94bd3282/html5/thumbnails/15.jpg)
Apriori property
A constraint has a-priori property means:
If a set violates this constraint, any its superset will violate this constraint as well.
For example max(S) < 5
Frequency constraint has a-priori property!
For example, BA●A●BB appears less than J0 times, any its super patterns CANNOT appears more than J0 times!
![Page 16: SPLASH: Structural Pattern Localization Analysis by Sequential Histograms A. Califano, IBM TJ Watson Presented by Tao Tao April 14 th, 2004](https://reader030.vdocuments.us/reader030/viewer/2022032709/56649ec75503460f94bd3282/html5/thumbnails/16.jpg)
A whole picture of the algorithm To form longer pattern only from short qualified patterns.- First, to generate candidates/seed (length l0): every seed
should repeat at least J0 times
- To generate longer patterns from short patterns, iteratively
1. Two patterns are together
2. Longer patterns repeat at least J0
……
![Page 17: SPLASH: Structural Pattern Localization Analysis by Sequential Histograms A. Califano, IBM TJ Watson Presented by Tao Tao April 14 th, 2004](https://reader030.vdocuments.us/reader030/viewer/2022032709/56649ec75503460f94bd3282/html5/thumbnails/17.jpg)
Generate the seeds: enumerating … To generate seeds (shortest patterns) first
ABAABBCBACBDB… J0=4 A: 4, B: 6, C:2, D: 1 Are length 1 seeds too short? How long could those
seeds be? - Too long: enumerating costs too much time- Too short: maybe not efficient, also not consider the
density constraints Maybe, we should start from the patterns with length
l0 .
![Page 18: SPLASH: Structural Pattern Localization Analysis by Sequential Histograms A. Califano, IBM TJ Watson Presented by Tao Tao April 14 th, 2004](https://reader030.vdocuments.us/reader030/viewer/2022032709/56649ec75503460f94bd3282/html5/thumbnails/18.jpg)
How to generate seeds with length l0 ? Give l0 and k0, and character sets ABC… Enumerating all possible patterns with length l0 Scan the sequence the count the frequency
For example, l0=3, k0=2, ABCAAA, AAB, AAC, ABA, …
AA●, AB●, AC●, … A●A, A●B, A●C, … …
![Page 19: SPLASH: Structural Pattern Localization Analysis by Sequential Histograms A. Califano, IBM TJ Watson Presented by Tao Tao April 14 th, 2004](https://reader030.vdocuments.us/reader030/viewer/2022032709/56649ec75503460f94bd3282/html5/thumbnails/19.jpg)
Can we do it more efficiently? Give l0 and k0,
1: full character 0:wild card
Enumerating all possible patterns by 1 and 0? Example l0=5, k0=3, to find comb
11111, 11110, 11101, 11100, 11011,
11010, 11001, 10111, 10110, 10101
10011, 01111, 01110, 01101, 01011,
00111
![Page 20: SPLASH: Structural Pattern Localization Analysis by Sequential Histograms A. Califano, IBM TJ Watson Presented by Tao Tao April 14 th, 2004](https://reader030.vdocuments.us/reader030/viewer/2022032709/56649ec75503460f94bd3282/html5/thumbnails/20.jpg)
How to use comb? For example 10101A B A A B A B B A B A B A B B A A B
A●A●B
![Page 21: SPLASH: Structural Pattern Localization Analysis by Sequential Histograms A. Califano, IBM TJ Watson Presented by Tao Tao April 14 th, 2004](https://reader030.vdocuments.us/reader030/viewer/2022032709/56649ec75503460f94bd3282/html5/thumbnails/21.jpg)
How to use comb? For example 10101A B A A B A B B A B A B A B B A A B
B●A●A
![Page 22: SPLASH: Structural Pattern Localization Analysis by Sequential Histograms A. Califano, IBM TJ Watson Presented by Tao Tao April 14 th, 2004](https://reader030.vdocuments.us/reader030/viewer/2022032709/56649ec75503460f94bd3282/html5/thumbnails/22.jpg)
How to use comb? For example 10101A B A A B A B B A B A B A B B A A B
A●B●B
![Page 23: SPLASH: Structural Pattern Localization Analysis by Sequential Histograms A. Califano, IBM TJ Watson Presented by Tao Tao April 14 th, 2004](https://reader030.vdocuments.us/reader030/viewer/2022032709/56649ec75503460f94bd3282/html5/thumbnails/23.jpg)
How to use comb? For example 10101A B A A B A B B A B A B A B B A A B
A●A●B 3, B●A●A 2, A●B●B 2B●B●A 2, B●B●B 2, A●A●A 1A●B●A 1, B●A●B 1
J0 = 3? only A●A●B left
By the same way, use others combs to generate other seeds, different combs won’t generate the same patterns
![Page 24: SPLASH: Structural Pattern Localization Analysis by Sequential Histograms A. Califano, IBM TJ Watson Presented by Tao Tao April 14 th, 2004](https://reader030.vdocuments.us/reader030/viewer/2022032709/56649ec75503460f94bd3282/html5/thumbnails/24.jpg)
How to get long patterns?
Long pattern two patterns could be merged need short patterns and their locations
- Pattern: A●B●●C {A:0,B:2,C:5}- Locus: the locations where a pattern occurs:
Patten AB in string ABBCABAB
Its locus {0, 4, 6}
![Page 25: SPLASH: Structural Pattern Localization Analysis by Sequential Histograms A. Califano, IBM TJ Watson Presented by Tao Tao April 14 th, 2004](https://reader030.vdocuments.us/reader030/viewer/2022032709/56649ec75503460f94bd3282/html5/thumbnails/25.jpg)
Append operation: to connect two small patterns to a longer patternPatten S1: A●●B●C and S2: B●D●
S1S2: A●●B●CB●D●
conditional on: Their locus have intersection
S1 locus: {1, 20, 32, 57 …}
S2 locus: {7, 13, 38, 63 … } {1,7,32,57,…}
S1S2 locus: {1,32,57,…}
-6
![Page 26: SPLASH: Structural Pattern Localization Analysis by Sequential Histograms A. Califano, IBM TJ Watson Presented by Tao Tao April 14 th, 2004](https://reader030.vdocuments.us/reader030/viewer/2022032709/56649ec75503460f94bd3282/html5/thumbnails/26.jpg)
Add: to make the patterns more “dense” Patten A●●B●C●●●D and ●●●B●CE
A●●B●CE●●D
on the conditions:
Their locus (with shifting) have intersection
![Page 27: SPLASH: Structural Pattern Localization Analysis by Sequential Histograms A. Califano, IBM TJ Watson Presented by Tao Tao April 14 th, 2004](https://reader030.vdocuments.us/reader030/viewer/2022032709/56649ec75503460f94bd3282/html5/thumbnails/27.jpg)
Significance: whether it can be generated by randomly sampling ? hypothesis: A pattern is not randomly generated
Given: character set: {A,B,C,D,E} sequence length: L A pattern: A●BA●AA●B Its frequency jProbability to generate this pattern j times pure randomly?
![Page 28: SPLASH: Structural Pattern Localization Analysis by Sequential Histograms A. Califano, IBM TJ Watson Presented by Tao Tao April 14 th, 2004](https://reader030.vdocuments.us/reader030/viewer/2022032709/56649ec75503460f94bd3282/html5/thumbnails/28.jpg)
Statistical significance
Pure random sampling, the frequency should satisfy normal distribution
Z score, (A-E[A])/σA --- normalized into N(0,1)
![Page 29: SPLASH: Structural Pattern Localization Analysis by Sequential Histograms A. Califano, IBM TJ Watson Presented by Tao Tao April 14 th, 2004](https://reader030.vdocuments.us/reader030/viewer/2022032709/56649ec75503460f94bd3282/html5/thumbnails/29.jpg)
Experiments
Two questions to answer.- How efficient is this algorithm?- How effective is this algorithm?
Baseline algorithm- PRATT(EBI), MEME(UCSD)
![Page 30: SPLASH: Structural Pattern Localization Analysis by Sequential Histograms A. Califano, IBM TJ Watson Presented by Tao Tao April 14 th, 2004](https://reader030.vdocuments.us/reader030/viewer/2022032709/56649ec75503460f94bd3282/html5/thumbnails/30.jpg)
Efficiency
SPLASH
PRATT
![Page 31: SPLASH: Structural Pattern Localization Analysis by Sequential Histograms A. Califano, IBM TJ Watson Presented by Tao Tao April 14 th, 2004](https://reader030.vdocuments.us/reader030/viewer/2022032709/56649ec75503460f94bd3282/html5/thumbnails/31.jpg)
Effectiveness
Search against SWISS-PROT Rel. 36,
578 GPCR proteins returned, only 4 false positive
MEME cannot find it, PRATT program crashed
![Page 32: SPLASH: Structural Pattern Localization Analysis by Sequential Histograms A. Califano, IBM TJ Watson Presented by Tao Tao April 14 th, 2004](https://reader030.vdocuments.us/reader030/viewer/2022032709/56649ec75503460f94bd3282/html5/thumbnails/32.jpg)
Conclusions
Deterministic algorithm: It can discover all patterns satisfying the requirements
Efficient and scalable: It beats PRATT and MEME. More scalable …
Effective: It can discover useful patterns.
![Page 33: SPLASH: Structural Pattern Localization Analysis by Sequential Histograms A. Califano, IBM TJ Watson Presented by Tao Tao April 14 th, 2004](https://reader030.vdocuments.us/reader030/viewer/2022032709/56649ec75503460f94bd3282/html5/thumbnails/33.jpg)
Problems
All problems that A-priori algorithm could have: too many results, cannot really avoid worse-case exponential …
Doesn’t really consider the 3D structure of proteins
The software crashes sometimes