geometric matching on sequential data veli mäkinen ag genominformatik technical fakultät bielefeld...
TRANSCRIPT
![Page 1: Geometric Matching on Sequential Data Veli Mäkinen AG Genominformatik Technical Fakultät Bielefeld Universität](https://reader030.vdocuments.us/reader030/viewer/2022032607/56649eda5503460f94be9519/html5/thumbnails/1.jpg)
Geometric Matching on Sequential Data
Veli Mäkinen
AG Genominformatik
Technical Fakultät
Bielefeld Universität
![Page 2: Geometric Matching on Sequential Data Veli Mäkinen AG Genominformatik Technical Fakultät Bielefeld Universität](https://reader030.vdocuments.us/reader030/viewer/2022032607/56649eda5503460f94be9519/html5/thumbnails/2.jpg)
Stringology Haifa 2005 Geometric matching on sequential data 2
Introduction
Motivation: To study problems in the intersection of geometry and stringology.
Applications to time-series data.
![Page 3: Geometric Matching on Sequential Data Veli Mäkinen AG Genominformatik Technical Fakultät Bielefeld Universität](https://reader030.vdocuments.us/reader030/viewer/2022032607/56649eda5503460f94be9519/html5/thumbnails/3.jpg)
Stringology Haifa 2005 Geometric matching on sequential data 3
Three problems
1D point set matching under translations (Akutsu, COCOON’04).
1D point set matching under translations, scaling and noise (Böcker & Mäkinen, EuroCG’05)
2D point set matching under translations (Ukkonen & Lemström & Mäkinen, 2003 + Cieliebak & Mäkinen, 2005).
![Page 4: Geometric Matching on Sequential Data Veli Mäkinen AG Genominformatik Technical Fakultät Bielefeld Universität](https://reader030.vdocuments.us/reader030/viewer/2022032607/56649eda5503460f94be9519/html5/thumbnails/4.jpg)
Stringology Haifa 2005 Geometric matching on sequential data 4
1D point set matching under translations
Two point sets A and B of sizes m and n. Problem 1a: Find largest common point set
of f(A) and B over translations f. Problem 1b: Find largest common point set
of f(A) and a continuous subset of B. Let k be the number of unmatched points.
![Page 5: Geometric Matching on Sequential Data Veli Mäkinen AG Genominformatik Technical Fakultät Bielefeld Universität](https://reader030.vdocuments.us/reader030/viewer/2022032607/56649eda5503460f94be9519/html5/thumbnails/5.jpg)
Stringology Haifa 2005 Geometric matching on sequential data 5
Example
B
A
f(A)
Problem 1a: k=3Problem 1b: k=1
![Page 6: Geometric Matching on Sequential Data Veli Mäkinen AG Genominformatik Technical Fakultät Bielefeld Universität](https://reader030.vdocuments.us/reader030/viewer/2022032607/56649eda5503460f94be9519/html5/thumbnails/6.jpg)
Stringology Haifa 2005 Geometric matching on sequential data 6
Solutions
Trivial in O(m2n log n) time. Easy in O(mn log m) time. Akutsu gives an O(k3+n log n) time solution.
![Page 7: Geometric Matching on Sequential Data Veli Mäkinen AG Genominformatik Technical Fakultät Bielefeld Universität](https://reader030.vdocuments.us/reader030/viewer/2022032607/56649eda5503460f94be9519/html5/thumbnails/7.jpg)
Stringology Haifa 2005 Geometric matching on sequential data 7
Akutsu’s solution
Use differential encoding for A and B. A’=a2-a1,a3-a2,..., am-am-1,
B’=b2-b1,b3-b2,..., bn-bn-1.
Construct suffix tree T of A’#B’$. Preprocess T for LCA queries.
![Page 8: Geometric Matching on Sequential Data Veli Mäkinen AG Genominformatik Technical Fakultät Bielefeld Universität](https://reader030.vdocuments.us/reader030/viewer/2022032607/56649eda5503460f94be9519/html5/thumbnails/8.jpg)
Stringology Haifa 2005 Geometric matching on sequential data 8
Akutsu’s solution...
Let Jump(ai,bj)=h where h is largest integer such that,
Jump(ai,bj) can be computed O(1) time.
bj bj+h-1
ai ai+h-1
![Page 9: Geometric Matching on Sequential Data Veli Mäkinen AG Genominformatik Technical Fakultät Bielefeld Universität](https://reader030.vdocuments.us/reader030/viewer/2022032607/56649eda5503460f94be9519/html5/thumbnails/9.jpg)
Stringology Haifa 2005 Geometric matching on sequential data 9
Akutsu’s solution...
Observation: One of the first k+1 points in both A and B must match.
Each match defines a translation. For each translation, one needs at most k+1
queries to Jump() to find out whether there is large enough overlap.
![Page 10: Geometric Matching on Sequential Data Veli Mäkinen AG Genominformatik Technical Fakultät Bielefeld Universität](https://reader030.vdocuments.us/reader030/viewer/2022032607/56649eda5503460f94be9519/html5/thumbnails/10.jpg)
Stringology Haifa 2005 Geometric matching on sequential data 10
Akutsu’s solution...
Theorem 1: Problem 1a can be solved in O(k3+n log n) time and Problem 1b in O(k2n+n log n) time.
Akutsu also gives reductions from 2D/3D problems to 1D achieving good bounds.
![Page 11: Geometric Matching on Sequential Data Veli Mäkinen AG Genominformatik Technical Fakultät Bielefeld Universität](https://reader030.vdocuments.us/reader030/viewer/2022032607/56649eda5503460f94be9519/html5/thumbnails/11.jpg)
Stringology Haifa 2005 Geometric matching on sequential data 11
Three problems
1D point set matching under translations (Akutsu, COCOON’04).
1D point set matching under translations, scaling and noise (Böcker & Mäkinen, EuroCG’05)
2D point set matching under translations (Ukkonen & Lemström & Mäkinen, 2003 + Cieliebak & Mäkinen, 2005).
![Page 12: Geometric Matching on Sequential Data Veli Mäkinen AG Genominformatik Technical Fakultät Bielefeld Universität](https://reader030.vdocuments.us/reader030/viewer/2022032607/56649eda5503460f94be9519/html5/thumbnails/12.jpg)
Stringology Haifa 2005 Geometric matching on sequential data 12
Linear 1D point set matching
Let us consider generalization where we allow also scaling and noise.
We search for best linear mapping from point set A to point set B.- maximum number of points of A should move close to points of B.
![Page 13: Geometric Matching on Sequential Data Veli Mäkinen AG Genominformatik Technical Fakultät Bielefeld Universität](https://reader030.vdocuments.us/reader030/viewer/2022032607/56649eda5503460f94be9519/html5/thumbnails/13.jpg)
Stringology Haifa 2005 Geometric matching on sequential data 13
Example
A
B
![Page 14: Geometric Matching on Sequential Data Veli Mäkinen AG Genominformatik Technical Fakultät Bielefeld Universität](https://reader030.vdocuments.us/reader030/viewer/2022032607/56649eda5503460f94be9519/html5/thumbnails/14.jpg)
Stringology Haifa 2005 Geometric matching on sequential data 14
Example...
A
B
f(A)
![Page 15: Geometric Matching on Sequential Data Veli Mäkinen AG Genominformatik Technical Fakultät Bielefeld Universität](https://reader030.vdocuments.us/reader030/viewer/2022032607/56649eda5503460f94be9519/html5/thumbnails/15.jpg)
Stringology Haifa 2005 Geometric matching on sequential data 15
Linear 1D point set matching...
There is an optimum mapping such that two points of A are mapped exactly at -distance from some points of B.
One mapping fixes the translation, second the scale around the new origin defined by the translation.
![Page 16: Geometric Matching on Sequential Data Veli Mäkinen AG Genominformatik Technical Fakultät Bielefeld Universität](https://reader030.vdocuments.us/reader030/viewer/2022032607/56649eda5503460f94be9519/html5/thumbnails/16.jpg)
Stringology Haifa 2005 Geometric matching on sequential data 16
Example
2
A
B
f(A)
![Page 17: Geometric Matching on Sequential Data Veli Mäkinen AG Genominformatik Technical Fakultät Bielefeld Universität](https://reader030.vdocuments.us/reader030/viewer/2022032607/56649eda5503460f94be9519/html5/thumbnails/17.jpg)
Stringology Haifa 2005 Geometric matching on sequential data 17
Degenerate solution!
2B
A
f(A)
![Page 18: Geometric Matching on Sequential Data Veli Mäkinen AG Genominformatik Technical Fakultät Bielefeld Universität](https://reader030.vdocuments.us/reader030/viewer/2022032607/56649eda5503460f94be9519/html5/thumbnails/18.jpg)
Stringology Haifa 2005 Geometric matching on sequential data 18
One-to-one mapping
To avoid the degenerate solution, one needs a better definition for the mapping searched for.
Hence, we search for a mapping producing maximum size one-to-one matching between the points (Problem 2).
2 22 2 2 2
f(A)B
![Page 19: Geometric Matching on Sequential Data Veli Mäkinen AG Genominformatik Technical Fakultät Bielefeld Universität](https://reader030.vdocuments.us/reader030/viewer/2022032607/56649eda5503460f94be9519/html5/thumbnails/19.jpg)
Stringology Haifa 2005 Geometric matching on sequential data 19
Solving one-to-one case
Consider a fixed translation and scale. Construct a bipartite graph having edges
between points of f(A) and B that are at -distance.
Solve the maximum matching problem on this graph.
2 22 2 2 2
f(A)B
![Page 20: Geometric Matching on Sequential Data Veli Mäkinen AG Genominformatik Technical Fakultät Bielefeld Universität](https://reader030.vdocuments.us/reader030/viewer/2022032607/56649eda5503460f94be9519/html5/thumbnails/20.jpg)
Stringology Haifa 2005 Geometric matching on sequential data 20
Solving one-to-one case...
Repeating the algorithm on each relevant translation and scale gives the optimum solution.
The overall time complexity is O((mn)2 g(mn)) where g(x) is the complexity of the maximum matching algorithm on a graph with x edges.
![Page 21: Geometric Matching on Sequential Data Veli Mäkinen AG Genominformatik Technical Fakultät Bielefeld Universität](https://reader030.vdocuments.us/reader030/viewer/2022032607/56649eda5503460f94be9519/html5/thumbnails/21.jpg)
Stringology Haifa 2005 Geometric matching on sequential data 21
Solving one-to-one case faster
Consider a fixed translation, and sort the relevant scales from smallest to largest.
Observation [Alt et al. 88]: The graph Gi corresponding to ith scale differs from the graph Gi-1 of the (i-1)th scale by one edge.
The maximum matching on Gi can be found by searching for an augmenting path in Gi-1 added/deleted one edge.
![Page 22: Geometric Matching on Sequential Data Veli Mäkinen AG Genominformatik Technical Fakultät Bielefeld Universität](https://reader030.vdocuments.us/reader030/viewer/2022032607/56649eda5503460f94be9519/html5/thumbnails/22.jpg)
Stringology Haifa 2005 Geometric matching on sequential data 22
Solving one-to-one case faster..
Incremental computation gives O((mn)3) time solution.
Theorem 2: Problem 2 can be solved in O((mn)2(m+n)) time.
To obtain the result, we exploit the monotonicity of the match graph.
![Page 23: Geometric Matching on Sequential Data Veli Mäkinen AG Genominformatik Technical Fakultät Bielefeld Universität](https://reader030.vdocuments.us/reader030/viewer/2022032607/56649eda5503460f94be9519/html5/thumbnails/23.jpg)
Stringology Haifa 2005 Geometric matching on sequential data 23
Staircase property
fi(A)
B
![Page 24: Geometric Matching on Sequential Data Veli Mäkinen AG Genominformatik Technical Fakultät Bielefeld Universität](https://reader030.vdocuments.us/reader030/viewer/2022032607/56649eda5503460f94be9519/html5/thumbnails/24.jpg)
Stringology Haifa 2005 Geometric matching on sequential data 24
Greedy algorithm is enough
B
fi(A)
![Page 25: Geometric Matching on Sequential Data Veli Mäkinen AG Genominformatik Technical Fakultät Bielefeld Universität](https://reader030.vdocuments.us/reader030/viewer/2022032607/56649eda5503460f94be9519/html5/thumbnails/25.jpg)
Stringology Haifa 2005 Geometric matching on sequential data 25
scale i => scale i+1
B
fi+1(A)
![Page 26: Geometric Matching on Sequential Data Veli Mäkinen AG Genominformatik Technical Fakultät Bielefeld Universität](https://reader030.vdocuments.us/reader030/viewer/2022032607/56649eda5503460f94be9519/html5/thumbnails/26.jpg)
Stringology Haifa 2005 Geometric matching on sequential data 26
scale i+1
B
fi+1(A)
![Page 27: Geometric Matching on Sequential Data Veli Mäkinen AG Genominformatik Technical Fakultät Bielefeld Universität](https://reader030.vdocuments.us/reader030/viewer/2022032607/56649eda5503460f94be9519/html5/thumbnails/27.jpg)
Stringology Haifa 2005 Geometric matching on sequential data 27
scale i+1 => scale i+2
B
fi+2(A)
![Page 28: Geometric Matching on Sequential Data Veli Mäkinen AG Genominformatik Technical Fakultät Bielefeld Universität](https://reader030.vdocuments.us/reader030/viewer/2022032607/56649eda5503460f94be9519/html5/thumbnails/28.jpg)
Stringology Haifa 2005 Geometric matching on sequential data 28
scale i+2
B
fi+2(A)
![Page 29: Geometric Matching on Sequential Data Veli Mäkinen AG Genominformatik Technical Fakultät Bielefeld Universität](https://reader030.vdocuments.us/reader030/viewer/2022032607/56649eda5503460f94be9519/html5/thumbnails/29.jpg)
Stringology Haifa 2005 Geometric matching on sequential data 29
Observation - open question
Observation: With only translations and noise, we obtain O(mn(m+n)) time.
The staircase matrix changes only by one cell when moving from one scale to another.
Question: Can one update the greedy path incrementally?
O(1) solution for the above would imply that adding noise does not make the problem any harder.
![Page 30: Geometric Matching on Sequential Data Veli Mäkinen AG Genominformatik Technical Fakultät Bielefeld Universität](https://reader030.vdocuments.us/reader030/viewer/2022032607/56649eda5503460f94be9519/html5/thumbnails/30.jpg)
Stringology Haifa 2005 Geometric matching on sequential data 30
Three problems
1D point set matching under translations (Akutsu, COCOON’04).
1D point set matching under translations, scaling and noise (Böcker & Mäkinen, EuroCG’05)
2D point set matching under translations (Ukkonen & Lemström & Mäkinen, 2003 + Cieliebak & Mäkinen, 2005).
![Page 31: Geometric Matching on Sequential Data Veli Mäkinen AG Genominformatik Technical Fakultät Bielefeld Universität](https://reader030.vdocuments.us/reader030/viewer/2022032607/56649eda5503460f94be9519/html5/thumbnails/31.jpg)
Stringology Haifa 2005 Geometric matching on sequential data 31
2D point set matching
B
A f(A)
![Page 32: Geometric Matching on Sequential Data Veli Mäkinen AG Genominformatik Technical Fakultät Bielefeld Universität](https://reader030.vdocuments.us/reader030/viewer/2022032607/56649eda5503460f94be9519/html5/thumbnails/32.jpg)
Stringology Haifa 2005 Geometric matching on sequential data 32
Solutions
Easy in O(mn log m) time by constructing the set of mn translation vectors, sorting it, and finding maximum repeating element.
Possible also in O(mn) time by using naive string matching type algorithm.
![Page 33: Geometric Matching on Sequential Data Veli Mäkinen AG Genominformatik Technical Fakultät Bielefeld Universität](https://reader030.vdocuments.us/reader030/viewer/2022032607/56649eda5503460f94be9519/html5/thumbnails/33.jpg)
Stringology Haifa 2005 Geometric matching on sequential data 33
Naive point set matching
A
B
Remark: This is the fastest known algorithm for this problem!!
![Page 34: Geometric Matching on Sequential Data Veli Mäkinen AG Genominformatik Technical Fakultät Bielefeld Universität](https://reader030.vdocuments.us/reader030/viewer/2022032607/56649eda5503460f94be9519/html5/thumbnails/34.jpg)
Stringology Haifa 2005 Geometric matching on sequential data 34
Restricted case?
Would the problem become easier if there were no other points inside the area of matches?
f(A)
![Page 35: Geometric Matching on Sequential Data Veli Mäkinen AG Genominformatik Technical Fakultät Bielefeld Universität](https://reader030.vdocuments.us/reader030/viewer/2022032607/56649eda5503460f94be9519/html5/thumbnails/35.jpg)
Stringology Haifa 2005 Geometric matching on sequential data 35
Restricted case?
Restricted 1D case is extremely easy:- Exact string matching on the differentially encoded sequences.
![Page 36: Geometric Matching on Sequential Data Veli Mäkinen AG Genominformatik Technical Fakultät Bielefeld Universität](https://reader030.vdocuments.us/reader030/viewer/2022032607/56649eda5503460f94be9519/html5/thumbnails/36.jpg)
Stringology Haifa 2005 Geometric matching on sequential data 36
Easier on grid points
![Page 37: Geometric Matching on Sequential Data Veli Mäkinen AG Genominformatik Technical Fakultät Bielefeld Universität](https://reader030.vdocuments.us/reader030/viewer/2022032607/56649eda5503460f94be9519/html5/thumbnails/37.jpg)
Stringology Haifa 2005 Geometric matching on sequential data 37
Easier on grid points...
The problem becomes a special case of two-dimensional exact string matching.
Can be solved in O(N2) time on a text grid of size N£N and pattern grid of size M£M.
Notice that the run-length encoded representation of the rows of the matrix is of size O(n).
![Page 38: Geometric Matching on Sequential Data Veli Mäkinen AG Genominformatik Technical Fakultät Bielefeld Universität](https://reader030.vdocuments.us/reader030/viewer/2022032607/56649eda5503460f94be9519/html5/thumbnails/38.jpg)
Stringology Haifa 2005 Geometric matching on sequential data 38
Easier on grid points...
The algorithm of Amir & Landau & Sokol, 2002, for run-length compressed 2D search can be applied:- Time complexity O(M2+n). (can be reduced to O(m2+n)?)
![Page 39: Geometric Matching on Sequential Data Veli Mäkinen AG Genominformatik Technical Fakultät Bielefeld Universität](https://reader030.vdocuments.us/reader030/viewer/2022032607/56649eda5503460f94be9519/html5/thumbnails/39.jpg)
Stringology Haifa 2005 Geometric matching on sequential data 39
What about Bird-Baker?
Our idea to solve the problem is to modify Bird-Baker algorithm to work directly on point sets.
As a preliminary tool, we need an Aho-Corasick automaton that recognizes run-length encoded binary strings.
![Page 40: Geometric Matching on Sequential Data Veli Mäkinen AG Genominformatik Technical Fakultät Bielefeld Universität](https://reader030.vdocuments.us/reader030/viewer/2022032607/56649eda5503460f94be9519/html5/thumbnails/40.jpg)
Stringology Haifa 2005 Geometric matching on sequential data 40
Run-length encoding
5.7 12.2
3.1 9.3 ...
05.71012.2
...
![Page 41: Geometric Matching on Sequential Data Veli Mäkinen AG Genominformatik Technical Fakultät Bielefeld Universität](https://reader030.vdocuments.us/reader030/viewer/2022032607/56649eda5503460f94be9519/html5/thumbnails/41.jpg)
Stringology Haifa 2005 Geometric matching on sequential data 41
Modified Aho-Corasick automaton
Proposition: There is an automaton accepting a set of run-length encoded binary strings with the following properties:- O(m log m) construction time, where m is the number of 1-bits in the set.- Reading a fail-link in O(log m) time. - Scanning a string with n 1-bits in O(n log m) time.
![Page 42: Geometric Matching on Sequential Data Veli Mäkinen AG Genominformatik Technical Fakultät Bielefeld Universität](https://reader030.vdocuments.us/reader030/viewer/2022032607/56649eda5503460f94be9519/html5/thumbnails/42.jpg)
Stringology Haifa 2005 Geometric matching on sequential data 42
Bird-Baker on point sets
Now we can build our automaton on the rows of set A, scan it with the rows of set B.
Let R be the set of positions where a row of A was accepted inside the rows of B.
After sorting R by columns, we can test in O(|R|) time if any column of R contains the correct sequence of accepting states.
![Page 43: Geometric Matching on Sequential Data Veli Mäkinen AG Genominformatik Technical Fakultät Bielefeld Universität](https://reader030.vdocuments.us/reader030/viewer/2022032607/56649eda5503460f94be9519/html5/thumbnails/43.jpg)
Stringology Haifa 2005 Geometric matching on sequential data 43
Bird-Baker on point sets
The overall running time is O(n log m +|R| log |R|).
Unfortunately, there are examples where |R|=(mn) :-(
Hence, it is still open if (even) the restricted case has o(mn) solution or not.