uncertain sequence data: algorithms and applications james ...related work • top-k queries in...

64
Uncertain Sequence Data: Algorithms and Applications James Bailey The University of Melbourne ALSIP 2014

Upload: others

Post on 02-Oct-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Uncertain Sequence Data: Algorithms and Applications James ...Related Work • Top-k queries in uncertain data (Hua et.al. 08) – Based on probabilistic model – Ranking query with

Uncertain Sequence Data:

Algorithms and Applications

James Bailey

The University of Melbourne

ALSIP 2014

Page 2: Uncertain Sequence Data: Algorithms and Applications James ...Related Work • Top-k queries in uncertain data (Hua et.al. 08) – Based on probabilistic model – Ranking query with

Relationship to ALSIP

• We’ll be looking at data mining for sequential data, where the

elements of the sequence are uncertain

• Fit with the theme of the workshop

– Challenges in designing efficient algorithms for this scenario

– Uncertainty can be viewed as a form of succinctness or

compression

– Applications of uncertain sequence models for text, data

streams and spatio/temporal scenarios

Page 3: Uncertain Sequence Data: Algorithms and Applications James ...Related Work • Top-k queries in uncertain data (Hua et.al. 08) – Based on probabilistic model – Ranking query with

Talk Outline

• Background

• Uncertain Data Models

• Challenges

• Related Work

• Mining Probabilistic Spatio-Temporal Sequential Patterns

• Matching Substrings Over Uncertain Sequences

• Future Directions

Page 4: Uncertain Sequence Data: Algorithms and Applications James ...Related Work • Top-k queries in uncertain data (Hua et.al. 08) – Based on probabilistic model – Ranking query with

Background – data uncertainty

• Sources of data uncertainty

– Incompleteness of data sources

– Artificial noise in privacy-sensitive applications

– Uncertainty arising from imprecision in measurements and

observations.

Model Year Made Kilometers Transmission Body Type

Honda Civic 2004 ------- Auto Sedan

Mazda 3 2002 63,357 Manual -------

Name Marital Status Occupation

Jim Ross ***** Engineer

Mobile Satellite

Page 5: Uncertain Sequence Data: Algorithms and Applications James ...Related Work • Top-k queries in uncertain data (Hua et.al. 08) – Based on probabilistic model – Ranking query with

Data uncertainty

• Uncertainty due to compression

• Given a collection of certain sequences, summarise/collapse

them into an uncertain sequence

– Sequence 1 A A A A B C

– Sequence 2 A A B A B A

– Sequence 3 A A A A A A

– Summary A A A|B A A|B A|C

• E.g.

– Compress a set of trajectories

– Consensus description for a group of proteins

– Summarise a group of time series

Page 6: Uncertain Sequence Data: Algorithms and Applications James ...Related Work • Top-k queries in uncertain data (Hua et.al. 08) – Based on probabilistic model – Ranking query with

Applications

• Applications with data uncertainty

– Trajectory data analysis

– Bioinformatics (DNA and protein comparison)

– Web querying

– Text recognition

A,C,G,T?

Page 7: Uncertain Sequence Data: Algorithms and Applications James ...Related Work • Top-k queries in uncertain data (Hua et.al. 08) – Based on probabilistic model – Ranking query with

Applications

• Text mining

– Given a noisy stream of speech being parsed by a machine

• word1 word2 word3 word4 word5 word6 ………

• There may be uncertainty about each word. E.g. Was

word3 “likes” or was it “strikes” or was it “spikes” ?

• Wish to compute probability that the stream contains the

query phrase “more strikes”

Page 8: Uncertain Sequence Data: Algorithms and Applications James ...Related Work • Top-k queries in uncertain data (Hua et.al. 08) – Based on probabilistic model – Ranking query with

• Example query: what is the probability that the (certain)

sequence query = CO is a substring ?

C O C A C O L A

C 0.4 0.1 0.4 0 0.4 0.1 0 0

G 0.3 0.1 0.3 0 0.3 0.1 0 0

O 0.3 0.7 0.3 0 0.3 0.7 0 0

Q 0 0.1 0 0 0 0.1 0 0

L 0 0 0 0 0 0 1.0 0

A 0 0 0 1.0 0 0 0 1.0

Page 9: Uncertain Sequence Data: Algorithms and Applications James ...Related Work • Top-k queries in uncertain data (Hua et.al. 08) – Based on probabilistic model – Ranking query with

Applications

• Linear spatial anomalies

– Each person has a probability of being a “bad guy”.

– What is the probability 3 bad guys are together in a row ?

Page 10: Uncertain Sequence Data: Algorithms and Applications James ...Related Work • Top-k queries in uncertain data (Hua et.al. 08) – Based on probabilistic model – Ranking query with

Applications: Bioinformatics

• A motif (protein) can be represented as an uncertain sequence

• Given uncertain sequence s1 and and uncertain sequence s2,

how similar are s1 and s2 ?

• Compute all possible k-mers, use these k-mers as a feature

space to compare the similarity of s1 and s2.

1 2 3 4

A 0.2 0.5 0.1 0.7

C 0.3 0.1 0.2 0.3

T 0.4 0.1 0.2 0

G 0.1 0.3 0.5 0

Page 11: Uncertain Sequence Data: Algorithms and Applications James ...Related Work • Top-k queries in uncertain data (Hua et.al. 08) – Based on probabilistic model – Ranking query with

Vector representation

AA AC AT AG CA ..

0.4 0.2 0.9 0.01 0.33 ..

Page 12: Uncertain Sequence Data: Algorithms and Applications James ...Related Work • Top-k queries in uncertain data (Hua et.al. 08) – Based on probabilistic model – Ranking query with

Examples: uncertain data

• Uncertain Transactions (frequency checking problem)

• Uncertain Trajectory Data (frequency checking with gap constraints)

• Uncertain Sequence (substring/subsequence matching problem)

time Clusters

1 {o1:0.6,o2:1.0,o3:0.8}; {o4:0.7,o5:1.0,o6:0.8}

2 {o1:0.8,o2:1.0}; {o3:1.0,o4:0.5}; {o5:0.7,o6:0.5}

6 {o1:0.4,o2:1.0}; {o3:1.0,o4:0.9,o5:0.9,o6:0.8}

TransactionID Itemset Probability

1 {a,b} 0.8

2 {b,c,d} 0.7

A 0.1 0 0.2 1.0

C 0.2 1.0 0.4 0

G 0.3 0 0.2 0

T 0.4 0 0.2 0

Page 13: Uncertain Sequence Data: Algorithms and Applications James ...Related Work • Top-k queries in uncertain data (Hua et.al. 08) – Based on probabilistic model – Ranking query with

Frequency Checking vs.

Substring/Subsequence Matching

• Checking the frequency of an itemset I in a set of transactions

– I occurs at least three times in a transaction database

• To match a subsequence in a (longer) reference sequence

– “III” is contained in the reference sequence (at least once)

where the gap is set to infinity.

Tran 1 Tran 2 Tran 3 Tran 4 Tran 5 Tran 6

I ¬I I I ¬I I

Treat transaction history as an ordered list;

Itemset I as a character in the sequence;

All other characters as ¬I.

𝑠 [1] 𝑠 [2] 𝑠 [3] 𝑠 [4] 𝑠 [5] 𝑠 [6]

I A I I B I

Page 14: Uncertain Sequence Data: Algorithms and Applications James ...Related Work • Top-k queries in uncertain data (Hua et.al. 08) – Based on probabilistic model – Ranking query with

Uncertain Data Models

• Expectation models

– Treat existence probability as weights

– Lacks an indication on confidence

– item a has an expected frequency of 1.0, but it is not

possible that a has a frequency of 2.

– item b has a lower expected frequency of 0.5, but b has

probability of 0.06 of having a frequency of 2.

TransactionID Items

1 a:1.0,b:0.2

2 b:0.3

Page 15: Uncertain Sequence Data: Algorithms and Applications James ...Related Work • Top-k queries in uncertain data (Hua et.al. 08) – Based on probabilistic model – Ranking query with

Uncertain Data Models cont.

• Probabilistic (confidence) models

– Possible world semantics

– More popular

• Probability that b occurs in at least one transaction :

= P(W1) + P(W2) + P(W3) = 0.44

TransactionID Items

1 a:1.0,b:0.2

2 b:0.3

instantiated

Possible

World Wi

Items Probability

1 T1: a,b

T2: b

0.06

2 T1: a

T2: b

0.24

3 T1: a,b

T2:

0.14

4 T1: a

T2:

0.56

Page 16: Uncertain Sequence Data: Algorithms and Applications James ...Related Work • Top-k queries in uncertain data (Hua et.al. 08) – Based on probabilistic model – Ranking query with

Challenges

• As the size of uncertain sequence/ time space/ transaction DB

increases, the number of possible worlds grows exponentially.

• In the problem of substring/subsequence matching, checking

pattern characteristics for uncertain data in the presence of gaps

comes with extra challenges.

Page 17: Uncertain Sequence Data: Algorithms and Applications James ...Related Work • Top-k queries in uncertain data (Hua et.al. 08) – Based on probabilistic model – Ranking query with

Possible worlds challenge

Page 18: Uncertain Sequence Data: Algorithms and Applications James ...Related Work • Top-k queries in uncertain data (Hua et.al. 08) – Based on probabilistic model – Ranking query with

Related Work

• Uncertain frequent itemset mining (Bernecker et.al. 09)

• Top-k queries in uncertain data (Hua et.al. 08, Yi et.al.10)

• Notations of source-level and event-level uncertain models for

sequential pattern mining (Muhammad and Raman 10)

• Mining Probabilistic Spatio-Temporal Sequential Patterns (Li

et.al. 13)

• Matching Substrings over Uncertain Sequences (Li et.al. 14)

Page 19: Uncertain Sequence Data: Algorithms and Applications James ...Related Work • Top-k queries in uncertain data (Hua et.al. 08) – Based on probabilistic model – Ranking query with

Related Work

• Top-k queries in uncertain data (Hua et.al. 08)

– Based on probabilistic model

– Ranking query with a probabilistic threshold p

– Returning tuples whose top-k probability values are at least p

– Three algorithms proposed:

• An exact algorithm with pruning rules: faster for a small k

• A sampling (approximation) method:

– Trade of between accuracy and efficiency

– Generally more stable in runtime.

• A Poisson approximation based method

– Better approximation as k increases

Page 20: Uncertain Sequence Data: Algorithms and Applications James ...Related Work • Top-k queries in uncertain data (Hua et.al. 08) – Based on probabilistic model – Ranking query with

Related Work

• Uncertain frequent itemset mining (Bernecker et.al. 09)

– Based on a probabilistic model (possible world semantics)

– Given a frequency threshold and a probability threshold of

an itemset, the main task is compute the frequentness

probability.

– A dynamic programming approach introduced

• Using Poissonn binomial recurrence technique

• Linear time and space complexity (assuming the

frequency threshold as a constant)

Page 21: Uncertain Sequence Data: Algorithms and Applications James ...Related Work • Top-k queries in uncertain data (Hua et.al. 08) – Based on probabilistic model – Ranking query with

Related Work

• Notations of source-level and event-level uncertain models for

sequential pattern mining

– Discussed both expectation model and probabilistic model

– Two uncertain models for probabilistic frequentness

• Event-level uncertainty

• Source-level uncertainty

p-sequence

DXp {a,b: 0.6} {c,d:0.3}

DYp {a,b: 0.4} {c,d:0.7}

e-id event W

e1 (a,b) X:0.6, Y:0.4

e2 (c,d) X:0.3, Y:0.7

Page 22: Uncertain Sequence Data: Algorithms and Applications James ...Related Work • Top-k queries in uncertain data (Hua et.al. 08) – Based on probabilistic model – Ranking query with

Matching Substrings over Uncertain Sequences

Y. Li, J. Bailey, L. Kulik and J. Pei. Efficient Matching of Substrings in Uncertain

Sequences. in Proceedings of the 2014 SIAM International Conference on Data

Mining (SDM), 2014.

Page 23: Uncertain Sequence Data: Algorithms and Applications James ...Related Work • Top-k queries in uncertain data (Hua et.al. 08) – Based on probabilistic model – Ranking query with

Problem Definition

• Given a query substring q and an uncertain sequence 𝑠 , our main task

is to calculate the substring matching probability 𝑃(𝑞 ⊑ 𝑠 ).

• An example query is what is the probability that the (certain) sequence

AGCTCT is a substring of 𝑠 ?

• No gaps permitted in matching

• Challenge: the number of possible world increases exponentially with

the size of uncertain sequence 𝑠 .

Page 24: Uncertain Sequence Data: Algorithms and Applications James ...Related Work • Top-k queries in uncertain data (Hua et.al. 08) – Based on probabilistic model – Ranking query with

Possible worlds challenge

Page 25: Uncertain Sequence Data: Algorithms and Applications James ...Related Work • Top-k queries in uncertain data (Hua et.al. 08) – Based on probabilistic model – Ranking query with

A Dynamic Programming Approach

(overview)

• To split the problem of computing substring matching probability for a

sequence with size of j into sub-problems of computing the substring

matching probabilities for sequences with size of j − 1.

• Our approach consists of two parts:

– Backward Index Computation: to perform a top-down scan on q

for computing the backward indices (for performing backward

matching).

– Dynamic programming scheme: to compute the substring

matching probability using a bottom-up dynamic programming

scheme.

• Forward matching

• Backward matching and Tail matching

• Reset

Page 26: Uncertain Sequence Data: Algorithms and Applications James ...Related Work • Top-k queries in uncertain data (Hua et.al. 08) – Based on probabilistic model – Ranking query with

Forward Matching

• In the step of matching 𝑞(𝑖) over 𝑠 (𝑗), if 𝑠 𝑗 = 𝑞[𝑖], then we

continue to match 𝑞(𝑖 − 1) over 𝑠 (𝑗 − 1).

• Example

• 𝑞= AGCT, if 𝑠 6 = 𝑞[4],

• then, we match 𝑞(3)= AGC over 𝑠 5 .

𝑠 [1] 𝑠 [2] 𝑠 [3] 𝑠 [4] 𝑠 [5] 𝑠 [6]

T

Page 27: Uncertain Sequence Data: Algorithms and Applications James ...Related Work • Top-k queries in uncertain data (Hua et.al. 08) – Based on probabilistic model – Ranking query with

Backward Matching

• Move backward and match 𝑞(𝑘) over 𝑠 (𝑗 − 1).

• Example

• 𝑞= ACTC, 1234

𝑠 [1] 𝑠 [2] 𝑠 [3] 𝑠 [4] 𝑠 [5] 𝑠 [6]

T C T C

𝑠 [1] 𝑠 [2] 𝑠 [3] 𝑠 [4] 𝑠 [5] 𝑠 [6]

A C T C T C

𝑘 = 2

Page 28: Uncertain Sequence Data: Algorithms and Applications James ...Related Work • Top-k queries in uncertain data (Hua et.al. 08) – Based on probabilistic model – Ranking query with

Tail Matching and Reset

• Tail Matching: match 𝑞(𝑚 − 1) over 𝑠 (𝑗 − 1).

• Not constrained by gap compared to backward matching

• Reset: match 𝑞(𝑚) over 𝑠 (𝑗 − 1), if the conditions for the three

other scenarios are false.

Page 29: Uncertain Sequence Data: Algorithms and Applications James ...Related Work • Top-k queries in uncertain data (Hua et.al. 08) – Based on probabilistic model – Ranking query with

Example: matching AGCTCT over 𝑠 (𝟏𝟎)

• Backward Matching index Computation

• One pass on the query, starting from the right.

Page 30: Uncertain Sequence Data: Algorithms and Applications James ...Related Work • Top-k queries in uncertain data (Hua et.al. 08) – Based on probabilistic model – Ranking query with

Example: matching AGCTCT over 𝑠 (𝟏𝟎)

• Computing the probability

• It computes and stores the internal results column by column in

a bottom-up manner.

Page 31: Uncertain Sequence Data: Algorithms and Applications James ...Related Work • Top-k queries in uncertain data (Hua et.al. 08) – Based on probabilistic model – Ranking query with

• 𝑚: the size query

• 𝑛: the size of uncertain reference sequence

• Total #nodes computed and stored: 𝑛 − 𝑚 + 1 ⋅ 𝑚 = 𝑂 𝑚 ⋅ 𝑛

Time Complexity and Space Complexity

Page 32: Uncertain Sequence Data: Algorithms and Applications James ...Related Work • Top-k queries in uncertain data (Hua et.al. 08) – Based on probabilistic model – Ranking query with

Experiments

• Scalability

Page 33: Uncertain Sequence Data: Algorithms and Applications James ...Related Work • Top-k queries in uncertain data (Hua et.al. 08) – Based on probabilistic model – Ranking query with

Mining Probabilistic Spatio-Temporal Sequential Patterns

Y. Li, J. Bailey, L. Kulik, and J. Pei. Mining Probabilistic Frequent Spatio-

Temporal Sequential Patterns with Gap Constraints from Uncertain

Databases. In Proceedings of the 2013 IEEE International Conference on Data

Mining (ICDM), 2013.

Page 34: Uncertain Sequence Data: Algorithms and Applications James ...Related Work • Top-k queries in uncertain data (Hua et.al. 08) – Based on probabilistic model – Ranking query with

Background

• Spatio-temporal (sequential) patterns in certain data

– Flocks (Vieira et al. 09)

– Convoys (Jeung et al. 10)

– Swarms (Li et al. 11)

• A minimum number of moving objects 𝑂 stay together for a minimum

number of (consecutive) timestamps 𝑇.

– Minimum number of objects: 𝑂 ≥ 𝑚𝑖𝑛𝑜

– Minimum number of timestamps: 𝑇 ≥ 𝑚𝑖𝑛𝑡

– Maximum gap constraint: ⨆𝑇 ≤ 𝑔

𝑡1 𝑡2 𝑡3 𝑡4 𝑡5 𝑡6

√ √ √

⨆𝑇 = 2

𝑇 = {𝑡1, 𝑡3, 𝑡6}

Page 35: Uncertain Sequence Data: Algorithms and Applications James ...Related Work • Top-k queries in uncertain data (Hua et.al. 08) – Based on probabilistic model – Ranking query with

Examples: uncertain data

• Uncertain Trajectory Data (frequency checking with gap constraints)

time Clusters

1 {o1:0.6,o2:1.0,o3:0.8}; {o4:0.7,o5:1.0,o6:0.8}

2 {o1:0.8,o2:1.0}; {o3:1.0,o4:0.5}; {o5:0.7,o6:0.5}

6 {o1:0.4,o2:1.0}; {o3:1.0,o4:0.9,o5:0.9,o6:0.8}

Page 36: Uncertain Sequence Data: Algorithms and Applications James ...Related Work • Top-k queries in uncertain data (Hua et.al. 08) – Based on probabilistic model – Ranking query with

Example

• Parameters: 𝑚𝑖𝑛𝑜 = 2,𝑚𝑖𝑛𝑡 = 3, g = 1

𝑡1 𝑡2 𝑡3 𝑡4 𝑡5 𝑡6

{𝑜1, 𝑜2} {𝑜1, 𝑜2} {𝑜1, 𝑜2}

𝑡1 𝑡2 𝑡3 𝑡4 𝑡5 𝑡6

{𝑜1, 𝑜2} {𝑜1, 𝑜2} {𝑜1, 𝑜2}

⨆𝑇 = 2 > 𝑔

Page 37: Uncertain Sequence Data: Algorithms and Applications James ...Related Work • Top-k queries in uncertain data (Hua et.al. 08) – Based on probabilistic model – Ranking query with

Data Uncertainty in Location Data

• Location is represented by a probability-density function

• Whether objects 𝑂 stay together at 𝑡 is probabilistic.

– Co-occurrence of objects at 𝑇 is described by a discrete

probability-distribution function.

lon/lat: 𝑓 𝑥 = 𝑎𝑥 + 𝑏

0.2 0.5

0.2 0.1 0

1

𝑃(𝑇 = 𝑡1, 𝑡2 ) 𝑃(𝑇 = 𝑡1 ) 𝑃(𝑇 = ∅) 𝑃(𝑇 = 𝑡2 )

Page 38: Uncertain Sequence Data: Algorithms and Applications James ...Related Work • Top-k queries in uncertain data (Hua et.al. 08) – Based on probabilistic model – Ranking query with

• The main computational challenge is to calculate the frequentness

probability.

o the probability that a pattern 𝑂 satisfies minimum #timestamps

threshold 𝑚𝑖𝑛𝑡 and maximum gap constraint 𝑔.

• The problem is to find all patterns that satisfy the probability threshold.

• Challenge: the number of possible world increases exponentially with both

object space and time space.

𝑡1 𝑡2 𝑡3 𝑡4 𝑡5 𝑡6 𝑡7 𝑡8 …

√ √ √

𝑔 𝑇𝑆 𝑚𝑖𝑛𝑡

Problem Definition

Page 39: Uncertain Sequence Data: Algorithms and Applications James ...Related Work • Top-k queries in uncertain data (Hua et.al. 08) – Based on probabilistic model – Ranking query with

A Dynamic Programming Approach

(overview)

• Spliting the problem of computing frequentness probability at the first j timestamps into subproblems of computing frequentness probabilities at the

first j − 1 timestamps.

• Question: constraints on the patterns of subproblems?

o It depends on whether 𝑂 occurs at 𝑡𝑗

• Two constraints need to be considered:

o Minimum number of timestamps threshold 𝑚𝑖𝑛𝑡.

o Maximum gap constraint 𝑔.

Page 40: Uncertain Sequence Data: Algorithms and Applications James ...Related Work • Top-k queries in uncertain data (Hua et.al. 08) – Based on probabilistic model – Ranking query with

A Dynamic Programming Approach

• The frequentness probability 𝑃≥𝑖,𝑗𝑔

𝑂 at 𝑇𝑗 = {𝑡1…𝑡𝑗}

the frequentness probabilities at 𝑇𝑗−1

• 𝑇 = the timestamps that 𝑂 occurs at 𝑇𝑗

• 𝑇′ = the timestamps that 𝑂 occurs at 𝑇𝑗−1

• Question: constraints on 𝑇′ to make 𝑇 ≥ 𝑖 and ⨆𝑇 ≤ 𝑔 ?

• For minimum #timestamps threshold 𝑖:

o if 𝑂 @ 𝑡𝑗 𝑂 must occur at least 𝑖 − 1 timestamps of 𝑇𝑗−1

𝑇′ ≥ 𝑖 − 1

o Otherwise 𝑂 must occur at least 𝑖 timestamps of 𝑇𝑗−1 𝑇′ ≥ 𝑖

Page 41: Uncertain Sequence Data: Algorithms and Applications James ...Related Work • Top-k queries in uncertain data (Hua et.al. 08) – Based on probabilistic model – Ranking query with

A Dynamic Programming Approach

• Tail gap: ⋁𝑇,𝑗 = 𝑗 −𝑚, where 𝑡𝑚 is the last timestamp in 𝑇

• For gap constraint 𝑔: o if 𝑂 @ 𝑡𝑗 𝑇′ must fulfill gap constraint ⨆𝑇′ ≤ 𝑔 and tail gap constraint ⋁𝑇′,𝑗−1 ≤ 𝑔

o otherwise 𝑇′ must fulfill gap constraint ⨆𝑇′ ≤ 𝑔

• For tail gap constraint 𝑦: o if 𝑂 @ 𝑡𝑗 𝑇′ must fulfill gap constraint ⨆𝑇′ ≤ 𝑔 and tail gap constraint ⋁𝑇′,𝑗−1 ≤ 𝑦

o otherwise 𝑇′ must fulfill gap constraint ⨆𝑇′ ≤ 𝑔 and tail gap constraint

⋁𝑇′,𝑗−1 ≤ 𝑦 − 1

𝑡1 𝑡2 𝑡3 𝑡4 𝑡5

√ √

⋁𝑇,5 = 5 − 3

𝑇 ={𝑡1, 𝑡3}

Page 42: Uncertain Sequence Data: Algorithms and Applications James ...Related Work • Top-k queries in uncertain data (Hua et.al. 08) – Based on probabilistic model – Ranking query with

Example: Tail Gap

• Computing 𝑃≥3,62 (𝑂)

• if 𝑂 @ 𝑡6, ⨆𝑇′ ≤ 2 and ⋁𝑇′,5 ≤ 2

𝑡1 𝑡2 𝑡3 𝑡4 𝑡5 𝑡6

√ √ √

⋁𝑇′,5 = 2 ⨆𝑇′ = 0

𝑡1 𝑡2 𝑡3 𝑡4 𝑡5 𝑡6

√ √ √

𝑇′ = {𝑡2, 𝑡3}

⨆𝑇′ = 0 ⋁𝑇′,5 = 3

𝑇′ = {𝑡1, 𝑡2} 𝑇 = {𝑡1, 𝑡2, 𝑡6}

Page 43: Uncertain Sequence Data: Algorithms and Applications James ...Related Work • Top-k queries in uncertain data (Hua et.al. 08) – Based on probabilistic model – Ranking query with

• Bottom up approach: the internal results stored for further calculations.

• Trade off between time complexity and space complexity.

• The internal results are calculated layer by layer.

Implementation

Page 44: Uncertain Sequence Data: Algorithms and Applications James ...Related Work • Top-k queries in uncertain data (Hua et.al. 08) – Based on probabilistic model – Ranking query with

Example: Computing 𝑃≥3,51 (𝑂)

Page 45: Uncertain Sequence Data: Algorithms and Applications James ...Related Work • Top-k queries in uncertain data (Hua et.al. 08) – Based on probabilistic model – Ranking query with

𝑃0,2 𝑃0,1 𝑃0,0

Example: Computing 𝑃≥3,51 (𝑂)

Page 46: Uncertain Sequence Data: Algorithms and Applications James ...Related Work • Top-k queries in uncertain data (Hua et.al. 08) – Based on probabilistic model – Ranking query with

𝑃0,2 𝑃0,1 𝑃0,0 Bottom layer

Example: Computing 𝑃≥3,51 (𝑂)

Page 47: Uncertain Sequence Data: Algorithms and Applications James ...Related Work • Top-k queries in uncertain data (Hua et.al. 08) – Based on probabilistic model – Ranking query with

𝑃≥1,3∨1,1

𝑃≥1,2∨1,1

𝑃≥1,2∨0,1

𝑃≥1,1∨0,1

𝑃0,2 𝑃0,1 𝑃0,0

𝑖 = 1

Bottom layer

Example: Computing 𝑃≥3,51 (𝑂)

Page 48: Uncertain Sequence Data: Algorithms and Applications James ...Related Work • Top-k queries in uncertain data (Hua et.al. 08) – Based on probabilistic model – Ranking query with

𝑃≥1,3∨1,1

𝑃≥1,2∨1,1

𝑃≥1,2∨0,1

𝑃≥1,1∨0,1

𝑃0,2 𝑃0,1 𝑃0,0

𝑖 = 1

Bottom layer

Internal layer 1

Example: Computing 𝑃≥3,51 (𝑂)

Page 49: Uncertain Sequence Data: Algorithms and Applications James ...Related Work • Top-k queries in uncertain data (Hua et.al. 08) – Based on probabilistic model – Ranking query with

𝑃≥2,4∨1,1

𝑃≥2,3∨1,1

𝑃≥2,3∨0,1

𝑃≥2,2∨0,1

𝑃≥1,3∨1,1

𝑃≥1,2∨1,1

𝑃≥1,2∨0,1

𝑃≥1,1∨0,1

𝑃0,2 𝑃0,1 𝑃0,0

𝑖 = 2

𝑖 = 1

Bottom layer

Internal layer 1

Example: Computing 𝑃≥3,51 (𝑂)

Page 50: Uncertain Sequence Data: Algorithms and Applications James ...Related Work • Top-k queries in uncertain data (Hua et.al. 08) – Based on probabilistic model – Ranking query with

𝑃≥2,4∨1,1

𝑃≥2,3∨1,1

𝑃≥2,3∨0,1

𝑃≥2,2∨0,1

𝑃≥1,3∨1,1

𝑃≥1,2∨1,1

𝑃≥1,2∨0,1

𝑃≥1,1∨0,1

𝑃0,2 𝑃0,1 𝑃0,0

𝑖 = 2

𝑖 = 1

Bottom layer

Internal layer 1

Internal layer 2

C

B

A

Example: Computing 𝑃≥3,51 (𝑂)

Page 51: Uncertain Sequence Data: Algorithms and Applications James ...Related Work • Top-k queries in uncertain data (Hua et.al. 08) – Based on probabilistic model – Ranking query with

𝑃≥2,4∨1,1

𝑃≥2,3∨1,1

𝑃≥2,3∨0,1

𝑃≥2,2∨0,1

𝑃≥1,3∨1,1

𝑃≥1,2∨1,1

𝑃≥1,2∨0,1

𝑃≥1,1∨0,1

𝑃0,2 𝑃0,1 𝑃0,0

𝑖 = 2

𝑖 = 1

Bottom layer

Internal layer 1

Internal layer 2

C

B

A C = A + B

C

B

A

Example: Computing 𝑃≥3,51 (𝑂)

Page 52: Uncertain Sequence Data: Algorithms and Applications James ...Related Work • Top-k queries in uncertain data (Hua et.al. 08) – Based on probabilistic model – Ranking query with

𝑃≥3,51 𝑃≥3,4

1 𝑃≥3,30

𝑃≥2,4∨1,1

𝑃≥2,3∨1,1

𝑃≥2,3∨0,1

𝑃≥2,2∨0,1

𝑃≥1,3∨1,1

𝑃≥1,2∨1,1

𝑃≥1,2∨0,1

𝑃≥1,1∨0,1

𝑃0,2 𝑃0,1 𝑃0,0

𝑖 = 3

𝑖 = 2

𝑖 = 1

Bottom layer

Internal layer 1

Internal layer 2

C

B

A C = A + B

C

B

A

Example: Computing 𝑃≥3,51 (𝑂)

Page 53: Uncertain Sequence Data: Algorithms and Applications James ...Related Work • Top-k queries in uncertain data (Hua et.al. 08) – Based on probabilistic model – Ranking query with

𝑃≥3,51 𝑃≥3,4

1 𝑃≥3,30

𝑃≥2,4∨1,1

𝑃≥2,3∨1,1

𝑃≥2,3∨0,1

𝑃≥2,2∨0,1

𝑃≥1,3∨1,1

𝑃≥1,2∨1,1

𝑃≥1,2∨0,1

𝑃≥1,1∨0,1

𝑃0,2 𝑃0,1 𝑃0,0

𝑖 = 3

𝑖 = 2

𝑖 = 1

Bottom layer

Internal layer 1

Internal layer 2

C

B

A C = A + B

C

B

A

Output

Example: Computing 𝑃≥3,51 (𝑂)

Page 54: Uncertain Sequence Data: Algorithms and Applications James ...Related Work • Top-k queries in uncertain data (Hua et.al. 08) – Based on probabilistic model – Ranking query with

𝑃≥3,51 𝑃≥3,4

1 𝑃≥3,30

𝑃≥2,4∨1,1

𝑃≥2,3∨1,1

𝑃≥2,3∨0,1

𝑃≥2,2∨0,1

𝑃≥1,3∨1,1

𝑃≥1,2∨1,1

𝑃≥1,2∨0,1

𝑃≥1,1∨0,1

𝑃0,2 𝑃0,1 𝑃0,0

𝑖 = 3

𝑖 = 2

𝑖 = 1

Bottom layer

Internal layer 1

Internal layer 2

Top layer

C

B

A C = A + B

C

B

A

Output

Example: Computing 𝑃≥3,51 (𝑂)

Page 55: Uncertain Sequence Data: Algorithms and Applications James ...Related Work • Top-k queries in uncertain data (Hua et.al. 08) – Based on probabilistic model – Ranking query with

𝑃≥3,51 𝑃≥3,4

1 𝑃≥3,30

𝑃≥2,4∨1,1

𝑃≥2,3∨1,1

𝑃≥2,3∨0,1

𝑃≥2,2∨0,1

𝑃≥1,3∨1,1

𝑃≥1,2∨1,1

𝑃≥1,2∨0,1

𝑃≥1,1∨0,1

𝑃0,2 𝑃0,1 𝑃0,0

𝑖 = 3

𝑖 = 2

𝑖 = 1

ℎ𝑒𝑖𝑔ℎ𝑡 = 𝑚𝑖𝑛𝑡 + 1

Bottom layer

Internal layer 1

Internal layer 2

Top layer

C

B

A C = A + B

C

B

A

Output

Example: Computing 𝑃≥3,51 (𝑂)

Page 56: Uncertain Sequence Data: Algorithms and Applications James ...Related Work • Top-k queries in uncertain data (Hua et.al. 08) – Based on probabilistic model – Ranking query with

𝑃≥3,51 𝑃≥3,4

1 𝑃≥3,30

𝑃≥2,4∨1,1

𝑃≥2,3∨1,1

𝑃≥2,3∨0,1

𝑃≥2,2∨0,1

𝑃≥1,3∨1,1

𝑃≥1,2∨1,1

𝑃≥1,2∨0,1

𝑃≥1,1∨0,1

𝑃0,2 𝑃0,1 𝑃0,0

𝑖 = 3

𝑖 = 2

𝑖 = 1

ℎ𝑒𝑖𝑔ℎ𝑡 = 𝑚𝑖𝑛𝑡 + 1

𝑤𝑖𝑑𝑡ℎ = |𝑇𝑆| −𝑚𝑖𝑛𝑡 +1

Bottom layer

Internal layer 1

Internal layer 2

Top layer

C

B

A C = A + B

C

B

A

Output

Example: Computing 𝑃≥3,51 (𝑂)

Page 57: Uncertain Sequence Data: Algorithms and Applications James ...Related Work • Top-k queries in uncertain data (Hua et.al. 08) – Based on probabilistic model – Ranking query with

𝑃≥1,3∨1,1

𝑃≥1,2∨1,1

𝑃≥1,2∨0,1

𝑃≥1,1∨0,1

ℎ𝑒𝑖𝑔ℎ𝑡 = 𝑔 + 1

|𝑇𝑆| −𝑚𝑖𝑛𝑡 +1

𝑤𝑖𝑑𝑡ℎ = |𝑇𝑆| −𝑚𝑖𝑛𝑡 +1 − 𝑔

𝑔

𝑖 = 1

• Parallelogram in shape

• #nodes = (|𝑇𝑆 −𝑚𝑖𝑛𝑡 + 1 − g × (g + 1)

• A quadratic function that peaks at (|𝑇𝑆 −𝑚𝑖𝑛𝑡 /2

Maximum gap

#nodes

#Nodes per Internal Layers

Page 58: Uncertain Sequence Data: Algorithms and Applications James ...Related Work • Top-k queries in uncertain data (Hua et.al. 08) – Based on probabilistic model – Ranking query with

• Total #nodes computed and stored = 𝑂(𝑚𝑖𝑛𝑡 ⋅ 𝑔 ⋅ 𝑇𝑆 )

• Linear time: 𝑂( 𝑇𝑆 ) if we assume input parameters 𝑚𝑖𝑛𝑡 and 𝑔 are

constants.

• If 𝑔 = ∞ (𝑔 = |𝑇𝑆| − 𝑚𝑖𝑛𝑡),

o equivalent to uncertain frequent itemset mining.

𝑔

Time Complexity and Space Complexity

𝑡1 𝑡2 𝑡3 𝑡4 𝑡5 𝑡6 𝑡7 𝑡8 …

√ √ √

𝑔 𝑇𝑆 𝑚𝑖𝑛𝑡

Page 59: Uncertain Sequence Data: Algorithms and Applications James ...Related Work • Top-k queries in uncertain data (Hua et.al. 08) – Based on probabilistic model – Ranking query with

Experiments

Page 60: Uncertain Sequence Data: Algorithms and Applications James ...Related Work • Top-k queries in uncertain data (Hua et.al. 08) – Based on probabilistic model – Ranking query with

Future Directions

Page 61: Uncertain Sequence Data: Algorithms and Applications James ...Related Work • Top-k queries in uncertain data (Hua et.al. 08) – Based on probabilistic model – Ranking query with

Subsequence Matching with Arbitrary Gaps

• Substring: no gap is allowed.

• Subsequence: gap is set to infinite.

• Subsequence matching with arbitrary gaps

– Gap constraints are imposed to relax and/or restrict the

distance between two adjacent characters.

Page 62: Uncertain Sequence Data: Algorithms and Applications James ...Related Work • Top-k queries in uncertain data (Hua et.al. 08) – Based on probabilistic model – Ranking query with

Connections and Comparison

(i) Subsequence matching with arbitrary gaps

(ii) Substring matching in uncertain data

(ii) Uncertain S.T. sequential patterns

(iv) Uncertain frequent itemset mining

Page 63: Uncertain Sequence Data: Algorithms and Applications James ...Related Work • Top-k queries in uncertain data (Hua et.al. 08) – Based on probabilistic model – Ranking query with

Future Directions

• Establlising hardness results for matching problems in uncertain

sequences

• Considering richer types of queries

• Considering uncertainty in the query (in addition to uncertainty in

the reference)

• Use of succinct data structures to speed up matching

• Investigation for real world applications

Page 64: Uncertain Sequence Data: Algorithms and Applications James ...Related Work • Top-k queries in uncertain data (Hua et.al. 08) – Based on probabilistic model – Ranking query with

Thank you!

&

Questions?