from data series indexing to big data series analytics · - complex analytics similarity search...

73
From Data Series Indexing to Big Data Series Analytics TSDays - Rennes, March 2019 Themis Palpanas Paris Descartes University French University Institute

Upload: others

Post on 24-May-2020

13 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: From Data Series Indexing to Big Data Series Analytics · - complex analytics Similarity Search Classification Clustering Outlier Detection Frequent Pattern Mining Themis Palpanas

From Data Series Indexing

to Big Data Series Analytics

TSDays - Rennes, March 2019

Themis Palpanas

Paris Descartes University

French University Institute

Page 2: From Data Series Indexing to Big Data Series Analytics · - complex analytics Similarity Search Classification Clustering Outlier Detection Frequent Pattern Mining Themis Palpanas

References 1

2

Themis Palpanas - TSDays - Mar 2019

• papers▫ Coconut Palm: Static and Streaming Data Series Exploration Now in your Palm. SIGMOD 2019

http://www.mi.parisdescartes.fr/~themisp/publications/vldb19-lernaeanhydra.pdf

▫ The Lernaean Hydra of Data Series Similarity Search: An Experimental Evaluation of the State of the Art. PVLDB 2019

http://www.mi.parisdescartes.fr/~themisp/publications/vldb19-lernaeanhydra.pdf

▫ Scalable, Variable-Length Similarity Search in Data Series: The ULISSE Approach. PVLDB 2019

http://www.mi.parisdescartes.fr/~themisp/publications/vldb19-ulisse.pdf

▫ Progressive Similarity Search on Time Series Data. BigVis 2019

http://helios.mi.parisdescartes.fr/~themisp/publications/bigvis19.pdf

▫ ParIS: The Next Destination for Fast Data Series Indexing and Query Answering. IEEE BigData 2018

http://helios.mi.parisdescartes.fr/~themisp/publications/bigdata18.pdf

▫ Massively Distributed Time Series Indexing and Querying. TKDE 2018

http://helios.mi.parisdescartes.fr/~themisp/publications/tkde18-dpisax.pdf

Page 3: From Data Series Indexing to Big Data Series Analytics · - complex analytics Similarity Search Classification Clustering Outlier Detection Frequent Pattern Mining Themis Palpanas

References 2

3

Themis Palpanas - TSDays - Mar 2019

• papers▫ Generating Data Series Query Workloads. VLDBJ 2018

http://www.mi.parisdescartes.fr/~themisp/publications/vldbj18-benchmark.pdf

▫ Coconut: A Scalable Bottom-Up Approach for Building Data Series Indexes. PVLDB 2018

http://www.mi.parisdescartes.fr/~themisp/publications/vldb18-coconut.pdf

▫ Comparing Similarity Perception in Time Series Visualizations. VIS 2018

http://www.mi.parisdescartes.fr/~themisp/publications/vis2018.pdf

▫ Data Series Management: Fulfilling the Need for Big Sequence Analytics. ICDM 2018

http://www.mi.parisdescartes.fr/~themisp/publications/icde18-sms.pdf

▫ ULISSE: ULtra compact Index for Variable-Length Similarity SEarch in Data Series. ICDE 2018

http://www.mi.parisdescartes.fr/~themisp/publications/icde18-ulisse.pdf

▫ DPiSAX: Massively Distributed Partitioned iSAX. ICDM 2017

http://www.mi.parisdescartes.fr/~themisp/publications/icdm17-dpisax.pdf

▫ ADS: The Adaptive Data Series Index. PVLDBJ 2016

http://www.mi.parisdescartes.fr/~themisp/publications/vldbj16-ads.pdf

▫ Big Sequence Management: A Glimpse on the Past, the Present, and the Future. LNCS, 2016

http://www.mi.parisdescartes.fr/~themisp/publications/sofsem16-bisem.pdf

▫ Query Workloads for Data-Series Indexes. KDD 2015

http://www.mi.parisdescartes.fr/~themisp/publications/kdd15-bends.pdf

▫ RINSE: Interactive Data Series Exploration. PVLDB 2015

http://www.mi.parisdescartes.fr/~themisp/publications/vldb15-rinse.pdf

▫ Indexing for Interactive Exploration of Big Data Series. SIGMOD 2014

http://www.mi.parisdescartes.fr/~themisp/publications/sigmod14-ads.pdf

▫ Beyond One Billion Time Series: Indexing and Mining Very Large Time Series Collections with iSAX2+. KAIS 2014

http://www.mi.parisdescartes.fr/~themisp/publications/kais14-isax2plus.pdf

▫ iSAX 2.0: Indexing and Mining One Billion Time Series. ICDM 2010

http://www.mi.parisdescartes.fr/~themisp/publications/icdm10-billiontimeseries.pdf

Page 4: From Data Series Indexing to Big Data Series Analytics · - complex analytics Similarity Search Classification Clustering Outlier Detection Frequent Pattern Mining Themis Palpanas

References

4

Themis Palpanas - TSDays - Mar 2019

• code and datasets▫ Coconut

https://github.com/kon0925/coconut

▫ ULISSE

http://helios.mi.parisdescartes.fr/~mlinardi/ULISSE.html

▫ DPiSAX

https://github.com/DPiSAX/DPiSAX.github.io

▫ ADS

https://github.com/zoumpatianos/ADS

▫ iSAX2+

http://www.mi.parisdescartes.fr/~themisp/isax2plus/

• data series toolbox▫ DSStat

https://github.com/zoumpatianos/DSStat

• demos▫ Coconut Palm

http://users.ics.forth.gr/~kondylak/Coconut_Palm/Coconut_Palm.html

▫ RINSE

http://daslab.seas.harvard.edu/rinse/

Page 5: From Data Series Indexing to Big Data Series Analytics · - complex analytics Similarity Search Classification Clustering Outlier Detection Frequent Pattern Mining Themis Palpanas

AcknowledgementsParis Descartes University

• Michele Linardi

• Anna Gogolou

• Botao Peng

• Karima Echihabi

• Paul Boniol

• Federico Roncallo

Harvard University

• Kostas Zoumpatianos

• Stratos Idreos

• Niv Dayan

Cornell University/Microsoft

• Yin Lou

• Johannes Gehrke

Inria

• Anastasia Bezerianos

• Fanis Tsandilas

• Djamel-Edine Yagoubi

• Reza Akbarinia

• Florent Masseglia

University of California at Riverside

• Jin Shieh

• Eammon Keogh

University of Crete/FORTH

• Haridimos Kondylakis

• Panagiota Fatourou

University of Trento

• Alessandro CamerraThemis Palpanas - TSDays - Mar 2019

5

Page 6: From Data Series Indexing to Big Data Series Analytics · - complex analytics Similarity Search Classification Clustering Outlier Detection Frequent Pattern Mining Themis Palpanas

Executive Summary

• data collected at unprecedented rates

• they enable data-driven scientific discovery

• lots of these data are sequences

▫ takes days-weeks to analyze big sequence collections

6

goal: analyze big sequences in minutes/secondsThemis Palpanas - TSDays - Mar 2019

Page 7: From Data Series Indexing to Big Data Series Analytics · - complex analytics Similarity Search Classification Clustering Outlier Detection Frequent Pattern Mining Themis Palpanas

Data series

7

Themis Palpanas - TSDays - Mar 2019

• Sequence of points ordered along some dimension

TimePosition

v1

v2

sequence dimension

x1x2

xnva

lue

Page 8: From Data Series Indexing to Big Data Series Analytics · - complex analytics Similarity Search Classification Clustering Outlier Detection Frequent Pattern Mining Themis Palpanas

Scientific Monitoring

• meteorology, oceanography, astronomy,

finance, sociology, …

8

Themis Palpanas - TSDays - Mar 2019

Historical stock quoteshttp://money.cnn.com/2012/04/23/markets/walmart_stock/index.htm

Time

Wind speedFrom ocean observing node projecthttp://bml.ucdavis.edu/boon/wind.html

Page 9: From Data Series Indexing to Big Data Series Analytics · - complex analytics Similarity Search Classification Clustering Outlier Detection Frequent Pattern Mining Themis Palpanas

• functional Resonance Magnetic Imaging (fMRI) data

▫ primary experimental tool of neuroscientists

▫ reveal how different parts of brain respond to stimuli

Themis Palpanas - TSDays - Mar 2019

9

Time

Page 10: From Data Series Indexing to Big Data Series Analytics · - complex analytics Similarity Search Classification Clustering Outlier Detection Frequent Pattern Mining Themis Palpanas

Themis Palpanas - TSDays - Mar 2019

10

plant membraneStylet

voltage source

input resistor

V

0 50 100 150 200

10

20

to insect

conductive glue

voltage

readingto plant

Time

Page 11: From Data Series Indexing to Big Data Series Analytics · - complex analytics Similarity Search Classification Clustering Outlier Detection Frequent Pattern Mining Themis Palpanas

Themis Palpanas - TSDays - Mar 2019

11

Schinnerer et al.

Page 12: From Data Series Indexing to Big Data Series Analytics · - complex analytics Similarity Search Classification Clustering Outlier Detection Frequent Pattern Mining Themis Palpanas

GTCAATGGCCAGGATATTAGAACAGTACTCTGTGAACCCTATTTATGGTGGCACCCCTTAGACTAAGATAACACAGGGAGCAAGAGGTTGACAGGAAAGCCAGGGGAGCAGGGAAGCCTCCTGTAAAGAGAGAAGTGCTAAGTCTCCTTTCTAAGGCACATGATGGATTCAAGGGAAAGCCACATTTGACTAAAGCCCAAGGGATTGTTGCTTCTAATCCGATTTCTTGGCAGAAGATATTACAAACTAAGAGTCAGATTAATATGTGGGTGCCAAAATAAATAAACAAATAATTGAATAATCCCTGGAGGTTTAAGTGAGGAGAAACTCCTCCACAGCTTGCTACCGAGGCAGAACCGGTTGAAACTGAAATGCATCCGCCGCCAGAGGATCTGTAAAAGAGAGGTTGTTACGAAACTGGCAACTGCCAACCAAAGTCCACCAATGGACAAGCAAAAAAGAGCACTCATCTCATGCTCCCAAGGATCAACCTTCCCAGAGTTTTCACTTAAGTGGCCACCAAGCCAGTTGTCAATCCAGGGCTTTGGACTGAAATCTAGGGCTTCATCCGCTACCTCAGAGTGTCTTCTATTTCTTCCAGCCAGTGACAAATACAACAAACATCTGAGATGTTTTAGCTATAAATCCTTTACAATTGTTATTTATGTCTTAACTTTTGTTATACCTGGAAAAGTAGGGGAAACAATAAGAACATACTGTCTTGGCCAAGCATCCAAGGTTAAATGAGTTATGGAAATTCATTTGGGAGCCAAGACATTGCACGTGGTTATTTATTAGTCACCCAAGCATGTATTTTGCATGTCCATCAGTTGTTCTTGGCCAAAAGAGCAGAATCAATGAGCCGCTGCAGATGCAGACATAGCAGCCCCTTGCAGGGACAAGTCTGCAAGATGAGCATTGAAGAGGATGCACAAGCCCGGTAGCCCGGGAAATGGCAGGCACTTACAAGAGCCCAGGTTGTTGCCATGTTTGTTTTTGCAACTTGTCTATTTAAAGAGATTTGGGCAATGGCCAGGATATTAGAACAGTACTCTGTGAACCCTATTTATGGTAGCACCCCTTAGACTAAGATAACACAGGGAGCAAGAGGTTGACAGGAAAGCCAGGGGAGCAGGGAAGCCTCCTGTAAAGAGAGAAGTGCTAAGTCTCCTTTCTAAGGCACATGATGGATCAAGGGAAAGTCACATTTGACTAAAGCCCAAGGGATTGTTGCTTCTAATCCGATTCTTGGCAGAAGATATTGCAAACTAAGAGTCAGATTAATATGTGGGTGCCAAAATAAATAAACAAATAATTGAATAATCCCTGGAGGTTTAAGTGAGGAGAAACTCCTCCACACTTGCTACCGAGGCAGAACCGGTTGAAACTGAAATGCACCCGCTGCCAGATTTATTAGTCACCCAAGCATGTATTTTGCATGTCCATCAGTTGTTCTTGGCCAAAAGAACAGAATCAATGAGCCGCTGCAGATGCAGACATAGCAGCCCCTTGCAGGAACAAGTCTGCAAGATGAGCATTGAAGAGGATGCACAAGCCCGGTAGCCCGGGAAATGGCAGGCACTTACAAGAGCCCAGGTTGTTGCCATGTTTGTTTTTGCAACTTGTCTTTTAAACAGATTTGA

Themis Palpanas - TSDays - Mar 2019

12

Position

Page 13: From Data Series Indexing to Big Data Series Analytics · - complex analytics Similarity Search Classification Clustering Outlier Detection Frequent Pattern Mining Themis Palpanas

What do we want to do with them?

- simple query answering

SimlaritySearch

select some data series

select values in time interval

select values in some

range

combinations of those

Themis Palpanas - TSDays - Mar 2019

13

Page 14: From Data Series Indexing to Big Data Series Analytics · - complex analytics Similarity Search Classification Clustering Outlier Detection Frequent Pattern Mining Themis Palpanas

• a solved(?) problem

▫ your favorite DBMS

▫ …

▫ InfluxData

▫ kx

▫ Riak TS

▫ OpenTSDB

▫ TimescaleDB

▫ KairosDB

▫ Druid

▫ …

Themis Palpanas - TSDays - Mar 2019

14

What do we want to do with them?

- simple query answering

Page 15: From Data Series Indexing to Big Data Series Analytics · - complex analytics Similarity Search Classification Clustering Outlier Detection Frequent Pattern Mining Themis Palpanas

What do we want to do with them?

- complex analytics

Similarity Search

Classification

ClusteringOutlier

Detection

Frequent Pattern Mining

Themis Palpanas - TSDays - Mar 2019

15

Page 16: From Data Series Indexing to Big Data Series Analytics · - complex analytics Similarity Search Classification Clustering Outlier Detection Frequent Pattern Mining Themis Palpanas

What do we want to do with them?

- complex analytics

Similarity Search

Classification

ClusteringOutlier

Detection

Frequent Pattern Mining

Themis Palpanas - TSDays - Mar 2019

16

similar sequences

query

sequence collection

Page 17: From Data Series Indexing to Big Data Series Analytics · - complex analytics Similarity Search Classification Clustering Outlier Detection Frequent Pattern Mining Themis Palpanas

What do we want to do with them?

- complex analytics

Similarity Search

Classification

ClusteringOutlier

Detection

Frequent Pattern Mining

Themis Palpanas - TSDays - Mar 2019

17

Euclidean

Dynamic Time Warping (DTW)

( ) ( ) −=

n

iii yxYXD

1

2,

−−

+−=

=

)1,1(

),1(

)1,(

min),(

),(),(

jif

jif

jif

yxjif

mnfYXD

ji

dtw

Page 18: From Data Series Indexing to Big Data Series Analytics · - complex analytics Similarity Search Classification Clustering Outlier Detection Frequent Pattern Mining Themis Palpanas

What do we want to do with them?

- complex analytics

Similarity Search

Classification

ClusteringOutlier

Detection

Frequent Pattern Mining

Themis Palpanas - TSDays - Mar 2019

18

Euclidean

Dynamic Time Warping (DTW)

Page 19: From Data Series Indexing to Big Data Series Analytics · - complex analytics Similarity Search Classification Clustering Outlier Detection Frequent Pattern Mining Themis Palpanas

What do we want to do with them?

- complex analytics

Similarity Search

Classification

ClusteringOutlier

Detection

Frequent Pattern Mining

Themis Palpanas - TSDays - Mar 2019

19

HARD, because of very high dimensionality:each data series has 100s-1000s of points!

even HARDER, because of very large size:millions to billions of data series (multi-TBs)!

Page 20: From Data Series Indexing to Big Data Series Analytics · - complex analytics Similarity Search Classification Clustering Outlier Detection Frequent Pattern Mining Themis Palpanas

Query answering process

data-to-query time query answering time

Query Answering ProcedureData Loading Procedure

Answers

Data Series Database/Indexing

DataRaw data

Queries

Themis Palpanas - TSDays - Mar 2019 20

these times are big!

Page 21: From Data Series Indexing to Big Data Series Analytics · - complex analytics Similarity Search Classification Clustering Outlier Detection Frequent Pattern Mining Themis Palpanas

Similarity Search via

Serial Scan

Themis Palpanas - TSDays - Mar 2019 21

Page 22: From Data Series Indexing to Big Data Series Analytics · - complex analytics Similarity Search Classification Clustering Outlier Detection Frequent Pattern Mining Themis Palpanas

Similarity Search via

Indexing

Themis Palpanas - TSDays - Mar 2019 22

Page 23: From Data Series Indexing to Big Data Series Analytics · - complex analytics Similarity Search Classification Clustering Outlier Detection Frequent Pattern Mining Themis Palpanas

Themis Palpanas - TSDays - Mar 2019 23

Traditional Approaches

answer nearest neighbor queries on a 1TB dataset:

Query Answering

Query Answering

but building the index takes too long!

indexing a 1TB dataset takes days

complex analytics in days…

a data series index can

reduce querying time

serial scan takes 45 minutes/query

Page 24: From Data Series Indexing to Big Data Series Analytics · - complex analytics Similarity Search Classification Clustering Outlier Detection Frequent Pattern Mining Themis Palpanas

Themis Palpanas - TSDays - Mar 2019

Query answering process

data-to-query time query answering time

Query Answering ProcedureData Loading Procedure

Answers

we have proposed the

state-of-the-artsolutions for both problems!

Data Series Database/Indexing

DataRaw data

Queries

24

Page 25: From Data Series Indexing to Big Data Series Analytics · - complex analytics Similarity Search Classification Clustering Outlier Detection Frequent Pattern Mining Themis Palpanas

Themis Palpanas - TSDays - Mar 2019

State of the Art Approach: ADS+

25

complex analytics in hours!

Publications

SIGMOD‘14

PVLDB‘15

VLDBJ‘16

Page 26: From Data Series Indexing to Big Data Series Analytics · - complex analytics Similarity Search Classification Clustering Outlier Detection Frequent Pattern Mining Themis Palpanas

-3

-2

-1

0

1

2

3

4 8 12 160

00

01

10

11

iSAX(T,4,4)

-3

-2

-1

0

1

2

3

4 8 12 160

A data series T

4 8 12 160

PAA(T,4)

-3

-2

-1

0

1

2

3

SAX Representation

• Symbolic Aggregate approXimation(SAX) ▫ (1) Represent data series T of length n

with w segments using Piecewise Aggregate Approximation (PAA) T typically normalized to μ = 0, σ = 1

PAA(T,w) =

where

▫ (2) Discretize into a vector of symbols Breakpoints map to small alphabet a

of symbols

wttT ,,1 =

+−=

=

i

ij

jnw

i

wn

wn

Tt1)1(

Themis Palpanas - TSDays - Mar 2019

26

Page 27: From Data Series Indexing to Big Data Series Analytics · - complex analytics Similarity Search Classification Clustering Outlier Detection Frequent Pattern Mining Themis Palpanas

iSAX Index

• non-balanced tree-based index with non-overlapping regions, and controlled fan-out rate

▫ base cardinality b (optional), segments w, threshold th

▫ hierarchically subdivides SAX space until num. entries ≤ th

• Approximate Search▫ Match iSAX representation at each level

• Exact Search▫ Leverage approximate search

▫ Prune search space

Lower bounding distance

Themis Palpanas - TSDays - Mar 2019

27

Page 28: From Data Series Indexing to Big Data Series Analytics · - complex analytics Similarity Search Classification Clustering Outlier Detection Frequent Pattern Mining Themis Palpanas

Drawback of iSAX2+

• cannot start answering queries until entire index is built!

Themis Palpanas - TSDays - Mar 2019

28

Page 29: From Data Series Indexing to Big Data Series Analytics · - complex analytics Similarity Search Classification Clustering Outlier Detection Frequent Pattern Mining Themis Palpanas

Adaptive Data Series Index:

ADS+

• novel paradigm for building a data series index

▫ do not build entire index and then answer queries

▫ start answering queries by building the part of the index needed by those queries

• still guarantee correct answers

Themis Palpanas - TSDays - Mar 2019

29

Page 30: From Data Series Indexing to Big Data Series Analytics · - complex analytics Similarity Search Classification Clustering Outlier Detection Frequent Pattern Mining Themis Palpanas

Adaptive Data Series Index:

ADS+

• intuition for proposed solution

▫ build the iSAX index using the iSAX representations▫ just like iSAX2+

▫ but start with a large leaf size▫ minimize initial cost

▫ postpone leaf materialization to query time▫ only materialize (at query time) leaves needed by queries

▫ parts that are queried more are refined more▫ use smaller leaf sizes (reduced leaf materialization and query

answering costs)Themis Palpanas - TSDays - Mar 2019

30

Publications

SIGMOD‘14

PVLDB‘15

VLDBJ‘16

Page 31: From Data Series Indexing to Big Data Series Analytics · - complex analytics Similarity Search Classification Clustering Outlier Detection Frequent Pattern Mining Themis Palpanas

ROOT

I1 I2

LBL

FBL

Raw data

DISK

RAM

Start building an index with only the iSAX representations

31Themis Palpanas - TSDays - Mar 2019

Page 32: From Data Series Indexing to Big Data Series Analytics · - complex analytics Similarity Search Classification Clustering Outlier Detection Frequent Pattern Mining Themis Palpanas

ROOT

I1 I2

LBL

FBL

Raw data

DISK

RAM

Read the data-series one by one from the raw file

32Themis Palpanas - TSDays - Mar 2019

Page 33: From Data Series Indexing to Big Data Series Analytics · - complex analytics Similarity Search Classification Clustering Outlier Detection Frequent Pattern Mining Themis Palpanas

ROOT

I1 I2

LBL

FBL

Raw data

DISK

RAM

Convert them to iSAX

33Themis Palpanas - TSDays - Mar 2019

Page 34: From Data Series Indexing to Big Data Series Analytics · - complex analytics Similarity Search Classification Clustering Outlier Detection Frequent Pattern Mining Themis Palpanas

ROOT

I1 I2

LBL

FBL

Raw data

DISK

RAM

Store only iSAX in memory (64 times smaller) ~1%

34Themis Palpanas - TSDays - Mar 2019

Page 35: From Data Series Indexing to Big Data Series Analytics · - complex analytics Similarity Search Classification Clustering Outlier Detection Frequent Pattern Mining Themis Palpanas

ROOT

I1 I2

LBL

FBL

Raw data

DISK

RAM

Discard raw data and keep pointer to raw file

35Themis Palpanas - TSDays - Mar 2019

Page 36: From Data Series Indexing to Big Data Series Analytics · - complex analytics Similarity Search Classification Clustering Outlier Detection Frequent Pattern Mining Themis Palpanas

ROOT

I1

LBL

FBL

Raw data

I2

DISK

RAM

Continue loading data until we run out of memory

36Themis Palpanas - TSDays - Mar 2019

Page 37: From Data Series Indexing to Big Data Series Analytics · - complex analytics Similarity Search Classification Clustering Outlier Detection Frequent Pattern Mining Themis Palpanas

ROOT

I1

L3 L4L1 L2

I2

LBL

FBL

Raw data

DISK

RAM

Expand each sub-tree and move data to LBL

37Themis Palpanas - TSDays - Mar 2019

Page 38: From Data Series Indexing to Big Data Series Analytics · - complex analytics Similarity Search Classification Clustering Outlier Detection Frequent Pattern Mining Themis Palpanas

Raw data

ROOT

I1

L3 L4L1 L2

I2

LBL

FBL

DISK

RAM

We flush the data to the disk to free up memory

38Themis Palpanas - TSDays - Mar 2019

Page 39: From Data Series Indexing to Big Data Series Analytics · - complex analytics Similarity Search Classification Clustering Outlier Detection Frequent Pattern Mining Themis Palpanas

Raw data

PARTIALPARTIAL

ROOT

I1

L5L1 L2

I2

LBL

FBL

PARTIAL

DISK

RAML4

PARTIAL

39Themis Palpanas - TSDays - Mar 2019

Page 40: From Data Series Indexing to Big Data Series Analytics · - complex analytics Similarity Search Classification Clustering Outlier Detection Frequent Pattern Mining Themis Palpanas

Raw data

PARTIALPARTIAL

ROOT

I1

L5L1 L2

I2

LBL

FBL

PARTIAL

DISK

RAML4

PARTIAL

Query #1

40Themis Palpanas - TSDays - Mar 2019

Page 41: From Data Series Indexing to Big Data Series Analytics · - complex analytics Similarity Search Classification Clustering Outlier Detection Frequent Pattern Mining Themis Palpanas

Raw data

PARTIALPARTIAL

ROOT

I1

L5L1 L2

I2

LBL

FBL

PARTIAL

DISK

RAML4

PARTIAL

Query #1

TOO BIG!

41Themis Palpanas - TSDays - Mar 2019

Page 42: From Data Series Indexing to Big Data Series Analytics · - complex analytics Similarity Search Classification Clustering Outlier Detection Frequent Pattern Mining Themis Palpanas

Raw data

PARTIALPARTIAL

ROOT

I1

L5L2

L1

I2

LBL

FBL

PARTIAL

DISK

RAML4

PARTIAL

Query #1

TOO BIG!

42Themis Palpanas - TSDays - Mar 2019

Page 43: From Data Series Indexing to Big Data Series Analytics · - complex analytics Similarity Search Classification Clustering Outlier Detection Frequent Pattern Mining Themis Palpanas

Raw data

PARTIAL

PARTIAL

ROOT

I1

L5

I3

L2

I2

LBL

FBL

PARTIAL

DISK

RAML4

PARTIAL

Query #1

PARTIAL

L5L4

Adaptive split

Create a smaller leaf

43Themis Palpanas - TSDays - Mar 2019

Page 44: From Data Series Indexing to Big Data Series Analytics · - complex analytics Similarity Search Classification Clustering Outlier Detection Frequent Pattern Mining Themis Palpanas

Raw data

PARTIAL

PARTIAL

ROOT

I1

L5

I3

L2

I2

LBL

FBL

PARTIAL

DISK

RAML4

PARTIAL

Query #1

PARTIAL

L5L4

Load data in LBL andanswer the query

44Themis Palpanas - TSDays - Mar 2019

Page 45: From Data Series Indexing to Big Data Series Analytics · - complex analytics Similarity Search Classification Clustering Outlier Detection Frequent Pattern Mining Themis Palpanas

Raw data

PARTIAL

PARTIAL

ROOT

I1

L5

I3

L2

I2

LBL

FBL

PARTIAL

DISK

RAML4

PARTIAL

FULL

L5L4

We spill to the disk when we run out of memory

45Themis Palpanas - TSDays - Mar 2019

Page 46: From Data Series Indexing to Big Data Series Analytics · - complex analytics Similarity Search Classification Clustering Outlier Detection Frequent Pattern Mining Themis Palpanas

Experimental Evaluation

Themis Palpanas - TSDays - Mar 2019

46

• iSAX 2.0 needs more than 35 hours to answer 100K approximate queries

• ADS+ answers 100K approximate queries in less than 5 hours

7x faster

Page 47: From Data Series Indexing to Big Data Series Analytics · - complex analytics Similarity Search Classification Clustering Outlier Detection Frequent Pattern Mining Themis Palpanas

Extensions…

• Coconut: current solution for limited memory devices

and streaming time series

▫ bottom-up, succinct index construction based on sortable summarizations

▫ outperforms state-of-the-art in terms of index space, index construction time, and query answering time

Themis Palpanas - TSDays - Mar 2019

47

Publications

PVLDB‘18

SIGMOD’19

Page 48: From Data Series Indexing to Big Data Series Analytics · - complex analytics Similarity Search Classification Clustering Outlier Detection Frequent Pattern Mining Themis Palpanas

Extensions…

• Coconut: current solution for limited memory devices

and streaming time series

▫ bottom-up, succinct index construction based on sortable summarizations

▫ outperforms state-of-the-art in terms of index space, index construction time, and query answering time

• ULISSE: current solution for variable-length queries

▫ single supports of queries of variable lengths

▫ orders of magnitude faster than competing approaches

Themis Palpanas - TSDays - Mar 2019

48

Publications

PVLDB‘18

SIGMOD’19

ICDE’18

PVLDB‘19

Page 49: From Data Series Indexing to Big Data Series Analytics · - complex analytics Similarity Search Classification Clustering Outlier Detection Frequent Pattern Mining Themis Palpanas

Massive Data Series Collections

• functional Resonance Magnetic Imaging (fMRI) data

▫ primary experimental tool of neuroscientists

▫ reveal how different parts of brain respond to stimuli

▫ single experiment (1 subject, 1 test) produces

60,000 data series of length 3,000: 12 GB

Themis Palpanas - TSDays - Mar 2019

49

Page 50: From Data Series Indexing to Big Data Series Analytics · - complex analytics Similarity Search Classification Clustering Outlier Detection Frequent Pattern Mining Themis Palpanas

Massive Data Series Collections

• ADHD-200 Global Competition

▫ classification task: detect Attention Deficit Hyperactivity Disorder

776 subjects: 9 TB

equivalent to: 4.5 billion non-overlapping data series of size 256

equivalent to: 1100 billion overlapping data series of size 256

Themis Palpanas - TSDays - Mar 2019

50

Page 51: From Data Series Indexing to Big Data Series Analytics · - complex analytics Similarity Search Classification Clustering Outlier Detection Frequent Pattern Mining Themis Palpanas

Massive Data Series Collections

Human Genome project

130 TB

NASA’s Solar Observatory

1.5 TB per day

Large Synoptic Survey Telescope (2019)

~30 TB per night

Themis Palpanas - TSDays - Mar 2019 51

data center andservices monitoring

2B data series4M points/sec

Page 52: From Data Series Indexing to Big Data Series Analytics · - complex analytics Similarity Search Classification Clustering Outlier Detection Frequent Pattern Mining Themis Palpanas

The Road Ahead

“enable practitioners and non-expert users to easily and efficiently manage and analyze massive data series collections”

Themis Palpanas - TSDays - Mar 2019

52

Publications

ICDE’18

HPCS’17

SIGREC’15

Page 53: From Data Series Indexing to Big Data Series Analytics · - complex analytics Similarity Search Classification Clustering Outlier Detection Frequent Pattern Mining Themis Palpanas

The Road Ahead

• Big Sequence Management System

▫ general purpose data series management system

Themis Palpanas - TSDays - Mar 2019

53

data sequences

Publications

ICDE’18

HPCS’17

SIGREC’15

Page 54: From Data Series Indexing to Big Data Series Analytics · - complex analytics Similarity Search Classification Clustering Outlier Detection Frequent Pattern Mining Themis Palpanas

The Road Ahead

• Big Sequence Management System

Themis Palpanas - TSDays - Mar 2019

54

Publications

ICDE’18

HPCS’17

SIGREC’15

Page 55: From Data Series Indexing to Big Data Series Analytics · - complex analytics Similarity Search Classification Clustering Outlier Detection Frequent Pattern Mining Themis Palpanas

The Road Ahead

• Big Sequence Management System

Themis Palpanas - TSDays - Mar 2019

55

Publications

ICDE’18

HPCS’17

SIGREC’15

Ho

lis

tic

Op

tim

iza

tio

n

PVLDB’19

Page 56: From Data Series Indexing to Big Data Series Analytics · - complex analytics Similarity Search Classification Clustering Outlier Detection Frequent Pattern Mining Themis Palpanas

Parallelization/Distribution?

• discussion so far assumed serial execution in a single core

▫ focus on efficient resource utilization

▫ squeeze the most out of a single core

▫ produce scalable solutions at lowest possible cost

also suitable for analysts with no access to/expertise for clusters

Themis Palpanas - TSDays - Mar 2019

56

Page 57: From Data Series Indexing to Big Data Series Analytics · - complex analytics Similarity Search Classification Clustering Outlier Detection Frequent Pattern Mining Themis Palpanas

Need for

Parallelization/Distribution

• take advantage of modern hardware!

▫ Single Instruction Multiple Data (SIMD)

natural for data series operations

▫ multi-tier CPU caches

design data structures aligned to cache lines

▫ multi-core and multi-socket architectures

use parallelism inside each computation server

▫ Graphics Processing Units (GPUs)

propose massively parallel techniques for GPUs

▫ new storage solutions: SSDs, NVRAM

develop algorithms that take these new characteristics/tradeoffs into account

▫ compute clusters

distribute operation over many machines Themis Palpanas - TSDays - Mar 2019

57

Publications

HPCS’17

Page 58: From Data Series Indexing to Big Data Series Analytics · - complex analytics Similarity Search Classification Clustering Outlier Detection Frequent Pattern Mining Themis Palpanas

Need for

Parallelization/Distribution

• further scale-up and scale-out possible!

▫ techniques inherently parallelizable

across cores, across machines

Themis Palpanas - TSDays - Mar 2019

58

Publications

HPCS’17

Page 59: From Data Series Indexing to Big Data Series Analytics · - complex analytics Similarity Search Classification Clustering Outlier Detection Frequent Pattern Mining Themis Palpanas

Need for

Parallelization/Distribution

• further scale-up and scale-out possible!

▫ techniques inherently parallelizable

across cores, across machines

• more involved solutions required when optimizing for energy

▫ reducing execution time is relatively easy

▫ minimizing total work (energy) is more challenging

Themis Palpanas - TSDays - Mar 2019

59

Publications

HPCS’17

Page 60: From Data Series Indexing to Big Data Series Analytics · - complex analytics Similarity Search Classification Clustering Outlier Detection Frequent Pattern Mining Themis Palpanas

Need for

Parallelization/Distribution

• DPiSAX: current solution for distributed processing (Spark)

▫ balances work of different worker nodes

▫ performs 2 orders of magnitude faster than centralized solution

Themis Palpanas - TSDays - Mar 2019

60

Publications

ICDM‘17

TKDE’18

Page 61: From Data Series Indexing to Big Data Series Analytics · - complex analytics Similarity Search Classification Clustering Outlier Detection Frequent Pattern Mining Themis Palpanas

Need for

Parallelization/Distribution

• DPiSAX: current solution for distributed processing (Spark)

▫ balances work of different worker nodes

▫ performs 2 orders of magnitude faster than centralized solution

• ParIS: current solution for modern hardware

▫ completely masks out the CPU cost

▫ answers exact queries in the order of ms

3 orders of magnitude faster then single-core solutions

Themis Palpanas - TSDays - Mar 2019

61

Publications

ICDM‘17

TKDE’18

BigData’18

Page 62: From Data Series Indexing to Big Data Series Analytics · - complex analytics Similarity Search Classification Clustering Outlier Detection Frequent Pattern Mining Themis Palpanas

Interactive Analytics?

• data series analytics is computationally expensive

▫ very high inherent complexity

• may not always be possible to remove delays

▫ but could try to hide them!

Themis Palpanas - TSDays - Mar 2019

62

Page 63: From Data Series Indexing to Big Data Series Analytics · - complex analytics Similarity Search Classification Clustering Outlier Detection Frequent Pattern Mining Themis Palpanas

Need for

Interactive Analytics

• interaction with users offers new opportunities

▫ progressive answers

produce intermediate results

iteratively converge to final, correct solution

provide bounds on the errors (of the intermediate results) along the way

▫ imprecise queries

enable user to specify varying accuracy requirements for different parts of the same query

Themis Palpanas - TSDays - Mar 2019

63

Publications

BigVis‘19

Page 64: From Data Series Indexing to Big Data Series Analytics · - complex analytics Similarity Search Classification Clustering Outlier Detection Frequent Pattern Mining Themis Palpanas

Need for

Interactive Analytics

• interaction with users offers new opportunities

▫ progressive answers

produce intermediate results

iteratively converge to final, correct solution

provide bounds on the errors (of the intermediate results) along the way

▫ imprecise queries

enable user to specify varying accuracy requirements for different parts of the same query

• several exciting research problems in intersection of visualization and data management

▫ frontend: HCI/visualizations for querying/results display

▫ backend: efficiently supporting these operations

Themis Palpanas - TSDays - Mar 2019

64

Publications

VIS‘18

BigVis‘19

Page 65: From Data Series Indexing to Big Data Series Analytics · - complex analytics Similarity Search Classification Clustering Outlier Detection Frequent Pattern Mining Themis Palpanas

Benchmarking Data Series Indexes?

Themis Palpanas - TSDays - Mar 2019

65

Page 66: From Data Series Indexing to Big Data Series Analytics · - complex analytics Similarity Search Classification Clustering Outlier Detection Frequent Pattern Mining Themis Palpanas

Previous Studies

• chosen from the data (with/without noise)

66

evaluate performance of indexing methods using random queries

Themis Palpanas - TSDays - Mar 2019

Page 67: From Data Series Indexing to Big Data Series Analytics · - complex analytics Similarity Search Classification Clustering Outlier Detection Frequent Pattern Mining Themis Palpanas

Previous Workloads

0

25

50

75

100

0.0 0.1 0.2 0.3 0.4 0.5Hardness

% o

f q

ue

rie

s

0

25

50

75

100

0.0 0.1 0.2 0.3 0.4 0.5Hardness

% o

f q

ue

rie

s

0

25

50

75

100

0.0 0.1 0.2 0.3 0.4 0.5Hardness

% o

f q

ue

rie

s

0

25

50

75

100

0.0 0.1 0.2 0.3 0.4 0.5Hardness

% o

f q

ue

rie

s

0

25

50

75

100

0.0 0.1 0.2 0.3 0.4 0.5Hardness

% o

f q

ue

rie

s

0

25

50

75

100

0.0 0.1 0.2 0.3 0.4 0.5Hardness

% o

f q

ue

rie

s

0

25

50

75

100

0.0 0.1 0.2 0.3 0.4 0.5Hardness

% o

f q

ue

rie

s

0

25

50

75

100

0.0 0.1 0.2 0.3 0.4 0.5Hardness

% o

f q

ue

rie

s

0

25

50

75

100

0.0 0.1 0.2 0.3 0.4 0.5Hardness

% o

f q

ue

rie

s

DNA

EEG

Ran

do

mw

alk

64 256 1024

Most previous workloads are skewed to easy queries

68Themis Palpanas - TSDays - Mar 2019

Publications

KDD ‘15

TKDE ‘18

Page 68: From Data Series Indexing to Big Data Series Analytics · - complex analytics Similarity Search Classification Clustering Outlier Detection Frequent Pattern Mining Themis Palpanas

Benchmark Workloads

69

If all queries are easy all indexes look good

If all queries are hard all indexes look bad

need methods for generating queries of varying hardness

Themis Palpanas - TSDays - Mar 2019

Publications

KDD ‘15

TKDE ‘18

Page 69: From Data Series Indexing to Big Data Series Analytics · - complex analytics Similarity Search Classification Clustering Outlier Detection Frequent Pattern Mining Themis Palpanas

Conclusions

• data series is a very common data type

▫ across several different domains and applications

• complex data series analytics are challenging

▫ have very high complexity

▫ efficiency comes from data series management/indexing techniques

• current approaches used in practice are ad-hoc

▫ waste of time and effort

▫ suboptimal solutions

• need for Sequence Management System

▫ optimize operations based on data/hardware characteristics

▫ transparent to user

Themis Palpanas - TSDays - Mar 2019

70

Page 70: From Data Series Indexing to Big Data Series Analytics · - complex analytics Similarity Search Classification Clustering Outlier Detection Frequent Pattern Mining Themis Palpanas

Conclusions

• data series is a very common data type

▫ across several different domains and applications

• complex data series analytics are challenging

▫ have very high complexity

▫ efficiency comes from data series management/indexing techniques

• current approaches used in practice are ad-hoc

▫ waste of time and effort

▫ suboptimal solutions

• need for Sequence Management System

▫ optimize operations based on data/hardware characteristics

▫ transparent to user

• several exciting research opportunitiesThemis Palpanas - TSDays - Mar 2019

71

Page 71: From Data Series Indexing to Big Data Series Analytics · - complex analytics Similarity Search Classification Clustering Outlier Detection Frequent Pattern Mining Themis Palpanas

72

Themis Palpanas - TSDays - Mar 2019

Page 72: From Data Series Indexing to Big Data Series Analytics · - complex analytics Similarity Search Classification Clustering Outlier Detection Frequent Pattern Mining Themis Palpanas

Data-Intensive and Knowledge-Oriented systems

Page 73: From Data Series Indexing to Big Data Series Analytics · - complex analytics Similarity Search Classification Clustering Outlier Detection Frequent Pattern Mining Themis Palpanas

74

thank you!

google: Themis Palpanas

work supported by: