on the role of interactivity and data placement in big data analytics

24
On the role of Interactivity and Data Placement in Big Data Analytics Srini Parthasarathy OSU

Upload: rusk

Post on 22-Feb-2016

33 views

Category:

Documents


0 download

DESCRIPTION

On the role of Interactivity and Data Placement in Big Data Analytics. Srini Parthasarathy OSU. The Data Deluge: Data Data Everywhere. Data Storage is Cheap. 600$ to buy a disk drive that can store all of the world ’ s music. [McKinsey Global Institute Special Report, June ’ 11]. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: On the role of Interactivity and Data Placement in Big Data Analytics

On the role of Interactivity and Data Placement in Big Data Analytics

Srini ParthasarathyOSU

Page 2: On the role of Interactivity and Data Placement in Big Data Analytics

The Data Deluge: Data Data Everywhere

22

Page 3: On the role of Interactivity and Data Placement in Big Data Analytics

600$ to buy a disk drive that can store all of the

world’s music

3

[McKinsey Global Institute Special Report, June ’11]

Data Storage is Cheap

Page 4: On the role of Interactivity and Data Placement in Big Data Analytics

Data does not exist in isolation.

4

Page 5: On the role of Interactivity and Data Placement in Big Data Analytics

Data almost always exists in connection with other data – integral

part of the value proposition.

5

Page 6: On the role of Interactivity and Data Placement in Big Data Analytics

6

Social networks Protein Interactions Internet

VLSI networks Data dependencies Neighborhood graphs

Page 7: On the role of Interactivity and Data Placement in Big Data Analytics

7

Big Data Problem: All this data is only useful if we can scalably extract useful knowledge from such complex data

Page 8: On the role of Interactivity and Data Placement in Big Data Analytics

THIS TALK

• THE ROLE OF DATA PLACEMENT IN BIG DATA SYSTEMS

• THE ROLE OF VISUALIZATION AND INTERACTION IN BIG DATA ANALYSIS

Page 9: On the role of Interactivity and Data Placement in Big Data Analytics

GLOBAL GRAPHS

Page 10: On the role of Interactivity and Data Placement in Big Data Analytics

GLOBAL GRAPHS

• What? – System for deploying applications processing complex data

• Why? – Seeks balance between high productivity and high performance

• How?– Built on top of PNL’s GlobalArrays– Trees (GlobalTrees, GlobalForests)– Relational Arrays (ArrayDB-GA)– Graphs (GlobalGraphs)

• Data Placement is key to high performance

Page 11: On the role of Interactivity and Data Placement in Big Data Analytics

Importance of Data Placement

• Locality– Placing related items close to each other so they may be

processed together

• Mitigating Impact of Data Skew– Reducing load imbalance in a parallel setting– Reducing variance in partition samples

• Generating Stratified Samples– Improving interactive performance

Page 12: On the role of Interactivity and Data Placement in Big Data Analytics

Key Ideas• Pivotization

– Convert data with complex structure into sets– Each element of set captures features of local topology

• Hashing into Strata: Hash related sets into similar bins– Can employ a sketch-clustering algorithm

• Partitioning: Place Strata into partitions for• Locality • Mitigating Data Skew• Samples

Page 13: On the role of Interactivity and Data Placement in Big Data Analytics

SK

ETCH

SORT

or S

KETC

HCLU

STER

S-1 : : S-4(Δ1, SK-1)(Δ5, SK-5)(Δ12,SK-12)(Δ25,SK-25) : : : S-5 : : : S-128 : : :

PART

ITIO

NIN

G &

REP

LICA

TIO

N

P-1 : P-2 S-4 S-7 S-8 S-12 : S-128 P-3 : : : P-8 S-3 S-4 S-9S-12 : S-127

PIVO

T

T

RAN

SFO

RMAT

ION

S

A

B C

LE

A

B C

LE F

.

.

.

.

Δ1

Δ25

DATA (Δ)

A

B C

A

F C

A

E C

A

F L

B

E F

A

E L

A

B L

A

B C

A

E CA

E L

A

B L

.

.

.

.

(PS-1)

(PS-25)

PIVOT SETS (PS)

MIN

WIS

E H

ASHI

NG

on

PIVO

T SE

TS

{1050, 2020,3130,1800} (SK-1)

{1050, 2020,7225, 2020} (SK-25)

.

.

.

.

.

.SKETCHES(SK) Strata (S)

Page 14: On the role of Interactivity and Data Placement in Big Data Analytics

Frequent Tree Mining

• Our proposed approaches shows 100X gains

Page 15: On the role of Interactivity and Data Placement in Big Data Analytics

WebGraph Compression

• Linear Scaleup with no loss in compression ratio

Page 16: On the role of Interactivity and Data Placement in Big Data Analytics

PRISM-HD -

PRobing the Intrinsic Structure and Makeup of High-dimensional Data

HD

Page 17: On the role of Interactivity and Data Placement in Big Data Analytics

Visualization and Interactivity are key to discovery

17

Page 18: On the role of Interactivity and Data Placement in Big Data Analytics

PRISM-HD• What?

– A novel mechanism for exploring complex data

• Why?– User is often overwhelmed with

characteristics of data– Befuddled on where to start

• How?– Given, similarity measure-of-interest– Compute similarity graph at threshold (t)

• Key: Graphs are dimensionless– Provide user graph visualization cues

• User determines next threshold and repeats

HD

Page 19: On the role of Interactivity and Data Placement in Big Data Analytics

HD

HIGH THRESHOLD MODERATE THRESHOLD LOW THRESHOLD

Page 20: On the role of Interactivity and Data Placement in Big Data Analytics

Benefits of Knowledge CachingHD

Page 21: On the role of Interactivity and Data Placement in Big Data Analytics

Benefits of Incremental Processing on Twitter

Incremental estimates on Twitter t1 = 0.95

HD

Page 22: On the role of Interactivity and Data Placement in Big Data Analytics

PRISM-HD and Global Graphs in Context:Leveraging Social Media in Emergency Response

HD

Page 23: On the role of Interactivity and Data Placement in Big Data Analytics

Concluding Remarks

• Data is everywhere• Data is fraught with complexities

– Dimensionality, dynamics, structure, massive…• Both data placement and data interactivity

have an important role to play in big data analytics– PRISM-HD and GlobalGraphs can help!

HD

Page 24: On the role of Interactivity and Data Placement in Big Data Analytics

Thanks for your attentionContact: [email protected]

Mining Simulation Data

Medical Image Analysis

Protein Interaction Network (yeast)

Acknowledgements: Various NSF, NIH, DOE and industry grants