on the role of interactivity and data placement in big data analytics
DESCRIPTION
On the role of Interactivity and Data Placement in Big Data Analytics. Srini Parthasarathy OSU. The Data Deluge: Data Data Everywhere. Data Storage is Cheap. 600$ to buy a disk drive that can store all of the world ’ s music. [McKinsey Global Institute Special Report, June ’ 11]. - PowerPoint PPT PresentationTRANSCRIPT
On the role of Interactivity and Data Placement in Big Data Analytics
Srini ParthasarathyOSU
The Data Deluge: Data Data Everywhere
22
600$ to buy a disk drive that can store all of the
world’s music
3
[McKinsey Global Institute Special Report, June ’11]
Data Storage is Cheap
Data does not exist in isolation.
4
Data almost always exists in connection with other data – integral
part of the value proposition.
5
6
Social networks Protein Interactions Internet
VLSI networks Data dependencies Neighborhood graphs
7
Big Data Problem: All this data is only useful if we can scalably extract useful knowledge from such complex data
THIS TALK
• THE ROLE OF DATA PLACEMENT IN BIG DATA SYSTEMS
• THE ROLE OF VISUALIZATION AND INTERACTION IN BIG DATA ANALYSIS
GLOBAL GRAPHS
GLOBAL GRAPHS
• What? – System for deploying applications processing complex data
• Why? – Seeks balance between high productivity and high performance
• How?– Built on top of PNL’s GlobalArrays– Trees (GlobalTrees, GlobalForests)– Relational Arrays (ArrayDB-GA)– Graphs (GlobalGraphs)
• Data Placement is key to high performance
Importance of Data Placement
• Locality– Placing related items close to each other so they may be
processed together
• Mitigating Impact of Data Skew– Reducing load imbalance in a parallel setting– Reducing variance in partition samples
• Generating Stratified Samples– Improving interactive performance
Key Ideas• Pivotization
– Convert data with complex structure into sets– Each element of set captures features of local topology
• Hashing into Strata: Hash related sets into similar bins– Can employ a sketch-clustering algorithm
• Partitioning: Place Strata into partitions for• Locality • Mitigating Data Skew• Samples
SK
ETCH
SORT
or S
KETC
HCLU
STER
S-1 : : S-4(Δ1, SK-1)(Δ5, SK-5)(Δ12,SK-12)(Δ25,SK-25) : : : S-5 : : : S-128 : : :
PART
ITIO
NIN
G &
REP
LICA
TIO
N
P-1 : P-2 S-4 S-7 S-8 S-12 : S-128 P-3 : : : P-8 S-3 S-4 S-9S-12 : S-127
PIVO
T
T
RAN
SFO
RMAT
ION
S
A
B C
LE
A
B C
LE F
.
.
.
.
Δ1
Δ25
DATA (Δ)
A
B C
A
F C
A
E C
A
F L
B
E F
A
E L
A
B L
A
B C
A
E CA
E L
A
B L
.
.
.
.
(PS-1)
(PS-25)
PIVOT SETS (PS)
MIN
WIS
E H
ASHI
NG
on
PIVO
T SE
TS
{1050, 2020,3130,1800} (SK-1)
{1050, 2020,7225, 2020} (SK-25)
.
.
.
.
.
.SKETCHES(SK) Strata (S)
Frequent Tree Mining
• Our proposed approaches shows 100X gains
WebGraph Compression
• Linear Scaleup with no loss in compression ratio
PRISM-HD -
PRobing the Intrinsic Structure and Makeup of High-dimensional Data
HD
Visualization and Interactivity are key to discovery
17
PRISM-HD• What?
– A novel mechanism for exploring complex data
• Why?– User is often overwhelmed with
characteristics of data– Befuddled on where to start
• How?– Given, similarity measure-of-interest– Compute similarity graph at threshold (t)
• Key: Graphs are dimensionless– Provide user graph visualization cues
• User determines next threshold and repeats
HD
HD
HIGH THRESHOLD MODERATE THRESHOLD LOW THRESHOLD
Benefits of Knowledge CachingHD
Benefits of Incremental Processing on Twitter
Incremental estimates on Twitter t1 = 0.95
HD
PRISM-HD and Global Graphs in Context:Leveraging Social Media in Emergency Response
HD
Concluding Remarks
• Data is everywhere• Data is fraught with complexities
– Dimensionality, dynamics, structure, massive…• Both data placement and data interactivity
have an important role to play in big data analytics– PRISM-HD and GlobalGraphs can help!
HD
Thanks for your attentionContact: [email protected]
Mining Simulation Data
Medical Image Analysis
Protein Interaction Network (yeast)
Acknowledgements: Various NSF, NIH, DOE and industry grants