demystifying systems for interactive and real-time analytics
DESCRIPTION
Demystifying Systems for Interactive and Real-time Analytics. The BigFrame Team. Duke University, Hong Kong Polytechnic University, and HP Labs. Analytics System Landscape. Streaming. Dataflow. MapReduce. Graph. Multi-tenant. MPP DB. Array DB. Columnar. Mixed. Text Analytics. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Demystifying Systems for Interactive and Real-time Analytics](https://reader036.vdocuments.us/reader036/viewer/2022062315/568165dd550346895dd8f857/html5/thumbnails/1.jpg)
Demystifying Systems for Interactive and Real-time
Analytics
The BigFrame TeamDuke University, Hong Kong Polytechnic
University, and HP Labs
![Page 2: Demystifying Systems for Interactive and Real-time Analytics](https://reader036.vdocuments.us/reader036/viewer/2022062315/568165dd550346895dd8f857/html5/thumbnails/2.jpg)
MPP DB
Columnar
MapReduce
Mixed
Dataflow
Streaming
Text Analytics
Array DB
GraphMulti-tenant
Analytics System Landscape
![Page 3: Demystifying Systems for Interactive and Real-time Analytics](https://reader036.vdocuments.us/reader036/viewer/2022062315/568165dd550346895dd8f857/html5/thumbnails/3.jpg)
MPP DB
Columnar
MapReduce
Mixed
Dataflow
Streaming
Text Analytics
Array DB
GraphMulti-tenant
Analytics System Landscape
Gamma
AsterNetezza
DB2 PE
Teradata SQL Server Parallel DataWarehouse
Greenplum
![Page 4: Demystifying Systems for Interactive and Real-time Analytics](https://reader036.vdocuments.us/reader036/viewer/2022062315/568165dd550346895dd8f857/html5/thumbnails/4.jpg)
MPP DB
Columnar
MapReduce
Mixed
Dataflow
Streaming
Text Analytics
Array DB
GraphMulti-tenant
Analytics System Landscape
HP Vertica
ParAccel
Redshift
Vectorwise
![Page 5: Demystifying Systems for Interactive and Real-time Analytics](https://reader036.vdocuments.us/reader036/viewer/2022062315/568165dd550346895dd8f857/html5/thumbnails/5.jpg)
MPP DB
Columnar
MapReduce
Mixed
Dataflow
Streaming
Text Analytics
Array DB
GraphMulti-tenant
Analytics System LandscapeHadoo
pTenzing
HiveMahout
HadoopDBPig
![Page 6: Demystifying Systems for Interactive and Real-time Analytics](https://reader036.vdocuments.us/reader036/viewer/2022062315/568165dd550346895dd8f857/html5/thumbnails/6.jpg)
MPP DB
Columnar
MapReduce
Mixed
Dataflow
Streaming
Text Analytics
Array DB
GraphMulti-tenant
Analytics System LandscapeDremel
Drill StingerImpala
SparkDryad SCOPE
![Page 7: Demystifying Systems for Interactive and Real-time Analytics](https://reader036.vdocuments.us/reader036/viewer/2022062315/568165dd550346895dd8f857/html5/thumbnails/7.jpg)
MPP DB
Columnar
MapReduce
Mixed
Dataflow
Streaming
Text Analytics
Array DB
GraphMulti-tenant
Analytics System Landscape
CassandraHBaseBigtable
Druid
HANA
SpannerMegastore
Splunk
![Page 8: Demystifying Systems for Interactive and Real-time Analytics](https://reader036.vdocuments.us/reader036/viewer/2022062315/568165dd550346895dd8f857/html5/thumbnails/8.jpg)
MPP DB
Columnar
MapReduce
Mixed
Dataflow
Streaming
Text Analytics
Array DB
GraphMulti-tenant
Analytics System Landscape
StormGraphLab
Streambase
CassovaryGraphX
Solr
ElasticSearch
SciDBCloudera Search
MadLINQ
Pregel
HAMA
![Page 9: Demystifying Systems for Interactive and Real-time Analytics](https://reader036.vdocuments.us/reader036/viewer/2022062315/568165dd550346895dd8f857/html5/thumbnails/9.jpg)
MPP DB
Columnar
MapReduce
Mixed
Dataflow
Streaming
Text Analytics
Array DB
GraphMulti-tenant
Analytics System Landscape
Mesos
YARNSerengeti
Cloud platforms
![Page 10: Demystifying Systems for Interactive and Real-time Analytics](https://reader036.vdocuments.us/reader036/viewer/2022062315/568165dd550346895dd8f857/html5/thumbnails/10.jpg)
MPP DB
Columnar
MapReduce
Mixed
Dataflow
Streaming
Text Analytics
Array DB
GraphMulti-tenant
What does this mean for Big Data Practitioners?
![Page 11: Demystifying Systems for Interactive and Real-time Analytics](https://reader036.vdocuments.us/reader036/viewer/2022062315/568165dd550346895dd8f857/html5/thumbnails/11.jpg)
Gives them a lot of power!
From: http://animeonly.org/Digital-Wallpapers/Digital-renders/Spiderman-95061p.html
![Page 12: Demystifying Systems for Interactive and Real-time Analytics](https://reader036.vdocuments.us/reader036/viewer/2022062315/568165dd550346895dd8f857/html5/thumbnails/12.jpg)
Even the mighty may need a little help
![Page 13: Demystifying Systems for Interactive and Real-time Analytics](https://reader036.vdocuments.us/reader036/viewer/2022062315/568165dd550346895dd8f857/html5/thumbnails/13.jpg)
Challenges for Practitioners
Which system touse for the app that I
am developing?
• Features (e.g., graph data)
• Performance (e.g., claims like
System A is 50x faster than B)
• Resource efficiency
• Growth and scalability
• Multi-tenancy
App Developers, Data Scientists
![Page 14: Demystifying Systems for Interactive and Real-time Analytics](https://reader036.vdocuments.us/reader036/viewer/2022062315/568165dd550346895dd8f857/html5/thumbnails/14.jpg)
Different parts of my app have different
requirements
Compose “best of breed” systems
ORUse “one size fits
all” system?
Managing manysystems is hard!
System Admins
Challenges for Practitioners
Which system touse for the app that I
am developing?
App Developers, Data Scientists
![Page 15: Demystifying Systems for Interactive and Real-time Analytics](https://reader036.vdocuments.us/reader036/viewer/2022062315/568165dd550346895dd8f857/html5/thumbnails/15.jpg)
Managing manysystems is hard!
Different parts of my app have different
requirements
Total Cost of Ownership (TCO)?
CIOSystem Admins
Challenges for Practitioners
Which system touse for the app that I
am developing?
App Developers, Data Scientists
![Page 16: Demystifying Systems for Interactive and Real-time Analytics](https://reader036.vdocuments.us/reader036/viewer/2022062315/568165dd550346895dd8f857/html5/thumbnails/16.jpg)
Numbers make decisions easier
![Page 17: Demystifying Systems for Interactive and Real-time Analytics](https://reader036.vdocuments.us/reader036/viewer/2022062315/568165dd550346895dd8f857/html5/thumbnails/17.jpg)
Need benchmarks
![Page 18: Demystifying Systems for Interactive and Real-time Analytics](https://reader036.vdocuments.us/reader036/viewer/2022062315/568165dd550346895dd8f857/html5/thumbnails/18.jpg)
One Approach
Develop a benchmark per system category
Categorize systems
![Page 19: Demystifying Systems for Interactive and Real-time Analytics](https://reader036.vdocuments.us/reader036/viewer/2022062315/568165dd550346895dd8f857/html5/thumbnails/19.jpg)
Useful, But …
MPP DB
Columnar
MapReduce
Mixed
Dataflow
Streaming
Text Analytics
Array DB
GraphMulti-tenant
Star Schema BenchmarkTPC-H / TPC-DS
Counting triangles
Terasort
GridMixSWIMHiBench
DFSIO
MapReduce Vs. Parallel DB /Hive Benchmark (in HiBench) /Berkeley Big Data Benchmark
Yahoo Cloud Serving Benchmark (YCSB)YCSB Variants
CH-benchCHmark
MulTe
Graph 500PageRank
RDF Benchmarks
Information Extraction Benchmark
Linear Road
SS-DB
![Page 20: Demystifying Systems for Interactive and Real-time Analytics](https://reader036.vdocuments.us/reader036/viewer/2022062315/568165dd550346895dd8f857/html5/thumbnails/20.jpg)
Problem #1 May Miss the Big Picture
![Page 21: Demystifying Systems for Interactive and Real-time Analytics](https://reader036.vdocuments.us/reader036/viewer/2022062315/568165dd550346895dd8f857/html5/thumbnails/21.jpg)
Problem #1 May Miss the Big Picture
Cannot capture the complexities and end-to-end behavior of big data applications and deployments:
(i) Bottlenecks(ii) Data conversion, transfer, & loading overheads(iii) Storage costs & other parts of the data life-cycle(iv) Resource management challenges(v) Total Cost of Ownership (TCO)
![Page 22: Demystifying Systems for Interactive and Real-time Analytics](https://reader036.vdocuments.us/reader036/viewer/2022062315/568165dd550346895dd8f857/html5/thumbnails/22.jpg)
Give a man a fish and you will feed him for a day.
Give him fishing gear and you will feed him for life.
-- Anonymous
Problem #2 Benchmark
BenchmarkGenerator
![Page 23: Demystifying Systems for Interactive and Real-time Analytics](https://reader036.vdocuments.us/reader036/viewer/2022062315/568165dd550346895dd8f857/html5/thumbnails/23.jpg)
BigFrame: A Benchmark Generator for Big
Data Analytics
![Page 24: Demystifying Systems for Interactive and Real-time Analytics](https://reader036.vdocuments.us/reader036/viewer/2022062315/568165dd550346895dd8f857/html5/thumbnails/24.jpg)
How a user uses BigFrameBigFram
eInterfac
e
bigif(benchmark
input format)BenchmarkGenerator
bspec(benchmark specification)
HBase
Hive
MapReduce
Benchmark Driver for System
Under Testrun the benchmark
results
System Under Test
![Page 25: Demystifying Systems for Interactive and Real-time Analytics](https://reader036.vdocuments.us/reader036/viewer/2022062315/568165dd550346895dd8f857/html5/thumbnails/25.jpg)
bspec: Benchmark Specification
HBase
Hive
MapReduce
System Under Test
2. Data refreshpattern
Time
3. Query streams
4. E
valu
atio
n m
etric
s
1. Data forinitial load
![Page 26: Demystifying Systems for Interactive and Real-time Analytics](https://reader036.vdocuments.us/reader036/viewer/2022062315/568165dd550346895dd8f857/html5/thumbnails/26.jpg)
What does the user(want to) specify?
BigFrame
Interface
bigif(benchmark
input format)
![Page 27: Demystifying Systems for Interactive and Real-time Analytics](https://reader036.vdocuments.us/reader036/viewer/2022062315/568165dd550346895dd8f857/html5/thumbnails/27.jpg)
The 3Vs
MPP DB
Columnar
MapReduce
Mixed
Dataflow
Streaming
Text Analytics
Array DB
GraphMulti-tenantVolume
VarietyVelocity
![Page 28: Demystifying Systems for Interactive and Real-time Analytics](https://reader036.vdocuments.us/reader036/viewer/2022062315/568165dd550346895dd8f857/html5/thumbnails/28.jpg)
bigif: BigFrame’s InputFormat
Data Variety
Relational, text, array,
graph
Small,medium,
large
Data Volume
QueryVolume
Queryconcurrency
& classes
DataVelocity
At rest,slow,fast
Micro,Macro
QueryVariety
Exploratory,Continuous
QueryVelocity
![Page 29: Demystifying Systems for Interactive and Real-time Analytics](https://reader036.vdocuments.us/reader036/viewer/2022062315/568165dd550346895dd8f857/html5/thumbnails/29.jpg)
Benchmark Generationbigif
(benchmark input format)
BenchmarkGenerator
bspec(benchmark specification)
bigif describes pointsin a discrete space of
{Data,Query} X{Variety,Volume,Velocity}
1. Initial data to load 2. Data refresh pattern3. Query streams4. Evaluation metrics
Benchmark generation can beaddressed as a search problem
within a rich application domain
![Page 30: Demystifying Systems for Interactive and Real-time Analytics](https://reader036.vdocuments.us/reader036/viewer/2022062315/568165dd550346895dd8f857/html5/thumbnails/30.jpg)
Application Domain Modeled Currently
E-commerce sales,
promotions, recommendati
ons
Social mediasentiment &
influence
Benchmark generation can beaddressed as a search problem
within a rich application domain
![Page 31: Demystifying Systems for Interactive and Real-time Analytics](https://reader036.vdocuments.us/reader036/viewer/2022062315/568165dd550346895dd8f857/html5/thumbnails/31.jpg)
Application Domain Modeled Currently
Item
Customer
Web_sales
Promotion
Tweets
Relationships
![Page 32: Demystifying Systems for Interactive and Real-time Analytics](https://reader036.vdocuments.us/reader036/viewer/2022062315/568165dd550346895dd8f857/html5/thumbnails/32.jpg)
Application Domain Modeled Currently
Item
Web_salesPromotion
![Page 33: Demystifying Systems for Interactive and Real-time Analytics](https://reader036.vdocuments.us/reader036/viewer/2022062315/568165dd550346895dd8f857/html5/thumbnails/33.jpg)
Application Domain Modeled Currently
![Page 34: Demystifying Systems for Interactive and Real-time Analytics](https://reader036.vdocuments.us/reader036/viewer/2022062315/568165dd550346895dd8f857/html5/thumbnails/34.jpg)
Benchmark Generationbigif
(benchmark input format)
BenchmarkGenerator
bspec(benchmark specification)
bigif describes pointsin a discrete space of
{Data,Query} X{Variety,Volume,Velocity}
1. Initial data to load 2. Data refresh pattern3. Query streams4. Evaluation metrics
BigFrame can generate Data, Queries, and Arrival Patterns with the user-specified {Variety,Volume,Velocity}
requirements from the application domain
![Page 35: Demystifying Systems for Interactive and Real-time Analytics](https://reader036.vdocuments.us/reader036/viewer/2022062315/568165dd550346895dd8f857/html5/thumbnails/35.jpg)
Use Cases of BigFrame
![Page 36: Demystifying Systems for Interactive and Real-time Analytics](https://reader036.vdocuments.us/reader036/viewer/2022062315/568165dd550346895dd8f857/html5/thumbnails/36.jpg)
Use Case I: Exploratory BI• Large volumes of relational data
• Mostly aggregation and few joins
• Can Spark’s performance match that of an MPP DB?
Data Variety = {Relational}
Query Variety = Micro
BigFrame will generate a benchmark specification containing
relational data and (SQL-ish) queries
![Page 37: Demystifying Systems for Interactive and Real-time Analytics](https://reader036.vdocuments.us/reader036/viewer/2022062315/568165dd550346895dd8f857/html5/thumbnails/37.jpg)
Use Case II: Complex BI• Large volumes of relational data• Even larger volumes of text data
• Combined analytics
Data Variety = {Relational, Text}
Query Variety = Macro (application-focused instead of
micro-benchmarking)
BigFrame will generate a benchmark specification that includes
sentiment analysis tasks over tweets
![Page 38: Demystifying Systems for Interactive and Real-time Analytics](https://reader036.vdocuments.us/reader036/viewer/2022062315/568165dd550346895dd8f857/html5/thumbnails/38.jpg)
• Large volume and velocity of
relational and text data
Use Case III: Dashboards
• Continuously-updated Dashboards
Query Velocity = Continuous
(as opposed to Exploratory)
Data Velocity =Fast
BigFrame will generate a benchmark specification that includes data refresh as well as continuous queries whose results
change upon data refresh
![Page 39: Demystifying Systems for Interactive and Real-time Analytics](https://reader036.vdocuments.us/reader036/viewer/2022062315/568165dd550346895dd8f857/html5/thumbnails/39.jpg)
Use Case IV: Does One Size Fit All?• Growing set of applications have to
process relational, text, & graph data
• Compose “best of breed” systems or use a “one size fits all” system?
Data Variety = {Relational, Text,
Graph}
BigFrame will generate a benchmark specification that includes composite workflows
with relational, text, and graph analytics
Query Variety = Macro
![Page 40: Demystifying Systems for Interactive and Real-time Analytics](https://reader036.vdocuments.us/reader036/viewer/2022062315/568165dd550346895dd8f857/html5/thumbnails/40.jpg)
Use Case V: Multi-tenancy and SLAs• Big data deployments are
increasingly multi-tenant and
need to meet SLAs
Specifiedthrough Query
Volume dimension
BigFrame can generate a benchmark specification containing a specified number of concurrent query streams with class labels for queries (e.g., Batch, Interactive, or Streaming)
![Page 41: Demystifying Systems for Interactive and Real-time Analytics](https://reader036.vdocuments.us/reader036/viewer/2022062315/568165dd550346895dd8f857/html5/thumbnails/41.jpg)
Working with the Community• First release of BigFrame planned for August 2013• With feedback from benchmark developers (BigBench)
• Open-source with extensibility APIs
• Benchmark Drivers for more systems
• Utilities (accessed through the Benchmark Driver to
drill down into system behavior during benchmarking)
• Instantiate the BigFrame pipeline for more app domains
![Page 42: Demystifying Systems for Interactive and Real-time Analytics](https://reader036.vdocuments.us/reader036/viewer/2022062315/568165dd550346895dd8f857/html5/thumbnails/42.jpg)
Take Away• “Benchmarks shape a field (for better or worse) …”
-- David Patterson, Univ. of California, Berkeley
• Benchmarks meet different needs for different people• End customers, application developers, system designers,
system administrators, researchers, CIOs
• BigFrame helps users generate benchmarks that best
meet their needs