five ways to do data analytics "the wrong way"
Post on 06-May-2015
557 Views
Preview:
DESCRIPTION
TRANSCRIPT
Five Ways to Do Data Analytics
“The Wrong Way”
Title of the talk, on August 6 2014, @ Pinterest
Powered by the Wisconsin Idea: The Wisconsin Idea is the principle that the university should
improve people’s lives beyond the classroom. It spans UW–Madison’s teaching, research,
outreach and public service.
Jignesh M. Patel
jignesh@cs.wisc.edu
1
Definition: A computing or networking architecture
suggested by the marketing department for sales purposes
rather than for technical reasons. Cisco calls them
"reference designs".
http://www.urbandictionary.com
Follow the markitecture
2
http://gridgaintech.wordpress.com
Technology = In-‐memory file system
https://spark.apache.org
Technology = In-‐memory caching + language bindings
http://hortonworks.com/blog/100x-‐faster-‐hive/
The Stinger Initiative: 100X Hive
Technology = caching, vectorized query execution
http://blog.cloudera.com
Technology = pin files in memory
3
http://hortonworks.com/blog/stinger-‐phase-‐2-‐the-‐journey-‐to-‐100x-‐faster-‐hive/
Problem: Claims are too broad!
https://spark.apache.org
Problem: Claims are too broad
Venkatraman et al. EuroSys’13
Presto (not the FB) v/s Spark: Big Wins an in the R framework
4
Never fix a duct-‐taped solution
Embrace complexity
5
Image from: http://http://thewaysleueslove.blogspot.com
One has to apply duct tape to fix problems, but consider
removing it later.
Stonebraker and Cetintemel, ICDE 2005
Natural instinct is to build/deploy a specialized system for each application,
but that approach blows up the operational complexity
6
Chasseur and Patel, WebDB’13
JSON
JSON
Web App
Mapping Layer
Rather than a specialized engine for JSON document store, a
simple language translator to SQL has higher performance and
better data integrity.
Chasseur and Patel, WebDB’13
Similar story for graphs and linear ML models – can easily be
supported on top of systems powered by relational algebra
The network effect! But in a bad way!
Complexity Growth = O(N2)
1 2
3
1 2
3 4
7
R v/s Python debate
Complexity Growth = O(N2) Also applies to tools and
programming languages in house
R Python
5K CRAN statistically robust packages
Linear algebra, clustering, …
ETL
8
Never realize that technology is NOT the “end,” but simply the “means to a (business) end”
Think of technology as the end
9
Netflix Challenge
Example: Building a recommendation system
10
Figure from: Ricardo: Integrating R and Hadoop by Das et al. SIGMOD’10
Key approach: Latent-‐factor Modeling
All Together Now: A Perspective on the Netflix Prize, by Bell, Koren and Volinsky
Winning insights
• Missing ratings are not missing by random!
• Parameters (popularity, users standards for rating, user tastes, …) vary over time
• Combining sets of predictors
• Efficient computation critical
11
Pandora’s Music Recommender by Michael Howe
Pandora: Music Genome
• Content-‐filtering • Classification to pick the
recommendation • Key is to “build up a
neighborhood for a particular user’s preference”
Pandora.com
Pandora: Music Genome
12
Build before you analyze the technology trend
Never use back-‐of-‐the envelope calculations
13
Motivation for the UW Quickstep project http://quickstep.cs.wisc.edu
Hardware changes are far more non-‐linear than in the past
La
te
nc
y ((
cyc
le
s) ( CPU$
$
DRAM$
caches$
Magnetic)Hard)Disk)Drives)
~1#10s
!
~100
!
~107
!– !108
!
CPU$$caches$
NVRAM)(e.g.)SSDs))
~105
) –)10
6!
Ca
pa
ci
ty (
Co
st(
Energy Efficiency for Large-‐Scale MapReduce Workloads with Significant Interactive Analysis, Chen et al. EuroSys’12
Most interactive jobs work on “small” data sets
14
15
Patterson, CACM 2004
Latency lags bandwidth J. Dean, Latency numbers every programmer should know, 2012
0
10
1,0
00
100
,000
10,
000,
000
1,0
00,0
00,0
00
L1 cache reference
Branch mispredict
L2 cache reference
Mutex lock/unlock
Main memory reference
Compress 1K bytes with Zippy
Send 1K bytes over 1 Gbps network
Read 4K randomly from SSD*
Read 1 MB sequentially from memory
Round trip within same datacenter
Read 1 MB sequentially from SSD*
Disk seek
Read 1 MB sequentially from disk
Send packet CA-‐>Netherlands-‐>CA
Time in ns (log scale)
Amazing way to reason about bottlenecks
Little’s Law
L = λW
16
Amdahl, AFIPS 1967
Amdahl's law
DeWitt and Gray, CACM 1992
Parallel computing is hard
Speedu
p = Old/New
Stubbornly refuse to throw away code and platform architecture.
Fall in love with your architecture
17
Data from 2013 publicly reported numbers and Alexa
19#
29#18#7#
9#
1"
2"
4"
8"
16"
32"
64"
0" 1" 2" 3"
$/Active)Use
r)(log)scale))
Revenue/Employee)($M))
YouTube
Problem: It’s hard to throw away something that you built, even if it
doesn’t fit anymore
18
Bubble volume based on daily time on the site
19
Watch for claims that are too broad
Markitecture
Simple is beautiful – keep the building blocks of your architectural DNA simple
Complexity
Periodically re-‐evaluate your technology architecture. Also, people and processes.
Architecture
Technology must serve an end business goal
Technology and Business
Amazingly powerful – think hard before you build!
Back-‐of-‐the envelope calculations
doing it right …
SSuummmmaarryy
top related