five ways to do data analytics "the wrong way"

19
Five Ways to Do Data Analytics “The Wrong Way” Title of the talk, on August 6 2014, @ Pinterest Powered by the Wisconsin Idea: The Wisconsin Idea is the principle that the university should improve people’s lives beyond the classroom. It spans UW–Madison’s teaching, research, outreach and public service. Jignesh M. Patel [email protected] 1

Upload: discover-pinterest

Post on 06-May-2015

557 views

Category:

Engineering


2 download

DESCRIPTION

ABSTRACT: The ongoing big data revolution has revolutionized the way in which technology is used to empower new business segments like social networking and transform old business segments like traditional retail. However, the DNA that is used to build data processing platform is evolving quite rapidly. There is a plethora of competing tools, technologies, and “religion” for how to build state-of-the-art data analysis frameworks. In this talk, I will go over five ways to build scalable high-performance long-lasting data analysis frameworks in the wrong way. Surprisingly, the industry is full of examples of organization building frameworks in this “wrong” way. Since the “right” way to build a technology framework is dependent on the key business drivers, it is my hope that this talk will spur a discussion on what is the “right” way for Pinterest. The talk will focus on technologies including “data plumbing” (e.g. tools in the Hadoop ecosystem), and statistical modeling methods (e.g. R and Python). In this talk, I’ll try to connect to platform builders, data scientists, and business decision makers. BIO: Jignesh Patel is a Professor in Computer Sciences at the University of Wisconsin-Madison, where he also earned his Ph.D. He has worked in the area of databases (now fashionably called “big data”) for over two decades. He has won several best paper awards, and industry research awards. He is the recipient of the Wisconsin COW teaching award, and the U. Michigan College of Engineering Education Excellence Award. He has a strong interest in seeing research ideas transition to actual products. His Ph.D. thesis work was acquired by NCR/Teradata in 1997, and he also co-founded Locomatix -- a startup that built a platform to power real-time data-driven mobile services. Locomatix became part of Twitter in 2013. He is an ACM Distinguished Scientist and an IEEE Senior Member. He also serves on the board of Lands’ End, and advises a number of startups.

TRANSCRIPT

Page 1: Five Ways To Do Data Analytics "The Wrong Way"

Five  Ways  to  Do  Data  Analytics  

“The  Wrong  Way”  

   

Title  of  the  talk,  on  August  6  2014,  @  Pinterest    

   

Powered  by  the  Wisconsin  Idea:  The  Wisconsin  Idea  is  the  principle  that  the  university  should  

improve  people’s  lives  beyond  the  classroom.  It  spans  UW–Madison’s  teaching,  research,  

outreach  and  public  service.    

   

Jignesh  M.  Patel    

[email protected]  

1  

Page 2: Five Ways To Do Data Analytics "The Wrong Way"

Definition:  A  computing  or  networking  architecture  

suggested  by  the  marketing  department  for  sales  purposes  

rather  than  for  technical  reasons.  Cisco  calls  them  

"reference  designs".    

http://www.urbandictionary.com  

Follow  the  markitecture  

2  

Page 3: Five Ways To Do Data Analytics "The Wrong Way"

http://gridgaintech.wordpress.com  

Technology  =  In-­‐memory  file  system  

https://spark.apache.org    

Technology  =  In-­‐memory  caching  +  language  bindings  

http://hortonworks.com/blog/100x-­‐faster-­‐hive/  

The  Stinger  Initiative:  100X  Hive    

Technology  =  caching,  vectorized  query  execution  

http://blog.cloudera.com  

Technology  =  pin  files  in  memory  

3  

Page 4: Five Ways To Do Data Analytics "The Wrong Way"

http://hortonworks.com/blog/stinger-­‐phase-­‐2-­‐the-­‐journey-­‐to-­‐100x-­‐faster-­‐hive/  

Problem:  Claims  are  too  broad!  

https://spark.apache.org  

Problem:  Claims  are  too  broad  

Venkatraman  et  al.  EuroSys’13    

Presto  (not  the  FB)  v/s  Spark:  Big  Wins  an  in  the  R  framework  

4  

Page 5: Five Ways To Do Data Analytics "The Wrong Way"

Never  fix  a  duct-­‐taped  solution  

Embrace  complexity  

5  

Page 6: Five Ways To Do Data Analytics "The Wrong Way"

Image  from:  http://http://thewaysleueslove.blogspot.com  

One  has  to  apply  duct  tape  to  fix  problems,  but  consider  

removing  it  later.  

Stonebraker  and  Cetintemel,  ICDE  2005  

Natural  instinct  is  to  build/deploy  a  specialized  system  for  each  application,  

but  that  approach  blows  up  the  operational  complexity  

6  

Page 7: Five Ways To Do Data Analytics "The Wrong Way"

Chasseur  and  Patel,  WebDB’13  

JSON

JSON

Web App

Mapping Layer

Rather  than  a  specialized  engine  for  JSON  document  store,  a  

simple  language  translator  to  SQL  has  higher  performance  and  

better  data  integrity.  

Chasseur  and  Patel,  WebDB’13  

Similar  story  for  graphs  and  linear  ML  models  –  can  easily  be  

supported  on  top  of  systems  powered  by  relational  algebra  

The  network  effect!  But  in  a  bad  way!  

Complexity  Growth  =  O(N2)  

1   2  

3  

1   2  

3   4  

7  

Page 8: Five Ways To Do Data Analytics "The Wrong Way"

R  v/s  Python  debate  

Complexity  Growth  =  O(N2)  Also  applies  to  tools  and  

programming  languages  in  house  

R      Python  

5K  CRAN  statistically  robust  packages  

Linear  algebra,  clustering,  …  

ETL  

8  

Page 9: Five Ways To Do Data Analytics "The Wrong Way"

Never  realize  that  technology  is  NOT  the  “end,”  but  simply  the  “means  to  a  (business)  end”  

Think  of  technology  as  the  end  

9  

Page 10: Five Ways To Do Data Analytics "The Wrong Way"

Netflix  Challenge  

Example:  Building  a  recommendation  system  

10  

Page 11: Five Ways To Do Data Analytics "The Wrong Way"

Figure  from:  Ricardo:  Integrating  R  and  Hadoop  by  Das  et  al.  SIGMOD’10    

Key  approach:  Latent-­‐factor  Modeling    

All  Together  Now:  A  Perspective  on  the  Netflix  Prize,  by  Bell,  Koren  and  Volinsky  

Winning  insights  

•  Missing  ratings  are  not  missing  by  random!  

•  Parameters  (popularity,  users  standards  for  rating,  user  tastes,  …)  vary  over  time  

•  Combining  sets  of  predictors  

•  Efficient  computation  critical  

11  

Page 12: Five Ways To Do Data Analytics "The Wrong Way"

Pandora’s  Music  Recommender  by  Michael  Howe  

Pandora:  Music  Genome  

•  Content-­‐filtering  •  Classification  to  pick  the  

recommendation  •  Key  is  to  “build  up  a  

neighborhood  for  a  particular  user’s  preference”  

Pandora.com  

Pandora:  Music  Genome  

12  

Page 13: Five Ways To Do Data Analytics "The Wrong Way"

Build  before  you  analyze  the  technology  trend  

 

Never  use  back-­‐of-­‐the  envelope  calculations  

13  

Page 14: Five Ways To Do Data Analytics "The Wrong Way"

Motivation  for  the  UW  Quickstep  project  http://quickstep.cs.wisc.edu      

Hardware  changes  are  far  more  non-­‐linear  than  in  the  past  

La

te

nc

y ((

cyc

le

s) ( CPU$

$

DRAM$

caches$

Magnetic)Hard)Disk)Drives)

~1#10s

!

~100

!

~107

!– !108

!

CPU$$caches$

NVRAM)(e.g.)SSDs))

~105

) –)10

6!

Ca

pa

ci

ty (

Co

st(

Energy  Efficiency  for  Large-­‐Scale  MapReduce  Workloads  with  Significant  Interactive  Analysis,  Chen  et  al.  EuroSys’12  

Most  interactive  jobs  work  on  “small”  data  sets    

14  

Page 15: Five Ways To Do Data Analytics "The Wrong Way"

15  

Patterson,  CACM  2004  

Latency  lags  bandwidth  J.  Dean,  Latency  numbers  every  programmer  should  know,  2012    

 0    

 10  

 

 1,0

00    

 100

,000

   

 10,

000,

000  

 

 1,0

00,0

00,0

00    

L1  cache  reference  

Branch  mispredict  

L2  cache  reference  

Mutex  lock/unlock  

Main  memory  reference  

Compress  1K  bytes  with  Zippy  

Send  1K  bytes  over  1  Gbps  network  

Read  4K  randomly  from  SSD*  

Read  1  MB  sequentially  from  memory  

Round  trip  within  same  datacenter  

Read  1  MB  sequentially  from  SSD*  

Disk  seek  

Read  1  MB  sequentially  from  disk  

Send  packet  CA-­‐>Netherlands-­‐>CA  

Time  in  ns    (log  scale)  

Page 16: Five Ways To Do Data Analytics "The Wrong Way"

Amazing  way  to  reason  about  bottlenecks  

Little’s  Law  

L  =  λW  

16  

Amdahl,  AFIPS  1967  

Amdahl's  law  

DeWitt  and  Gray,  CACM  1992    

Parallel  computing  is  hard  

Speedu

p  =  Old/New

 

Page 17: Five Ways To Do Data Analytics "The Wrong Way"

Stubbornly  refuse  to  throw  away  code  and  platform  architecture.  

Fall  in  love  with  your  architecture  

17  

Page 18: Five Ways To Do Data Analytics "The Wrong Way"

Data  from  2013  publicly  reported  numbers  and  Alexa  

19#

29#18#7#

9#

1"

2"

4"

8"

16"

32"

64"

0" 1" 2" 3"

$/Active)Use

r)(log)scale))

Revenue/Employee)($M))

Google

YouTube

Problem:  It’s  hard  to  throw  away  something  that  you  built,  even  if  it  

doesn’t  fit  anymore  

18  

Bubble  volume  based  on  daily  time  on  the  site    

Page 19: Five Ways To Do Data Analytics "The Wrong Way"

19  

Watch  for  claims  that  are  too  broad  

Markitecture  

Simple  is  beautiful  –  keep  the  building  blocks  of  your  architectural  DNA  simple  

Complexity  

Periodically  re-­‐evaluate  your  technology  architecture.  Also,  people  and  processes.  

Architecture    

Technology  must  serve  an  end  business  goal  

Technology  and  Business  

Amazingly  powerful  –  think  hard  before  you  build!  

Back-­‐of-­‐the  envelope  calculations  

doing  it  right  …  

SSuummmmaarryy