Transcript
Page 1: Five Ways To Do Data Analytics "The Wrong Way"

Five  Ways  to  Do  Data  Analytics  

“The  Wrong  Way”  

   

Title  of  the  talk,  on  August  6  2014,  @  Pinterest    

   

Powered  by  the  Wisconsin  Idea:  The  Wisconsin  Idea  is  the  principle  that  the  university  should  

improve  people’s  lives  beyond  the  classroom.  It  spans  UW–Madison’s  teaching,  research,  

outreach  and  public  service.    

   

Jignesh  M.  Patel    

[email protected]  

1  

Page 2: Five Ways To Do Data Analytics "The Wrong Way"

Definition:  A  computing  or  networking  architecture  

suggested  by  the  marketing  department  for  sales  purposes  

rather  than  for  technical  reasons.  Cisco  calls  them  

"reference  designs".    

http://www.urbandictionary.com  

Follow  the  markitecture  

2  

Page 3: Five Ways To Do Data Analytics "The Wrong Way"

http://gridgaintech.wordpress.com  

Technology  =  In-­‐memory  file  system  

https://spark.apache.org    

Technology  =  In-­‐memory  caching  +  language  bindings  

http://hortonworks.com/blog/100x-­‐faster-­‐hive/  

The  Stinger  Initiative:  100X  Hive    

Technology  =  caching,  vectorized  query  execution  

http://blog.cloudera.com  

Technology  =  pin  files  in  memory  

3  

Page 4: Five Ways To Do Data Analytics "The Wrong Way"

http://hortonworks.com/blog/stinger-­‐phase-­‐2-­‐the-­‐journey-­‐to-­‐100x-­‐faster-­‐hive/  

Problem:  Claims  are  too  broad!  

https://spark.apache.org  

Problem:  Claims  are  too  broad  

Venkatraman  et  al.  EuroSys’13    

Presto  (not  the  FB)  v/s  Spark:  Big  Wins  an  in  the  R  framework  

4  

Page 5: Five Ways To Do Data Analytics "The Wrong Way"

Never  fix  a  duct-­‐taped  solution  

Embrace  complexity  

5  

Page 6: Five Ways To Do Data Analytics "The Wrong Way"

Image  from:  http://http://thewaysleueslove.blogspot.com  

One  has  to  apply  duct  tape  to  fix  problems,  but  consider  

removing  it  later.  

Stonebraker  and  Cetintemel,  ICDE  2005  

Natural  instinct  is  to  build/deploy  a  specialized  system  for  each  application,  

but  that  approach  blows  up  the  operational  complexity  

6  

Page 7: Five Ways To Do Data Analytics "The Wrong Way"

Chasseur  and  Patel,  WebDB’13  

JSON

JSON

Web App

Mapping Layer

Rather  than  a  specialized  engine  for  JSON  document  store,  a  

simple  language  translator  to  SQL  has  higher  performance  and  

better  data  integrity.  

Chasseur  and  Patel,  WebDB’13  

Similar  story  for  graphs  and  linear  ML  models  –  can  easily  be  

supported  on  top  of  systems  powered  by  relational  algebra  

The  network  effect!  But  in  a  bad  way!  

Complexity  Growth  =  O(N2)  

1   2  

3  

1   2  

3   4  

7  

Page 8: Five Ways To Do Data Analytics "The Wrong Way"

R  v/s  Python  debate  

Complexity  Growth  =  O(N2)  Also  applies  to  tools  and  

programming  languages  in  house  

R      Python  

5K  CRAN  statistically  robust  packages  

Linear  algebra,  clustering,  …  

ETL  

8  

Page 9: Five Ways To Do Data Analytics "The Wrong Way"

Never  realize  that  technology  is  NOT  the  “end,”  but  simply  the  “means  to  a  (business)  end”  

Think  of  technology  as  the  end  

9  

Page 10: Five Ways To Do Data Analytics "The Wrong Way"

Netflix  Challenge  

Example:  Building  a  recommendation  system  

10  

Page 11: Five Ways To Do Data Analytics "The Wrong Way"

Figure  from:  Ricardo:  Integrating  R  and  Hadoop  by  Das  et  al.  SIGMOD’10    

Key  approach:  Latent-­‐factor  Modeling    

All  Together  Now:  A  Perspective  on  the  Netflix  Prize,  by  Bell,  Koren  and  Volinsky  

Winning  insights  

•  Missing  ratings  are  not  missing  by  random!  

•  Parameters  (popularity,  users  standards  for  rating,  user  tastes,  …)  vary  over  time  

•  Combining  sets  of  predictors  

•  Efficient  computation  critical  

11  

Page 12: Five Ways To Do Data Analytics "The Wrong Way"

Pandora’s  Music  Recommender  by  Michael  Howe  

Pandora:  Music  Genome  

•  Content-­‐filtering  •  Classification  to  pick  the  

recommendation  •  Key  is  to  “build  up  a  

neighborhood  for  a  particular  user’s  preference”  

Pandora.com  

Pandora:  Music  Genome  

12  

Page 13: Five Ways To Do Data Analytics "The Wrong Way"

Build  before  you  analyze  the  technology  trend  

 

Never  use  back-­‐of-­‐the  envelope  calculations  

13  

Page 14: Five Ways To Do Data Analytics "The Wrong Way"

Motivation  for  the  UW  Quickstep  project  http://quickstep.cs.wisc.edu      

Hardware  changes  are  far  more  non-­‐linear  than  in  the  past  

La

te

nc

y ((

cyc

le

s) ( CPU$

$

DRAM$

caches$

Magnetic)Hard)Disk)Drives)

~1#10s

!

~100

!

~107

!– !108

!

CPU$$caches$

NVRAM)(e.g.)SSDs))

~105

) –)10

6!

Ca

pa

ci

ty (

Co

st(

Energy  Efficiency  for  Large-­‐Scale  MapReduce  Workloads  with  Significant  Interactive  Analysis,  Chen  et  al.  EuroSys’12  

Most  interactive  jobs  work  on  “small”  data  sets    

14  

Page 15: Five Ways To Do Data Analytics "The Wrong Way"

15  

Patterson,  CACM  2004  

Latency  lags  bandwidth  J.  Dean,  Latency  numbers  every  programmer  should  know,  2012    

 0    

 10  

 

 1,0

00    

 100

,000

   

 10,

000,

000  

 

 1,0

00,0

00,0

00    

L1  cache  reference  

Branch  mispredict  

L2  cache  reference  

Mutex  lock/unlock  

Main  memory  reference  

Compress  1K  bytes  with  Zippy  

Send  1K  bytes  over  1  Gbps  network  

Read  4K  randomly  from  SSD*  

Read  1  MB  sequentially  from  memory  

Round  trip  within  same  datacenter  

Read  1  MB  sequentially  from  SSD*  

Disk  seek  

Read  1  MB  sequentially  from  disk  

Send  packet  CA-­‐>Netherlands-­‐>CA  

Time  in  ns    (log  scale)  

Page 16: Five Ways To Do Data Analytics "The Wrong Way"

Amazing  way  to  reason  about  bottlenecks  

Little’s  Law  

L  =  λW  

16  

Amdahl,  AFIPS  1967  

Amdahl's  law  

DeWitt  and  Gray,  CACM  1992    

Parallel  computing  is  hard  

Speedu

p  =  Old/New

 

Page 17: Five Ways To Do Data Analytics "The Wrong Way"

Stubbornly  refuse  to  throw  away  code  and  platform  architecture.  

Fall  in  love  with  your  architecture  

17  

Page 18: Five Ways To Do Data Analytics "The Wrong Way"

Data  from  2013  publicly  reported  numbers  and  Alexa  

19#

29#18#7#

9#

1"

2"

4"

8"

16"

32"

64"

0" 1" 2" 3"

$/Active)Use

r)(log)scale))

Revenue/Employee)($M))

Google

YouTube

Problem:  It’s  hard  to  throw  away  something  that  you  built,  even  if  it  

doesn’t  fit  anymore  

18  

Bubble  volume  based  on  daily  time  on  the  site    

Page 19: Five Ways To Do Data Analytics "The Wrong Way"

19  

Watch  for  claims  that  are  too  broad  

Markitecture  

Simple  is  beautiful  –  keep  the  building  blocks  of  your  architectural  DNA  simple  

Complexity  

Periodically  re-­‐evaluate  your  technology  architecture.  Also,  people  and  processes.  

Architecture    

Technology  must  serve  an  end  business  goal  

Technology  and  Business  

Amazingly  powerful  –  think  hard  before  you  build!  

Back-­‐of-­‐the  envelope  calculations  

doing  it  right  …  

SSuummmmaarryy


Top Related