analyzing your data in the cloud -...

31
Analyzing Your Data In the Cloud Amazon-Zaponet-ISA Workshop 23.12.2013 Jonathan Rosenblatt- WIS Sunday, December 22, 13

Upload: others

Post on 05-Oct-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Analyzing Your Data In the Cloud - Meetupfiles.meetup.com/1804355/Analyzing_data_in_the_cloud.pdf · Analyzing Your Data In the Cloud Amazon-Zaponet-ISA Workshop 23.12.2013 Jonathan

Analyzing Your Data In the Cloud

Amazon-Zaponet-ISA Workshop 23.12.2013

Jonathan Rosenblatt- WIS

Sunday, December 22, 13

Page 2: Analyzing Your Data In the Cloud - Meetupfiles.meetup.com/1804355/Analyzing_data_in_the_cloud.pdf · Analyzing Your Data In the Cloud Amazon-Zaponet-ISA Workshop 23.12.2013 Jonathan

Fixing Ideas

My machine

Sunday, December 22, 13

Page 3: Analyzing Your Data In the Cloud - Meetupfiles.meetup.com/1804355/Analyzing_data_in_the_cloud.pdf · Analyzing Your Data In the Cloud Amazon-Zaponet-ISA Workshop 23.12.2013 Jonathan

RStudio Server (EC2+S3)

Sunday, December 22, 13

Page 4: Analyzing Your Data In the Cloud - Meetupfiles.meetup.com/1804355/Analyzing_data_in_the_cloud.pdf · Analyzing Your Data In the Cloud Amazon-Zaponet-ISA Workshop 23.12.2013 Jonathan

Scale (EC2 +EBS)

Sunday, December 22, 13

Page 5: Analyzing Your Data In the Cloud - Meetupfiles.meetup.com/1804355/Analyzing_data_in_the_cloud.pdf · Analyzing Your Data In the Cloud Amazon-Zaponet-ISA Workshop 23.12.2013 Jonathan

Jonathan Rosenblatt <[email protected]>

Multicore in R

library(parallel)

cl<-makeCluster(getOption("cl.cores", 2))

clusterEvalQ(cl=cl, source('R/utility.R') )

fcrs<- parApply(cl=cl, configurations, 1, wrapRun)

stopCluster(cl)

Sunday, December 22, 13

Page 6: Analyzing Your Data In the Cloud - Meetupfiles.meetup.com/1804355/Analyzing_data_in_the_cloud.pdf · Analyzing Your Data In the Cloud Amazon-Zaponet-ISA Workshop 23.12.2013 Jonathan

Interactive Reports

Sunday, December 22, 13

Page 7: Analyzing Your Data In the Cloud - Meetupfiles.meetup.com/1804355/Analyzing_data_in_the_cloud.pdf · Analyzing Your Data In the Cloud Amazon-Zaponet-ISA Workshop 23.12.2013 Jonathan

Jonathan Rosenblatt <[email protected]>

Why go cloud?

• Portability of analysis (iPad?)

• Unportability of data (“Bigdata”)

• Local machine load (RAM)

• Server room fault tolerance (UPS)

• Collaboration (Shiny)

• POWER and SCALABILITY on demand

Sunday, December 22, 13

Page 8: Analyzing Your Data In the Cloud - Meetupfiles.meetup.com/1804355/Analyzing_data_in_the_cloud.pdf · Analyzing Your Data In the Cloud Amazon-Zaponet-ISA Workshop 23.12.2013 Jonathan

Jonathan Rosenblatt <[email protected]>

Ingredients

• Remote machine/cluster

• Remote programming environment and IDE

• File transfers

Sunday, December 22, 13

Page 9: Analyzing Your Data In the Cloud - Meetupfiles.meetup.com/1804355/Analyzing_data_in_the_cloud.pdf · Analyzing Your Data In the Cloud Amazon-Zaponet-ISA Workshop 23.12.2013 Jonathan

Jonathan Rosenblatt <[email protected]>

Workflow

• Get the analyst to the data

• Analyze (single machine)

• Scale up (CPUs, RAM, HD, cluster,...)

• Recover output

Sunday, December 22, 13

Page 10: Analyzing Your Data In the Cloud - Meetupfiles.meetup.com/1804355/Analyzing_data_in_the_cloud.pdf · Analyzing Your Data In the Cloud Amazon-Zaponet-ISA Workshop 23.12.2013 Jonathan

Get to the data

Sunday, December 22, 13

Page 11: Analyzing Your Data In the Cloud - Meetupfiles.meetup.com/1804355/Analyzing_data_in_the_cloud.pdf · Analyzing Your Data In the Cloud Amazon-Zaponet-ISA Workshop 23.12.2013 Jonathan

Analyze

Sunday, December 22, 13

Page 12: Analyzing Your Data In the Cloud - Meetupfiles.meetup.com/1804355/Analyzing_data_in_the_cloud.pdf · Analyzing Your Data In the Cloud Amazon-Zaponet-ISA Workshop 23.12.2013 Jonathan

Scale

Sunday, December 22, 13

Page 13: Analyzing Your Data In the Cloud - Meetupfiles.meetup.com/1804355/Analyzing_data_in_the_cloud.pdf · Analyzing Your Data In the Cloud Amazon-Zaponet-ISA Workshop 23.12.2013 Jonathan

Analyze

Sunday, December 22, 13

Page 14: Analyzing Your Data In the Cloud - Meetupfiles.meetup.com/1804355/Analyzing_data_in_the_cloud.pdf · Analyzing Your Data In the Cloud Amazon-Zaponet-ISA Workshop 23.12.2013 Jonathan

Get Output

Sunday, December 22, 13

Page 15: Analyzing Your Data In the Cloud - Meetupfiles.meetup.com/1804355/Analyzing_data_in_the_cloud.pdf · Analyzing Your Data In the Cloud Amazon-Zaponet-ISA Workshop 23.12.2013 Jonathan

Jonathan Rosenblatt <[email protected]>

Which Cloud?

• AWS (and ~13 other competitors)

• Organization’s

• Specialized providers:

• https://www.revocloudr.com/

• http://www.cac.cornell.edu/redcloud/

• ...

Sunday, December 22, 13

Page 16: Analyzing Your Data In the Cloud - Meetupfiles.meetup.com/1804355/Analyzing_data_in_the_cloud.pdf · Analyzing Your Data In the Cloud Amazon-Zaponet-ISA Workshop 23.12.2013 Jonathan

Jonathan Rosenblatt <[email protected]>

Software & Remote IDE

Dedicated Batch mode

--

SAS, SPSS, ... --

Sunday, December 22, 13

Page 17: Analyzing Your Data In the Cloud - Meetupfiles.meetup.com/1804355/Analyzing_data_in_the_cloud.pdf · Analyzing Your Data In the Cloud Amazon-Zaponet-ISA Workshop 23.12.2013 Jonathan

RStudio Server

Sunday, December 22, 13

Page 18: Analyzing Your Data In the Cloud - Meetupfiles.meetup.com/1804355/Analyzing_data_in_the_cloud.pdf · Analyzing Your Data In the Cloud Amazon-Zaponet-ISA Workshop 23.12.2013 Jonathan

Jonathan Rosenblatt <[email protected]>

• Remote:ACE (javascript IDE), nohup, multiple users, ...

• Powerful editor.

• Integrated: version control, package building, visual debugger, ...

• User friendly.

RStudio Server

Sunday, December 22, 13

Page 19: Analyzing Your Data In the Cloud - Meetupfiles.meetup.com/1804355/Analyzing_data_in_the_cloud.pdf · Analyzing Your Data In the Cloud Amazon-Zaponet-ISA Workshop 23.12.2013 Jonathan

Jonathan Rosenblatt <[email protected]>

File Transfers

• RStudio Server Web Interface

• Mount/Map

• DropBox (don’t use with version control)

• sFTP (Filezilla), WGET, Curl

• S3/Glacier CLIs

Sunday, December 22, 13

Page 20: Analyzing Your Data In the Cloud - Meetupfiles.meetup.com/1804355/Analyzing_data_in_the_cloud.pdf · Analyzing Your Data In the Cloud Amazon-Zaponet-ISA Workshop 23.12.2013 Jonathan

Sunday, December 22, 13

Page 21: Analyzing Your Data In the Cloud - Meetupfiles.meetup.com/1804355/Analyzing_data_in_the_cloud.pdf · Analyzing Your Data In the Cloud Amazon-Zaponet-ISA Workshop 23.12.2013 Jonathan

Jonathan Rosenblatt <[email protected]>

Scaling is an art

• Different solutions for different problems:

• Storage

• Working memory (RAM)

• Time

Sunday, December 22, 13

Page 22: Analyzing Your Data In the Cloud - Meetupfiles.meetup.com/1804355/Analyzing_data_in_the_cloud.pdf · Analyzing Your Data In the Cloud Amazon-Zaponet-ISA Workshop 23.12.2013 Jonathan

Jonathan Rosenblatt <[email protected]>

Bottleneck Diagnosis- Storage

• You will know...

Sunday, December 22, 13

Page 23: Analyzing Your Data In the Cloud - Meetupfiles.meetup.com/1804355/Analyzing_data_in_the_cloud.pdf · Analyzing Your Data In the Cloud Amazon-Zaponet-ISA Workshop 23.12.2013 Jonathan

Jonathan Rosenblatt <[email protected]>

Bottleneck Diagnosis- RAM

• Operating system level:Task manager / top

• R level:

• Wickham’s Advanced R Programming

• object.size() / gc() / tracemem()lineprof package

• RAM > 3 * max(object.size)

Sunday, December 22, 13

Page 24: Analyzing Your Data In the Cloud - Meetupfiles.meetup.com/1804355/Analyzing_data_in_the_cloud.pdf · Analyzing Your Data In the Cloud Amazon-Zaponet-ISA Workshop 23.12.2013 Jonathan

Jonathan Rosenblatt <[email protected]>

Bottleneck Diagnosis- Time

• Operating system level:Task manager / top

• R level:

• R.prof()

• Complexity analysis.

Sunday, December 22, 13

Page 25: Analyzing Your Data In the Cloud - Meetupfiles.meetup.com/1804355/Analyzing_data_in_the_cloud.pdf · Analyzing Your Data In the Cloud Amazon-Zaponet-ISA Workshop 23.12.2013 Jonathan

Jonathan Rosenblatt <[email protected]>

Storage• “3 Levels of Data” -- Lukas Biewald

• Under 20,000 Rows: “Data Sets That Can Be Opened In Excel”

• Under 2,000,000 Rows: “Data Sets That Fit Into RAM on a Single Machine”

• Above 2,000,000 Rows:”A World of Pain”

Sunday, December 22, 13

Page 26: Analyzing Your Data In the Cloud - Meetupfiles.meetup.com/1804355/Analyzing_data_in_the_cloud.pdf · Analyzing Your Data In the Cloud Amazon-Zaponet-ISA Workshop 23.12.2013 Jonathan

Jonathan Rosenblatt <[email protected]>

Storage

• “Don’t use Hadoop- your data isn’t that big” -- Chris Stuchio

• Hundreds of MB: R, MatLab, Python

• 10 GB: Buy more RAM.

• 100GB/500GB/1TB: Buy more HD+ PostgresSQL

• 5TB: “your life now sucks”.

Sunday, December 22, 13

Page 27: Analyzing Your Data In the Cloud - Meetupfiles.meetup.com/1804355/Analyzing_data_in_the_cloud.pdf · Analyzing Your Data In the Cloud Amazon-Zaponet-ISA Workshop 23.12.2013 Jonathan

MLI: An API for Distributed Machine Learning

Evan R. Sparksa Ameet Talwalkara Virginia Smitha Jey Kottalama

Xinghao Pana Joseph Gonzaleza Michael J. Franklina Michael I. Jordana Tim Kraskab

aUniversity of California, Berkeley bBrown University{sparks, ameet, vsmith, jey, xinghao, jegonzal, franklin, jordan}@cs.berkeley.edu, [email protected]

Abstract—MLI is an Application Programming Interfacedesigned to address the challenges of building Machine Learn-ing algorithms in a distributed setting based on data-centriccomputing. Its primary goal is to simplify the development ofhigh-performance, scalable, distributed algorithms. Our initialresults show that, relative to existing systems, this interface canbe used to build distributed implementations of a wide variety ofcommon Machine Learning algorithms with minimal complexityand highly competitive performance and scalability.

I. INTRODUCTION

The recent success stories of machine learning (ML) drivenapplications have created an increasing demand for scalableML solutions. Nonetheless, ML researchers often prefer tocode their solutions in statistical computing languages suchas MATLAB or R, as these languages allow them to code infewer lines using syntax that resembles high-level pseudocode.MATLAB and R allow researchers to avoid low-level imple-mentation details, leading to quickly developed prototypes thatare often sufficient for small scale exploration. However, theseprototypes are typically ad-hoc, non-robust, and non-scalableimplementations. In contrast, industrial implementations ofthese solutions often require a relatively heavy amount of de-velopment effort and are difficult to change once implemented.

This disconnect between these ad-hoc scripts and the grow-ing need for scalable ML, in particular systems that leveragethe increasingly pervasive cloud computing architecture, hasspurred the development of several distributed systems forML. Initial attempts at such systems exposed a restricted setof low-level primitives for development, e.g., MapReduce [1]or graph-based [2, 3] interfaces. The resulting systems areindeed significantly faster and more scalable than MATLABor R scripts. They also tend to be much less accessible to MLresearchers, as ML algorithms do not always naturally fit intothe exposed low-level primitives, and moreover, efficient useof these primitives requires a fairly advanced knowledge of theunderlying distributed system.

Subsequent attempts at distributed systems have exposedhigh-level interfaces that compile down to low-level primitives.These systems abstract away much of the communicationand parallelization complexity inherent in distributed MLimplementations. Although these systems can in theory obtainexcellent performance, they are quite difficult to implement inpractice, as they either heavily rely on optimizers to effectivelytransform high-level code into efficient distributed implemen-tations [4, 5], or utilize pattern matching techniques to identifyregions that can be replaced by low-level implementations [6].The need for fast ML algorithms has also led to the develop-ment of highly specialized systems for ML using a restrictedset of algorithms [7, 8], with varying degrees of scalability.

Given the accessibility issues of low-level systems andthe implementation issues of the high-level systems, ML re-searchers have yet to widely adopt any of the existing systems.Indeed, ML researchers, both in academic and industrial envi-ronments, often rely on system programmers to translate theprototypes of their novel, and often subtle, algorithmic insightsinto scalable and robust implementations. Unfortunately, thereis often a ‘loss in translation’ during this process; smallmisinterpretation and/or minor errors are unavoidable and cansignificantly impact the quality of the algorithm. Furthermore,due to the randomized nature of many ML algorithms, it isnot always straightforward to construct appropriate test-casesand discover these bugs.

Matlab,'Rx

Ease%of%u

se

Performance

GraphLab,'VWx

MLIx

Mahoutx

MLPACKx Op7ML

x

Fig. 1: Landscape of existing development platforms for ML.

In this paper, we present a novel API for ML, calledMLI,1 to bridge this gap between prototypes and industry-grade ML software. We provide abstractions that simplify MLdevelopment in comparison to pure MapReduce and graph-based primitives, while nonetheless allowing developers con-trol of the communication and parallelization patterns of theiralgorithms, thus obviating the need for a complex optimizer.With MLI, we aim to be in the top right corner of Figure 1,by providing a development environment that is nearly onpar with the usability of MATLAB or R, while matchingthe scalability of and approaching the walltime of low-leveldistributed systems. We make the following contributions inthis work:

MLI: We show how MLI-supported high-level ML abstrac-tions naturally target common ML problems related to dataloading, feature extraction, model training and testing.

Usability: We demonstrate that implementing ML algorithmswritten against MLI yields concise, readable code, comparable

1MLI is a component of MLBASE [9, 10], a system that aims to provideuser-friendly distributed ML functionality for ML experts and end users.

arX

iv:1

310.

5426

v2 [

cs.L

G]

25 O

ct 2

013

Distributed ML abstractions

Sparks, Evan R., Ameet Talwalkar, Virginia Smith, Jey Kottalam, Xinghao Pan, Joseph Gonzalez, Michael J. Franklin, Michael I. Jordan, and Tim Kraska.

MLI: An API for Distributed Machine Learning. Sunday, December 22, 13

Page 28: Analyzing Your Data In the Cloud - Meetupfiles.meetup.com/1804355/Analyzing_data_in_the_cloud.pdf · Analyzing Your Data In the Cloud Amazon-Zaponet-ISA Workshop 23.12.2013 Jonathan

Jonathan Rosenblatt <[email protected]>

RAM

• Get more RAM! (EC2?)

• Swap files buy RAM at the cost of time.

• Specialized Software:SAS, SPSS, Revolutions, TIBCO,...

• Tailored algorithms: Parallel, Distributed, GPUs, ...

Sunday, December 22, 13

Page 29: Analyzing Your Data In the Cloud - Meetupfiles.meetup.com/1804355/Analyzing_data_in_the_cloud.pdf · Analyzing Your Data In the Cloud Amazon-Zaponet-ISA Workshop 23.12.2013 Jonathan

Jonathan Rosenblatt <[email protected]>

Time

• Multicore (EC2?)

• Cluster (EC2? EMR?)

• Specialized Software:SAS, SPSS, Revolutions, TIBCO,...

• Tailoring algorithms: Parallel, Distributed, GPUs, ...

Sunday, December 22, 13

Page 31: Analyzing Your Data In the Cloud - Meetupfiles.meetup.com/1804355/Analyzing_data_in_the_cloud.pdf · Analyzing Your Data In the Cloud Amazon-Zaponet-ISA Workshop 23.12.2013 Jonathan

Sunday, December 22, 13