dataiku r users group v2

92
Building your own Data Science platform in the cloud GUR FlautR – Paris, November 14 th 2012

Upload: cornec

Post on 27-Jan-2015

128 views

Category:

Documents


12 download

DESCRIPTION

 

TRANSCRIPT

Page 1: Dataiku   r users group v2

Building your own Data Science platform in the cloud

GUR FlautR – Paris, November 14th 2012

Page 2: Dataiku   r users group v2

10/04/2023 Build Your Data Science Platform in the Cloud 2

Who Am I

• Co-founder and Data Scientist at Dataiku

• Long-time data hacker – Telco (Orange)– Retail (Catalina Marketing, all major French retailers)– High Tech (Apple)– Social Gaming (Is Cool Entertainment)– Data Provider (qunb)

• I love data and blending innovative technologies and methods to get the most out of a dataset.

Page 3: Dataiku   r users group v2

10/04/2023 Build Your Data Science Platform in the Cloud 3

Agenda

• Introducing Dataiku

• Motivations & building blocks

• Setting up the Data Science stack

• Annexes (with step-by-step tutorial)

Page 4: Dataiku   r users group v2

Your data lab accelerator

Page 5: Dataiku   r users group v2

10/04/2023 5

New Product ?

Product Designer

Business &

Marketing

Engineers

User Voice

Product Innovation opposes conflicting views

Introducing Dataiku

Today, Innovation requires to put together different expertise

and different views…

User Experience?Features?Roadmap?

Acquisition? Pricing?Loyalty?

Planning?Performance?

Reliability?

Satisfaction?Perception?

Engagement?

Page 6: Dataiku   r users group v2

10/04/2023 6

Data !

Product Designer

Business &

Marketing

Engineers

User Voice

Data Innovation: fill the gap!

Introducing Dataiku

Targeted campaingsPrice optimization

A common ground to federate your product teams

towards a common goal

Personalized experience

Quality AssuranceWorkload and yield

management

User Feedback (A/B Test)Continuous improvement

Page 7: Dataiku   r users group v2

10/04/2023 7

• You can’t « design » insights, you explore and discover them…

• Iterate quickly with constant feedback

• Try a lot, don’t be afraid to fail!

An exploratory and iterative approach…

Introducing Dataiku

Function

Form

Experience

Emotion

Surprise

Culture

Explore and Refine Experiment

Generate Ideas

Select & Develop

Enhance or Discard

Gather Feedback

Page 8: Dataiku   r users group v2

10/04/2023 8

…which is key to your future business models

• Personalized Subscription Models

Digital Publishing

• Detailed Risk Analytics Models

Insurance

• Personalized Treatment

Healthcare

• Optimized Traffic Network

Transportation

• Bio Surveillance with captors networks

Environment

• … to imagine !

Your Business

Introducing Dataiku

?

Page 9: Dataiku   r users group v2

10/04/2023 9

• data lab, (n. m): a small group with all the expertise, including business minded people, machine learning knowledge and the right technology

• A proven organization used by successful data-driven companies over the past few years (eBay, LinkedIn, Walmart…)

The « data lab »

Introducing Dataiku

Page 10: Dataiku   r users group v2

10/04/2023 10

How does it work?

Tools• To perform experiment

Protocols• How to apply experiment

People• Scientists

Software and Servers• Store, process, analyze

Intelligence• Models, Algorithms

People• Data Scientists

Real Lab Data Lab

Introducing Dataiku

Page 11: Dataiku   r users group v2

10/04/2023 11

Technologies

People

Governance

But it’s not so easy…

Introducing Dataiku

Data Lab

• Lot of recent open source technologies to choose from

• Complex integration and usage

• Very rare skills• Hard to recruit or train

• Lack of integrated teams• New mindset to adopt

Page 12: Dataiku   r users group v2

10/04/2023 12

Dataiku help you find your path to Data-Driven Innovation,

building (or accelerating) your own lab

Our mission

Introducing Dataiku

‟”

Page 13: Dataiku   r users group v2

10/04/2023 13

DataikuYour data lab accelerator

Introducing Dataiku

Dataiku Platform•Ready-to use platform to store, process and analyze your data•Open Source Technologies•Machine learning + statistics + distributed computing•Scale from 10GB to 1PTB

Dataiku Innovation•Dedicated programs to kick start data science practice in your company

•Assess your Data potential•Bootstrap your Data Science practices•Build a fully integrated Data Science team in your org

Dataiku Community• A community of data science experts that help

you grow your organization to Data Science• Unique Data Scientist training Program• Network of experts that can be activated “as a

service”

Page 14: Dataiku   r users group v2

10/04/2023 Build Your Data Science Platform in the Cloud 14

MOTIVATIONS & BUILDING BLOCKSA Data Science Platform

Page 15: Dataiku   r users group v2

10/04/2023 Build Your Data Science Platform in the Cloud 15

Motivations

• I often face situations where I need a lot of flexibility and computing resources to address my day-to-day work, while being on a budget.

• There are a lot of (new, and often open source) technologies out there to deal with data, but sometimes poor documentation make them hard to use.

• To address this issue, I am going to detail the set up of a data science platform with some of these technologies. – There are a lot of other options of course, but this one proved to work

very well.

Page 16: Dataiku   r users group v2

10/04/2023 Build Your Data Science Platform in the Cloud 16

A new framework to process data

• Cloud Computing offers a new paradigm vs. computation power and flexibility– Ideal when a lot of processing power is required temporarily (think, a

lot of RAM for R…)– When building a prototype or when you don’t have internal resources

available

• Open Source brings in best-of-breed technologies and analytical capabilities

• Together, they allow to experiment in a whole new way with data.

Page 17: Dataiku   r users group v2

10/04/2023 Build Your Data Science Platform in the Cloud 17

The building blocks

Infrastructure

Fast data storage and querying system

Cutting-edge analytics engine

• it is flexible and cost effective• it allows to experiment and iterate fast • it can be extended easily with other

components, such as Hadoop (via EMR or CDH)

Page 18: Dataiku   r users group v2

10/04/2023 Build Your Data Science Platform in the Cloud 18

Infrastructure

• Amazon Web Services is one of the leading cloud computing provider.

• It is IAAS (infrastructure as a service), which means it offers all the required components but you’ll need to configure and assemble them together.

• The components we are interested in today:– EC2 (Elastic Cloud Compute) : servers– EBS (Elastic Block Storage) : data persistence – S3 : file system

• Be warned, this type of service is good for experimenting and for temporarily resource needs. The cost could grow quickly if you use it on a regular basis.

• See current price lists in the addendum.

Page 19: Dataiku   r users group v2

10/04/2023 Build Your Data Science Platform in the Cloud 19

Data Storage and Querying• Vertica is a very fast, column-oriented database, specialized in analytical workloads (large

scans / joins / aggregations).

• It offers fast data loading, is SQL-99 compliant (“analytical” queries), and can be extended using User-Defined Functions, including R.

• Vertica is not an open source technology, but provides with a Community Edition, for free– Paid version is massively parallel (scale out architecture) among other things– Community Edition could use up to 3 nodes

• There are a few other options in this space, open source or not:– InfiniDB / Infobright (MySQL based, less practical “analytical” wise)– Greenplum, Aster Data– Netezza, Teradata, Oracle Exadata…– “Big Data” alternatives: Cloudera’s Impala (relying on Hive), the incubating Apache Drill

(open source version of Google’s Dremel’s, accessible today via Google Big Query)

Page 20: Dataiku   r users group v2

10/04/2023 Build Your Data Science Platform in the Cloud 20

Analytical Engine

• Well, I guess you all know it…

• We’ll be using R Studio here, in Server version– Access the IDE in a web browser

– Has a lot of nice features, like Git integration, the “Shiny” project…

Page 21: Dataiku   r users group v2

10/04/2023 Build Your Data Science Platform in the Cloud 21

SETTING UP THE DATA SCIENCE STACK

Page 22: Dataiku   r users group v2

10/04/2023 Build Your Data Science Platform in the Cloud 22

Preamble

• This is not as easy as it sounds

• It is a bit techy, and some optimizations in the following process might exist.

• The very detailed step-by-step tutorial can be found in the addendum part of this deck, or at http://dataiku.com/blog/setting-up-a-cool-data-science-platform-for-cheap/

Page 23: Dataiku   r users group v2

10/04/2023 Build Your Data Science Platform in the Cloud 23

Requirements

• Create an Amazon Web Services at – http://aws.amazon.com/fr/– Payment info required if your organization does not have an account

yet, but it’s worth it

• Register for the Vertica Community Edition at – http://my.vertica.com/– Free, but might take a few days before your registration is approved

• Make sure you have a terminal client available (like iTerm on Mac OS X or Putty on Windows)

Page 24: Dataiku   r users group v2

10/04/2023 Build Your Data Science Platform in the Cloud 24

Schematic Steps

Launch an EC2 instance

Attach an EBS disk

Install Vertica Community Edition

Install and Configure R Studio

Configure ODBC connectivity to Vertica CE

H.A.V.E F.U.N

The “server” itself

Additional and persistent storage for the server

Page 25: Dataiku   r users group v2

10/04/2023 Build Your Data Science Platform in the Cloud 25

Creating the EC2 instance

Connect to the EC2 management console

Create a key pair if not done already Select “Launch Instance”

Select a RHEL 6 “AMI”Choose your instance type and region

Give a name to your instance

Select your key pair Specify your security group Launch and wait

• Store in a “safe” location on your PC

• OS must be compatible both with RStudio and Vertica (I used AMI ami-41d00528)

• I used a “m3.xlarge” to start, but can be resized later !

• If you have several instance, will be easier to find later

• That will be used to connect (“ssh”) to the server later

• Only TCP port 22 needs to be opened (for ssh)

• Can take a few minutes

Page 26: Dataiku   r users group v2

10/04/2023 Build Your Data Science Platform in the Cloud 26

Attach an EBS disk

Click on “Create Volume” tab Specify a size and region Under “More..”, attach the

EBS to your instance

Connect to the remote serverFormat your EBSCreate a “mount point”

Mount the EBS on this directory Test if everything is working

• Same region as your instance• Size can be up to 1 Tb

• ssh –i /path/to/your/keypair root@instance-public-dns

• fdisk –l to list your devices• mkfs –t ext3 /dev/your-ebs

• mkdir –p /data

• mount /dev/your-ebs /data • df –kh for example

Page 27: Dataiku   r users group v2

10/04/2023 Build Your Data Science Platform in the Cloud 27

Install RStudio

Update your Yum package manager with EPEL Install R Download RStudio Server

Install RStudio ServerCreate a dedicated userExit and log back using ssh port forwarding

Point your browser to localhost:8787

You run RStudio in the Cloud

• To be able to yum install R

• That’s great !• You’ll work transparently from your PC

• R base is required to make RStudio work

Page 28: Dataiku   r users group v2

10/04/2023 Build Your Data Science Platform in the Cloud 28

Install Vertica

Upload or download the Vertica installer

Prepare the data directory on the EBS Run the installer

Log as dbadmin and run the adminTools toolCreate a new databaseExit adminTools

Test your new DB using the “vsql” client

• The installer you got from my.vertica.com

• Where Vertica is going to store its data

• Don’t forget to point the data directory to the EBS !

• The Vertica main account and management tool

• Talk to Vertica as you would with Postgres

Page 29: Dataiku   r users group v2

10/04/2023 Build Your Data Science Platform in the Cloud 29

Configure ODBC connectivity to Vertica

Install RODBC package Create the odbc.ini file Create the vertica.ini file

Export VERTICAINICheck your connectivity

• Via yum install • ODBC driver configuration file

• The system variable• In RStudio

Page 30: Dataiku   r users group v2

10/04/2023 Build Your Data Science Platform in the Cloud 30

And now you can play !Collect some weather data Create a Vertica table Load into Vertica

Analyze ! Put data into RStudio

Page 31: Dataiku   r users group v2

Thank YouThomas Cabrol

[email protected]+33 (0)7 86 42 62 81

@ThomasCabrolhttp://dataiku.com

Page 32: Dataiku   r users group v2

10/04/2023 Build Your Data Science Platform in the Cloud 32

ANNEXES

Page 33: Dataiku   r users group v2

10/04/2023 Build Your Data Science Platform in the Cloud 33

Amazon EC2 price list

Page 34: Dataiku   r users group v2

10/04/2023 Build Your Data Science Platform in the Cloud 34

STEP-BY-STEP INSTALLATIONhttp://dataiku.com/setting-up-a-cool-data-science-platform-for-cheap/

Page 35: Dataiku   r users group v2

10/04/2023 Build Your Data Science Platform in the Cloud 35

Connect to EC2 Management console

Page 36: Dataiku   r users group v2

10/04/2023 Build Your Data Science Platform in the Cloud 36

Under “Key Pairs”, create a new key pair

Note: once created, you can reuse it at will

Page 37: Dataiku   r users group v2

10/04/2023 Build Your Data Science Platform in the Cloud 37

Move your key pair to a safe location

Note: this is shown for Mac OS X.

Set Read/Write permissions only on the key

Page 38: Dataiku   r users group v2

10/04/2023 Build Your Data Science Platform in the Cloud 38

Click on “Launch Instance”

Page 39: Dataiku   r users group v2

10/04/2023 Build Your Data Science Platform in the Cloud 39

Select the “Classic Wizard”

Page 40: Dataiku   r users group v2

10/04/2023 Build Your Data Science Platform in the Cloud 40

Select your AMI

Page 41: Dataiku   r users group v2

10/04/2023 Build Your Data Science Platform in the Cloud 41

Select your instance type

Page 42: Dataiku   r users group v2

10/04/2023 Build Your Data Science Platform in the Cloud 42

Leave defaults settings

Page 43: Dataiku   r users group v2

10/04/2023 Build Your Data Science Platform in the Cloud 43

Go through the Device Configuration window

Page 44: Dataiku   r users group v2

10/04/2023 Build Your Data Science Platform in the Cloud 44

Assign a name on your instance

Page 45: Dataiku   r users group v2

10/04/2023 Build Your Data Science Platform in the Cloud 45

Select your key pair

Page 46: Dataiku   r users group v2

10/04/2023 Build Your Data Science Platform in the Cloud 46

Choose your default Security Group

Just make sure TCP port #22 is open for ssh access

Page 47: Dataiku   r users group v2

10/04/2023 Build Your Data Science Platform in the Cloud 47

Launch the instance

Page 48: Dataiku   r users group v2

10/04/2023 Build Your Data Science Platform in the Cloud 48

Wait for the instance to start

Page 49: Dataiku   r users group v2

10/04/2023 Build Your Data Science Platform in the Cloud 49

When Running, click on “Volumes”

Page 50: Dataiku   r users group v2

10/04/2023 Build Your Data Science Platform in the Cloud 50

Click on the “Create Volume” tab

Page 51: Dataiku   r users group v2

10/04/2023 Build Your Data Science Platform in the Cloud 51

Select size and region of your EBS

EBS up to 1 TbSame region as your instance

Page 52: Dataiku   r users group v2

10/04/2023 Build Your Data Science Platform in the Cloud 52

Put a name on your EBS

Page 53: Dataiku   r users group v2

10/04/2023 Build Your Data Science Platform in the Cloud 53

Under “More…”, select “Attach”

Page 54: Dataiku   r users group v2

10/04/2023 Build Your Data Science Platform in the Cloud 54

Attachment settings

Page 55: Dataiku   r users group v2

10/04/2023 Build Your Data Science Platform in the Cloud 55

Write down your public DNS

This will be used to connect to the machine.This will be re-affected each time the instance is stopped/started.

Page 56: Dataiku   r users group v2

10/04/2023 Build Your Data Science Platform in the Cloud 56

Login to the machine

Start your favorite Terminal application. Windows users could use Putty.

ssh : secured connection to a remote host-i option is used to specify your key locationroot is the base account used@public-dns: this is why you need to remember your machine dns

Page 57: Dataiku   r users group v2

10/04/2023 Build Your Data Science Platform in the Cloud 57

Find your EBS

The “fdisk” utility on RHEL with –l option could be used to locate the physical device where your EBS is attached.You’ll find one device with the size of your EBS approximately.

Page 58: Dataiku   r users group v2

10/04/2023 Build Your Data Science Platform in the Cloud 58

Format your EBS (FIRST RUN ONLY!)

At first use only of your EBS, you’ll need to format it using the mkfs utility.

Page 59: Dataiku   r users group v2

10/04/2023 Build Your Data Science Platform in the Cloud 59

Mount your EBS

This creates a “/data” directory first, then actually mounts the EBS to this point.

Page 60: Dataiku   r users group v2

10/04/2023 Build Your Data Science Platform in the Cloud 60

Check that everything is okay

Page 61: Dataiku   r users group v2

10/04/2023 Build Your Data Science Platform in the Cloud 61

Update your YUM repo

This is required to be able to install R (base) from the Yum package manager

Page 62: Dataiku   r users group v2

10/04/2023 Build Your Data Science Platform in the Cloud 62

Install R base

Page 63: Dataiku   r users group v2

10/04/2023 Build Your Data Science Platform in the Cloud 63

Wait for R base installation…

Page 64: Dataiku   r users group v2

10/04/2023 Build Your Data Science Platform in the Cloud 64

Download Rstudio Server

Page 65: Dataiku   r users group v2

10/04/2023 Build Your Data Science Platform in the Cloud 65

Install Rstudio Server

Page 66: Dataiku   r users group v2

10/04/2023 Build Your Data Science Platform in the Cloud 66

Create a dedicated User

Creates a new sudo user called “rstudio”.The “passwd” utility sets a new password for it.

Page 67: Dataiku   r users group v2

10/04/2023 Build Your Data Science Platform in the Cloud 67

Test your connection to RStudio

Close the current connection to the server

Re-issue a ssh connection, but this time a port forwarding option. All connections on the remote 8787 (Rstudio server) port will be channeled to the 8787 port of your local machine (better for security)

Page 68: Dataiku   r users group v2

10/04/2023 Build Your Data Science Platform in the Cloud 68

Install S3 tools

This step is not mandatory but is used here because the Vertica installer is stored on S3.

Page 70: Dataiku   r users group v2

10/04/2023 Build Your Data Science Platform in the Cloud 70

Download the Vertica installer

NOTE: this is specific to my installation, you must specify your own S3 bucket if you choose this way to store your Vertica installer. Another option is to download the installer on your local machine, and upload it back to the EC2 instance using a “scp” command.

Page 71: Dataiku   r users group v2

10/04/2023 Build Your Data Science Platform in the Cloud 71

Install Vertica

Page 72: Dataiku   r users group v2

10/04/2023 Build Your Data Science Platform in the Cloud 72

Prepare the data directory

This is where Vertica is going to persist its data. Make sure it has permissions to write into it.

Page 73: Dataiku   r users group v2

10/04/2023 Build Your Data Science Platform in the Cloud 73

Run Vertica installer

The “-d” option is very important, this is how to tell Vertica where to store its data. We point here to the directory previously created on the EBS.

Page 74: Dataiku   r users group v2

10/04/2023 Build Your Data Science Platform in the Cloud 74

Change user and start adminTools

“dbadmin” is the account that handles Vertica management.“adminTools” is the Vertica utility that can be used to actually configure and execute the managements tasks (most of them could also be done directly via the command line).

Page 75: Dataiku   r users group v2

10/04/2023 Build Your Data Science Platform in the Cloud 75

Select the Configuration Menu

Page 76: Dataiku   r users group v2

10/04/2023 Build Your Data Science Platform in the Cloud 76

Choose “Create Database”

Page 77: Dataiku   r users group v2

10/04/2023 Build Your Data Science Platform in the Cloud 77

Enter the database name and comments

Page 78: Dataiku   r users group v2

10/04/2023 Build Your Data Science Platform in the Cloud 78

Enter your password for the database

Page 79: Dataiku   r users group v2

10/04/2023 Build Your Data Science Platform in the Cloud 79

Confirm your password

Page 80: Dataiku   r users group v2

10/04/2023 Build Your Data Science Platform in the Cloud 80

Select your host (localhost only here)

Page 81: Dataiku   r users group v2

10/04/2023 Build Your Data Science Platform in the Cloud 81

Go through the data directories

Page 82: Dataiku   r users group v2

10/04/2023 Build Your Data Science Platform in the Cloud 82

Go through the k-safety warning message

Page 83: Dataiku   r users group v2

10/04/2023 Build Your Data Science Platform in the Cloud 83

Confirm the database creation

Page 84: Dataiku   r users group v2

10/04/2023 Build Your Data Science Platform in the Cloud 84

Go through the database creation confirmation message

Page 85: Dataiku   r users group v2

10/04/2023 Build Your Data Science Platform in the Cloud 85

Go back to the Main Menu

Page 86: Dataiku   r users group v2

10/04/2023 Build Your Data Science Platform in the Cloud 86

Exit adminTools

Page 87: Dataiku   r users group v2

10/04/2023 Build Your Data Science Platform in the Cloud 87

Test that everything’s okay using the vsql client

Page 88: Dataiku   r users group v2

10/04/2023 Build Your Data Science Platform in the Cloud 88

Install the RODBC package

Page 89: Dataiku   r users group v2

10/04/2023 Build Your Data Science Platform in the Cloud 89

Create the /etc/odbc.ini file

Page 90: Dataiku   r users group v2

10/04/2023 Build Your Data Science Platform in the Cloud 90

Create the /etc/vertica.ini file

Page 91: Dataiku   r users group v2

10/04/2023 Build Your Data Science Platform in the Cloud 91

Export the VERTICAINI variable

Page 92: Dataiku   r users group v2

10/04/2023 Build Your Data Science Platform in the Cloud 92

Check RStudio to Vertica connectivity