dataiku r users group v2
DESCRIPTION
TRANSCRIPT
Building your own Data Science platform in the cloud
GUR FlautR – Paris, November 14th 2012
10/04/2023 Build Your Data Science Platform in the Cloud 2
Who Am I
• Co-founder and Data Scientist at Dataiku
• Long-time data hacker – Telco (Orange)– Retail (Catalina Marketing, all major French retailers)– High Tech (Apple)– Social Gaming (Is Cool Entertainment)– Data Provider (qunb)
• I love data and blending innovative technologies and methods to get the most out of a dataset.
10/04/2023 Build Your Data Science Platform in the Cloud 3
Agenda
• Introducing Dataiku
• Motivations & building blocks
• Setting up the Data Science stack
• Annexes (with step-by-step tutorial)
Your data lab accelerator
10/04/2023 5
New Product ?
Product Designer
Business &
Marketing
Engineers
User Voice
Product Innovation opposes conflicting views
Introducing Dataiku
Today, Innovation requires to put together different expertise
and different views…
User Experience?Features?Roadmap?
Acquisition? Pricing?Loyalty?
Planning?Performance?
Reliability?
Satisfaction?Perception?
Engagement?
10/04/2023 6
Data !
Product Designer
Business &
Marketing
Engineers
User Voice
Data Innovation: fill the gap!
Introducing Dataiku
Targeted campaingsPrice optimization
A common ground to federate your product teams
towards a common goal
Personalized experience
Quality AssuranceWorkload and yield
management
User Feedback (A/B Test)Continuous improvement
10/04/2023 7
• You can’t « design » insights, you explore and discover them…
• Iterate quickly with constant feedback
• Try a lot, don’t be afraid to fail!
An exploratory and iterative approach…
Introducing Dataiku
Function
Form
Experience
Emotion
Surprise
Culture
Explore and Refine Experiment
Generate Ideas
Select & Develop
Enhance or Discard
Gather Feedback
10/04/2023 8
…which is key to your future business models
• Personalized Subscription Models
Digital Publishing
• Detailed Risk Analytics Models
Insurance
• Personalized Treatment
Healthcare
• Optimized Traffic Network
Transportation
• Bio Surveillance with captors networks
Environment
• … to imagine !
Your Business
Introducing Dataiku
?
10/04/2023 9
• data lab, (n. m): a small group with all the expertise, including business minded people, machine learning knowledge and the right technology
• A proven organization used by successful data-driven companies over the past few years (eBay, LinkedIn, Walmart…)
The « data lab »
Introducing Dataiku
10/04/2023 10
How does it work?
Tools• To perform experiment
Protocols• How to apply experiment
People• Scientists
Software and Servers• Store, process, analyze
Intelligence• Models, Algorithms
People• Data Scientists
Real Lab Data Lab
Introducing Dataiku
10/04/2023 11
Technologies
People
Governance
But it’s not so easy…
Introducing Dataiku
Data Lab
• Lot of recent open source technologies to choose from
• Complex integration and usage
• Very rare skills• Hard to recruit or train
• Lack of integrated teams• New mindset to adopt
10/04/2023 12
Dataiku help you find your path to Data-Driven Innovation,
building (or accelerating) your own lab
Our mission
Introducing Dataiku
‟”
10/04/2023 13
DataikuYour data lab accelerator
Introducing Dataiku
Dataiku Platform•Ready-to use platform to store, process and analyze your data•Open Source Technologies•Machine learning + statistics + distributed computing•Scale from 10GB to 1PTB
Dataiku Innovation•Dedicated programs to kick start data science practice in your company
•Assess your Data potential•Bootstrap your Data Science practices•Build a fully integrated Data Science team in your org
Dataiku Community• A community of data science experts that help
you grow your organization to Data Science• Unique Data Scientist training Program• Network of experts that can be activated “as a
service”
10/04/2023 Build Your Data Science Platform in the Cloud 14
MOTIVATIONS & BUILDING BLOCKSA Data Science Platform
10/04/2023 Build Your Data Science Platform in the Cloud 15
Motivations
• I often face situations where I need a lot of flexibility and computing resources to address my day-to-day work, while being on a budget.
• There are a lot of (new, and often open source) technologies out there to deal with data, but sometimes poor documentation make them hard to use.
• To address this issue, I am going to detail the set up of a data science platform with some of these technologies. – There are a lot of other options of course, but this one proved to work
very well.
10/04/2023 Build Your Data Science Platform in the Cloud 16
A new framework to process data
• Cloud Computing offers a new paradigm vs. computation power and flexibility– Ideal when a lot of processing power is required temporarily (think, a
lot of RAM for R…)– When building a prototype or when you don’t have internal resources
available
• Open Source brings in best-of-breed technologies and analytical capabilities
• Together, they allow to experiment in a whole new way with data.
10/04/2023 Build Your Data Science Platform in the Cloud 17
The building blocks
Infrastructure
Fast data storage and querying system
Cutting-edge analytics engine
• it is flexible and cost effective• it allows to experiment and iterate fast • it can be extended easily with other
components, such as Hadoop (via EMR or CDH)
10/04/2023 Build Your Data Science Platform in the Cloud 18
Infrastructure
• Amazon Web Services is one of the leading cloud computing provider.
• It is IAAS (infrastructure as a service), which means it offers all the required components but you’ll need to configure and assemble them together.
• The components we are interested in today:– EC2 (Elastic Cloud Compute) : servers– EBS (Elastic Block Storage) : data persistence – S3 : file system
• Be warned, this type of service is good for experimenting and for temporarily resource needs. The cost could grow quickly if you use it on a regular basis.
• See current price lists in the addendum.
10/04/2023 Build Your Data Science Platform in the Cloud 19
Data Storage and Querying• Vertica is a very fast, column-oriented database, specialized in analytical workloads (large
scans / joins / aggregations).
• It offers fast data loading, is SQL-99 compliant (“analytical” queries), and can be extended using User-Defined Functions, including R.
• Vertica is not an open source technology, but provides with a Community Edition, for free– Paid version is massively parallel (scale out architecture) among other things– Community Edition could use up to 3 nodes
• There are a few other options in this space, open source or not:– InfiniDB / Infobright (MySQL based, less practical “analytical” wise)– Greenplum, Aster Data– Netezza, Teradata, Oracle Exadata…– “Big Data” alternatives: Cloudera’s Impala (relying on Hive), the incubating Apache Drill
(open source version of Google’s Dremel’s, accessible today via Google Big Query)
10/04/2023 Build Your Data Science Platform in the Cloud 20
Analytical Engine
• Well, I guess you all know it…
• We’ll be using R Studio here, in Server version– Access the IDE in a web browser
– Has a lot of nice features, like Git integration, the “Shiny” project…
10/04/2023 Build Your Data Science Platform in the Cloud 21
SETTING UP THE DATA SCIENCE STACK
10/04/2023 Build Your Data Science Platform in the Cloud 22
Preamble
• This is not as easy as it sounds
• It is a bit techy, and some optimizations in the following process might exist.
• The very detailed step-by-step tutorial can be found in the addendum part of this deck, or at http://dataiku.com/blog/setting-up-a-cool-data-science-platform-for-cheap/
10/04/2023 Build Your Data Science Platform in the Cloud 23
Requirements
• Create an Amazon Web Services at – http://aws.amazon.com/fr/– Payment info required if your organization does not have an account
yet, but it’s worth it
• Register for the Vertica Community Edition at – http://my.vertica.com/– Free, but might take a few days before your registration is approved
• Make sure you have a terminal client available (like iTerm on Mac OS X or Putty on Windows)
10/04/2023 Build Your Data Science Platform in the Cloud 24
Schematic Steps
Launch an EC2 instance
Attach an EBS disk
Install Vertica Community Edition
Install and Configure R Studio
Configure ODBC connectivity to Vertica CE
H.A.V.E F.U.N
The “server” itself
Additional and persistent storage for the server
10/04/2023 Build Your Data Science Platform in the Cloud 25
Creating the EC2 instance
Connect to the EC2 management console
Create a key pair if not done already Select “Launch Instance”
Select a RHEL 6 “AMI”Choose your instance type and region
Give a name to your instance
Select your key pair Specify your security group Launch and wait
• Store in a “safe” location on your PC
• OS must be compatible both with RStudio and Vertica (I used AMI ami-41d00528)
• I used a “m3.xlarge” to start, but can be resized later !
• If you have several instance, will be easier to find later
• That will be used to connect (“ssh”) to the server later
• Only TCP port 22 needs to be opened (for ssh)
• Can take a few minutes
10/04/2023 Build Your Data Science Platform in the Cloud 26
Attach an EBS disk
Click on “Create Volume” tab Specify a size and region Under “More..”, attach the
EBS to your instance
Connect to the remote serverFormat your EBSCreate a “mount point”
Mount the EBS on this directory Test if everything is working
• Same region as your instance• Size can be up to 1 Tb
• ssh –i /path/to/your/keypair root@instance-public-dns
• fdisk –l to list your devices• mkfs –t ext3 /dev/your-ebs
• mkdir –p /data
• mount /dev/your-ebs /data • df –kh for example
10/04/2023 Build Your Data Science Platform in the Cloud 27
Install RStudio
Update your Yum package manager with EPEL Install R Download RStudio Server
Install RStudio ServerCreate a dedicated userExit and log back using ssh port forwarding
Point your browser to localhost:8787
You run RStudio in the Cloud
• To be able to yum install R
• That’s great !• You’ll work transparently from your PC
• R base is required to make RStudio work
10/04/2023 Build Your Data Science Platform in the Cloud 28
Install Vertica
Upload or download the Vertica installer
Prepare the data directory on the EBS Run the installer
Log as dbadmin and run the adminTools toolCreate a new databaseExit adminTools
Test your new DB using the “vsql” client
• The installer you got from my.vertica.com
• Where Vertica is going to store its data
• Don’t forget to point the data directory to the EBS !
• The Vertica main account and management tool
• Talk to Vertica as you would with Postgres
10/04/2023 Build Your Data Science Platform in the Cloud 29
Configure ODBC connectivity to Vertica
Install RODBC package Create the odbc.ini file Create the vertica.ini file
Export VERTICAINICheck your connectivity
• Via yum install • ODBC driver configuration file
• The system variable• In RStudio
10/04/2023 Build Your Data Science Platform in the Cloud 30
And now you can play !Collect some weather data Create a Vertica table Load into Vertica
Analyze ! Put data into RStudio
10/04/2023 Build Your Data Science Platform in the Cloud 32
ANNEXES
10/04/2023 Build Your Data Science Platform in the Cloud 33
Amazon EC2 price list
10/04/2023 Build Your Data Science Platform in the Cloud 34
STEP-BY-STEP INSTALLATIONhttp://dataiku.com/setting-up-a-cool-data-science-platform-for-cheap/
10/04/2023 Build Your Data Science Platform in the Cloud 35
Connect to EC2 Management console
10/04/2023 Build Your Data Science Platform in the Cloud 36
Under “Key Pairs”, create a new key pair
Note: once created, you can reuse it at will
10/04/2023 Build Your Data Science Platform in the Cloud 37
Move your key pair to a safe location
Note: this is shown for Mac OS X.
Set Read/Write permissions only on the key
10/04/2023 Build Your Data Science Platform in the Cloud 38
Click on “Launch Instance”
10/04/2023 Build Your Data Science Platform in the Cloud 39
Select the “Classic Wizard”
10/04/2023 Build Your Data Science Platform in the Cloud 40
Select your AMI
10/04/2023 Build Your Data Science Platform in the Cloud 41
Select your instance type
10/04/2023 Build Your Data Science Platform in the Cloud 42
Leave defaults settings
10/04/2023 Build Your Data Science Platform in the Cloud 43
Go through the Device Configuration window
10/04/2023 Build Your Data Science Platform in the Cloud 44
Assign a name on your instance
10/04/2023 Build Your Data Science Platform in the Cloud 45
Select your key pair
10/04/2023 Build Your Data Science Platform in the Cloud 46
Choose your default Security Group
Just make sure TCP port #22 is open for ssh access
10/04/2023 Build Your Data Science Platform in the Cloud 47
Launch the instance
10/04/2023 Build Your Data Science Platform in the Cloud 48
Wait for the instance to start
10/04/2023 Build Your Data Science Platform in the Cloud 49
When Running, click on “Volumes”
10/04/2023 Build Your Data Science Platform in the Cloud 50
Click on the “Create Volume” tab
10/04/2023 Build Your Data Science Platform in the Cloud 51
Select size and region of your EBS
EBS up to 1 TbSame region as your instance
10/04/2023 Build Your Data Science Platform in the Cloud 52
Put a name on your EBS
10/04/2023 Build Your Data Science Platform in the Cloud 53
Under “More…”, select “Attach”
10/04/2023 Build Your Data Science Platform in the Cloud 54
Attachment settings
10/04/2023 Build Your Data Science Platform in the Cloud 55
Write down your public DNS
This will be used to connect to the machine.This will be re-affected each time the instance is stopped/started.
10/04/2023 Build Your Data Science Platform in the Cloud 56
Login to the machine
Start your favorite Terminal application. Windows users could use Putty.
ssh : secured connection to a remote host-i option is used to specify your key locationroot is the base account used@public-dns: this is why you need to remember your machine dns
10/04/2023 Build Your Data Science Platform in the Cloud 57
Find your EBS
The “fdisk” utility on RHEL with –l option could be used to locate the physical device where your EBS is attached.You’ll find one device with the size of your EBS approximately.
10/04/2023 Build Your Data Science Platform in the Cloud 58
Format your EBS (FIRST RUN ONLY!)
At first use only of your EBS, you’ll need to format it using the mkfs utility.
10/04/2023 Build Your Data Science Platform in the Cloud 59
Mount your EBS
This creates a “/data” directory first, then actually mounts the EBS to this point.
10/04/2023 Build Your Data Science Platform in the Cloud 60
Check that everything is okay
10/04/2023 Build Your Data Science Platform in the Cloud 61
Update your YUM repo
This is required to be able to install R (base) from the Yum package manager
10/04/2023 Build Your Data Science Platform in the Cloud 62
Install R base
10/04/2023 Build Your Data Science Platform in the Cloud 63
Wait for R base installation…
10/04/2023 Build Your Data Science Platform in the Cloud 64
Download Rstudio Server
10/04/2023 Build Your Data Science Platform in the Cloud 65
Install Rstudio Server
10/04/2023 Build Your Data Science Platform in the Cloud 66
Create a dedicated User
Creates a new sudo user called “rstudio”.The “passwd” utility sets a new password for it.
10/04/2023 Build Your Data Science Platform in the Cloud 67
Test your connection to RStudio
Close the current connection to the server
Re-issue a ssh connection, but this time a port forwarding option. All connections on the remote 8787 (Rstudio server) port will be channeled to the 8787 port of your local machine (better for security)
10/04/2023 Build Your Data Science Platform in the Cloud 68
Install S3 tools
This step is not mandatory but is used here because the Vertica installer is stored on S3.
10/04/2023 Build Your Data Science Platform in the Cloud 69
Configure S3 tools
Specify your Amazon credentials: access key and secret key (which can be found under https://portal.aws.amazon.com/gp/aws/securityCredentials)
10/04/2023 Build Your Data Science Platform in the Cloud 70
Download the Vertica installer
NOTE: this is specific to my installation, you must specify your own S3 bucket if you choose this way to store your Vertica installer. Another option is to download the installer on your local machine, and upload it back to the EC2 instance using a “scp” command.
10/04/2023 Build Your Data Science Platform in the Cloud 71
Install Vertica
10/04/2023 Build Your Data Science Platform in the Cloud 72
Prepare the data directory
This is where Vertica is going to persist its data. Make sure it has permissions to write into it.
10/04/2023 Build Your Data Science Platform in the Cloud 73
Run Vertica installer
The “-d” option is very important, this is how to tell Vertica where to store its data. We point here to the directory previously created on the EBS.
10/04/2023 Build Your Data Science Platform in the Cloud 74
Change user and start adminTools
“dbadmin” is the account that handles Vertica management.“adminTools” is the Vertica utility that can be used to actually configure and execute the managements tasks (most of them could also be done directly via the command line).
10/04/2023 Build Your Data Science Platform in the Cloud 75
Select the Configuration Menu
10/04/2023 Build Your Data Science Platform in the Cloud 76
Choose “Create Database”
10/04/2023 Build Your Data Science Platform in the Cloud 77
Enter the database name and comments
10/04/2023 Build Your Data Science Platform in the Cloud 78
Enter your password for the database
10/04/2023 Build Your Data Science Platform in the Cloud 79
Confirm your password
10/04/2023 Build Your Data Science Platform in the Cloud 80
Select your host (localhost only here)
10/04/2023 Build Your Data Science Platform in the Cloud 81
Go through the data directories
10/04/2023 Build Your Data Science Platform in the Cloud 82
Go through the k-safety warning message
10/04/2023 Build Your Data Science Platform in the Cloud 83
Confirm the database creation
10/04/2023 Build Your Data Science Platform in the Cloud 84
Go through the database creation confirmation message
10/04/2023 Build Your Data Science Platform in the Cloud 85
Go back to the Main Menu
10/04/2023 Build Your Data Science Platform in the Cloud 86
Exit adminTools
10/04/2023 Build Your Data Science Platform in the Cloud 87
Test that everything’s okay using the vsql client
10/04/2023 Build Your Data Science Platform in the Cloud 88
Install the RODBC package
10/04/2023 Build Your Data Science Platform in the Cloud 89
Create the /etc/odbc.ini file
10/04/2023 Build Your Data Science Platform in the Cloud 90
Create the /etc/vertica.ini file
10/04/2023 Build Your Data Science Platform in the Cloud 91
Export the VERTICAINI variable
10/04/2023 Build Your Data Science Platform in the Cloud 92
Check RStudio to Vertica connectivity