scaling up machine learning: how to benchmark graphlab create on huge datasets

15
Dato Confidential 1 GraphLab Create Benchmarks April 21, 2016 Guy Rapaport, Data Scientist, Dato EMEA [email protected]

Upload: turi-inc

Post on 16-Apr-2017

193 views

Category:

Technology


5 download

TRANSCRIPT

Page 1: Scaling Up Machine Learning: How to Benchmark GraphLab Create on Huge Datasets

Dato Confidential1

GraphLab Create BenchmarksApril 21, 2016

Guy Rapaport, Data Scientist, Dato [email protected]

Page 2: Scaling Up Machine Learning: How to Benchmark GraphLab Create on Huge Datasets

Dato Confidential2

Dato: We Intelligent Applications

Page 3: Scaling Up Machine Learning: How to Benchmark GraphLab Create on Huge Datasets

Dato Confidential

Some of our Customers

3

Page 4: Scaling Up Machine Learning: How to Benchmark GraphLab Create on Huge Datasets

Dato Confidential4

Businessmust be

intelligent

Machine learning applications

• Recommenders • Fraud detection• Ad targeting• Financial models• Personalized

medicine • Churn prediction• Smart UX

(video & text)• Personal assistants• IoT• Socials networks• Log analysisLast decade:

Data managementNow:

Intelligent apps

?Last 5 years:

Traditional analytics

Page 5: Scaling Up Machine Learning: How to Benchmark GraphLab Create on Huge Datasets

Dato Confidential5

Example Intelligent Applications- images- text- graphs- tabular data

Page 6: Scaling Up Machine Learning: How to Benchmark GraphLab Create on Huge Datasets

Dato Confidential

Creating a model pipeline

exploration

data

modeling

Page 7: Scaling Up Machine Learning: How to Benchmark GraphLab Create on Huge Datasets

Dato Confidential

Creating a model pipeline

Ingest Transform

Model DeployUnstructured Data

Page 8: Scaling Up Machine Learning: How to Benchmark GraphLab Create on Huge Datasets

Dato Confidential

Page 9: Scaling Up Machine Learning: How to Benchmark GraphLab Create on Huge Datasets

Dato Confidential9

GraphLab Create in a Line“A general-purpose machine learning Python library that scales on large datasets.”

General purpose: classification, graph analytics… Python API on top, C++ open-source engine

below. Scales vertically: more CPUs, RAM and faster

disks. Large datasets: disk bound, not RAM bound.

9

Page 10: Scaling Up Machine Learning: How to Benchmark GraphLab Create on Huge Datasets

Dato Confidential10

What will we cover today?1. Instantiating a machine in the Amazon EC2 cloud• r3.8xlarge instance• 32 cores, 244GBs of RAM, 2 SSDs of 320GBs each

2. Run PageRank on a large graph• CommonCrawl 2012 dataset – the internet as a graph• 3.5 billion nodes, 128 billion links

3. Run Gradient Boosted Trees on a large dataset• Criteo 1TB Click Logs Dataset• 4.3 billion rows, 39 features (13 numerical, 26 categorical)

10

Page 11: Scaling Up Machine Learning: How to Benchmark GraphLab Create on Huge Datasets

Dato Confidential11

What will you be able to do afterwards?Instantiate an EC2 instance, grab our benchmark notebooks, and try it yourself!

Everything is publicly available on github:https://github.com/guy4261/glc_pagerank_benchmark

11

Page 12: Scaling Up Machine Learning: How to Benchmark GraphLab Create on Huge Datasets

Dato Confidential12

Screen PrimerCommand Action

sudo apt-get install –y screen Install screen

screen –S my_session Start a session named my_session

PS1=‘\u@\h(${STY}:${WINDOW}):\w$’ Change your screen prompt (helpful)

# CTRL+A, then D Key combination to detach

screen -ls List all open screens

screen –r my_session Reattach to your screen

exit Exit the session and terminate the screen

Page 13: Scaling Up Machine Learning: How to Benchmark GraphLab Create on Huge Datasets

Dato ConfidentialConfidential – Dato internal use only. ©2015 Dato, Inc.

Questions?

“For the purpose of learning the Answer to theUltimate Question of Life, The Universe, and Everything,

the supercomputer Deep Thought was specially built.It takes Deep Thought 7½ million years to compute and check the answer, which turns out to be 42. Deep Thought points out that

the answer seems meaningless becausethe beings who instructed it

never actually knew what the Question was.”- Douglas Adams, “The Hitchhiker’s Guide to the Galaxy”

Page 14: Scaling Up Machine Learning: How to Benchmark GraphLab Create on Huge Datasets

Dato Confidential14

Our Machine Learning Specializationin Coursera

https://www.coursera.org/learn/ml-foundations

Page 15: Scaling Up Machine Learning: How to Benchmark GraphLab Create on Huge Datasets

Dato ConfidentialConfidential – Dato internal use only. ©2015 Dato, Inc.

Thanks!Install using pip: $ pip install -U graphlab-create

Dato Launcher Download:https://dato.com/download/

The benchmarks on GitHub:https://github.com/guy4261/glc_pagerank_benchmark

Coursera Course:https://www.coursera.org/learn/ml-foundations

Reach out: [email protected]