scaling up machine learning: how to benchmark graphlab create on huge datasets

Dato Confidential1

GraphLab Create BenchmarksApril 21, 2016

Guy Rapaport, Data Scientist, Dato [email protected]

Dato Confidential2

Dato: We Intelligent Applications

Dato Confidential

Some of our Customers

3

Dato Confidential4

Businessmust be

intelligent

Machine learning applications

• Recommenders • Fraud detection• Ad targeting• Financial models• Personalized

medicine • Churn prediction• Smart UX

(video & text)• Personal assistants• IoT• Socials networks• Log analysisLast decade:

Data managementNow:

Intelligent apps

?Last 5 years:

Traditional analytics

Dato Confidential5

Example Intelligent Applications- images- text- graphs- tabular data

Dato Confidential

Creating a model pipeline

exploration

data

modeling

Dato Confidential

Creating a model pipeline

Ingest Transform

Model DeployUnstructured Data

Dato Confidential

Dato Confidential9

GraphLab Create in a Line“A general-purpose machine learning Python library that scales on large datasets.”

General purpose: classification, graph analytics… Python API on top, C++ open-source engine

below. Scales vertically: more CPUs, RAM and faster

disks. Large datasets: disk bound, not RAM bound.

9

Dato Confidential10

What will we cover today?1. Instantiating a machine in the Amazon EC2 cloud• r3.8xlarge instance• 32 cores, 244GBs of RAM, 2 SSDs of 320GBs each

2. Run PageRank on a large graph• CommonCrawl 2012 dataset – the internet as a graph• 3.5 billion nodes, 128 billion links

3. Run Gradient Boosted Trees on a large dataset• Criteo 1TB Click Logs Dataset• 4.3 billion rows, 39 features (13 numerical, 26 categorical)

10

Dato Confidential11

What will you be able to do afterwards?Instantiate an EC2 instance, grab our benchmark notebooks, and try it yourself!

Everything is publicly available on github:https://github.com/guy4261/glc_pagerank_benchmark

11

https://github.com/guy4261/glc_pagerank_benchmark




Dato Confidential12

Screen PrimerCommand Action

sudo apt-get install –y screen Install screen

screen –S my_session Start a session named my_session

PS1=‘\u@\h(${STY}:${WINDOW}):\w$’ Change your screen prompt (helpful)

# CTRL+A, then D Key combination to detach

screen -ls List all open screens

screen –r my_session Reattach to your screen

exit Exit the session and terminate the screen

Dato ConfidentialConfidential – Dato internal use only. ©2015 Dato, Inc.

Questions?

“For the purpose of learning the Answer to theUltimate Question of Life, The Universe, and Everything,

the supercomputer Deep Thought was specially built.It takes Deep Thought 7½ million years to compute and check the answer, which turns out to be 42. Deep Thought points out that

the answer seems meaningless becausethe beings who instructed it

never actually knew what the Question was.”- Douglas Adams, “The Hitchhiker’s Guide to the Galaxy”

https://en.wikipedia.org/wiki/Deep_Thought_(The_Hitchhiker's_Guide_to_the_Galaxy)

https://en.wikipedia.org/wiki/42_(number)

Dato Confidential14

Our Machine Learning Specializationin Coursera

https://www.coursera.org/learn/ml-foundations

Dato ConfidentialConfidential – Dato internal use only. ©2015 Dato, Inc.

Thanks!Install using pip: $ pip install -U graphlab-create

Dato Launcher Download:https://dato.com/download/

The benchmarks on GitHub:https://github.com/guy4261/glc_pagerank_benchmark

Coursera Course:https://www.coursera.org/learn/ml-foundations

Reach out: [email protected]

https://dato.com/download/

https://dato.com/download/

https://dato.com/learn/gallery/

https://dato.com/learn/gallery/

https://www.coursera.org/learn/ml-foundations

mailto:[email protected]

scaling up machine learning: how to benchmark graphlab create on huge datasets

Technology