scaling up machine learning: how to benchmark graphlab create on huge datasets
TRANSCRIPT
Dato Confidential1
GraphLab Create BenchmarksApril 21, 2016
Guy Rapaport, Data Scientist, Dato [email protected]
Dato Confidential2
Dato: We Intelligent Applications
Dato Confidential
Some of our Customers
3
Dato Confidential4
Businessmust be
intelligent
Machine learning applications
• Recommenders • Fraud detection• Ad targeting• Financial models• Personalized
medicine • Churn prediction• Smart UX
(video & text)• Personal assistants• IoT• Socials networks• Log analysisLast decade:
Data managementNow:
Intelligent apps
?Last 5 years:
Traditional analytics
Dato Confidential5
Example Intelligent Applications- images- text- graphs- tabular data
Dato Confidential
Creating a model pipeline
exploration
data
modeling
Dato Confidential
Creating a model pipeline
Ingest Transform
Model DeployUnstructured Data
Dato Confidential
Dato Confidential9
GraphLab Create in a Line“A general-purpose machine learning Python library that scales on large datasets.”
General purpose: classification, graph analytics… Python API on top, C++ open-source engine
below. Scales vertically: more CPUs, RAM and faster
disks. Large datasets: disk bound, not RAM bound.
9
Dato Confidential10
What will we cover today?1. Instantiating a machine in the Amazon EC2 cloud• r3.8xlarge instance• 32 cores, 244GBs of RAM, 2 SSDs of 320GBs each
2. Run PageRank on a large graph• CommonCrawl 2012 dataset – the internet as a graph• 3.5 billion nodes, 128 billion links
3. Run Gradient Boosted Trees on a large dataset• Criteo 1TB Click Logs Dataset• 4.3 billion rows, 39 features (13 numerical, 26 categorical)
10
Dato Confidential11
What will you be able to do afterwards?Instantiate an EC2 instance, grab our benchmark notebooks, and try it yourself!
Everything is publicly available on github:https://github.com/guy4261/glc_pagerank_benchmark
11
Dato Confidential12
Screen PrimerCommand Action
sudo apt-get install –y screen Install screen
screen –S my_session Start a session named my_session
PS1=‘\u@\h(${STY}:${WINDOW}):\w$’ Change your screen prompt (helpful)
# CTRL+A, then D Key combination to detach
screen -ls List all open screens
screen –r my_session Reattach to your screen
exit Exit the session and terminate the screen
Dato ConfidentialConfidential – Dato internal use only. ©2015 Dato, Inc.
Questions?
“For the purpose of learning the Answer to theUltimate Question of Life, The Universe, and Everything,
the supercomputer Deep Thought was specially built.It takes Deep Thought 7½ million years to compute and check the answer, which turns out to be 42. Deep Thought points out that
the answer seems meaningless becausethe beings who instructed it
never actually knew what the Question was.”- Douglas Adams, “The Hitchhiker’s Guide to the Galaxy”
Dato Confidential14
Our Machine Learning Specializationin Coursera
https://www.coursera.org/learn/ml-foundations
Dato ConfidentialConfidential – Dato internal use only. ©2015 Dato, Inc.
Thanks!Install using pip: $ pip install -U graphlab-create
Dato Launcher Download:https://dato.com/download/
The benchmarks on GitHub:https://github.com/guy4261/glc_pagerank_benchmark
Coursera Course:https://www.coursera.org/learn/ml-foundations
Reach out: [email protected]