gl conference2014 toolkits_alice
DESCRIPTION
GraphLab's Alice Zheng presents on using the toolkits within GraphLab Create to build data products.TRANSCRIPT
Machine Learning Toolkits in GraphLab Create Alice Zheng GraphLab, Inc.
Going Beyond Data Engineering
GraphLab Create enables Data Intelligence • Recommender systems for retailers • Fraud detection for financial institutions • Market segmentation and ad targeting • Churn prediction for telecom • Community detection and friend
recommendation for social networks
© 2014 GraphLab, Inc.
The Data Pipeline
Raw Data
Features
Models
Data Engineering
Data Intelligence
Predictions
GraphLab Create Design Principles
• Easy to use • Powerful • Fast • Composable
Example: Movie Recommender
City of God
Wild Strawberries
The Celebration
Women on the Verge of a Nervous Breakdown
What do I recommend???
Example: Movie Recommender
City of God
Wild Strawberries
The Celebration
La Dolce Vita
Women on the Verge of a Nervous Breakdown
User-Movie Interaction Matrix Women on the Verge …
The Celebra2on
City of God Wild Strawberries
La Dolce Vita
Bob
Anna
David
Ethan
Matrix Factorization User-item interactions
Information about users Information about items
Item latent factors User latent factors
×
+ +
Demo
The Moral of the Story
• Data scientists need the right tools for the right job
• There is always a more clever model • There is probably some bug in your data • GraphLab Create • Versatile, composable, automated • Play, learn, build better models
GraphLab Create Toolkits • Recommenders
• Item similarity, factorization machine, matrix factorization, non-negative matrix factorization, matrix factorization for ranking
• Graph analytics • PageRank, triangle counting, degree distribution, graph coloring, connected
components, shortest path, k-core decomposition • User-defined graph computation
• Nearest Neighbors • Brute-force and ball trees
• Topic modeling • LDA
• Regression/Classification • Linear regression, logistic regression, SVM, gradient boosted trees, neural networks/
deep learning • Clustering
• K-Means • Other popular ML libraries
• Vowpal Wabbit
GraphLab Create Toolkits • Recommenders
• Item similarity, factorization machine, matrix factorization, non-negative matrix factorization, matrix factorization for ranking
• Graph analytics • PageRank, triangle counting, degree distribution, graph coloring, connected
components, shortest path, k-core decomposition • User-defined graph computation
• Nearest Neighbors • Brute-force and ball trees
• Topic modeling • LDA
• Regression/Classification • Linear regression, logistic regression, SVM, gradient boosted trees, neural
networks/deep learning • Clustering
• K-Means • Other popular ML libraries
• Vowpal Wabbit
GraphLab Create Toolkits • Recommenders
• Item similarity, factorization machine, matrix factorization, non-negative matrix factorization, matrix factorization for ranking
• Graph analytics • PageRank, triangle counting, degree distribution, graph coloring, connected
components, shortest path, k-core decomposition • User-defined graph computation
• Nearest Neighbors • Brute-force and ball trees
• Topic modeling • LDA
• Regression/Classification • Linear regression, logistic regression, SVM, gradient boosted trees, neural
networks/deep learning • Clustering
• K-Means • Other popular ML libraries
• Vowpal Wabbit
Come to Training Day!
• GraphLab data science training day tomorrow!
• A full day of lectures and exercises • Data engineering, model building,
deployment, all on GraphLab Create
Speed + Scale
• How much do you need? • How much data do you really have?
Data Funnel
Raw Data
Features Models
PB GB—TB
MB
Data Analytics Life Cycle Extract
Transform Load
Data Analytics Life Cycle Extract
Transform Load
Model Learning
Data Analytics Life Cycle Extract
Transform Load
Model Learning
Data Analytics Life Cycle Extract
Transform Load
Model Learning
Data Analytics Life Cycle
ETL
Data Analytics Life Cycle
ETL Model
Learning
Data Analytics Life Cycle
ETL Model
Learning
Data Analytics Life Cycle
ETL Model
Learning
Benchmarks
0 200 400 600 800 1000 1200 1400 1600 1800
Run Time of Item Similarity on Netflix Dataset
GraphLab Create (1 Node), 3.6 minutes
Mahout (5 Node), 29 minutes
Become a GLC User!
• We push the frontier of the industry • ... and our customers guide us • Our features are customer driven • Tell us what you think!