ecs289: scalable machine learning
TRANSCRIPT
![Page 1: ECS289: Scalable Machine Learning](https://reader034.vdocuments.us/reader034/viewer/2022051716/58a1a24c1a28abd8728bea7a/html5/thumbnails/1.jpg)
ECS289: Scalable Machine Learning
Cho-Jui HsiehUC Davis
Sept 24, 2015
![Page 2: ECS289: Scalable Machine Learning](https://reader034.vdocuments.us/reader034/viewer/2022051716/58a1a24c1a28abd8728bea7a/html5/thumbnails/2.jpg)
Course Information
Website:www.stat.ucdavis.edu/~chohsieh/ECS289G_scalableML.html
My office: Mathematical Sciences Building (MSB) 4232
Office hours: by appointment (email)
My email: [email protected], [email protected]
This is a 4-unit course
![Page 3: ECS289: Scalable Machine Learning](https://reader034.vdocuments.us/reader034/viewer/2022051716/58a1a24c1a28abd8728bea7a/html5/thumbnails/3.jpg)
Course Information
Goals:
Understand the challenges in large-scale machine learning.Understand state-of-the-art approaches for addressing these challenges.Identify interesting open questions.
Course Structure:
Pick some important machine learning problems(classification, regression, recommender system, . . . )Introduce the modelDiscuss the computational challengesHow do people scale to large datasets?
Prerequisites:
Basic knowledge in linear algebra (matrix multiplication, inversion, . . . )Basic knowledge in programming (C/MATLAB) for the final project.
![Page 4: ECS289: Scalable Machine Learning](https://reader034.vdocuments.us/reader034/viewer/2022051716/58a1a24c1a28abd8728bea7a/html5/thumbnails/4.jpg)
Grading Policy
Class participation (10%)
1 assignment and 1 presentation (30%)
Midterm exam (20%)
Final project (40%)
![Page 5: ECS289: Scalable Machine Learning](https://reader034.vdocuments.us/reader034/viewer/2022051716/58a1a24c1a28abd8728bea7a/html5/thumbnails/5.jpg)
Final Project
Topics include:
Develop new algorithms or improve existing algorithmsImplement parallel machine learning algorithms and test on large datasetsApply machine learning to some applicationCompare existing algorithms.. . .
Schedule:
Final project proposal presentation 10/20Final project presentation 12/1, 12/3Final project paper due TBD
![Page 6: ECS289: Scalable Machine Learning](https://reader034.vdocuments.us/reader034/viewer/2022051716/58a1a24c1a28abd8728bea7a/html5/thumbnails/6.jpg)
Syllabus
Supervised Learning: Classification and Regression
Optimization for Machine Learning
Matrix Completion
Semi-supervised Learning
Ranking
Neural Networks
![Page 7: ECS289: Scalable Machine Learning](https://reader034.vdocuments.us/reader034/viewer/2022051716/58a1a24c1a28abd8728bea7a/html5/thumbnails/7.jpg)
What is Machine Learning?
Train and test data are usually assumed to be iid sample from the samedistribution
![Page 8: ECS289: Scalable Machine Learning](https://reader034.vdocuments.us/reader034/viewer/2022051716/58a1a24c1a28abd8728bea7a/html5/thumbnails/8.jpg)
Training
Linear SVM/regression: Linear hyperplane
Kernel SVM/regression: Nonlinear hyperplane
Decision tree, random forest
Nearest Neighbor
. . .
![Page 9: ECS289: Scalable Machine Learning](https://reader034.vdocuments.us/reader034/viewer/2022051716/58a1a24c1a28abd8728bea7a/html5/thumbnails/9.jpg)
Prediction
Learn a model that best explains the observed data as well asgeneralizes to unseen dataScalability Issues:
Time & space complexity of the (Training) Learning AlgorithmSize of the ModelTime complexity of Prediction(for real-time applications)
![Page 10: ECS289: Scalable Machine Learning](https://reader034.vdocuments.us/reader034/viewer/2022051716/58a1a24c1a28abd8728bea7a/html5/thumbnails/10.jpg)
A simple example
K-nearest neighbor classificationModel size: storing all the training samples
1 billion samples, each reqruires 1 KBytes space⇒ 1000G memory
Prediction time: Find the nearest training sample1 billion samples, each distance evaluation requires 1 micro second⇒ 1000 secs per prediction
![Page 11: ECS289: Scalable Machine Learning](https://reader034.vdocuments.us/reader034/viewer/2022051716/58a1a24c1a28abd8728bea7a/html5/thumbnails/11.jpg)
Topics in this course
Classification
Regression
Matrix Completion (Recommender systems)
Ranking
Semi-supervised learning
![Page 12: ECS289: Scalable Machine Learning](https://reader034.vdocuments.us/reader034/viewer/2022051716/58a1a24c1a28abd8728bea7a/html5/thumbnails/12.jpg)
Machine Learning Problems: Classification
Image classification
Hand-written digit recognition
Spam filters
![Page 13: ECS289: Scalable Machine Learning](https://reader034.vdocuments.us/reader034/viewer/2022051716/58a1a24c1a28abd8728bea7a/html5/thumbnails/13.jpg)
Binary Classification
Input: training samples {x1, x2, . . . , xn} and labels {y1, y2, . . . , yn}xi : d-dimensional vectoryi : +1 or -1
Output: A decision function f such that
f (xi ) > 0 if yi = 1, f (xi ) < 0 if yi = −1
![Page 14: ECS289: Scalable Machine Learning](https://reader034.vdocuments.us/reader034/viewer/2022051716/58a1a24c1a28abd8728bea7a/html5/thumbnails/14.jpg)
Feature generation for documents
Bag of words features for documents:
number of features = number of potential words ≈ 10,000
![Page 15: ECS289: Scalable Machine Learning](https://reader034.vdocuments.us/reader034/viewer/2022051716/58a1a24c1a28abd8728bea7a/html5/thumbnails/15.jpg)
Feature generation for documents
Bag of n-gram features (n = 2):
10,000 words ⇒ 10, 0002 potential features
![Page 16: ECS289: Scalable Machine Learning](https://reader034.vdocuments.us/reader034/viewer/2022051716/58a1a24c1a28abd8728bea7a/html5/thumbnails/16.jpg)
Classification
> 1 million dimensional space, > 1 billion training points
![Page 17: ECS289: Scalable Machine Learning](https://reader034.vdocuments.us/reader034/viewer/2022051716/58a1a24c1a28abd8728bea7a/html5/thumbnails/17.jpg)
Scalability challenges
Large number of features
Large number of samples
Data cannot fit into memory
Splice-site: 10 million samples, 11 million features, > 1T memory
Current solutions:
Intellectually swap between memory and diskOnline algorithmsParallel algorithms on distributed systemsOther idea?
![Page 18: ECS289: Scalable Machine Learning](https://reader034.vdocuments.us/reader034/viewer/2022051716/58a1a24c1a28abd8728bea7a/html5/thumbnails/18.jpg)
Challenges: large number of categories
Multi-label (or multi-class) classification with large number of labels
Image classification—> 10000 labels
Recommending tags for articles: millions of labels (tags)
![Page 19: ECS289: Scalable Machine Learning](https://reader034.vdocuments.us/reader034/viewer/2022051716/58a1a24c1a28abd8728bea7a/html5/thumbnails/19.jpg)
Challenges: large number of categories
Consider a problem with 1 million labels.
Traditional approach: reduce to binary problems.
Training: 1 million binary classification problems.
Need 694 days if each binary problem can be solved in 1 minute
Model size: 1 million models.
Need 1 TB if each model requires 1MB.
Prediction one testing data: 1 million binary prediction
Need 1000 secs if each binary prediction needs 10−3 secs.
![Page 20: ECS289: Scalable Machine Learning](https://reader034.vdocuments.us/reader034/viewer/2022051716/58a1a24c1a28abd8728bea7a/html5/thumbnails/20.jpg)
Machine Learning Problems: Regression
Line fitting
Polynomial curve fitting
Stock price prediction
(Figures from Dhillon et al)
![Page 21: ECS289: Scalable Machine Learning](https://reader034.vdocuments.us/reader034/viewer/2022051716/58a1a24c1a28abd8728bea7a/html5/thumbnails/21.jpg)
Stock Price Prediction
We have p > 20, 000 stocks
Xi ,t : the price of stock i at time t; xt = [X1,t , . . . ,Xp,t ]
Find a function f such that
xt+1 ≈ f (xt , xt−1, . . . , xt−L)
pL input variables, p output variables
![Page 22: ECS289: Scalable Machine Learning](https://reader034.vdocuments.us/reader034/viewer/2022051716/58a1a24c1a28abd8728bea7a/html5/thumbnails/22.jpg)
Machine Learning Problems: Recommender Systems
Netflix Problem
(Figure from Dhillon et al)
![Page 23: ECS289: Scalable Machine Learning](https://reader034.vdocuments.us/reader034/viewer/2022051716/58a1a24c1a28abd8728bea7a/html5/thumbnails/23.jpg)
Machine Learning Problems: Recommender Systems
Collaborative Filtering
(Figure from Dhillon et al)
![Page 24: ECS289: Scalable Machine Learning](https://reader034.vdocuments.us/reader034/viewer/2022051716/58a1a24c1a28abd8728bea7a/html5/thumbnails/24.jpg)
Machine Learning Problems: Recommender Systems
Latent Factor Model
(Figure from Dhillon et al)
![Page 25: ECS289: Scalable Machine Learning](https://reader034.vdocuments.us/reader034/viewer/2022051716/58a1a24c1a28abd8728bea7a/html5/thumbnails/25.jpg)
Machine Learning Problems: Recommender Systems
Latent Factor Model
(Figure from Dhillon et al)
![Page 26: ECS289: Scalable Machine Learning](https://reader034.vdocuments.us/reader034/viewer/2022051716/58a1a24c1a28abd8728bea7a/html5/thumbnails/26.jpg)
Machine Learning Problems: Recommender Systems
Latent Factor Model
(Figure from Dhillon et al)
![Page 27: ECS289: Scalable Machine Learning](https://reader034.vdocuments.us/reader034/viewer/2022051716/58a1a24c1a28abd8728bea7a/html5/thumbnails/27.jpg)
Recommender Systems: challenges
Size of the matrix:
billions of users, billions of items, >100 billions of observations
Memory to store ratings: > 1200 GBytes
How to incorporate Side information?
User/Item profiles
Temporal information, click sequence
Prediction time:
Recommend top-k items to a user:
Need to compute a row of a matrix: O(mk) time
m > 1, 000, 000, 000, k > 500: need > 100 seconds
Recommend items to all users: 100 billion seconds ≈ 3170 years
![Page 28: ECS289: Scalable Machine Learning](https://reader034.vdocuments.us/reader034/viewer/2022051716/58a1a24c1a28abd8728bea7a/html5/thumbnails/28.jpg)
Ranking
Ranking players by pair comparison
Given n items and a subset of pair comparisons, what’s the ranking foreach player?
Examples: Chess tournaments, . . .
![Page 29: ECS289: Scalable Machine Learning](https://reader034.vdocuments.us/reader034/viewer/2022051716/58a1a24c1a28abd8728bea7a/html5/thumbnails/29.jpg)
Ranking
Ranking players by group comparison
Given n items and a subset of group comparisons, what’s the rankingfor each player?
![Page 30: ECS289: Scalable Machine Learning](https://reader034.vdocuments.us/reader034/viewer/2022051716/58a1a24c1a28abd8728bea7a/html5/thumbnails/30.jpg)
Ranking
Ranking players by group comparison
Given n items and a subset of group comparisons, what’s the rankingfor each player?
How to form the best group?
Examples: Halo, LOL, Heroes of the storm player ranking, . . .
![Page 31: ECS289: Scalable Machine Learning](https://reader034.vdocuments.us/reader034/viewer/2022051716/58a1a24c1a28abd8728bea7a/html5/thumbnails/31.jpg)
Ranking: challenges
Sample complexity: how many comparisons do we need?
Scalability: how to compute the ranking for huge datasets?
Side information: how to incorporate features?
![Page 32: ECS289: Scalable Machine Learning](https://reader034.vdocuments.us/reader034/viewer/2022051716/58a1a24c1a28abd8728bea7a/html5/thumbnails/32.jpg)
Semi-supervised Learning
Given both labeled and unlabeled data
Is unlabeled data useful?
Figure from Wikipedia
![Page 33: ECS289: Scalable Machine Learning](https://reader034.vdocuments.us/reader034/viewer/2022051716/58a1a24c1a28abd8728bea7a/html5/thumbnails/33.jpg)
Semi-supervised Learning
Two approaches:
Graph-based algorithm (label propagation)Graph-based regularization
Scalability: need to construct an n × n graph
n: total of labeled and unlabeled samples
What if n > 1 million?
Extensions: can we apply similar idea to other learning algorithms?
Matrix completion? Ranking?
![Page 34: ECS289: Scalable Machine Learning](https://reader034.vdocuments.us/reader034/viewer/2022051716/58a1a24c1a28abd8728bea7a/html5/thumbnails/34.jpg)
Scalability: Need
Information gathered today is often in terabytes, petabytes andexabytes
About 2.5 exabytes (= 2.5× 1018) bytes of data added each day
Almost 90% of the world’s data today was generated during the pasttwo years!
Wal-Mart collects more than 2.5 petabytes of data every hour from itscustomer transactions
Twitter generates 7TB/day and Facebook 10TB/day
![Page 35: ECS289: Scalable Machine Learning](https://reader034.vdocuments.us/reader034/viewer/2022051716/58a1a24c1a28abd8728bea7a/html5/thumbnails/35.jpg)
How to address these scalability challenges?
Algorithmic level:
Faster (optimization) algorithms
Approximation algorithms — expensive to compute exact solutions
Parallel algorithms
Multi-core optimization algorithms: when data can store in a singlemachineDistributed algorithms: when data cannot fit in memory of a singlemachine
Modern architecture:
High-throughput computational clusters and fault tolerance
Tools and technologies to leverage computational resources — such asHadoop, Spark
Parallel programming paradigms, software and enabling easy adoption
![Page 36: ECS289: Scalable Machine Learning](https://reader034.vdocuments.us/reader034/viewer/2022051716/58a1a24c1a28abd8728bea7a/html5/thumbnails/36.jpg)
Coming up
Read the class website
Next class: linear regression problems
Questions?