big data infrastructure · 2016. 3. 31. · why is this a big data problem? " isn’t...

31
Big Data Infrastructure Week 8: Data Mining (1/4) This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States See http://creativecommons.org/licenses/by-nc-sa/3.0/us/ for details CS 489/698 Big Data Infrastructure (Winter 2016) Jimmy Lin David R. Cheriton School of Computer Science University of Waterloo March 1, 2016 These slides are available at http://lintool.github.io/bigdata-2016w/

Upload: others

Post on 15-Oct-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Big Data Infrastructure · 2016. 3. 31. · Why is this a big data problem? " Isn’t gathering labels a serious bottleneck?! Solution: crowdsourcing! Solution: bootstrapping, semi-supervised

Big Data Infrastructure

Week 8: Data Mining (1/4)

This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States���See http://creativecommons.org/licenses/by-nc-sa/3.0/us/ for details

CS 489/698 Big Data Infrastructure (Winter 2016)

Jimmy LinDavid R. Cheriton School of Computer Science

University of Waterloo

March 1, 2016

These slides are available at http://lintool.github.io/bigdata-2016w/

Page 2: Big Data Infrastructure · 2016. 3. 31. · Why is this a big data problem? " Isn’t gathering labels a serious bottleneck?! Solution: crowdsourcing! Solution: bootstrapping, semi-supervised

Structure of the Course

“Core” framework features and algorithm design

Anal

yzin

g Te

xt

Anal

yzin

g G

raph

s

Anal

yzin

g Re

latio

nal D

ata

Data

Min

ing

Page 3: Big Data Infrastructure · 2016. 3. 31. · Why is this a big data problem? " Isn’t gathering labels a serious bottleneck?! Solution: crowdsourcing! Solution: bootstrapping, semi-supervised

Supervised Machine Learning

The generic problem of function induction given sample instances of input and output

Classification: output draws from finite discrete labels

Regression: output is a continuous value

This is not meant to be an exhaustive treatment of machine learning!

Focus today

Page 4: Big Data Infrastructure · 2016. 3. 31. · Why is this a big data problem? " Isn’t gathering labels a serious bottleneck?! Solution: crowdsourcing! Solution: bootstrapping, semi-supervised

Source: Wikipedia (Sorting)

Classification

Page 5: Big Data Infrastructure · 2016. 3. 31. · Why is this a big data problem? " Isn’t gathering labels a serious bottleneck?! Solution: crowdsourcing! Solution: bootstrapping, semi-supervised

Applications

Spam detection

Sentiment analysis

Content (e.g., genre) classification

Link prediction

Document ranking

Object recognition

And much much more!

Fraud detection

Page 6: Big Data Infrastructure · 2016. 3. 31. · Why is this a big data problem? " Isn’t gathering labels a serious bottleneck?! Solution: crowdsourcing! Solution: bootstrapping, semi-supervised

training

Model

training data

Machine Learning Algorithm

testing/deployment

?

Supervised Machine Learning

Page 7: Big Data Infrastructure · 2016. 3. 31. · Why is this a big data problem? " Isn’t gathering labels a serious bottleneck?! Solution: crowdsourcing! Solution: bootstrapping, semi-supervised

Objects are represented in terms of features:“Dense” features: sender IP, timestamp, # of recipients, length of message, etc.

“Sparse” features: contains the term “viagra” in message, contains “URGENT” in subject, etc.

Who comes up with the features?

How?

Feature Representations

Page 8: Big Data Infrastructure · 2016. 3. 31. · Why is this a big data problem? " Isn’t gathering labels a serious bottleneck?! Solution: crowdsourcing! Solution: bootstrapping, semi-supervised

Applications

Spam detection

Sentiment analysis

Content (e.g., genre) classification

Link prediction

Document ranking

Object recognition

And much much more!

Fraud detection

Features are highly

application-specific!

Page 9: Big Data Infrastructure · 2016. 3. 31. · Why is this a big data problem? " Isn’t gathering labels a serious bottleneck?! Solution: crowdsourcing! Solution: bootstrapping, semi-supervised

Components of a ML Solution

DataFeaturesModel

Optimization

What “matters” the most?

logistic regression, naïve Bayes, SVM, random forests, perceptrons, neural network, etc.gradient descent, stochastic

gradient descent, L-BFGS, etc.

Page 10: Big Data Infrastructure · 2016. 3. 31. · Why is this a big data problem? " Isn’t gathering labels a serious bottleneck?! Solution: crowdsourcing! Solution: bootstrapping, semi-supervised

No data like more data!

(Banko and Brill, ACL 2001) (Brants et al., EMNLP 2007)

s/knowledge/data/g;

Page 11: Big Data Infrastructure · 2016. 3. 31. · Why is this a big data problem? " Isn’t gathering labels a serious bottleneck?! Solution: crowdsourcing! Solution: bootstrapping, semi-supervised

Limits of Supervised Classification?¢  Why is this a big data problem?

l  Isn’t gathering labels a serious bottleneck?

¢  Solution: crowdsourcing

¢  Solution: bootstrapping, semi-supervised techniques

¢  Solution: user behavior logsl  Learning to rank

l  Computational advertisingl  Link recommendation

¢  The virtuous cycle of data-driven products

Page 12: Big Data Infrastructure · 2016. 3. 31. · Why is this a big data problem? " Isn’t gathering labels a serious bottleneck?! Solution: crowdsourcing! Solution: bootstrapping, semi-supervised

Supervised Binary Classification¢  Restrict output label to be binary

l  Yes/Nol  1/0

¢  Binary classifiers form a primitive building block for multi-class problemsl  One vs. rest classifier ensembles

l  Classifier cascades

Page 13: Big Data Infrastructure · 2016. 3. 31. · Why is this a big data problem? " Isn’t gathering labels a serious bottleneck?! Solution: crowdsourcing! Solution: bootstrapping, semi-supervised

¢  Inducel  Such that loss is minimized

f : X ! Y

1

n

nX

i=0

`(f(xi), yi)

¢  Given D = {(xi, yi)}ni

¢  Typically, consider functions of a parametric form:

argmin✓

1

n

nX

i=0

`(f(xi; ✓), yi)

xi = [x1, x2, x3, . . . , xd]

y 2 {0, 1}

The Task

(sparse) feature vector

label

loss function

model parameters

Page 14: Big Data Infrastructure · 2016. 3. 31. · Why is this a big data problem? " Isn’t gathering labels a serious bottleneck?! Solution: crowdsourcing! Solution: bootstrapping, semi-supervised

Key insight: machine learning as an optimization problem!(closed form solutions generally not possible)

Page 15: Big Data Infrastructure · 2016. 3. 31. · Why is this a big data problem? " Isn’t gathering labels a serious bottleneck?! Solution: crowdsourcing! Solution: bootstrapping, semi-supervised

Gradient Descent: Preliminaries¢  Rewrite:

¢  Compute gradient:l  “Points” to fastest increasing “direction”

¢  So, at any point:

rL(✓) =

@L(✓)

@w0,@L(✓)

@w1, . . .

@L(✓)

@wd

b = a� �rL(a)

L(a) � L(b)

*

* caveats

argmin✓

L(✓)argmin

1

n

nX

i=0

`(f(xi; ✓), yi)

Page 16: Big Data Infrastructure · 2016. 3. 31. · Why is this a big data problem? " Isn’t gathering labels a serious bottleneck?! Solution: crowdsourcing! Solution: bootstrapping, semi-supervised

Gradient Descent: Iterative Update¢  Start at an arbitrary point, iteratively update:

¢  We have:

¢  Lots of details:l  Figuring out the step size

l  Getting stuck in local minima

l  Convergence rate

l  …

✓(t+1) ✓(t) � �(t)rL(✓(t))

L(✓(0)) � L(✓(1)) � L(✓(2)) . . .

Page 17: Big Data Infrastructure · 2016. 3. 31. · Why is this a big data problem? " Isn’t gathering labels a serious bottleneck?! Solution: crowdsourcing! Solution: bootstrapping, semi-supervised

Gradient DescentRepeat until convergence:

✓(t+1) ✓(t) � �(t) 1

n

nX

i=0

r`(f(xi; ✓(t)), yi)

Page 18: Big Data Infrastructure · 2016. 3. 31. · Why is this a big data problem? " Isn’t gathering labels a serious bottleneck?! Solution: crowdsourcing! Solution: bootstrapping, semi-supervised

Intuition behind the math…

Old weightsUpdate based on gradient

New weights

✓(t+1) ✓(t) � �(t) 1

n

nX

i=0

r`(f(xi; ✓(t)), yi)

`(x)

r`d

dx

`

Page 19: Big Data Infrastructure · 2016. 3. 31. · Why is this a big data problem? " Isn’t gathering labels a serious bottleneck?! Solution: crowdsourcing! Solution: bootstrapping, semi-supervised
Page 20: Big Data Infrastructure · 2016. 3. 31. · Why is this a big data problem? " Isn’t gathering labels a serious bottleneck?! Solution: crowdsourcing! Solution: bootstrapping, semi-supervised

Gradient Descent

Source: Wikipedia (Hills)

✓(t+1) ✓(t) � �(t) 1

n

nX

i=0

r`(f(xi; ✓(t)), yi)

Page 21: Big Data Infrastructure · 2016. 3. 31. · Why is this a big data problem? " Isn’t gathering labels a serious bottleneck?! Solution: crowdsourcing! Solution: bootstrapping, semi-supervised

Lots More Details…¢  Gradient descent is a “first order” optimization technique

l  Often, slow convergencel  Conjugate techniques accelerate convergence

¢  Newton and quasi-Newton methods:l  Intuition: Taylor expansion

l  Requires the Hessian (square matrix of second order partial derivatives): impractical to fully compute

f(x+�x) = f(x) + f

0(x)�x+1

2f

00(x)�x

2

Page 22: Big Data Infrastructure · 2016. 3. 31. · Why is this a big data problem? " Isn’t gathering labels a serious bottleneck?! Solution: crowdsourcing! Solution: bootstrapping, semi-supervised

Source: Wikipedia (Hammer)

Logistic Regression

Page 23: Big Data Infrastructure · 2016. 3. 31. · Why is this a big data problem? " Isn’t gathering labels a serious bottleneck?! Solution: crowdsourcing! Solution: bootstrapping, semi-supervised

Logistic Regression: Preliminaries¢  Given

¢  Let’s define:

¢  Interpretation:

D = {(xi, yi)}nixi = [x1, x2, x3, . . . , xd]

y 2 {0, 1}

ln

Pr (y = 1|x)Pr (y = 0|x)

�= w · x

ln

Pr (y = 1|x)

1� Pr (y = 1|x)

�= w · x

f(x; w) : Rd ! {0, 1}

f(x; w) =

⇢1 if w · x � t

0 if w · x < t

Page 24: Big Data Infrastructure · 2016. 3. 31. · Why is this a big data problem? " Isn’t gathering labels a serious bottleneck?! Solution: crowdsourcing! Solution: bootstrapping, semi-supervised

Relation to the Logistic Function¢  After some algebra:

¢  The logistic function:

Pr (y = 1|x) = e

w·x

1 + e

w·x

Pr (y = 0|x) = 1

1 + e

w·x

f(z) =ez

ez + 1

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

-8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7 8

logi

stic

(z)

z

Page 25: Big Data Infrastructure · 2016. 3. 31. · Why is this a big data problem? " Isn’t gathering labels a serious bottleneck?! Solution: crowdsourcing! Solution: bootstrapping, semi-supervised

Training an LR Classifier¢  Maximize the conditional likelihood:

¢  Define the objective in terms of conditional log likelihood:

l  We know so:

l  Substituting:

argmax

w

nY

i=1

Pr(yi|xi,w)

L(w) =nX

i=1

ln Pr(yi|xi,w)

y 2 {0, 1}

L(w) =nX

i=1

⇣yi ln Pr(yi = 1|xi,w) + (1� yi) lnPr(yi = 0|xi,w)

Pr(y|x,w) = Pr(y = 1|x,w)yPr(y = 0|x,w)(1�y)

Page 26: Big Data Infrastructure · 2016. 3. 31. · Why is this a big data problem? " Isn’t gathering labels a serious bottleneck?! Solution: crowdsourcing! Solution: bootstrapping, semi-supervised

LR Classifier Update Rule¢  Take the derivative:

¢  General form for update rule:

¢  Final update rule:

L(w) =nX

i=1

⇣yi ln Pr(yi = 1|xi,w) + (1� yi) lnPr(yi = 0|xi,w)

@

@wL(w) =

nX

i=0

xi

⇣yi � Pr(yi = 1|xi,w)

w(t+1) w(t) + �(t)rwL(w(t))

rL(w) =

@L(w)

@w0,@L(w)

@w1, . . .

@L(w)

@wd

w

(t+1)i w

(t)i + �

(t)nX

j=0

xj,i

⇣yj � Pr(yj = 1|xj ,w(t)

)

Page 27: Big Data Infrastructure · 2016. 3. 31. · Why is this a big data problem? " Isn’t gathering labels a serious bottleneck?! Solution: crowdsourcing! Solution: bootstrapping, semi-supervised

Lots more details…¢  Regularization

¢  Different loss functions

¢  …

Want more details? ���Take a real machine-learning course!

Page 28: Big Data Infrastructure · 2016. 3. 31. · Why is this a big data problem? " Isn’t gathering labels a serious bottleneck?! Solution: crowdsourcing! Solution: bootstrapping, semi-supervised

mapper mapper mapper mapper

reducer

compute partial gradient

single reducer

mappers

update model iterate until convergence

✓(t+1) ✓(t) � �(t) 1

n

nX

i=0

r`(f(xi; ✓(t)), yi)

MapReduce Implementation

Page 29: Big Data Infrastructure · 2016. 3. 31. · Why is this a big data problem? " Isn’t gathering labels a serious bottleneck?! Solution: crowdsourcing! Solution: bootstrapping, semi-supervised

Shortcomings¢  Hadoop is bad at iterative algorithms

l  High job startup costsl  Awkward to retain state across iterations

¢  High sensitivity to skewl  Iteration speed bounded by slowest task

¢  Potentially poor cluster utilizationl  Must shuffle all data to a single reducer

¢  Some possible tradeoffsl  Number of iterations vs. complexity of computation per iteration

l  E.g., L-BFGS: faster convergence, but more to compute

Page 30: Big Data Infrastructure · 2016. 3. 31. · Why is this a big data problem? " Isn’t gathering labels a serious bottleneck?! Solution: crowdsourcing! Solution: bootstrapping, semi-supervised

Spark Implementationval points = spark.textFile(...).map(parsePoint).persist() var w = // random initial vector for (i <- 1 to ITERATIONS) { val gradient = points.map{ p => p.x * (1/(1+exp(-p.y*(w dot p.x)))-1)*p.y }.reduce((a,b) => a+b) w -= gradient }

mapper mapper mapper mapper

reducer

compute partial gradient

update model

What’s the difference?

Page 31: Big Data Infrastructure · 2016. 3. 31. · Why is this a big data problem? " Isn’t gathering labels a serious bottleneck?! Solution: crowdsourcing! Solution: bootstrapping, semi-supervised

Source: Wikipedia (Japanese rock garden)

Questions?