machine learning: chenhao tan university of colorado

51
Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 6 Slides adapted from Jordan Boyd-Graber, Chris Ketelsen Machine Learning: Chenhao Tan | Boulder | 1 of 39

Upload: others

Post on 04-May-2022

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Machine Learning: Chenhao Tan University of Colorado

Machine Learning: Chenhao TanUniversity of Colorado BoulderLECTURE 6

Slides adapted from Jordan Boyd-Graber, Chris Ketelsen

Machine Learning: Chenhao Tan | Boulder | 1 of 39

Page 2: Machine Learning: Chenhao Tan University of Colorado

• HW1 turned in• HW2 released• Office hour• Group formation signup

Machine Learning: Chenhao Tan | Boulder | 2 of 39

Page 3: Machine Learning: Chenhao Tan University of Colorado

Overview

Feature engineering

Revisiting Logistic Regression

Feed Forward Networks

Layers for Structured Data

Machine Learning: Chenhao Tan | Boulder | 3 of 39

Page 4: Machine Learning: Chenhao Tan University of Colorado

Feature engineering

Outline

Feature engineering

Revisiting Logistic Regression

Feed Forward Networks

Layers for Structured Data

Machine Learning: Chenhao Tan | Boulder | 4 of 39

Page 5: Machine Learning: Chenhao Tan University of Colorado

Feature engineering

Feature Engineering

Republican nominee George Bush said he felt nervous as he voted today in his adopted home state of Texas, where he ended...

(

(From Chris Harrison's WikiViz)

Machine Learning: Chenhao Tan | Boulder | 5 of 39

Page 6: Machine Learning: Chenhao Tan University of Colorado

Feature engineering

Brainstorming

What are features useful for sentiment analysis?

Machine Learning: Chenhao Tan | Boulder | 6 of 39

Page 7: Machine Learning: Chenhao Tan University of Colorado

Feature engineering

What are features useful for sentiment analysis?

• Unigram• Bigram• Normalizing options• Part-of-speech tagging• Parse-tree related features• Negation related features• Additional resources

Machine Learning: Chenhao Tan | Boulder | 7 of 39

Page 8: Machine Learning: Chenhao Tan University of Colorado

Feature engineering

Sarcasm detection

“Trees died for this book?” (book)

• find high-frequency words and content words• replace content words with “CW”• extract patterns, e.g., “does not CW much about CW”

[Tsur et al., 2010]

Machine Learning: Chenhao Tan | Boulder | 8 of 39

Page 9: Machine Learning: Chenhao Tan University of Colorado

Feature engineering

Sarcasm detection

“Trees died for this book?” (book)

• find high-frequency words and content words• replace content words with “CW”• extract patterns, e.g., “does not CW much about CW”

[Tsur et al., 2010]

Machine Learning: Chenhao Tan | Boulder | 8 of 39

Page 10: Machine Learning: Chenhao Tan University of Colorado

Feature engineering

More examples: Which one will be retweeted more?

[Tan et al., 2014]https://chenhaot.com/papers/wording-for-propagation.html

Machine Learning: Chenhao Tan | Boulder | 9 of 39

Page 11: Machine Learning: Chenhao Tan University of Colorado

Revisiting Logistic Regression

Outline

Feature engineering

Revisiting Logistic Regression

Feed Forward Networks

Layers for Structured Data

Machine Learning: Chenhao Tan | Boulder | 10 of 39

Page 12: Machine Learning: Chenhao Tan University of Colorado

Revisiting Logistic Regression

Revisiting Logistic Regression

P(Y = 0 | x, β) = 11 + exp [β0 +

∑i βiXi]

P(Y = 1 | x, β) =exp [β0 +

∑i βiXi]

1 + exp [β0 +∑

i βiXi]

L = −∑

j

logP(y(j) | X(j), β)

Machine Learning: Chenhao Tan | Boulder | 11 of 39

Page 13: Machine Learning: Chenhao Tan University of Colorado

Revisiting Logistic Regression

Revisiting Logistic Regression

• Transformation on x (we map class labels from {0, 1} to {1, 2}):

li = βTi x, i = 1, 2

oi =exp li∑

c∈{1,2} exp lc, i = 1, 2

• Objective function (using cross entropy −∑

i pi log qi):

L (Y, Y) = −∑

j

P(y(j) = 1) logP(yi = 1 | x(j), β) + P(y(j) = 0) log P(yi = 0 | Xi)

Machine Learning: Chenhao Tan | Boulder | 12 of 39

Page 14: Machine Learning: Chenhao Tan University of Colorado

Revisiting Logistic Regression

Logistic Regression as a Single-layer Neural Network

x1

x2

. . .

xd

Inputlayer

l1

l2

Linear

o1

o2

Softmax

Machine Learning: Chenhao Tan | Boulder | 13 of 39

Page 15: Machine Learning: Chenhao Tan University of Colorado

Revisiting Logistic Regression

Logistic Regression as a Single-layer Neural Network

x1

x2

. . .

xd

Inputlayer

o1

o2

SingleLayer

Machine Learning: Chenhao Tan | Boulder | 14 of 39

Page 16: Machine Learning: Chenhao Tan University of Colorado

Feed Forward Networks

Outline

Feature engineering

Revisiting Logistic Regression

Feed Forward Networks

Layers for Structured Data

Machine Learning: Chenhao Tan | Boulder | 15 of 39

Page 17: Machine Learning: Chenhao Tan University of Colorado

Feed Forward Networks

Deep Neural networks

A two-layer example (one hidden layer)

x1

x2

. . .

xd

Input Hidden

o1

o2

Output

Machine Learning: Chenhao Tan | Boulder | 16 of 39

Page 18: Machine Learning: Chenhao Tan University of Colorado

Feed Forward Networks

Deep Neural networks

More layers:

x1

x2

. . .

xd

Input Hidden 1 Hidden 2 Hidden 3

o1

o2

Output

Machine Learning: Chenhao Tan | Boulder | 17 of 39

Page 19: Machine Learning: Chenhao Tan University of Colorado

Feed Forward Networks

Forward propagation algorithm

How do we make predictions based on a multi-layer neural network?Store the biases for layer l in bl, weight matrix in Wl

x1

x2

. . .

xd

W1, b1 W2, b2 W3, b3 W4, b4

o1

o2

Machine Learning: Chenhao Tan | Boulder | 18 of 39

Page 20: Machine Learning: Chenhao Tan University of Colorado

Feed Forward Networks

Forward propagation algorithm

Suppose your network has L layersMake a prediction based on text point x

1: Initialize a0 = x2: for l = 1 to L do3: zl = Wlal−1 + bl

4: al = g(zl)5: end for6: The prediction y is simply aL

Machine Learning: Chenhao Tan | Boulder | 19 of 39

Page 21: Machine Learning: Chenhao Tan University of Colorado

Feed Forward Networks

Nonlinearity

What happens if there is no nonlinearity?

Linear combinations of linear combinations are still linear combinations.

Machine Learning: Chenhao Tan | Boulder | 20 of 39

Page 22: Machine Learning: Chenhao Tan University of Colorado

Feed Forward Networks

Nonlinearity

What happens if there is no nonlinearity?Linear combinations of linear combinations are still linear combinations.

Machine Learning: Chenhao Tan | Boulder | 20 of 39

Page 23: Machine Learning: Chenhao Tan University of Colorado

Feed Forward Networks

Neural networks in a nutshell

• Training data Strain = {(x, y)}• Network architecture (model)

y = fw(x)

• Loss function (objective function)

L (y, y)

• Learning (next lecture)

Machine Learning: Chenhao Tan | Boulder | 21 of 39

Page 24: Machine Learning: Chenhao Tan University of Colorado

Feed Forward Networks

Nonlinearity Options

• Sigmoid

f (x) =1

1 + exp(x)

• tanhf (x) =

exp(x)− exp(−x)exp(x) + exp(−x)

• ReLU (rectified linear unit)f (x) = max(0, x)

• softmaxx =

exp(x)∑xiexp(xi)

https://keras.io/activations/

Machine Learning: Chenhao Tan | Boulder | 22 of 39

Page 25: Machine Learning: Chenhao Tan University of Colorado

Feed Forward Networks

Nonlinearity Options

Machine Learning: Chenhao Tan | Boulder | 23 of 39

Page 26: Machine Learning: Chenhao Tan University of Colorado

Feed Forward Networks

Loss Function Options

• `2 loss ∑i

(yi − yi)2

• `1 loss ∑i

|yi − yi|

• Cross entropy−∑

i

yi log yi

• Hinge loss (more on this during SVM)

max(0, 1− yy)

https://keras.io/losses/

Machine Learning: Chenhao Tan | Boulder | 24 of 39

Page 27: Machine Learning: Chenhao Tan University of Colorado

Feed Forward Networks

A Perceptron Example

x = (x1, x2), y = f (x1, x2)

b

x1

x2

o1

We consider a simple activation function

f (z) =

{1 z ≥ 00 z < 0

Machine Learning: Chenhao Tan | Boulder | 25 of 39

Page 28: Machine Learning: Chenhao Tan University of Colorado

Feed Forward Networks

A Perceptron Example

x = (x1, x2), y = f (x1, x2)

b

x1

x2

o1

We consider a simple activation function

f (z) =

{1 z ≥ 00 z < 0

Machine Learning: Chenhao Tan | Boulder | 25 of 39

Page 29: Machine Learning: Chenhao Tan University of Colorado

Feed Forward Networks

A Perceptron Example

Simple Example: Can we learn OR?

x1 0 1 0 1x2 0 0 1 1

y = x1 ∨ x2 0 1 1 1

w = (1, 1), b = −0.5

b

x1

x2

o1

Machine Learning: Chenhao Tan | Boulder | 26 of 39

Page 30: Machine Learning: Chenhao Tan University of Colorado

Feed Forward Networks

A Perceptron Example

Simple Example: Can we learn OR?

x1 0 1 0 1x2 0 0 1 1

y = x1 ∨ x2 0 1 1 1

w = (1, 1), b = −0.5

b

x1

x2

o1

Machine Learning: Chenhao Tan | Boulder | 26 of 39

Page 31: Machine Learning: Chenhao Tan University of Colorado

Feed Forward Networks

A Perceptron Example

Simple Example: Can we learn AND?

x1 0 1 0 1x2 0 0 1 1

y = x1 ∧ x2 0 0 0 1

w = (1, 1), b = −1.5

b

x1

x2

o1

Machine Learning: Chenhao Tan | Boulder | 27 of 39

Page 32: Machine Learning: Chenhao Tan University of Colorado

Feed Forward Networks

A Perceptron Example

Simple Example: Can we learn AND?

x1 0 1 0 1x2 0 0 1 1

y = x1 ∧ x2 0 0 0 1

w = (1, 1), b = −1.5

b

x1

x2

o1

Machine Learning: Chenhao Tan | Boulder | 27 of 39

Page 33: Machine Learning: Chenhao Tan University of Colorado

Feed Forward Networks

A Perceptron Example

Simple Example: Can we learn NAND?

x1 0 1 0 1x2 0 0 1 1

y = ¬(x1 ∧ x2) 1 0 0 0

w = (−1,−1), b = 0.5

b

x1

x2

o1

Machine Learning: Chenhao Tan | Boulder | 28 of 39

Page 34: Machine Learning: Chenhao Tan University of Colorado

Feed Forward Networks

A Perceptron Example

Simple Example: Can we learn NAND?

x1 0 1 0 1x2 0 0 1 1

y = ¬(x1 ∧ x2) 1 0 0 0

w = (−1,−1), b = 0.5

b

x1

x2

o1

Machine Learning: Chenhao Tan | Boulder | 28 of 39

Page 35: Machine Learning: Chenhao Tan University of Colorado

Feed Forward Networks

A Perceptron Example

Simple Example: Can we learn XOR?

x1 0 1 0 1x2 0 0 1 1

x1 XOR x2 0 1 1 0

NOPE!But why?The single-layer perceptron is just a linear classifier, and can only learn things thatare linearly separable.How can we fix this?

Machine Learning: Chenhao Tan | Boulder | 29 of 39

Page 36: Machine Learning: Chenhao Tan University of Colorado

Feed Forward Networks

A Perceptron Example

Simple Example: Can we learn XOR?

x1 0 1 0 1x2 0 0 1 1

x1 XOR x2 0 1 1 0

NOPE!

But why?The single-layer perceptron is just a linear classifier, and can only learn things thatare linearly separable.How can we fix this?

Machine Learning: Chenhao Tan | Boulder | 29 of 39

Page 37: Machine Learning: Chenhao Tan University of Colorado

Feed Forward Networks

A Perceptron Example

Simple Example: Can we learn XOR?

x1 0 1 0 1x2 0 0 1 1

x1 XOR x2 0 1 1 0

NOPE!But why?

The single-layer perceptron is just a linear classifier, and can only learn things thatare linearly separable.How can we fix this?

Machine Learning: Chenhao Tan | Boulder | 29 of 39

Page 38: Machine Learning: Chenhao Tan University of Colorado

Feed Forward Networks

A Perceptron Example

Simple Example: Can we learn XOR?

x1 0 1 0 1x2 0 0 1 1

x1 XOR x2 0 1 1 0

NOPE!But why?The single-layer perceptron is just a linear classifier, and can only learn things thatare linearly separable.

How can we fix this?

Machine Learning: Chenhao Tan | Boulder | 29 of 39

Page 39: Machine Learning: Chenhao Tan University of Colorado

Feed Forward Networks

A Perceptron Example

Simple Example: Can we learn XOR?

x1 0 1 0 1x2 0 0 1 1

x1 XOR x2 0 1 1 0

NOPE!But why?The single-layer perceptron is just a linear classifier, and can only learn things thatare linearly separable.How can we fix this?

Machine Learning: Chenhao Tan | Boulder | 29 of 39

Page 40: Machine Learning: Chenhao Tan University of Colorado

Feed Forward Networks

A Perceptron Example

Increase the number of layers.

x1 0 1 0 1x2 0 0 1 1

x1 XOR x2 0 1 1 0

b

x1

x2

b

h1

h2

o1W1 =

[1 1−1 −1

], b1 =

[−0.51.5

]W2 =

[11

], b2 = −1.5

Machine Learning: Chenhao Tan | Boulder | 30 of 39

Page 41: Machine Learning: Chenhao Tan University of Colorado

Feed Forward Networks

General Expressiveness of Neural Networks

Neural networks with a single hidden layer can approximate any measurablefunctions [Hornik et al., 1989, Cybenko, 1989].

Machine Learning: Chenhao Tan | Boulder | 31 of 39

Page 42: Machine Learning: Chenhao Tan University of Colorado

Layers for Structured Data

Outline

Feature engineering

Revisiting Logistic Regression

Feed Forward Networks

Layers for Structured Data

Machine Learning: Chenhao Tan | Boulder | 32 of 39

Page 43: Machine Learning: Chenhao Tan University of Colorado

Layers for Structured Data

Structured data

Spatial information

https://www.reddit.com/r/aww/comments/6ip2la/before_and_after_she_was_told_she_was_a_good_girl/

Machine Learning: Chenhao Tan | Boulder | 33 of 39

Page 44: Machine Learning: Chenhao Tan University of Colorado

Layers for Structured Data

Convolutional Layers

Sharing parameters across patchesinput image

or input feature mapoutput feature maps

https://github.com/davidstutz/latex-resources/blob/master/tikz-convolutional-layer/convolutional-layer.tex

Machine Learning: Chenhao Tan | Boulder | 34 of 39

Page 45: Machine Learning: Chenhao Tan University of Colorado

Layers for Structured Data

Structured data

Sequential information“My words fly up, my thoughts remain below: Words without thoughts never toheaven go.”

—Hamlet

• language• activity history

x = (x1, . . . , xT)

Machine Learning: Chenhao Tan | Boulder | 35 of 39

Page 46: Machine Learning: Chenhao Tan University of Colorado

Layers for Structured Data

Structured data

Sequential information“My words fly up, my thoughts remain below: Words without thoughts never toheaven go.”

—Hamlet

• language• activity history

x = (x1, . . . , xT)

Machine Learning: Chenhao Tan | Boulder | 35 of 39

Page 47: Machine Learning: Chenhao Tan University of Colorado

Layers for Structured Data

Structured data

Sequential information“My words fly up, my thoughts remain below: Words without thoughts never toheaven go.”

—Hamlet

• language• activity history

x = (x1, . . . , xT)

Machine Learning: Chenhao Tan | Boulder | 35 of 39

Page 48: Machine Learning: Chenhao Tan University of Colorado

Layers for Structured Data

Recurrent Layers

Sharing parameters along a sequence

ht = f (xt, ht−1)

Machine Learning: Chenhao Tan | Boulder | 36 of 39

Page 49: Machine Learning: Chenhao Tan University of Colorado

Layers for Structured Data

Recurrent Layers

Sharing parameters along a sequence

ht = f (xt, ht−1)

Long short-term memory

Machine Learning: Chenhao Tan | Boulder | 37 of 39

Page 50: Machine Learning: Chenhao Tan University of Colorado

Layers for Structured Data

What is missing?

• How to find good weights?• How to make the model work (regularization, architecture, etc)?

Machine Learning: Chenhao Tan | Boulder | 38 of 39

Page 51: Machine Learning: Chenhao Tan University of Colorado

Layers for Structured Data

References (1)

George Cybenko. Approximation by superpositions of a sigmoidal function. Mathematics of Control, Signals, andSystems (MCSS), 2(4):303–314, 1989.

Kurt Hornik, Maxwell Stinchcombe, and Halbert White. Multilayer feedforward networks are universalapproximators. Neural networks, 2(5):359–366, 1989.

Chenhao Tan, Lillian Lee, and Bo Pang. The effect of wording on message propagation: Topic- andauthor-controlled natural experiments on twitter. In Proceedings of ACL, 2014.

Oren Tsur, Dmitry Davidov, and Ari Rappoport. ICWSM-A Great Catchy Name: Semi-Supervised Recognition ofSarcastic Sentences in Online Product Reviews. In Proceedings of ICWSM, 2010.

Machine Learning: Chenhao Tan | Boulder | 39 of 39