introduction to machine learning - uni- · pdf fileintroduction to machine learning, ......

Introduction to Machine Learning

Marc Toussaint

July 5, 2016

This is a direct concatenation and reformatting of all lecture slides and exercises fromthe Machine Learning course (summer term 2016, U Stuttgart), including indexing tohelp prepare for exams.

Contents

1 Introduction 2

2 Regression 10Linear regression (2:3) Features (2:6) Regularization (2:11) Estimator variance (2:12)Ridge regularization (2:15) Ridge regression (2:15) Cross validation (2:17) Lasso reg-ularization (2:19) Feature selection (2:19) Dual formulation of ridge regression (2:24)Dual formulation of Lasso regression (2:25)

3 Classification & Structured Output 21Discriminative function (3:1)

3.1 Logistic regression: Binary case . . . . . . . . . . . . . . . . . . . . . . 22Logistic regression (binary) (3:6)

3.2 Logistic regression: Multi-class case . . . . . . . . . . . . . . . . . . . 26Logistic regression (multi-class) (3:14)

3.3 Structured Output & Structured Input . . . . . . . . . . . . . . . . . . 27Structured ouput and structured input (3:18)

3.4 Conditional Random Fields . . . . . . . . . . . . . . . . . . . . . . . . 29Conditional random fields (3:23)

1

2 Introduction to Machine Learning, Marc Toussaint

4 Kernelization & Structured Input 32Kernel trick (4:1) Kernel ridge regression (4:1)

5 The Breadth of ML methods 36

5.1 Other loss functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36Robust error functions (5:4) Least squares loss (5:5) Forsyth loss (5:5) Huber loss (5:5)SVM regression loss (5:6) Neg-log-likelihood loss (5:7) Hinge loss (5:7) Class Huberloss (5:7) Support Vector Machines (SVMs) (5:8) Hinge loss leads to max margin lin-ear program (5:12)

5.2 Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43Neural networks (5:20) Deep learning by regularizing the embedding (5:26) Autoen-coders (5:28)

5.3 Unsupervised learning . . . . . . . . . . . . . . . . . . . . . . . . . . . 47Principle Component Analysis (PCA) (5:32) Kernel PCA (5:39) Spectral clustering(5:43) Multidimensional scaling (5:47) ISOMAP (5:50) Non-negative matrix factor-ization (5:53) Factor analysis (5:55) Independent component analysis (5:56) Partialleast squares (PLS) (5:57) Clustering (5:60) k-means clustering (5:61) Gaussian mix-ture model (5:65) Agglomerative Hierarchical Clustering (5:67)

5.4 Local learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64Smoothing kernel (5:70) kNN smoothing kernel (5:71) Epanechnikov quadraticsmoothing kernel (5:71) kd-trees (5:73)

5.5 Combining weak and randomized learners . . . . . . . . . . . . . . . 67Bootstrapping (5:78) Bagging (5:78) Model averaging (5:80) Weak learners as features(5:81) Boosting (5:82) AdaBoost (5:83) Gradient boosting (5:87) Decision trees (5:91)Boosting decision trees (5:96) Random forests (5:97) Centering and whitening (5:99)

6 Bayesian Regression & Classification 77Bayesian (kernel) ridge regression (6:4) Predictive distribution (6:7) Gaussian Pro-cess (6:11) Bayesian Kernel Ridge Regression (6:11) Bayesian (kernel) logistic regres-sion (6:17) Gaussian Process Classification (6:20) Bayesian Kernel Logistic Regression(6:20) Bayesian Neural Networks (6:22)

A Probability Basics 85Inference: general meaning (7:3)

A.1 Basic definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87Random variables (7:8) Probability distribution (7:9) Joint distribution (7:10)Marginal (7:10) Conditional distribution (7:10) Bayes’ Theorem (7:12) Multiple RVs,conditional independence (7:13) Learning as Bayesian inference (7:19)

A.2 Probability distributions . . . . . . . . . . . . . . . . . . . . . . . . . . 91

Introduction to Machine Learning, Marc Toussaint 3

Bernoulli and Binomial distributions (7:21) Beta (7:22) Multinomial (7:25) Dirich-let (7:26) Conjugate priors (7:30) Dirac distribution (7:32) Gaussian (7:33) Parti-cle approximation of a distribution (7:37) Utilities and Decision Theory (7:40) En-tropy (7:41) Kullback-Leibler divergence (7:42) Monte Carlo methods (7:43) Rejec-tion sampling (7:44) Importance sampling (7:45) Student’s t, Exponential, Laplace,Chi-squared, Gamma distributions (7:47)

B Graphical Models 103Bayesian Network (8:3) Conditional independence in a Bayes Net (8:9) Inference:general meaning (8:14)

B.1 Inference in Graphical Models . . . . . . . . . . . . . . . . . . . . . . . 110Inference in graphical models: overview (8:22) Rejection sampling (8:25) Importancesampling (8:28) Gibbs sampling (8:31) Variable elimination (8:34) Factor graph (8:37)Belief propagation (8:43) Message passing (8:43) Loopy belief propagation (8:46)Junction tree algorithm (8:48) Maximum a-posteriori (MAP) inference (8:52) Con-ditional random field (8:53)

B.2 Learning in Graphical Models . . . . . . . . . . . . . . . . . . . . . . . 123Maximum likelihood parameter estimate (8:57) Bayesian parameter estimate (8:57)Bayesian prediction (8:57) Maximum a posteriori (MAP) parameter estimate (8:57)Learning with missing data (8:61) Expectation Maximization (EM) algorithm (8:63)Expected data log-likelihood (8:63) Gaussian mixture model (8:66) Mixture models(8:71) Hidden Markov model (8:72) HMM inference (8:78) Free energy formulationof EM (8:86) Maximum conditional likelihood (8:91) Conditional random field (8:93)

7 Exercises 1407.1 Exercise 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1407.2 Exercise 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1427.3 Exercise 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1467.4 Exercise 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1477.5 Exercise 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1487.6 Exercise 6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1507.7 Exercise 7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1527.8 Exercise 8 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1547.9 Exercise 9 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1567.10 Exercise 10 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1587.11 Exercise 11 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159

Index 160


1 Introduction

What is Machine Learning?

1) A long list of methods/algorithms for different data anlysis problems– in sciences

– in commerce

2) Frameworks to develop your own learning algorithm/method

3) Machine Learning = information theory/statistics + computer science1:1

What is Machine Learning?

• Pedro Domingos: A Few Useful Things to Know about Machine LearningLEARNING = REPRESENTATION + EVALUATION + OPTIMIZATION

• “Representation”: Choice of model, choice of hypothesis space• “Evaluation”: Choice of objective function, optimality principle

Notes: The prior is both, a choice of representation and, usually, a part of theobjective function.In Bayesian settings, the choice of model often directly implies also the “objec-tive function”• “Optimization”: The algorithm to compute/approximate the best model

1:2

Pedro Domingos: A Few Useful Things to Know about Machine Learning• It’s generalization that counts

– Data alone is not enough

– Overfitting has many faces

– Intuition fails in high dimensions

– Theoretical guarantees are not what they seem

• Feature engineering is the key• More data beats a cleverer algorithm• Learn many models, not just one• Simplicity does not imply accuracy• Representable does not imply learnable• Correlation does not imply causation


1:3

Machine Learning is everywhere

NSA, Amazon, Google, Zalando, Trading, ...Chemistry, Biology, Physics, ...Control, Operations Reserach, Scheduling, ...

• Machine Learning ∼ Information Processing (e.g. Bayesian ML)1:4

Face recognition

keypoints

eigenfaces

(e.g., Viola & Jones)1:5

Hand-written digit recognition (US postal data)

(e.g., Yann LeCun)1:6


Gene annotation

(Gunnar Ratsch, Tubingen, mGene Project)1:7

Speech recognition

(This is the basis of all commercial products)1:8

Spam filters

1:9


Machine Learning became an important technologyin science as well as commerce

1:10

Examples of ML for behavior...1:11

Learing motor skills

(around 2000, by Schaal, Atkeson, Vijayakumar)1:12

Learning to walk


(Rus Tedrake et al.)1:13

Organization of this lecture

See TOC of last year’s slide collection

• Part 1: The Core: Regression & Classification• Part 2: The Breadth of ML methods• Part 3: Bayesian Methods

1:14

Is this a theoretical or practical course?

Neither alone.

• The goal is to teach how to design good learning algorithmsdata↓

modelling [requires theory & practise]↓

algorithms [requires practise & theory]↓

testing, problem identification, restart1:15

How much math do you need?

• Let L(x) = ||y −Ax||2. What is

argminx

L(x)

• Findminx||y −Ax||2 s.t. xi ≤ 1

• Given a discriminative function f(x, y) we define

p(y |x) =ef(y,x)∑y′ e

f(y′,x)


• Let A be the covariance matrix of a Gaussian. What does the Singular ValueDecomposition A = V DV> tell us?

1:16

How much coding do you need?

• A core subject of this lecture: learning to go from principles (math) to code

• Many exercises will implement algorithms we derived in the lecture and collectexperience on small data sets

• Choice of language is fully free. I support C++; tutors might prefer Python;Octave/Matlab or R is also good choice.

1:17

Books

Springer Series in Statistics

Trevor HastieRobert TibshiraniJerome Friedman

Springer Series in Statistics

The Elements ofStatistical LearningData Mining, Inference, and Prediction

The Elements of Statistical Learning

During the past decade there has been an explosion in computation and information tech-nology. With it have come vast amounts of data in a variety of fields such as medicine, biolo-gy, finance, and marketing. The challenge of understanding these data has led to the devel-opment of new tools in the field of statistics, and spawned new areas such as data mining,machine learning, and bioinformatics. Many of these tools have common underpinnings butare often expressed with different terminology. This book describes the important ideas inthese areas in a common conceptual framework. While the approach is statistical, theemphasis is on concepts rather than mathematics. Many examples are given, with a liberaluse of color graphics. It should be a valuable resource for statisticians and anyone interestedin data mining in science or industry. The book’s coverage is broad, from supervised learning(prediction) to unsupervised learning. The many topics include neural networks, supportvector machines, classification trees and boosting—the first comprehensive treatment of thistopic in any book.

This major new edition features many topics not covered in the original, including graphicalmodels, random forests, ensemble methods, least angle regression & path algorithms for thelasso, non-negative matrix factorization, and spectral clustering. There is also a chapter onmethods for “wide” data (p bigger than n), including multiple testing and false discovery rates.

Trevor Hastie, Robert Tibshirani, and Jerome Friedman are professors of statistics atStanford University. They are prominent researchers in this area: Hastie and Tibshiranideveloped generalized additive models and wrote a popular book of that title. Hastie co-developed much of the statistical modeling software and environment in R/S-PLUS andinvented principal curves and surfaces. Tibshirani proposed the lasso and is co-author of thevery successful An Introduction to the Bootstrap. Friedman is the co-inventor of many data-mining tools including CART, MARS, projection pursuit and gradient boosting.

› springer.com

S T A T I S T I C S

----

Trevor Hastie • Robert Tibshirani • Jerome FriedmanThe Elements of Statictical Learning

Hastie • Tibshirani • Friedman

Second Edition

Trevor Hastie, Robert Tibshirani andJerome Friedman: The Elements of Statis-tical Learning: Data Mining, Inference, andPrediction Springer, Second Edition, 2009.http://www-stat.stanford.edu/

˜tibs/ElemStatLearn/(recommended: read introductory chap-ter)

(this course will not go to the full depth in math of Hastie et al.)

1:18

Books

http://www-stat.stanford.edu/~tibs/ElemStatLearn/



Bishop, C. M.: Pattern Recognition and Ma-chine Learning.Springer, 2006http://research.microsoft.com/en-us/um/people/cmbishop/prml/(some chapters are fully online)

1:19

Books & Readings

• more recently:– David Barber: Bayesian Reasoning and Machine Learning

– Kevin Murphy: Machine learning: a Probabilistic Perspective

• See the readings at the bottom of:http://ipvs.informatik.uni-stuttgart.de/mlr/marc/teaching/index.html#readings

1:20

Organization

• Course Webpage:http://ipvs.informatik.uni-stuttgart.de/mlr/marc/teaching/16-MachineLearning/

– Slides, Exercises & Software (C++)

– Links to books and other resources

• Admin things, please first ask:Carola Stahl, [email protected], Raum 2.217

• Rules for the tutorials:– Doing the exercises is crucial!

– Nur Votieraufgaben. At the beginning of each tutorial:– sign into a list– mark which exercises you have (successfully) worked on

– Students are randomly selected to present their solutions

– You need 50% of completed exercises to be allowed to the exam

– Please check 2 weeks before the end of the term, if you can take the exam

http://research.microsoft.com/en-us/um/people/cmbishop/prml/

http://research.microsoft.com/en-us/um/people/cmbishop/prml/

http://ipvs.informatik.uni-stuttgart.de/mlr/marc/teaching/index.html#readings

http://ipvs.informatik.uni-stuttgart.de/mlr/marc/teaching/16-MachineLearning/


1:21


2 Regression

-1.5

-1

-0.5

0

0.5

1

1.5

-3 -2 -1 0 1 2 3

(MT/plot.h -> gnuplot pipe)

'train' us 1:2

'model' us 1:2

-2

-1

0

1

2

3

-2 -1 0 1 2 3


traindecision boundary

• Are these linear models? Linear in what?– Input: No.– Parameters, features: Yes!

2:1

Linear Modelling is more powerful than it might seem at first!

• Linear Regression on non-linear features → very powerful (polynomials, piece-wise,spline basis, kernels)

• Regularization (Ridge, Lasso) & cross-validation for proper generalization to test data• Gaussian Processes and SVMs are closely related (linear in kernel features, but with

different optimality criteria)• Liquid/Echo State Machines, Extreme Learning, are examples of linear modelling on

many (sort of random) non-linear features• Basic insights in model complexity (effective degrees of freedom)• Input relevance estimation (z-score) and feature selection (Lasso)• Linear regression→ linear classification (logistic regression: outputs are likelihood ra-

tios)

⇒ Linear modelling is a core of ML(We roughly follow Hastie, Tibshirani, Friedman: Elements of Statistical Learning)

2:2

Linear Regression


• Notation:– input vector x ∈ Rd

– output value y ∈ R– parameters β = (β0, β1, .., βd)

>∈ Rd+1

– linear modelf(x) = β0 +

∑dj=1 βjxj

• Given training data D = (xi, yi)ni=1 we define the least squares cost (or “loss”)

Lls(β) =∑ni=1(yi − f(xi))

2

2:3

Optimal parameters β

• Augment input vector with a 1 in front: x = (1, x) = (1, x1, .., xd)>∈ Rd+1

β = (β0, β1, .., βd)>∈ Rd+1

f(x) = β0 +∑nj=1 βjxj = x>β

• Rewrite sum of squares as:Lls(β) =

∑ni=1(yi − x>iβ)2 = ||y −Xβ||2

X =

x>1...x>n

=

1 x1,1 x1,2 · · · x1,d

......

1 xn,1 xn,2 · · · xn,d

, y =

y1

...yn

• Optimum:

0>d = ∂Lls(β)∂β = −2(y −Xβ)>X ⇐⇒ 0d = X>Xβ −X>y

βls = (X>X)-1X>y

2:4


-6

-5

-4

-3

-2

-1

0

1

-3 -2 -1 0 1 2 3


'train' us 1:2'model' us 1:2

-1.5

-1

-0.5

0

0.5

1

1.5

-3 -2 -1 0 1 2 3


'train' us 1:2'model' us 1:2

./x.exe -mode 1 -dataFeatureType 1 -modelFeatureType 1

2:5

Non-linear features

• Replace the inputs xi ∈ Rd by some non-linear features φ(xi) ∈ Rk

f(x) =∑kj=1 φj(x) βj = φ(x)>β

• The optimal β is the same

βls = (X>X)-1X>y but with X =

φ(x1)>

...φ(xn)>

∈ Rn×k

• What are “features”?a) Features are an arbitrary set of basis functionsb) Any function linear in β can be written as f(x) = φ(x)>βfor some φ—which we denote as “features”

2:6

Example: Polynomial features

• Linear: φ(x) = (1, x1, .., xd) ∈ R1+d

• Quadratic: φ(x) = (1, x1, .., xd, x21, x1x2, x1x3, .., x

2d) ∈ R1+d+

d(d+1)2

• Cubic: φ(x) = (.., x31, x

21x2, x

21x3, .., x

3d) ∈ R1+d+

d(d+1)2 +

d(d+1)(d+2)6


x1

x2

xd

φ β

1

x21

x1

xd

x1x2

x1x3

x2d

φ(x)x f(x) = φ(x)>β

f(x)

./x.exe -mode 1 -dataFeatureType 1 -modelFeatureType 1

2:7

Example: Piece-wise features (in 1D)

• Piece-wise constant: φj(x) = [ξj < x ≤ ξj+1]

• Piece-wise linear: φj(x) = (1, x)> [ξj < x ≤ ξj+1]

• Continuous piece-wise linear: φj(x) = [x− ξj ]+ (and φ0(x) = x)

2:8

Example: Radial Basis Functions (RBF)

• Given a set of centers cjkj=1, define

φj(x) = b(x, cj) = e−12 ||x−cj ||

2

∈ [0, 1]

Each φj(x) measures similarity with the center cj


• Special case:use all training inputs xini=1 as centers

φ(x) =

1b(x, x1)

...b(x, xn)

(n+ 1 dim)

This is related to “kernel methods” and GPs, but not quite the same—we’ll discuss thislater.

2:9

Features

• Polynomial• Piece-wise• Radial basis functions (RBF)• Splines (see Hastie Ch. 5)

• Linear regression on top of rich features is extremely powerful!2:10

The need for regularization

Noisy sin data fitted with radial basis functions-1.5

-1

-0.5

0

0.5

1

1.5

2

-3 -2 -1 0 1 2 3


'z.train' us 1:2'z.model' us 1:2

./x.exe -mode 1 -n 40 -modelFeatureType 4 -dataType 2 -rbfWidth .1 -sigma

.5 -lambda 1e-10

• Overfitting & generalization:The model overfits to the data—and generalizes badly

• Estimator variance:When you repeat the experiment (keeping the underlying function fixed), theregression always returns a different model estimate

2:11

Estimator variance

• Assumptions:

– We computed parameters β = (X>X)-1X>y

– The data was noisy with variance Vary = σ2In

Then


Varβ = (X>X)-1σ2

– high data noise σ→ high estimator variance

– more data→ less estimator variance: Varβ ∝ 1n

• In practise we don’t know σ, but we can estimate it based on the deviation fromthe learnt model:

σ2 =1

n− d− 1

n∑i=1

(yi − f(xi))2

2:12

Estimator variance

• “Overfitting”← picking one specific data set y ∼ N(ymean, σ

2In)

↔ picking one specific b ∼ N(βmean, (X>X)-1σ2)

• If we could reduce the variance of the estimator, we could reduce overfitting—and increase generalization.

2:13

Hastie’s section on shrinkage methods is great! Describes several ideas on re-ducing estimator variance by reducing model complexity. We focus on regu-larization.

2:14

Ridge regression: L2-regularization

• We add a regularization to the cost:

Lridge(β) =∑ni=1(yi − φ(xi)

>β)2 + λ∑kj=2 β

2j

NOTE: β1 is usually not regularized!

• Optimum:

βridge = (X>X + λI)-1X>y

(where I = Ik, or with I1,1 = 0 if β1 is not regularized)2:15


• The objective is now composed of two “potentials”: The loss, which dependson the data and jumps around (introduces variance), and the regularizationpenalty (sitting steadily at zero). Both are “pulling” at the optimal β → theregularization reduces variance.

• The estimator variance becomes less: Varβ = (X>X + λI)-1σ2

• The ridge effectively reduces the complexity of the model:

df(λ) =∑dj=1

d2jd2j+λ

where d2j is the eigenvalue of X>X = V D2V>

(details: Hastie 3.4.1)2:16

Choosing λ: generalization error & cross validation

• λ = 0 will always have a lower training data errorWe need to estimate the generalization error on test data

• k-fold cross-validation:

D1 D2 Di Dk· · · · · ·

1: Partition data D in k equal sized subsets D = D1, .., Dk2: for i = 1, .., k do3: compute βi on the training data D \Di leaving out Di4: compute the error `i = Lls(βi, Di)/|Di| on the validation data Di5: end for6: report mean squared error ˆ= 1/k

∑i `i and variance 1/(k-1)[(

∑i `

2i )− k ˆ2]

• Choose λ for which ˆ is smallest2:17

quadratic features on sinus data:


0.4

0.5

0.6

0.7

0.8

0.9

1

1.1

1.2

1.3

0.001 0.01 0.1 1 10 100 1000 10000 100000

me

an

sq

ua

red

err

or

lambda


cv errortraining error

./x.exe -mode 4 -n 10 -modelFeatureType 2 -dataType 2 -sigma .1

./x.exe -mode 1 -n 10 -modelFeatureType 2 -dataType 2 -sigma .1

2:18

Lasso: L1-regularization

• We add a L1 regularization to the cost:

Llasso(β) =∑ni=1(yi − φ(xi)

>β)2 + λ∑kj=1 |βj |

NOTE: β1 is usually not regularized!

• Has no closed form expression for optimum

(Optimum can be found by solving a quadratic program; see appendix.)

2:19

Lasso vs. Ridge:


• Lasso→ sparsity! feature selection!2:20

Lq(β) =∑ni=1(yi − φ(xi)

>β)2 + λ∑kj=1 |βj |q

• Subset selection: q = 0 (counting the number of βj 6= 0)2:21

Summary

• Representation: choice of features

f(x) = φ(x)>β

• Objective: squared error + Ridge/Lasso regularization

Lridge(β) =

n∑i=1

(yi − φ(xi)>β)2 + λ||β||2I


• Solver: analytical (or quadratic program for Lasso)

βridge = (X>X + λI)-1X>y

2:22

Summary

• Linear models on non-linear features—extremely powerful

linearpolynomial

RBFkernel

RidgeLasso

regressionclassification*

*logistic regression

• Generalization ↔ Regularization ↔ complexity/DoF penalty

• Cross validation to estimate generalization empirically→ use to choose regu-larization parameters

2:23

Appendix: Dual formulation of Ridge

• The standard way to write the Ridge regularization:Lridge(β) =

∑ni=1(yi − φ(xi)

>β)2 + λ∑kj=1 β

2j

• Finding βridge = argminβ Lridge(β) is equivalent to solving

βridge = argminβ

n∑i=1

(yi − φ(xi)>β)2

subject tok∑j=1

β2j ≤ t

λ is the Lagrange multiplier for the inequality constraint2:24

Appendix: Dual formulation of Lasso

• The standard way to write the Lasso regularization:Llasso(β) =

∑ni=1(yi − φ(xi)

>β)2 + λ∑kj=1 |βj |


• Equivalent formulation (via KKT):

βlasso = argminβ

n∑i=1

(yi − φ(xi)>β)2

subject tok∑j=1

|βj | ≤ t

• Decreasing t is called “shrinkage”: The space of allowed β shrinks. Some βwill become zero→ feature selection

2:25


3 Classification & Structured Output

Discriminative Function

• Represent a discrete-valued function F : Rn → Y via a discriminative func-tion

f : Rn × Y → R

such thatF : x 7→ argmaxy f(x, y)

That is, a discriminative function f(x, y) maps an input x to an output

y(x) = argmaxy

f(x, y)

• A discriminative function f(x, y) has high value if y is a correct answer to x;and low value if y is a false answer• In that way a discriminative function e.g. discriminates correct sequence/image/graph-

labelling from wrong ones3:1

Example Discriminative Function

• Input: x ∈ R2; output y ∈ 1, 2, 3displayed are p(y=1|x), p(y=2|x), p(y=3|x)

0.9 0.5 0.1 0.9 0.5 0.1 0.9 0.5 0.1

-2-1

0 1

2 3 -2

-1 0

1 2

3

0

0.2

0.4

0.6

0.8

1

(here already “scaled” to the interval [0,1]... explained later)

3:2


How could we parameterize a discriminative function?

• Well, linear in features!

f(x, y) =∑kj=1 φj(x, y)βj = φ(x, y)>β

• Example: Let x ∈ R and y ∈ 1, 2, 3. Typical features might be

φ(x, y) =

1 [y = 1]x [y = 1]x2 [y = 1]1 [y = 2]x [y = 2]x2 [y = 2]1 [y = 3]x [y = 3]x2 [y = 3]

• Example: Let x, y ∈ 0, 1 be both discrete. Features might be

φ(x, y) =

1[x = 0][y = 0][x = 0][y = 1][x = 1][y = 0][x = 1][y = 1]

3:3

more intuition...

• Features “connect” input and output. Each φj(x, y) allows f to capture a cer-tain dependence between x and y• If both x and y are discrete, a feature φj(x, y) is typically a joint indicator func-

tion (logical function), indicating a certain “event”• Each weight βj mirrors how important/frequent/infrequent a certain depen-

dence described by φj(x, y) is• −f(x, y) is also called energy, and the following methods are also called energy-

based modelling, esp. in neural modelling3:4

3.1 Logistic regression: Binary case

3:5

Binary classification example


-2

-1

0

1

2

3

-2 -1 0 1 2 3


train

decision boundary

• Input x ∈ R2

Output y ∈ 0, 1Example shows RBF Ridge Logistic Regression

3:6

A loss function for classification

• Data D = (xi, yi)ni=1 with xi ∈ Rd and yi ∈ 0, 1• Bad idea: Squared error regression (See also Hastie 4.2)

• Maximum likelihood:We interpret the discriminative function f(x, y) as defining class probabilities

p(y |x) = ef(x,y)∑y′ e

f(x,y′)

→ p(y |x) should be high for the correct class, and low otherwise

For each (xi, yi) we want to maximize the likelihood p(yi |xi):

Lneg-log-likelihood(β) = −∑ni=1 log p(yi |xi)

3:7

Logistic regression

• In the binary case, we have “two functions” f(x, 0) and f(x, 1). W.l.o.g. we may fixf(x, 0) = 0 to zero. Therefore we choose features

φ(x, y) = φ(x) [y = 1]

with arbitrary input features φ(x) ∈ Rk


• We have

f(x, 1) = φ(x)>β , y = argmaxy

f(x, y) =

0 else1 if φ(x)>β > 0

• and conditional class probabilities

p(1 |x) =ef(x,1)

ef(x,0) + ef(x,1)= σ(f(x, 1))

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

-10 -5 0 5 10

exp(x)/(1+exp(x))

with the logistic sigmoid function σ(z) = ez

1+ez= 1

e−z+1.

• Given data D = (xi, yi)ni=1, we minimize

Llogistic(β) = −∑ni=1 log p(yi |xi) + λ||β||2

= −∑ni=1

[yi log p(1 |xi) + (1− yi) log[1− p(1 |xi)]

]+ λ||β||2

3:8


• Gradient (see exercises):

∂Llogistic(β)

∂β

>

=

n∑i=1

(pi − yi)φ(xi) + 2λIβ = X>(p− y) + 2λIβ

where pi := p(y=1 |xi) , X =

φ(x1)>

...φ(xn)>

∈ Rn×k

• ∂Llogistic(β)∂β is non-linear in β (it enters also the calculation of pi)→ does not have

analytic solution

• Newton algorithm: iterate

β ← β −H -1 ∂Llogistic(β)∂β

>

with Hessian H = ∂2Llogistic(β)∂β2 = X>WX + 2λI

W = diag(p (1− p)), that is, diagonal with Wii = pi(1− pi)3:9

polynomial (cubic) ridge logistic regression:


-2

-1

0

1

2

3

-2 -1 0 1 2 3



1 0 -1

-2-1

0 1

2 3 -2

-1 0

1 2

3

-3-2-1 0 1 2 3

./x.exe -mode 2 -modelFeatureType 3 -lambda 1e+0

3:10

RBF ridge logistic regression:

-2

-1

0

1

2

3

-2 -1 0 1 2 3



1 0 -1

-2-1

0 1

2 3 -2

-1 0

1 2

3

-3-2-1 0 1 2 3

./x.exe -mode 2 -modelFeatureType 4 -lambda 1e+0 -rbfBias 0 -rbfWidth

.2

3:11

Recap


ridge regression logistic regressionREPRESENTATION f(x) = φ(x)>β f(x, y) = φ(x, y)>β

OBJECTIVE Lls(β) =∑ni=1(yi − f(xi))

2 + λ||β||2ILnll(β) =−∑ni=1 log p(yi |xi) + λ||β||2I

p(y |x) ∝ ef(x,y)

SOLVER βridge = (X>X + λI)-1X>y binary case:β ← β − (X>WX + 2λI)-1

(X>(p− y) + 2λIβ)3:12

3.2 Logistic regression: Multi-class case

3:13

Logistic regression: Multi-class case

• Data D = (xi, yi)ni=1 with xi ∈ Rd and yi ∈ 1, ..,M• We choose f(x, y) = φ(x, y)>β with

φ(x, y) =

φ(x) [y = 1]φ(x) [y = 2]

...φ(x) [y = M ]

where φ(x) are arbitrary features. We have M (or M -1) parameters β(optionally we may set f(x,M) = 0 and drop the last entry)

• Conditional class probabilties

p(y |x) =ef(x,y)∑y′ e

f(x,y′)↔ f(x, y)− f(x,M) = log

p(y |x)

p(y=M |x)

(the discriminative functions model “log-ratios”)

• Given data D = (xi, yi)ni=1, we minimize

Llogistic(β) = −∑ni=1 log p(y=yi |xi) + λ||β||2

3:14


• Gradient:∂Llogistic(β)

∂βc=∑ni=1(pic− yic)φ(xi) + 2λIβc = X>(pc− yc) + 2λIβc , pic = p(y=

c |xi)

• Hessian: H = ∂2Llogistic(β)∂βc∂βd

= X>WcdX + 2[c = d] λI

Wcd diagonal with Wcd,ii = pic([c = d]− pid)3:15


polynomial (quadratic) ridge 3-class logistic regression:

-2

-1

0

1

2

3

-2 -1 0 1 2 3


trainp=0.5

0.9 0.5 0.1 0.9 0.5 0.1 0.9 0.5 0.1

-2-1

0 1

2 3 -2

-1 0

1 2

3

0

0.2

0.4

0.6

0.8

1

./x.exe -mode 3 -modelFeatureType 3 -lambda 1e+1

3:16

3.3 Structured Output & Structured Input

3:17

Structured Output & Structured Input

• regression:

Rn → R

• structured output:

Rn → binary class label 0, 1Rn → integer class label 1, 2, ..,MRn → sequence labelling y1:T

Rn → image labelling y1:W,1:H

Rn → graph labelling y1:N

• structured input:

relational database→ Rlabelled graph/sequence→ R

3:18


Examples for Structured Output

• Text taggingX = sentenceY = tagging of each wordhttp://sourceforge.net/projects/crftagger

• Image segmentationX = imageY = labelling of each pixelhttp://scholar.google.com/scholar?cluster=13447702299042713582

• Depth estimationX = single imageY = depth maphttp://make3d.cs.cornell.edu/

3:19

CRFs in image processing

3:20

CRFs in image processing

http://sourceforge.net/projects/crftagger

http://scholar.google.com/scholar?cluster=13447702299042713582

http://make3d.cs.cornell.edu/


• Google “conditional random field image”– Multiscale Conditional Random Fields for Image Labeling (CVPR 2004)

– Scale-Invariant Contour Completion Using Conditional Random Fields (ICCV 2005)

– Conditional Random Fields for Object Recognition (NIPS 2004)

– Image Modeling using Tree Structured Conditional Random Fields (IJCAI 2007)

– A Conditional Random Field Model for Video Super-resolution (ICPR 2006)3:21

3.4 Conditional Random Fields

3:22

Conditional Random Fields (CRFs)

• CRFs are a generalization of logistic binary and multi-class classification• The output y may be an arbitrary (usually discrete) thing (e.g., sequence/image/graph-

labelling)• Hopefully we can maximize efficiently

argmaxy

f(x, y)

over the output!→ f(x, y) should be structured in y so this optimization is efficient.

• The name CRF describes that p(y|x) ∝ ef(x,y) defines a probability distribution(a.k.a. random field) over the output y conditional to the input x. The word“field” usually means that this distribution is structured (a graphical model;see later part of lecture).

3:23

CRFs: the structure is in the features

• Assume y = (y1, .., yl) is a tuple of individual (local) discrete labels• We can assume that f(x, y) is linear in features

f(x, y) =

k∑j=1

φj(x, y∂j)βj = φ(x, y)>β

where each feature φj(x, y∂j) depends only on a subset y∂j of labels. φj(x, y∂j)effectively couples the labels y∂j . Then ef(x,y) is a factor graph.

3:24


Example: pair-wise coupled pixel labels

y31

yH1

y11 y12 y13 y14 y1W

y21

x11

• Each black box corresponds to features φj(y∂j) which couple neighboring pixellabels y∂j• Each gray box corresponds to features φj(xj , yj) which couple a local pixel

observation xj with a pixel label yj3:25

CRFs: Core equations

f(x, y) = φ(x, y)>β

p(y|x) =ef(x,y)∑y′ e

f(x,y′)= ef(x,y)−Z(x,β)

Z(x, β) = log∑y′

ef(x,y′) (log partition function)

L(β) = −∑i

log p(yi|xi) = −∑i

[f(xi, yi)− Z(xi, β)]

∇Z(x, β) =∑y

p(y|x)∇f(x, y)

∇2Z(x, β) =∑y

p(y|x)∇f(x, y)∇f(x, y)>−∇Z ∇Z>

• This gives the neg-log-likelihood L(β), its gradient and Hessian3:26

Training CRFs

• Maximize conditional likelihoodBut Hessian is typically too large (Images: ∼10 000 pixels, ∼50 000 features)


If f(x, y) has a chain structure over y, the Hessian is usually banded → computationtime linear in chain length

Alternative: Efficient gradient method, e.g.:Vishwanathan et al.: Accelerated Training of Conditional Random Fields with Stochastic GradientMethods

• Other loss variants, e.g., hinge loss as with Support Vector Machines(“Structured output SVMs”)

• Perceptron algorithm: Minimizes hinge loss using a gradient method3:27


4 Kernelization & Structured Input

Kernel Ridge Regression—the “Kernel Trick”

• Reconsider solution of Ridge regression (using the Woodbury identity):βridge = (X>X + λIk)-1X>y = X>(XX>+ λIn)-1y

• Recall X>= (φ(x1), .., φ(xn)) ∈ Rk×n, then:

f ridge(x) = φ(x)>βridge = φ(x)>X>︸︷︷︸κ(x)>

(XX>︸︷︷︸K

+λI)-1y

K is called kernel matrix and has elements

Kij = k(xi, xj) := φ(xi)>φ(xj)

κ is the vector: κ(x)>= φ(x)>X>= k(x, x1:n)

The kernel function k(x, x′) calculates the scalar product in feature space.4:1

The Kernel Trick

• We can rewrite kernel ridge regression as:

f rigde(x) = κ(x)>(K + λI)-1y

with Kij = k(xi, xj)

κi(x) = k(x, xi)

→ at no place we actually need to compute the parameters β→ at no place we actually need to compute the features φ(xi)

→we only need to be able to compute k(x, x′) for any x, x′

• This rewriting is called kernel trick.• It has great implications:

– Instead of inventing funny non-linear features, we may directly invent funny ker-nels

– Inventing a kernel is intuitive: k(x, x′) expresses how correlated y and y′ should be:it is a meassure of similarity, it compares x and x′. Specifying how ’comparable’ xand x′ are is often more intuitive than defining “features that might work”.

4:2


• Every choice of features implies a kernel.

• But, does every choice of kernel correspond to a specific choice of features?4:3

Reproducing Kernel Hilbert Space

• Let’s define a vector space Hk, spanned by infinitely many basis elements

φx = k(·, x) : x ∈ Rd

Vectors in this space are linear combinations of such basis elements, e.g.,

f =∑i

αiφxi , f(x) =∑i

αik(x, xi)

• Let’s define a scalar product in this space. Assuming k(·, ·) is positive definite, we firstdefine the scalar product for every basis element,

〈φx, φy〉 := k(x, y)

Then it follows

〈φx, f〉 =∑i

αi 〈φx, φxi〉 =∑i

αik(x, xi) = f(x)

• The φx = k(·, x) is the ‘feature’ we associate with x. Note that this is a function and in-finite dimensional. Choosing α = (K + λI)-1y represents f ridge(x) =

∑ni=1 αik(x, xi) =

κ(x)>α, and shows that ridge regression has a finite-dimensional solution in the basiselements φxi. A more general version of this insight is called representer theorem.

4:4

Representer Theorem

• Forf∗ = argmin

f∈Hk

L(f(x1), .., f(xn)) + Ω(||f ||2Hk)

where L is an arbitrary loss function, and Ω a monotone regularization, it holds

f∗ =

n∑i=1

αik(·, xi)

• Proof:decompose f = fs + f⊥, fs ∈ spanφxi : xi ∈ Df(xi) = 〈f, φxi〉 = 〈fs + f⊥, φxi〉 = 〈fs, φxi〉 = fs(xi)

L(f(x1), .., f(xn)) = L(fs(x1), .., fs(xn))

Ω(||f ||2Hk ) ≥ Ω(||fs||2Hk )

4:5


Example Kernels

• Kernel functions need to be positive definite: ∀z:|z|>0 : k(z, z′) > 0

→K is a positive definite matrix• Examples:

– Polynomial: k(x, x′) = (x>x′ + c)d

Let’s verify for d = 2, φ(x) =(1,√

2x1,√

2x2, x21,√

2x1x2, x22

)>:

k(x, x′) = ((x1, x2)

x′1x′2

+ 1)2

= (x1x′1 + x2x

′2 + 1)2

= x21x′12

+ 2x1x2x′1x′2 + x2

2x′22

+ 2x1x′1 + 2x2x

′2 + 1

= φ(x)>φ(x′)

– Squared exponential (radial basis function): k(x, x′) = exp(−γ |x− x′ | 2)

4:6

Example Kernels

• Bag-of-words kernels: let φw(x) be the count of word w in document x; definek(x, y) = 〈φ(x), φ(y)〉• Graph kernels (Vishwanathan et al: Graph kernels, JMLR 2010)

– Random walk graph kernels4:7

Kernel Logistic Regression*For logistic regression we compute β using the Newton iterates

β ← β − (X>WX + 2λI)-1 [X>(p− y) + 2λβ] (1)

= −(X>WX + 2λI)-1 X>[(p− y)−WXβ] (2)

Using the Woodbury identity we can rewrite this as

(X>WX +A)-1X>W = A-1X>(XA-1X>+W -1)-1 (3)

β ← −1

2λX>(X

1

2λX>+W -1)-1 W -1[(p− y)−WXβ] (4)

= X>(XX>+ 2λW -1)-1[Xβ −W -1(p− y)

]. (5)

We can now compute the discriminative function values fX = Xβ ∈ Rn at the training points byiterating over those instead of β:

fX ← XX>(XX>+ 2λW -1)-1[Xβ −W -1(p− y)

](6)

= K(K + 2λW -1)-1[fX −W -1(pX − y)

](7)


Note, that pX on the RHS also depends on fX . Given fX we can compute the discriminativefunction values fZ = Zβ ∈ Rm for a set of m query points Z using

fZ ← κ>(K + 2λW -1)-1[fX −W -1(pX − y)

], κ>= ZX> (8)

4:8


5 The Breadth of ML methods

• Other loss functions– Robust error functions (esp. Forsyth)

– Hinge loss & SVMs

• Neural Networks & Deep learning– Basic equations

– Deep Learning & regularizing the representation, autoencoders

• Unsupervised Learning– PCA, kernel PCA Spectral Clustering, Multidimensional Scaling, ISOMAP

– Non-negative Matrix Factorization*, Factor Analysis*, ICA*, PLS*

• Clustering– k-means, Gaussian Mixture model

– Agglomerative Hierarchical Clustering

• Local learners– local & lazy learning, kNN, Smoothing Kernel, kd-trees

• Combining weak or randomized learners– Bootstrap, bagging, and model averaging

– Boosting

– (Boosted) decision trees & stumps, random forests

5:1

5.1 Other loss functions

5:2

Loss functions

• So far we’ve talked about– least squares Lls(f) =

∑ni=1(yi − f(xi))

2 for regression

– neg-log-likelihood Lnll(f) = −∑ni=1 log p(yi |xi) for classification

• There are reasons to consider other loss functions:– Outliers are too important in least squares

– In classification, maximizing a margin is intuitive

– Intuitively, small errors should be penalized little or not at all

– Some loss functions lead to sparse sets of support vectors→ conciseness of functionrepresentation, good generalization

5:3

Robust error functions

• There is (at least) two approaches to handle outliers:– Analyze/label them explicitly and discount them in the error


– Choose cost functions that are more “robust” to outliers

(Kheng’s lecture CVPR)

r = yi − f(xi) is the residual (=model error)5:4

Robust error functions

• standard least squares: Lls(r) = r2

• Forsyth: Lls(r) = r2

a2+r2

• Huber’s robust regression (1964): LH(r) =

r2/2 if |r| ≤ cc|r| − c2/2 else

• Less used:– Cauchy: L(r) = a2/2 log(1 + r2/a2)

– Beaton & Turkey: const for |r| > a, cubic inside

“Outliers have small/no gradient”

By this we mean: ∂β∗(D)∂yi

is small: Varying an outlier data point yi does not have muchinfluence on the optimal parameter β∗↔ robustnessCf. our discussion of the estimator variance Varβ∗

5:5

SVM regression: “Inliers have no gradient”*

(Hastie, page 435)


• SVM regression (ε-insensitive): Lε(r) = [|r| − ε]+(sub-script + indicates the positive part)

• When optimizing for the SVM regression loss, the optimal regression (function f∗, orparameters β∗) does not vary inlier data points. It only depends on critical data points.These are called support vectors.

• In the kernel case f(x) = κ(x)>α, every αi relates to exactly one training point (cf. RBFfeatures). Clearly, αi = 0 for those data points inside the margin! α is sparse; onlysupport vectors have α 6= 0.

• I skip the details for SVM regression and reiterate this for classification5:6

Loss functions for classification

(Hastie, page 426)

• The “x-axis” denotes yif(xi), assuming yi ∈ −1,+1– neg-log-likelihood (green curve above)

Lnll(y, f) = − log p(yi |xi) = − log σ(yf(xi))]

– hinge loss Lhinge(y, f) = [1− yf(x)]+ = max0, 1− yf(x)– class Huber’s robust regression

LHC(r) =

−4yf(x) if yf(x) ≤ −1

[1− yf(x)]2+ else

• Neg-log-likelihood (“Binomial Deviance”) and Huber have same asymtotes asthe hinge loss (shifted by 1), but rounded in the interior

5:7

Hinge loss

• Consider the hinge loss with L2 regularization, minimizing

Lhinge(f) =

n∑i=1

[1− yif(xi)]+ + λ||β||2


• It is easy to show that the following are equivalent

minβ||β||2 + C

n∑i=1

max0, 1− yif(xi) (9)

minβ,ξ||β||2 + C

n∑i=1

ξi s.t. yif(xi) ≥ 1− ξi , ξi ≥ 0 (10)

(where C = 1/λ > 0 and yi ∈ −1,+1)

• The optimal model (optimal β) will “not depend” on data points for whichξi = 0. Here “not depend” is meant in the variational sense: If you would varythe data point (xi) slightly, β∗ would not be changed.

5:8

(Hastie, page 418)

• Right:

minβ,ξ||β||2 + C

n∑i=1

ξi s.t. yif(xi) ≥ 1− ξi , ξi ≥ 0

• Left: assuming ∀i : ξi = 0 we just maximize the margin:

minβ||β||2 s.t. yif(xi) ≥ 1

• Right: We see that β∗ does not depend directly (variationally) on xi for datapoints outside the margin!Data points inside the margin or wrong classified: support vectors

5:9


• A linear SVM (on non-linear classification problem) for different regularizationBlack points: exactly on the margin (ξi = 0, αi > 0)Support vectors (αi > 0): inside of margin or wrong side of margin

5:10

• Kernel Logistic Regression (using neg-log-likelihood) vs. Kernel SVM (usinghinge)

[My view: these are equally powerful. The core difference is in how they can be trained,and how “sparse” the optimal α is]

5:11

Finding the optimal discriminative function

• The constrained problem

minβ,ξ||β||2 + C

n∑i=1

ξi s.t. yi(x>iβ) ≥ 1− ξi , ξi ≥ 0


is a linear program and can be reformulated as the dual problem, with dualparameters αi that indicate whether the constraint yi(x>iβ) ≥ 1 − ξi is active.The dual problem is convex. SVM libraries use, e.g., CPLEX to solve this.• For all inactive constraints (yi(x>iβ) ≥ 1) the data point (xi, yi) does not directly

influence the solution β∗. Active points are support vectors.5:12

• Let (x, ξ) be the primal variables, (α, µ) the dual, we derive the dual problem:

minβ,ξ||β||2 + C

n∑i=1

ξi s.t. yi(x>iβ) ≥ 1− ξi , ξi ≥ 0 (11)

L(β, ξ, α, µ) = ||β||2 + C

n∑i=1

ξi −n∑i=1

αi[yi(x>iβ)− (1− ξi)]−

n∑i=1

µiξi (12)

∂βL!= 0 ⇒ 2β =

n∑i=1

αiyixi (13)

∂ξL!= 0 ⇒ ∀i : αi = C − µi (14)

l(α, µ) = minβ,ξ

L(β, ξ, α, µ) = − 14

n∑i=1

n∑i′=1

αiαi′yiyi′ x>i xi′ +

n∑i=1

αi (15)

maxα,µ

l(α, µ) s.t. 0 ≤ αi ≤ C (16)

• 12: Lagrangian (with negative Lagrange terms because of ≥ instead of ≤ )• 13: the optimal β∗ depends only on xiyi for which αi > 0→ support vectors• 15: This assumes that xi = (1, xi) includes the constant feature 1 (so that the statistics become

centered)• 16: This is the dual problem. µi ≤ 0 implies αi ≤ C• Note: the dual problem only refers to x>i xi → kernelization

5:13

Conclusions on the hinge loss

• The hinge implies an important structural aspect of optimal solutions: supportvectors• A core benefit of the hinge loss/SVMs: These optimization problems are struc-

turally different to the (dense) Newton methods for logistic regression. Theycan scale well to large data sets using specialized constrained optimizationmethods. (However, there also exist incremental, well-scaling methods for ker-nel ridge regression/GPs.)(I personally think: the benefit is not in terms of the predictive performance, which is just different,not better or worse.)

• The hinge loss can equally be applied to conditional random fields (includingmulti-class classification), which is called structured-output SVM

5:14


Comments on comparing classifiers*

• Given two classification algorithm that optimize different loss functions, whatare objective ways to compare them?

• Often one only measures the accuracy (rate of correct classification). A test (e.g.paired t-test) can check the significance of difference between the algorithms.On comparing classifiers: Pitfalls to avoid and a recommended approach SL Salzberg - DataMining and knowledge discovery, 1997 - SpringerApproximate statistical tests for comparing supervised classification learning algorithms TGDietterich - Neural computation, 1998 - MIT Press

• Critical view:Provost, Fawcett, Kohavi: The case against accuracy estimation for comparing induction al-gorithms. ICML, 1998

5:15

Precision, Recall & ROC curves*

• Measures: data size n, false positives (FP), true positives (TP), data positives(DP=TP+FN), etc:

accuracy = TP+TNn

precision = TPTP+FP

recall (TP-rate) = TPTP+FN=DP

FP-rate = FPFP+TN=DN

• Binary classifiers typically define a discriminative function f(x) ∈ R. (Or f(y=1, x) = f(x), f(y=0, x) = 0.) The “decision threshold” typically is zero:

y∗ = argmaxy

f(y, x) = [f(x) > 0]

But in some applications one might want to tune the threshold after training.• decreasing threshold: TPR→ 1, increasing threshold: FPR→ 1

5:16

ROC curves*

• ROC curves (Receiver Operating Characteristic) plot TPR vs. FPR for all possi-ble choices of decision thresholds:


• Provost, Fawcett, Kohavi empirically demonstrate, that rarely an algorithm dominates anotheralgorithm over the whole ROC curve→ there always exists decision thresholds (cost assignmentsto TPR/FPR) for which the other algorithm is better.

• Their conclusion: Show the ROC curve as a whole• Some report on the area under the ROC curve – which is intuitive but meaningless...

5:17

5.2 Neural Networks

5:18

Neural Networks

• Consider a regression problem with input x ∈ Rd and output y ∈ R

– Linear function: (β ∈ Rd)f(x) = β>x

– 1-layer Neural Network function: (W0 ∈ Rh1×d)f(x) = β>σ(W0x)

– 2-layer Neural Network function:f(x) = β>σ(W1σ(W0x))

• Neural Networks are a special function model y = f(x,w), i.e. a special way toparameterize non-linear functions

5:19

Neural Networks: basic equations

• Consider L hidden layers, each hl-dimensional– let zl = Wl-1xl-1 be the inputs to all neurons in layer l


– let xl = σ(zl) the activation of all neurons in layer l

– redundantly, we denote by x0 ≡ x the activation of the input layer, and by φ(x) ≡xL the activation of the last hidden layer

• Forward propagation: An L-layer NN recursively computes, for l = 1, .., L,

∀l=1,..,L : zl = Wl-1xl-1 , xl = σ(zl)

and then computes the output f ≡ zL+1 = WLxL

• Backpropagation: Given some loss `(f), let δL+1∆= ∂`

∂f. We can recursivly compute the

loss-gradient w.r.t. the inputs of layer l:

∀l=L,..,1 : δl∆=

d`

dzl=

d`

dzl+1

∂zl+1

∂xl

∂xl∂zl

= [δl+1 Wl] [xl (1− xl)]>

where is an element-wise product. The gradient w.r.t. weights is:

d`

dWl,ij=

d`

dzl+1,i

∂zl+1,i

∂W1,ij= δl+1,i xl,j or

d`

dWl= δ>l+1x

>l

5:20

NN regression & regularization

• In the standard regression case, y ∈ R, we typically assume a squared errorloss `(f) =

∑i(f(xi, w)− yi)2. We have

δL+1 =∑i

2(f(xi, w)− yi)>

• Regularization: Add a L2 or L1 regularization. First compute all gradients asbefore, then add λWl,ij (for L2), or λ signWl,ij (for L1) to the gradient. His-torically, this is called weight decay, as the additional gradient leads to a stepdecaying the weighs.• The optimal output weights are as for standard regression

W ∗L = (X>X + λI)-1X>y

where X is the data matrix of activations xL ≡ φ(x)5:21

NN classification

• Consider the multi-class case y ∈ 1, ..,M. Then we have M output neuronsto represent the discriminative function

f(x, y, w) = (W>LzL)y , W ∈ RM×hL

• Choosing neg-log-likelihood objective↔ logistic regression• Choosing hinge loss objective↔ “NN + SVM”


– For given x, let y∗ be the correct class. The one-vs-all hinge loss is∑y 6=y∗ [1− (fy∗ − fy)]+

– For output neuron y 6= y∗ this implies a gradient δy = [fy∗ < fy + 1]

– For output neuron y∗ this implies a gradient δy∗ = −∑y 6=y∗ [fy∗ < fy + 1]

Only data points inside the margin induce an error (and gradient).

– This is also called Perceptron Algorithm5:22

NN model selection

• NN have many hyper parameters: The choice of λ as well as L, and hl

→ cross validation5:23

Deep Learning

• Discussion papers:– Thomas G. Dietterich et al.: Structured machine learning: the next ten years– Bengio, Yoshua and LeCun, Yann: Scaling learning algorithms towards AI– Rodney Douglas et al.: Future Challenges for the Science and Engineering of Learning– Pat Langley: The changing science of machine learning– Glorot, Bengio: Understanding the difficulty of training deep feedforward neural networks

• There are open issues w.r.t. representations (e.g. discovering/developing ab-stractions, hierarchies, etc).Deep Learning and the revival of Neural Networks was driven by this

5:24

Deep Learning

• Must read:

Jason Weston et al.: Deep Learning viaSemi-Supervised Embedding (ICML 2008)www.thespermwhale.com/jaseweston/

papers/deep_embed.pdf

• Rough ideas:– Multi-layer NNs are a “deep” function model

– Only training on least squares gives too little “pressure” on inner representations

www.thespermwhale.com/jaseweston/papers/deep_embed.pdf

www.thespermwhale.com/jaseweston/papers/deep_embed.pdf


– Mixing least squares cost with representational costs leads to better representations& generalization

5:25

Deep Learning

• In this perspective,1) Deep Learning considers deep function models (NNs usually)2) Deep Learning makes explicit statements about what internal representa-tions should capture

• In Weston et al.’s case

L(β) =

n∑i=1

(yi − f(xi))2

︸︷︷︸squared error

+λ∑i

∑jk

(||zij − zik||2 − djk)2

︸︷︷︸Multidimensional Scaling regularization

where zi = σ(Wi-1zi-1) are the activations in the ith layer

This mixes the squared error with a representational cost for each layer5:26

Deep Learning

• Results from Weston et al. (ICML, 2008)

Mnist1h dataset, deep NNs of 2, 6, 8, 10 and 15 layers; each hidden layer 50 hiddenunits

5:27


Similar idea: Stacking Autoencoders

• An autoencoder typically is a NN of the type

which is trained to reproduce the input: mini ||y(xi)− xi||2

The hidden layer (“bottleneck”) needs to find a good representation/compression.

• Similar to the PCA objective, but nonlinear

• Stacking autoencoders:

5:28

Deep Learning – further reading

• Weston, Ratle & Collobert: Deep Learning via Semi-Supervised Embedding, ICML2008.• Hinton & Salakhutdinov: Reducing the Dimensionality of Data with Neural Net-

works, Science 313, pp. 504-507, 2006.• Bengio & LeCun: Scaling Learning Algorithms Towards AI. In Bottou et al. (Eds)

“Large-Scale Kernel Machines”, MIT Press 2007.• Hadsell, Chopra & LeCun: Dimensionality Reduction by Learning an Invariant

Mapping, CVPR 2006.• Glorot, Bengio: Understanding the difficulty of training deep feedforward neural net-

works, AISTATS’10.

... and newer papers citing those5:29

5.3 Unsupervised learning

5:30

Unsupervised learning

• What does that mean? Generally: modelling P (x)


• Instances:– Finding lower-dimensional spaces

– Clustering

– Density estimation

– Fitting a graphical model

• “Supervised Learning as special case”...5:31

Principle Component Analysis (PCA)

• Assume we have data D = xini=1, xi ∈ Rd.

Intuitively: “We belief that there is an underlying lower-dimensional spaceexplaining this data”.

• How can we formalize this?5:32

PCA: minimizing projection error

• For each xi ∈ Rd we postulate a lower-dimensional latent variable zi ∈ Rp

xi ≈ Vpzi + µ

• Optimality:Find Vp, µ and values zi that minimize

∑ni=1 ||xi − (Vpzi + µ)||2

5:33

Optimal Vp

µ, z1:n = argminµ,z1:n

n∑i=1

||xi − Vpzi − µ||2

⇒ µ = 〈xi〉 = 1n

∑ni=1 xi , zi = V>p (xi − µ)

• Center the data xi = xi − µ. Then

Vp = argminVp

n∑i=1

||xi − VpV>p xi||2


• Solution via Singular Value Decomposition– Let X ∈ Rn×d be the centered data matrix containing all xi– We compute a sorted Singular Value Decomposition X>X = V DV>

D is diagonal with sorted singular values λ1 ≥ λ2 ≥ · · · ≥ λdV = (v1 v2 · · · vd) contains largest eigenvectors vi as columns

Vp := V1:d,1:p = (v1 v2 · · · vp)

5:34

Principle Component Analysis (PCA)

V>p is the matrix that projects to the largest variance directions of X>X

zi = V>p (xi − µ) , Z = XVp

• In non-centered case: Compute SVD of variance

A = Varx =⟨xx>⟩− µµ>=

1

nX>X − µµ>

5:35


Example: Digits

5:36

Example: Digits

• The “basis vectors” in Vp are also eigenvectorsEvery data point can be expressed in these eigenvectors

x ≈ µ+ Vpz

= µ+ z1v1 + z2v2 + . . .

= + z1 · + z2 · + · · ·


5:37

Example: Eigenfaces

(Viola & Jones)5:38

“Feature PCA” & Kernel PCA

• The feature trick: X =

φ(x1)>

...φ(xn)>

∈ Rn×k

• The kernel trick: rewrite all necessary equations such that they only involvescalar products φ(x)>φ(x′) = k(x, x′):

We want to compute eigenvectors of X>X =∑i φ(xi)φ(xi)

>. We can rewrite this asX>Xvj = λvj

XX>︸︷︷︸K

Xvj︸︷︷︸Kαj

= λXvj︸︷︷︸Kαj

, vj =∑i

αjiφ(xi)

Kαj = λαj

Where K = XX>with entries Kij = φ(xi)>φ(xj).


→We compute SVD of the kernel matrix K → gives eigenvectors αj ∈ Rn.Projection: x 7→ z = V>p φ(x) =

∑i α1:p,iφ(xi)

>φ(x) = Aκ(x)

(with matrix A ∈ Rp×n, Aji = αji and vector κ(x) ∈ Rn, κi(x) = k(xi, x))

Since we cannot center the features φ(x) we actually need “the double centered kernel matrix” K =(I− 1

n11>)K(I− 1

n11>), where Kij = φ(xi)

>φ(xj) is uncentered.

5:39

Kernel PCA

red points: datagreen shading: eigenvector αj represented as functions

∑i αjik(xj , x)

Kernel PCA “coordinates” allow us to discriminate clusters!5:40

Kernel PCA

• Kernel PCA uncovers quite surprising structure:

While PCA “merely” picks high-variance dimensionsKernel PCA picks high variance features—where features correspond to basisfunctions (RKHS elements) over x

• Kernel PCA may map data xi to latent coordinates zi where clustering is mucheasier

• All of the following can be represented as kernel PCA:– Local Linear Embedding– Metric Multidimensional Scaling


– Laplacian Eigenmaps (Spectral Clustering)see “Dimensionality Reduction: A Short Tutorial” by Ali Ghodsi

5:41

Kernel PCA clustering

• Using a kernel function k(x, x′) = e−||x−x′||2/c:

• Gaussian mixtures or k-means will easily cluster this5:42

Spectral Clustering

Spectral Clustering is very similar to kernel PCA:• Instead of the kernel matrix K with entries kij = k(xi, xj) we construct a

weighted adjacency matrix, e.g.,

wij =

0 if xi are not a kNN of xj

e−||xi−xj ||2/c otherwise

wij is the weight of the edge between data point xi and xj .

• Instead of computing maximal eigenvectors of K, compute minimal eigenvec-tors of

L = I− W , W = diag(∑j wij)

-1W

(∑j wij is called degree of node i, W is the normalized weighted adjacency ma-

trix)5:43

• Given L = UDV>, we pick the p smallest eigenvectors Vp = V1:n,1:p


• The latent coordinates for xi are zi = Vi,1:p

• Spectral Clustering provides a method to compute latent low-dimensional co-ordinates zi = Vi,1:p for each high-dimensional xi ∈ Rd input.

• This is then followed by a standard clustering, e.g., Gaussian Mixture or k-means

5:44

5:45

• Spectral Clustering is similar to kernel PCA:– The kernel matrix K usually represents similarity

The weighted adjacency matrix W represents proximity & similarity– High Eigenvectors of K are similar to low EV of L = I−W

• Original interpretation of Spectral Clustering:– L = I−W (weighted graph Laplacian) describes a diffusion process:

The diffusion rate Wij is high if i and j are close and similar


– Eigenvectors of L correspond to stationary solutions

• The Graph Laplacian L: For some vector f ∈ Rn, note the following identities:

(Lf)i = (∑j

wij)fi −∑j

wijfj =∑j

wij(fi − fj)

f>Lf =∑i

fi∑j

wij(fi − fj) =∑ij

wij(f2i − fifj) =

∑ij

wij(1

2f2i +

1

2f2j − fifj) =

1

2

∑ij

wij(fi − fj)2

where the second-to-last = holds if wij = wji is symmetric.5:46

Metric Multidimensional Scaling

• Assume we have data D = xini=1, xi ∈ Rd.As before we want to indentify latent lower-dimensional representations zi ∈Rp for this data.

• A simple idea: Minimize the stress

SC(z1:n) =∑i 6=j(d

2ij − ||zi − zj ||2)2

We want distances in high-dimensional space to be equal to distances in low-dimensional space.

5:47

Metric Multidimensional Scaling = (kernel) PCA

• Note the relation:

d2ij = ||xi − xj ||2 = ||xi − x||2 + ||xj − x||2 − 2(xi − x)>(xj − x)

This translates a distance into a (centered) scalar product

• If may we defineK = (I− 1

n11>)D(I− 1

n11>) , Dij = −d2

ij/2

then Kij = (xi− x)>(xj− x) is the normal covariance matrix and MDS is equiv-alent to kernel PCA

5:48

Non-metric Multidimensional Scaling

• We can do this for any data (also non-vectorial or not ∈ Rd) as long as we havea data set of comparative dissimilarities dij

S(z1:n) =∑i 6=j

(d2ij − |zi − zj |2)2


• Minimize S(z1:n) w.r.t. z1:n without any further constraints!5:49

Example for Non-Metric MDS: ISOMAP

• Construct kNN graph and label edges with Euclidean distance– Between any two xi and xj , compute “geodescic” distance dij

(shortest path along the graph)– Then apply MDS

by Tenenbaum et al.5:50

The zoo of dimensionality reduction methods

• PCA family:– kernel PCA, non-neg. Matrix Factorization, Factor Analysis

• All of the following can be represented as kernel PCA:– Local Linear Embedding– Metric Multidimensional Scaling– Laplacian Eigenmaps (Spectral Clustering)

They all use different notions of distance/correlation as input to kernel PCA

see “Dimensionality Reduction: A Short Tutorial” by Ali Ghodsi5:51

PCA variants*


5:52

PCA variant: Non-negative Matrix Factorization*

• Assume we have data D = xini=1, xi ∈ Rd.As for PCA (where we had xi ≈ Vpzi + µ) we search for a lower-dimensionalspace with linear relation to xi

• In NMF we require everything is non-negative: the data xi, the projection W ,and latent variables ziFind W ∈ Rp×d (the tansposed projection) and Z ∈ Rn×p (the latent variableszi) such that

X ≈ ZW

• Iterative solution: (E-step and M-step like...)

zik ← zik

∑dj=1 wkjxij/(ZW )ij∑d

j=1 wkj

wkj ← wkj

∑Ni=1 zikxij/(ZW )ij∑N

i=1 zik

5:53

PCA variant: Non-negative Matrix Factorization*

(from Hastie 14.6)


5:54

PCA variant: Factor Analysis*

Another variant of PCA: (Bishop 12.64)Allows for different noise in each dimension P (xi | zi) = N(xi |Vpzi + µ,Σ)(with Σ diagonal)

5:55

Independent Component Analysis*

• Assume we have data D = xini=1, xi ∈ Rd.PCA: P (xi | zi) = N(xi |Wzi + µ, I) , P (zi) = N(zi | 0, I)Factor Analysis: P (xi | zi) = N(xi |Wzi + µ,Σ) , P (zi) = N(zi | 0, I)ICA: P (xi | zi) = N(xi |Wzi + µ, εI) , P (zi) =

∏dj=1 P (zij)

x0

x1

x2

x3

x4

x5

x6

latent sources observed

mixing

y0

y1

y2

y3

• In ICA1) We have (usually) as many latent variables as observed dim(xi) = dim(zi)

2) We require all latent variables to be independent3) We allow for latent variables to be non-Gaussian

Note: without point (3) ICA would be sense-less!

5:56

Partial least squares (PLS)*

• Is it really a good idea to just pick the p-higest variance components??

Why should that be a good idea?


5:57

PLS*

• Idea: The first dimension to pick should be the one most correlated with theOUTPUT, not with itself!

Input: data X ∈ Rn×d, y ∈ RnOutput: predictions y ∈ Rn

1: initialize the predicted output : y = 〈y〉 1n2: initialize the remaining input dimensions: X = X

3: for i = 1, .., p do4: i-th input ‘basis vector’: zi = XX>y

5: update prediction: y ← y + Ziy where Zi =ziz>i

z>izi

6: remove “used” input dimensions: X ← X(I− Zi)7: end for

(Hastie, page 81)Line 4 identifies a new input “coordinate” via maximal correlation between the remaning inputdimensions and y.Line 5 updates the prediction to include the project of y onto ziLine 6 removes the projection of input data X along zi. All zi will be orthogonal.

5:58

PLS for classification*

• Not obvious.

• We’ll try to invent one in the exercises :-)5:59

Clustering

• Clustering often involves two steps:• First map the data to some embedding that emphasizes clusters

– (Feature) PCA


– Spectral Clustering

– Kernel PCA

– ISOMAP

• Then explicitly analyze clusters– k-means clustering

– Gaussian Mixture Model

– Agglomerative Clustering

5:60

k-means Clustering

• Given data D = xini=1, find K centers µk, and a data assignment c : i 7→ k tominimize

minc,µ

∑i

(xi − µc(i))2

• k-means clustering:– Pick K data points randomly to initialize the centers µk– Iterate adapting the assignments c(i) and the centers µk:

∀i : c(i)← argminc(i)

∑i

(xi − µc(i))2 = argmink

(xi − µk)2

∀k : µk ← argminµk

∑i

(xi − µc(i))2 =1

|c-1(k)|∑

i∈c-1(k)

xi

5:61


k-means Clustering

from Hastie

5:62

k-means Clustering

• Converges to local minimum→many restarts• Choosing k? Plot L(k) = minc,µ

∑i(xi−µc(i))2 for different k – choose a trade-

off between model complexity (large k) and data fit (small loss L(k))

5:63


k-means Clustering for Classification

from Hastie

5:64

Gaussian Mixture Model for Clustering

• GMMs can/should be introduced as generative probabilistic model of the data:– K different Gaussians with parmameters µk,Σk– Assignment RANDOM VARIABLE ci ∈ 1, ..,Kwith P (ci=k) = πk

– The observed data point xi with P (xi | ci=k;µk,Σk) = N(xi |µk,Σk)

• EM-Algorithm described as a kind of soft-assignment version of k-means– Initialize the centers µ1:K randomly from the data; all covariances Σ1:K to unit and

all πk uniformly.

– E-step: (probabilistic/soft assignment) Compute

q(ci=k) = P (ci=k |xi, µ1:K ,Σ1:K) ∝ N(xi |µk,Σk) πk

– M-step: Update parameters (centers AND covariances)

πk =1

n

∑i q(ci=k)

µk =1

nπk

∑i q(ci=k) xi

Σk =1

nπk

∑i q(ci=k) xix

>i − µkµ>k

5:65


Gaussian Mixture Model

EM iterations for Gaussian Mixture model:

from Bishop

5:66

Agglomerative Hierarchical Clustering

• agglomerative = bottom-up, divisive = top-down• Merge the two groups with the smallest intergroup dissimilarity• Dissimilarity of two groups G, H can be measures as

– Nearest Neighbor (or “single linkage”): d(G,H) = mini∈G,j∈H d(xi, xj)

– Furthest Neighbor (or “complete linkage”): d(G,H) = maxi∈G,j∈H d(xi, xj)

– Group Average: d(G,H) = 1|G||H|

∑i∈G

∑j∈H d(xi, xj)

5:67


Agglomerative Hierarchical Clustering

5:68

5.4 Local learning

5:69

Local & lazy learning

• Idea of local (or “lazy”) learning:Do not try to build one global model f(x) from the data. Instead, whenever wehave a query point x∗, we build a specific local model in the neighborhood ofx∗.

• Typical approach:– Given a query point x∗, find all kNN in the data D = (xi, yi)Ni=1

– Fit a local model fx∗ only to these kNNs, perhaps weighted– Use the local model fx∗ to predict x∗ 7→ y0

• Weighted local least squares:

Llocal(β, x∗) =∑ni=1K(x∗, xi)(yi − f(xi))

2 + λ||β||2


where K(x∗, x) is called smoothing kernel. The optimum is:

β = (X>WX + λI)-1X>Wy , W = diag(K(x∗, x1:n))

5:70

Regression example

kNN smoothing kernel: K(x∗, xi) =

1 if xi ∈ kNN(x∗)

0 otherwise

Epanechnikov quadratic smoothing kernel: Kλ(x∗, x) = D(|x∗ − x|/λ) , D(s) =34(1− s2) if s ≤ 1

0 otherwise(Hastie, Sec 6.3)

5:71

Smoothing Kernels

from Wikipedia


5:72

kd-trees

• For local & lazy learning it is essential to efficiently retrieve the kNN

Problem: Given data X , a query x∗, identify the kNNs of x∗ in X .

• Linear time (stepping through all of X) is far too slow.

A kd-tree pre-structures the data into a binary tree, allowing O(log n) retrievalof kNNs.

5:73

kd-trees

(There are “typos” in this figure... Exercise to find them.)

• Every node plays two roles:– it defines a hyperplane that separates the data along one coordinate– it hosts a data point, which lives exactly on the hyperplane (defines the division)

5:74

kd-trees

• Simplest (non-efficient) way to construct a kd-tree:– hyperplanes divide alternatingly along 1st, 2nd, ... coordinate– choose random point, use it to define hyperplane, divide data, iterate

• Nearest neighbor search:– descent to a leave node and take this as initial nearest point– ascent and check at each branching the possibility that a nearer point exists on theother side of the hyperplane


• Approximate Nearest Neighbor (libann on Debian..)5:75

5.5 Combining weak and randomized learners

5:76

Combining learners

• The general idea is:– Given data D, let us learn various models f1, .., fM

– Our prediction is then some combination of these, e.g.

f(x) =

M∑m=1

αmfm(x)

• Various models could be:

Model averaging: Fully different types of models (using different (e.g. limited)feature sets; neural net; decision trees; hyperparameters)

Bootstrap: Models of same type, trained on randomized versions of D

Boosting: Models of same type, trained on cleverly designed modifications/reweightingsof D

• How can we choose the αm?5:77

Bootstrap & Bagging

• Bootstrap:– Data set D of size n– Generate M data sets Dm by resampling D with replacement– Each Dm is also of size n (some samples doubled or missing)

– Distribution over data sets↔ distribution over β (compare slide 02-13)– The ensemble f1, .., fM is similar to cross-validation– Mean and variance of f1, .., fM can be used for model assessment


• Bagging: (“bootstrap aggregation”)

f(x) =1

M

M∑m=1

fm(x)

5:78

• Bagging has similar effect to regularization:

(Hastie, Sec 8.7)

5:79

Bayesian Model Averaging

• If f1, .., fM are very different models– Equal weighting would not be clever– More confident models (less variance, less parameters, higher likelihood)→ higher weight

• Bayesian Averaging

P (y|x) =

M∑m=1

P (y|x, fm, D) P (fm|D)

The term P (fm|D) is the weighting αm: it is high, when the model is likelyunder the data (↔ the data is likely under the model & the model has “fewerparameters”).

5:80


The basis function view: Models are features!

• Compare model averaging f(x) =∑Mm=1 αmfm(x) with regression:

f(x) =

k∑j=1

φj(x) βj = φ(x)>β

• We can think of the M models fm as features φj for linear regression!– We know how to find optimal parameters α– But beware overfitting!

5:81

Boosting

• In Bagging and Model Averaging, the models are trained on the “same data”(unbiased randomized versions of the same data)

• Boosting tries to be cleverer:– It adapts the data for each learner– It assigns each learner a differently weighted version of the data

• With this, boosing can– Combine many “weak” classifiers to produce a powerful “committee”– A weak learner only needs to be somewhat better than random

5:82

AdaBoost

(Freund & Schapire, 1997)(classical Algo; use Gradient Boosting instead in practice)

• Binary classification problem with data D = (xi, yi)ni=1, yi ∈ −1,+1• We know how to train discriminative functions f(x); let

G(x) = sign f(x) ∈ −1,+1

• We will train a sequence of classificers G1, .., GM , each on differently weighteddata, to yield a classifier

G(x) = sign

M∑m=1

αmGm(x)

5:83


AdaBoost

(Hastie, Sec 10.1)

5:84

AdaBoost

Input: data D = (xi, yi)ni=1

Output: family of M classifiers Gm and weights αm1: initialize ∀i : wi = 1/n

2: for m = 1, ..,M do3: Fit classifier Gm to the training data weighted by wi4: errm =

∑ni=1 wi [yi 6=Gm(xi)]∑n

i=1 wi

5: αm = log[ 1−errmerrm

]

6: ∀i : wi ← wi expαm [yi 6= Gm(xi)]7: end for

(Hastie, sec 10.1)

Weights unchanged for correctly classified pointsMultiply weights with 1−errm

errm> 1 for mis-classified data points

• Real AdaBoost: A variant exists that combines probabilistic classifiers σ(f(x)) ∈[0, 1] instead of discrete G(x) ∈ −1,+1

5:85


The basis function view

• In AdaBoost, each model Gm depends on the data weights wmWe could write this as

f(x) =

M∑m=1

αmfm(x,wm)

The “features” fm(x,wm) now have additional parameters wmWe’d like to optimize

minα,w1,..,wM

L(f)

w.r.t. α and all the feature parameters wm.

• In general this is hard.But assuming αm and wm fixed, optimizing for αm and wm is efficient.

• AdaBoost does exactly this, choosing wm so that the “feature” fm will bestreduce the loss (cf. PLS)(Literally, AdaBoost uses exponential loss or neg-log-likelihood; Hastie sec 10.4 & 10.5)

5:86

Gradient Boosting

• AdaBoost generates a series of basis functions by using different data weight-ings wm depending on so-far classification errors• We can also generate a series of basis functions fm by fitting them to the gradi-

ent of the so-far loss5:87

Gradient Boosting

• Assume we want to miminize some loss function

minfL(f) =

n∑i=1

L(yi, f(xi))

We can solve this using gradient descent

f∗ = f0 + α1∂L(f0)

∂f︸︷︷︸≈f1

+α2∂L(f0 + α1f1)

∂f︸︷︷︸≈f2

+α3∂L(f0 + α1f1 + α2f2)

∂f︸︷︷︸≈f3

+ · · ·

– Each fm aproximates the so-far loss gradient– We use linear regression to choose αm (instead of line search)


• More intuitively: ∂L(f)∂f “points into the direction of the error/redisual of f”. It

shows how f could be improved.

Gradient boosting uses the next lerner fk ≈ ∂L(fso far)∂f to approximate how f can

be improved.Optimizing α’s does the improvement.

5:88

Gradient Boosting

Input: function class F (e.g., of discriminative functions), data D = (xi, yi)ni=1,an arbitrary loss function L(y, y)

Output: function f to minimize∑ni=1 L(yi, f(xi))

1: Initialize a constant f = f0 = argminf∈F∑ni=1 L(yi, f(xi))

2: for m = 1 : M do3: For each data point i = 1 : n compute rim = − ∂L(yi,f(xi))

∂f(xi)

∣∣f=f

4: Fit a regression fm ∈ F to the targets rim, minimizing squared error5: Find optimal coefficients (e.g., via feature logistic regression)

α = argminα∑i L(yi,

∑mj=0 αmfm(xi))

(often: fix α0:m-1 and only optimize over αm)6: Update f =

∑mj=0 αmfm

7: end for

• If F is the set of regression/decision trees, then step 5 usually re-optimizes the termi-nal constants of all leave nodes of the regression tree fm. (Step 4 only determines theterminal regions.)

5:89

Gradient boosting is the preferred method

• Hastie’s book quite “likes” gradient boosting– Can be applied to any loss function

– No matter if regression or classification

– Very good performance

– Simpler, more general, better than AdaBoost5:90

Decision Trees

• Decision trees are particularly used in Bagging and Boosting contexts

• Decision trees are “linear in features”, but the features are the terminal regionsof a tree, which are constructed depending on the data

• We’ll learn about


– Boosted decision trees & stumps– Random Forests

5:91

Decision Trees

• We describe CART (classification and regression tree)• Decision trees are linear in features:

f(x) =

k∑j=1

cj [x ∈ Rj ]

where Rj are disjoint rectangular regions and cj the constant prediction in aregion• The regions are defined by a binary decision tree

5:92

Growing the decision tree

• The constants are the region averages cj =∑i yi [xi∈Rj ]∑i[xi∈Rj ]

• Each split xa > t is defined by a choice of input dimension a ∈ 1, .., d and athreshold t• Given a yet unsplit region Rj , we split it by choosing

mina,t

[minc1

∑i:xi∈Rj∧xa≤t

(yi − c1)2 + minc2

∑i:xi∈Rj∧xa>t

(yi − c2)2]

– Finding the threshold t is really quick (slide along)

– We do this for every input dimension a5:93

Deciding on the depth (if not pre-fixed)


• We first grow a very large tree (e.g. until at most 5 data points live in eachregion)

• Then we rank all nodes using “weakest link pruning”:Iteratively remove the node that least increases

n∑i=1

(yi − f(xi))2

• Use cross-validation to choose the eventual level of pruning

This is equivalent to choosing a regularization parameter λ forL(T ) =

∑ni=1(yi − f(xi))

2 + λ|T |where the regularization |T | is the tree size

5:94

Example:CART on the Spam data set(details: Hastie, p 320)

Test error rate: 8.7%

5:95


Boosting trees & stumps

• A decision stump is a decision tree with fixed depth 1 (just one split)

• Gradient boosting of decision trees (of fixed depth J) and stumps is very effec-tive

Test error rates on Spam data set:full decision tree 8.7%

boosted decision stumps 4.7%boosted decision trees with J = 5 4.5%

5:96

Random Forests: Bootstrapping & randomized splits

• Recall that Bagging averages models f1, .., fM where each fm was trained on abootstrap resample Dm of the data DThis randomizes the models and avoids over-generalization

• Random Forests do Bagging, but additionally randomize the trees:– When growing a new split, choose the input dimension a only from a random subsetm features

– m is often very small; even m = 1 or m = 3

• Random Forests are the prime example for “creating many randomized weaklearners from the same data D”

5:97

Random Forests vs. gradient boosted trees


(Hastie, Fig 15.1)5:98

Appendix: Centering & Whitening

• Some prefer to center (shift to zero mean) the data before applying methods:

x← x− 〈x〉 , y ← y − 〈y〉

this spares augmenting the bias feature 1 to the data.

• More interesting: The loss and the best choice of λ depends on the scaling of thedata. If we always scale the data in the same range, we may have better priorsabout choice of λ and interpretation of the loss

x← 1√Varx

x , y ← 1√Vary

y

• Whitening: Transform the data to remove all correlations and varances.Let A = Varx = 1

nX>X − µµ>with Cholesy decomposition A = MM>.

x←M -1x , with VarM -1x = Id

5:99


6 Bayesian Regression & Classification

Learning as Inference

• The parameteric view

P (β|Data) =P (Data|β) P (β)

P (Data)

• The function space view

P (f |Data) =P (Data|f) P (f)

P (Data)

• Today:– Bayesian (Kernel) Ridge Regression ↔ Gaussian Process (GP)– Bayesian (Kernel) Logistic Regression ↔ GP classification– Bayesian Neural Networks (briefly)

6:1

• Beyond learning about specific Bayesian learning methods:

Understand relations between

loss/error ↔ neg-log likelihood

regularization ↔ neg-log prior

cost (reg.+loss) ↔ neg-log posterior6:2

Gaussian Process = Bayesian (Kernel) Ridge Regression6:3

Ridge regression as Bayesian inference

• We have random variables X1:n, Y1:n, β

• We observe data D = (xi, yi)ni=1 and want to compute P (β |D)


• Let’s assume:β

xi

yii = 1 : n

P (X) is arbitrary

P (β) is Gaussian: β ∼ N(0, σ2

λ ) ∝ e−λ

2σ2||β||2

P (Y |X,β) is Gaussian: y = x>β + ε , ε ∼ N(0, σ2)6:4


• Bayes’ Theorem:

P (β |D) =P (D |β) P (β)

P (D)

P (β |x1:n, y1:n) =

∏ni=1 P (yi |β, xi) P (β)

Z

P (D |β) is a product of independent likelihoods for each observation (xi, yi)

Using the Gaussian expressions:

P (β |D) =1

Z ′

n∏i=1

e−1

2σ2(yi−x>iβ)2 e−

λ2σ2||β||2

− logP (β |D) =1

2σ2

[ n∑i=1

(yi − x>iβ)2 + λ||β||2]

+ logZ ′

− logP (β |D) ∝ Lridge(β)

1st insight: The neg-log posterior P (β |D) is proportional to the cost functionLridge(β)!

6:5


• Let us compute P (β |D) explicitly:

P (β |D) =1

Z′

n∏i=1

e− 1

2σ2(yi−x>iβ)2

e− λ

2σ2||β||2

=1

Z′e− 1

2σ2

∑i(yi−x

>iβ)2

e− λ

2σ2||β||2

=1

Z′e− 1

2σ2[(y−Xβ)>(y−Xβ)+λβ>β]

=1

Z′e− 1

2[ 1σ2

y>y+ 1σ2β>(X>X+λI)β− 2

σ2β>X>y]

= N(β | β,Σ)


This is a Gaussian with covariance and meanΣ = σ2 (X>X + λI)-1 , β = 1

σ2 ΣX>y = (X>X + λI)-1X>y

• 2nd insight: The mean β is exactly the classical argminβ Lridge(β).

• 3rd insight: The Bayesian inference approach not only gives a mean/optimalβ, but also a variance Σ of that estimate!

6:6

Predicting with an uncertain β

• Suppose we want to make a prediction at x. We can compute the predictivedistribution over a new observation y∗ at x∗:

P (y∗ |x∗, D) =∫βP (y∗ |x∗, β) P (β |D) dβ

=∫βN(y∗ |φ(x∗)>β, σ2) N(β | β,Σ) dβ

= N(y∗ |φ(x∗)>β, σ2 + φ(x∗)>Σφ(x∗))

Note, for f(x) = φ(x)>β, we have P (f(x) |D) = N(f(x) |φ(x)>β, φ(x)>Σφ(x)) without the σ2

• So, y∗ is Gaussian distributed around the mean prediction φ(x∗)>β:

(from Bishop, p176)6:7

Wrapup of Bayesian Ridge regression

• 1st insight: The neg-log posterior P (β |D) is equal to the cost function Lridge(β)!

This is a very very common relation: optimization costs correspond to neg-log proba-bilities; probabilities correspond to exp-neg costs.

• 2nd insight: The mean β is exactly the classical argminβ Lridge(β).

More generally, the most likely parameter argmaxβ P (β|D) is also the least-cost param-eter argminβ L(β). In the Gaussian case, most-likely β is also the mean.


• 3rd insight: The Bayesian inference approach not only gives a mean/optimalβ, but also a variance Σ of that estimate!

This is a core benefit of the Bayesian view: It naturally provides a probability distribu-tion over predictions (“error bars”), not only a single prediction.

6:8

Kernel Bayesian Ridge Regression

• As in the classical case, we can consider arbitrary features φ(x)

• .. or directly use a kernel k(x, x′):

P (f(x) |D) = N(f(x) |φ(x)>β, φ(x)>Σφ(x))

φ(x)>β = φ(x)>X>(XX>+ λI)-1y

= κ(x)(K + λI)-1y

φ(x)>Σφ(x) = φ(x)>σ2 (X>X + λI)-1φ(x)

=σ2

λφ(x)>φ(x)− σ2

λφ(x)>X>(XX>+ λIk)-1Xφ(x)

=σ2

λk(x, x)− σ2

λκ(x)(K + λIn)-1κ(x)>

3rd line: As on slide 02:24last lines: Woodbury identity (A + UBV )-1 = A-1 − A-1U(B-1 + V A-1U)-1V A-1 withA = λI

• In standard conventions λ = σ2, i.e. P (β) = N(β|0, 1)

– Regularization: scale the covariance function (or features)6:9

Kernelized Bayesian Ridge Regression

is equivalent to Gaussian Processes(see also Welling: “Kernel Ridge Regression” Lecture Notes; Rasmussen & Williams sections 2.1 &6.2; Bishop sections 3.3.3 & 6)

• As we have the equations alreay, I skip further math details. (See Rasmussen& Williams)

6:10

Gaussian Processes

• The function space view


P (Data)


• Gaussian Processes define a probability distribution over functions:– A function is an infinite dimensional thing – how could we define a Gaussian dis-

tribution over functions?

– For every finite set x1, .., xM, the function values f(x1), .., f(xM ) are Gaussiandistributed with mean and cov.

〈f(xi)〉 = µ(xi) (often zero)

〈[f(xi)− µ(xi)][f(xj)− µ(xj)]〉 = k(xi, xj)

Here, k(·, ·) is called covariance function

• Second, Gaussian Processes define an observation probability

P (y|x, f) = N(y|f(x), σ2)

6:11

Gaussian Processes

(from Rasmussen & Williams)6:12

GP: different covariance functions

(from Rasmussen & Williams)


• These are examples from the γ-exponential covariance function

k(x, x′) = exp−|(x− x′)/l|γ

6:13

GP: derivative observations

(from Rasmussen & Williams)6:14

• Bayesian Kernel Ridge Regression = Gaussian Process

• GPs have become a standard regression method• If exact GP is not efficient enough, many approximations exist, e.g. sparse and

pseudo-input GPs6:15

GP classification = Bayesian (Kernel) Logistic Regression6:16

Bayesian Logistic Regression

• f now defines a discriminative function:

P (X) = arbitrary

P (β) = N(β|0, 2

λ) ∝ exp−λ||β||2

P (Y =1 |X,β) = σ(β>φ(x))

• Recall

Llogistic(β) = −n∑i=1

log p(yi |xi) + λ||β||2


• Again, the parameter posterior is

P (β|D) ∝ P (D |β) P (β) ∝ exp−Llogistic(β)

6:17

Bayesian Logistic Regression

• Use Laplace approximation (2nd order Taylor for L) at β∗ = argminβ L(β):

L(β) ≈ L(β∗) + β>∇+1

2β>Hβ , β = β − β∗

At β∗ the gradient∇ = 0 and L(β∗) = const. Therefore

P (β|D) ∝ exp−1

2β>Hβ

= N(β|β∗, H -1)

• Then the predictive distribution of the discriminative function is also Gaussian!

P (f(x) |D) =∫βP (f(x) |β) P (β |D) dβ

=∫βN(f(x) |φ(x)>β, 0) N(β |β∗, H -1) dβ

= N(f(x) |φ(x)>β∗, φ(x)>H -1φ(x)) =: N(f(x) | f∗, s2)

• The predictive distribution over the label y ∈ 0, 1:

P (y(x)=1 |D) =∫f(x)

σ(f(x)) P (f(x)|D) df

≈ σ((1 + s2π/8)-12 f∗)

which uses a probit approximation of the convolution.→ The variance s2 pushes the predictive class probabilities towards 0.5.

6:18

Kernelized Bayesian Logistic Regression

• As with Kernel Logistic Regression, the MAP discriminative function f∗ can befound iterating the Newton method↔ iterating GP estimation on a re-weighteddata set.• The rest is as above.

6:19

Kernel Bayesian Logistic Regression

is equivalent to Gaussian Process Classification

• GP classification became a standard classification method, if the predictionneeds to be a meaningful probability that takes the model uncertainty into ac-count.

6:20


Bayesian Neural Networksdetails in Bishop6:21

General non-linear models

• Above we always assumed f(x) = φ(x)>β (or kernelized)• Bayesian Learning also works for non-linear function models f(x, β)

• Regression case:P (X) is arbitrary.

P (β) is Gaussian: β ∼ N(0, σ2

λ ) ∝ e−λ

2σ2||β||2

P (Y |X,β) is Gaussian: y = f(x, β) + ε , ε ∼ N(0, σ2)6:22

General non-linear models

• To compute P (β|D) we first compute the most likely

β∗ = argminβ

L(β) = argmaxβ

P (β|D)

• Use Laplace approximation around β∗: 2nd-order Taylor of f(x, β) and then ofL(β) to estimate a Gaussian P (β|D)

• Neural Networks:– The Gaussian prior P (β) = N(β|0, σ

2

λ ) is called weight decay– This pushes “sigmoids to be in the linear region”.

6:23

Conclusions

• Probabilistic inference is a very powerful concept!– Inferring about the world given data– Learning, decision making, reasoning can view viewed as forms of (proba-bilistic) inference

• We introduced Bayes’ Theorem as the fundamental form of probabilistic infer-ence

• Marrying Bayes with (Kernel) Ridge (Logisic) regression yields– Gaussian Processes– Gaussian Process classification

6:24


A Probability Basics

The need for modelling

• Given a real world problem, translating it to a well-defined learning problemis non-trivial

• The “framework” of plain regression/classification is restricted: input x, out-put y.

• Graphical models (probabilstic models with multiple random variables and de-pendencies) are a more general framework for modelling “problems”; regres-sion & classification become a special case; Reinforcement Learning, decisionmaking, unsupervised learning, but also language processing, image segmen-tation, can be represented.

7:1

Thomas Bayes (1702-–1761)

“Essay Towards Solving aProblem in the Doctrine ofChances”

• Addresses problem of inverse probabilities:Knowing the conditional probability of B given A, what is the conditional prob-ability of A given B?

• Example:40% Bavarians speak dialect, only 1% of non-Bavarians speak (Bav.) dialectGiven a random German that speaks non-dialect, is he Bavarian?(15% of Germans are Bavarian)

7:2

Inference

• “Inference” = Given some pieces of information (prior, observed variabes) whatis the implication (the implied information, the posterior) on a non-observed


variable

• Learning as Inference:– given pieces of information: data, assumed model, prior over β

– non-observed variable: β

7:3

Probability Theory

• Why do we need probabilities?– Obvious: to express inherent stochasticity of the world (data)

• But beyond this: (also in a “deterministic world”):– lack of knowledge!

– hidden (latent) variables

– expressing uncertainty– expressing information (and lack of information)

• Probability Theory: an information calculus7:4

Probability: Frequentist and Bayesian

• Frequentist probabilities are defined in the limit of an infinite number of trialsExample: “The probability of a particular coin landing heads up is 0.43”

• Bayesian (subjective) probabilities quantify degrees of beliefExample: “The probability of rain tomorrow is 0.3” – not possible to repeat “tomorrow”

7:5

Outline

• Basic definitions– Random variables

– joint, conditional, marginal distribution

– Bayes’ theorem

• Examples for Bayes• Probability distributions [skipped, only Gauss]

– Binomial; Beta

– Multinomial; Dirichlet

– Conjugate priors


– Gauss; Wichart

– Student-t, Dirak, Particles

• Monte Carlo, MCMC [skipped]These are generic slides on probabilities I use throughout my lecture. Only parts aremandatory for the AI course.

7:6

A.1 Basic definitions

7:7

Probabilities & Random Variables

• For a random variable X with discrete domain dom(X) = Ω we write:∀x∈Ω : 0 ≤ P (X=x) ≤ 1∑x∈Ω P (X=x) = 1

Example: A dice can take values Ω = 1, .., 6.X is the random variable of a dice throw.P (X=1) ∈ [0, 1] is the probability that X takes value 1.

• A bit more formally: a random variable is a map from a measureable space to a domain(sample space) and thereby introduces a probability measure on the domain (“assigns aprobability to each possible value”)

7:8

Probabilty Distributions

• P (X=1) ∈ R denotes a specific probabilityP (X) denotes the probability distribution (function over Ω)

Example: A dice can take values Ω = 1, 2, 3, 4, 5, 6.By P (X) we discribe the full distribution over possible values 1, .., 6. These are 6numbers that sum to one, usually stored in a table, e.g.: [ 1

6, 1

6, 1

6, 1

6, 1

6, 1

6]

• In implementations we typically represent distributions over discrete randomvariables as tables (arrays) of numbers

• Notation for summing over a RV:In equation we often need to sum over RVs. We then write∑

X P (X) · · ·as shorthand for the explicit notation

∑x∈dom(X) P (X=x) · · ·

7:9


Joint distributions

Assume we have two random variables X and Y

• Definitions:Joint: P (X,Y )

Marginal: P (X) =∑Y P (X,Y )

Conditional: P (X|Y ) = P (X,Y )P (Y )

The conditional is normalized: ∀Y :∑X P (X|Y ) = 1

• X is independent of Y iff: P (X|Y ) = P (X)

(table thinking: all columns of P (X|Y ) are equal)7:10

Joint distributions

joint: P (X,Y )

marginal: P (X) =∑Y P (X,Y )

conditional: P (X|Y ) = P (X,Y )P (Y )

• Implications of these definitions:Product rule: P (X,Y ) = P (X|Y ) P (Y ) = P (Y |X) P (X)

Bayes’ Theorem: P (X|Y ) = P (Y |X) P (X)P (Y )

7:11

Bayes’ Theorem

P (X|Y ) =P (Y |X) P (X)

P (Y )

posterior =likelihood · prior

normalization7:12


Multiple RVs:

• Analogously for n random variables X1:n (stored as a rank n tensor)Joint: P (X1:n)

Marginal: P (X1) =∑X2:n

P (X1:n),

Conditional: P (X1|X2:n) = P (X1:n)P (X2:n)

• X is conditionally independent of Y given Z iff:P (X|Y, Z) = P (X|Z)

• Product rule and Bayes’ Theorem:

P (X1:n) =∏ni=1 P (Xi|Xi+1:n)

P (X1|X2:n) = P (X2|X1,X3:n) P (X1|X3:n)P (X2|X3:n)

P (X,Z, Y ) = P (X|Y,Z) P (Y |Z) P (Z)

P (X|Y,Z) = P (Y |X,Z) P (X|Z)P (Y |Z)

P (X,Y |Z) = P (X,Z|Y ) P (Y )P (Z)

7:13

Example 1: Bavarian dialect

• 40% Bavarians speak dialect, only 1% of non-Bavarians speak (Bav.) dialectGiven a random German that speaks non-dialect, is he Bavarian?(15% of Germans are Bavarian)

P (D=1 |B=1) = 0.4

P (D=1 |B=0) = 0.01

P (B=1) = 0.15

If follows

D

B

P (B=1 |D=0) = P (D=0 |B=1) P (B=1)P (D=0) = .6·.15

.6·.15+0.99·.85 ≈ 0.0977:14

Example 2: Coin flipping

HHTHT

HHHHH

H

d1 d2 d3 d4 d5

• What process produces these sequences?

• We compare two hypothesis:


H = 1 : fair coin P (di=H |H=1) = 12

H = 2 : always heads coin P (di=H |H=2) = 1

• Bayes’ theorem:P (H |D) = P (D |H)P (H)

P (D)

7:15

Coin flipping

D = HHTHT

P (D |H=1) = 1/25 P (H=1) = 9991000

P (D |H=2) = 0 P (H=2) = 11000

P (H=1 |D)

P (H=2 |D)=P (D |H=1)

P (D |H=2)

P (H=1)

P (H=2)=

1/32

0

999

1=∞

7:16

Coin flipping

D = HHHHH

P (D |H=1) = 1/25 P (H=1) = 9991000

P (D |H=2) = 1 P (H=2) = 11000

P (H=1 |D)

P (H=2 |D)=P (D |H=1)

P (D |H=2)

P (H=1)

P (H=2)=

1/32

1

999

1≈ 30

7:17

Coin flipping

D = HHHHHHHHHH

P (D |H=1) = 1/210 P (H=1) = 9991000

P (D |H=2) = 1 P (H=2) = 11000


P (H=1 |D)

P (H=2 |D)=P (D |H=1)

P (D |H=2)

P (H=1)

P (H=2)=

1/1024

1

999

1≈ 1

7:18

Learning as Bayesian inference

P (World|Data) =P (Data|World) P (World)

P (Data)

P (World) describes our prior over all possible worlds. Learning means to inferabout the world we live in based on the data we have!

• In the context of regression, the “world” is the function f(x)


P (Data)

P (f) describes our prior over possible functions

Regression means to infer the function based on the data we have7:19

A.2 Probability distributionsrecommended reference: Bishop.: Pattern Recognition and Machine Learning

7:20

Bernoulli & Binomial

• We have a binary random variable x ∈ 0, 1 (i.e. dom(x) = 0, 1)The Bernoulli distribution is parameterized by a single scalar µ,

P (x=1 |µ) = µ , P (x=0 |µ) = 1− µBern(x |µ) = µx(1− µ)1−x

• We have a data set of random variables D = x1, .., xn, each xi ∈ 0, 1. Ifeach xi ∼ Bern(xi |µ) we have

P (D |µ) =∏ni=1 Bern(xi |µ) =

∏ni=1 µ

xi(1− µ)1−xi

argmaxµ

logP (D |µ) = argmaxµ

n∑i=1

xi logµ+ (1− xi) log(1− µ) =1

n

n∑i=1

xi


• The Binomial distribution is the distribution over the count m =∑ni=1 xi

Bin(m |n, µ) =nm

µm(1− µ)n−m ,nm

=n!

(n−m)! m!

7:21

BetaHow to express uncertainty over a Bernoulli parameter µ

• The Beta distribution is over the interval [0, 1], typically the parameter µ of aBernoulli:

Beta(µ | a, b) =1

B(a, b)µa−1(1− µ)b−1

with mean 〈µ〉 = aa+b and mode µ∗ = a−1

a+b−2 for a, b > 1

• The crucial point is:– Assume we are in a world with a “Bernoulli source” (e.g., binary bandit), but don’t

know its parameter µ

– Assume we have a prior distribution P (µ) = Beta(µ | a, b)– Assume we collected some data D = x1, .., xn, xi ∈ 0, 1, with counts aD =∑

i xi of [xi=1] and bD =∑i(1− xi) of [xi=0]

– The posterior is

P (µ |D) =P (D |µ)

P (D)P (µ) ∝ Bin(D |µ) Beta(µ | a, b)

∝ µaD (1− µ)bD µa−1(1− µ)b−1 = µa−1+aD (1− µ)b−1+bD

= Beta(µ | a+ aD, b+ bD)

7:22

Beta

The prior is Beta(µ | a, b), the posterior is Beta(µ | a+ aD, b+ bD)

• Conclusions:– The semantics of a and b are counts of [xi=1] and [xi=0], respectively

– The Beta distribution is conjugate to the Bernoulli (explained later)

– With the Beta distribution we can represent beliefs (state of knowledge) about un-certain µ ∈ [0, 1] and know how to update this belief given data

7:23


Beta

from Bishop

7:24

Multinomial

• We have an integer random variable x ∈ 1, ..,KThe probability of a single x can be parameterized by µ = (µ1, .., µK):

P (x=k |µ) = µk

with the constraint∑Kk=1 µk = 1 (probabilities need to be normalized)

• We have a data set of random variables D = x1, .., xn, each xi ∈ 1, ..,K. Ifeach xi ∼ P (xi |µ) we have

P (D |µ) =∏ni=1 µxi =

∏ni=1

∏Kk=1 µ

[xi=k]k =

∏Kk=1 µ

mkk

where mk =∑ni=1[xi=k] is the count of [xi=k]. The ML estimator is

argmaxµ

logP (D |µ) =1

n(m1, ..,mK)

• The Multinomial distribution is this distribution over the counts mk

Mult(m1, ..,mK |n, µ) ∝∏Kk=1 µ

mkk

7:25


Dirichlet

How to express uncertainty over a Multinomial parameter µ• The Dirichlet distribution is over the K-simplex, that is, over µ1, .., µK ∈ [0, 1]

subject to the constraint∑Kk=1 µk = 1:

Dir(µ |α) ∝∏Kk=1 µ

αk−1k

It is parameterized by α = (α1, .., αK), has mean 〈µi〉 = αi∑jαj

and mode µ∗i =αi−1∑jαj−K

for ai > 1.

• The crucial point is:– Assume we are in a world with a “Multinomial source” (e.g., an integer bandit), but

don’t know its parameter µ

– Assume we have a prior distribution P (µ) = Dir(µ |α)

– Assume we collected some data D = x1, .., xn, xi ∈ 1, ..,K, with counts mk =∑i[xi=k]

– The posterior is

P (µ |D) =P (D |µ)

P (D)P (µ) ∝ Mult(D |µ) Dir(µ | a, b)

∝∏Kk=1 µ

mkk

∏Kk=1 µ

αk−1k =

∏Kk=1 µ

αk−1+mkk

= Dir(µ |α+m)

7:26

Dirichlet

The prior is Dir(µ |α), the posterior is Dir(µ |α+m)

• Conclusions:– The semantics of α is the counts of [xi=k]

– The Dirichlet distribution is conjugate to the Multinomial

– With the Dirichlet distribution we can represent beliefs (state of knowledge) aboutuncertain µ of an integer random variable and know how to update this belief givendata

7:27

Dirichlet

Illustrations for α = (0.1, 0.1, 0.1), α = (1, 1, 1) and α = (10, 10, 10):


from Bishop7:28

Motivation for Beta & Dirichlet distributions

• Bandits:– If we have binary [integer] bandits, the Beta [Dirichlet] distribution is a way to

represent and update beliefs

– The belief space becomes discrete: The parameter α of the prior is continuous, butthe posterior updates live on a discrete “grid” (adding counts to α)

– We can in principle do belief planning using this

• Reinforcement Learning:– Assume we know that the world is a finite-state MDP, but do not know its transition

probability P (s′ | s, a). For each (s, a), P (s′ | s, a) is a distribution over the integers′

– Having a separate Dirichlet distribution for each (s, a) is a way to represent ourbelief about the world, that is, our belief about P (s′ | s, a)

– We can in principle do belief planning using this→ Bayesian Reinforcement Learning

• Dirichlet distributions are also used to model texts (word distributions in text),images, or mixture distributions in general

7:29

Conjugate priors

• Assume you have data D = x1, .., xnwith likelihood

P (D | θ)

that depends on an uncertain parameter θAssume you have a prior P (θ)

• The prior P (θ) is conjugate to the likelihood P (D | θ) iff the posterior

P (θ |D) ∝ P (D | θ) P (θ)

is in the same distribution class as the prior P (θ)


• Having a conjugate prior is very convenient, because then you know how toupdate the belief given data

7:30

Conjugate priors

likelihood conjugateBinomial Bin(D |µ) Beta Beta(µ | a, b)Multinomial Mult(D |µ) Dirichlet Dir(µ |α)Gauss N(x |µ,Σ) Gauss N(µ |µ0, A)1D Gauss N(x |µ, λ-1) Gamma Gam(λ | a, b)nD Gauss N(x |µ,Λ-1) Wishart Wish(Λ |W, ν)nD Gauss N(x |µ,Λ-1) Gauss-Wishart

N(µ |µ0, (βΛ)-1) Wish(Λ |W, ν)7:31

Distributions over continuous domain

• Let x be a continuous RV. The probability density function (pdf) p(x) ∈ [0,∞)defines the probability

P (a ≤ x ≤ b) =

∫ b

a

p(x) dx ∈ [0, 1]

The (cumulative) probability distribution F (y) = P (x ≤ y) =∫ y−∞ dx p(x) ∈

[0, 1] is the cumulative integral with limy→∞ F (y) = 1

(In discrete domain: probability distribution and probability mass function P (x) ∈ [0, 1] areused synonymously.)

• Two basic examples:

Gaussian: N(x |µ,Σ) = 1| 2πΣ | 1/2 e

− 12 (x−µ)> Σ-1 (x−µ)

Dirac or δ (“point particle”) δ(x) = 0 except at x = 0,∫δ(x) dx = 1

δ(x) = ∂∂xH(x) where H(x) = [x ≥ 0] = Heaviside step function

7:32

Gaussian distribution

• 1-dim: N(x |µ, σ2) = 1| 2πσ2 | 1/2 e

− 12 (x−µ)2/σ2

N (x|µ, σ2)

x

2σ

µ

• n-dim Gaussian in normal form:

N(x |µ,Σ) =1

| 2πΣ | 1/2exp−1

2(x− µ)>Σ-1 (x− µ)


• Assume we are initially uncertain about µ (but know Σ). We can express thisuncertainty using again a Gaussian N[µ | a,A]. Given data we have

P (µ |D) ∝ P (D |µ,Σ) P (µ) =∏iN(xi |µ,Σ) N[µ | a,A]

=∏iN[µ |Σ-1xi,Σ

-1] N[µ | a,A] ∝ N[µ |Σ-1∑i xi, nΣ-1 +A]

Note: in the limit A→ 0 (uninformative prior) this becomes

P (µ |D) = N(µ | 1

n

∑i

xi,1

nΣ)

which is consistent with the Maximum Likelihood estimator7:35

Motivation for Gaussian distributions

• Gaussian Bandits• Control theory, Stochastic Optimal Control• State estimation, sensor processing, Gaussian filtering (Kalman filtering)• Machine Learning• etc

7:36

Particle Approximation of a Distribution

• We approximate a distribution p(x) over a continuous domain Rn

• A particle distribution q(x) is a weighed set S = (xi, wi)Ni=1 of N particles– each particle has a “location” xi ∈ Rn and a weight wi ∈ R– weights are normalized,

∑i w

i = 1

q(x) :=

N∑i=1

wi δ(x− xi)

where δ(x− xi) is the δ-distribution.• Given weighted particles, we can estimate for any (smooth) f :

〈f(x)〉p =

∫x

f(x)p(x)dx ≈∑Ni=1 w

if(xi)

See An Introduction to MCMC for Machine Learning www.cs.ubc.ca/˜nando/papers/mlintro.pdf

7:37

www.cs.ubc.ca/~nando/papers/mlintro.pdf

www.cs.ubc.ca/~nando/papers/mlintro.pdf


Particle Approximation of a Distribution

Histogram of a particle representation:

7:38

Motivation for particle distributions

• Numeric representation of “difficult” distributions– Very general and versatile

– But often needs many samples

• Distributions over games (action sequences), sample based planning, MCTS• State estimation, particle filters• etc

7:39

Utilities & Decision Theory

• Given a space of events Ω (e.g., outcomes of a trial, a game, etc) the utility is afunction

U : Ω→ R

• The utility represents preferences as a single scalar – which is not always obvi-ous (cf. multi-objective optimization)


• Decision Theory making decisions (that determine p(x)) that maximize expectedutility

EUp =

∫x

U(x) p(x)

• Concave utility functions imply risk aversion (and convex, risk-taking)7:40

Entropy

• The neg-log (− log p(x)) of a distribution reflects something like “error”:– neg-log of a Guassian↔ squared error

– neg-log likelihood↔ prediction error

• The (− log p(x)) is the “optimal” coding length you should assign to a symbolx. This will minimize the expected length of an encoding

H(p) =

∫x

p(x)[− log p(x)]

• The entropy H(p) = Ep(x)− log p(x) of a distribution p is a measure of uncer-tainty, or lack-of-information, we have about x

7:41

Kullback-Leibler divergence

• Assume you use a “wrong” distribution q(x) to decide on the coding length ofsymbols drawn from p(x). The expected length of a encoding is∫

x

p(x)[− log q(x)] ≥ H(p)

• The difference

D(p∣∣∣∣ q) =

∫x

p(x) logp(x)

q(x)≥ 0

is called Kullback-Leibler divergence

Proof of inequality, using the Jenson inequality:

−∫xp(x) log

q(x)

p(x)≥ − log

∫xp(x)

q(x)

p(x)= 0

7:42


Monte Carlo methods

• Generally, a Monte Carlo method is a method to generate a set of (potentiallyweighted) samples that approximate a distribution p(x).In the unweighted case, the samples should be i.i.d. xi ∼ p(x)

In the general (also weighted) case, we want particles that allow to estimateexpectations of anything that depends on x, e.g. f(x):

limN→∞

〈f(x)〉q = limN→∞

N∑i=1

wif(xi) =

∫x

f(x) p(x) dx = 〈f(x)〉p

In this view, Monte Carlo methods approximate an integral.• Motivation: p(x) itself is too complicated to express analytically or compute〈f(x)〉p directly

• Example: What is the probability that a solitair would come out successful? (Originalstory by Stan Ulam.) Instead of trying to analytically compute this, generate manyrandom solitairs and count.

• Naming: The method developed in the 40ies, where computers became faster. Fermi,Ulam and von Neumann initiated the idea. von Neumann called it “Monte Carlo” as acode name.

7:43

Rejection Sampling

• How can we generate i.i.d. samples xi ∼ p(x)?• Assumptions:

– We can sample x ∼ q(x) from a simpler distribution q(x) (e.g., uniform), calledproposal distribution

– We can numerically evaluate p(x) for a specific x (even if we don’t have an analyticexpression of p(x))

– There exists M such that ∀x : p(x) ≤ Mq(x) (which implies q has larger or equalsupport as p)

• Rejection Sampling:– Sample a candiate x ∼ q(x)

– With probability p(x)Mq(x)

accept x and add to S; otherwise reject

– Repeat until |S| = n

• This generates an unweighted sample set S to approximate p(x)7:44

Importance sampling

• Assumptions:


– We can sample x ∼ q(x) from a simpler distribution q(x) (e.g., uniform)

– We can numerically evaluate p(x) for a specific x (even if we don’t have an analyticexpression of p(x))

• Importance Sampling:– Sample a candiate x ∼ q(x)

– Add the weighted sample (x, p(x)q(x)

) to S

– Repeat n times

• This generates an weighted sample set S to approximate p(x)

The weights wi = p(xi)q(xi)

are called importance weights

• Crucial for efficiency: a good choice of the proposal q(x)7:45

Applications

• MCTS can be viewed as estimating a distribution over games (action sequences)conditional to win• Inference in graphical models (models involving many depending random vari-

ables)7:46

Some more continuous distributions*

Gaussian N(x | a,A) = 1| 2πA | 1/2 e

− 12 (x−a)>A-1 (x−a)

Dirac or δ δ(x) = ∂∂xH(x)

Student’s t(=Gaussian for ν → ∞, otherwise heavytails)

p(x; ν) ∝ [1 + x2

ν ]−ν+12

Exponential(distribution over single event time)

p(x;λ) = [x ≥ 0] λe−λx

Laplace(“double exponential”)

p(x;µ, b) = 12be− | x−µ | /b

Chi-squared p(x; k) ∝ [x ≥ 0] xk/2−1e−x/2

Gamma p(x; k, θ) ∝ [x ≥ 0] xk−1e−x/θ

7:47


B Graphical Models

Outline

• A. Introduction– Motivation and definition of Bayes Nets– Conditional independence in Bayes Nets– Examples

• B. Inference in Graphical Models– Sampling methods (Rejection, Importance, Gibbs)– Variable Elimination & Factor Graphs– Message passing, Loopy Belief Propagation

• C. Learning in Graphical Models– Maximum likelihood learning– Expectation Maximization, Gaussian Mixture Models, Hidden Markov Models, Free Energy for-mulation– Discriminative Learning

8:1

Graphical Models

• The core difficulty in modelling is specifyingWhat are the relevant variables?

How do they depend on each other?(Or how could they depend on each other→ learning)

• Graphical models are a simple, graphical notation for1) which random variables exist2) which random variables are “directly coupled”

Thereby they describe a joint probability distribution P (X1, .., Xn) over n randomvariables.

• 2 basic variants:– Bayesian Networks (aka. directed model, belief network)

– Factor Graphs (aka. undirected model, Markov Random Field)8:2

Bayesian Networks

• A Bayesian Network is a– directed acyclic graph (DAG)

– where each node represents a random variable Xi– for each node we have a conditional probability distribution

P (Xi |Parents(Xi))


• In the simplest case (discrete RVs), the conditional distribution is representedas a conditional probability table (CPT)

8:3

Exampledrinking red wine→ longevity?

8:4

Bayesian Networks

• DAG→we can sort the RVs; edges only go from lower to higher index• The joint distribution can be factored as

P (X1:n) =

n∏i=1

P (Xi |Parents(Xi))

• Missing links imply conditional independence• Ancestral simulation to sample from joint distribution

8:5

Example

(Heckermann 1995)

P(G=empty|B=good,F=not empty)=0.04

P(G=empty|B=good,F=empty)=0.97

P(G=empty|B=bad,F=not empty)=0.10

P(G=empty|B=bad,F=empty)=0.99

P(T=no|B=good)=0.03

P(T=no|B=bad)=0.98

P(S=no|T=yes,F=not empty)=0.01

P(S=no|T=yes,F=empty)=0.92

P(S=no|T=no,Fnot empty)=1.00

P(S=no|T=no,F=empty)=1.00

Fuel

Gauge

Battery

TurnOver Start

P(B=bad) =0.02 P(F=empty)=0.05

⇐⇒ P (S, T,G, F,B) = P (B) P (F ) P (G|F,B) P (T |B) P (S|T, F )

• Table sizes: LHS = 25 − 1 = 31 RHS = 1 + 1 + 4 + 2 + 4 = 128:6


Constructing a Bayes Net

1. Choose a relevant set of variables Xi that describe the domain2. Choose an ordering for the variables3. While there are variables left

(a) Pick a variable Xi and add it to the network(b) Set Parents(Xi) to some minimal set of nodes already in the net(c) Define the CPT for Xi

8:7

• This procedure is guaranteed to produce a DAG• Different orderings may lead to different Bayes nets representing the same

joint distribution:• To ensure maximum sparsity choose a wise order (“root causes”first). Counter

example: construct DAG for the car example using the ordering S, T, G, F, B• “Wrong” ordering will give same joint distribution, but will require the speci-

fication of more numbers than otherwise necessary8:8

Bayes Nets & conditional independence

• Independence: Indep(X,Y ) ⇐⇒ P (X,Y ) = P (X) P (Y )

• Conditional independence:Indep(X,Y |Z) ⇐⇒ P (X,Y |Z) = P (X|Z) P (Y |Z)

Z

X Y

(head-to-head)

Indep(X,Y )¬Indep(X,Y |Z)

Z

X Y

(tail-to-tail)

¬Indep(X,Y )Indep(X,Y |Z)

Z

X Y

(head-to-tail)

¬Indep(X,Y )Indep(X,Y |Z)

8:9

• Head-to-head: Indep(X,Y )

P (X,Y, Z) = P (X) P (Y ) P (Z|X,Y )

P (X,Y ) = P (X) P (Y )∑Z P (Z|X,Y ) = P (X) P (Y )


• Tail-to-tail: Indep(X,Y |Z)

P (X,Y, Z) = P (Z) P (X|Z) P (Y |Z)

P (X,Y |Z) = P (X,Y, Z) = P (Z) = P (X|Z) P (Y |Z)

• Head-to-tail: Indep(X,Y |Z)

P (X,Y, Z) = P (X) P (Z|X) P (Y |Z)

P (X,Y |Z) = P (X,Y,Z)P (Z) = P (X,Z) P (Y |Z)

P (Z) = P (X|Z) P (Y |Z)8:10

General rules for determining conditional independence in a Bayes net:

• Given three groups of random variables X,Y, Z

Indep(X,Y |Z) ⇐⇒ every path from X to Y is “blocked by Z”

• A path is “blocked by Z” ⇐⇒ on this path...– ∃ a node in Z that is head-to-tail w.r.t. the path, or

– ∃ a node in Z that is tail-to-tail w.r.t. the path, or

– ∃ another node A which is head-to-head w.r.t. the pathand neither A nor any of its descendants are in Z

8:11

Example

(Heckermann 1995)

P(G=empty|B=good,F=not empty)=0.04

P(G=empty|B=good,F=empty)=0.97

P(G=empty|B=bad,F=not empty)=0.10

P(G=empty|B=bad,F=empty)=0.99

P(T=no|B=good)=0.03

P(T=no|B=bad)=0.98

P(S=no|T=yes,F=not empty)=0.01

P(S=no|T=yes,F=empty)=0.92

P(S=no|T=no,Fnot empty)=1.00

P(S=no|T=no,F=empty)=1.00

Fuel

Gauge

Battery

TurnOver Start

P(B=bad) =0.02 P(F=empty)=0.05

Indep(T, F )? Indep(B,F |S)? Indep(B,S|T )?


8:12

What can we do with Bayes nets?

• Inference: Given some pieces of information (prior, observed variabes) whatis the implication (the implied information, the posterior) on a non-observedvariable

• Learning:– Fully Bayesian Learning: Inference over parameters (e.g., β)

– Maximum likelihood training: Optimizing parameters

• Structure Learning (Learning/Inferring the graph structure itself): Decide whichmodel (which graph structure) fits the data best; thereby uncovering condi-tional independencies in the data.

8:13

Inference

• Inference: Given some pieces of information (prior, observed variabes) whatis the implication (the implied information, the posterior) on a non-observedvariable

• In a Bayes Nets: Assume there is three groups of RVs:– Z are observed random variables

– X and Y are hidden random variables

– We want to do inference about X , not Y

Given some observed variables Z, compute theposterior marginal P (X |Z) for some hidden variable X .

P (X |Z) =P (X,Z)

P (Z)=

1

P (Z)

∑Y

P (X,Y, Z)

where Y are all hidden random variables except for X

• Inference requires summing over (eliminating) hidden variables.8:14

Example: Holmes & Watson


• Mr. Holmes lives in Los Angeles. One morning when Holmes leaves his house,he realizes that his grass is wet. Is it due to rain, or has he forgotten to turn offhis sprinkler?

– Calculate P (R|H), P (S|H) and compare these values to the prior probabilities.

– Calculate P (R,S|H).Note: R and S are marginally independent, but conditionally dependent

• Holmes checks Watson’s grass, and finds it is also wet.– Calculate P (R|H,W ), P (S|H,W )

– This effect is called explaining away

JavaBayes: run it from the html pagehttp://www.cs.cmu.edu/˜javabayes/Home/applet.html

8:15

Example: Holmes & Watson

Watson Holmes

P(W=yes|R=yes)=1.0

P(W=yes|R=no)=0.2 P(H=yes|R=yes,S=yes)=1.0

P(H=yes|R=yes,S=no)=1.0

P(H=yes|R=no,S=yes)=0.9

P(H=yes|R=no,S=no)=0.0

Rain

P(R=yes)=0.2

Sprinkler

P(S=yes)=0.1

P (H,W,S,R) = P (H|S,R) P (W |R) P (S) P (R)

P (R|H) =∑W,S

P (R,W,S,H)

P (H)=

1

P (H)

∑W,S

P (H|S,R) P (W |R) P (S) P (R)

=1

P (H)

∑S

P (H|S,R) P (S) P (R)

P (R=1 |H=1) =1

P (H=1)(1.0 · 0.2 · 0.1 + 1.0 · 0.2 · 0.9) =

1

P (H=1)0.2

P (R=0 |H=1) =1

P (H=1)(0.9 · 0.8 · 0.1 + 0.0 · 0.8 · 0.9) =

1

P (H=1)0.072

8:16

• These types of calculations can be automated→ Variable Elimination Algorithm

8:17

http://www.cs.cmu.edu/~javabayes/Home/applet.html


Example: Bavarian dialect

D

B

• Two binary random variables(RVs): B (bavarian) and D (dialect)• Given:P (D,B) = P (D |B) P (B)

P (D=1 |B=1) = 0.4, P (D=1 |B=0) = 0.01, P (B=1) = 0.15

• Notation: Grey shading usually indicates “observed”8:18

Example: Coin flipping

H

d1 d2 d3 d4 d5

• One binary RV H (hypothesis), 5 RVs for the coin tosses d1, .., d5

• Given:P (D,H) =

∏i P (di |H) P (H)

P (H=1) = 9991000 , P (di=H |H=1) = 1

2 , P (di=H |H=2) = 18:19

Example: Ridge regression

β

xi

yii = 1 : n

• One multi-variate RV β, 2n RVs x1:n, y1:n (observed data)• Given:P (D,β) =

∏i

[P (yi |xi, β) P (xi)

]P (β)


P (β) = N(β | 0, σ2

λ ), P (yi |xi, β) = N(yi |x>iβ, σ2)

• Plate notation: Plates (boxes with index ranges) mean “copy n-times”8:20

B.1 Inference in Graphical Models

8:21

Common inference methods in graphical models

• Sampling:– Rejection samping, importance sampling, Gibbs sampling

– More generally, Markov-Chain Monte Carlo (MCMC) methods

• Message passing:– Exact inference on trees (includes the Junction Tree Algorithm)

– Belief propagation

• Other approximations/variational methods– Expectation propagation

– Specialized variational methods depending on the model

• Reductions:– Mathematical Programming (e.g. LP relaxations of MAP)

– Compilation into Arithmetic Circuits (Darwiche at al.)8:22

Probabilistic Inference & Optimization

• Note the relation of the Boltzmann distribution p(x) ∝ e−f(x)/T with the costfunction (energy) f(x) and temperature T .


8:23

Sampling

• Read Andrieu et al: An Introduction to MCMC for Machine Learning (MachineLearning, 2003)

• Here I’ll discuss only thee basic methods:– Rejection sampling

– Importance sampling

– Gibbs sampling8:24

Rejection Sampling

• We have a Bayesian Network with RVs X1:n, some of which are observed:Xobs = yobs, obs ⊂ 1 : n

• The goal is to compute marginal posteriors P (Xi |Xobs = yobs) conditioned onthe observations.

• Our strategy is to generate a set of K (joint) samples of all variables

S = (xk1 , xk2 , .., xkn)Kk=1

Note: Each sample xk1:n is a list of instantiation of all RVs.8:25

Rejection Sampling

• To generate a single sample xk1:n:

1. Sort all RVs in topological order; start with i = 1

2. Sample a value xki ∼ P (Xi |xkParents(i)) for the ith RV conditional to theprevious samples xk1:i-1

3. If i ∈ obs compare the sampled value xki with the observation yi. Rejectand repeat from a) if the sample is not equal to the observation.

4. Repeat with i← i+ 1 from 2.

8:26


Rejection Sampling

• Since computers are fast, we can sample for large K• The sample set

S = (xk1 , xk2 , .., xkn)Kk=1

implies the necessary information:• We can compute the marginal probabilities from these statistics:

P (Xi=x |Xobs = yobs) ≈countS(xki = x)

K

or pair-wise marginals:

P (Xi=x,Xj=x′ |Xobs = yobs) ≈countS(xki = x ∧ xkj = x′)

K

8:27

Importance sampling (with likelihood weighting)

• Rejecting whole samples may become very inefficient in large Bayes Nets!

• New strategy: We generate a weighted sample set

S = (wk, xk1 , xk2 , .., xkn)Kk=1

where each samples xk1:n is associated with a weight wk

• In our case, we will choose the weights proportional to the likelihood P (Xobs =yobs |X1:n=xk1:n) of the observations conditional to the sample xk1:n

8:28

Importance sampling

• To generate a single sample (wk, xk1:n):

1. Sort all RVs in topological order; start with i = 1 and wk = 1

2. a) If i 6∈ obs, sample a value xki ∼ P (Xi |xkParents(i)) for the ith RV condi-tional to the previous samples xk1:i-1

b) If i ∈ obs, set the value xki = yi and update the weight according tolikelihood

wk ← wk P (Xi=yi |xk1:i-1)

3. Repeat with i← i+ 1 from 2.

8:29


Importance Sampling (with likelihood weighting)

• From the weighted sample set

S = (wk, xk1 , xk2 , .., xkn)Kk=1

• We can compute the marginal probabilities:

P (Xi=x |Xobs = yobs) ≈∑Kk=1 w

k[xki = x]∑Kk=1 w

k

and likewise pair-wise marginals, etc

Notation: [expr] = 1 if expr is true and zero otherwise8:30

Gibbs sampling

• In Gibbs sampling we also generate a sample set S – but in this case the samplesare not independent from each other anymore. The next sample “modifies” theprevious one:• First, all observed RVs are clamped to their fixed value xki = yi for any k.• To generate the (k + 1)th sample, iterate through the latent variables i 6∈ obs,

updating:

xk+1i ∼ P (Xi |xk1:n\i)

∼ P (Xi |xk1 , xk2 , .., xki-1, xki+1, .., xkn)

∼ P (Xi |xkParents(i))∏

j:i∈Parents(j)

P (Xj=xkj |Xi, xkParents(j)\i)

That is, each xk+1i is resampled conditional to the other (neighboring) current

sample values.8:31

Gibbs sampling

• As for rejection sampling, Gibbs sampling generates an unweighted sample setS which can directly be used to compute marginals.In practice, one often discards an initial set of samples (burn-in) to avoid start-ing biases.

• Gibbs sampling is a special case of MCMC sampling.Roughly, MCMC means to invent a sampling process, where the next samplemay stochastically depend on the previous (Markov property), such that thefinal sample set is guaranteed to correspond to P (X1:n).→ An Introduction to MCMC for Machine Learning


8:32

Sampling – conclusions

• Sampling algorithms are very simple, very general and very popular– they equally work for continuous & discrete RVs

– one only needs to ensure/implement the ability to sample from conditional distri-butions, no further algebraic manipulations

– MCMC theory can reduce required number of samples

• In many cases exact and more efficient approximate inference is possible by ac-tually computing/manipulating whole distributions in the algorithms insteadof only samples.

8:33

Variable Elimination

8:34

Variable Elimination example

X6

X3 X5

X4X2

X1 X6

X3 X5

X4X2

X1

F1

µ1 (≡ 1)

X6

X3 X5

X2

X1 X6

X3 X5

X2

X1

F2

µ2

X3 X5

X2

X1

µ2

X3 X5

X2

X1

F3

µ2µ3

X3 X5

X2

µ2µ3

X3 X5

X2

F4µ4

X3 X5

µ4

X3 X5

F5

µ5

X5

P (x5)

=∑x1,x2,x3,x4,x6

P (x1) P (x2|x1) P (x3|x1) P (x4|x2) P (x5|x3) P (x6|x2, x5)

=∑x1,x2,x3,x6

P (x1) P (x2|x1) P (x3|x1) P (x5|x3) P (x6|x2, x5)∑x4P (x4|x2)︸︷︷︸F1(x2,x4)

=∑x1,x2,x3,x6

P (x1) P (x2|x1) P (x3|x1) P (x5|x3) P (x6|x2, x5) µ1(x2)

=∑x1,x2,x3

P (x1) P (x2|x1) P (x3|x1) P (x5|x3) µ1(x2)∑x6P (x6|x2, x5)︸︷︷︸F2(x2,x5,x6)

=∑x1,x2,x3

P (x1) P (x2|x1) P (x3|x1) P (x5|x3) µ1(x2) µ2(x2, x5)

=∑x2,x3

P (x5|x3) µ1(x2) µ2(x2, x5)∑x1P (x1) P (x2|x1) P (x3|x1)︸︷︷︸

F3(x1,x2,x3)

=∑x2,x3

P (x5|x3) µ1(x2) µ2(x2, x5) µ3(x2, x3)

=∑x3P (x5|x3)

∑x2µ1(x2) µ2(x2, x5) µ3(x2, x3)︸︷︷︸

F4(x3,x5)

=∑x3P (x5|x3) µ4(x3, x5)

=∑x3P (x5|x3) µ4(x3, x5)︸︷︷︸

F5(x3,x5)

= µ5(x5)

8:35


Variable Elimination example – lessons learnt

• There is a dynamic programming principle behind Variable Elimination:– For eliminating X5,4,6 we use the solution of eliminating X4,6

– The “sub-problems” are represented by the F terms, their solutions by the remainingµ terms

– We’ll continue to discuss this 4 slides later!

• The factorization of the joint– determines in which order Variable Elimination is efficient

– determines what the terms F (...) and µ(...) depend on

• We can automate Variable Elimination. For the automation, all that matters isthe factorization of the joint.

8:36

Factor graphs

• In the previous slides we introduces the box notation to indicate terms thatdepend on some variables. That’s exactly what factor graphs represent.

• A Factor graph is a– bipartite graph

– where each circle node represents a random variable Xi– each box node represents a factor fk, which is a function fk(X∂k)

– the joint probability distribution is given as

P (X1:n) =K∏k=1

fk(X∂k)

Notation: ∂k is shorthand for Neighbors(k)

8:37

Bayes Net→ factor graph

• Bayesian Network:

X6

X3 X5

X4X2

X1

P (x1:6) = P (x1) P (x2|x1) P (x3|x1) P (x4|x2) P (x5|x3) P (x6|x2, x5)


• Factor Graph:

X6

X3 X5

X4X2

X1

P (x1:6) = f1(x1, x2) f2(x3, x1) f3(x2, x4) f4(x3, x5) f5(x2, x5, x6)

→ each CPT in the Bayes Net is just a factor (we neglect the special semanticsof a CPT)

8:38

Variable Elimination Algorithm

• eliminate single variable(F, i)

1: Input: list F of factors, variable id i2: Output: list F of factors3: find relevant subset F ⊆ F of factors coupled to i: F = k : i ∈ ∂k4: create new factor k with neighborhood ∂k = all variables in F except i5: compute µk(X∂k) =

∑Xi

∏k∈F fk(X∂k)

6: remove old factors F and append new factor µk to F7: return F

• elimination algorithm(µ, F,M)

1: Input: list F of factors, tuple M of desired output variables ids2: Output: single factor µ over variables XM3: define all variables present in F : V = vars(F )

4: define variables to be eliminated: E = V \M5: for all i ∈ E: eliminate single variable(F, i)

6: for all remaining factors, compute the product µ =∏f∈F f

7: return µ

8:39

Variable Elimination on trees


3

2

1

45

6

7

XY4

Y5

Y6

Y7

Y8

Y2

Y3

Y1

F3(Y3,4,5, X)

F1(Y1,8, X)

F2(Y2,6,7, X)

The subtrees w.r.t. X can be described asF1(Y1,8, X) = f1(Y8, Y1) f2(Y1, X)

F2(Y2,6,7, X) = f3(X,Y2) f4(Y2, Y6) f5(Y2, Y7)

F3(Y3,4,5, X) = f6(X,Y3, Y4) f7(Y4, Y5)

The joint distribution is:P (Y1:8, X) = F1(Y1,8, X) F2(Y2,6,7, X) F3(Y3,4,5, X)

8:40

Variable Elimination on trees

1

3

2

45

6

7

8

µ3→X

µ2→X

µ6→X

Y9

XY4

Y5

Y6

Y7

Y8

Y2

Y3

Y1

F3(Y3,4,5, X)

F1(Y1,8, X)

F2(Y2,6,7, X)

We can eliminate each tree independently. The remaining terms (messages)are:µF1→X(X) =

∑Y1,8

F1(Y1,8, X)

µF2→X(X) =∑Y2,6,7

F2(Y2,6,7, X)

µF3→X(X) =∑Y3,4,5

F3(Y3,4,5, X)

The marginal P (X) is the product of subtree messagesP (X) = µF1→X(X) µF2→X(X) µF3→X(X)

8:41

Variable Elimination on trees – lessons learnt


• The “remaining terms” µ’s are called messagesIntuitively, messages subsume information from a subtree

• Marginal = product of messages, P (X) =∏k µFk→X , is very intuitive:

– Fusion of independent information from the different subtrees

– Fusing independent information↔multiplying probability tables

• Along a (sub-) tree, messages can be computed recursively8:42

Message passing

• General equations (belief propagation (BP)) for recursive message computa-tion (writing µk→i(Xi) instead of µFk→X(X)):

µk→i(Xi) =∑X∂k\i

fk(X∂k)∏

j∈∂k\i

µj→k(Xj)︷︸︸︷∏k′∈∂j\k

µk′→j(Xj)︸︷︷︸F (subtree)∏

j∈∂k\i: branching at factor k, prod. over adjacent variables j excl. i∏k′∈∂j\k: branching at variable j, prod. over adjacent factors k′ excl. k

µj→k(Xj) are called “variable-to-factor messages”: store them for efficiency

1

3

2

45

6

7

8

µ3→X

µ2→X

µ1→Y1

µ4→Y2

µ5→Y2

µ7→Y4

µ6→X

µ8→Y4

Y9

XY4

Y5

Y6

Y7

Y8

Y2

Y3

Y1

F3(Y3,4,5, X)

F1(Y1,8, X)

F2(Y2,6,7, X)

Example messages:µ2→X =

∑Y1f2(Y1, X) µ1→Y1

(Y1)

µ6→X =∑Y3,Y4

f6(Y3, Y4, X) µ7→Y4(Y4)

µ3→X =∑Y2f3(Y2, X) µ4→Y2

(Y2) µ5→Y2(Y2)

8:43

Message passing remarks

• Computing these messages recursively on a tree does nothing else than Vari-able Elimination

⇒ P (Xi) =∏k∈∂i µk→i(Xi) is the correct posterior marginal

• However, since it stores all “intermediate terms”, we can compute ANY marginalP (Xi) for any i


• Message passing exemplifies how to exploit the factorization structure of thejoint distribution for the algorithmic implementation

• Note: These are recursive equations. They can be resolved exactly if and only ifthe dependency structure (factor graph) is a tree. If the factor graph had loops,this would be a “loopy recursive equation system”...

8:44

Message passing variants

• Message passing has many important applications:

– Many models are actually trees: In particular chains esp. Hidden Markov Models

– Message passing can also be applied on non-trees (↔ loopy graphs)→ approximateinference (Loopy Belief Propagation)

– Bayesian Networks can be “squeezed” to become trees→ exact inference in BayesNets! (Junction Tree Algorithm)

8:45

Loopy Belief Propagation

• If the graphical model is not a tree (=has loops):– The recursive message equations cannot be resolved.

– However, we could try to just iterate them as update equations...

• Loopy BP update equations: (initialize with µk→i = 1)

µnewk→i(Xi) =

∑X∂k\i

fk(X∂k)∏

j∈∂k\i

∏k′∈∂j\k

µoldk′→j(Xj)

8:46

Loopy BP remarks• Problem of loops intuitively:

loops⇒ branches of a node to not represent independent information!– BP is multiplying (=fusing) messages from dependent sources of information

• No convergence guarantee, but if it converges, then to a state of marginal consistency∑X∂k\i

b(X∂k) =∑

X∂k′\i

b(X∂k′ ) = b(Xi)

and to the minimum of the Bethe approximation of the free energy (Yedidia, Freeman, & Weiss,2001)


• We shouldn’t be overly disappointed:

– if BP was exact on loopy graphs we could efficiently solve NP hard problems...

– loopy BP is a very interesting approximation to solving an NP hard problem

– is hence also applied in context of combinatorial optimization (e.g., SAT problems)• Ways to tackle the problems with BP convergence:

– Damping (Heskes, 2004: On the uniqueness of loopy belief propagation fixed points)

– CCCP (Yuille, 2002: CCCP algorithms to minimize the Bethe and Kikuchi free energies: Convergentalternatives to belief propagation)

– Tree-reweighted MP (Kolmogorov, 2006: Convergent tree-reweighted message passing for energyminimization)

8:47

Junction Tree Algorithm

• Many models have loops

Instead of applying loopy BP in the hope of getting a good approximation, it is possibleto convert every model into a tree by redefinition of RVs. The Junction Tree Algorithmsconverts a loopy model into a tree.

• Loops are resolved by defining larger variable groups (separators) on whichmessages are defined

8:48

Junction Tree Example

• Example:A B

C D

A B

C D

• Join variable B and C to a single separator

D

B,C

A

This can be viewed as a variable substitution: rename the tuple (B,C) as asingle random variable


• A single random variable may be part of multiple separators – but only alonga running intersection

8:49

Junction Tree Algorithm

• Standard formulation: Moralization & TriangulationA clique is a fully connected subset of nodes in a graph.1) Generate the factor graph (classically called “moralization”)2) Translate each factor to a clique: Generate the undirected graph

where undirected edges connect all RVs of a factor3) Triangulate the undirected graph4) Translate each clique back to a factor; identify the separators

between factors

• Formulation in terms of variable elimination:1) Start with a factor graph2) Choose an order of variable elimination3) Keep track of the “remaining µ terms” (slide 14): which RVs

would they depend on? → this identifies the separators8:50

Junction Tree Algorithm Example

X6

X3 X5

X4X2

X1X2,3

X2,5X6

X4

X1

• If we eliminate in order 4, 6, 5, 1, 2, 3, we get remaining terms

(X2), (X2, X5), (X2, X3), (X2, X3), (X3)

which translates to the Junction Tree on the right8:51


Maximum a-posteriori (MAP) inference

• Often we want to compute the most likely global assignment

XMAP1:n = argmax

X1:n

P (X1:n)

of all random variables. This is called MAP inference and can be solved by replacingall∑

by max in the message passing equations – the algorithm is called Max-ProductAlgorithm and is a generalization of Dynamic Programming methods like Viterbi orDijkstra.

• Application: Conditional Random Fields

f(y, x) = φ(y, x)>β =

k∑j=1

φj(y∂j , x)βj

= log[ k∏j=1

eφj(y∂j ,x)βj]

with prediction x 7→ y∗(x) = argmaxy

f(x, y)

Finding the argmax is a MAP inference problem! This is frequently needed in the inner-loop of many CRF learning algorithms.

8:52

Conditional Random Fields

• The following are interchangable:“Random Field”↔ “Markov Random Field”↔ Factor Graph

• Therefore, a CRF is a conditional factor graph:– A CRF defines a mapping from input x to a factor graph over y– Each feature φj(y∂j , x) depends only on a subset ∂j of variables y∂j– If y∂j are discrete, a feature φj(y∂j , x) is usually an indicator feature (see lecture 03);

the corresponding parameter βj is then one entry of a factor fk(y∂j) that couplesthese variables

8:53

What we didn’t cover

• A very promising line of research is solving inference problems using mathe-matical programming. This unifies research in the areas of optimization, math-ematical programming and probabilistic inference.

Linear Programming relaxations of MAP inference and CCCP methods are great examples.

8:54


B.2 Learning in Graphical Models

8:55

Fully Bayes vs. ML learning

• Fully Bayesian learning (learning = inference)

Every model “parameter” (e.g., the β of Ridge regression) is a RV.→ We have a priorover this parameter; learning = computing the posterior

• Maximum-Likelihood learning (incl. Expectation Maximization)

The conditional probabilities have parameters θ (e.g., the entries of a CPT). We findparameters θ to maximize the likelihood of the data under the model.

8:56

Likelihood, ML, fully Bayesian, MAP

We have a probabilistic model P (X1:n; θ) that depends on parameters θ. (Note:the ’;’ is used to indicate parameters.) We have data D.• The likelihood of the data under the model:P (D; θ)

• Maximum likelihood (ML) parameter estimate:θML := argmaxθ P (D; θ)

• If we treat θ as RV with prior P (θ) and P (X1:n, θ) = P (X1:n|θ)P (θ):Bayesian parameter estimate:

P (θ|D) ∝ P (D|θ) P (θ)

Bayesian prediction: P (prediction|D) =∫θP (prediction|θ) P (θ|D)

• Maximum a posteriori (MAP) parameter estimate:θMAP = argmaxθ P (θ|D)

8:57

Maximum likelihood learning

– fully observed case– with latent variables (→ Expectation Maximization)

8:58


Maximum likelihood parameters – fully observed

• An example: Given a model with three random varibles X , Y1, Y2:X

Y1 Y2

– We have unknown parameters

P (X=x) ≡ ax, P (Y1 =y|X=x) ≡ byx, P (Y2 =y|X=x) ≡ cyx

– Given a data set (xi, y1,i, y2,i)ni=1

how can we learn the parameters θ = (a, b, c)?

• Answer:

ax =

∑ni=1[xi=x]

n

byx =

∑ni=1[y1,i=y ∧ xi=x]

na

cyx =

∑ni=1[y2,i=y ∧ xi=x]

na

8:59

Maximum likelihood parameters – fully observed

• In what sense is this the correct answer?→ These parameters maximize observed data log-likelihood

logP (x1:n, y1,1:N , y2,1:n; θ) =

n∑i=1

logP (xi, y1,i, y2,i ; θ)

• When all RVs are fully observed, learning ML parameters is usually simple.This also holds when some random variables might be Gaussian (see mixtureof Gaussians example).

8:60

Maximum likelihood parameters with missing data

• Given a model with three random varibles X , Y1, Y2:X

Y2Y1


– We have unknown parameters

P (X=x) ≡ ax, P (Y1 =y|X=x) ≡ byx, P (Y2 =y|X=x) ≡ cyx

– Given a partial data set (y1,i, y2,i)Ni=1

how can we learn the parameters θ = (a, b, c)?8:61

General idea

• We should somehow fill in the missing datapartial data (y1:2,i)ni=1 → augmented data (xi, y1:2,i)ni=1

• If we knew the model P (X,Y1,2) already, we could use it to infer an xi for eachpartial datum (y1:2,i)

→ then use this “augmented data” to train the model

• A chicken and egg situation:

(xi, yi, zi)if we had the full data

dataaugment

P (X,Y, Z)

if we had a model

estimateparameters

8:62

Expectation Maximization (EM)

– Let X be a (set of) latent variables– Let Y be a (set of) observed variables– Let P (X,Y ; θ) be a parameterized probabilistic model– We have partial data D = (yi)ni=1

– We start with some initial guess θold of parameters

Iterate:

• Expectation-step: Compute the posterior belief over the latent variables usingθold

qi(x) = P (x|yi; θold)


• Maximization-step: Choose new parameters to maximize the expected datalog-likelihood (expectation w.r.t. q)

θnew = argmaxθ

N∑i=1

∑x

qi(x) logP (x, yi; θ)︸︷︷︸=:Q(θ,q)

8:63

EM for our example

X

Y2Y1

• E-step: Compute

qi(x ; θold) = P (x|y1:2,i; θold) ∝ aold

x boldy1,ix c

oldy1,ix

• M-step: Update using the expected counts:

anewx =

∑ni=1 qi(x)

n

bnewyx =

∑ni=1 qi(x) [y1,i=y]

nax

cnewyx =

∑ni=1 qi(x) [y2,i=y]

nax

8:64

Examples for the application of EM

– Mixture of Gaussians– Hidden Markov Models– Outlier detection in regression

8:65

Gaussian Mixture Model – as example for EM

• What might be a generative model of this data?


• There might be two “underlying” Gaussian distributions. A data point is drawn(here with about 50/50 chance) by one of the Gaussians.

8:66


ci

xi

• K different Gaussians with parmameters µk,Σk• A latent random variable ci ∈ 1, ..,K for each data point with prior

P (ci=k) = πk

• The observed data point xiP (xi | ci=k;µk,Σk) = N(xi |µk,Σk)

8:67


Same situation as before:

• If we had full data D = (ci, xi)ni=1, we could estimate parameters

πk =1

n

∑ni=1[ci=k]


µk =1

nπk

∑ni=1[ci=k] xi

Σk =1

nπk

∑ni=1[ci=k] xix

>i − µkµ>k

• If we knew the parameters µ1:K ,Σ1:K , we could infer the mixture labelsP (ci=k |xi;µ1:K ,Σ1:K) ∝ N(xi |µk,Σk) P (ci=k)

8:68


• E-step: Compute

q(ci=k) = P (ci=k |xi; , µ1:K ,Σ1:K) ∝ N(xi |µk,Σk) P (ci=k)

• M-step: Update parameters

πk =1

n

∑i q(ci=k)

µk =1

nπk

∑i q(ci=k) xi

Σk =1

nπk

∑i q(ci=k) xix

>i − µkµ>k

8:69


EM iterations for Gaussian Mixture model:


from Bishop

8:70

• Generally, the term mixture is used when there is a discrete latent variable(which indicates the “mixture component”).

8:71

Hidden Markov Model – as example for EM

-3

-2

-1

0

1

2

3

4

0 100 200 300 400 500 600 700 800 900 1000


data

What probabilistic model could explain this data?8:72


Hidden Markov Model – as example for EM

-3

-2

-1

0

1

2

3

4

0 100 200 300 400 500 600 700 800 900 1000


datatrue state

There could be a latent variable, with two states, indicating the high/low meanof the data.

8:73

Hidden Markov Models

• We assume we have– observed (discrete or continuous) variables Yt in each time slice– a discrete latent variable Xt in each time slice– some observation model P (Yt |Xt; θ)

– some transition model P (Xt |Xt-1; θ)

• A Hidden Markov Model (HMM) is defined as the joint distribution

P (X0:T , Y0:T ) = P (X0) ·T∏t=1

P (Xt|Xt-1) ·T∏t=0

P (Yt|Xt) .

X0 X1 X2 X3

Y0 Y1 Y2 Y3 YT

XT

8:74

EM for Hidden Markov Models

• Given data D = ytTt=0. How can we learn all parameters of a HMM?


• What are the parameters of an HMM?– The number K of discrete latent states xt (the cardinality of dom(xt))– The transition probability P (xt|xt-1) (a K ×K-CPT Pk′k)– The initial state distribution P (x0) (a K-table πk)– Parameters of the output distribution P (yt |xt=k)

e.g., the mean µk and covariance matrix Σk for each latent state k

• Learning K is very hard!– Try and evaluate many K– Have a distribution over K (→ infinite HMMs)– We assume K fixed in the following

• Learning the other parameters is easier: Expectation Maximization8:75

EM for Hidden Markov Models

The paramers of an HMM are θ = (Pk′k, πk, µk,Σk)

• E-step: Compute the marginal posteriors (also pair-wise)q(xt) = P (xt | y1:T ; θold)

q(xt, xt+1) = P (xt, xt+1 | y1:T ; θold)

• M-step: Update the parameters

Pk′k ←∑T -1t=0 q(xt=k, xt+1 =k′)∑T -1

t=0 q(xt=k), πk ← q(x0 =k)

µk ←T∑t=0

wktyt , Σk ←T∑t=0

wkt yty>t − µkµ>k

with wkt = q(xt=k)/(∑Tt=0 q(xt=k)) (each row of wkt is normalized)

compare with EM for Gaussian Mixtures!8:76

E-step: Inference in an HMM – a tree!


X0 X1 X2 X3

Y0 Y1 Y2 Y3 YT

XT

Fnow(X2, Y2)

Ffuture(X2:T , Y3:T )Fpast(X0:2, Y0:1)

• The marginal posterior P (Xt |Y1:T ) is the product of three messages

P (Xt |Y1:T ) ∝ P (Xt, Y1:T ) = µpast︸︷︷︸α

(Xt) µnow︸︷︷︸%

(Xt) µfuture︸︷︷︸β

(Xt)

• For all a < t and b > t

– Xa conditionally independent from Xb given Xt

– Ya conditionally independent from Yb given Xt

“The future is independent of the past given the present”Markov property

(conditioning on Yt does not yield any conditional independences)8:77

Inference in HMMs

X0 X1 X2 X3

Y0 Y1 Y2 Y3 YT

XT

Fnow(X2, Y2)

Ffuture(X2:T , Y3:T )Fpast(X0:2, Y0:1)

Applying the general message passing equations:forward msg. µXt-1→Xt (xt) =: αt(xt) =

∑xt-1

P (xt|xt-1) αt-1(xt-1) %t-1(xt-1)

α0(x0) = P (x0)

backward msg. µXt+1→Xt (xt) =: βt(xt) =∑xt+1

P (xt+1|xt) βt+1(xt+1) %t+1(xt+1)

βT (x0) = 1

observation msg. µYt→Xt (xt) =: %t(xt) = P (yt |xt)posterior marginal q(xt) ∝ αt(xt) %t(xt) βt(xt)posterior marginal q(xt, xt+1) ∝ αt(xt) %t(xt) P (xt+1|xt) %t+1(xt+1) βt+1(xt+1)

8:78


Inference in HMMs – implementation notes

• The message passing equations can be implemented by reinterpreting them as ma-trix equations: Let αt,βt,%t be the vectors corresponding to the probability tablesαt(xt), βt(xt), %t(xt); and let P be the matrix with enties P (xt |xt-1). Then

1: α0 = π, βT = 1

2: fort=1:T -1 : αt = P (αt-1 · %t-1)

3: fort=T -1:0 : βt = P>(βt+1 · %t+1)

4: fort=0:T : qt = αt · %t · βt5: fort=0:T -1 : Qt = P · [(βt+1 · %t+1) (αt · %t)>]

where · is the element-wise product! Here, qt is the vector with entries q(xt), and Qt

the matrix with entries q(xt+1, xt). Note that the equation for Qt describes Qt(x′, x) =P (x′|x)[(βt+1(x′)%t+1(x′))(αt(x)%t(x))].

8:79

-3

-2

-1

0

1

2

3

4

0 100 200 300 400 500 600 700 800 900 1000


dataevidences

-3

-2

-1

0

1

2

3

4

0 100 200 300 400 500 600 700 800 900 1000


datafiltering

betastrue state

-3

-2

-1

0

1

2

3

4

0 100 200 300 400 500 600 700 800 900 1000


dataposterior

true state

8:80

Inference in HMMs: classical derivationGiven our knowledge of Belief propagation, inference in HMMs is simple. For reference, here is amore classical derivation:

P (xt | y0:T ) =P (y0:T |xt) P (xt)

P (y0:T )


=P (y0:t |xt) P (yt+1:T |xt) P (xt)

P (y0:T )

=P (y0:t, xt) P (yt+1:T |xt)

P (y0:T )

=αt(xt) βt(xt)

P (y0:T )

αt(xt) := P (y0:t, xt) = P (yt|xt) P (y0:t-1, xt)

= P (yt|xt)∑xt-1

P (xt |xt-1) αt-1(xt-1)

βt(xt) := P (yt+1:T |xt) =∑x+1

P (yt+1:T |xt+1) P (xt+1 |xt)

=∑x+1

[βt+1(xt+1) P (yt+1|xt+1)

]P (xt+1 |xt)

Note: αt here is the same as αt · %t on all other slides!8:81

Different inference problems in HMMs

• P (xt | y0:T ) marginal posterior• P (xt | y0:t) filtering• P (xt | y0:a), t > a prediction• P (xt | y0:b), t < b smoothing• P (y0:T ) likelihood calculation

• Viterbi alignment: Find sequence x∗0:T that maximizes P (x0:T | y0:T )

(This is done using max-product, instead of sum-product message passing.)8:82

HMM remarks

• The computation of forward and backward messages along the Markov chainis also called forward-backward algorithm• The EM algorithm to learn the HMM parameters is also called Baum-Welch

algorithm• If the latent variable xt is continuous xt ∈ Rd instead of discrete, then such a

Markov model is also called state space model.


• If the continuous transitions are linear Gaussian

P (xt+1|xt) = N(xt+1|Axt + a,Q)

then the forward and backward messages αt and βt are also Gaussian.→ forward filtering is also called Kalman filtering→ smoothing is also called Kalman smoothing• Sometimes, computing forward and backward messages (in disrete or contin-

uous context) is also called Bayesian filtering/smoothing

8:83

HMM example: Learning Bach

• A machine “listens” (reads notes of) Bach pieces over and over again→ It’s supposed to learn how to write Bach pieces itself (or at least harmonizethem).

• Harmonizing Chorales in the Style of J S Bach Moray Allan & Chris Williams (NIPS2004)

• use an HMM– observed sequence Y0:T Soprano melody– latent sequence X0:T chord & and harmony:

8:84

HMM example: Learning Bach

• results: http://www.anc.inf.ed.ac.uk/demos/hmmbach/

http://www.anc.inf.ed.ac.uk/demos/hmmbach/


• See also work by Gerhard Widmer http://www.cp.jku.at/people/widmer/8:85

Free Energy formulation of EM

• We introduced EM rather heuristically: an iteration to resolve the chicken andegg problem.Are there convergence proofs?Are there generalizations?Are there more theoretical insights?

→ Free Energy formulation of Expectation Maximization8:86


• Define a function (called “Free Energy”)

F (q, θ) = logP (Y ; θ)−D(q(X)

∣∣∣∣P (X|Y ; θ))

where

logP (Y ; θ) = log∑X

P (X,Y ; θ) log-likelihood of observed data Y

D(q(X)

∣∣∣∣P (X|Y ; θ))

:=∑X

q(X) logq(X)

P (X |Y ; θ)Kullback-Leibler div.

• The Kullback-Leibler divergence is sort-of a distance measure between two dis-tributions: D

(q∣∣∣∣ p) is always positive and zero only if q = p.

8:87

http://www.cp.jku.at/people/widmer/



• We can write the free energy in two ways

F (q, θ) = logP (Y ; θ)︸︷︷︸data log-like.

−D(q(X)

∣∣∣∣P (X|Y ; θ))︸︷︷︸

E-step: q ← argmin

(18)

= logP (Y ; θ)−∑X

q(X) logq(X)

P (X |Y ; θ)

=∑X

q(X) logP (Y ; θ) +∑X

q(X) logP (X |Y ; θ) +H(q)

=∑X

q(X) logP (X,Y ; θ)︸︷︷︸M-step: θ ← argmax

+H(q), (19)

• We actually want to maximize P (Y ; θ) w.r.t. θ→ but can’t analytically• Instead, we maximize the lower bound F (q, θ) ≤ logP (Y ; θ)

– E-step: find q that maximizes F (q, θ) for fixed θold using (18)(→ q(X) approximates P (X|Y ; θ), makes lower bound tight)

– M-step: find θ that maximizes F (q, θ) for fixed q using (19)(→ θ maximizes Q(θ, q) =

∑X q(X) logP (X,Y ; θ))

• Convergence proof: F (q, θ) increases in each step.8:88

• The free energy formulation of EM– clarifies convergence of EM (to local minima)– clarifies that q need only to approximate the posterior

• Why is it called free energy?

– Given a physical system in specific state (x, y), it has energy E(x, y)

This implies a joint distribution p(x, z) = 1Ze−E(x,y) with partition function Z

– If the DoFs y are “constrained” and x “unconstrained”,then the expected energy is 〈E〉 =

∑x q(x)E(x, y)

Note 〈E〉 = −Q(θ, q) is the same as the neg. expected data log-likelihood

– The unconstrained DoFs x have an entropy H(q)

– Physicists call A = 〈E〉 −H (Helmholtz) free energyWhich is the negative of our eq. (19): F (q, θ) = Q(θ, q) +H(q)

8:89


Conditional vs. generative models

(very briefly)8:90

Maximum conditional likelihood

• We have two random variables X and Y and full data D = (xi, yi)Given a model P (X,Y ; θ) of the joint distribution we can train it

θML = argmaxθ

∏i

P (xi, yi; θ)

• Assume the actual goal is to have a classifier x 7→ y, or, the conditional distribu-tion P (Y |X).

Q: Is it really necessary to first learn the full joint model P (X,Y ) when we only needP (Y |X)?

Q: If P (X) is very complicated but P (Y |X) easy, then learning the joint P (X,Y ) isunnecessarily hard?

8:91

Maximum conditional Likelihood

Instead of likelihood maximization θML = argmaxθ∏i P (xi, yi; θ):

• Maximum conditional likelihood:Given a conditional model P (Y |X; θ), train

θ∗ = argmaxθ

∏i

P (yi|xi; θ)

• The little but essential difference: We don’t care to learn P (X)!8:92

Example: Conditional Random Field

f(y, x) = φ(y, x)>β =

k∑j=1

φj(y∂j , x)βj

p(y|x) = ef(y,x)−Z(x,β) ∝k∏j=1

eφj(y∂j ,x)βj


• Logistic regression also maximizes the conditional log-likelihood• Plain regression also maximizes the conditional log-likelihood! (with logP (yi|xi;β) ∝

(yi − φ(xi)>β)2)

8:93


7 Exercises

7.1 Exercise 1

There will be no credit points for the first exercise – we’ll do them on the fly.(Prasenzubung) For those of you that had lectures with me before this is redundant—you’re free to skip the tutorial.

7.1.1 Reading: Pedro Domingos

Read at least until section 5 of Pedro Domingos’s A Few Useful Things to Know aboutMachine Learning http://homes.cs.washington.edu/˜pedrod/papers/cacm12.pdf. Beable to explain roughly what generalization and the bias-variance-tradeoff (Fig. 1)are.

7.1.2 Matrix equations

a) Let X,A be arbitrary matrices, A invertible. Solve for X :

XA+A>= I

b) Let X,A,B be arbitrary matrices, (C − 2A>) invertible. Solve for X :

X>C = [2A(X +B)]>

c) Let x ∈ Rn, y ∈ Rd, A ∈ Rd×n. A obviously not invertible, but let A>A be invert-ible. Solve for x:

(Ax− y)>A = 0>n

d) As above, additionally B ∈ Rn×n, B positive-definite. Solve for x:

(Ax− y)>A+ x>B = 0>n

7.1.3 Vector derivatives

Let x ∈ Rn, y ∈ Rd, A ∈ Rd×n.

a) What is ∂∂xx ? (Of what type/dimension is this thing?)

b) What is ∂∂x [x>x] ?

c) Let B be symmetric (and pos.def.). What is the minimum of (Ax− y)>(Ax− y) +x>Bx w.r.t. x?

http://homes.cs.washington.edu/~pedrod/papers/cacm12.pdf


7.1.4 Coding

Future exercises will need you to code some Machine Learning methods. You arefree to choose your programming language. If you’re new to numerics we rec-ommend Matlab/Octave or Python (SciPy & scikit-learn). I’ll support C++, butrecommend it really only to those familiar with C++.

To get started, try to just plot the data set http://ipvs.informatik.uni-stuttgart.de/mlr/marc/teaching/data/dataQuadReg2D.txt, e.g. in Octave:

D = importdata(’dataQuadReg2D.txt’);plot3(D(:,1),D(:,2),D(:,3), ’ro’)

Or in Python

import numpy as npimport matplotlib.pyplot as pltfrom mpl_toolkits.mplot3d import Axes3D

D = np.loadtxt(’dataQuadReg2D.txt’)

fig = plt.figure()ax = fig.add_subplot(111, projection=’3d’)ax.plot(D[:,0],D[:,1],D[:,2], ’ro’)plt.show()

Or you can store the grid data in a file and use gnuplot, e.g.:

splot ’dataQuadReg2D.txt’ with points

For those using C++, download and test http://ipvs.informatik.uni-stuttgart.de/mlr/marc/source-code/15-MLcourse.tgz. In particular, have a look atexamples/Core/array/main.cpp with many examples on how to use the arrayclass. Report on problems with installation.

http://ipvs.informatik.uni-stuttgart.de/mlr/marc/teaching/data/dataQuadReg2D.txt

http://ipvs.informatik.uni-stuttgart.de/mlr/marc/teaching/data/dataQuadReg2D.txt

http://ipvs.informatik.uni-stuttgart.de/mlr/marc/source-code/15-MLcourse.tgz

http://ipvs.informatik.uni-stuttgart.de/mlr/marc/source-code/15-MLcourse.tgz


7.2 Exercise 2

7.2.1 Getting Started with Ridge Regression

In the appendix you find starting-point implementations of basic linear regressionfor Python, C++, and Matlab. These include also the plotting of the data and model.Have a look at them, choose a language and understand the code in detail.

On the course webpage there are two simple data sets dataLinReg2D.txt anddataQuadReg2D.txt. Each line contains a data entry (x, y) with x ∈ R2 andy ∈ R; the last entry in a line refers to y.

a) The examples demonstrate plain linear regression for dataLinReg2D.txt. Ex-tend them to include a regularization parameter λ. Report the squared error on thefull data set when trained on the full data set.

b) Do the same for dataQuadReg2D.txtwhile first computing quadratic features.

c) Implement cross-validation (slide 02:18) to evaluate the prediction error of thequadratic model for a third, noisy data set dataQuadReg2D_noisy.txt. Report1) the squared error when training on all data (=training error), and 2) the meansquared error ˆfrom cross-validation.

Repeat this for different Ridge regularization parameters λ. (Ideally, generate a nicebar plot of the generalization error, including deviation, for various λ.)

Python (by Stefan Otte)

#!/usr/bin/env python# encoding: utf-8"""This is a mini demo of how to use numpy arrays and plot data.NOTE: the operators + - * / are element wise operation. If you wantmatrix multiplication use ‘‘dot‘‘ or ‘‘mdot‘‘!"""import numpy as npfrom numpy import dotfrom numpy.linalg import inv

import matplotlib.pyplot as pltfrom mpl_toolkits.mplot3d.axes3d import Axes3D # 3D plotting

################################################################################ Helper functionsdef mdot(*args):

"""Multi argument dot function. http://wiki.scipy.org/Cookbook/MultiDot"""return reduce(np.dot, args)


def prepend_one(X):"""prepend a one vector to X."""return np.column_stack([np.ones(X.shape[0]), X])

def grid2d(start, end, num=50):"""Create an 2D array where each row is a 2D coordinate.

np.meshgrid is pretty annoying!"""dom = np.linspace(start, end, num)X0, X1 = np.meshgrid(dom, dom)return np.column_stack([X0.flatten(), X1.flatten()])

################################################################################ load the datadata = np.loadtxt("dataLinReg2D.txt")print "data.shape:", data.shapenp.savetxt("tmp.txt", data) # save data if you want to# split into features and labelsX, y = data[:, :2], data[:, 2]print "X.shape:", X.shapeprint "y.shape:", y.shape

# 3D plottingfig = plt.figure()ax = fig.add_subplot(111, projection=’3d’) # the projection arg is important!ax.scatter(X[:, 0], X[:, 1], y, color="red")ax.set_title("raw data")

plt.draw() # show, use plt.show() for blocking

# prep for linear reg.X = prepend_one(X)print "X.shape:", X.shape

# Fit model/compute optimal parameters betabeta_ = mdot(inv(dot(X.T, X)), X.T, y)print "Optimal beta:", beta_

# prep for predictionX_grid = prepend_one(grid2d(-3, 3, num=30))print "X_grid.shape:", X_grid.shape# Predict with trained modely_grid = mdot(X_grid, beta_)print "Y_grid.shape", y_grid.shape

# vis the resultfig = plt.figure()ax = fig.add_subplot(111, projection=’3d’) # the projection part is importantax.scatter(X_grid[:, 1], X_grid[:, 2], y_grid) # don’t use the 1 infrontax.scatter(X[:, 1], X[:, 2], y, color="red") # also show the real dataax.set_title("predicted data")


plt.show()

C++

(by Marc Toussaint)

//g++ -I../../../src -L../../../lib -fPIC -std=c++0x main.cpp -lCore

#include <Core/array.h>

//===========================================================================

void gettingStarted() //load the dataarr D = FILE("../01-linearModels/dataLinReg2D.txt");

//plot itFILE("z.1") <<D;gnuplot("splot ’z.1’ us 1:2:3 w p", true);

//decompose in input and outputuint n = D.d0; //number of data pointsarr Y = D.sub(0,-1,-1,-1).reshape(n); //pick last columnarr X = catCol(ones(n,1), D.sub(0,-1,0,-2)); //prepend 1s to inputscout <<"X dim = " <<X.dim() <<endl;cout <<"Y dim = " <<Y.dim() <<endl;

//compute optimal betaarr beta = inverse(˜X*X)*˜X*Y;cout <<"optimal beta=" <<beta <<endl;

//display the functionarr X_grid = grid(2, -3, 3, 30);X_grid = catCol(ones(X_grid.d0,1), X_grid);cout <<"X_grid dim = " <<X_grid.dim() <<endl;

arr Y_grid = X_grid * beta;cout <<"Y_grid dim = " <<Y_grid.dim() <<endl;FILE("z.2") <<Y_grid.reshape(31,31);gnuplot("splot ’z.1’ us 1:2:3 w p, ’z.2’ matrix us ($1/5-3):($2/5-3):3 w l", true);

cout <<"CLICK ON THE PLOT!" <<endl;

//===========================================================================

int main(int argc, char *argv[]) MT::initCmdLine(argc,argv);


gettingStarted();

return 0;

Matlab

(by Peter Englert)

clear;

% load the dateload(’dataLinReg2D.txt’);

% plot itfigure(1);clf;hold on;plot3(dataLinReg2D(:,1),dataLinReg2D(:,2),dataLinReg2D(:,3),’r.’);

% decompose in input X and output Yn = size(dataLinReg2D,1);X = dataLinReg2D(:,1:2);Y = dataLinReg2D(:,3);

% prepend 1s to inputsX = [ones(n,1),X];

% compute optimal betabeta = inv(X’*X)*X’*Y;

% display the function[a b] = meshgrid(-2:.1:2,-2:.1:2);Xgrid = [ones(length(a(:)),1),a(:),b(:)];Ygrid = Xgrid*beta;Ygrid = reshape(Ygrid,size(a));h = surface(a,b,Ygrid);view(3);grid on;


7.3 Exercise 3

7.3.1 Max-likelihood estimator

Consider data D = (yi)ni=1 that consists only of labels yi ∈ 1, ..,M. You want tofind a probability distribution p ∈ RM with

∑Mk=1 pk = 1 that maximizes the data

likelihood. The solution is really simple and intuitive: the optimal probabilities are

pk =1

n

n∑i=1

[yi = k]

where [expr] equals 1 if expr = true and 0 otherwise. So the sum just counts theoccurances of yi = k, which is then normalized. Derive this from first principles:

a) Understand (=be able to explain every step) that under i.i.d. assumptions

P (D|p) =

n∏i=1

P (yi|p) =

n∏i=1

pyi

and

logP (D|p) =

n∑i=1

log pyi =

n∑i=1

M∑k=1

[yi = k] log pk =

M∑k=1

nk log pk , nk =

n∑i=1

[yi = k]

b) Provide the derivative of

logP (D|p) + λ(

1−M∑k=1

pk

)w.r.t. p and λ

c) Set both derivatives equal to zero to derive the optimal parameter p∗.

7.3.2 Log-likelihood gradient and Hessian

Consider a binary classification problem with data D = (xi, yi)ni=1, xi ∈ Rd andyi ∈ 0, 1. We define

f(x) = x>β (20)

p(x) = σ(f(x)) , σ(z) = 1/(1 + e−z) (21)

L(β) = −n∑i=1

[yi log p(xi) + (1− yi) log[1− p(xi)]

](22)

where β ∈ Rd is a vector. (NOTE: the p(x) we defined here is a short-hand forp(y = 1|x) on slide 03:10.)


a) Compute the derivative ∂∂βL(β). Tip: use the fact ∂

∂zσ(z) = σ(z)(1− σ(z)).

b) Compute the 2nd derivative ∂2

∂β2L(β).

7.4 Exercise 4

7.4.1 Discriminative Function in Logistic Regression

Logistic Regression (slide 03:10) defines class probabilities as proportional to theexponential of a discriminative function:

P (y|x) =exp f(x, y)∑y′ exp f(x, y′)

Prove that, in the binary classification case, you can assume f(x, 0) = 0 withoutloss of generality.

This results in

P (y = 1|x) =exp f(x, 1)

1 + exp f(x, 1)= σ(f(x, 1)).

(Hint: first assume f(x, y) = φ(x, y)>β, and then define a new discriminative func-tion f ′ as a function of the old one, such that f ′(x, 0) = 0 and for which P (y|x)maintains the same expressibility.)

7.4.2 Logistic Regression

On the course webpage there is a data set data2Class.txt for a binary classifi-cation problem. Each line contains a data entry (x, y) with x ∈ R2 and y ∈ 0, 1.a) Compute the optimal parameters β (perhaps also the mean neg-log-likelihood,− 1n logL(β)) of logistic regression using linear features. Plot the probability P (y =

1 |x) over a 2D grid of test points. Tips:

• Recall the objective function, and its gradient and Hessian that we derived inthe last exercise:

L(β) = −n∑i=1

logP (yi |xi) + λ||β||2 (23)

= −n∑i=1

[yi log pi + (1− yi) log[1− pi]

]+ λ||β||2 (24)

∇L(β) =∂L(β)

∂β

>=

n∑i=1

(pi − yi) φ(xi) + 2λIβ = X>(p− y) + 2λIβ (25)


∇2L(β) =∂2L(β)

∂β2=

n∑i=1

pi(1− pi) φ(xi) φ(xi)>+ 2λI = X>WX + 2λI (26)

where p(x) := P (y=1 |x) = σ(φ(x)>β) , pi := p(xi) , W := diag(p (1− p))(27)

• Setting the gradient equal to zero can’t be done analytically. Instead, optimalparameters can quickly be found by iterating Newton steps: For this, initializeβ = 0 and iterate

β ← β − (∇2L(β))-1 ∇L(β) . (28)

You usually need to iterate only a few times (∼10) til convergence.

• As you did for regression, plot the discriminative function f(x) = φ(x)>β orthe class probability function p(x) = σ(f(x)) over a grid.

Useful gnuplot commands:

splot [-2:3][-2:3][-3:3.5] ’model’ matrix \us ($1/20-2):($2/20-2):3 with lines notitle

plot [-2:3][-2:3] ’data2Class.txt’ \us 1:2:3 with points pt 2 lc variable title ’train’

b) Compute and plot the same for quadratic features.

7.5 Exercise 5

7.5.1 Special cases of CRFs

Slide 03:27 summarizes the core equations for CRFs.

a) Confirm the given equations for∇βZ(x, β) and∇2β Z(x, β) (i.e., derive them from

the definition of Z(x, β)).

b) Binary logistic regression is clearly a special case of CRFs. Sanity check that thegradient and Hessian given on slide 03:10 can alternatively be derived from thegeneral expressions for∇βL(β) and∇2

β L(β) on slide 03:27. (The same is true for themulti-class case .)

c) Proof that ordinary ridge regression is a special case of CRFs if you choose thediscriminative function f(x, y) = −y2 + 2yφ(x)>β.


7.5.2 Structured Output: mixed classification/regression

Note: The following two exercises are meant to train you in modelling structuredoutput. The generic application case would be time series tagging or image seg-mentation. However, for both of these applications defining features and comput-ing the argmaxy f(x, y) is non-trivial. So, for the sake of simplicity of the exercises Idescribe some made up ’toy’ problems.

Consider a data set of tuples (x, y1, y2) where

• x ∈ R is the age of a machine in some factory

• y1 ∈ 0, 1 is a binary indicator whether some piece of the machine got looseor broke

• y2 ∈ R is the performance of the machine (perhaps the rate of correct produc-tion/assembly or so).

The figure displays the data (x, y2), which looks like two curves to be fit. These twocurves correspond to y1 ∈ 0, 1.a) Properly define a representation and objective for modelling the prediction x 7→(y1, y2).

b) Implement the method. What is your prediction for x = 80, that is, what isthe probability P (y1 |x = 80) of a broken piece and, conditionally to that, yourprediction of performance y2 conditional to (y1, x = 80), i.e., for both cases y1 ∈0, 1?

7.5.3 Structured Output: multi-output regression

Consider data of tuples (x, y1, y2) where

• x is the age of a spotify user


• y1 ∈ R quantifies how much the user likes HipHop

• y2 ∈ R quantifies how much the user likes Classic

Naively one could train separate regressions x 7→ y1 and x 7→ y2. However, itseems reasonable that somebody that likes HipHop might like Classic a bit lessthan average (anti-correlated).

a) Properly define a representation and objective for modelling the prediction x 7→(y1, y2).

7.5.4 Labelling a time series*

Assume we are in a clinical setting. Several sensors measure different things ofa patient, like heart beat rate, blood pressure, EEG signals. These measurementsform a vector xt ∈ Rn for every time step t = 1, .., T .

A doctor wants to annotate such time series with a binary label that indicateswhether the patient was asleep. Your job is to develop a ML algorithm to automat-ically annotate such data. For this assume that the doctor has already annotatedseveral time series of different patients: you are given the training data set

D = (xi1:T , yi1:T )ni=1 , xit ∈ Rn, yit ∈ 0, 1

where the superscript i denotes the data instance, and the subscript t the time step(e.g., each step could correspond to 10 seconds).

Develop a ML algorithms to do this job. First specify the model formally in alldetails (representation & objective). Then detail the algorithm to compute optimalparameters in pseudo code, being as precise as possible in all respects. Finally, don’tforget to sketch an algorithm to actually do the annotation given an input xi1:T andoptimized parameters.

(No need to actually implement.)

7.6 Exercise 6

7.6.1 Kernel Ridge regression

In exercise 2 we implemented Ridge regression. Modify the code to implementKernel ridge regression. Try to program it in a way that you only need to changeone line to have a different kernel. Note that this computes optimal “parameters”α = (K + λI)-1y such that f(x) = κ(x)>α.

a) Using a linear kernel, does this reproduce the linear regression we looked at inexercise 2? Test this on the data. If not, how can you make it equivalent?


b) Is using the squared exponential kernel k(x, x′) = exp(−γ |x − x′ | 2) exactlyequivalent to using the radial basis function features we introduced?

7.6.2 Positive Definite Kernels (optional/bonus)

For a non-empty set X , a kernel is a symmetric function k : X ×X → R. Note thatthe set X can be arbitrarily structured (real vector space, graphs, images, stringsand so on). A very important class of useful kernels for machine learning are posi-tive definite kernels. A kernel is called positive definite, if for all arbitrary finite sub-sets xini=1 ⊆ X the corresponding kernel matrix K with elements Kij = k(xi, xj)is positive semi-definite,

α ∈ Rn ⇒ α>Kα ≥ 0. (29)

For features φ : X → H (with H a suitable space), prove that

k(xi, xj) = 〈φ(xi), φ(xj)〉H (30)

is a positive definite kernel.

7.6.3 Kernel Construction (optional/bonus)

Often, one wants to construct more complicated kernels out of existing ones. Letk1, k2 : X ×X → R be two positive definite kernels. Proof that

1. k(x, x′) = k1(x, x′) + k2(x, x′)

2. k(x, x′) = c · k1(x, x′) for c ≥ 0

3. k(x, x′) = k1(x, x′) · k2(x, x′)

4. k(x, x′) = k1(f(x), f(x′)) for f : X → X

are positive definite kernels.

7.6.4 Kernel logistic regression (no implementation to do)

The “kernel trick” is generally applicable whenever the “solution” (which may bethe predictive function f ridge(x), or the discriminative function, or principal com-ponents...) can be written in a form that only uses the kernel function k(x, x′), butnever features φ(x) or parameters β explicitly.

Derive a kernelization of Logistic Regression. That is, think about how you couldperform the Newton iterations based only on the kernel function k(x, x′).


Tips: Reformulate the Newton iterations

β ← β − (X>WX + 2λI)-1 [X>(p− y) + 2λIβ] (31)

using the two Woodbury identities

(X>WX +A)-1X>W = A-1X>(XA-1X>+W -1)-1 (32)

(X>WX +A)-1 = A-1 −A-1X>(XA-1X>+W -1)-1XA-1 (33)

Note that you’ll need to handle theX>(p− y) and 2λIβ differently.

Then think about what is actually been iterated in the kernalized case: surely wecannot iteratively update the optimal parameters, because we want to rewrite equa-tions to never touch β or φ(x) explicitly.

7.7 Exercise 7

7.7.1 Intro to Neural Networks

Read sections 11.1-11.4 of Hastie’s book http://www-stat.stanford.edu/˜tibs/ElemStatLearn/ As a success criterion for learning, check if you can explain thefollowing to others:

Let h0 denote the input dimentionality, and h1,..,L the number of neurons in L hid-den layers. For all layers, let zl = Wl-1xl-1 denote the inputs to the neurons, andxl = σ(zl) the activation of the neurons, zl, xl ∈ Rhl .1 Let x0 ≡ x denote the datainput, or activation of the input layer. Then an L-layer NN first computes (forwardpropagation)

∀l=1,..,L : zl = Wl-1xl-1 , xl = σ(zl) , f ≡ zL+1 = WLxL , (34)

where w = (W0, ..,WL), with Wl ∈ Rhl+1×hl , are the parameter matrices, and f ∈RhL the function output.

Assume a loss function `(f, y). For a specific y, denote by

δL+1∆=∂`

∂f(35)

the partial derivative (row vector!) of the loss w.r.t. the output function values.E.g., for squared error `(f, y) = (f − y)2, δL+1 = 2(f − y)>. We have (backwardpropagation)

∀l=L,..,1 : δl∆=

d`

dzl=

d`

dzl+1

∂zl+1

∂xl

∂xl∂zl

= [δl+1 Wl] [xl (1− xl)]> . (36)

1The typical activation function σ is the logistic sigmoid function σ(z) = ez

1+ez= 1

e−z+1with σ′(z) =

σ(z)(1−σ(z)). Alternatives would be σ(z) = tanh(z) or (rare) σ(z) = z1+|z| ∈ [−1, 1] or newer (hinge

loss like) versions.




Note that is an element-wise product (that arises from the element-wise applicationof the activation function). Further, the brackets indicate which order of computa-tion is efficient. Given these, the gradient w.r.t. the weights is

d`

dWl,ij=

d`

dzl+1,i

∂zl+1,i

∂Wl,ijor

d`

dWl= δ>l+1x

>l . (37)

7.7.2 Weigth Initialization in Neural Networks

In Glorot, Bengio: Understanding the difficulty of training deep feedforward neural net-works (AISTATS’10) the issue of weight initialization in neural networks is dis-cussed, which is super important. Often weights are initialized uniformly fromsome interval [−a, a]. The question is how to choose a.

Consider a single layer of a neural network

y = σ(Wx)

where x ∈ Rn is an n-dimensional input, W ∈ Rh×n is an h × n matrix of weights,σ(z) = 1

e−z+1 is the sigmoid function, and y ∈ Rh the activation vector of the hneurons of the layer. Now assume that Ex = 0 and Varx = 1, and each weightWij ∼ U(−a, a) is uniformly samples from [−a, a].

What we want is that Vary = 1, that is, the layer’s output y should have the samevariance as the layer’s input x. (Why, we’ll discuss in the tutorial.)

How do you have to choose a so that Vary = 1? For simplicity you may assumethat σ(z) ≈ z, that is, that there is no sigmoid (or that, for small weights, the sigmoidis approximately linear).

7.7.3 Programming Backprop

Consider the classification data set “data2Class adjusted.txt”, which is a non-linearclassification problem. It has a three-dimensional input, where the first is a constantbias input.

Train a NN for this classification problem. As this is a binary classification problemwe only need one output neuron y. If y > 0 we classify +1, otherwise we classify−1.

• Code a routine “forward(x, w)” that computes f(x,w), the forward propaga-tion of the network, for a NN with a single hidden layer of size h1 = 100.Note that we have 100× 3 and 100× 1 parameters respectively (remember toinitialize them in a smart manner, as discussed in previous exercise).

• Code a routine “backward(δL+1, x, w)”, that performs the backpropagationsteps and collects the gradients d`

dwl.


• Let us use a hinge-loss function `(f, y) = max(0, 1 − fy), which means thatδL+1 = −y[1− yf > 0].

• Run forward and backward propagation for each x, y in the dataset, and sumup the respective gradients.

• Code a routine which optimizes the parameters using gradient descent: Wl =Wl − α d`

dWlwith step size α = .05. Run until convergence (should take a few

hundred steps).

• Print out the loss function ` at each iteration, to verify that the parameteroptimization is indeed decreasing the loss.

a) Visualize the predicion by plotting σ(f(x,w)) over a 2-dimensional grid.

b) What happens if we initialize the weights with 0? Why?

7.8 Exercise 8

Principal Component Analysis (PCA) Intro

Dimensionality reduction is the process of projecting data-points x ∈ Rd onto alower-dimensional space z ∈ Rp, with p < n. This procedure serves multiple pur-poses, in supervised learning, unsupervised learning, data visualization, etc..

A key aspect and objective of all dimensionality reduction algorithms is that theywant to preserve as much variance of the original data as possible: the goal is tolearn a lower-dimensional representation while retaining the maximum informa-tion about the original data.

In standard PCA, we approximate a vector x with an affine projection of its latentrepresentation z, as follows:

x ≈ V z + µ, (38)

where V ∈ Rd×p is an orthonormal matrix which specifies an axis rotation, andµ ∈ Rd which specifies a translation.

Given a datasetD = xini=1, the projection parameters µ, and the projections zi areobtained by minimizing the difference between the original data and the respectiveapproximations:

µ, z1:n = argminµ,z1:n

∑i

||V zi + µ− xi||2 (39)

7.8.1 PCA optimality principles

a) Find the optimal latent representation zi, as a function of V and µ.


b) Find an optimal offset µ.

(Hint: there is a whole subspace of solutions to this problem. Verify that yoursolution is consistent with (i.e. contains) µ = 1

n

∑i xi).

c) Find optimal projection vectors vipi=1, which make up the projection matrix

V =

| |v1 . . . vp| |

(40)

Guide:

1. Given a projection V , any vector can be split in orthogonal components whichbelong to the projected subspace and its complement (which we call W ). x =V V>x+WW>x.

2. For simplicity, let us work with the centered datapoints xi = xi − µ.

3. The optimal projection V is that which minimizes the discarded componentsWW>xi.

V = argminV

n∑i=1

||WW>xi||2 =

n∑i=1

||xi − V V>xi||2 (41)

4. Don’t try to solve computing gradients and setting them to zero. Instead,use the fact that V V> =

∑pi=1 viv

>i , and the singular value decomposition of∑n

i=1 xix>i = X>X = EDET .

d) In the above, is the orthonormality of V an essential assumption?

e) Prove that you can compute the V also from the SVD of X (instead of X>X).

7.8.2 PCA and reconstruction on the Yale face database

On the webpage find and download the Yale face database http://ipvs.informatik.

uni-stuttgart.de/mlr/marc/teaching/data/yalefaces.tgz. (Optionally use yalefaces_cropBackground.tgz,which is slightly cleaned version of the same dataset). The file contains gif imagesof 165 faces.

a) Write a routine to load all images into a big data matrix X ∈ R165×77760, whereeach row contains a gray image.

In Octave, images can easily be read using I=imread("subject01.gif"); andimagesc(I);. You can loop over files using files=dir("."); and files(:).name.Python tips:

import matplotlib.pyplot as plt

import scipy as sp

http://ipvs.informatik.uni-stuttgart.de/mlr/marc/teaching/data/yalefaces.tgz

http://ipvs.informatik.uni-stuttgart.de/mlr/marc/teaching/data/yalefaces.tgz


plt.imshow(plt.imread(file_name))

u, s, vt = sp.sparse.linalg.svds(X, k=neigenvalues)

b) Compute the mean face µ = 1n

∑i xi and center the whole data matrix, X =

X − 1nµ>.

c) Compute the singular value decomposition X = UDV> for the centered datamatrix.

In Octave/Matlab, use [U, S, V] = svd(X, "econ"), where the "econ" en-sures you don’t run out of memory.

In python, use

import scipy.sparse.linalg as sla;

u, s, vt = sla.svds(X, k=num_eigenvalues)

d) Find the p-dimensional representations Z = XVp, where Vp ∈ R77760×p containsonly the first p columns of V (Depending on which language / library you use,verify that the eigenvectors are returned in eigenvalue-descending order, otherwiseyou’ll have to find the correct eigenvectors manually). Assume p = 60. The rowsof Z represent each face as a p-dimensional vector, instead of a 77760-dimensionalimage.

e) Reconstruct all faces by computing X ′ = 1nµ>+ZV>p and display them; Do they

look ok? Report the reconstruction error∑ni=1 ||xi − x′i||2.

Repeat for various PCA-dimensions p = 5, 10, 15, 20 . . ..

BONUS) How is the value of p related to the variance / information which is lostthrough the low-dim projection?

7.9 Exercise 9

7.9.1 Graph cut objective function & spectral clustering

One of the central messages of the whole course is: To solve (learning) problems,first formulate an objective function that defines the problem, then derive algo-rithms to find/approximate the optimal solution. That should also hold for clus-tering...

k-means finds centers µk and assignments c : i 7→ k to minimize min∑i(xi−µc(i))2.

An alternative class of objective functions for clustering are graph cuts. Considern data points with similarities wij , forming a weighted graph. We denote by W =(wij) the symmetric weight matrix, and D = diag(d1, .., dn), with di =

∑j wij , the

degree matrix. For simplicitly we consider only 2-cuts, that is, cutting the graph intwo disjoint clusters, C1 ∪C2 = 1, .., n, C1 ∩C2 = ∅. The normalized cut objective


is

RatioCut(C1, C2) =(

1/|C1|+ 1/|C2|) ∑i∈C1,j∈C2

wij

a) Let fi =

+√|C2|/|C1| for i ∈ C1

−√|C1|/|C2| for i ∈ C2

be a kind of indicator function of the clus-

tering. Prove that

f>(D −W )f = n RatioCut(C1, C2)

b) Further prove that∑i fi = 0 and

∑i f

2i = n.

Note (to be discussed in the tutorial in more detail): Spectral clustering addresses

min f>(D −W )f s.t.∑i

fi = 0 , ||f ||2 = 1

by computing eigenvectors f of the graph Laplacian D −W with smallest eigen-values. This is a relaxation of the above problem that minimizes over continuousfunctions f ∈ Rn instead of discrete clusters C1, C2. The resulting eigen functionsare “approximate indicator functions of clusters”. The algorithms uses k-meansclustering in this coordinate system to explicitly deside on the clustering of datapoints.

7.9.2 Clustering the Yale face database

On the webpage find and download the Yale face database http://ipvs.informatik.

uni-stuttgart.de/mlr/marc/teaching/data/yalefaces_cropBackground.tgz. The filecontains gif images of 165 faces.

We’ll cluster the faces using k-means in K = 4 clusters.

a) Compute a k-means clustering starting with random initializations of the centers.Repeat k-means clustering 10 times. For each run, report on the clustering errormin

∑i(xi − µc(i))2 and pick the best clustering. Display the center faces µk and

perhaps some samples for each cluster.

b) Repeat the above for various K and plot the clustering error over K.

c) Repeat the above on the first 20 principal components of the data. Discussionin the tutorial: Is PCA the best way to reduce dimensionality as a precursor tok-means clustering? What would be the ‘ideal’ way to reduce dimensionality asprecursor to k-means clustering?

http://ipvs.informatik.uni-stuttgart.de/mlr/marc/teaching/data/yalefaces_cropBackground.tgz

http://ipvs.informatik.uni-stuttgart.de/mlr/marc/teaching/data/yalefaces_cropBackground.tgz


7.10 Exercise 10

7.10.1 Gradient Boosting for classification

Consider the following weak learner for classification: Given a data setD = (xi, yi)ni=1, yi ∈−1,+1, the weak learner picks a single i∗ and defines the discriminative function

f(x) = αe−(x−xi∗ )2/2σ2

,

with fixed width σ and variable parameter α. Therefore, this weak learner is param-eterized only by i∗ and α ∈ R, which are chosen to minimize the neg-log-likelihood

Lnll(f) = −n∑i=1

log σ(yif(xi)) .

a) Write down explicitly a pseudo code for gradient boosting with this weak learner.By “pseudo code” I mean explicit equations for every step that can directly be im-plemented in Matlab. This needs to be specific for this particular learner and loss.

b) Here is a 1D data set, where are 0-class, and × 1-class data points. “Simulate”the algorithm graphically on paper.

width

choose first

c) If we would replace the neg-log-likelihood by a hinge loss, what would be therelation to SVMs?

7.10.2 Sum of 3 dices

You have 3 dices (potentially fake dices where each one has a different probabilitytable over the 6 values). You’re given all three probability tables P (D1), P (D2),and P (D3). Write down a) the equations and b) an algorithm (in pseudo code)that computes the conditional probability P (S|D1) of the sum of all three dicesconditioned on the value of the first dice.

function POSTERIORPROBABILITY(S,D1)p← 0for D2 in 1, 2, 3, 4, 5, 6 do

p← p+ P2(D2)P3(S −D1 −D2)end forreturn p

end function


7.10.3 Product of Gaussians

A Gaussian distribution over x ∈ Rn with mean µ and covariance matrix Σ is de-fined as

N(x |µ,Σ) =1

| 2πΣ | 1/2e−

12 (x−µ)> Σ-1 (x−µ)

Multiplying probability distributions is a fundamental operation, and multiplyingtwo Gaussians is needed in many models.

a) Prove

N(x | a,A) N(x | b, B) = c N(x | (A-1 +B-1)-1(A-1a+B-1b), (A-1 +B-1)-1) .

where c are terms independent of x.

b) Now also compute the normalization c as a function of a,A, b, B.

Note: The so-called canonical form of a Gaussian is defined as N[x | a, A] = N(x | A-1a, A-1);in this convention the product reads much nicher: N[x | a, A] N[x | b, B] ∝ N[x | a +b, A+ B]. You can first prove this before proving the above, if you like.

7.11 Exercise 11

7.11.1 Gaussian Processes

Consider a Gaussian Process prior P (f) over functions defined by the mean func-tion µ(x) = 0, the γ-exponential covariance function

k(x, x′) = exp−|(x− x′)/l|γ

and an observation noise σ = 0.1. We assume x ∈ R is 1-dimensional. First considerthe standard squared exponential kernel with γ = 2 and l = 0.2.

a) Assume we have two data points (−0.5, 0.3) and (0.5,−0.1). Display the poste-rior P (f |D). For this, compute the mean posterior function f(x) and the standarddeviation function σ(x) (on the 100 grid points) exactly as on slide 06:10, usingλ = σ2. Then plot f , f + σ and f − σ to display the posterior mean and standarddeviation.

b) Now display the posterior P (y∗|x∗, D). This is only a tiny difference from theabove (see slide 06:8). The mean is the same, but the variance of y∗ includes addi-tionally the obseration noise σ2.

c) Repeat a) & b) for a kernel with γ = 1.

IndexAdaBoost (5:83),Agglomerative Hierarchical Clustering

(5:67),Autoencoders (5:28),

Bagging (5:78),Bayes’ Theorem (7:12),Bayesian (kernel) logistic regression (6:17),

Bayesian (kernel) ridge regression (6:4),

Bayesian Kernel Logistic Regression (6:20),

Bayesian Kernel Ridge Regression (6:11),

Bayesian Network (8:3),Bayesian Neural Networks (6:22),Bayesian parameter estimate (8:57),Bayesian prediction (8:57),Belief propagation (8:43),Bernoulli and Binomial distributions (7:21),

Beta (7:22),Boosting (5:82),Boosting decision trees (5:96),Bootstrapping (5:78),

Centering and whitening (5:99),Class Huber loss (5:7),Clustering (5:60),Conditional distribution (7:10),Conditional independence in a Bayes

Net (8:9),Conditional random field (8:53),Conditional random field (8:93),Conditional random fields (3:23),Conjugate priors (7:30),Cross validation (2:17),

Decision trees (5:91),Deep learning by regularizing the em-

bedding (5:26),

Dirac distribution (7:32),Dirichlet (7:26),Discriminative function (3:1),Dual formulation of Lasso regression

(2:25),Dual formulation of ridge regression

(2:24),

Entropy (7:41),Epanechnikov quadratic smoothing ker-

nel (5:71),Estimator variance (2:12),Expectation Maximization (EM) algo-

rithm (8:63),Expected data log-likelihood (8:63),

Factor analysis (5:55),Factor graph (8:37),Feature selection (2:19),Features (2:6),Forsyth loss (5:5),Free energy formulation of EM (8:86),

Gaussian (7:33),Gaussian mixture model (5:65),Gaussian mixture model (8:66),Gaussian Process (6:11),Gaussian Process Classification (6:20),

Gibbs sampling (8:31),Gradient boosting (5:87),

Hidden Markov model (8:72),Hinge loss (5:7),Hinge loss leads to max margin linear

program (5:12),HMM inference (8:78),Huber loss (5:5),

Importance sampling (7:45),Importance sampling (8:28),

162


Independent component analysis (5:56),

Inference in graphical models: overview(8:22),

Inference: general meaning (7:3),Inference: general meaning (8:14),ISOMAP (5:50),

Joint distribution (7:10),Junction tree algorithm (8:48),

k-means clustering (5:61),kd-trees (5:73),Kernel PCA (5:39),Kernel ridge regression (4:1),Kernel trick (4:1),kNN smoothing kernel (5:71),Kullback-Leibler divergence (7:42),

Lasso regularization (2:19),Learning as Bayesian inference (7:19),

Learning with missing data (8:61),Least squares loss (5:5),Linear regression (2:3),Logistic regression (binary) (3:6),Logistic regression (multi-class) (3:14),

Loopy belief propagation (8:46),

Marginal (7:10),Maximum a posteriori (MAP) param-

eter estimate (8:57),Maximum a-posteriori (MAP) inference

(8:52),Maximum conditional likelihood (8:91),

Maximum likelihood parameter estimate(8:57),

Message passing (8:43),Mixture models (8:71),Model averaging (5:80),Monte Carlo methods (7:43),Multidimensional scaling (5:47),Multinomial (7:25),

Multiple RVs, conditional independence(7:13),

Neg-log-likelihood loss (5:7),Neural networks (5:20),Non-negative matrix factorization (5:53),

Partial least squares (PLS) (5:57),Particle approximation of a distribution

(7:37),Predictive distribution (6:7),Principle Component Analysis (PCA)

(5:32),Probability distribution (7:9),

Random forests (5:97),Random variables (7:8),Regularization (2:11),Rejection sampling (7:44),Rejection sampling (8:25),Ridge regression (2:15),Ridge regularization (2:15),Robust error functions (5:4),

Smoothing kernel (5:70),Spectral clustering (5:43),Structured ouput and structured input

(3:18),Student’s t, Exponential, Laplace, Chi-

squared, Gamma distributions(7:47),

Support Vector Machines (SVMs) (5:8),

SVM regression loss (5:6),

Utilities and Decision Theory (7:40),

Variable elimination (8:34),

Weak learners as features (5:81),

introduction to machine learning - uni- · pdf fileintroduction to machine learning, ......

Documents