all you wanted to know about regression… cosc 526 class 9 arvind ramanathan computational science...

All you wanted to know about Regression…

COSC 526 Class 9

Arvind RamanathanComputational Science & Engineering DivisionOak Ridge National Laboratory, Oak RidgePh: 865-576-7266E-mail: [email protected]

Slides inspired by: Andrew Moore (CMU), Andrew Ng (Stanford)

mailto:[email protected]

2

Introducing your guest instructor (Feb 10-12)

• Dr. Sreenivas (Rangan) Sukumar

• Staff member at ORNL:– Leader in graph analytics

approaches

– UTK grad…

– “Healthcare Guru” at ORNLSide bar: • The class website location will be shortly updated. The original links

must work – but will be redirected to a new location at EECS.• Approved for space on the EECS website! • Hadoop server is working (finally) and your accounts are also ready

(utk id). More information on log in procedures as well as access to data forthcoming…

3

Last class: Classification with SVMs

• We had a class variable: y– Categorical in nature

– {x1, x2, …, xn} could be anything

• Formulated a quadratic programming problem that would eventually allow us to classify– stochastic gradient descent (SGD)

• Alterations for big datasets: – Minimum enclosing ball (MEB)

– Shrinking the optimization problem

– Incremental and decremental SVM learning

4

This class: predicting a real valued y

• Instead of a categorical class value y, we are going to see how to predict a real valued y

• Various regression algorithms:– Linear regression

– Regression with varying noise

– Non-linear regression

• Adapting regression for big data

5

Part I: Linear Regression

6

Regression

Living Area (sq. ft)

Prince ($ 1000s)

2104 400

1600 300

2400 370

1416 200

3000 540

Living area (sq. ft.)

Pri

ce• Can we predict the prices of other

houses as a function of their living area?

Linear regression helps us with this analysis…

As a recent home buyer (or a buyer interested in the market):

7

Linear regression

• Linear regression assumes that the expected value of the output, given some input is linear

• Simplest way to think about this: y = wx for some unknown w


Pri

ceLiving Area (sq. ft)

Prince ($ 1000s)

2104 400

1600 300

2400 370

1416 200

3000 540

Given the data, how to estimate w….

w

1

8

Some formalism…

• Assume that our data is formed by:

– Noise signals are independent

– Drawn from a Normal distribution

• p(y | w, x) has a normal distribution:– Mean wx

– variance σ2

Noise

9

Linear Regression (1)

• we have a bunch of data {(x1, y1), (x2, y2), … (xn, yn)} which are all evidence about w

• How to infer w (from the data)?

• Bayes rule to our rescue:– Maximum likelihood estimate (MLE) of w

– Because you can do it on a computer!

10

MLE of w

• For which value of w is the data most likely to have this behavior? – i.e., for what w is

maximized?

– i.e., for what w is maximized?

Since we know the distribution, i.e., we assumed that the data came from a normal distribution

2

11

MLE of w

• Now do the log-likelihood trick…

• Equivalently:

now we are in familiar territory…

12

All we have to do is …

• Take the derivative of E(w) w.r.t w and set to 0

0

13

What do we mean by this (graphically)?

• If x=sq. ft., y = price,

is the average price for x = 2014 sq. ft.• If x=height, y = weight,

is the average weight for all people 60 in tall.

14

Multi-linear Regression

• Now instead of a single x, let’s say we have x, where it comes from a d-dimensional spaceLiving Area (sq. ft)

No. of rooms

Prince ($ 1000s)

2104 2 400

1600 2 300

2400 3 370

1416 2 200

3000 4 540

How do we think of doing regression?

• Remember there are d-dimensions • (2 here)

• Can we visualize our data in a way that is easy to “regress”?

15

Matrix algebra to our rescue…

• out(x) = wTx = w1x[1] + w2x[2] + … + wdx[d]

• How do we learn w?

• Let’s define a cost-function

16

MLE is very similar to the simple regression story…

• MLE is given by:

• xTx is a n x n matrix:– where (i,j) th element is

• xTy is a n element vector: – with ith element

17

How to solve this on a computer?

• Let’s say I have an initial guess for w

• I need to search for a suitable w that will make J(w) smaller

• Idea: use gradient descent!

Repeat until convergence: For every j = 1…m:

Calculate gradient

Update

18

Problem(s) with gradient descent

• It will converge: for linear regression, since we have a global minimum, GD will converge to the solution!

• Takes a long time if training examples are large in number:– Each iteration scans through the entire training

dataset

– Can do stochastic gradient descent (SGD) in a similar way we discussed last time…

19

Pesky detail…

• We always talked about the line as if it originated from the origin 0D

• What if this is not the case?


Pri

ce


Pri

ce

20

Let’s fake it… neat trick!

• Create a fake input x0 with a value of 1

(always)

x1 x2 y

2104 2 400

1600 2 300

2400 3 370

1416 2 200

3000 4 540

x0 x1 x2 y

1 2104 2 400

1 1600 2 300

1 2400 3 370

1 1416 2 200

1 3000 4 540

y = w1x1 + w2x2 y = w0x0 + w1x1 + w2x2

= w0 + w1x1 + w2x2

21

Let’s say we know something about the noise added to each data point

• E.g.: I know the variance of the noise added to each data point…

xi σi2 y

0.5 4 0.5

1 1 1

2 0.25 1

2 4 3

3 0.25 2

Now, how do we do the MLE?

22

MLE with varying noise

Assuming independence among noise, plug in the Gaussian equation and simplify;

setting d(LL)/dw = 0 for minimum:

23

Weighted Regression

• We just saw “weighted regression”

• points that have a “higher confidence” and “lower noise” are important

• Rest are weighted by the variance in their noise

24

Part II: Non-linear Regression

25

Non-linear regression…

• Suppose y is related to a function of x in such a way that the predicted values have a non-linear relationship…

xi y1

0.5 0.05

1 2.5

2 3

3 2

3 3

Assume

26

Non-linear MLE

• Ugly, ugly algebra!!! What do we do?– Line search

– Simulated annealing

– GD and SGD

– Newton’s method

– Expectation Maximization!

27

Polynomial Regression…

• All this while, we were talking about linear regression

• But, it may not be the best way to describe data

• Be careful about how to fit the data…

28

Suppose we add an additional term…

• Quadratic regression: Each component is now called a term

• Each column is called a term column

• How many terms in a quadratic regression with p inputs?– 1 constant term

– p linear terms

– (p+1)C2 quadratic terms! => O(m2) terms

Solving our MLE:• Similar to our linear regression w =

(xTx)-1(xTy) • Cost will be O(p6)

29

Generalizing: p inputs, Qth degree polynomial… how many terms?

• = number of unique terms of the form

• = number of unique terms of the form

• = the number of lists of non-negative integers [q0, q1, …, qp]

• =(Q+p)CQ terms!!

30

Notes of caution…

• Is a polynomial with p = 2 better than p = 5?

• Linear fit is underfitting the data: – data shows structure not captured by the model

• Polynomial fit is overfitting the data:– data is fit very strongly by the model…

Moral of the story• Selecting model is important• More important is the selection of

the features!!

31

Locally Weighted Regression (LWR)

• An approach to reduce the dependency on selecting features:– Many datasets don’t have linear descriptors

• We have seen this before: – In the weighted regression model

• How do we choose the right weights?

weights!

32

Using the Kernel Trick once again…

• where Φ(x) is the kernel function

How do we estimate w?

33

Using the Kernel Trick once again…

• where Φ(x) is the kernel function

All ci are held constant. We will just initialize them at random or on a uniformly spaced grid in d dimensions…

KW – kernel width is also held constant. It will be some value that ensures good overlap between the basis functions…

34

How do we estimate w?

• Same as before…– Given the Q basis functions, let’s define a

matrix, Z such that

– Here xk is the the kth input vector…

• Now, we will:

• How to find the ci and KW?– Use BGD / SGD…

– Other methods will work

Also referred to as radial basis functions (RBFs)

35

What are good radial basis function choices?

• We talked about overlaps…


no.

of

room

s


no.

of

room

s


no.

of

room

s

Just about right overlap…

Too little overlap? Too much overlap?

36

Robust Regression…

• Best quadratic fit: – what is the problem

here?

• What would we want?– better fit to the varying

data!

– How can we find the better fitting curve?

37

LOESS-based Robust Regression

• After the initial fit, score each data point to say how well it is fitted

good data point

good data point

Not that bad either

this is a horrible data point

Repeat until convergence: For every k = 1…m:

Let be the kth data point

Let be the estimate of the yk data point

Let wk is large if the data point is

fitted well and very small if it is not fitted well

Redo the regression with the weighted data points

How do we know we have converged? Use expectation maximization (EM)

38

Multilinear Interpolation

• How to create a piecewise, linear fit to the data?

Create a set of “knot points” equally spaced along the data…

Let’s assume that the data points are generated by a noisy function that is allowed to bend only at these knot points…

We can do a linear regression for every segment identified here…

39

How to find the best fit?

• With some algebraic manipulations…

q1 q2 q3 q4 q5 q6

h3

h2

41

Can we do classification with this?

• Map y to be {0, 1} – negative and positive class

• Function: Logistic/Sigmoid function

• Note g(θTx)1 as θTx ∞

• g(θTx)0 as θTx -∞

42

How do we do MLE on this?

43

Another approach to maximize L(θ)

• Using Newton’s approach: finding a zero for a function

Hessian: (n x n matrix to keep track of all partial derivatives

44

Generalizing further…

• Regression:

• Classification:

• Begin by defining an exponential family of distributions:

natural parameter

sufficient statistic

log partition function

45

Bernoulli and Gaussian as specific GLMs

46

Softmax Regression

• Instead of a response variable y taking {0, 1}, we can think of having one of k values {1, 2,… k}

• Ex.: Mail classification = {spam, personal mail, work mail, advertisement}

• GLM with multinomial…

47

Part II: What do we do with Big Data?

48

Can we make Regression Faster?

• At least O(p2m):– Where p is the number of features (columns)

– m is the number of training examples

• Usually only a subset of p features, k, is relevant k << p

• What can we do to exploit this?– Variance inflation factor (VIF) regression O(pm)

49

VIF regression

• Evaluation step: – approximate the partial correlation of each

candidate variable (feature xi) with y using a

small pre-sampled set of data

– [stagewise regression]

• Search step:

– Test each xi sequentially using an α-investing

rule

D. Lin, D.P. Foster, L.H. Ungar, VIM Regression, Arxiv 2012

50

Other standard approaches also work…• MapReduce

• Gather/Apply/Scatter (GAS) [to be seen in the future]

• Spark!

• What you need to know? – Regression is one of the most commonly used

ML algorithms

– Many flavors and can be generalized using GLMs

– Research still needs to be carried out for big datasets

all you wanted to know about regression… cosc 526 class 9 arvind ramanathan computational science...

Documents