advanced introduction to machine learningepxing/class/10715/lectures/lecture2-lr.pdf · a summary:...
TRANSCRIPT
Advanced Introduction to Machine Learning
10715, Fall 2014
Linear Regression and Sparsity
Eric XingLecture 2, September 10, 2014
Reading:© Eric Xing @ CMU, 2014 1
Machine learning for apartment hunting
Now you've moved to Pittsburgh!! And you want to find the most reasonably priced apartment satisfying your needs:
square-ft., # of bedroom, distance to campus …
Living area (ft2) # bedroom Rent ($)
230 1 600506 2 1000433 2 1100109 1 500…150 1 ?270 1.5 ?
© Eric Xing @ CMU, 2014 2
The learning problem
Features: Living area, distance to campus, #
bedroom … Denote as x=[x1, x2, … xk]
Target: Rent Denoted as y
Training set:
rent
Living area
rent
Living area
Location
knnn
k
k
n xxx
xxxxxx
21
222
12
121
11
2
1
x
xx
X
nny
yy
y
yy
Y
2
1
2
1
or
© Eric Xing @ CMU, 2014 3
Linear Regression Assume that Y (target) is a linear function of X (features):
e.g.:
let's assume a vacuous "feature" X0=1 (this is the intercept term, why?), and define the feature vector to be:
then we have the following general representation of the linear function:
Our goal is to pick the optimal . How! We seek that minimize the following cost function:
22
110ˆ xxy
n
iiii yxyJ
1
2
21 ))(ˆ()(
© Eric Xing @ CMU, 2014 4
The Least-Mean-Square (LMS) method The Cost Function:
Consider a gradient descent algorithm:
n
ii
Ti yJ
1
2
21 )()( x
tj
tj
tj J )(
1
© Eric Xing @ CMU, 2014 5
The Least-Mean-Square (LMS) method Now we have the following descent rule:
For a single training point, we have:
This is known as the LMS update rule, or the Widrow-Hoff learning rule This is actually a "stochastic", "coordinate" descent algorithm This can be used as a on-line algorithm
n
i
ji
tTii
tj
tj xy
1
1 )( x
© Eric Xing @ CMU, 2014 6
Geometry and Convergence of LMS
N=1 N=2 N=3
Claim: when the step size satisfies certain condition, and when certain other technical conditions are satisfied, LMS will converge to an “optimal region”.
itT
iitt y xx )(1
© Eric Xing @ CMU, 2014 7
Steepest Descent and LMS Steepest descent
Note that:
This is as a batch gradient descent algorithm
n
in
Tnn
T
k
yJJJ11
xx )(,,
n
in
tTnn
tt y1
1 xx )(
© Eric Xing @ CMU, 2014 8
The normal equations Write the cost function in matrix form:
To minimize J(θ), take derivative and set to zero:
yyXyyXXX
yXyX
yJ
TTTTTT
T
n
ii
Ti
212121
1
2)()( x
nx
xx
X
2
1
ny
yy
y
2
1
0
221
22121
yXXX
yXXXXX
yyXyXX
yyXyyXXXJ
TT
TTT
TTTT
TTTTTT
trtrtr
tryXXX TT
The normal equations
yXXX TT 1*
© Eric Xing @ CMU, 2014 9
Some matrix derivatives For , define:
Trace:
Some fact of matrix derivatives (without proof)
fA
fA
fA
fA
Af
mnm
n
A
1
111
)(
RR nmf :
, tr
n
iiiAA
1
, tr aa BCACABABC trtrtr
, tr TA BAB , tr TTT
A ABCCABCABA TA AAA 1
© Eric Xing @ CMU, 2014 10
Comments on the normal equation In most situations of practical interest, the number of data
points N is larger than the dimensionality k of the input space and the matrix X is of full column rank. If this condition holds, then it is easy to verify that XTX is necessarily invertible.
The assumption that XTX is invertible implies that it is positive definite, thus at the critical point we have found is a minimum.
What if X has less than full column rank? regularization (later).
© Eric Xing @ CMU, 2014 11
Direct and Iterative methods Direct methods: we can achieve the solution in a single step
by solving the normal equation Using Gaussian elimination or QR decomposition, we converge in a finite number
of steps It can be infeasible when data are streaming in in real time, or of very large
amount
Iterative methods: stochastic or steepest gradient Converging in a limiting sense But more attractive in large practical problems Caution is needed for deciding the learning rate
© Eric Xing @ CMU, 2014 12
Convergence rate Theorem: the steepest descent equation algorithm
converge to the minimum of the cost characterized by normal equation:
If
A formal analysis of LMS need more math-mussels; in practice, one can use a small , or gradually decrease .
© Eric Xing @ CMU, 2014 13
A Summary: LMS update rule
Pros: on-line, low per-step cost, fast convergence and perhaps less prone to local optimum
Cons: convergence to optimum not always guaranteed
Steepest descent
Pros: easy to implement, conceptually clean, guaranteed convergence Cons: batch, often slow converging
Normal equations
Pros: a single-shot algorithm! Easiest to implement. Cons: need to compute pseudo-inverse (XTX)-1, expensive, numerical issues
(e.g., matrix is singular ..), although there are ways to get around this …
intT
nnt
jt
j xy ,)( x1
n
in
tTnn
tt y1
1 xx )(
yXXX TT 1*
© Eric Xing @ CMU, 2014 14
Geometric Interpretation of LMS The predictions on the training data are:
Note that
and
is the orthogonal projection ofinto the space spanned by the columns of X
yXXXXXy TT 1 *ˆ
yIXXXXyy TT 1ˆ
0
1
1
yXXXXXX
yIXXXXXyyXTTTT
TTTT
!!y y
nx
xx
X
2
1
ny
yy
y
2
1
© Eric Xing @ CMU, 2014 15
Probabilistic Interpretation of LMS Let us assume that the target variable and the inputs are
related by the equation:
where ε is an error term of unmodeled effects or random noise
Now assume that ε follows a Gaussian N(0,σ), then we have:
By independence assumption:
iiT
iy x
2
2
221
)(exp);|( i
Ti
iiyxyp x
2
12
1 221
n
i iT
inn
iii
yxypL
)(exp);|()(
x
© Eric Xing @ CMU, 2014 16
Probabilistic Interpretation of LMS, cont. Hence the log-likelihood is:
Do you recognize the last term?
Yes it is:
Thus under independence assumption, LMS is equivalent to MLE of θ !
n
i iT
iynl1
22 211
21 )(log)( x
n
ii
Ti yJ
1
2
21 )()( x
© Eric Xing @ CMU, 2014 17
Case study: predicting gene expression
The genetic picture
CGTTTCACTGTACAATTTcausal SNPs
a univariate phenotype:
i.e., the expression intensity of a gene
© Eric Xing @ CMU, 2014 18
Individual 1
Individual 2
Individual N
Phenotype (BMI)
2.5
4.8
4.7
Genotype
. . C . . . . . T . . C . . . . . . . T . . .
. . C . . . . . A . . C . . . . . . . T . . .
. . G . . . . . A . . G . . . . . . . A . . .
. . C . . . . . T . . C . . . . . . . T . . .
. . G . . . . . T . . C . . . . . . . T . . .
. . G . . . . . T . . G . . . . . . . T . . .
Causal SNPBenign SNPs
…Association Mapping as Regression
© Eric Xing @ CMU, 2014 19
Individual 1
Individual 2
Individual N
Phenotype (BMI)
2.5
4.8
4.7
Genotype
. . 0 . . . . . 1 . . 0 . . . . . . . 0 . . .
. . 1 . . . . . 1 . . 1 . . . . . . . 1 . . .
. . 2 . . . . . 2 . . 1 . . . . . . . 0 . . .
…
yi =
J
jjijx
1 SNPs with large
|βj| are relevant
Association Mapping as Regression
© Eric Xing @ CMU, 2014 20
Experimental setup Asthama dataset
543 individuals, genotyped at 34 SNPs Diploid data was transformed into 0/1 (for homozygotes) or 2 (for heterozygotes) X=543x34 matrix Y=Phenotype variable (continuous)
A single phenotype was used for regression
Implementation details Iterative methods: Batch update and online update implemented. For both methods, step size α is chosen to be a small fixed value (10-6). This
choice is based on the data used for experiments. Both methods are only run to a maximum of 2000 epochs or until the change in
training MSE is less than 10-4
© Eric Xing @ CMU, 2014 21
Convergence Curves
For the batch method, the training MSE is initially large due to uninformed initialization
In the online update, N updates for every epoch reduces MSE to a much smaller value.
© Eric Xing @ CMU, 2014
22
The Learned Coefficients
© Eric Xing @ CMU, 2014 23
T G
A A
C C
A T
G A
A G
T A
GenotypeTrait
Multivariate Regression for Trait Association Analysis
Xy
2.1 x=
x= β
Association Strength
?
© Eric Xing @ CMU, 2014 24
T G
A A
C C
A T
G A
A G
T A
GenotypeTrait
Multivariate Regression for Trait Association Analysis
Many non-zero associations: Which SNPs are truly significant?
2.1 x=
Association Strength
© Eric Xing @ CMU, 2014 25
Sparsity One common assumption to make sparsity.
Makes biological sense: each phenotype is likely to be associated with a small number of SNPs, rather than all the SNPs.
Makes statistical sense: Learning is now feasible in high dimensions with small sample size
© Eric Xing @ CMU, 2014 26
Sparsity: In a mathematical sense Consider least squares linear regression problem: Sparsity means most of the beta’s are zero.
But this is not convex!!! Many local optima, computationally intractable.
y
x1
x2
x3
xn-1
xn
1
2
3
n-1
n
© Eric Xing @ CMU, 2014 27
L1 Regularization (LASSO)(Tibshirani, 1996)
A convex relaxation.
Still enforces sparsity!
Constrained Form Lagrangian Form
© Eric Xing @ CMU, 2014 28
Theoretical Guarantees Assumptions
Dependency Condition: Relevant Covariates are not overly dependent Incoherence Condition: Large number of irrelevant covariates cannot be too
correlated with relevant covariates Strong concentration bounds: Sample quantities converge to expected values
quickly
If these are assumptions are met, LASSO will asymptotically recover correct subset of covariates that relevant.
© Eric Xing @ CMU, 2014 29
Consistent Structure Recovery[Zhao and Yu 2006]
© Eric Xing @ CMU, 2014 30
Genotype
x=2.1
Trait
Lasso for Reducing False Positives
Many zero associations (sparse results), but what if there are multiple related traits?
+
J
j 1 |βj |
T G
A A
C C
A T
G A
A G
T A
Lasso Penalty for sparsity
Association Strength
© Eric Xing @ CMU, 2014 31
Ridge Regression vs Lasso
Ridge Regression: Lasso: HOT!
Lasso (l1 penalty) results in sparse solutions – vector with more zero coordinatesGood for high‐dimensional problems – don’t have to store all coordinates!
βs with constant l1 norm
βs with constant J(β)(level sets of J(β))
βs with constant l2 norm
β2
β1
X X
© Eric Xing @ CMU, 2014 32
Bayesian Interpretation Treat the distribution parameters also as a random variable The a posteriori distribution of after seem the data is:
This is Bayes Rule
likelihood marginalpriorlikelihoodposterior
dpDppDp
DppDpDp
)()|()()|(
)()()|()|(
The prior p(.) encodes our prior knowledge about the domain© Eric Xing @ CMU, 2014 33
What if (XTX) is not invertible ?
log likelihood log prior
Prior belief that β is Gaussian with zero‐mean biases solution to “small” β
I) Gaussian Prior
0
Ridge Regression
Closed form: HW
Regularized Least Squares and MAP
© Eric Xing @ CMU, 2014 34
Regularized Least Squares and MAP
log likelihood log prior
Prior belief that β is Laplace with zero‐mean biases solution to “small” β
Lasso
Closed form: HW
II) Laplace Prior
What if (XTX) is not invertible ?
© Eric Xing @ CMU, 2014 35
Take home message Gradient descent
On-line Batch
Normal equations Geometric interpretation of LMS Probabilistic interpretation of LMS, and equivalence of LMS and
MLE under certain assumption (what?) Sparsity:
Approach: ridge vs. lasso regression Interpretation: regularized regression versus Bayesian regression Algorithm: convex optimization (we did not discuss this)
LR does not mean fitting linear relations, but linear combination or basis functions (that can be non-linear)
Weighting points by importance versus by fitness
© Eric Xing @ CMU, 2014 36
After class material …
© Eric Xing @ CMU, 2014 37
Advanced Material: Beyond basic LR LR with non-linear basis functions
Locally weighted linear regression
Regression trees and Multilinear Interpolation
We will discuss this in next class after we set the state right!(if we’ve got time )
© Eric Xing @ CMU, 2014 38
LR with non-linear basis functions LR does not mean we can only deal with linear relationships
We are free to design (non-linear) features under LR
where the j(x) are fixed basis functions (and we define 0(x) = 1).
Example: polynomial regression:
We will be concerned with estimating (distributions over) the weights θ and choosing the model order M.
)()( xxy Tm
j j 10
321 xxxx ,,,:)(
© Eric Xing @ CMU, 2014 39
Basis functions There are many basis functions, e.g.:
Polynomial
Radial basis functions
Sigmoidal
Splines, Fourier, Wavelets, etc
1 jj xx)(
2
2
2sx
x jj
exp)(
sx
x jj
)(
© Eric Xing @ CMU, 2014 40
1D and 2D RBFs 1D RBF
After fit:
© Eric Xing @ CMU, 2014 41
Good and Bad RBFs A good 2D RBF
Two bad 2D RBFs
© Eric Xing @ CMU, 2014 42
Overfitting and underfitting
xy 10 2210 xxy
5
0jj
j xy
© Eric Xing @ CMU, 2014 43
Bias and variance We define the bias of a model to be the expected
generalization error even if we were to fit it to a very (say, infinitely) large training set.
By fitting "spurious" patterns in the training set, we might again obtain a model with large generalization error. In this case, we say the model has large variance.
© Eric Xing @ CMU, 2014 44
Locally weighted linear regression
The algorithm:Instead of minimizing
now we fit θ to minimize
Where do wi's come from?
where x is the query point for which we'd like to know its corresponding y
Essentially we put higher weights on (errors on) training examples that are close to the query point (than those that are further away from the query)
n
ii
Ti yJ
1
2
21 )()( x
n
ii
Tii ywJ
1
2
21 )()( x
2
2
2)(exp xx i
iw
© Eric Xing @ CMU, 2014 45
Parametric vs. non-parametric Locally weighted linear regression is the second example we
are running into of a non-parametric algorithm. (what is the first?)
The (unweighted) linear regression algorithm that we saw earlier is known as a parametric learning algorithm because it has a fixed, finite number of parameters (the θ), which are fit to the
data; Once we've fit the θ and stored them away, we no longer need to keep the
training data around to make future predictions. In contrast, to make predictions using locally weighted linear regression, we need
to keep the entire training set around.
The term "non-parametric" (roughly) refers to the fact that the amount of stuff we need to keep in order to represent the hypothesis grows linearly with the size of the training set.
© Eric Xing @ CMU, 2014 46
Robust Regression
The best fit from a quadratic regression
But this is probably better …
How can we do this?© Eric Xing @ CMU, 2014 47
LOESS-based Robust Regression Remember what we do in "locally weighted linear regression"? we "score" each point for its impotence
Now we score each point according to its "fitness"
(Courtesy to Andrew Moor) © Eric Xing @ CMU, 2014 48
Robust regression For k = 1 to R…
Let (xk ,yk) be the kth datapoint Let yest
k be predicted value of yk
Let wk be a weight for data point k that is large if the data point fits well and small if it fits badly:
Then redo the regression using weighted data points.
Repeat whole thing until converged!
2)( estkkk yyw
© Eric Xing @ CMU, 2014 49
Robust regression—probabilistic interpretation What regular regression does:
Assume yk was originally generated using the following recipe:
Computational task is to find the Maximum Likelihood estimation of θ
),( 20 N kT
ky x
© Eric Xing @ CMU, 2014 50
Robust regression—probabilistic interpretation What LOESS robust regression does:
Assume yk was originally generated using the following recipe:
with probability p:
but otherwise
Computational task is to find the Maximum Likelihood estimates of θ, p, µ and σhuge.
The algorithm you saw with iterative reweighting/refittingdoes this computation for us. Later you will find that it is an instance of the famous E.M. algorithm
),( 20 N kT
ky x
),(~ huge2Nky
© Eric Xing @ CMU, 2014 51
Regression Tree Decision tree for regression
Gender Rich? Num. Children
# travel per yr.
Age
F No 2 5 38
M No 0 2 25
M Yes 1 0 72
: : : : :
Gender?
Predicted age=39 Predicted age=36
Female Male
© Eric Xing @ CMU, 2014 52
A conceptual picture Assuming regular regression trees, can you sketch a graph of
the fitted function y*(x) over this diagram?
© Eric Xing @ CMU, 2014 53
How about this one? Multilinear Interpolation
We wanted to create a continuous and piecewise linear fit to the data
© Eric Xing @ CMU, 2014 54
Take home message Gradient descent
On-line Batch
Normal equations Geometric interpretation of LMS Probabilistic interpretation of LMS, and equivalence of LMS and
MLE under certain assumption (what?) Sparsity:
Approach: ridge vs. lasso regression Interpretation: regularized regression versus Bayesian regression Algorithm: convex optimization (we did not discuss this)
LR does not mean fitting linear relations, but linear combination or basis functions (that can be non-linear)
Weighting points by importance versus by fitness
© Eric Xing @ CMU, 2014 55
Appendix
© Eric Xing @ CMU, 2014 56
Parameter Learning from iid Data Goal: estimate distribution parameters from a dataset of N
independent, identically distributed (iid), fully observed, training cases
D = {x1, . . . , xN}
Maximum likelihood estimation (MLE)1. One of the most common estimators2. With iid and full-observability assumption, write L() as the likelihood of the data:
3. pick the setting of parameters most likely to have generated the data we saw:
);,,()( , NxxxPL 21
N
i i
N
xP
xPxPxP
1
2
);(
);(,),;();(
)(maxarg*
L )(logmaxarg
L© Eric Xing @ CMU, 2014 57
Example: Bernoulli model Data:
We observed N iid coin tossing: D={1, 0, 1, …, 0}
Representation:Binary r.v:
Model:
How to write the likelihood of a single observation xi ?
The likelihood of datasetD={x1, …,xN}:
ii xxixP 11 )()(
N
i
xxN
iiN
iixPxxxP1
1
121 1 )()|()|,...,,(
},{ 10nx
tails#head# )()(
11 111
N
ii
N
ii xx
101
xx
xPfor for
)(
xxxP 11 )()(
© Eric Xing @ CMU, 2014 58
Maximum Likelihood Estimation Objective function:
We need to maximize this w.r.t.
Take derivatives wrt
Sufficient statistics The counts, are sufficient statistics of data D
)log()(log)(log)|(log);( 11 hhnn nNnDPD thl
01
hh nNnl
Nnh
MLE
i
iMLE xN1
or
Frequency as sample mean
, where, i ikh xnn
© Eric Xing @ CMU, 2014 59
Overfitting Recall that for Bernoulli Distribution, we have
What if we tossed too few times so that we saw zero head?We have and we will predict that the probability of seeing a head next is zero!!!
The rescue: "smoothing" Where n' is know as the pseudo- (imaginary) count
But can we make this more formal?
tailhead
headheadML nn
n
,0headML
''
nnnnn
tailhead
headheadML
© Eric Xing @ CMU, 2014 60
Bayesian Parameter Estimation Treat the distribution parameters also as a random variable The a posteriori distribution of after seem the data is:
This is Bayes Rule
likelihood marginalpriorlikelihoodposterior
dpDppDp
DppDpDp
)()|()()|(
)()()|()|(
The prior p(.) encodes our prior knowledge about the domain© Eric Xing @ CMU, 2014 61
Frequentist Parameter Estimation Two people with different priors p() will end up with different estimates p(|D).
Frequentists dislike this “subjectivity”. Frequentists think of the parameter as a fixed, unknown
constant, not a random variable. Hence they have to come up with different "objective"
estimators (ways of computing from data), instead of using Bayes’ rule. These estimators have different properties, such as being “unbiased”, “minimum
variance”, etc. The maximum likelihood estimator, is one such estimator.
© Eric Xing @ CMU, 2014 62
Discussion
or p(), this is the problem!
Bayesians know it
© Eric Xing @ CMU, 2014 63
Bayesian estimation for Bernoulli Beta distribution:
When x is discrete
Posterior distribution of :
Notice the isomorphism of the posterior to the prior, such a prior is called a conjugate prior and are hyperparameters (parameters of the prior) and correspond to the
number of “virtual” heads/tails (pseudo counts)
1111
1
11 111 thth nnnn
N
NN xxp
pxxpxxP )()()(),...,(
)()|,...,(),...,|(
1111 11
)(),()(
)()()(),;( BP
!)()( xxxx 1
© Eric Xing @ CMU, 2014 64
Bayesian estimation for Bernoulli, con'd Posterior distribution of :
Maximum a posteriori (MAP) estimation:
Posterior mean estimation:
Prior strength: A=+ A can be interoperated as the size of an imaginary data set from which we obtain
the pseudo-counts
1111
1
11 111 thth nnnn
N
NN xxp
pxxpxxP )()()(),...,(
)()|,...,(),...,|(
NndCdDp hnn
Bayesth 11 1 )()|(
Bata parameters can be understood as pseudo-counts
),...,|(logmaxarg NMAP xxP 1
© Eric Xing @ CMU, 2014 65
Effect of Prior Strength Suppose we have a uniform prior (==1/2),
and we observe Weak prior A = 2. Posterior prediction:
Strong prior A = 20. Posterior prediction:
However, if we have enough data, it washes away the prior. e.g., . Then the estimates under weak and strong prior are and , respectively, both of which are close to 0.2
),( 82 th nnn
25010221282 .)',,|(
th nnhxp
40010202102082 .)',,|(
th nnhxp
),( 800200 th nnn
100022001
10002020010
© Eric Xing @ CMU, 2014 66
Example 2: Gaussian density Data:
We observed N iid real samples: D={-0.1, 10, 1, -5.2, …, 3}
Model:
Log likelihood:
MLE: take derivative and set to zero:
22212 22 /)(exp)( /
xxP
N
n
nxNDPD1
2
22
212
2 )log()|(log);(l
n n
n n
xN
x
2422
2
21
2
1
l
l )/(
n MLnMLE
n nMLE
xN
xN
22 1
1
© Eric Xing @ CMU, 2014 67
MLE for a multivariate-Gaussian It can be shown that the MLE for µ and Σ is
where the scatter matrix is
The sufficient statistics are nxn and nxnxnT.
Note that XTX=nxnxnT may not be full rank (eg. if N <D), in which case ΣML is not
invertible
SN
xxN
xN
nT
MLnMLnMLE
n nMLE
11
1
TMLMLn n
Tnnn
TMLnMLn NxxxxS
Kn
n
n
n
x
xx
x
2
1
TN
T
T
x
xx
X2
1
© Eric Xing @ CMU, 2014 68
Bayesian estimation
Normal Prior:
Joint probability:
Posterior:
2020
2120 22 /)(exp)( /
P
1
20
22
020
2
20
20
2
2 11
11
N
Nx
NN ~ and ,
///
///~ where
Sample mean
2020
2120
1
22
22
22
212
/)(exp
exp),(
/
/
N
nn
N xxP
22212 22 ~/)~(exp~)|( /
xP
© Eric Xing @ CMU, 2014 69
Bayesian estimation: unknown µ, known σ
The posterior mean is a convex combination of the prior and the MLE, with weights proportional to the relative noise levels.
The precision of the posterior 1/σ2N is the precision of the prior 1/σ2
0 plus one contribution of data precision 1/σ2 for each observed data point.
Sequentially updating the mean µ∗ = 0.8 (unknown), (σ2)∗ = 0.1 (known)
Effect of single data point
Uninformative (vague/ flat) prior, σ20 →∞
1
20
22
020
2
20
20
2
2 11
11
N
Nx
NN
N~ ,
///
///
20
2
20
020
2
20
001
)( )( xxx
0 N
© Eric Xing @ CMU, 2014 70