machine learningmgormley/courses/10701-f16/... · 3-member team to be formed in the first two...
TRANSCRIPT
Machine Learning
10-701, Fall 2016
Introduction to ML and
Density Estimation
Eric Xing Lecture 1, September 7, 2016
Reading: Mitchell: Chap 1,3© Eric Xing @ CMU, 2006-2016 1
Class Registration IF YOU ARE ON THE WAITING LIST: This class is now fully
subscribed. You may want to consider the following options:
Take the class when it is offered again in the Spring semester;
Come to the first several lectures and see how the course develops. We will admit as many students from the waitlist as we can, once we see how many registered students drop the course during the first two weeks.
© Eric Xing @ CMU, 2006-2016 2
Machine Learning 10-701 Class webpage:
http://www.cs.cmu.edu/~mgormley/courses/10701-f16/
© Eric Xing @ CMU, 2006-2016 3
The instructors
© Eric Xing @ CMU, 2006-2016 4
Brynn Edmunds Previous Research
Medical Physics with specific interest in Radiotherapy and Radiation Oncology Examination of DVH parameters for prostate
treatments Comparing clinicians with different training to
look for treatment variability
Currently: ML Assistant Instructor
© Eric Xing @ CMU, 2006-2016 5
Devendra Chaplot Office Hour: Friday 11:00am -12:00pm Location: GHC 5412 Interests: Concept Graph Learning,
Computational models of human learning, Reinforcement Learning
© Eric Xing @ CMU, 2006-2016 6
Siddharth GoyalOffice Hour: Tue 4:00pm -5:00pmLocation: GHC 5
thfloor common area
Interests: Bayesian optimization, Reinforcement learning
© Eric Xing @ CMU, 2006-2016 7
Hemank Lamba
Office Hours: Tuesday, 11 to Noon
Location: TBD
Research• Graph Mining• Data Mining• Anomaly Detection• Social Good Applications
© Eric Xing @ CMU, 2006-2016 8
Hyun Ah SongOffice hour: Friday 1pm-2pmOffice: GHC 8003Interests: time series analysis
© Eric Xing @ CMU, 2006-2016 9
Petar Stojanov
Office Hours: Wednesday, 4:30 to 5:30pm(starting next week)
Location: TBD
Research• Transfer Learning • Domain Adaptation • Multitask Learning
© Eric Xing @ CMU, 2006-2016 10
Logistics Text book
Chris Bishop, Pattern Recognition and Machine Learning (required) Kevin Murphy, Machine Learning, a probabilistic approach Tom Mitchell, Machine Learning David Mackay, Information Theory, Inference, and Learning Algorithms
Mailing Lists: To contact the instructors: [email protected] Class announcements list: [email protected].
Piazza …
© Eric Xing @ CMU, 2006-2016 11
Logistics 5 homework assignments: 35% of grade
Theory exercises Implementation exercises
Final project: 35% of grade Applying machine learning to your research area
NLP, IR,, vision, robotics, computational biology …
Outcomes that offer real utility and value Search all the wine bottle labels, An iPhone app for landmark recognition
Theoretical and/or algorithmic work a more efficient approximate inference algorithm a new sampling scheme for a non-trivial model …
3-member team to be formed in the first two weeks, proposal, mid-way report, poster & demo, final report.
One Midterm: 30% Theory exercises and/or analysis. Dates already set (no “ticket already booked”, “I am in a
conference”, etc. excuse …)
Policies …© Eric Xing @ CMU, 2006-2016 12
Apoptosis + Medicine
What is Learning
Grammatical rulesManufacturing proceduresNatural laws…
Inference: what does this mean?Any similar article?…
Learning is about seeking a predictive and/or executable understanding of natural/artificial subjects, phenomena, or activities from …
© Eric Xing @ CMU, 2006-2016 13
Machine Learning (ML)
© Eric Xing @ CMU, 2006-2016 14
A short definition
Study of algorithms and systems that• improve their performance P• at some task T• with experience E
well-defined learning task: <P,T,E>
© Eric Xing @ CMU, 2006-2016 15
Elements of Modern ML
© Eric Xing @ CMU, 2006-2016 16
ML methodologies, system paradigms, & hardware infrastructure New
mathematical tools
New theory and algorithms
Moore’s Law New system architecture
BSP
MapReduce
© Eric Xing @ CMU, 2006-2016 17
Parameter Server and SSP
Where Machine Learning is being used or can be useful?
Speech recognition
Information retrieval
Computer vision
Robotic control
Planning
Games
Evolution
Pedigree
© Eric Xing @ CMU, 2006-2016 18
Amazing Breakthroughs .
© Eric Xing @ CMU, 2006-2016 19
Paradigms of Machine Learning Supervised Learning
Given , learn , s.t.
Unsupervised Learning Given , learn , s.t.
Semi-supervised Learning Reinforcement Learning
Given
learn , s.t.
Active Learning Given , learn , s.t.
Transfer learning Deep xxx …
iiD YX , ii XY f :)f( jjD YX new
iD X ii XY f :)f( jjD YX new
game trace/realsimulator/ rewards,,actions,envD
reaare
,:utility,:policy 321 aaa ,,game real new,env
)(G~ D jD Y policy, ),(G' all )f( and )(G'~new D
© Eric Xing @ CMU, 2006-2016 20
Machine Learning - TheoryFor the learned F(; )
Consistency (value, pattern, …) Bias versus variance Sample complexity Learning rate Convergence Error bound Confidence Stability …
PAC Learning Theory
# examples (m)
representational complexity (H)
error rate ()
failure probability ()
(supervised concept learning)
© Eric Xing @ CMU, 2006-2016 21
Why machine learning?
1B+ USERS30+ PETABYTES
645 million users500 million tweets / day
100+ hours video uploaded every minute
32 million pages
© Eric Xing @ CMU, 2006-2016 22
Growth of Machine Learning Machine learning already the preferred approach to
Speech recognition, Natural language processing Computer vision Medical outcomes analysis Robot control …
This ML niche is growing (why?)
All software apps.
ML apps.
© Eric Xing @ CMU, 2006-2016 23
Growth of Machine Learning Machine learning already the preferred approach to
Speech recognition, Natural language processing Computer vision Medical outcomes analysis Robot control …
This ML niche is growing Improved machine learning algorithms Increased data capture, networking Software too complex to write by hand New sensors / IO devices Demand for self-customization to user, environment
All software apps.
ML apps.
© Eric Xing @ CMU, 2006-2016 24
Summary: What is Machine Learning
Machine Learning seeks to develop theories and computer systems for
representing; classifying, clustering, recognizing, organizing; reasoning under uncertainty; predicting; and reacting to …complex, real world data, based on the system's own experience with data, and (hopefully) under a unified model or mathematical framework, that
can be formally characterized and analyzed can take into account human prior knowledge can generalize and adapt across data and domains can operate automatically and autonomously and can be interpreted and perceived by human.
© Eric Xing @ CMU, 2006-2016 25
InferencePredictionDecision-Making under uncertainty…
Statistical Machine Learning Function Approximation: F( |)? Density Estimation
© Eric Xing @ CMU, 2006-2016 26
Classification sickle-cell anemia
© Eric Xing @ CMU, 2006-2016 27
Function Approximation Setting:
Set of possible instances X Unknown target function f: XY Set of function hypotheses H={ h | h: XY }
Given: Training examples {<xi,yi>} of unknown target function f
Determine: Hypothesis h ∈ H that best approximates f
© Eric Xing @ CMU, 2006-2016 28
Density Estimation A Density Estimator learns a mapping from a set of attributes
to a Probability
© Eric Xing @ CMU, 2006-2016 29
Basic Probability Concepts A sample space S is the set of all possible outcomes of a
conceptual or physical, repeatable experiment. (S can be finite or infinite.)
E.g., S may be the set of all possible outcomes of a dice roll:
E.g., S may be the set of all possible nucleotides of a DNA site:
E.g., S may be the set of all possible positions time-space positions of a aircraft on a radar screen:
GC,T,A,S
61,2,3,4,5,S
},{},{},{ omax 036000 RS
© Eric Xing @ CMU, 2006-2016 30
Random Variable A random variable is a function that associates a unique
numerical value (a token) with every outcome of an experiment. (The value of the r.v. will vary from trial to trial as the experiment is repeated) Discrete r.v.:
The outcome of a dice-roll The outcome of reading a nt at site i:
Binary event and indicator variable: Seeing an "A" at a site X=1, o/w X=0. This describes the true or false outcome a random event. Can we describe richer outcomes in the same way? (i.e., X=1, 2, 3, 4, for being A, C, G,
T) --- think about what would happen if we take expectation of X.
Unit-Base Random vectorXi=[Xi
A, XiT, Xi
G, XiC]', Xi=[0,0,1,0]' seeing a "G" at site i
Continuous r.v.: The outcome of recording the true location of an aircraft: The outcome of observing the measured location of an aircraft
S X()
iX
trueXobsX
© Eric Xing @ CMU, 2006-2016 31
Random Variable
S X()
Notational convention
Univariate
Multivariate (random vector)
© Eric Xing @ CMU, 2006-2016 32
Discrete Prob. Distribution (In the discrete case), a probability distribution P on S (and
hence on the domain of X ) is an assignment of a non-negative real number P(s) to each sS (or each valid value of x) such that sSP(s)=1. (0P(s) 1) intuitively, P(s) corresponds to the frequency (or the likelihood) of getting s in the
experiments, if repeated many times call s= P(s) the parameters in a discrete probability distribution
A probability distribution on a sample space is sometimes called a probability model, in particular if several different distributions are under consideration write models as M1, M2, probabilities as P(X|M1), P(X|M2) e.g., M1 may be the appropriate prob. dist. if X is from "fair dice", M2 is for the
"loaded dice". M is usually a two-tuple of {dist. family, dist. parameters}
© Eric Xing @ CMU, 2006-2016 33
Bernoulli distribution: Ber(p)
Multinomial distribution: Mult(1,)
Multinomial (indicator) variable:
. 1 , w.p.1
1 and ],1,0[
where , ∑
∑
[1,...,6]∈
[1,...,6]∈
6
5
4
3
2
1
jjj
j
j
jj
X
XX
XXXXXX
X
x
k
xk
xT
xG
xC
xAj
kTGCA
j jXPjxp
∏
}face-dice index the where,1{))((
Discrete Distributions
1 if 0 if 1
)(xx
xP
xx ppxP 11 )()(
© Eric Xing @ CMU, 2006-2016 34
Multinomial distribution: Mult(n,)
Count variable:
Discrete Distributions
j
j
K
nxx
xX where ,
1
xK
xK
xxK xxx
nxxx
nxpK
!!!
!!!!
!)( 212121
21
© Eric Xing @ CMU, 2006-2016 35
Density Estimation A Density Estimator learns a mapping from a set of attributes
to a Probability
Often know as parameter estimation if the distribution form is specified Binomial, Gaussian …
Three important issues:
Nature of the data (iid, correlated, …) Objective function (MLE, MAP, …) Algorithm (simple algebra, gradient methods, EM, …) Evaluation scheme (likelihood on test data, predictability, consistency, …)
© Eric Xing @ CMU, 2006-2016 36
Density Estimation Schemes
Data
),,( 111
nxx
),,( 1 nMM xx
),,( 212
nxx
Learnparameters
Maximum likelihood
Bayesian
Conditional likelihood
Margin
…
Score param
510
1510
310
Algorithm
Analytical
Gradient
EM
Sampling
…
© Eric Xing @ CMU, 2006-2016 37
Parameter Learning from iid Data Goal: estimate distribution parameters from a dataset of N
independent, identically distributed (iid), fully observed, training cases
D = {x1, . . . , xN}
Maximum likelihood estimation (MLE)1. One of the most common estimators2. With iid and full-observability assumption, write L() as the likelihood of the data:
3. pick the setting of parameters most likely to have generated the data we saw:
);,,()( , NxxxPL 21
N
i i
N
xP
xPxPxP
1
21
);(
);(,),;();(
)(maxarg*
L )(logmaxarg
L© Eric Xing @ CMU, 2006-2016 38
Example: Bernoulli model Data:
We observed N iid coin tossing: D={1, 0, 1, …, 0}
Representation:Binary r.v:
Model:
How to write the likelihood of a single observation xi ?
The likelihood of datasetD={x1, …,xN}:
ii xxixP 11 )()(
N
i
xxN
iiN
iixPxxxP1
1
121 1 )()|()|,...,,(
},{ 10nx
tails#head# )()(
11 111
N
ii
N
ii xx
xxxP 11 )()(
1for 0for 1
)(xx
xP
© Eric Xing @ CMU, 2006-2016 39
Maximum Likelihood Estimation Objective function:
We need to maximize this w.r.t.
Take derivatives wrt
Sufficient statistics The counts, are sufficient statistics of data D
)log()(log)(log)|(log);( 11 hhnn nNnDPD thl
01
hh nNnl
Nnh
MLE
i
iMLE xN1
or
Frequency as sample mean
, where, i ihh xnn
© Eric Xing @ CMU, 2006-2016 40
Overfitting Recall that for Bernoulli Distribution, we have
What if we tossed too few times so that we saw zero head?We have and we will predict that the probability of seeing a head next is zero!!!
The rescue: "smoothing" Where n' is know as the pseudo- (imaginary) count
But can we make this more formal?
tailhead
headheadML nn
n
,0headML
''
nnnnn
tailhead
headheadML
© Eric Xing @ CMU, 2006-2016 41
Bayesian Parameter Estimation Treat the distribution parameters also as a random variable The a posteriori distribution of after seem the data is:
This is Bayes Rule
likelihood marginalpriorlikelihoodposterior
dpDppDp
DppDpDp
)()|()()|(
)()()|()|(
The prior p(.) encodes our prior knowledge about the domain© Eric Xing @ CMU, 2006-2016 42
Frequentist Parameter Estimation Two people with different priors p() will end up with different estimates p(|D).
Frequentists dislike this “subjectivity”. Frequentists think of the parameter as a fixed, unknown
constant, not a random variable. Hence they have to come up with different "objective"
estimators (ways of computing from data), instead of using Bayes’ rule. These estimators have different properties, such as being “unbiased”, “minimum
variance”, etc. The maximum likelihood estimator, is one such estimator.
© Eric Xing @ CMU, 2006-2016 43
Discussion
or p(), this is the problem!
Bayesians know it
© Eric Xing @ CMU, 2006-2016 44
Discussion
or p(), this is the problem!
Bayesians know it
© Eric Xing @ CMU, 2006-2016 45
Bayesian estimation for Bernoulli Beta distribution:
When x is discrete
Posterior distribution of :
Notice the isomorphism of the posterior to the prior, such a prior is called a conjugate prior and are hyperparameters (parameters of the prior) and correspond to the
number of “virtual” heads/tails (pseudo counts)
1111
1
11 111 thth nnnn
N
NN xxp
pxxpxxP )()()(),...,(
)()|,...,(),...,|(
1111 11
)(),()(
)()()(),;( BP
!)()( xxxx 1
© Eric Xing @ CMU, 2006-2016 46
Bayesian estimation for Bernoulli, con'd Posterior distribution of :
Maximum a posteriori (MAP) estimation:
Posterior mean estimation:
Prior strength: A=+ A can be interoperated as the size of an imaginary data set from which we obtain
the pseudo-counts
1111
1
11 111 thth nnnn
N
NN xxp
pxxpxxP )()()(),...,(
)()|,...,(),...,|(
NndCdDp hnn
Bayesth 11 1 )()|(
Bata parameters can be understood as pseudo-counts
),...,|(logmaxarg NMAP xxP 1
© Eric Xing @ CMU, 2006-2016 47
Effect of Prior Strength Suppose we have a uniform prior (==1/2),
and we observe Weak prior A = 2. Posterior prediction:
Strong prior A = 20. Posterior prediction:
However, if we have enough data, it washes away the prior. e.g., . Then the estimates under weak and strong prior are and , respectively, both of which are close to 0.2
),( 82 th nnn
25010221282 .)',,|(
th nnhxp
40010202102082 .)',,|(
th nnhxp
),( 800200 th nnn
100022001
10002020010
© Eric Xing @ CMU, 2006-2016 48
Continuous Prob. Distribution A continuous random variable X can assume any value in an
interval on the real line or in a region in a high dimensional space A random vector X=[x1, x2, …, xn]T usually corresponds to a real-valued
measurements of some property, e.g., length, position, …
It is not possible to talk about the probability of the random variable assuming a particular value --- P(x) = 0
Instead, we talk about the probability of the random variable assuming a value within a given interval, or half interval
Arbitrary Boolean combination of basic propositions
, ,baXP
xXPxXP ,
© Eric Xing @ CMU, 2006-2016 49
Continuous Prob. Distribution The probability of the random variable assuming a value within
some given interval from a to b is defined to be the area under the graph of the probability density function between a and b.
Probability mass:
note that
Cumulative distribution function (CDF):
Probability density function (PDF):
xdxxpxXPxP ')'()(
, )( , b
adxxpbaXP
xPdxdxp )(
. 1 )(
dxxp
Car flow on Liberty Bridge (cooked up!)xxpdxxp
,)( ; 1 )( 0
© Eric Xing @ CMU, 2006-2016 50
Uniform Probability Density Function
Normal (Gaussian) Probability Density Function
The distribution is symmetric, and is often illustrated as a bell-shaped curve. Two parameters, (mean) and (standard deviation), determine the location and shape of the distribution. The highest point on the normal curve is at the mean, which is also the median and mode. The mean can be any numerical value: negative, zero, or positive.
Multivariate Gaussian
Continuous Distributions
elsewhere for )/()(
01
bxaabxp
22 2
21
/)()( xexp
xx
ff((xx))
xx
ff((xx))
XXXp T
n1
212 21
21 exp),;(
//
© Eric Xing @ CMU, 2006-2016 51
Example 2: Gaussian density Data:
We observed N iid real samples: D={-0.1, 10, 1, -5.2, …, 3}
Model:
Log likelihood:
MLE: take derivative and set to zero:
22212 22 /)(exp)( /
xxP
N
n
nxNDPD1
2
22
212
2 )log()|(log);(l
n n
n n
xN
x
2422
2
21
2
1
l
l )/(
n MLnMLE
n nMLE
xN
xN
22 1
1
© Eric Xing @ CMU, 2006-2016 52
MLE for a multivariate-Gaussian It can be shown that the MLE for µ and Σ is
where the scatter matrix is
The sufficient statistics are nxn and nxnxnT.
Note that XTX=nxnxnT may not be full rank (eg. if N <D), in which case ΣML is not
invertible
SN
xxN
xN
nT
MLnMLnMLE
n nMLE
11
1
TMLMLn
Tnnn
TMLnMLn NxxxxS
Kn
n
n
n
x
xx
x
2
1
TN
T
T
x
xx
X2
1
© Eric Xing @ CMU, 2006-2016 53
Bayesian estimation
Normal Prior:
Joint probability:
Posterior:
20
20
2120 22 /)(exp)( /
P
1
20
22
020
2
20
20
2
2 11
11
N
Nx
NN ~ and ,
///
///~ where
Sample mean
20
20
2120
1
22
22
22
212
/)(exp
exp),(
/
/
N
nn
N xxP
22212 22 ~/)~(exp~)|( /
xP
© Eric Xing @ CMU, 2006-2016 54
Bayesian estimation: unknown µ, known σ
The posterior mean is a convex combination of the prior and the MLE, with weights proportional to the relative noise levels.
The precision of the posterior 1/σ2N is the precision of the prior 1/σ2
0 plus one contribution of data precision 1/σ2 for each observed data point.
Sequentially updating the mean µ∗ = 0.8 (unknown), (σ2)∗ = 0.1 (known)
Effect of single data point
Uninformative (vague/ flat) prior, σ20 →∞
1
20
22
020
2
20
20
2
2 11
11
N
Nx
NN
N~ ,
///
///
20
2
20
020
2
20
001
)( )( xxx
0 N
© Eric Xing @ CMU, 2006-2016 55
Summary Machine Learning is Cool and Useful!!
Learning scenarios: Data Objective function Frequentist and Bayesian
Density estimation Typical discrete distribution Typical continuous distribution (recitation) Conjugate priors
© Eric Xing @ CMU, 2006-2016 56
Some suggestions …
© Eric Xing @ CMU, 2006-2016 57
How ML facilitates Applications (say, NLP)
… Question Answering
Machine Translation
Language Modeling POS tagging & Parsing
Name Entity Recognition
Sentiment Analysis
Topic Clustering
Linguistic Theory:Syntax,
Semantics …
CRF, RNN, SVM, LSA, HMM
© Eric Xing @ CMU, 2006-2016 58
One way … …
Question Answering
Machine Translation
Language Modeling POS tagging & Parsing
Name Entity Recognition
Sentiment Analysis
Topic ClusteringLDA
© Eric Xing @ CMU, 2006-2016 59
Maybe highway …?
Deep Learning!
… Question Answering
Machine Translation
Language Modeling POS tagging & Parsing
Name Entity Recognition
Sentiment Analysis
Topic Clustering
© Eric Xing @ CMU, 2006-2016 60
© Eric Xing @ CMU, 2006-2016 61
Solution = deep domain knowledge + sounds methodology
Topic Clustering
Sentiment Analysis
POS tagging
Name Entity Recognition
Parsing
Machine Translation
Question Answering
…
Topic models/Latent space models
Structured input/output predictive models
Spectrum models Deep network models Distance metric Convex and non-convex
optimization algorithms Monte Carlo algorithms Distributed ML systems Consistency/identifiability/co
nvergence theories …
© Eric Xing @ CMU, 2006-2016 62