crash course on machine learning part v several slides from derek hoiem, ben taskar, and andreas...
TRANSCRIPT
![Page 1: Crash Course on Machine Learning Part V Several slides from Derek Hoiem, Ben Taskar, and Andreas Krause](https://reader035.vdocuments.us/reader035/viewer/2022062519/5697bfd11a28abf838cab3cb/html5/thumbnails/1.jpg)
Crash Course on Machine LearningPart V
Several slides from Derek Hoiem, Ben Taskar, and Andreas Krause
![Page 2: Crash Course on Machine Learning Part V Several slides from Derek Hoiem, Ben Taskar, and Andreas Krause](https://reader035.vdocuments.us/reader035/viewer/2022062519/5697bfd11a28abf838cab3cb/html5/thumbnails/2.jpg)
Structured Prediction
• Use local information • Exploit correlations
b r ea c
![Page 3: Crash Course on Machine Learning Part V Several slides from Derek Hoiem, Ben Taskar, and Andreas Krause](https://reader035.vdocuments.us/reader035/viewer/2022062519/5697bfd11a28abf838cab3cb/html5/thumbnails/3.jpg)
Min-max Formulation
LP duality
![Page 4: Crash Course on Machine Learning Part V Several slides from Derek Hoiem, Ben Taskar, and Andreas Krause](https://reader035.vdocuments.us/reader035/viewer/2022062519/5697bfd11a28abf838cab3cb/html5/thumbnails/4.jpg)
Before
QP duality
Exponentially many constraints/variables
![Page 5: Crash Course on Machine Learning Part V Several slides from Derek Hoiem, Ben Taskar, and Andreas Krause](https://reader035.vdocuments.us/reader035/viewer/2022062519/5697bfd11a28abf838cab3cb/html5/thumbnails/5.jpg)
After
By QP duality
Dual inherits structure from problem-specific inference LPVariables correspond to a decomposition of variables of the flat case
![Page 6: Crash Course on Machine Learning Part V Several slides from Derek Hoiem, Ben Taskar, and Andreas Krause](https://reader035.vdocuments.us/reader035/viewer/2022062519/5697bfd11a28abf838cab3cb/html5/thumbnails/6.jpg)
The Connection
b c a r e b r o r e b r o c eb r a c e
rc
ao
cr
.2
.15
.25
.4
.2 .35
.65.8.4
.61b 1e
2 2 10
![Page 7: Crash Course on Machine Learning Part V Several slides from Derek Hoiem, Ben Taskar, and Andreas Krause](https://reader035.vdocuments.us/reader035/viewer/2022062519/5697bfd11a28abf838cab3cb/html5/thumbnails/7.jpg)
Duals and Kernels
Kernel trick works: Factored dual Local functions (log-potentials) can use kernels
![Page 8: Crash Course on Machine Learning Part V Several slides from Derek Hoiem, Ben Taskar, and Andreas Krause](https://reader035.vdocuments.us/reader035/viewer/2022062519/5697bfd11a28abf838cab3cb/html5/thumbnails/8.jpg)
3D Mapping
Laser Range Finder
GPS
IMU
Data provided by: Michael Montemerlo & Sebastian Thrun
Label: ground, building, tree, shrub Training: 30 thousand points Testing: 3 million points
![Page 9: Crash Course on Machine Learning Part V Several slides from Derek Hoiem, Ben Taskar, and Andreas Krause](https://reader035.vdocuments.us/reader035/viewer/2022062519/5697bfd11a28abf838cab3cb/html5/thumbnails/9.jpg)
![Page 10: Crash Course on Machine Learning Part V Several slides from Derek Hoiem, Ben Taskar, and Andreas Krause](https://reader035.vdocuments.us/reader035/viewer/2022062519/5697bfd11a28abf838cab3cb/html5/thumbnails/10.jpg)
![Page 11: Crash Course on Machine Learning Part V Several slides from Derek Hoiem, Ben Taskar, and Andreas Krause](https://reader035.vdocuments.us/reader035/viewer/2022062519/5697bfd11a28abf838cab3cb/html5/thumbnails/11.jpg)
![Page 12: Crash Course on Machine Learning Part V Several slides from Derek Hoiem, Ben Taskar, and Andreas Krause](https://reader035.vdocuments.us/reader035/viewer/2022062519/5697bfd11a28abf838cab3cb/html5/thumbnails/12.jpg)
![Page 13: Crash Course on Machine Learning Part V Several slides from Derek Hoiem, Ben Taskar, and Andreas Krause](https://reader035.vdocuments.us/reader035/viewer/2022062519/5697bfd11a28abf838cab3cb/html5/thumbnails/13.jpg)
• Simple iterative method
• Unstable for structured output: fewer instances, big updates– May not converge if non-separable– Noisy
• Voted / averaged perceptron [Freund & Schapire 99, Collins 02]– Regularize / reduce variance by aggregating over iterations
Alternatives: Perceptron
![Page 14: Crash Course on Machine Learning Part V Several slides from Derek Hoiem, Ben Taskar, and Andreas Krause](https://reader035.vdocuments.us/reader035/viewer/2022062519/5697bfd11a28abf838cab3cb/html5/thumbnails/14.jpg)
• Add most violated constraint
• Handles several more general loss functions• Need to re-solve QP many times• Theorem: Only polynomial # of constraints needed to achieve -
error [Tsochantaridis et al, 04]
• Worst case # of constraints larger than factored
Alternatives: Constraint Generation
[Collins 02; Altun et al, 03]
![Page 15: Crash Course on Machine Learning Part V Several slides from Derek Hoiem, Ben Taskar, and Andreas Krause](https://reader035.vdocuments.us/reader035/viewer/2022062519/5697bfd11a28abf838cab3cb/html5/thumbnails/15.jpg)
Integration
• Feature Passing
• Margin Based– Max margin Structure Learning
• Probabilistic– Graphical Models
![Page 16: Crash Course on Machine Learning Part V Several slides from Derek Hoiem, Ben Taskar, and Andreas Krause](https://reader035.vdocuments.us/reader035/viewer/2022062519/5697bfd11a28abf838cab3cb/html5/thumbnails/16.jpg)
Graphical Models
• Joint distribution– Factoring using independent variables
• Representation
• Inference
• Learning
![Page 17: Crash Course on Machine Learning Part V Several slides from Derek Hoiem, Ben Taskar, and Andreas Krause](https://reader035.vdocuments.us/reader035/viewer/2022062519/5697bfd11a28abf838cab3cb/html5/thumbnails/17.jpg)
Big Picture
• Two problems with using full joint distribution tables as our probabilistic models: – Unless there are only a few variables, the joint is WAY too
big to represent explicitly – Hard to learn (estimate) anything empirically about more
than a few variables at a time
• Describe complex joint distributions (models) using simple, local distributions – We describe how variables locally interact – Local interactions chain together to give global, indirect
interactions
![Page 18: Crash Course on Machine Learning Part V Several slides from Derek Hoiem, Ben Taskar, and Andreas Krause](https://reader035.vdocuments.us/reader035/viewer/2022062519/5697bfd11a28abf838cab3cb/html5/thumbnails/18.jpg)
Joint Distribution• For n variables with domain sizes d
– joint distribution table with dn -1 free parameters • Size of representation if we use the chain rule
Concretely, counting the number of free parameters accounting for that we know probabilities sum to one: (d-1) + d(d-1) + d2(d-1) + ... + dn-1 (d-1) = (dn-1)/(d-1) (d-1)= dn - 1
![Page 19: Crash Course on Machine Learning Part V Several slides from Derek Hoiem, Ben Taskar, and Andreas Krause](https://reader035.vdocuments.us/reader035/viewer/2022062519/5697bfd11a28abf838cab3cb/html5/thumbnails/19.jpg)
Conditional Independence• Two variables are conditionally independent:
• What about this domain? – Traffic– Umbrella– Raining
![Page 20: Crash Course on Machine Learning Part V Several slides from Derek Hoiem, Ben Taskar, and Andreas Krause](https://reader035.vdocuments.us/reader035/viewer/2022062519/5697bfd11a28abf838cab3cb/html5/thumbnails/20.jpg)
RepresentationExplicitly model uncertainty and dependency structure
a
b
c
a
b
c
Directed Undirected Factor graph
d d
a
b
c d
Key concept: Markov blanket
![Page 21: Crash Course on Machine Learning Part V Several slides from Derek Hoiem, Ben Taskar, and Andreas Krause](https://reader035.vdocuments.us/reader035/viewer/2022062519/5697bfd11a28abf838cab3cb/html5/thumbnails/21.jpg)
Bayes Net: Notation
• Nodes: variables– Can be assigned (observed)
or – unassigned (unobserved)
• Arcs: interactions – Indicate “direct influence”
between variables – Formally: encode conditional
independence
Cavity
Toothache Catch
Weather
![Page 22: Crash Course on Machine Learning Part V Several slides from Derek Hoiem, Ben Taskar, and Andreas Krause](https://reader035.vdocuments.us/reader035/viewer/2022062519/5697bfd11a28abf838cab3cb/html5/thumbnails/22.jpg)
Example: Flip Coins
• N independent flip coins
• No interactions between variables– Absolute independence
X1 X2 Xn
![Page 23: Crash Course on Machine Learning Part V Several slides from Derek Hoiem, Ben Taskar, and Andreas Krause](https://reader035.vdocuments.us/reader035/viewer/2022062519/5697bfd11a28abf838cab3cb/html5/thumbnails/23.jpg)
Example: Traffic
• Variables: – Traffic– Rain
• Model 1: absolute independence• Model 2: rain causes traffic• Which makes more sense?
Rain
Traffic
![Page 24: Crash Course on Machine Learning Part V Several slides from Derek Hoiem, Ben Taskar, and Andreas Krause](https://reader035.vdocuments.us/reader035/viewer/2022062519/5697bfd11a28abf838cab3cb/html5/thumbnails/24.jpg)
Semantics
• A set of nodes, one per variable X • A directed, acyclic graph • A conditional distribution for each
node – A collection of distributions over X,
one for each combination of parents’ values
– Conditional Probability Table (CPT)
A1 An
X
A2
Parents
A Bayes net = Topology (graph) + Local Conditional Probabilities
![Page 25: Crash Course on Machine Learning Part V Several slides from Derek Hoiem, Ben Taskar, and Andreas Krause](https://reader035.vdocuments.us/reader035/viewer/2022062519/5697bfd11a28abf838cab3cb/html5/thumbnails/25.jpg)
Example: Alarm
• Variables:– Alarm– Burglary– Earthquake– Radio– Calls John
Earthquake
Radio
Burglary
Alarm
Call
![Page 26: Crash Course on Machine Learning Part V Several slides from Derek Hoiem, Ben Taskar, and Andreas Krause](https://reader035.vdocuments.us/reader035/viewer/2022062519/5697bfd11a28abf838cab3cb/html5/thumbnails/26.jpg)
Example: AlarmEarthquake
Radio
Burglary
Alarm
Call
P(C|A)
P(R|E) P(A|E,B)
P(E) P(B)
P(E,B,R,A,C)=P(E)P(B)P(R|E)P(A|B,E)P(C|A)
![Page 27: Crash Course on Machine Learning Part V Several slides from Derek Hoiem, Ben Taskar, and Andreas Krause](https://reader035.vdocuments.us/reader035/viewer/2022062519/5697bfd11a28abf838cab3cb/html5/thumbnails/27.jpg)
Bayes Net Size
• How big is a joint distribution over N Boolean variables?
• How big is the size of CPT with k parents?
• How big is the size of BN with n node if nodes have up to k parents?
• BNs: – Compact representation– Use local properties to define CPTS– Answer queries more easily
2n
2k+1
n.2k+1
![Page 28: Crash Course on Machine Learning Part V Several slides from Derek Hoiem, Ben Taskar, and Andreas Krause](https://reader035.vdocuments.us/reader035/viewer/2022062519/5697bfd11a28abf838cab3cb/html5/thumbnails/28.jpg)
Independence in BN
• BNs present a compact representation for joint distributions– Take advantage of conditional independence
• Given a BN let’s answer independence questions:– Are two nodes independent given certain
evidence?
– What can we say about X, Z? (Example: Low pressure, Rain, Traffic}
X Y Z
![Page 29: Crash Course on Machine Learning Part V Several slides from Derek Hoiem, Ben Taskar, and Andreas Krause](https://reader035.vdocuments.us/reader035/viewer/2022062519/5697bfd11a28abf838cab3cb/html5/thumbnails/29.jpg)
Causal Chains
• Question: Is Z independent of X given Y?
X Y Z• X: low pressure• Y: Rain• Z: Traffic
![Page 30: Crash Course on Machine Learning Part V Several slides from Derek Hoiem, Ben Taskar, and Andreas Krause](https://reader035.vdocuments.us/reader035/viewer/2022062519/5697bfd11a28abf838cab3cb/html5/thumbnails/30.jpg)
Common Cause
• Are X, Z independent?
X
y
Z
• Are X, Z independent given Y?
• Observing Y blocks the influence between X,Z
• Y: low pressure• X: Rain• Z: Cold
![Page 31: Crash Course on Machine Learning Part V Several slides from Derek Hoiem, Ben Taskar, and Andreas Krause](https://reader035.vdocuments.us/reader035/viewer/2022062519/5697bfd11a28abf838cab3cb/html5/thumbnails/31.jpg)
Common Effect
• Are X, Z independent?
X
Y
Z
• X: Rain• Y: Traffic• Z: Ball Game
• Are X, Z independent given Y?
• Observing Y activates influence between X, Z
![Page 32: Crash Course on Machine Learning Part V Several slides from Derek Hoiem, Ben Taskar, and Andreas Krause](https://reader035.vdocuments.us/reader035/viewer/2022062519/5697bfd11a28abf838cab3cb/html5/thumbnails/32.jpg)
Independence in BNs
• Any complex BN structure can be analyzed using these three cases
Earthquake
Radio
Burglary
Alarm
Call
![Page 33: Crash Course on Machine Learning Part V Several slides from Derek Hoiem, Ben Taskar, and Andreas Krause](https://reader035.vdocuments.us/reader035/viewer/2022062519/5697bfd11a28abf838cab3cb/html5/thumbnails/33.jpg)
Directed acyclical graph (Bayes net)
a
b
c d
P(a,b,c,d) = P(c|b)P(d|b)P(b|a)P(a)
• Can model causality• Parameter learning
– Decomposes: learn each term separately (ML)
• Inference– Simple exact inference if tree-
shaped (belief propagation)
![Page 34: Crash Course on Machine Learning Part V Several slides from Derek Hoiem, Ben Taskar, and Andreas Krause](https://reader035.vdocuments.us/reader035/viewer/2022062519/5697bfd11a28abf838cab3cb/html5/thumbnails/34.jpg)
Directed acyclical graph (Bayes net)
a
b
c d
• Can model causality• Parameter learning
– Decomposes: learn each term separately (ML)
• Inference– Simple exact inference if tree-
shaped (belief propagation)– Loops require approximation
• Loopy BP• Tree-reweighted BP• Sampling
P(a,b,c,d) = P(c|b)P(d|a,b)P(b|a)P(a)
![Page 35: Crash Course on Machine Learning Part V Several slides from Derek Hoiem, Ben Taskar, and Andreas Krause](https://reader035.vdocuments.us/reader035/viewer/2022062519/5697bfd11a28abf838cab3cb/html5/thumbnails/35.jpg)
• Example: Places and scenes
Directed graph
Place: office, kitchen, street, etc.
Car Person Toaster MicrowaveFire
Hydrant
Objects Present
P(place, car, person, toaster, micro, hydrant) = P(place) P(car | place) P(person | place) … P(hydrant | place)
![Page 36: Crash Course on Machine Learning Part V Several slides from Derek Hoiem, Ben Taskar, and Andreas Krause](https://reader035.vdocuments.us/reader035/viewer/2022062519/5697bfd11a28abf838cab3cb/html5/thumbnails/36.jpg)
Undirected graph (Markov Networks)
• Does not model causality• Often pairwise• Parameter learning difficult• Inference usually approximate
x1
x2
x3 x4
edgesji
jii
iZ dataxxdataxdataP,
24..1
11 ),;,(),;(),;( x
![Page 37: Crash Course on Machine Learning Part V Several slides from Derek Hoiem, Ben Taskar, and Andreas Krause](https://reader035.vdocuments.us/reader035/viewer/2022062519/5697bfd11a28abf838cab3cb/html5/thumbnails/37.jpg)
Markov Networks• Example: “label smoothing” grid
Binary nodes
0 10 0 K1 K 0
Pairwise Potential
![Page 38: Crash Course on Machine Learning Part V Several slides from Derek Hoiem, Ben Taskar, and Andreas Krause](https://reader035.vdocuments.us/reader035/viewer/2022062519/5697bfd11a28abf838cab3cb/html5/thumbnails/38.jpg)
Image De-Noising
Original Image Noisy Image
![Page 39: Crash Course on Machine Learning Part V Several slides from Derek Hoiem, Ben Taskar, and Andreas Krause](https://reader035.vdocuments.us/reader035/viewer/2022062519/5697bfd11a28abf838cab3cb/html5/thumbnails/39.jpg)
Image De-Noising
![Page 40: Crash Course on Machine Learning Part V Several slides from Derek Hoiem, Ben Taskar, and Andreas Krause](https://reader035.vdocuments.us/reader035/viewer/2022062519/5697bfd11a28abf838cab3cb/html5/thumbnails/40.jpg)
Image De-Noising
Noisy Image Restored Image (ICM)
![Page 41: Crash Course on Machine Learning Part V Several slides from Derek Hoiem, Ben Taskar, and Andreas Krause](https://reader035.vdocuments.us/reader035/viewer/2022062519/5697bfd11a28abf838cab3cb/html5/thumbnails/41.jpg)
Factor graphs• A general representation
a
b
c
Bayes Net
Factor Graph
d
a
b
c d
![Page 42: Crash Course on Machine Learning Part V Several slides from Derek Hoiem, Ben Taskar, and Andreas Krause](https://reader035.vdocuments.us/reader035/viewer/2022062519/5697bfd11a28abf838cab3cb/html5/thumbnails/42.jpg)
Factor graphs• A general representation
a
b
c
Markov Net
d
Factor Graph
a
b
c d
![Page 43: Crash Course on Machine Learning Part V Several slides from Derek Hoiem, Ben Taskar, and Andreas Krause](https://reader035.vdocuments.us/reader035/viewer/2022062519/5697bfd11a28abf838cab3cb/html5/thumbnails/43.jpg)
Factor graphs
),()(),,(),,,( 321 dafdfcbafdcbaP
Write as a factor graph
![Page 44: Crash Course on Machine Learning Part V Several slides from Derek Hoiem, Ben Taskar, and Andreas Krause](https://reader035.vdocuments.us/reader035/viewer/2022062519/5697bfd11a28abf838cab3cb/html5/thumbnails/44.jpg)
Inference in Graphical Models
• Joint• Marginal• Max
• Exact inference is HARD
![Page 45: Crash Course on Machine Learning Part V Several slides from Derek Hoiem, Ben Taskar, and Andreas Krause](https://reader035.vdocuments.us/reader035/viewer/2022062519/5697bfd11a28abf838cab3cb/html5/thumbnails/45.jpg)
Approximate Inference
![Page 46: Crash Course on Machine Learning Part V Several slides from Derek Hoiem, Ben Taskar, and Andreas Krause](https://reader035.vdocuments.us/reader035/viewer/2022062519/5697bfd11a28abf838cab3cb/html5/thumbnails/46.jpg)
Approximation
![Page 47: Crash Course on Machine Learning Part V Several slides from Derek Hoiem, Ben Taskar, and Andreas Krause](https://reader035.vdocuments.us/reader035/viewer/2022062519/5697bfd11a28abf838cab3cb/html5/thumbnails/47.jpg)
Sampling a Multinomial Distribution
![Page 48: Crash Course on Machine Learning Part V Several slides from Derek Hoiem, Ben Taskar, and Andreas Krause](https://reader035.vdocuments.us/reader035/viewer/2022062519/5697bfd11a28abf838cab3cb/html5/thumbnails/48.jpg)
Sampling from a BN
- Compute Marginals- Compute Conditionals
![Page 49: Crash Course on Machine Learning Part V Several slides from Derek Hoiem, Ben Taskar, and Andreas Krause](https://reader035.vdocuments.us/reader035/viewer/2022062519/5697bfd11a28abf838cab3cb/html5/thumbnails/49.jpg)
Belief Propagation
• Very general • Approximate, except for tree-shaped graphs
– Generalizing variants BP can have better convergence for graphs with many loops or high potentials
• Standard packages available (BNT toolbox)
• To learn more:– Yedidia, J.S.; Freeman, W.T.; Weiss, Y., "Understanding Belief Propagation and Its
Generalizations”, Technical Report, 2001: http://www.merl.com/publications/TR2001-022/
![Page 50: Crash Course on Machine Learning Part V Several slides from Derek Hoiem, Ben Taskar, and Andreas Krause](https://reader035.vdocuments.us/reader035/viewer/2022062519/5697bfd11a28abf838cab3cb/html5/thumbnails/50.jpg)
Belief Propagation
i
)(
)()(iNa
iiaii xmxb
“beliefs” “messages”
a
)( \)(
)()()(aNi aiNb
iibaaaa xmXfXb
The “belief” is the BP approximation of the marginal probability.
![Page 51: Crash Course on Machine Learning Part V Several slides from Derek Hoiem, Ben Taskar, and Andreas Krause](https://reader035.vdocuments.us/reader035/viewer/2022062519/5697bfd11a28abf838cab3cb/html5/thumbnails/51.jpg)
BP Message-update Rules
Using ,)()(\
ia xX
aaii Xbxb we get
ai
ia xX iaNj ajNb
jjbaaiia xmXfxm\ \)( \)(
)()()(
i a=
![Page 52: Crash Course on Machine Learning Part V Several slides from Derek Hoiem, Ben Taskar, and Andreas Krause](https://reader035.vdocuments.us/reader035/viewer/2022062519/5697bfd11a28abf838cab3cb/html5/thumbnails/52.jpg)
Inference: Graph Cuts
• Associative: edge potentials penalize different labels• Associative binary networks can be solved optimally
(and quickly) using graph cuts
• Multilabel associative networks can be handled by alpha-expansion or alpha-beta swaps
• To learn more:– http://www.cs.cornell.edu/~rdz/graphcuts.html– Classic paper: What Energy Functions can be Minimized via Graph Cuts? (Kolmogorov
and Zabih, ECCV '02/PAMI '04)
![Page 53: Crash Course on Machine Learning Part V Several slides from Derek Hoiem, Ben Taskar, and Andreas Krause](https://reader035.vdocuments.us/reader035/viewer/2022062519/5697bfd11a28abf838cab3cb/html5/thumbnails/53.jpg)
Graph Cuts: Binary MRF
Unary terms (compatability of data with label y)
Pairwise terms (compatability of neighboring labels)
Graph cuts used to optimise this cost function:
Summary of approach
• Associate each possible solution with a minimum cut on a graph• Set capacities on graph, so cost of cut matches the cost function• Use augmenting paths to find minimum cut• This minimizes the cost function and finds the MAP solution
![Page 54: Crash Course on Machine Learning Part V Several slides from Derek Hoiem, Ben Taskar, and Andreas Krause](https://reader035.vdocuments.us/reader035/viewer/2022062519/5697bfd11a28abf838cab3cb/html5/thumbnails/54.jpg)
Denoising Results
Original Pairwise costs increasing
Pairwise costs increasing