what have we learned about learning?
DESCRIPTION
What have we learned about learning?. Statistical learning Mathematically rigorous, general approach R equires probabilistic expression of likelihood, prior Decision trees Learning concepts that can be expressed as logical statements - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: What have we learned about learning?](https://reader036.vdocuments.us/reader036/viewer/2022062302/5681667a550346895dda1c02/html5/thumbnails/1.jpg)
1
WHAT HAVE WE LEARNED ABOUT LEARNING? Statistical learning
Mathematically rigorous, general approach Requires probabilistic expression of likelihood, prior
Decision trees Learning concepts that can be expressed as logical
statements Statement must be relatively compact for small trees,
efficient learning Neuron learning
Optimization to minimize fitting error over weight parameters
Fixed linear function class Neural networks
Can tune arbitrarily sophisticated hypothesis classes Unintuitive map from network structure => hypothesis class
![Page 2: What have we learned about learning?](https://reader036.vdocuments.us/reader036/viewer/2022062302/5681667a550346895dda1c02/html5/thumbnails/2.jpg)
2
SUPPORT VECTOR MACHINES
![Page 3: What have we learned about learning?](https://reader036.vdocuments.us/reader036/viewer/2022062302/5681667a550346895dda1c02/html5/thumbnails/3.jpg)
3
SVM INTUITION Find “best” linear classifier
Hope to generalize well
![Page 4: What have we learned about learning?](https://reader036.vdocuments.us/reader036/viewer/2022062302/5681667a550346895dda1c02/html5/thumbnails/4.jpg)
4
LINEAR CLASSIFIERS Plane equation: 0 = x1θ1 + x2θ2 + … + xnθn + b If x1θ1 + x2θ2 + … + xnθn + b > 0, positive example If x1θ1 + x2θ2 + … + xnθn + b < 0, negative example
Separating plane
![Page 5: What have we learned about learning?](https://reader036.vdocuments.us/reader036/viewer/2022062302/5681667a550346895dda1c02/html5/thumbnails/5.jpg)
5
LINEAR CLASSIFIERS Plane equation: 0 = x1θ1 + x2θ2 + … + xnθn + b If x1θ1 + x2θ2 + … + xnθn + b > 0, positive example If x1θ1 + x2θ2 + … + xnθn + b < 0, negative example
Separating plane
(θ1,θ2)
![Page 6: What have we learned about learning?](https://reader036.vdocuments.us/reader036/viewer/2022062302/5681667a550346895dda1c02/html5/thumbnails/6.jpg)
6
LINEAR CLASSIFIERS Plane equation: x1θ1 + x2θ2 + … + xnθn + b = 0 C = Sign(x1θ1 + x2θ2 + … + xnθn + b) If C=1, positive example, if C= -1, negative example
Separating plane
(θ1,θ2)
(-bθ1, -bθ2)
![Page 7: What have we learned about learning?](https://reader036.vdocuments.us/reader036/viewer/2022062302/5681667a550346895dda1c02/html5/thumbnails/7.jpg)
7
SVM: MAXIMUM MARGIN CLASSIFICATION Find linear classifier that maximizes the
margin between positive and negative examples
Margin
![Page 8: What have we learned about learning?](https://reader036.vdocuments.us/reader036/viewer/2022062302/5681667a550346895dda1c02/html5/thumbnails/8.jpg)
8
MARGIN The farther away from the boundary we are,
the more “confident” the classification
Margin
Very confident
Not as confident
![Page 9: What have we learned about learning?](https://reader036.vdocuments.us/reader036/viewer/2022062302/5681667a550346895dda1c02/html5/thumbnails/9.jpg)
9
GEOMETRIC MARGIN The farther away from the boundary we are,
the more “confident” the classification
Margin
Distance of example to the boundary is its geometric margin
![Page 10: What have we learned about learning?](https://reader036.vdocuments.us/reader036/viewer/2022062302/5681667a550346895dda1c02/html5/thumbnails/10.jpg)
10
KEY INSIGHTS The optimal classification boundary is
defined by just a few (d+1) points: support vectors
Numerical tricks to make optimization fastMargin
![Page 11: What have we learned about learning?](https://reader036.vdocuments.us/reader036/viewer/2022062302/5681667a550346895dda1c02/html5/thumbnails/11.jpg)
11
NONSEPARABLE DATA Cannot achieve perfect accuracy with noisy
dataRegularization parameter:Tolerate some errors, cost of error determined by some parameter C
• Higher C: more support vectors, lower error
• Lower C: fewer support vectors, higher error
![Page 12: What have we learned about learning?](https://reader036.vdocuments.us/reader036/viewer/2022062302/5681667a550346895dda1c02/html5/thumbnails/12.jpg)
12
SOFT GEOMETRIC MARGINminimize
Where Errori indicatesa degree of misclassification
Errori: nonzero only for misclassified examples
Regularization parameter
![Page 13: What have we learned about learning?](https://reader036.vdocuments.us/reader036/viewer/2022062302/5681667a550346895dda1c02/html5/thumbnails/13.jpg)
13
CAN WE DO BETTER?
![Page 14: What have we learned about learning?](https://reader036.vdocuments.us/reader036/viewer/2022062302/5681667a550346895dda1c02/html5/thumbnails/14.jpg)
14
MOTIVATION: FEATURE MAPPINGS Given attributes x, learn in the space of
features f(x) E.g., parity, FACE(card), RED(card)
Hope CONCEPT is easier to learn in feature space
Goal: Generate many features in the hopes that some
are predictive But not too many that we overfit (maximum
margin helps somewhat against overfitting)
![Page 15: What have we learned about learning?](https://reader036.vdocuments.us/reader036/viewer/2022062302/5681667a550346895dda1c02/html5/thumbnails/15.jpg)
VC DIMENSION In an N dimensional feature space, there
exists a perfect linear separator for n <= N+1 non-coplanar examples no matter how they are labeled
+
+
- +
-
- +
-
-
+
?
![Page 16: What have we learned about learning?](https://reader036.vdocuments.us/reader036/viewer/2022062302/5681667a550346895dda1c02/html5/thumbnails/16.jpg)
16
WHAT FEATURES SHOULD BE USED? Adding linear functions of x’s doesn’t help
SVM separate non-separable data Why? But it may help improve generalization
(particularly, badly-scaled datasets). Why? But nonlinear functions may help…
![Page 17: What have we learned about learning?](https://reader036.vdocuments.us/reader036/viewer/2022062302/5681667a550346895dda1c02/html5/thumbnails/17.jpg)
17
EXAMPLE
x1
x2
![Page 18: What have we learned about learning?](https://reader036.vdocuments.us/reader036/viewer/2022062302/5681667a550346895dda1c02/html5/thumbnails/18.jpg)
18
EXAMPLE Choose f1=x1
2, f2=x22, f3=2 x1x2
x1
x2
f2
f1
f3
![Page 19: What have we learned about learning?](https://reader036.vdocuments.us/reader036/viewer/2022062302/5681667a550346895dda1c02/html5/thumbnails/19.jpg)
19
EXAMPLE Choose f1=x1
2, f2=x22, f3=2 x1x2
x1
x2
f2
f1
f3
![Page 20: What have we learned about learning?](https://reader036.vdocuments.us/reader036/viewer/2022062302/5681667a550346895dda1c02/html5/thumbnails/20.jpg)
20
POLYNOMIAL FEATURES Original features
x1,…,xn
Quadratic features x1
2… xn
2, x1x2, …, x1xn, … , xn-1xn (n2 features possible)
Linear classifiers in feature space become ellipses, parabolas, and hyperbolas in original space!
[Doesn’t help to add features like 3 x12 - 5x1x3.
Why?] Higher order features also possible
Increase maximum power until data is linearly separable?
SVMs implement these and other feature mappings efficiently through the “kernel trick”
![Page 21: What have we learned about learning?](https://reader036.vdocuments.us/reader036/viewer/2022062302/5681667a550346895dda1c02/html5/thumbnails/21.jpg)
21
RESULTS Decision boundaries in feature space
maybe highly curved in original space!
More complex: better fit, more possibility to overfit
![Page 22: What have we learned about learning?](https://reader036.vdocuments.us/reader036/viewer/2022062302/5681667a550346895dda1c02/html5/thumbnails/22.jpg)
22
OVERFITTING / UNDERFITTING
![Page 23: What have we learned about learning?](https://reader036.vdocuments.us/reader036/viewer/2022062302/5681667a550346895dda1c02/html5/thumbnails/23.jpg)
23
COMMENTS SVMs often have very good
performanceE.g., digit classification, face recognition,
etc Still need parameter
tweakingKernel typeKernel parametersRegularization weight
Fast optimization for medium datasets (~100k)
Off-the-shelf libraries libsvm, SVMlight
![Page 24: What have we learned about learning?](https://reader036.vdocuments.us/reader036/viewer/2022062302/5681667a550346895dda1c02/html5/thumbnails/24.jpg)
NONPARAMETRIC MODELING(MEMORY-BASED LEARNING)
![Page 25: What have we learned about learning?](https://reader036.vdocuments.us/reader036/viewer/2022062302/5681667a550346895dda1c02/html5/thumbnails/25.jpg)
So far, most of our learning techniques represent the target concept as a model with unknown parameters, which are fitted to the training set Bayes nets Linear models Neural networks
Parametric learners have fixed capacity Can we skip the modeling step?
![Page 26: What have we learned about learning?](https://reader036.vdocuments.us/reader036/viewer/2022062302/5681667a550346895dda1c02/html5/thumbnails/26.jpg)
EXAMPLE: TABLE LOOKUP Values of concept f(x)
given on training set D = {(xi,f(xi)) for i=1,…,N}
+
+
+
+
++
+
-
-
-
--
-
+
+
+
+
+
-
-
-
--
-Training set D
Example space X
![Page 27: What have we learned about learning?](https://reader036.vdocuments.us/reader036/viewer/2022062302/5681667a550346895dda1c02/html5/thumbnails/27.jpg)
EXAMPLE: TABLE LOOKUP
+
+
+
+
++
+
-
-
-
--
-
+
+
+
+
+
-
-
-
--
-Training set D
Example space X Values of concept f(x)
given on training set D = {(xi,f(xi)) for i=1,…,N}
On a new example x, a nonparametric hypothesis h might return The cached value of f(x), if
x is in D FALSE otherwise
A pretty bad learner, because you are unlikely to
see the same exact situation twice!
![Page 28: What have we learned about learning?](https://reader036.vdocuments.us/reader036/viewer/2022062302/5681667a550346895dda1c02/html5/thumbnails/28.jpg)
NEAREST-NEIGHBORS MODELS
+
+
+
+
+
-
-
-
--
-Training set D
X Suppose we have a
distance metric d(x,x’) between examples
A nearest-neighbors model classifies a point x by:1. Find the closest
point xi in the training set
2. Return the label f(xi)
+
![Page 29: What have we learned about learning?](https://reader036.vdocuments.us/reader036/viewer/2022062302/5681667a550346895dda1c02/html5/thumbnails/29.jpg)
NEAREST NEIGHBORS NN extends the
classification value at each example to its Voronoi cell
Idea: classification boundary is spatially coherent (we hope)
Voronoi diagram in a 2D space
![Page 30: What have we learned about learning?](https://reader036.vdocuments.us/reader036/viewer/2022062302/5681667a550346895dda1c02/html5/thumbnails/30.jpg)
30
NEAREST NEIGHBORS QUERY Given dataset D = {(x1,f(x1)),…,(xN,f(xN))},
distance metric d
Brute-Force-NN-Query(x,D,d):1. For each example xi in D:2. Compute di = d(x,xi)3. Return the label f(xi) of the minimum di
![Page 31: What have we learned about learning?](https://reader036.vdocuments.us/reader036/viewer/2022062302/5681667a550346895dda1c02/html5/thumbnails/31.jpg)
DISTANCE METRICS d(x,x’) measures how “far” two examples are
from one another, and must satisfy: d(x,x) = 0 d(x,x’) ≥ 0 d(x,x’) = d(x’,x)
Common metrics Euclidean distance (if dimensions are in same units) Manhattan distance (different units)
Axes should be weighted to account for spread d(x,x’) = αh|height-height’| + αw|weight-weight’|
Some metrics also account for correlation between axes (e.g., Mahalanobis distance)
![Page 32: What have we learned about learning?](https://reader036.vdocuments.us/reader036/viewer/2022062302/5681667a550346895dda1c02/html5/thumbnails/32.jpg)
PROPERTIES OF NN Let:
N = |D| (size of training set) d = dimensionality of data
Without noise, performance improves as N grows k-nearest neighbors helps handle overfitting on
noisy data Consider label of k nearest neighbors, take
majority vote Curse of dimensionality
As d grows, nearest neighbors become pretty far away!
![Page 33: What have we learned about learning?](https://reader036.vdocuments.us/reader036/viewer/2022062302/5681667a550346895dda1c02/html5/thumbnails/33.jpg)
CURSE OF DIMENSIONALITY Suppose X is a hypercube of dimension d,
width 1 on all axes Say an example is “close” to the query point
if difference on every axis is < 0.25 What fraction of X are “close” to the query
point?
d=2 d=3
0.52 = 0.25 0.53 = 0.125d=10
0.510 = 0.00098
d=20
0.520 = 9.5x10-7
? ?
![Page 34: What have we learned about learning?](https://reader036.vdocuments.us/reader036/viewer/2022062302/5681667a550346895dda1c02/html5/thumbnails/34.jpg)
COMPUTATIONAL PROPERTIES OF K-NN Training time is nil
Naïve k-NN: O(N) time to make a prediction
Special data structures can make this faster k-d trees Locality sensitive hashing
… but are ultimately worthwhile only when d is small, N is very large, or we are willing to approximate
See R&N
![Page 35: What have we learned about learning?](https://reader036.vdocuments.us/reader036/viewer/2022062302/5681667a550346895dda1c02/html5/thumbnails/35.jpg)
ASIDE: DIMENSIONALITY REDUCTION Many datasets are too high dimensional to do
effective supervised learning E.g. images, audio, surveys
Dimensionality reduction: preprocess data to a find a low # of features automatically
![Page 36: What have we learned about learning?](https://reader036.vdocuments.us/reader036/viewer/2022062302/5681667a550346895dda1c02/html5/thumbnails/36.jpg)
PRINCIPAL COMPONENT ANALYSIS Finds a few “axes” that explain the major
variations in the data
Related techniques: multidimensional scaling, factor analysis, Isomap
Useful for learning, visualization, clustering, etc
University of Washington
![Page 37: What have we learned about learning?](https://reader036.vdocuments.us/reader036/viewer/2022062302/5681667a550346895dda1c02/html5/thumbnails/37.jpg)
37
NEXT TIME In a world with a slew of machine learning
techniques, feature spaces, training techniques…
How will you: Prove that a learner performs well? Compare techniques against each other? Pick the best technique?
R&N 18.4-5