pac learning

ML Study PAC Learning

2014.09.11 Sanghyuk Chun

Overview• ML intro & Decision tree • Bayesian Methods • Regression • Graphical Model 1 • Graphical Model 2 (EM) • PAC learning • Hidden Markov Models • Learning Representations • Neural Network • Support Vector Machine • Reinforcement Learning

Basic Concepts

Model and AlgorithmsPAC: Theory for ML algorithms

Computational Learning Theory

• Computational learning theory is a mathematical and theoretical field related to analysis of machine learning algorithms

• We need to seek theory to relate: • Probability of successful learning • Number of training examples • Complexity of hypothesis space • Accuracy to which target function is approximated • Manner in which training examples presented

Prototypical Concept Learning Task

• Given: • Instances X (set of instance or objects in world) • Target concept c (subset of instance space) • Hypothesis H (collection of concepts over X) • Training data D (example from instance space)

• Determine: • A hypothesis h in H such that h(x) = c(x) for all x in D? • A hypothesis h in H such that h(x) = c(x) for all x in X?

Training errorTrue error

Function Approximation: Overview

There is no free lunch!! generalization beyond the training data is impossible without more assumptions (for example, regularization term in logistic regression)

h in hypothesis space H which is “best” hypothesis on the training data D

c is a target function (or concept) what we want to find from hypothesis space H

True error and Training error

• True error of hypothesis h with respect to c • How often h(x) ≠ c(x) over random instances

!

!!

• Training error of hypothesis h with respect to c • How often h(x) ≠ c(x) over training instances

!

True error and Training error• We now use “Empirical Risk Minimization” method which find hypothesis minimizing training error to select hypothesis h

• Problem: errortrain(h) is a biased approximation to the errorD(h) • Since h is selected using training data, or errortrain(h) is dependent on h, errortrain(h) is a biased approximation to the errorD(h)

• On h, it is likely to be an underestimate!

True error and Test error

• Question: By the way, it is impossible to know exact true error errorD(h), is there any unbiased approximation error to the errorD(h)? How can we measure ‘true’ performance of hypothesis h?

• Answer: Test error is an unbiased approximation to the errorD(h) • as the test set are i.i.d. samples draw from the true distribution independently of h

Overfitting• Hypothesis h in H overfits training data if there is an alternative hypothesis h’ in H such that

• errortrain(h) < errortrain(h’) and

• errorD(h) > errorD(h’)

!

!

“Complex” model causes more overfitting effect (Occam’s razor)

What if the training set goes infinity? Or, how many training sample we need?

PAC learning• PAC learning, or Probably Approximately Correct learning is a framework for mathematical analysis of machine learning

• Goal of PAC: With high probability (“Probably”), the selected hypothesis will have low error ("Approximately Correct")

• Assume there is no error (noise) on data

PAC learning: finite hypothesis space

• As we see before, training error underestimates the true error

• In PAC learning, we seek theory to relate: • The number of training samples: m • Te gap between training and true errors

• errorD(h) ≤ errortrain(h) + ε • Complexity of the hypothesis space: |H| • Confidence of the relation: at least (1-δ )

Special case: errortrain(h)=0• Assume errortrain(h)=0, or target concept c is in hypothesis space H

• errorD(h) ≤ errortrain(h) + ε • errorD(h) ≤ ε

• What is the probability that there exists consistent hypothesis with true error > ε? • i.e. represent δ using m, ε, |H|

• Result (proof: see appendix)

Bounds for finite hypothesis space

•

• Suppose the probability to be at most δ (confidence of relation is 1-δ), (i.e. |H|e-εm ≤ δ)

• How many training examples suffice? •

• If errortrain(h)=0 then with probability at least 1-δ •

Agnostic learning• Assume errortrain(h) ≠ 0 or target function c is not in hypothesis space H

• Again, in PAC learning, we seek theory to relate: • The number of training samples: m • Te gap between training and true errors

• errorD(h) ≤ errortrain(h) + ε • Complexity of the hypothesis space: |H| • Confidence of the relation: at least (1-δ )

• The bound on δ (derived from Hoeffding bounds)

Bounds for finite hypothesis space

• The bound on δ •

!

• We got new answer!

•

true error training error degree of overfitting

Intuition• The bound of number of training samples is

!

• The bound on true error is

!

• We can improve performance of the algorithm by • Decreasing training errortrain(h) • Increasing number of training sample m • Choose H which is “simple” (Occam’s Razor)

PAC learnable

PAC learning: infinite hypothesis space

• Bound for finite hypothesis space

!

!

!

• What if hypothesis space is infinite? or |H| → ∞?

VC dimension• VC (Vapnik–Chervonenkis) dimension or VC(H) is a measure of the capacity of a classification algorithm

• defined as the cardinality (size) of the largest set of points that the algorithm can shatter

!

!

!

• Shatter: correctly classify regardless of the labeling

VC dimension example• Linear classify in 2-D dimension

!

!

!

• 1-Nearest neighbor method?

!

!

• Decision tree with k boolean variables • VC(H) = 2k because we can shatter 2k examples using a tree with 2k leaf nodes, and we cannot shatter 2k +1 examples.

for d-D dimension, we can do classify d+1 points! VC(H) = d+1

http://www.cs.cmu.edu/~guestrin/Class/15781/slides/learningtheory-bns-annotated.pdf

http://www.cs.cmu.edu/~guestrin/Class/15781/slides/learningtheory-bns-annotated.pdf

VC(H) and |H|• Is there any relation between VC(H) and |H| ??

• Let VC(H) = k (k is finite value)

• We can shatter k examples using hypothesis H

• We got 2k labeling cases of them

• Definitely, since hypothesis space H can shatter every 2k cases, we got |H| ≥ 2k

• k ≤ log2|H| (This is a very loose bound)

Bounds for infinite hypothesis space• In PAC learning, we seek theory to relate:

• The number of training samples: m • errorD(h) ≤ errortrain(h) + ε • VC dimension: VC(H) (|H| and VC(H) are related) • Confidence of the relation: at least (1-δ )

• The bound on m • !

• The bound on error increasing function of VC(H) on VC(H) < 2m

For sufficiently large training data, Occam’s razor still works here

Structural Risk Minimization

• Question: Is there any better criteria for choose hypothesis than empirical risk minimization?

• Answer: choose H to minimize bound on expected true error (Structural Risk Minimization)

!

pick hypothesis that minimize structural risk

Appendix: bound for finite hypothesis space

• We will call a hypothesis h’ is “bad” if err(h’) > ε • if err(h) > ε, then it must be the case that there exists some

h’ in B s.t. h’ is consistent with m training data points • h is one example, and there could be many more • These implies that

pac learning

Technology

training error of hypothesis

overfitting hypothesis

hypothesis h problem

analternative hypothesis

training errortrue error

true error errordh

h overfits training

selected hypothesis