revealing inductive biases with bayesian models tom griffiths uc berkeley with mike kalish, brian...
Post on 21-Dec-2015
217 views
TRANSCRIPT
Revealing inductive biases with Bayesian models
Tom GriffithsUC Berkeley
with Mike Kalish, Brian Christian, and Steve Lewandowsky
Inductive problems
blicket toma
dax wug
blicket wug
S X Y
X {blicket,dax}
Y {toma, wug}
Learning languages from utterances
Learning functions from (x,y) pairs
Learning categories from instances of their members
Generalization requires induction
Generalization: predicting the properties of an entity from observed properties of others
y
x
What makes a good inductive learner?
• Hypothesis 1: more representational power– more hypotheses, more complexity– spirit of many accounts of learning and development
Some hypothesis spaces
Linear functions
Quadratic functions
8th degree polynomials€
g(x) = p1x + p0
€
g(x) = p2x 2 + p1x + p0
€
g(x) = p j xj
j= 0
8
∑
Minimizing squared error
QuickTime™ and aTIFF (LZW) decompressor
are needed to see this picture.
Minimizing squared error
QuickTime™ and aTIFF (LZW) decompressor
are needed to see this picture.
Minimizing squared error
QuickTime™ and aTIFF (LZW) decompressor
are needed to see this picture.
Minimizing squared error
QuickTime™ and aTIFF (LZW) decompressor
are needed to see this picture.
Measuring prediction error
QuickTime™ and aTIFF (LZW) decompressor
are needed to see this picture.
What makes a good inductive learner?
• Hypothesis 1: more representational power– more hypotheses, more complexity– spirit of many accounts of learning and development
• Hypothesis 2: good inductive biases– constraints on hypotheses that match the environment
Outline
The bias-variance tradeoff
Bayesian inference and inductive biases
Revealing inductive biases
Conclusions
Outline
The bias-variance tradeoff
Bayesian inference and inductive biases
Revealing inductive biases
Conclusions
A simple schema for induction• Data D are n pairs (x,y)
generated from function f
• Hypothesis space of functions, y = g(x)
• Error is E = (y - g(x))2
• Pick function g that minimizes error on D
• Measure prediction error, averaging over x and y
y
x
Bias and variance
• A good learner makes (f(x) - g(x))2 small
• g is chosen on the basis of the data D
• Evaluate learners by the average of (f(x) - g(x))2 over data D generated from f
€
E p(D ) ( f (x) − g(x))2[ ] =
€
( f (x) − E p(D )[g(x)])2 + E p(D ) g(x) − E p(D )[g(x)][ ]2
bias variance
(Geman, Bienenstock, & Doursat, 1992)
Making things more intuitive…
• The next few slides were generated by:– choosing a true function f(x)– generating a number of datasets D from p(x,y) defined
by uniform p(x), p(y|x) = f(x) plus noise
– finding the function g(x) in the hypothesis space that minimized the error on D
• Comparing average of g(x) to f(x) reveals bias
• Spread of g(x) around average is the variance
Linear functions (n = 10)
QuickTime™ and aTIFF (LZW) decompressor
are needed to see this picture.
Linear functions (n = 10)
QuickTime™ and aTIFF (LZW) decompressor
are needed to see this picture.
}bias
pink is g(x) for each dataset
red is average g(x)
black is f(x)
y
x
}variance
QuickTime™ and aTIFF (LZW) decompressor
are needed to see this picture.
Quadratic functions (n = 10)
pink is g(x) for each dataset
red is average g(x)
black is f(x)
y
x
QuickTime™ and aTIFF (LZW) decompressor
are needed to see this picture.
8-th degree polynomials (n = 10)
pink is g(x) for each dataset
red is average g(x)
black is f(x)
y
x
Bias and variance
(for our (quadratic) f(x), with n = 10)
Linear functionshigh bias, medium variance
Quadratic functionslow bias, low variance
8-th order polynomialslow bias, super-high variance
In general…
• Larger hypothesis spaces result in higher variance, but low bias across several f(x)
• The bias-variance tradeoff:– if we want a learner that has low bias on a range of
problems, we pay a price in variance
• This is mainly an issue when n is small– the regime of much of human learning
QuickTime™ and aTIFF (LZW) decompressor
are needed to see this picture.
Quadratic functions (n = 100)
pink is g(x) for each dataset
red is average g(x)
black is f(x)
y
x
QuickTime™ and aTIFF (LZW) decompressor
are needed to see this picture.
8-th degree polynomials (n = 100)
pink is g(x) for each dataset
red is average g(x)
black is f(x)
y
x
The moral
• General-purpose learning mechanisms do not work well with small amounts of data– more representational power isn’t always better
• To make good predictions from small amounts of data, you need a bias that matches the problem– these biases are the key to successful induction, and
characterize the nature of an inductive learner
• So… how can we identify human inductive biases?
Outline
The bias-variance tradeoff
Bayesian inference and inductive biases
Revealing inductive biases
Conclusions
Bayesian inference
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
Reverend Thomas Bayes
• Rational procedure for updating beliefs
• Foundation of many learning algorithms
• Lets us make the inductive biases of learners precise
Bayes’ theorem
€
P(h | d) =P(d | h)P(h)
P(d | ′ h )P( ′ h )′ h ∈H
∑
Posteriorprobability
Likelihood Priorprobability
Sum over space of hypothesesh: hypothesis
d: data
Priors and biases
• Priors indicate the kind of world a learner expects to encounter, guiding their conclusions
• In our function learning example…– likelihood gives probability to data that decrease with
sum squared errors (i.e. a Gaussian)– priors are uniform over all functions in hypothesis
spaces of different kinds of polynomials– having more functions corresponds to a belief in a more
complex world…
Outline
The bias-variance tradeoff
Bayesian inference and inductive biases
Revealing inductive biases
Conclusions
Two ways of using Bayesian models
• Specify models that make different assumptions about priors, and compare their fit to human data
(Anderson & Schooler, 1991;
Oaksford & Chater, 1994;
Griffiths & Tenenbaum, 2006)
• Design experiments explicitly intended to reveal the priors of Bayesian learners
QuickTime™ and aTIFF (Uncompressed) decompressorare needed to see this picture.
QuickTime™ and aTIFF (LZW) decompressor
are needed to see this picture.
QuickTime™ and aTIFF (LZW) decompressor
are needed to see this picture.
Iterated learning(Kirby, 2001)
What are the consequences of learners learning from other learners?
Objects of iterated learning
• Knowledge communicated across generations through provision of data by learners
• Examples:– religious concepts– social norms– myths and legends– causal theories– language
Analyzing iterated learning
QuickTime™ and aTIFF (LZW) decompressor
are needed to see this picture.
PL(h|d): probability of inferring hypothesis h from data d
PP(d|h): probability of generating data d from hypothesis h
PL(h|d)
PP(d|h)
PL(h|d)
PP(d|h)
• Variables x(t+1) independent of history given x(t)
• Converges to a stationary distribution under easily checked conditions (i.e., if it is ergodic)
x x x x x x x x
Transition matrixT = P(x(t+1)|x(t))
Markov chains
Analyzing iterated learning
d0 h1 d1 h2PL(h|d) PP(d|h) PL(h|d)
d2 h3PP(d|h) PL(h|d)
d PP(d|h)PL(h|d)h1 h2d PP(d|h)PL(h|d)
h3
A Markov chain on hypotheses
d0 d1h PL(h|d) PP(d|h)d2h PL(h|d) PP(d|h) h PL(h|d) PP(d|h)
A Markov chain on data
Iterated Bayesian learning
QuickTime™ and aTIFF (LZW) decompressor
are needed to see this picture.
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
PL(h|d)
PP(d|h)
PL(h|d)
PP(d|h)
€
PL (h | d) =PP (d | h)P(h)
PP (d | ′ h )P( ′ h )′ h ∈H
∑
Assume learners sample from their posterior distribution:
Stationary distributions
• Markov chain on h converges to the prior, P(h)
• Markov chain on d converges to the “prior predictive distribution”
€
P(d) = P(d | h)h
∑ P(h)
(Griffiths & Kalish, 2005)
Explaining convergence to the prior
QuickTime™ and aTIFF (LZW) decompressor
are needed to see this picture.
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
PL(h|d)
PP(d|h)
PL(h|d)
PP(d|h)
• Intuitively: data acts once, prior many times
• Formally: iterated learning with Bayesian agents is a Gibbs sampler on P(d,h)
(Griffiths & Kalish, in press)
Revealing inductive biases
• If iterated learning converges to the prior, it might provide a tool for determining the inductive biases of human learners
• We can test this by reproducing iterated learning in the lab, with stimuli for which human biases are well understood
Iterated function learning
• Each learner sees a set of (x,y) pairs
• Makes predictions of y for new x values
• Predictions are data for the next learner
data hypotheses
(Kalish, Griffiths, & Lewandowsky, in press)
Function learning experiments
Stimulus
Response
Slider
Feedback
Examine iterated learning with different initial data
1 2 3 4 5 6 7 8 9
IterationInitialdata
Identifying inductive biases
• Formal analysis suggests that iterated learning provides a way to determine inductive biases
• Experiments with human learners support this idea– when stimuli for which biases are well understood are used,
those biases are revealed by iterated learning
• What do inductive biases look like in other cases?– continuous categories– causal structure– word learning– language learning
Outline
The bias-variance tradeoff
Bayesian inference and inductive biases
Revealing inductive biases
Conclusions
Conclusions
• Solving inductive problems and forming good generalizations requires good inductive biases
• Bayesian inference provides a way to make assumptions about the biases of learners explicit
• Two ways to identify human inductive biases:– compare Bayesian models assuming different priors– design tasks to extract biases from Bayesian learners
• Iterated learning provides a lens for magnifying the inductive biases of learners– small effects for individuals are big effects for groups
Iterated concept learning
• Each learner sees examples from a species
• Identifies species of four amoebae
• Iterated learning is run within-subjects
data hypotheses
(Griffiths, Christian, & Kalish, in press)
Two positive examples
QuickTime™ and aTIFF (LZW) decompressor
are needed to see this picture.
data (d)
hypotheses (h)
Bayesian model(Tenenbaum, 1999; Tenenbaum & Griffiths, 2001)
€
P(h | d) =P(d | h)P(h)
P(d | ′ h )P( ′ h )′ h ∈H
∑d: 2 amoebaeh: set of 4 amoebae
€
P(d | h) =1/ h
m
0
⎧ ⎨ ⎩
d ∈ h
otherwise
m: # of amoebae in the set d (= 2)|h|: # of amoebae in the set h (= 4)
€
P(h | d) =P(h)
P( ′ h )h '|d ∈h'
∑Posterior is renormalized prior
What is the prior?
Classes of concepts(Shepard, Hovland, & Jenkins, 1961)
Class 1
Class 2
Class 3
Class 4
Class 5
Class 6
shape
size
color
Experiment design (for each subject)Class 1Class 2Class 3Class 4Class 5Class 6Class 1Class 2Class 3Class 4Class 5Class 6
6 iterated learning chains
6 independent
learning “chains”
Estimating the prior
data (d)hy
poth
eses
(h)
Estimating the prior
Class 1Class 2
Class 3
Class 4
Class 5
Class 6
0.8610.087
0.009
0.002
0.013
0.028
Prior
r = 0.952
Bayesian modelHuman subjects
Two positive examples(n = 20)
Prob
abil
ity
Iteration
Prob
abil
ity
Iteration
Human learners Bayesian model
Two positive examples(n = 20)
Prob
abil
ity
Bayesian model
Human learners
Three positive examples
data (d)
hypotheses (h)
Three positive examples(n = 20)
Prob
abil
ity
Iteration
Prob
abil
ity
Iteration
Human learners Bayesian model
Three positive examples(n = 20)
Bayesian model
Human learners
Serial reproduction(Bartlett, 1932)
• Participants see stimuli, then reproduce them from memory
• Reproductions of one participant are stimuli for the next
• Stimuli were interesting, rather than controlled– e.g., “War of the Ghosts”
Discovering the biases of models
QuickTime™ and aTIFF (LZW) decompressor
are needed to see this picture.
Generic neural network:
QuickTime™ and aTIFF (LZW) decompressor
are needed to see this picture.
Discovering the biases of models
EXAM (Delosh, Busemeyer, & McDaniel, 1997):
QuickTime™ and aTIFF (LZW) decompressor
are needed to see this picture.
Discovering the biases of models
POLE (Kalish, Lewandowsky, & Kruschke, 2004):