1 ling 696b: midterm review: parametric and non-parametric inductive inference

43
1 LING 696B: Midterm review: parametric and non- parametric inductive inference

Upload: lucy-annabella-thompson

Post on 03-Jan-2016

219 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: 1 LING 696B: Midterm review: parametric and non-parametric inductive inference

1

LING 696B: Midterm review: parametric and non-parametric inductive inference

Page 2: 1 LING 696B: Midterm review: parametric and non-parametric inductive inference

2

Big question: How do people generalize?

Page 3: 1 LING 696B: Midterm review: parametric and non-parametric inductive inference

3

Big question: How do people generalize? Examples related to language:

Categorizing a new stimuli Assign structure to a signal Telling whether a form is grammatical

Page 4: 1 LING 696B: Midterm review: parametric and non-parametric inductive inference

4

Big question: How do people generalize? Examples related to language:

Categorizing a new stimuli Assign structure to a signal Telling whether a form is grammatical

What is the nature of inductive inference?

Page 5: 1 LING 696B: Midterm review: parametric and non-parametric inductive inference

5

Big question: How do people generalize? Examples related to language:

Categorizing a new stimuli Assign structure to a signal Telling whether a form is grammatical

What is the nature of inductive inference? What role does statistics play?

Page 6: 1 LING 696B: Midterm review: parametric and non-parametric inductive inference

6

Two paradigms of statistical learning (I) Fisher’s paradigm: inductive

inference through likelihood -- p(X|) X: observed set of data : parameters of the probability density

function p, or an interpretation of X We expect X to come from an infinite

population observing p(X|) Representational bias: the form of p(X|)

constrains what kind things you can learn

Page 7: 1 LING 696B: Midterm review: parametric and non-parametric inductive inference

7

Learning in Fisher’s paradigm Philosophy: finding the infinite

population so that the chance of seeing X is large (idea from Bayes) Knowing the universe by seeing

individuals Randomness is due to the finiteness of X

Maximum likelihood: find so p(X|) reaches the maximum

Natural consequence: the more X you see, the better you learn about p(X|)

Page 8: 1 LING 696B: Midterm review: parametric and non-parametric inductive inference

8

Extending Fisher’s paradigm to complex situations Statisticians cannot specify p(X|) for you!

Must come from understanding of the structure that generates X, e.g. grammar

Needs a supporting theory that guides the construction of p(X|) -- “language is special”

Extending p(X|) to include hidden variables The EM algorithm

Making bigger model from smaller models Iterative learning through coordinate-wise

ascent

Page 9: 1 LING 696B: Midterm review: parametric and non-parametric inductive inference

9

Example: unsupervised learning of categories X: instances of pre-segmented speech

sounds : mixture of a fixed number of category

models Representational bias:

Discreteness Distribution of each category (bias from

mixture components) Hidden variable: category membership Learning: EM algorithm

Page 10: 1 LING 696B: Midterm review: parametric and non-parametric inductive inference

10

Example: unsupervised learning of phonological words X: instances of word-level signals : mixture model + phonotactic model

+ word segmentation Representational bias:

Discreteness Distribution of each category (bias from

mixture components) Combinatorial structure of phonological

words Learning: coordinate-wise ascent

Page 11: 1 LING 696B: Midterm review: parametric and non-parametric inductive inference

11

From Fisher’s paradigm to Bayesian learning Bayesian: wants to learn the

posterior distribution p(|X) Bayesian formula: p(|X) p(X|)

p() = p(X, ) Same as ML when p() is uniform

Still needs a theory guiding the construction of p() and p(X|) More on this later

Page 12: 1 LING 696B: Midterm review: parametric and non-parametric inductive inference

12

Attractions of generative modeling Has clear semantics

p(X|) -- prediction/production/synthesis

p() -- belief/prior knowledge/initial bias

p(|X) -- perception/interpretation

Page 13: 1 LING 696B: Midterm review: parametric and non-parametric inductive inference

13

Attractions of generative modeling Has clear semantics

p(X|) -- prediction/production/synthesis

p() -- belief/prior knowledge/initial bias

p(|X) -- perception/interpretation Can make “infinite

generalizations” Synthesize from p(X, ) can tell us

something about the generalization

Page 14: 1 LING 696B: Midterm review: parametric and non-parametric inductive inference

14

Attractions of generative modeling Has clear semantics

p(X|) -- prediction/production/synthesis

p() -- belief/prior knowledge/initial bias p(|X) -- perception/interpretation

Can make “infinite generalizations” Synthesize from p(X, ) can tell us

something about the generalization A very general framework

Theory of everything?

Page 15: 1 LING 696B: Midterm review: parametric and non-parametric inductive inference

15

Challenges to generative modeling The representational bias can be

wrong

Page 16: 1 LING 696B: Midterm review: parametric and non-parametric inductive inference

16

Challenges to generative modeling The representational bias can be

wrong But “all models are wrong”

Page 17: 1 LING 696B: Midterm review: parametric and non-parametric inductive inference

17

Challenges to generative modeling The representational bias can be

wrong But “all models are wrong”

Unclear how to choose from different classes of models

Page 18: 1 LING 696B: Midterm review: parametric and non-parametric inductive inference

18

Challenges to generative modeling The representational bias can be

wrong But “all models are wrong”

Unclear how to choose from different classes of models E.g. The destiny of K

Page 19: 1 LING 696B: Midterm review: parametric and non-parametric inductive inference

19

Challenges to generative modeling The representational bias can be

wrong But “all models are wrong”

Unclear how to choose from different classes of models E.g. The destiny of K Simplicity is relative, e.g.

f(x)=a*sin(bx)+c

Page 20: 1 LING 696B: Midterm review: parametric and non-parametric inductive inference

20

Challenges to generative modeling

The representational bias can be wrong But “all models are wrong”

Unclear how to choose from different classes of models E.g. The destiny of K Simplicity is relative, e.g. f(x)=a*sin(bx)+c

Computing max{p(X|)} can be very hard Bayesian computation may help

Page 21: 1 LING 696B: Midterm review: parametric and non-parametric inductive inference

21

Challenges to generative modeling Even finding X can be hard for

language

Page 22: 1 LING 696B: Midterm review: parametric and non-parametric inductive inference

22

Challenges to generative modeling Even finding X can be hard for

language Probability distribution over what?

Example: statistical syntax, choices of X String of words Parse trees Semantic interpetations Social interactions

Page 23: 1 LING 696B: Midterm review: parametric and non-parametric inductive inference

23

Challenges to generative modeling Even finding X can be hard for

language Probability distribution over what?

Example: X for statistical syntax? String of words Parse trees Semantic interpetations Social interactions

Hope: staying on low levels of language will make the choice of X easier

Page 24: 1 LING 696B: Midterm review: parametric and non-parametric inductive inference

24

Two paradigms of statistical learning (II) Vapnik’s critique for generative

modeling: “Why solve a more general problem before solving a specific one ?”

Example: Generative approach to 2-class classification (supervised)Likelihood ratio test:Log[p(x|A)/p(x|B)]A, B are parametric models

Page 25: 1 LING 696B: Midterm review: parametric and non-parametric inductive inference

25

Non-parametric approach to inductive inference Main idea: don’t want to know the

universe first, then generalize Universe is complicated, representational

bias often inappropriate Very few data to learn from, compared to

dimensionality of space Instead, want to generalize directly from

old data to new data Rules v.s. analogy?

Page 26: 1 LING 696B: Midterm review: parametric and non-parametric inductive inference

26

Examples of non-parametric learning (I): Nearest neighbor classification:

Analogy-based learning by dictionary lookup

Generalize to K-nearest neighbors

Page 27: 1 LING 696B: Midterm review: parametric and non-parametric inductive inference

27

Examples of non-parametric learning (II)

Radial Basis networks for supervised learning: F(x) = i ai K(x, xi) K(x, xi) a non-linear similarity function

centered at xi , with tunable parameters Interpretation: “soft/smooth” dictionary

lookup/analogy within a population Learning: find ai from (xi, yi) pairs -- a

regularized regression problemmin i [f(x)-yi]2 + || f ||2

Page 28: 1 LING 696B: Midterm review: parametric and non-parametric inductive inference

28

Radial basis functions/networks Each data point xi is associated with

a K(x, xi) -- a radial basis function Linear combinations of enough K(x,

xi) can approximate any smooth function from RnR Universal approximation property Network interpretation (see demo)

Page 29: 1 LING 696B: Midterm review: parametric and non-parametric inductive inference

29

How is this different from generative modeling? Do not assume a fixed space to

search for the best hypothesis Instead, this space grows with the

amount of data Basis of the space: K(x, xi) Interpretation: local generalization from

old data xi to new data x F(x) = i ai K(x, xi) represents an

ensemble generlization from {xi} to x

Page 30: 1 LING 696B: Midterm review: parametric and non-parametric inductive inference

30

Examples of non-parametric learning (III) Support Vector Machines (last

time): linear separation f(x) = sign(<w,x>+b)

Page 31: 1 LING 696B: Midterm review: parametric and non-parametric inductive inference

31

Max margin classification The solution is also a direct

generalization from old data, but sparse

mostly zero

f(x) = sign(<w,x>+b)

Page 32: 1 LING 696B: Midterm review: parametric and non-parametric inductive inference

32

Interpretation of support vectors Support vectors have non-zero

contribution to the generalization “prototypes” for analogical learning

mostly zero

f(x) = sign(<w,x>+b)

Page 33: 1 LING 696B: Midterm review: parametric and non-parametric inductive inference

33

Kernel generalization of SVM The solution looks very much like RBF

networks: RBF net: F(x) = i ai K(x, xi)

Many old data contribute to generalization

SVM: F(x) = sign(i ai K(x, xi) + b)Relatively few old data contribute

Dense/sparse solution is due to different goals (see demo)

Page 34: 1 LING 696B: Midterm review: parametric and non-parametric inductive inference

34

Transductive inference with support vectors One more wrinkle: now I’m putting

two points there, but don’t tell you the color

Page 35: 1 LING 696B: Midterm review: parametric and non-parametric inductive inference

35

Transductive SVM Not only old data affect

generalization, the new data affect each other too

Page 36: 1 LING 696B: Midterm review: parametric and non-parametric inductive inference

36

A general view of non-parametric inductive inference A function approximation problem:

knowing that (x1, y1), …, (xN, yN) are input and output of some unknown function F, how can we approximate F and generalize to new values of x? Linguistics: find the universe for F Psychology: find the best model that

“behaves” like F In realistic terms, non-parametric

methods often win

Page 37: 1 LING 696B: Midterm review: parametric and non-parametric inductive inference

37

Who’s got the answer? Parametric approach can also

approximate functions Model the joint distribution p(x,y|)

Page 38: 1 LING 696B: Midterm review: parametric and non-parametric inductive inference

38

Who’s got the answer? Parametric approach can also

approximate functions Model the joint distribution p(x,y|)

But the model is often difficult to build E.g. a realistic experimental task

Page 39: 1 LING 696B: Midterm review: parametric and non-parametric inductive inference

39

Who’s got the answer? Parametric approach can also

approximate functions Model the joint distribution p(x,y|)

But the model is often difficult to build E.g. a realistic experimental task

Before reaching a conclusion, we need to know how people learn They may be doing both

Page 40: 1 LING 696B: Midterm review: parametric and non-parametric inductive inference

40

Where does neural net fit? Clearly not generative: does not

reason with probability

Page 41: 1 LING 696B: Midterm review: parametric and non-parametric inductive inference

41

Where does neural net fit? Clearly not generative: does not

reason with probability Somewhat different from analogy-

type of non-parametric: the network does not directly reason from old data Difficult to interpret the

generalization

Page 42: 1 LING 696B: Midterm review: parametric and non-parametric inductive inference

42

Where does neural net fit? Clearly not generative: does not

reason with probability Somewhat different from analogy-type

of non-parametric: the network does not directly reason from old data Difficult to interpret the generalization

Some results available for limiting cases Similar to non-parametric methods when

hidden units are infinite

Page 43: 1 LING 696B: Midterm review: parametric and non-parametric inductive inference

43

A point that nobody gets right Small sample dilemma: people learn

from very few examples (compared to dimension of data), yet any statistical machinery needs many Parametric: ML estimate approaches the

true distribution with infinite sample Non-parametric: universal approximation

requires infinite sample The limit is taken in the wrong

direction