1 ling 696b: midterm review: parametric and non-parametric inductive inference

LING 696B: Midterm review: parametric and non-parametric inductive inference

Big question: How do people generalize?

Big question: How do people generalize? Examples related to language:

Categorizing a new stimuli Assign structure to a signal Telling whether a form is grammatical

What is the nature of inductive inference?

What is the nature of inductive inference? What role does statistics play?

Two paradigms of statistical learning (I) Fisher’s paradigm: inductive

inference through likelihood -- p(X|) X: observed set of data : parameters of the probability density

function p, or an interpretation of X We expect X to come from an infinite

population observing p(X|) Representational bias: the form of p(X|)

constrains what kind things you can learn

Learning in Fisher’s paradigm Philosophy: finding the infinite

population so that the chance of seeing X is large (idea from Bayes) Knowing the universe by seeing

individuals Randomness is due to the finiteness of X

Maximum likelihood: find so p(X|) reaches the maximum

Natural consequence: the more X you see, the better you learn about p(X|)

Extending Fisher’s paradigm to complex situations Statisticians cannot specify p(X|) for you!

Must come from understanding of the structure that generates X, e.g. grammar

Needs a supporting theory that guides the construction of p(X|) -- “language is special”

Extending p(X|) to include hidden variables The EM algorithm

Making bigger model from smaller models Iterative learning through coordinate-wise

ascent

Example: unsupervised learning of categories X: instances of pre-segmented speech

sounds : mixture of a fixed number of category

models Representational bias:

Discreteness Distribution of each category (bias from

mixture components) Hidden variable: category membership Learning: EM algorithm

Example: unsupervised learning of phonological words X: instances of word-level signals : mixture model + phonotactic model

+ word segmentation Representational bias:

Discreteness Distribution of each category (bias from

mixture components) Combinatorial structure of phonological

words Learning: coordinate-wise ascent

From Fisher’s paradigm to Bayesian learning Bayesian: wants to learn the

posterior distribution p(|X) Bayesian formula: p(|X) p(X|)

p() = p(X, ) Same as ML when p() is uniform

Still needs a theory guiding the construction of p() and p(X|) More on this later

Attractions of generative modeling Has clear semantics

p(X|) -- prediction/production/synthesis

p() -- belief/prior knowledge/initial bias

p(|X) -- perception/interpretation

p() -- belief/prior knowledge/initial bias

p(|X) -- perception/interpretation Can make “infinite

generalizations” Synthesize from p(X, ) can tell us

something about the generalization

p() -- belief/prior knowledge/initial bias p(|X) -- perception/interpretation

Can make “infinite generalizations” Synthesize from p(X, ) can tell us

something about the generalization A very general framework

Theory of everything?

Challenges to generative modeling The representational bias can be

wrong But “all models are wrong”

Unclear how to choose from different classes of models

Unclear how to choose from different classes of models E.g. The destiny of K

Unclear how to choose from different classes of models E.g. The destiny of K Simplicity is relative, e.g.

f(x)=a*sin(bx)+c

Challenges to generative modeling

The representational bias can be wrong But “all models are wrong”

Unclear how to choose from different classes of models E.g. The destiny of K Simplicity is relative, e.g. f(x)=a*sin(bx)+c

Computing max{p(X|)} can be very hard Bayesian computation may help

Challenges to generative modeling Even finding X can be hard for

language

language Probability distribution over what?

Example: statistical syntax, choices of X String of words Parse trees Semantic interpetations Social interactions

language Probability distribution over what?

Example: X for statistical syntax? String of words Parse trees Semantic interpetations Social interactions

Hope: staying on low levels of language will make the choice of X easier

Two paradigms of statistical learning (II) Vapnik’s critique for generative

modeling: “Why solve a more general problem before solving a specific one ?”

Example: Generative approach to 2-class classification (supervised)Likelihood ratio test:Log[p(x|A)/p(x|B)]A, B are parametric models

Non-parametric approach to inductive inference Main idea: don’t want to know the

universe first, then generalize Universe is complicated, representational

bias often inappropriate Very few data to learn from, compared to

dimensionality of space Instead, want to generalize directly from

old data to new data Rules v.s. analogy?

Examples of non-parametric learning (I): Nearest neighbor classification:

Analogy-based learning by dictionary lookup

Generalize to K-nearest neighbors

Examples of non-parametric learning (II)

Radial Basis networks for supervised learning: F(x) = i ai K(x, xi) K(x, xi) a non-linear similarity function

centered at xi , with tunable parameters Interpretation: “soft/smooth” dictionary

lookup/analogy within a population Learning: find ai from (xi, yi) pairs -- a

regularized regression problemmin i [f(x)-yi]2 + || f ||2

Radial basis functions/networks Each data point xi is associated with

a K(x, xi) -- a radial basis function Linear combinations of enough K(x,

xi) can approximate any smooth function from RnR Universal approximation property Network interpretation (see demo)

How is this different from generative modeling? Do not assume a fixed space to

search for the best hypothesis Instead, this space grows with the

amount of data Basis of the space: K(x, xi) Interpretation: local generalization from

old data xi to new data x F(x) = i ai K(x, xi) represents an

ensemble generlization from {xi} to x

Examples of non-parametric learning (III) Support Vector Machines (last

time): linear separation f(x) = sign(<w,x>+b)

Max margin classification The solution is also a direct

generalization from old data, but sparse

mostly zero

f(x) = sign(<w,x>+b)

Interpretation of support vectors Support vectors have non-zero

contribution to the generalization “prototypes” for analogical learning

mostly zero

f(x) = sign(<w,x>+b)

Kernel generalization of SVM The solution looks very much like RBF

networks: RBF net: F(x) = i ai K(x, xi)

Many old data contribute to generalization

SVM: F(x) = sign(i ai K(x, xi) + b)Relatively few old data contribute

Dense/sparse solution is due to different goals (see demo)

Transductive inference with support vectors One more wrinkle: now I’m putting

two points there, but don’t tell you the color

Transductive SVM Not only old data affect

generalization, the new data affect each other too

A general view of non-parametric inductive inference A function approximation problem:

knowing that (x1, y1), …, (xN, yN) are input and output of some unknown function F, how can we approximate F and generalize to new values of x? Linguistics: find the universe for F Psychology: find the best model that

“behaves” like F In realistic terms, non-parametric

methods often win

Who’s got the answer? Parametric approach can also

approximate functions Model the joint distribution p(x,y|)

But the model is often difficult to build E.g. a realistic experimental task

Before reaching a conclusion, we need to know how people learn They may be doing both

Where does neural net fit? Clearly not generative: does not

reason with probability

reason with probability Somewhat different from analogy-

type of non-parametric: the network does not directly reason from old data Difficult to interpret the

generalization

reason with probability Somewhat different from analogy-type

of non-parametric: the network does not directly reason from old data Difficult to interpret the generalization

Some results available for limiting cases Similar to non-parametric methods when

hidden units are infinite

A point that nobody gets right Small sample dilemma: people learn

from very few examples (compared to dimension of data), yet any statistical machinery needs many Parametric: ML estimate approaches the

true distribution with infinite sample Non-parametric: universal approximation

requires infinite sample The limit is taken in the wrong

direction

1 ling 696b: midterm review: parametric and non-parametric inductive inference

Documents

ling 696b: maximum-entropy and random fields

parametric versus non parametric statistics

1 ling 696b: mixture model and linear dimension reduction

1 ling 696b: mds and non-linear methods of dimension...

parametric versus non parametric test

n g systemcreo parametric complete mold design, creo...

inductive study series ephesians inductive bible study book

patterns, inductive reasoning & conjecture. inductive...

psy 696b, analyzing neural time-series data · eeg...

tutorial parametric v. non-parametric

ling 696b: pca and other linear projection methods

ling 696b: categorical perception, perceptual magnets and...

§ 1.1 patterns and inductive reasoning patterns and...

1 ling 696b: gradient phonotactics and well- formedness

inductive teaching

inductive sensors for analog distance measurement...

psy 696b, analyzing neural time-series...

inductive modeling method gmdh in the problems …neural...

ling 696b: phonotactics wrap-up, ot, stochastic ot

1 ling 696b: pca and other linear projection methods