learning, uncertainty, and information big ideas november 8, 2004

Learning, Uncertainty, and Information

Big Ideas

November 8, 2004

Roadmap

• Turing, Intelligence, and Learning• Noisy-channel model

– Uncertainty, Bayes’ Rule, and Applications

• Hidden Markov Models– The Model– Decoding the best sequence– Training the model (EM)

• N-gram models: Modeling sequences– Shannon, Information Theory, and Perplexity

• Conclusion

Turing & Intelligence

• Turing (1950): – Computing Machinery and Intelligence– “Imitation Game” (aka Turing test)

• Functional definition of intelligence as indistinguishable from human

– Key question raised: Learning• Can a system be intelligent if only knows program?• Learning necessary for intelligence

– 1) Programmed knowledge

– 2) Learning mechanism

– Knowledge, reasoning, learning, communication

Noisy-Channel Model

• Original message not directly observable– Passed through some channel b/t sender, receiver + noise– From telephone (Shannon), Word sequence vs acoustics

(Jelinek), genome sequence vs CATG, object vs image

• Derive most likely original input based on observed

Bayesian Inference

• P(W|O) difficult to compute – W – input, O – observations

– Generative and Sequence

)|(maxarg* OWPWW

)(

)()|(maxarg

OP

WPWOP

W

)()|(maxarg WPWOPW

Applications

• AI: Speech recognition!, POS tagging, sense tagging, dialogue, image understanding, information retrieval

• Non-AI: – Bioinformatics: gene sequencing– Security: intrusion detection– Cryptography

Hidden Markov Models

Probabilistic Reasoning over Time

• Issue: Discrete models – Many processes continuously changing– How do we make observations? States?

• Solution: Discretize– “Time slices”: Make time discrete– Observations, States associated with time:

Ot, Qt• Observations can be discrete or continuous

– Here focus on discrete for clarity

Modelling Processes over Time• Infer underlying state sequence from observed• Issue: New state depends on preceding states

– Analyzing sequences

• Problem 1: Possibly unbounded # prob tables– Observation+State+Time

• Solution 1: Assume stationary process– Rules governing process same at all time

• Problem 2: Possibly unbounded # parents– Markov assumption: Only consider finite history– Common: 1 or 2 Markov: depend on last couple

Hidden Markov Models (HMMs)

• An HMM is:– 1) A set of states:– 2) A set of transition probabilities:

• Where aij is the probability of transition qi -> qj

– 3)Observation probabilities:• The probability of observing ot in state i

– 4) An initial probability dist over states: • The probability of starting in state i

– 5) A set of accepting states

ko qqqQ ,...,, 1

mnaaA ,...,01

)( ti obB

i

Three Problems for HMMs

• Find the probability of an observation sequence given a model– Forward algorithm

• Find the most likely path through a model given an observed sequence– Viterbi algorithm (decoding)

• Find the most likely model (parameters) given an observed sequence– Baum-Welch (EM) algorithm

Bins and Balls Example

• Assume there are two bins filled with red and blue balls. Behind a curtain, someone selects a bin and then draws a ball from it (and replaces it). They then select either the same bin or the other one and then select another ball…

– (Example due to J. Martin)

Bins and Balls Example

Bin 1 Bin 2

.6 .7

.4

.3

Bins and Balls

Bin1 Bin2

Bin1 0.6 0.4

Bin2 0.3 0.7

• Π Bin 1: 0.9; Bin 2: 0.1• A

• BBin 1 Bin 2

Red 0.7 0.4

Blue 0.3 0.6

Bins and Balls

• Assume the observation sequence:– Blue Blue Red (BBR)

• Both bins have Red and Blue– Any state sequence could produce observations

• However, NOT equally likely– Big difference in start probabilities– Observation depends on state– State depends on prior state

Bins and Balls

Blue Blue Red1 1 1 (0.9*0.3)*(0.6*0.3)*(0.6*0.7)=0.0204

1 1 2 (0.9*0.3)*(0.6*0.3)*(0.4*0.4)=0.0077

1 2 1 (0.9*0.3)*(0.4*0.6)*(0.3*0.7)=0.0136

1 2 2 (0.9*0.3)*(0.4*0.6)*(0.7*0.4)=0.0181

2 1 1 (0.1*0.6)*(0.3*0.7)*(0.6*0.7)=0.0052

2 1 2 (0.1*0.6)*(0.3*0.7)*(0.4*0.4)=0.0020

2 2 1 (0.1*0.6)*(0.7*0.6)*(0.3*0.7)=0.0052

2 2 2 (0.1*0.6)*(0.7*0.6)*(0.7*0.4)=0.0070

Answers and Issues

• Here, to compute probability of observed– Just add up all the state sequence probabilities

• To find most likely state sequence– Just pick the sequence with the highest value

• Problem: Computing all paths expensive– 2T*N^T

• Solution: Dynamic Programming– Sweep across all states at each time step

• Summing (Problem 1) or Maximizing (Problem 2)

Forward Probability

)()|(

)()()1(

1),()1(

)|,,..,,()(

1

11

1

21

TOP

obatt

Njob

jqoooPt

N

ii

tj

N

iijij

jjj

ttj

Where α is the forward probability, t is the time in utterance, i,j are states in the HMM, aij is the transition probability, bj(ot) is the probability of observing ot in state bj

N is the max state, T is the last time

Pronunciation Example

• Observations: 0/1

Sequence Pronunciation Model

Acoustic Model

• 3-state phone model for [m]– Use Hidden Markov Model (HMM)

– Probability of sequence: sum of prob of paths

Onset Mid End Final0.7

0.3 0.9

0.1

0.4

0.6

C1:0.5

C2:0.2

C3:0.3 C3:

0.2C4:0.7

C5:0.1 C4:

0.1C6:0.5

C6:0.4

Transition probabilities

Observation probabilities

Forward Algorithm• Idea: matrix where each cell forward[t,j] represents probability of

being in state j after seeing first t observations. • Each cell expresses the probability:

forward[t,j] = P(o1,o2,...,ot,qt=j|w)• qt = j means "the probability that the tth state in the sequence of

states is state j. • Compute probability by summing over extensions of all paths

leading to current cell. • An extension of a path from a state i at time t-1 to state j at t is

computed by multiplying together: i. previous path probability from the previous cell forward[t-1,i],

ii. transition probability aij from previous state i to current state j

iii. observation likelihood bjt that current state j matches observation symbol t.

Forward Algorithm

Function Forward(observations length T, state-graph) returns best-path

Num-states<-num-of-states(state-graph)Create path prob matrix forwardi[num-states+2,T+2]Forward[0,0]<- 1.0For each time step t from 0 to T do for each state s from 0 to num-states do for each transition s’ from s in state-graph new-score<-Forward[s,t]*at[s,s’]*bs’(ot)

Forward[s’,t+1] <- Forward[s’,t+1]+new-score

Viterbi Code

Function Viterbi(observations length T, state-graph) returns best-pathNum-states<-num-of-states(state-graph)Create path prob matrix viterbi[num-states+2,T+2]Viterbi[0,0]<- 1.0For each time step t from 0 to T do for each state s from 0 to num-states do for each transition s’ from s in state-graph new-score<-viterbi[s,t]*at[s,s’]*bs’(ot) if ((viterbi[s’,t+1]==0) || (viterbi[s’,t+1]<new-score))

then viterbi[s’,t+1] <- new-score back-pointer[s’,t+1]<-s

Backtrace from highest prob state in final column of viterbi[] & return

Modeling Sequences, Redux

• Discrete observation values – Simple, but inadequate– Many observations highly variable

• Gaussian pdfs over continuous values– Assume normally distributed observations

• Typically sum over multiple shared Gaussians– “Gaussian mixture models”– Trained with HMM model

1

)]()[(

||)2(

1)( j jtjt oo

tj ej

ob

Learning HMMs

• Issue: Where do the probabilities come from?• Solution: Learn from data

– Trains transition (aij) and emission (bj) probabilities• Typically assume structure

– Baum-Welch aka forward-backward algorithm• Iteratively estimate counts of transitions/emitted• Get estimated probabilities by forward comput’n

– Divide probability mass over contributing paths

learning, uncertainty, and information big ideas november 8, 2004

Documents