stanford intro ai class notes

8/3/2019 Stanford Intro AI Class Notes

1/56

Videos:http://www.wonderwhy-er.com/ai-class/

Unit 1 Theory: Welcome to AI

Purposes- Teach basic of AI- Excite you

Structure

- Videos Quizzes Answer Videos- Home works (assignments) Exams

AI program = Intelligent Agent

Agent function = it maps any given percept sequence to action = abstract math description

Agent program = a concrete impelemtation of agent function

Rational agent: the one that does the right thing i.e. For each possible percept sequence, a rational

agent should select an action that is expected to maximize itsperformance measure, given the

evidence provided by the percept sequence and whatever built-in knowledge the agent has

TerminologyEnvironment types:

- Fully vs. Partially Observable
http://www.wonderwhy-er.com/ai-class/http://www.wonderwhy-er.com/ai-class/http://www.wonderwhy-er.com/ai-class/http://www.wonderwhy-er.com/ai-class/


2/56

- Deterministic (e.g. Chess game) vs. Stochastic (e.g. Dice game)- Discrete vs. Continuous- Benign vs. Adversarial

AI as uncertainty management

Reasons for uncertainty

- Sensor limits- Adversaries- Stochastic environment- Laziness- Ignorance

Unit 2 Problem Solving

Definition of a problem

- Initial state- A function: ACTIONS (S) {a1, a2, a3, }

o S is a stateo a1, a2, a3 are possible actions from the state S

- A function: RESULT(S, a)So S is a stateo a is an action applied to the stateo S is the new state

- A function: GOALTEST(S) T|Fo S is a stateo GOALTEST will test if S is the destination (final state)

- A function PATHCOST (SSS) no n is cost to move from a state S to S to the final state So It is mostly additive so it will be many STEPCOST(S, a, S); sum of which is PATHCOST


3/56

Route finding problem

3 regions

- Explored- Frontier- Unexplored

Base algorithm

1. take a state on the Frontier (by some criteria)2. GoalTest it if YES terminate here3. expand it (to new states that would be added to the Frontier)4. remove it from the Frontier (to the Explored)

(Generic) Tree-Search

Tree-Search applied to path-finder problem


4/56

Graph-Search (like Tree-Search but remember what already explored so

when the frontier is expanded it will not take already explored states)

The key point of the base algorithm is the criteria used in step 1. It leads to few concrete algorithms:

- Breadth-First (aka shortest-first): consider the shortest path first to expand frontier- Uniform-Cost (aka cheapest-first)- Depth-First: consider the longest path first to expand frontier

A* algorithm

It is proven that the algorithm is improved if we know some extra info, e.g. the distance from the

current state (which is about to be expanded) to the goal. It is A* algorithm

h is called heuristic function. A* will always find the lowest cost path only if h(s) < true cost, inother words h never over-estimates (h is said to be optimistic, or admissible).


5/56

A* works well if we can come up with good heuristicbut it needs our intelligence. h, however

can be generated by relaxing conditions like below

When it works

Problem solving technique like above works when the problem is

- Fully observable- Known- Discrete- Deterministic- Static

Unit 3 Probability in AI

Key things to remember- Joint probability (see definition in the table below)- Conditional probability (see definition in the table below)- Total probability formula (see in the table below)

Event Probability

A

not A

It is applied to conditional probability as well: 1 = P(A|B) + P (A|B)

But be careful when do negating on the condition side:- Wrong : 1 = P(A|B) + P (A|B)- Wrong : P(A) = P(A|B) + P (A|B)- Right : P(A) = P(A|B).P(B) + P(A|B).P(B) (total probability; see also next

line)

Total

probability

| where b spans the whole probability space i.e. .

In part if b has only 2 values 1 or 0 then P(A) = P(A|B).P(B) + P(A|B).P(B)

This formula is applied to conditional probability as well

P(A|M) = P(A|B, M).P(B|M) + P(A|B, M).P(B|M)

A or B


6/56

A and B

(joint

probability)

A given B

(conditional

probability) Proof:- B and A are sets in some space

- Given B

- The space is now limited to be only B.- Probability of A given B is AB. In this new space of B we have

Q.E.D

Bayes Rule From the formula above we have

P(AB) = P(A|B).P(B) = P(B|A).P(A)

So we have

To calculate using Bayes rule we usually just calculate the numerators of P(A|B) and

P(A|B) and then do the proportion. See the picture below

B

A B


7/56

For multiple variables in the condition

| | || Variables

and

probabilitydistributions

Example of variables

1. We have seen theexampleof the uncertain event a = "Spurs win the FA Cup inthe year 2011".

a. We can think of this event as just one state of the variable A whichrepresents "FA Cup winners in 2011".

b. In this case A has many states, one for each team entering the FA Cup.c. We write this asA = {a1, a2, ..., an} where a1 = "Spurs", a2 =

"Chelsea", a3 = "West Ham", etc.

d. Since in this case the setA is finite we say thatA is afinite discretevariable.

2. As another example, suppose we are interested in the number of critical faultsin our control system.

a. The uncertain event isA = "Number of critical faults". Again it is bestto think ofA as a variable which can take on any of the discrete values0,1,2,3,... thus A={0,1,2,3,....;}.

b. In this case we say thatA is a infinite discrete variable.c. Let us define a1 as the event "A=0", and a2 as the event "A=1".d. Clearly the events a1 and a2 are mutually exclusive and so P(a1 or

a2)=P(a1)+P(a2). However, we cannot say that P(a1 or a2) = 1 because

a1 and a2 are not exhaustive. That is, they do not form a complete

partition ofA.

e. However, if we define a3 as the event "A>1" then a1, a2, and a3 arecomplete and mutually exhaustive and in this case P(a1)+P(a2)+P(a3)

= 1

f. In general if A is a variable with states a1, a2, ..., an:g. Theprobability distribution ofA, written P(A), is simply the set of

values {P(a1), P(a2), ..., P(an)}

Key thing to remember, summary from reddit

1) P(A) : Concept of Probability.Like intelligence, Probability is about trying to predict something about the future, but Probability is

a prediction with only one number.

That number is in the range from 0 to 1... Where 0 means 0% of the times will ocurr the given event,

and 1 means It would ocurr 100% of the times.Example 1: P(A) = 0.3 ... it means event A will ocurr 30% of the times. (if you get 100 samples, you

predict that 30 will be "of the A type")

Example 2: Dice. P("get a 5") = 1/6 = 0.1666..7

How to calculate that result: * The dice has a total of 6 possible events. Sample Space = {1, 2, 3, 4, 5,

6} * Every event is as probable as the other (so called "equiprobable") (you don't think the dice is

loaded, so P("1") = P("2") = ... = P("5") = P("6") ) * Every event is disjoint from the other (If the result

is 1, then it cannot be 2, nor 3, ... nor 6).

With those conditions, you can imagine:

P("get 1 or 2 or 3 or 4 or 5 or 6 ") = 1

(by definition of Probability that means:

100% of the times you will get 1 or 2 or 3 ... or 6)

Since they are disjoint: P( Union( Ai ) ) = sum ( P (Ai) )
http://www.eecs.qmul.ac.uk/~norman/BBNs/Bayesian_approach_to_probability.htmhttp://www.eecs.qmul.ac.uk/~norman/BBNs/Bayesian_approach_to_probability.htmhttp://www.eecs.qmul.ac.uk/~norman/BBNs/Bayesian_approach_to_probability.htmhttp://www.eecs.qmul.ac.uk/~norman/BBNs/Bayesian_approach_to_probability.htm


8/56

P("1") + P("2") + P("3") + P("4") + P("5") + P("6") = 1

since equiprobable then 6*P("1") = 1

then P("1") = 1/6

then P("5") = P("1") = 1/6

In general, if you have a discrete set as Sample Space with those conditions:

Equiprobable elements Disjoint events

Then probability can be calculated as:

P(A) = "number of A events" / "total number of events"

Example: P("get a pair number in the dice") = "number of pair events" / "total dice events" = 3 / 6 =

0.5

Probability = number of favourable events / total number of events

2) P(A,B) : Joint ProbabilityP(A,B) is the same as P("A intersection B") = P(A B)

P(A,B) = P("we get something that is A and ALSO B") = "# of events which are A and B" / totalP(A,B) = "# of events which are A and B" / "total"

Example:

L4 = "get a lower than 4 number in the dice"

O = "get an odd number in the dice"

P(O,L4) = "number of Odd which are lower than 4" / total

= number of {"1","3"} / 6 = 2/6 = 1/3

3) P(A|B) : Conditional ProbabilityP(A|B) : (reads "probability of A given B")

we will consider only the B events, what "percentage" OF THEM are also A

Don't confuse with the Joint. Here you are talking about those events which are B and ALSO A but it

is related to the B events. (as if B was a new Sample Space of another experiment where you only

get B samples )

P(A|B) = "# of events which are A and B" / "# of B events"

From the concept, we can imagine the Conditional Prob. formula: if P(A,B) and P(A|B) are different,

how can we relate them?

P(A,B) = "# of events which are A and B" / total


if we divide: P(A,B) / P(A|B) = "# of B events" / total

but... Hey! we know that concept! "# of B events" / total is what we call P(B) !!!

so:

**P(A,B) = P(A|B) * P(B)**

4) Bayes Theorem / Bayes Rule / Bayes LawWhat is the relation between P(B|A) and P(A|B) ??

well, P(A,B) = P(B,A)

so, from the Conditional Prob., formula: P(A|B) * P(B) = P(B|A) * P(A)

Dividing by P(B) we get the Bayes formula:

P(A|B) = P(B|A) * P(A) / P(B)

Translated into intuitive numbers:

"# of A and B" / "# of B" =


9/56

= "# of A and B" / "# of A" * "# of A" / total / ("# of B" / total)

Another version is this (changing A and B):

P(B|A) = P(A|B) * P(B) / P(A)

5) AB : Concept ofindependent eventsIn the conditional probability we talked about "probability of A given B" but what if the "given B"

doesn't matter for A? That is, what if the "probability of A given B" is the same as the "probability ofA" alone ? Or, seen another way, if getting only samples which have the B property will have the

same effect (for the calculation of P(A) ) as getting samples which have not the B property, and the

same as any sample (no matter it is B or not B)...

That is, the B property "doesn't affect" the A property...

In this case, it is said that A and B are independent. (written as: A B )It is the same as saying this:

A B P(A|B) = P(A)Is it conmutative? Is the same "A and B are independent" and "B and A are independent"?

P(B|A) = P(A|B) * P(B) / P(A)

given "A and B are independent" :

P(B|A) = P(A|B) * P(B) / P(A) = P(A) * P(B) / P(A) = P(B)then "B and A are independent"

by symetry, if "B and A are independent" then "A and B are independent"

and so... YES, it is the same (as the language "they are independent" would suggest, as you are not

saying any order)

From the combination of Independence and the concept of Conditional Probability a new concept

comes:

Conditional Independence

A B | C P(A|B,C) = P(A|C)This is very important for Bayes Networks: it is related to the concept of D-Separation and it is

important to solve exercises: knowing when you can apply this formula.

(if there is D-separation between A and B, given C ... then you are sure A and B are conditionallyindependent, given C, and then you can use that formula)

6) Total ProbabilityImagine you have 2 disjoint subsets (that is, their intersection is an empty set... they have no

elements in common). What's the number of elements of the union of both? In that case, the

number is the sum.

In general, #(A U B) = #(A) + #(B) - #(AB)

One simple example of disjoints: A and "not A" if one element (or sample) belongs to A it cannot

belong to "not A"

"not A" is written as A

P(A U A) = P(A) + P(A) - P(A A) = P(A) + P(A) = 1

Another example of disjoints: "A B" and "A B"P("A B" U "A B") = P(A B) + P(A B)

We know "A B" U "A B" ... it is simply A !!

And we know P("A B"), it is what we called thejoint

So: P(A) = P(A,B) + P(A,B)

And we can express that in terms of conditionals:

P(A,B) = P(A|B) * P(B)

So:

P(A) = P(A|B) * P(B) + P(A|B) * P(B)

This is called the Total Probability formula

Which can be seen in numbers as:

"# of A" / total == "# of A and B" / "# of B" * "# of B" / total + "# of A and B" / "# of B" * "# of B" / total


10/56

which is in fact like saying:

"# of A" / "total" = ("# of A and B" + "# of A and not B" ) / "total"

Typical problem #1Consider the Bayes network on the left.

- C is cancer space with the probability to get P(C)=0.01- T is a test for cancer

o Probability of positive result given C is P(+|C)=0.9o Probability of positive result given C is P(+|C)=0.2

- T1 and T2 are 2 test attempts of T- Calculate probability of cancer if tests T1 is negative and T2 are

positive P(C|-+)?

Solution

Use Bayes rule to express P(C|-+) and P(C|-+)

- P(C|-+) = P(-+|C) * P(C) / P(-+)- P(C|-+) = P(-+|C) * P(C) / P(-+)

P(-+) joint probability of T1 is negative and T2 is positive - is not easy to find. We however know

that 1 = P(C|-+) + P(C|-+) so we just calculate the numerators and do the proportion

- P(-+|C) * P(C) = P(-|C) * P(+|C) * P(C) = (1-0.9)*0.9*0.01 = 0.0009- P(-+|C) * P(C) = P(-|C) * P(+|C) * P(C) = (1-0.2)*0.2*(1-0.01) = 0.1584

Now doing the proportion we find the answer P(C|-+) = 0.0009/(0.0009+0.1584)=0.0056=0.56%

Typical problem #2

The conditions are as in typical problem #1. Find P(T1|T2) the probability of test 1 is positive

given that test 2 is positive

Solution

To solve this we need to know conditional independence (read next chapter first). Steps

- Apply total probability formula P(T1|T2) = P(T1|T2,C)*P(C|T2) + P(T1|T2,C)*P(C|T2)- Because T1 and T2 are independent given C so

o P(T1|T2,C) = P(T1|C)o P(T1|T2, C) = P(T1|C)

- So P(T1|T2) = P(T1|C)*P(C|T2) + P(T1|C)*P(C|T2)= P(+|C)*P(C|+) + P(+|C)*P(C|+)

(simplified writing by replacing T1 and T2 with +)

- Apply Bayes ruleo

P(C|+) = P(+|C)*P(C)/P(+)o P(C|+) = P(+|C)*P(C)/P(+)

C

T1T2


11/56

- So P(T1|T2) = P(+|C)*P(+|C)*P(C)/P(+) + P(+|C)*P(+|C)*P(C)/P(+)= (0.9*0.9*0.01 + 0.2*0.2*0.99) / P(+) = 0.0486 / P(+)

- Apply total probability formula to calculate P(+) = P(+|C)*P(C) + P(+|C)*P(C) = 0.207 wefind finally that P(T1|T2) = 0.2348

Typical problem #3 Confounding CauseWe have seen one type ofBayes network in Typical problem #1 and #2: one single hidden cause

causes 2 different measurements

Confounding Cause is another type of Bayes network where there are 2 hidden causes getting

confounded within a single observational variable

Explaining Away or problem of Happiness when Sunny and Raise of salaryIt is a typical confounding cause Bayes network

Cause

Measure1

Measure2

Measure

Cause1

Cause2


12/56

a) Find P(R|S)R and S are independent if H is not given so P(R|S) = P(R) = 0.01

b) Explaining Away question 1: find P(R|H,S)Use Bayes rule (multiple variables in condition case) P(R|H,S) = P(H|R,S) * P(R|S) / P (H|S)

- P(H|R,S) = 1- P(R|S) = 0.01 as calculated in (a) above- Use total probability: P(H|S) = P(H|S,R)*P(R) + P(H|S, R).P(R) = 1*0.01 + 0.7*0.99 =0.703

So P(R|H,S) = 1* 0.01 / 0.703 = 0.0142

c) Explaining Away question 1: find P(R|H)Use Bayes rule P(R|H) = P(H|R) * P(R) / P (H)

-

Use total probability P(H|R) = P(H|R,S)*P(S) + P(H|R, S).P(S) = 1*0.7 + 0.9*0.3=0.97- P(R) = 0.01- Use total probability across all the cases

P(H) = P(H|S,R)*P(S,R) + P(H|S, R).P(S,R) + P(H|S,R)*P(S,R) + P(H|S, R).P(S,R)

P(H) = 0.5245 (remember R, S are independent so P(R,S) = P(R)*P(S), similar for others)

So P(R|H) = 0.97* 0.01 / 0.5245 = 0.0185


By definition B and C are independent given A if P(B|C)=P(B|A,C)

D-separation (aka reachability)

It is used to find out if 2 states are independent:

- Active Triplets: parameters aredependent

- Inactive Triplets: parameters areindependent

- (shading = given, or known state)


13/56

Bayes Networks

Bayes networks define probability distributions over graph of random variables

Simplest Bayes network with 2 variables

To specify this network we need 3 parameters: P(A) and 2 others: P(B|A) and P(B|A)

Example of a Bayes network with 5 variables

The compact of Bayes network helps to reduce significantly the number of states. On the graph

above of 5 variables it is reduced from 31 (2^5-1) to 10 (1+1+4+2 see picture above) thanks to the

formula P(A,B,C,D,E) = P(A).P(B).P(C|A,B).P(D|C).P(E|C).

The formula is written as product of probabilities, each factor is probability of a state and is written

as conditional probability from the states it depends on.

Unit 4 Probability Inference(how to answer probability questions using Bayes network)


14/56

Enumeration

Given conditional probabilities

Then we can calculate P(+b,+j,+m) by enumeration over hidden parameters e and a

Speedup Techniques for Enumeration

1. Pull out terms

2. Maximise independence (good idea to sort following causal direction)3. Variable elimination


15/56

Unit 5 Machine Learning

Supervised Learning

Feature vector X Target label Y:

SL tries to predict labels given the input vectors (i.e. to find the functions f see the picture above)

Occams Razor

Quiz Spam filter

Here we use Nave Bayes filter to detect spam


16/56

P(SPAM|M) = P(M|SPAM)*P(SPAM)/P(M) = P(today,is,secret|SPAM)*P(SPAM)/P(M)

Using normalised Bayes rule as in Key thing to remember, summary from

reddit

1) P(A) : Concept of Probability.Like intelligence, Probability is about trying to predict something about the future, but Probability is

a prediction with only one number.

That number is in the range from 0 to 1... Where 0 means 0% of the times will ocurr the given event,

and 1 means It would ocurr 100% of the times.

Example 1: P(A) = 0.3 ... it means event A will ocurr 30% of the times. (if you get 100 samples, you

predict that 30 will be "of the A type")

Example 2: Dice. P("get a 5") = 1/6 = 0.1666..7

How to calculate that result: * The dice has a total of 6 possible events. Sample Space = {1, 2, 3, 4, 5,

6} * Every event is as probable as the other (so called "equiprobable") (you don't think the dice is

loaded, so P("1") = P("2") = ... = P("5") = P("6") ) * Every event is disjoint from the other (If the result

is 1, then it cannot be 2, nor 3, ... nor 6).

With those conditions, you can imagine:

P("get 1 or 2 or 3 or 4 or 5 or 6 ") = 1

(by definition of Probability that means:

100% of the times you will get 1 or 2 or 3 ... or 6)

Since they are disjoint: P( Union( Ai ) ) = sum ( P (Ai) )

P("1") + P("2") + P("3") + P("4") + P("5") + P("6") = 1

since equiprobable then 6*P("1") = 1

then P("1") = 1/6

then P("5") = P("1") = 1/6

In general, if you have a discrete set as Sample Space with those conditions:

Equiprobable elements


17/56

Disjoint eventsThen probability can be calculated as:

P(A) = "number of A events" / "total number of events"

Example: P("get a pair number in the dice") = "number of pair events" / "total dice events" = 3 / 6 =

0.5

Probability = number of favourable events / total number of events 2) P(A,B) : Joint Probability

P(A,B) is the same as P("A intersection B") = P(A B)

P(A,B) = P("we get something that is A and ALSO B") = "# of events which are A and B" / total

P(A,B) = "# of events which are A and B" / "total"

Example:

L4 = "get a lower than 4 number in the dice"

O = "get an odd number in the dice"

P(O,L4) = "number of Odd which are lower than 4" / total

= number of {"1","3"} / 6 = 2/6 = 1/3

3) P(A|B) : Conditional ProbabilityP(A|B) : (reads "probability of A given B")

we will consider only the B events, what "percentage" OF THEM are also A

Don't confuse with the Joint. Here you are talking about those events which are B and ALSO A but it

is related to the B events. (as if B was a new Sample Space of another experiment where you only

get B samples )


From the concept, we can imagine the Conditional Prob. formula: if P(A,B) and P(A|B) are different,

how can we relate them?

P(A,B) = "# of events which are A and B" / total


if we divide: P(A,B) / P(A|B) = "# of B events" / total

but... Hey! we know that concept! "# of B events" / total is what we call P(B) !!!

so:

**P(A,B) = P(A|B) * P(B)**

4) Bayes Theorem / Bayes Rule / Bayes LawWhat is the relation between P(B|A) and P(A|B) ??

well, P(A,B) = P(B,A)

so, from the Conditional Prob., formula: P(A|B) * P(B) = P(B|A) * P(A)

Dividing by P(B) we get the Bayes formula:

P(A|B) = P(B|A) * P(A) / P(B)

Translated into intuitive numbers:

"# of A and B" / "# of B" =

= "# of A and B" / "# of A" * "# of A" / total / ("# of B" / total)

Another version is this (changing A and B):

P(B|A) = P(A|B) * P(B) / P(A)

5) AB : Concept ofindependent eventsIn the conditional probability we talked about "probability of A given B" but what if the "given B"

doesn't matter for A? That is, what if the "probability of A given B" is the same as the "probability of

A" alone ? Or, seen another way, if getting only samples which have the B property will have the

same effect (for the calculation of P(A) ) as getting samples which have not the B property, and the

same as any sample (no matter it is B or not B)...


18/56

That is, the B property "doesn't affect" the A property...

In this case, it is said that A and B are independent. (written as: A B )It is the same as saying this:

A B P(A|B) = P(A)Is it conmutative? Is the same "A and B are independent" and "B and A are independent"?

P(B|A) = P(A|B) * P(B) / P(A)given "A and B are independent" :

P(B|A) = P(A|B) * P(B) / P(A) = P(A) * P(B) / P(A) = P(B)

then "B and A are independent"

by symetry, if "B and A are independent" then "A and B are independent"

and so... YES, it is the same (as the language "they are independent" would suggest, as you are not

saying any order)

From the combination of Independence and the concept of Conditional Probability a new concept

comes:


A B | C P(A|B,C) = P(A|C)This is very important for Bayes Networks: it is related to the concept of D-Separation and it isimportant to solve exercises: knowing when you can apply this formula.

(if there is D-separation between A and B, given C ... then you are sure A and B are conditionally

independent, given C, and then you can use that formula)

6) Total ProbabilityImagine you have 2 disjoint subsets (that is, their intersection is an empty set... they have no

elements in common). What's the number of elements of the union of both? In that case, the

number is the sum.

In general, #(A U B) = #(A) + #(B) - #(AB)

One simple example of disjoints: A and "not A" if one element (or sample) belongs to A it cannot

belong to "not A"

"not A" is written as AP(A U A) = P(A) + P(A) - P(A A) = P(A) + P(A) = 1

Another example of disjoints: "A B" and "A B"

P("A B" U "A B") = P(A B) + P(A B)

We know "A B" U "A B" ... it is simply A !!

And we know P("A B"), it is what we called thejoint

So: P(A) = P(A,B) + P(A,B)

And we can express that in terms of conditionals:

P(A,B) = P(A|B) * P(B)

So:

P(A) = P(A|B) * P(B) + P(A|B) * P(B)

This is called the Total Probability formulaWhich can be seen in numbers as:

"# of A" / total =

= "# of A and B" / "# of B" * "# of B" / total + "# of A and B" / "# of B" * "# of B" / total

which is in fact like saying:

"# of A" / "total" = ("# of A and B" + "# of A and not B" ) / "total"

(remember today,is,secret are independent):

- P(today,is,secret|SPAM)*P(SPAM) = P(today|SPAM)*P(is|SPAM)*P(secret|SPAM)*P(SPAM)= 0


19/56

- P(today,is,secret|HAM)*P(HAM) = P(today|HAM)*P(is|HAM)*P(secret|HAM)*P(HAM) =2/15*1/15*1/15*5/8 = 0.000037

So P(SPAM|M) = 0/(0+0.000037) = 0 !!!

It is not good, just because of single word today we cant detect the spam (OVERFITTING!).Overfitting is common problem when maximum likelihood is used!

One of solution is to use Laplace Smoothing define probability of words (in this case the word is

today)

- ML = max likelihood, LS = Laplace smoothing- x is a variable (in this case a word).- count(x) is the number of occurrences of this value (e.g.

today) of the variable x.

- |x| is the number of all possible values that the variablex can take.

- k is a smoothing parameter.-

N is the total number of occurrences of x (the variable, not the value) in the sample space.So apply Laplace Smoothing with k = 1 to the quiz (assuming the dictionary is 12 words both for

SPAM and HAM, 9 is total number of words on SPAM side, 15 is total number of words on HAM side)

- P(today,is,secret|SPAM)*P(SPAM)=(0+1)/(9+12) * (1+1)/(9+12) * (3+1)/(9+12) * 0.4=0.00034- P(today,is,secret|HAM)*P(HAM)=(2+1)/(15+12) * (1+1)/(15+12) * (1+1)/(15+12) *

0.6=0.00037

Normalising to get P(SPAM|M) = 0.00034/(0.00034+0.00037) = 0.48

Overfitting Prevention

Types of Supervised Learning

- Classification: values of the target are discrete (binary in the picture above)


20/56

- Regression: values of the target are continuous

- Parametric: those methods have parameters and # of them are constant, independent oftraining set size

- Non-parametric: # of parameters can grow significantlyK-nearest neighbours

K-nearest neighbours is non-parametric supervised learning method. It has 2 steps

- Learning step: memorise all data- Label new example

o Find K nearest neighbourso Return majority class label

Linear regression

M data points, y is continuous. Linear Regression tries to find the function f (linear!) as shown below

2 types of f

- f(x)=w1*x+w0 where w1 and w0 are scalar


21/56

- f(x)=w*x+w0 where w is a vector

To find f we define quadratic loss function and

try to minimise the loss (M is number of training samples):

Unit 6 Unsupervised Learning

Unlike supervised learning there is only an input vector, no label. The goal is to find the structure

(pattern) of this input

Clustering algorithms

k-means

Algorithm

- Select k cluster centres at random- Repeat until no move can be made:

o Correspond data points to the nearest clusterso For each cluster: move the cluster centre to the mean (average point) of

corresponding data points

o If cluster becomes empty: restart at randomo This algorithm is proved to converge to local minima, and not NP

Problems of k-means clustering algorithm


22/56

- Need to know k- Local minima

- High dimensionality- Lack of mathematical basis

Expectation maximisation (EM)

- A generalisation of k-means, but first we need to learn Gaussian distribution)- Gaussian distribution

o mean averageo quadratic deviationo M number of data points

- Gaussian learning

- Maximum likelihood


23/56

- EM as a probabilistic generalisation of k-means- Choose k

Linear dimensionality reduction

Spectral (affinity-based) clustering

Unit 7 Representation with Logic

Propositional logic

Truth table


24/56

Given a space of states a model has its own truth table

- A sentence is valid if it is true for any models- A sentence is satisfiable if it is true in some model but false in other- A sentence is unsatisfiable or not valid if it is false in all models

Limitations of propositional logic

- Can handle only TRUE or FALSE; not uncertainty- Cant talk about object properties, nor relationship between objects- No shortcuts

Overcome propositional logic limitations: first-order logic and probability. We focus on 1st

order logic

which is to overcome the last 2 limitations.

3 types of representations

- Atomic (e.g. problem solving)- Factored (e.g. presentation logic)- Structured (e.g. programming language)

First order logic

Model


25/56

- Set of objects- Set of constants- Set of functions- Set of relations (unary, binary, etc.)

Why it is called first-order: because it operators work on objects only; no operations on the

relationship between objects (it would be higher order logic).

Syntax

- Sentence = relation- Operators: operate on sentences- Terms: can be constants, variables or functions- Quantifiers: that is unique and important for 1st order logic

o 2 quantifiers: for all and there existso Ifquantifier is omitted we assume for all quantifiero Although all variations are allowed normally

for all structures have the form there exists structure have the form


26/56

Unit 8 Planning

Why Plan? (or Planning vs. Problem Solving)

Problem solving: find a solution upfront and then execute it. Although we have a solution we are not

always able to execute it due to:

- Changing (partially observable) environment; and/or- Unpredictable (stochastic) environment; and/or- Multi-agency

The solution is PLANNING i.e. before doing next action we observe what happens after previous

action and make decision. With planning we move from the world of actual states to the world of

belief states. See the example below with a vacuum cleaner (one belief state consists of one or few

world states)


27/56

More details about plans, actions, and observations

3 types of vacuum cleaners (VC)

- Sensorless: the VC doesnt have any sensor so no observation- Partially observable: the VC can see its location, and if the location is clean. But it cant see

another place.

- Stochastic: the VC can attempt to move left or right but the move can be successful or notsuccessful

Few things to note

- sensorless vacuum example: even we cant observe when we do actions we know moreabout the world

- partially observable vacuum example: when we do actions *and also observe* we evenknow more

- stochastic vacuum example: we may need branching (and loop): to do an action, observethe result and based on the result we go different way. This branching is not the same as

branching in Problem Solving!

In general, actions may increase uncertainty when observations always reduce it. See the diagram

below:

2 types of plans

- Bounded (finite number of steps)- Unbounded (infinite number of steps is allowed)

Plans are usually specified in 2 ways

- Linear (list of steps in order); or- Tree (when we have branches in plan, usually branching is done by observation!)

Specify plans mathematically


28/56

- A=set of actions, S=set of states, F=final states (goals)- First equation: exact state world- Second equation: belief state worldpredict-observe-update cycle

o Problem: some belief states can become very largeo Solution: instead of describing a belief state as a list exact states we use variables

Classical Planning: a representation language to describe plans

- It is propositional logic to be used- Variables, not states are used to describe things- To describe states:

o Variableso State spaceo World state: complete assignment of all variableso Belief state

- To describe actions:o Described by action schema a group of many possible actions similar to each

other

o An action schema is described by specifying PRE(CONDITION) where the actionschema is possible and EFF(ECT) of the action schema

- 2 ways to find a plano Search in state space

Progression (forward) search: normal problem-solving in state space Regression (backward) search: backward search from goal state

o Search in the plan space


29/56

Situation calculus

SC = a first order logic with set of conventions how to represent states and actions. Comparing with

classical planning where propositional logic is usedSC has advantage to describe with for each

and there exists flexibility

- 2 types of objectso Actions: normally they are functions e.g. Fly(p,x,y) fly of plane p from x to y.o Situations: normally they are paths (of actions in state space search), not states

Initial situation S0 A function S = Result(S,a) where S is a situation and a is an action; S is

another action

- Using predicate to specify a set ofpossible actions given a situation Poss(a,S).o It usually is in form SomePrecond(S) Poss(a,S). Poss() is called possibility axiom

for the action a.

o Example of the possibility axiom for the action Fly(p,x,y) p is a plane x and y are 2 locations s is a situation

Fluent = a predicate (i.e. a function or a relation) that can change from one situation to another. For

example the predicate At(P,X,S) - a plan P is at theairport X given situation S is a fluent.

- Convention: the situation S is given as an argument of the predicate, the last argument- True fluent is a fluent that is true in situation s

In Classical Planning we use schema to describe what happens when execute each action. In

Situation Calculus we use Successor-state axioms to describe what happens in the state that is a

successor of executing an action (state is synonym of situation?). One s-s axiom per fluent!

- In general, an s-s axiom has the form

o a is an actiono s is a stateo It says: if it is possible to execute a in state s then the fluent is TRUE if action a

makes it true, or action a doesnt undo it

- Example of the s-s axiom for the fluent (predicate) some cargo in some planeo c is a cargoo p is a planeo a is an actiono s is a state / situation


30/56

A great thing about SC is we already have solver for first order logic, i.e. once a problem is described

using the language of SC, we automatically come up with a solution (the path from initial state to the

goal state)!

Unit 9 Planning Under Uncertainty

(RL reinforcement learning)

Planning in different environments:

Deterministic StochasticFully Observable A*, Depth-First, Breadth-First MDP (Markov Decision Process)

Partially Observable POMDP (Partially Observable MDP)

MDP

What is Markov process?

Finite State Machine , when outcome is not certain but probable(action a1 moves the system from state S1 to state S2 with probability of 50%), is Markov process

- States S1, , SN- Activities a1, , aN- State-transition matrix T(S,a,S) = P(S|a,S) (T is called transition function)- Reward function R(S,a,S) or sometimes simply R(S)- Policy assigns (optimal) an action to each state- We try to find a policy (S) that maximises the discounted, total Reward


31/56

Problem under study Grid World

Move North could lead to move North (80%), East (10%) or West (10%). Conventional planning wont

work so need a Policy (S)A for each state. The task is to find optimal policy.

Value function

- E[] is expectation of a stochastic process- t is time moment- is discount factor- Planning = Calculate Value Functions!

Recursive algorithm (so called Value Iteration) to determine the Value Function

Policy is made based on value function


32/56

Conclusion

POMDP

Information Space Belief Space

Unit 10 Reinforcement Learning (RL)

3 forms of learning

- Supervised- Unsupervised- Reinforcement

o A sequence of state, action, state, action, etc.o Rewards associated with the sequenceo We try to learn what to do to maximise rewards

Agents of RL

(or what to do if P() and/or R() are not known)

Agent Know Learn Use

Utility-based agent P R, then U (utility) U

Q-learning agent Q(S,a) Q

Reflex agent (S)

-

Passive RL agents: stick to a fixed policyo Example: Temporal Difference (TD) learning

- Active RL agents: change the policy as we learn


33/56

o Example: Greedy learning recalculate the policy after certain number of iterations.Problem not enough exploration because greedy, once it found some local

optimum, it sticks with it

o Solution for Greedy: need more exploration (at some point we dont take theoptimal policy to explore more). BUT, more exploration means more cost need

balancing!

Q Learning

- Many varieties, but common point is to find Q(s,a), not utility function U and not transitionmatrix

- Policy may change in Q-learning

Unit 11 Hidden Markov Model and Filters

HMM

Used to analyse and time series, applicable in

- Robotics- Medical- Finance- Speech- Etc.

Markov Chain is a simple Bayes network where each state depends only on a previous state

s1s2sN, each state also emits so called a measurements.

Hidden Markov Model (HMM) is a Markov chain but the states s1, s2, etc. (prior probability) are

hidden (not observable); instead we can observe only the measurements (posterior probability).

Using HMM it is possible to do 2 things: prediction and state estimation

- Prediction (of next state and/or next measurement)


34/56

o Bayes rule (see above picture) is used to do prediction (usually we calculate onlynumerators and then normalise)

- State estimation (to compute the probability of hidden or internal states givenmeasurements)

o Total probability formula is used to predict next state- 2 equations above, plus distribution of the initial state P(s0), form the math of HDD

Stationary Distribution

SD = find the probabilities in a HMM when time approaches infinity

Transition Probabilities

Observe a sequence of days e.g. R-R-S-S-R-S

Then find max likelihood (or using Laplace smoothing)

Example

HMM from observed measures (HMM happy-grumpy problem)

Use Bayes formula to calculate HMM from observed measures


35/56

Particle Filter

Example: a robot in a maze

- Robot can move freely in the maze; it has a sonar to measure the distance to objects / wallsaround it

- The belief space is represented by a collection of points or particles- Each point / particle represents a possible state- Each point / particle is a 3-dimensional vector (x coordinate, y coordinate, and the heading

direction)

- Particle filters approximate a posterior by many guesses- The density of the guesses represents the posterior probability of being in certain location- The more consistent the particle with the measurement, the more the sonar measurement

fits into the place where the particle says the robot is, the better chance to survive


36/56

Particle Filter Algorithm

Given

- S a set of n particles with associated important weights- U a control- Z a measurement vector

Aim: to construct a new particle set S

Algorithm

PARTICLE_FILTER(S,U,z)

S=

=0

For i=1n

Sample j ~ {w} with replacement

x ~ P(x|U, sj)

w= P(z|x)

S = S + {}

= + w

End For

For i=1n

wi = wi/

End For

an auxiliary parameter (for normalisation)

Loop through all n particles in S

Sample in index j according to the distribution

defined by the importance weights associated

with the particle we have a new particle sj

Sample a possible successor state x according to

the state transition probability using our controls

U and new particle sjCompute an importance weight w the

measurement probability for the particle sj.

Add the particle sjto S

Add w to (will be sum of w in the end of the

loop and will be used for normalisation)

Move to the next particle in the loop

A loop to normalise the weights

Betterexplanation(also see wikikidnapped robot)

Suppose Ikidnap your robotand put it back in your house at random.

It knows
http://www.aiqus.com/questions/18339/the-kidnapped-robothttp://www.aiqus.com/questions/18339/the-kidnapped-robothttp://www.aiqus.com/questions/18339/the-kidnapped-robothttps://secure.wikimedia.org/wikipedia/en/wiki/Kidnapped_robot_problemhttps://secure.wikimedia.org/wikipedia/en/wiki/Kidnapped_robot_problemhttps://secure.wikimedia.org/wikipedia/en/wiki/Kidnapped_robot_problemhttps://secure.wikimedia.org/wikipedia/en/wiki/Kidnapped_robot_problemhttps://secure.wikimedia.org/wikipedia/en/wiki/Kidnapped_robot_problemhttps://secure.wikimedia.org/wikipedia/en/wiki/Kidnapped_robot_problemhttps://secure.wikimedia.org/wikipedia/en/wiki/Kidnapped_robot_problemhttps://secure.wikimedia.org/wikipedia/en/wiki/Kidnapped_robot_problemhttp://www.aiqus.com/questions/18339/the-kidnapped-robot


37/56

- it's in your house- the layout of your house

But it doesn't know where in your house it is.

Observation step

1. It generates 100 locations at random to use as estimates of where it might be. (If your houseis big it might need 1,000 or 10,000 locations instead of 100.)

2. Since they are random, each of these locations (x,y) is given an equal likelihood of 1%.3. Each triad of (x,y,%) is a state, so we have 100 states.4. The robot now takes a measurement of its surroundings and sees that it's in a 4-way

junction.

5. With perfect sensing, we could eliminate states which aren't in a 4-way junction, but oursensors are a bit flaky. The North-pointing sensor is only correct 80% of the time, so there's a

chance that we're not really in a 4-way junction after all. The other sensors (E, W, S) are also

flaky.

6. So we adjust the probability of all states according to Bayes Rule. It's possible that two ormore sensors are incorrect, but that's less likely than just one sensor being incorrect. When

we're done, the states describing 4-way junctions have a higher probability (since that'smost likely), and the rest have lower.

7. Perhaps 30 of our states - the ones which describe 4-way junctions - have a weight of 2%and the rest have weights smaller than 1% and the total of all weights is 100%.

Resample step

1. Now we move East. Just like the vacuum with slippery wheels, we could end up 1 positionEast, but there's also a smaller chance that we could end up NE or SE.

2. How do we update the list of states?a. We generate 100 new states from the existing states.b. We choose a state and duplicate it, then apply the movement and randomly choose

the expected outcome. If we are using the robot from the gridworld lectures(80&/10%/10%) then there's an 80% chance that the new position is East of our

original position, and 10% chance that we generate a new state which is NE and 10%

SE.

c. We choose states to duplicate using a weighted average of the existing states. Since30 of the states (the ones which describe a 4-way junction) are 2% likely and the rest

are


38/56

and the observation second. This is the same algorithm with a different implementation. In his

model the system makes a move and presents the list of original states, the move taken, and the

measurement made after the move. The essential concepts are the same.

Unit 12 MDP Review

Unit 13 GamesKey point: games can be solved by search (depth-first, breadth-first, A*, etc.)

Deterministic single-player games

- Set of states S (including the start state S0)- Set of players P (in this case it contains a single player)- A function Actions(s,p) that gives us the possible actions at state s for player p- A transition function Result(s,a)a that tells us the result of action a at state s- A terminal test Terminal(s)TRUE or FALSE to tell us if it is an end of the game- Terminal utilities U(s,p) that tells us, for a given state s and a given player p, some number

which is the value of the game for this player

Deterministic 2-player (turn taking) zero-sum games

- Deterministic: there is a single result of any actionMinimax routine

- 2-player: 2 players, MAX and MIN


39/56

o is a move by MAXo is a move by MINo is a terminal stateo The value function is defined as

o MAX tries to maximise value function. The algorithm is

The complexity of the algorithm for the tree below

Computational complexity = (bm) Space complexity = (bm)

o MIN tries to minimise value function: similar to above but oppositely- Zero-sum: sum of utilities of the 2 players is 0

Reduce complexity: 3 approaches

- Reduce b the breadth of the treeo - pruning technique can reduce (bm) to (bm/2)

- Reduce m the depth of the tree: e.g. cut-off at some level and use an evaluation function


40/56

- Combination of reduce b and reduce m (alpha-beta pruning)

o This algorithm also uses new definition of maxValue() as below

- Convert tree into graph: e.g. in chess we have opening books, ending books, midgame-

Only in the reduce m approach we have information lost

Stochastic games


41/56

- ? is a chance node: where we take the expect value- Expected value in Probability is calculated this way

o Assume we have N possibilities with values a1, a2,, aN and correspondingprobabilities p1, p2,, pN (p=1)

o Expected value = apUnit 14 Game Theory

2 objectives

- Agent design: given a game find optimal policy- Mechanism design: design game rules to attract players and game owner. More formally:

given a utility function and assuming the agents act rationally, find mechanism to maximise

global utilities

Key definitions (on example ofPrisoners dilemma)

Dominant strategy: a strategy that a player does better than in any other strategies

- For A: testify- For B: testify

Pareto optimal outcome: no other outcomes that all players prefer

- The outcome A=-1, B=-1 is Pareto optimalEquilibrium: an outcome that no player can benefit from switching to a different strategy assuming

the other player stays the same

- The outcome A=-5, B=-5 is Pareto optimalTwo Finger Morra (zero-sum)

2 players Even (E) and Odd (O) showing their finger at the same time


42/56

Difficulty: no dominant strategy, Pareto optimum

Solution 1: move from matrix form to tree one; assuming that one player must go first:

- Left: MAX goes first- Right: MIN goes first- Utility -3UE2 no good, very big discrepancy because we handicap (ask to reveal) the

first player to much. Solution 2 will ask less

Solution 2: like solution 1 but we assume the first player only need to reveal his strategy

- The probability the first player select his move is [p: one, (1-p): two]

Unit 15 Advanced PlanningAdvanced planning is like normal planning taking into account also the followings

- Time- Resources- Active perception


43/56

- Hierarchical plansScheduling

- Network of tasks- S start, F finish- Each task has ES (earliest start) and LS (latest start) as defined above. Below is ES (left box)

and LS (right box) for each state

Extending planning

Problem of classical planningit cant handle resources; so it may need to check many combinations.

So it is natural to add resources to the language of classical planning; below is an example

- New type Resources (red highlighted above)- 2 new attributes to Actions to deal with resources

o USE (green highlighted above): to use a resource. After using the resource still existso CONSUME (green highlighted above): to consume a resource. After consuming the

resource vanished


44/56

Hierarchical Planning

Aim: close abstraction gap

- Group actions into abstract actions- Do the planning with bigger abstract actions- Then do the refinement to find concrete action for a abstract action

HTN = hierarchical task network

How do we know we reach solution?

- A hierarchical task network achieves the goal if for every part, every abstract action at leastone of the refinements achieves the goal

Reachable states (by a abstract action)

Approximate reachable states: lower and upper bounders of the states we can reach by a abstract

action.

Conformant vs. Sensory Planning

Conformant plan = plan without perception. Sensory planning is about to extend classical planning to

allow active perception to deal with partial observability


45/56

- New type Percept (red highlighted above)to express that we sense something

Unit 16 Computer Vision I

Image formation

(the way to capture image)

Pinhole camera

Perspective Projection formula (for one dimension, but also applied for other dimensions)

Vanishing points: parallel lines converges in perspective into vanishing points

Lens: to eliminate the drawback of pinhole which is only one ray reaches the image. Restriction of

lens is the image must be on certain distance (lens law)


46/56

Computer vision

- Classify objects- 3d reconstruction- Motion analysis

Invariance is a key concept in Object Recognition: there are natural variations of an image that dont

affect the nature of the object itself. We will try to design recognition algorithms invariant to, say,

scale, illumination, rotation, deformation, occlusion (object is shaded by other objects) and view

point.

Grey scale images

-

More used than colour ones in image recognition- As usual a grey scale image represent by a 700x700 matrix with values from 0255 in each

cell (255=black, 0 = white)

Extract features: using (kernel) masks

Linear filter

Gradient kernels (filters)

- Horizontal filter - Vertical filter


47/56

Horizontal filters find the vertical edges and vice versa!

Gradient images

To find all the edges (both horizontal and vertical) we need to combine horizontal and vertical filters

gradient images (gradient magnitude kernel)

Canny edge detector (by professor Canny) improves gradient images significantly.

There are other masks like Prewitt, Gaussian kernel (to blur images), etc.

Harris corner detector

- Corners is where exist a lot of horizontal edges and vertical edges (top figure below)- Sometimes we may need to rotate the image (bottom figure below). The trick is to use

eigenvalues

Modern feature detectors

- Localisable- As

HOG = Histogram of Oriented Gradient

SIFT = Scale Invariant Feature Transform


48/56

Unit 17 Computer Vision II (3D)

Stereo

Task: sensing range (distance) with cameras

- One camera sometimes we can recover the 3D (i.e. distance to the object) but not all thetime.

- Stereo vision with 2 cameras more easily but again not all the time e.g. in the case withaperture effect.

Stereo Rig = 2 pinholes usually with the same focal length. Below is how we solve the depth Z (of an

object at P) from images from 2 pinholes camera:

- f = focal length- Baseline B = distance between 2 cameras- x1 = projected image via pinhole 1- x2 = projected image via pinhole 2- Displacement aka parallax = x1-x2- Optical axes = the axes drawn via the pinholes orthogonally to the image planes

Correspondence in stereoWe have images for 2 points P1, P2 (each point has 2 images from 2 cameras)

- If we mistake to mix the images we may end up in phantom points P1 and P2


49/56

- So to find correspondence (data association) is importantTake for example 2 cameras and an object that projects to point P in camera 1

Question is how to find its projection to camera 2

- Not the whole square (2D)- Not able yet to pinpoint it (0D)- The right answer is along some line (1D). The line is the projection of the line connecting the

real object and P!

Search along the line

- How can we find (pinpoint) the image on camera 2 along the line? 2 wayso Matching small image pattern; oro Matching the features (like edges) using linear filter (see unit 16)

- SSD (sum of square difference) minimization algorithm is usually used

Example

We try to correspond 2 patterns below


50/56

To do it we try to minimise the cost function (see above

Dynamic Programming is usually used to find best alignment (similar to MVP)

- Define of each point in the grid to be the best taking the value of getting there- Start-point: top-left, end-point: bottom right. Calculate V(i,j) for each point in grid, in the

end we get the V(i,j) of the end-point of 20 and also find the path

Example of dynamic programming

B B R R R B

B

R

RR


51/56

R

B

Unit 18 Computer Vision III

SFM - structure from motion

Motion here means we move a camera around, capture images of and object and recover the object

structure (3D world)

SFM = Non-linear Least Squares problem, minimisation is through

- Gradient descent- Conjugate gradient- Gauss-Newton- Levenberg Marquardt (common method!)- Singular Value decomposition (affine, orthographic)

Unit 19 Robotics I

2 key tasks

- Find out where you are (aka localisation)- Find the path to the goal state (aka planning)

2 types of state

- Kinematic state: state of an object in space- Dynamic state = kinetic state + velocity

Localisation: how to find your position in space given a map. For robotic cars

- Could use GPS but error is ~5m- Particle Filtergives error of 10cm!

Monte-Carlo localisation

(on example of differential-drive robot)

Deterministic case


52/56

Add noise (probability): after giving the command MOVE the robot could be few possible places,

each with some probability. This is the PREDICTION step ofParticle Filter

MEASUREMENT step

Unit 20 Robotic II

Robotic Path Planning vs. normal Planning

- Robotic one is in continuous state space- Normal one is in discrete state space

A* in continuous space

A* is discrete. It can find the path to the goal like in the picture below, but the path has many sharp

turns not suitable for robots like self-driving car.

In continuous space A* becomes Hybrid A*

Hybrid A* lacks of completeness (it may not find the path) but it guarantees correctness (if it found a

path, the path is correct)


53/56

Unit 21 Natural Language Processing

2 language models

- Word-based; probabilistic; learned from datao Probability P(word1, word2,)

- Tree-based; logical; hand-codedo Set of sentences (=a language): {S1, S2,}

Probabilistic modelsWe talk about probability that a sequence of words makes a sentence or for short

|

2 important assumptions

- Markov assumption (of order k): the localism of the probabilities i.e. | |

. Specifically when k=1 we have

| |

- Stationary assumption: the probabilities are similar across the sequence i.e. | ()

We look at the data and try to find the probability for one word to follow another very often we

need smoothing or other techniques otherwise the probability=0%

We also want to go beyond words (augmented models) by extending to non-word components

n-gram models

- Bag of words (e.g. all Shakespeare text)- Build n-gram model and sample from that model (i.e. to generate random sentences that

come from the probability distribution defined by that model)

Unigram model: to sample from words according to frequency in the corpus of Shakespeare text)

but not taking into account any relationship between adjacent words

Bigram model: to sample from the probability if a word given the previous word

Classification

Common tasks

- Classify words into categories- Detect language of a text


54/56

Can be word-based or character-based

Methods

- Nave Bayes-

K-nearest neighbour- Support Vector Machine (SVM)- Logistic regression- gzip compression utility (Unix)

Segmentation

Given a sequence of words (characters) find where spaces are (like in Chinese)

Probabilistic model of segmentation: the best segmentation S* is the one that maximises the joint

probability of the segmentation | Approximation can be done with Markov assumption, nave Bayes, etc.

In case of nave Bayes we try just to maximise probability of each individual word - Equivalently we can find argmax over all possible segmentations of the string s into a 1st

word f and the rest of the words r:

Spelling correction

Given a misspelled word w find the correct word c.

Probabilistic model of spelling correction: find best correction

|

Apply Bayes rule (ignoring the denominator as it is the same) |- P(c) comes from data counts- P(w|c) comes from spelling collection data

Example: pulse misspell as pluse

- It is usually not enough data for P(pluse|pulse)- So instead we work at character level - define misspelling type ullu

Unit 22 Natural Language Processing IITree model

Need a grammar e.g.

Grammar

SNP VP

NPN | D N | N N | N N N

VPV | V NP | V NP NP

Ninterest | Fed | rates | raises

Vinterest | rates | raisesD the | a

Where:

- S = sentence- N = noun- V = verb- NP = noun phrase- VP = verb phrase- D = determiner (e.g. a or the)


55/56

This type of grammar is called Context Free Grammar (CFG)

Problems with grammar

- Easy to omit good parser-

Easy to include bad parser by accident- Not a problem: trees are unobservable

Solutions

- Add probability to each tree- Add word association like Markov assumption- Not possible solution: make grammar unambiguous

Probabilistic Context Free Grammar (PCFG)

Add probability to CFG grammar we know so far

Example

Lexiconsi

How to define the probabilities: people are trained and paid to parse real life texts

Ambiguity

- I saw (a man with telescope)


56/56

- I saw (a man) with telescopeLexicalised PCFG (LPCFG)

Normal PCFG

- The probability is given in regard of the category of left hand side- Example P(VPV NP NP | lhs=VP) = 0.2; lhs = left hand side

Lexicalised PCFG

- The probability is given to specific word- Example P(VPV NP NP | V=gave) = 0.25; gave is the word

How to build grammar tree? Use Search

stanford intro ai class notes

Documents