stanford intro ai class notes

Upload: isaplant

Post on 06-Apr-2018

228 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/3/2019 Stanford Intro AI Class Notes

    1/56

    Videos:http://www.wonderwhy-er.com/ai-class/

    Unit 1 Theory: Welcome to AI

    Purposes- Teach basic of AI- Excite you

    Structure

    - Videos Quizzes Answer Videos- Home works (assignments) Exams

    AI program = Intelligent Agent

    Agent function = it maps any given percept sequence to action = abstract math description

    Agent program = a concrete impelemtation of agent function

    Rational agent: the one that does the right thing i.e. For each possible percept sequence, a rational

    agent should select an action that is expected to maximize itsperformance measure, given the

    evidence provided by the percept sequence and whatever built-in knowledge the agent has

    TerminologyEnvironment types:

    - Fully vs. Partially Observable

    http://www.wonderwhy-er.com/ai-class/http://www.wonderwhy-er.com/ai-class/http://www.wonderwhy-er.com/ai-class/http://www.wonderwhy-er.com/ai-class/
  • 8/3/2019 Stanford Intro AI Class Notes

    2/56

    - Deterministic (e.g. Chess game) vs. Stochastic (e.g. Dice game)- Discrete vs. Continuous- Benign vs. Adversarial

    AI as uncertainty management

    Reasons for uncertainty

    - Sensor limits- Adversaries- Stochastic environment- Laziness- Ignorance

    Unit 2 Problem Solving

    Definition of a problem

    - Initial state- A function: ACTIONS (S) {a1, a2, a3, }

    o S is a stateo a1, a2, a3 are possible actions from the state S

    - A function: RESULT(S, a)So S is a stateo a is an action applied to the stateo S is the new state

    - A function: GOALTEST(S) T|Fo S is a stateo GOALTEST will test if S is the destination (final state)

    - A function PATHCOST (SSS) no n is cost to move from a state S to S to the final state So It is mostly additive so it will be many STEPCOST(S, a, S); sum of which is PATHCOST

  • 8/3/2019 Stanford Intro AI Class Notes

    3/56

    Route finding problem

    3 regions

    - Explored- Frontier- Unexplored

    Base algorithm

    1. take a state on the Frontier (by some criteria)2. GoalTest it if YES terminate here3. expand it (to new states that would be added to the Frontier)4. remove it from the Frontier (to the Explored)

    (Generic) Tree-Search

    Tree-Search applied to path-finder problem

  • 8/3/2019 Stanford Intro AI Class Notes

    4/56

    Graph-Search (like Tree-Search but remember what already explored so

    when the frontier is expanded it will not take already explored states)

    The key point of the base algorithm is the criteria used in step 1. It leads to few concrete algorithms:

    - Breadth-First (aka shortest-first): consider the shortest path first to expand frontier- Uniform-Cost (aka cheapest-first)- Depth-First: consider the longest path first to expand frontier

    A* algorithm

    It is proven that the algorithm is improved if we know some extra info, e.g. the distance from the

    current state (which is about to be expanded) to the goal. It is A* algorithm

    h is called heuristic function. A* will always find the lowest cost path only if h(s) < true cost, inother words h never over-estimates (h is said to be optimistic, or admissible).

  • 8/3/2019 Stanford Intro AI Class Notes

    5/56

    A* works well if we can come up with good heuristicbut it needs our intelligence. h, however

    can be generated by relaxing conditions like below

    When it works

    Problem solving technique like above works when the problem is

    - Fully observable- Known- Discrete- Deterministic- Static

    Unit 3 Probability in AI

    Key things to remember- Joint probability (see definition in the table below)- Conditional probability (see definition in the table below)- Total probability formula (see in the table below)

    Event Probability

    A

    not A

    It is applied to conditional probability as well: 1 = P(A|B) + P (A|B)

    But be careful when do negating on the condition side:- Wrong : 1 = P(A|B) + P (A|B)- Wrong : P(A) = P(A|B) + P (A|B)- Right : P(A) = P(A|B).P(B) + P(A|B).P(B) (total probability; see also next

    line)

    Total

    probability

    | where b spans the whole probability space i.e. .

    In part if b has only 2 values 1 or 0 then P(A) = P(A|B).P(B) + P(A|B).P(B)

    This formula is applied to conditional probability as well

    P(A|M) = P(A|B, M).P(B|M) + P(A|B, M).P(B|M)

    A or B

  • 8/3/2019 Stanford Intro AI Class Notes

    6/56

    A and B

    (joint

    probability)

    A given B

    (conditional

    probability) Proof:- B and A are sets in some space

    - Given B

    - The space is now limited to be only B.- Probability of A given B is AB. In this new space of B we have

    Q.E.D

    Bayes Rule From the formula above we have

    P(AB) = P(A|B).P(B) = P(B|A).P(A)

    So we have

    To calculate using Bayes rule we usually just calculate the numerators of P(A|B) and

    P(A|B) and then do the proportion. See the picture below

    B

    A B

  • 8/3/2019 Stanford Intro AI Class Notes

    7/56

    For multiple variables in the condition

    | | || Variables

    and

    probabilitydistributions

    Example of variables

    1. We have seen theexampleof the uncertain event a = "Spurs win the FA Cup inthe year 2011".

    a. We can think of this event as just one state of the variable A whichrepresents "FA Cup winners in 2011".

    b. In this case A has many states, one for each team entering the FA Cup.c. We write this asA = {a1, a2, ..., an} where a1 = "Spurs", a2 =

    "Chelsea", a3 = "West Ham", etc.

    d. Since in this case the setA is finite we say thatA is afinite discretevariable.

    2. As another example, suppose we are interested in the number of critical faultsin our control system.

    a. The uncertain event isA = "Number of critical faults". Again it is bestto think ofA as a variable which can take on any of the discrete values0,1,2,3,... thus A={0,1,2,3,....;}.

    b. In this case we say thatA is a infinite discrete variable.c. Let us define a1 as the event "A=0", and a2 as the event "A=1".d. Clearly the events a1 and a2 are mutually exclusive and so P(a1 or

    a2)=P(a1)+P(a2). However, we cannot say that P(a1 or a2) = 1 because

    a1 and a2 are not exhaustive. That is, they do not form a complete

    partition ofA.

    e. However, if we define a3 as the event "A>1" then a1, a2, and a3 arecomplete and mutually exhaustive and in this case P(a1)+P(a2)+P(a3)

    = 1

    f. In general if A is a variable with states a1, a2, ..., an:g. Theprobability distribution ofA, written P(A), is simply the set of

    values {P(a1), P(a2), ..., P(an)}

    Key thing to remember, summary from reddit

    1) P(A) : Concept of Probability.Like intelligence, Probability is about trying to predict something about the future, but Probability is

    a prediction with only one number.

    That number is in the range from 0 to 1... Where 0 means 0% of the times will ocurr the given event,

    and 1 means It would ocurr 100% of the times.Example 1: P(A) = 0.3 ... it means event A will ocurr 30% of the times. (if you get 100 samples, you

    predict that 30 will be "of the A type")

    Example 2: Dice. P("get a 5") = 1/6 = 0.1666..7

    How to calculate that result: * The dice has a total of 6 possible events. Sample Space = {1, 2, 3, 4, 5,

    6} * Every event is as probable as the other (so called "equiprobable") (you don't think the dice is

    loaded, so P("1") = P("2") = ... = P("5") = P("6") ) * Every event is disjoint from the other (If the result

    is 1, then it cannot be 2, nor 3, ... nor 6).

    With those conditions, you can imagine:

    P("get 1 or 2 or 3 or 4 or 5 or 6 ") = 1

    (by definition of Probability that means:

    100% of the times you will get 1 or 2 or 3 ... or 6)

    Since they are disjoint: P( Union( Ai ) ) = sum ( P (Ai) )

    http://www.eecs.qmul.ac.uk/~norman/BBNs/Bayesian_approach_to_probability.htmhttp://www.eecs.qmul.ac.uk/~norman/BBNs/Bayesian_approach_to_probability.htmhttp://www.eecs.qmul.ac.uk/~norman/BBNs/Bayesian_approach_to_probability.htmhttp://www.eecs.qmul.ac.uk/~norman/BBNs/Bayesian_approach_to_probability.htm
  • 8/3/2019 Stanford Intro AI Class Notes

    8/56

    P("1") + P("2") + P("3") + P("4") + P("5") + P("6") = 1

    since equiprobable then 6*P("1") = 1

    then P("1") = 1/6

    then P("5") = P("1") = 1/6

    In general, if you have a discrete set as Sample Space with those conditions:

    Equiprobable elements Disjoint events

    Then probability can be calculated as:

    P(A) = "number of A events" / "total number of events"

    Example: P("get a pair number in the dice") = "number of pair events" / "total dice events" = 3 / 6 =

    0.5

    Probability = number of favourable events / total number of events

    2) P(A,B) : Joint ProbabilityP(A,B) is the same as P("A intersection B") = P(A B)

    P(A,B) = P("we get something that is A and ALSO B") = "# of events which are A and B" / totalP(A,B) = "# of events which are A and B" / "total"

    Example:

    L4 = "get a lower than 4 number in the dice"

    O = "get an odd number in the dice"

    P(O,L4) = "number of Odd which are lower than 4" / total

    = number of {"1","3"} / 6 = 2/6 = 1/3

    3) P(A|B) : Conditional ProbabilityP(A|B) : (reads "probability of A given B")

    we will consider only the B events, what "percentage" OF THEM are also A

    Don't confuse with the Joint. Here you are talking about those events which are B and ALSO A but it

    is related to the B events. (as if B was a new Sample Space of another experiment where you only

    get B samples )

    P(A|B) = "# of events which are A and B" / "# of B events"

    From the concept, we can imagine the Conditional Prob. formula: if P(A,B) and P(A|B) are different,

    how can we relate them?

    P(A,B) = "# of events which are A and B" / total

    P(A|B) = "# of events which are A and B" / "# of B events"

    if we divide: P(A,B) / P(A|B) = "# of B events" / total

    but... Hey! we know that concept! "# of B events" / total is what we call P(B) !!!

    so:

    **P(A,B) = P(A|B) * P(B)**

    4) Bayes Theorem / Bayes Rule / Bayes LawWhat is the relation between P(B|A) and P(A|B) ??

    well, P(A,B) = P(B,A)

    so, from the Conditional Prob., formula: P(A|B) * P(B) = P(B|A) * P(A)

    Dividing by P(B) we get the Bayes formula:

    P(A|B) = P(B|A) * P(A) / P(B)

    Translated into intuitive numbers:

    "# of A and B" / "# of B" =

  • 8/3/2019 Stanford Intro AI Class Notes

    9/56

    = "# of A and B" / "# of A" * "# of A" / total / ("# of B" / total)

    Another version is this (changing A and B):

    P(B|A) = P(A|B) * P(B) / P(A)

    5) AB : Concept ofindependent eventsIn the conditional probability we talked about "probability of A given B" but what if the "given B"

    doesn't matter for A? That is, what if the "probability of A given B" is the same as the "probability ofA" alone ? Or, seen another way, if getting only samples which have the B property will have the

    same effect (for the calculation of P(A) ) as getting samples which have not the B property, and the

    same as any sample (no matter it is B or not B)...

    That is, the B property "doesn't affect" the A property...

    In this case, it is said that A and B are independent. (written as: A B )It is the same as saying this:

    A B P(A|B) = P(A)Is it conmutative? Is the same "A and B are independent" and "B and A are independent"?

    P(B|A) = P(A|B) * P(B) / P(A)

    given "A and B are independent" :

    P(B|A) = P(A|B) * P(B) / P(A) = P(A) * P(B) / P(A) = P(B)then "B and A are independent"

    by symetry, if "B and A are independent" then "A and B are independent"

    and so... YES, it is the same (as the language "they are independent" would suggest, as you are not

    saying any order)

    From the combination of Independence and the concept of Conditional Probability a new concept

    comes:

    Conditional Independence

    A B | C P(A|B,C) = P(A|C)This is very important for Bayes Networks: it is related to the concept of D-Separation and it is

    important to solve exercises: knowing when you can apply this formula.

    (if there is D-separation between A and B, given C ... then you are sure A and B are conditionallyindependent, given C, and then you can use that formula)

    6) Total ProbabilityImagine you have 2 disjoint subsets (that is, their intersection is an empty set... they have no

    elements in common). What's the number of elements of the union of both? In that case, the

    number is the sum.

    In general, #(A U B) = #(A) + #(B) - #(AB)

    One simple example of disjoints: A and "not A" if one element (or sample) belongs to A it cannot

    belong to "not A"

    "not A" is written as A

    P(A U A) = P(A) + P(A) - P(A A) = P(A) + P(A) = 1

    Another example of disjoints: "A B" and "A B"P("A B" U "A B") = P(A B) + P(A B)

    We know "A B" U "A B" ... it is simply A !!

    And we know P("A B"), it is what we called thejoint

    So: P(A) = P(A,B) + P(A,B)

    And we can express that in terms of conditionals:

    P(A,B) = P(A|B) * P(B)

    So:

    P(A) = P(A|B) * P(B) + P(A|B) * P(B)

    This is called the Total Probability formula

    Which can be seen in numbers as:

    "# of A" / total == "# of A and B" / "# of B" * "# of B" / total + "# of A and B" / "# of B" * "# of B" / total

  • 8/3/2019 Stanford Intro AI Class Notes

    10/56

    which is in fact like saying:

    "# of A" / "total" = ("# of A and B" + "# of A and not B" ) / "total"

    Typical problem #1Consider the Bayes network on the left.

    - C is cancer space with the probability to get P(C)=0.01- T is a test for cancer

    o Probability of positive result given C is P(+|C)=0.9o Probability of positive result given C is P(+|C)=0.2

    - T1 and T2 are 2 test attempts of T- Calculate probability of cancer if tests T1 is negative and T2 are

    positive P(C|-+)?

    Solution

    Use Bayes rule to express P(C|-+) and P(C|-+)

    - P(C|-+) = P(-+|C) * P(C) / P(-+)- P(C|-+) = P(-+|C) * P(C) / P(-+)

    P(-+) joint probability of T1 is negative and T2 is positive - is not easy to find. We however know

    that 1 = P(C|-+) + P(C|-+) so we just calculate the numerators and do the proportion

    - P(-+|C) * P(C) = P(-|C) * P(+|C) * P(C) = (1-0.9)*0.9*0.01 = 0.0009- P(-+|C) * P(C) = P(-|C) * P(+|C) * P(C) = (1-0.2)*0.2*(1-0.01) = 0.1584

    Now doing the proportion we find the answer P(C|-+) = 0.0009/(0.0009+0.1584)=0.0056=0.56%

    Typical problem #2

    The conditions are as in typical problem #1. Find P(T1|T2) the probability of test 1 is positive

    given that test 2 is positive

    Solution

    To solve this we need to know conditional independence (read next chapter first). Steps

    - Apply total probability formula P(T1|T2) = P(T1|T2,C)*P(C|T2) + P(T1|T2,C)*P(C|T2)- Because T1 and T2 are independent given C so

    o P(T1|T2,C) = P(T1|C)o P(T1|T2, C) = P(T1|C)

    - So P(T1|T2) = P(T1|C)*P(C|T2) + P(T1|C)*P(C|T2)= P(+|C)*P(C|+) + P(+|C)*P(C|+)

    (simplified writing by replacing T1 and T2 with +)

    - Apply Bayes ruleo

    P(C|+) = P(+|C)*P(C)/P(+)o P(C|+) = P(+|C)*P(C)/P(+)

    C

    T1T2

  • 8/3/2019 Stanford Intro AI Class Notes

    11/56

    - So P(T1|T2) = P(+|C)*P(+|C)*P(C)/P(+) + P(+|C)*P(+|C)*P(C)/P(+)= (0.9*0.9*0.01 + 0.2*0.2*0.99) / P(+) = 0.0486 / P(+)

    - Apply total probability formula to calculate P(+) = P(+|C)*P(C) + P(+|C)*P(C) = 0.207 wefind finally that P(T1|T2) = 0.2348

    Typical problem #3 Confounding CauseWe have seen one type ofBayes network in Typical problem #1 and #2: one single hidden cause

    causes 2 different measurements

    Confounding Cause is another type of Bayes network where there are 2 hidden causes getting

    confounded within a single observational variable

    Explaining Away or problem of Happiness when Sunny and Raise of salaryIt is a typical confounding cause Bayes network

    Cause

    Measure1

    Measure2

    Measure

    Cause1

    Cause2

  • 8/3/2019 Stanford Intro AI Class Notes

    12/56

    a) Find P(R|S)R and S are independent if H is not given so P(R|S) = P(R) = 0.01

    b) Explaining Away question 1: find P(R|H,S)Use Bayes rule (multiple variables in condition case) P(R|H,S) = P(H|R,S) * P(R|S) / P (H|S)

    - P(H|R,S) = 1- P(R|S) = 0.01 as calculated in (a) above- Use total probability: P(H|S) = P(H|S,R)*P(R) + P(H|S, R).P(R) = 1*0.01 + 0.7*0.99 =0.703

    So P(R|H,S) = 1* 0.01 / 0.703 = 0.0142

    c) Explaining Away question 1: find P(R|H)Use Bayes rule P(R|H) = P(H|R) * P(R) / P (H)

    -

    Use total probability P(H|R) = P(H|R,S)*P(S) + P(H|R, S).P(S) = 1*0.7 + 0.9*0.3=0.97- P(R) = 0.01- Use total probability across all the cases

    P(H) = P(H|S,R)*P(S,R) + P(H|S, R).P(S,R) + P(H|S,R)*P(S,R) + P(H|S, R).P(S,R)

    P(H) = 0.5245 (remember R, S are independent so P(R,S) = P(R)*P(S), similar for others)

    So P(R|H) = 0.97* 0.01 / 0.5245 = 0.0185

    Conditional Independence

    By definition B and C are independent given A if P(B|C)=P(B|A,C)

    D-separation (aka reachability)

    It is used to find out if 2 states are independent:

    - Active Triplets: parameters aredependent

    - Inactive Triplets: parameters areindependent

    - (shading = given, or known state)

  • 8/3/2019 Stanford Intro AI Class Notes

    13/56

    Bayes Networks

    Bayes networks define probability distributions over graph of random variables

    Simplest Bayes network with 2 variables

    To specify this network we need 3 parameters: P(A) and 2 others: P(B|A) and P(B|A)

    Example of a Bayes network with 5 variables

    The compact of Bayes network helps to reduce significantly the number of states. On the graph

    above of 5 variables it is reduced from 31 (2^5-1) to 10 (1+1+4+2 see picture above) thanks to the

    formula P(A,B,C,D,E) = P(A).P(B).P(C|A,B).P(D|C).P(E|C).

    The formula is written as product of probabilities, each factor is probability of a state and is written

    as conditional probability from the states it depends on.

    Unit 4 Probability Inference(how to answer probability questions using Bayes network)

  • 8/3/2019 Stanford Intro AI Class Notes

    14/56

    Enumeration

    Given conditional probabilities

    Then we can calculate P(+b,+j,+m) by enumeration over hidden parameters e and a

    Speedup Techniques for Enumeration

    1. Pull out terms

    2. Maximise independence (good idea to sort following causal direction)3. Variable elimination

  • 8/3/2019 Stanford Intro AI Class Notes

    15/56

    Unit 5 Machine Learning

    Supervised Learning

    Feature vector X Target label Y:

    SL tries to predict labels given the input vectors (i.e. to find the functions f see the picture above)

    Occams Razor

    Quiz Spam filter

    Here we use Nave Bayes filter to detect spam

  • 8/3/2019 Stanford Intro AI Class Notes

    16/56

    P(SPAM|M) = P(M|SPAM)*P(SPAM)/P(M) = P(today,is,secret|SPAM)*P(SPAM)/P(M)

    Using normalised Bayes rule as in Key thing to remember, summary from

    reddit

    1) P(A) : Concept of Probability.Like intelligence, Probability is about trying to predict something about the future, but Probability is

    a prediction with only one number.

    That number is in the range from 0 to 1... Where 0 means 0% of the times will ocurr the given event,

    and 1 means It would ocurr 100% of the times.

    Example 1: P(A) = 0.3 ... it means event A will ocurr 30% of the times. (if you get 100 samples, you

    predict that 30 will be "of the A type")

    Example 2: Dice. P("get a 5") = 1/6 = 0.1666..7

    How to calculate that result: * The dice has a total of 6 possible events. Sample Space = {1, 2, 3, 4, 5,

    6} * Every event is as probable as the other (so called "equiprobable") (you don't think the dice is

    loaded, so P("1") = P("2") = ... = P("5") = P("6") ) * Every event is disjoint from the other (If the result

    is 1, then it cannot be 2, nor 3, ... nor 6).

    With those conditions, you can imagine:

    P("get 1 or 2 or 3 or 4 or 5 or 6 ") = 1

    (by definition of Probability that means:

    100% of the times you will get 1 or 2 or 3 ... or 6)

    Since they are disjoint: P( Union( Ai ) ) = sum ( P (Ai) )

    P("1") + P("2") + P("3") + P("4") + P("5") + P("6") = 1

    since equiprobable then 6*P("1") = 1

    then P("1") = 1/6

    then P("5") = P("1") = 1/6

    In general, if you have a discrete set as Sample Space with those conditions:

    Equiprobable elements

  • 8/3/2019 Stanford Intro AI Class Notes

    17/56

    Disjoint eventsThen probability can be calculated as:

    P(A) = "number of A events" / "total number of events"

    Example: P("get a pair number in the dice") = "number of pair events" / "total dice events" = 3 / 6 =

    0.5

    Probability = number of favourable events / total number of events 2) P(A,B) : Joint Probability

    P(A,B) is the same as P("A intersection B") = P(A B)

    P(A,B) = P("we get something that is A and ALSO B") = "# of events which are A and B" / total

    P(A,B) = "# of events which are A and B" / "total"

    Example:

    L4 = "get a lower than 4 number in the dice"

    O = "get an odd number in the dice"

    P(O,L4) = "number of Odd which are lower than 4" / total

    = number of {"1","3"} / 6 = 2/6 = 1/3

    3) P(A|B) : Conditional ProbabilityP(A|B) : (reads "probability of A given B")

    we will consider only the B events, what "percentage" OF THEM are also A

    Don't confuse with the Joint. Here you are talking about those events which are B and ALSO A but it

    is related to the B events. (as if B was a new Sample Space of another experiment where you only

    get B samples )

    P(A|B) = "# of events which are A and B" / "# of B events"

    From the concept, we can imagine the Conditional Prob. formula: if P(A,B) and P(A|B) are different,

    how can we relate them?

    P(A,B) = "# of events which are A and B" / total

    P(A|B) = "# of events which are A and B" / "# of B events"

    if we divide: P(A,B) / P(A|B) = "# of B events" / total

    but... Hey! we know that concept! "# of B events" / total is what we call P(B) !!!

    so:

    **P(A,B) = P(A|B) * P(B)**

    4) Bayes Theorem / Bayes Rule / Bayes LawWhat is the relation between P(B|A) and P(A|B) ??

    well, P(A,B) = P(B,A)

    so, from the Conditional Prob., formula: P(A|B) * P(B) = P(B|A) * P(A)

    Dividing by P(B) we get the Bayes formula:

    P(A|B) = P(B|A) * P(A) / P(B)

    Translated into intuitive numbers:

    "# of A and B" / "# of B" =

    = "# of A and B" / "# of A" * "# of A" / total / ("# of B" / total)

    Another version is this (changing A and B):

    P(B|A) = P(A|B) * P(B) / P(A)

    5) AB : Concept ofindependent eventsIn the conditional probability we talked about "probability of A given B" but what if the "given B"

    doesn't matter for A? That is, what if the "probability of A given B" is the same as the "probability of

    A" alone ? Or, seen another way, if getting only samples which have the B property will have the

    same effect (for the calculation of P(A) ) as getting samples which have not the B property, and the

    same as any sample (no matter it is B or not B)...

  • 8/3/2019 Stanford Intro AI Class Notes

    18/56

    That is, the B property "doesn't affect" the A property...

    In this case, it is said that A and B are independent. (written as: A B )It is the same as saying this:

    A B P(A|B) = P(A)Is it conmutative? Is the same "A and B are independent" and "B and A are independent"?

    P(B|A) = P(A|B) * P(B) / P(A)given "A and B are independent" :

    P(B|A) = P(A|B) * P(B) / P(A) = P(A) * P(B) / P(A) = P(B)

    then "B and A are independent"

    by symetry, if "B and A are independent" then "A and B are independent"

    and so... YES, it is the same (as the language "they are independent" would suggest, as you are not

    saying any order)

    From the combination of Independence and the concept of Conditional Probability a new concept

    comes:

    Conditional Independence

    A B | C P(A|B,C) = P(A|C)This is very important for Bayes Networks: it is related to the concept of D-Separation and it isimportant to solve exercises: knowing when you can apply this formula.

    (if there is D-separation between A and B, given C ... then you are sure A and B are conditionally

    independent, given C, and then you can use that formula)

    6) Total ProbabilityImagine you have 2 disjoint subsets (that is, their intersection is an empty set... they have no

    elements in common). What's the number of elements of the union of both? In that case, the

    number is the sum.

    In general, #(A U B) = #(A) + #(B) - #(AB)

    One simple example of disjoints: A and "not A" if one element (or sample) belongs to A it cannot

    belong to "not A"

    "not A" is written as AP(A U A) = P(A) + P(A) - P(A A) = P(A) + P(A) = 1

    Another example of disjoints: "A B" and "A B"

    P("A B" U "A B") = P(A B) + P(A B)

    We know "A B" U "A B" ... it is simply A !!

    And we know P("A B"), it is what we called thejoint

    So: P(A) = P(A,B) + P(A,B)

    And we can express that in terms of conditionals:

    P(A,B) = P(A|B) * P(B)

    So:

    P(A) = P(A|B) * P(B) + P(A|B) * P(B)

    This is called the Total Probability formulaWhich can be seen in numbers as:

    "# of A" / total =

    = "# of A and B" / "# of B" * "# of B" / total + "# of A and B" / "# of B" * "# of B" / total

    which is in fact like saying:

    "# of A" / "total" = ("# of A and B" + "# of A and not B" ) / "total"

    (remember today,is,secret are independent):

    - P(today,is,secret|SPAM)*P(SPAM) = P(today|SPAM)*P(is|SPAM)*P(secret|SPAM)*P(SPAM)= 0

  • 8/3/2019 Stanford Intro AI Class Notes

    19/56

    - P(today,is,secret|HAM)*P(HAM) = P(today|HAM)*P(is|HAM)*P(secret|HAM)*P(HAM) =2/15*1/15*1/15*5/8 = 0.000037

    So P(SPAM|M) = 0/(0+0.000037) = 0 !!!

    It is not good, just because of single word today we cant detect the spam (OVERFITTING!).Overfitting is common problem when maximum likelihood is used!

    One of solution is to use Laplace Smoothing define probability of words (in this case the word is

    today)

    - ML = max likelihood, LS = Laplace smoothing- x is a variable (in this case a word).- count(x) is the number of occurrences of this value (e.g.

    today) of the variable x.

    - |x| is the number of all possible values that the variablex can take.

    - k is a smoothing parameter.-

    N is the total number of occurrences of x (the variable, not the value) in the sample space.So apply Laplace Smoothing with k = 1 to the quiz (assuming the dictionary is 12 words both for

    SPAM and HAM, 9 is total number of words on SPAM side, 15 is total number of words on HAM side)

    - P(today,is,secret|SPAM)*P(SPAM)=(0+1)/(9+12) * (1+1)/(9+12) * (3+1)/(9+12) * 0.4=0.00034- P(today,is,secret|HAM)*P(HAM)=(2+1)/(15+12) * (1+1)/(15+12) * (1+1)/(15+12) *

    0.6=0.00037

    Normalising to get P(SPAM|M) = 0.00034/(0.00034+0.00037) = 0.48

    Overfitting Prevention

    Types of Supervised Learning

    - Classification: values of the target are discrete (binary in the picture above)

  • 8/3/2019 Stanford Intro AI Class Notes

    20/56

    - Regression: values of the target are continuous

    - Parametric: those methods have parameters and # of them are constant, independent oftraining set size

    - Non-parametric: # of parameters can grow significantlyK-nearest neighbours

    K-nearest neighbours is non-parametric supervised learning method. It has 2 steps

    - Learning step: memorise all data- Label new example

    o Find K nearest neighbourso Return majority class label

    Linear regression

    M data points, y is continuous. Linear Regression tries to find the function f (linear!) as shown below

    2 types of f

    - f(x)=w1*x+w0 where w1 and w0 are scalar

  • 8/3/2019 Stanford Intro AI Class Notes

    21/56

    - f(x)=w*x+w0 where w is a vector

    To find f we define quadratic loss function and

    try to minimise the loss (M is number of training samples):

    Unit 6 Unsupervised Learning

    Unlike supervised learning there is only an input vector, no label. The goal is to find the structure

    (pattern) of this input

    Clustering algorithms

    k-means

    Algorithm

    - Select k cluster centres at random- Repeat until no move can be made:

    o Correspond data points to the nearest clusterso For each cluster: move the cluster centre to the mean (average point) of

    corresponding data points

    o If cluster becomes empty: restart at randomo This algorithm is proved to converge to local minima, and not NP

    Problems of k-means clustering algorithm

  • 8/3/2019 Stanford Intro AI Class Notes

    22/56

    - Need to know k- Local minima

    - High dimensionality- Lack of mathematical basis

    Expectation maximisation (EM)

    - A generalisation of k-means, but first we need to learn Gaussian distribution)- Gaussian distribution

    o mean averageo quadratic deviationo M number of data points

    - Gaussian learning

    - Maximum likelihood

  • 8/3/2019 Stanford Intro AI Class Notes

    23/56

    - EM as a probabilistic generalisation of k-means- Choose k

    Linear dimensionality reduction

    Spectral (affinity-based) clustering

    Unit 7 Representation with Logic

    Propositional logic

    Truth table

  • 8/3/2019 Stanford Intro AI Class Notes

    24/56

    Given a space of states a model has its own truth table

    - A sentence is valid if it is true for any models- A sentence is satisfiable if it is true in some model but false in other- A sentence is unsatisfiable or not valid if it is false in all models

    Limitations of propositional logic

    - Can handle only TRUE or FALSE; not uncertainty- Cant talk about object properties, nor relationship between objects- No shortcuts

    Overcome propositional logic limitations: first-order logic and probability. We focus on 1st

    order logic

    which is to overcome the last 2 limitations.

    3 types of representations

    - Atomic (e.g. problem solving)- Factored (e.g. presentation logic)- Structured (e.g. programming language)

    First order logic

    Model

  • 8/3/2019 Stanford Intro AI Class Notes

    25/56

    - Set of objects- Set of constants- Set of functions- Set of relations (unary, binary, etc.)

    Why it is called first-order: because it operators work on objects only; no operations on the

    relationship between objects (it would be higher order logic).

    Syntax

    - Sentence = relation- Operators: operate on sentences- Terms: can be constants, variables or functions- Quantifiers: that is unique and important for 1st order logic

    o 2 quantifiers: for all and there existso Ifquantifier is omitted we assume for all quantifiero Although all variations are allowed normally

    for all structures have the form there exists structure have the form

  • 8/3/2019 Stanford Intro AI Class Notes

    26/56

    Unit 8 Planning

    Why Plan? (or Planning vs. Problem Solving)

    Problem solving: find a solution upfront and then execute it. Although we have a solution we are not

    always able to execute it due to:

    - Changing (partially observable) environment; and/or- Unpredictable (stochastic) environment; and/or- Multi-agency

    The solution is PLANNING i.e. before doing next action we observe what happens after previous

    action and make decision. With planning we move from the world of actual states to the world of

    belief states. See the example below with a vacuum cleaner (one belief state consists of one or few

    world states)

  • 8/3/2019 Stanford Intro AI Class Notes

    27/56

    More details about plans, actions, and observations

    3 types of vacuum cleaners (VC)

    - Sensorless: the VC doesnt have any sensor so no observation- Partially observable: the VC can see its location, and if the location is clean. But it cant see

    another place.

    - Stochastic: the VC can attempt to move left or right but the move can be successful or notsuccessful

    Few things to note

    - sensorless vacuum example: even we cant observe when we do actions we know moreabout the world

    - partially observable vacuum example: when we do actions *and also observe* we evenknow more

    - stochastic vacuum example: we may need branching (and loop): to do an action, observethe result and based on the result we go different way. This branching is not the same as

    branching in Problem Solving!

    In general, actions may increase uncertainty when observations always reduce it. See the diagram

    below:

    2 types of plans

    - Bounded (finite number of steps)- Unbounded (infinite number of steps is allowed)

    Plans are usually specified in 2 ways

    - Linear (list of steps in order); or- Tree (when we have branches in plan, usually branching is done by observation!)

    Specify plans mathematically

  • 8/3/2019 Stanford Intro AI Class Notes

    28/56

    - A=set of actions, S=set of states, F=final states (goals)- First equation: exact state world- Second equation: belief state worldpredict-observe-update cycle

    o Problem: some belief states can become very largeo Solution: instead of describing a belief state as a list exact states we use variables

    Classical Planning: a representation language to describe plans

    - It is propositional logic to be used- Variables, not states are used to describe things- To describe states:

    o Variableso State spaceo World state: complete assignment of all variableso Belief state

    - To describe actions:o Described by action schema a group of many possible actions similar to each

    other

    o An action schema is described by specifying PRE(CONDITION) where the actionschema is possible and EFF(ECT) of the action schema

    - 2 ways to find a plano Search in state space

    Progression (forward) search: normal problem-solving in state space Regression (backward) search: backward search from goal state

    o Search in the plan space

  • 8/3/2019 Stanford Intro AI Class Notes

    29/56

    Situation calculus

    SC = a first order logic with set of conventions how to represent states and actions. Comparing with

    classical planning where propositional logic is usedSC has advantage to describe with for each

    and there exists flexibility

    - 2 types of objectso Actions: normally they are functions e.g. Fly(p,x,y) fly of plane p from x to y.o Situations: normally they are paths (of actions in state space search), not states

    Initial situation S0 A function S = Result(S,a) where S is a situation and a is an action; S is

    another action

    - Using predicate to specify a set ofpossible actions given a situation Poss(a,S).o It usually is in form SomePrecond(S) Poss(a,S). Poss() is called possibility axiom

    for the action a.

    o Example of the possibility axiom for the action Fly(p,x,y) p is a plane x and y are 2 locations s is a situation

    Fluent = a predicate (i.e. a function or a relation) that can change from one situation to another. For

    example the predicate At(P,X,S) - a plan P is at theairport X given situation S is a fluent.

    - Convention: the situation S is given as an argument of the predicate, the last argument- True fluent is a fluent that is true in situation s

    In Classical Planning we use schema to describe what happens when execute each action. In

    Situation Calculus we use Successor-state axioms to describe what happens in the state that is a

    successor of executing an action (state is synonym of situation?). One s-s axiom per fluent!

    - In general, an s-s axiom has the form

    o a is an actiono s is a stateo It says: if it is possible to execute a in state s then the fluent is TRUE if action a

    makes it true, or action a doesnt undo it

    - Example of the s-s axiom for the fluent (predicate) some cargo in some planeo c is a cargoo p is a planeo a is an actiono s is a state / situation

  • 8/3/2019 Stanford Intro AI Class Notes

    30/56

    A great thing about SC is we already have solver for first order logic, i.e. once a problem is described

    using the language of SC, we automatically come up with a solution (the path from initial state to the

    goal state)!

    Unit 9 Planning Under Uncertainty

    (RL reinforcement learning)

    Planning in different environments:

    Deterministic StochasticFully Observable A*, Depth-First, Breadth-First MDP (Markov Decision Process)

    Partially Observable POMDP (Partially Observable MDP)

    MDP

    What is Markov process?

    Finite State Machine , when outcome is not certain but probable(action a1 moves the system from state S1 to state S2 with probability of 50%), is Markov process

    - States S1, , SN- Activities a1, , aN- State-transition matrix T(S,a,S) = P(S|a,S) (T is called transition function)- Reward function R(S,a,S) or sometimes simply R(S)- Policy assigns (optimal) an action to each state- We try to find a policy (S) that maximises the discounted, total Reward

  • 8/3/2019 Stanford Intro AI Class Notes

    31/56

    Problem under study Grid World

    Move North could lead to move North (80%), East (10%) or West (10%). Conventional planning wont

    work so need a Policy (S)A for each state. The task is to find optimal policy.

    Value function

    - E[] is expectation of a stochastic process- t is time moment- is discount factor- Planning = Calculate Value Functions!

    Recursive algorithm (so called Value Iteration) to determine the Value Function

    Policy is made based on value function

  • 8/3/2019 Stanford Intro AI Class Notes

    32/56

    Conclusion

    POMDP

    Information Space Belief Space

    Unit 10 Reinforcement Learning (RL)

    3 forms of learning

    - Supervised- Unsupervised- Reinforcement

    o A sequence of state, action, state, action, etc.o Rewards associated with the sequenceo We try to learn what to do to maximise rewards

    Agents of RL

    (or what to do if P() and/or R() are not known)

    Agent Know Learn Use

    Utility-based agent P R, then U (utility) U

    Q-learning agent Q(S,a) Q

    Reflex agent (S)

    -

    Passive RL agents: stick to a fixed policyo Example: Temporal Difference (TD) learning

    - Active RL agents: change the policy as we learn

  • 8/3/2019 Stanford Intro AI Class Notes

    33/56

    o Example: Greedy learning recalculate the policy after certain number of iterations.Problem not enough exploration because greedy, once it found some local

    optimum, it sticks with it

    o Solution for Greedy: need more exploration (at some point we dont take theoptimal policy to explore more). BUT, more exploration means more cost need

    balancing!

    Q Learning

    - Many varieties, but common point is to find Q(s,a), not utility function U and not transitionmatrix

    - Policy may change in Q-learning

    Unit 11 Hidden Markov Model and Filters

    HMM

    Used to analyse and time series, applicable in

    - Robotics- Medical- Finance- Speech- Etc.

    Markov Chain is a simple Bayes network where each state depends only on a previous state

    s1s2sN, each state also emits so called a measurements.

    Hidden Markov Model (HMM) is a Markov chain but the states s1, s2, etc. (prior probability) are

    hidden (not observable); instead we can observe only the measurements (posterior probability).

    Using HMM it is possible to do 2 things: prediction and state estimation

    - Prediction (of next state and/or next measurement)

  • 8/3/2019 Stanford Intro AI Class Notes

    34/56

    o Bayes rule (see above picture) is used to do prediction (usually we calculate onlynumerators and then normalise)

    - State estimation (to compute the probability of hidden or internal states givenmeasurements)

    o Total probability formula is used to predict next state- 2 equations above, plus distribution of the initial state P(s0), form the math of HDD

    Stationary Distribution

    SD = find the probabilities in a HMM when time approaches infinity

    Transition Probabilities

    Observe a sequence of days e.g. R-R-S-S-R-S

    Then find max likelihood (or using Laplace smoothing)

    Example

    HMM from observed measures (HMM happy-grumpy problem)

    Use Bayes formula to calculate HMM from observed measures

  • 8/3/2019 Stanford Intro AI Class Notes

    35/56

    Particle Filter

    Example: a robot in a maze

    - Robot can move freely in the maze; it has a sonar to measure the distance to objects / wallsaround it

    - The belief space is represented by a collection of points or particles- Each point / particle represents a possible state- Each point / particle is a 3-dimensional vector (x coordinate, y coordinate, and the heading

    direction)

    - Particle filters approximate a posterior by many guesses- The density of the guesses represents the posterior probability of being in certain location- The more consistent the particle with the measurement, the more the sonar measurement

    fits into the place where the particle says the robot is, the better chance to survive

  • 8/3/2019 Stanford Intro AI Class Notes

    36/56

    Particle Filter Algorithm

    Given

    - S a set of n particles with associated important weights- U a control- Z a measurement vector

    Aim: to construct a new particle set S

    Algorithm

    PARTICLE_FILTER(S,U,z)

    S=

    =0

    For i=1n

    Sample j ~ {w} with replacement

    x ~ P(x|U, sj)

    w= P(z|x)

    S = S + {}

    = + w

    End For

    For i=1n

    wi = wi/

    End For

    an auxiliary parameter (for normalisation)

    Loop through all n particles in S

    Sample in index j according to the distribution

    defined by the importance weights associated

    with the particle we have a new particle sj

    Sample a possible successor state x according to

    the state transition probability using our controls

    U and new particle sjCompute an importance weight w the

    measurement probability for the particle sj.

    Add the particle sjto S

    Add w to (will be sum of w in the end of the

    loop and will be used for normalisation)

    Move to the next particle in the loop

    A loop to normalise the weights

    Betterexplanation(also see wikikidnapped robot)

    Suppose Ikidnap your robotand put it back in your house at random.

    It knows

    http://www.aiqus.com/questions/18339/the-kidnapped-robothttp://www.aiqus.com/questions/18339/the-kidnapped-robothttp://www.aiqus.com/questions/18339/the-kidnapped-robothttps://secure.wikimedia.org/wikipedia/en/wiki/Kidnapped_robot_problemhttps://secure.wikimedia.org/wikipedia/en/wiki/Kidnapped_robot_problemhttps://secure.wikimedia.org/wikipedia/en/wiki/Kidnapped_robot_problemhttps://secure.wikimedia.org/wikipedia/en/wiki/Kidnapped_robot_problemhttps://secure.wikimedia.org/wikipedia/en/wiki/Kidnapped_robot_problemhttps://secure.wikimedia.org/wikipedia/en/wiki/Kidnapped_robot_problemhttps://secure.wikimedia.org/wikipedia/en/wiki/Kidnapped_robot_problemhttps://secure.wikimedia.org/wikipedia/en/wiki/Kidnapped_robot_problemhttp://www.aiqus.com/questions/18339/the-kidnapped-robot
  • 8/3/2019 Stanford Intro AI Class Notes

    37/56

    - it's in your house- the layout of your house

    But it doesn't know where in your house it is.

    Observation step

    1. It generates 100 locations at random to use as estimates of where it might be. (If your houseis big it might need 1,000 or 10,000 locations instead of 100.)

    2. Since they are random, each of these locations (x,y) is given an equal likelihood of 1%.3. Each triad of (x,y,%) is a state, so we have 100 states.4. The robot now takes a measurement of its surroundings and sees that it's in a 4-way

    junction.

    5. With perfect sensing, we could eliminate states which aren't in a 4-way junction, but oursensors are a bit flaky. The North-pointing sensor is only correct 80% of the time, so there's a

    chance that we're not really in a 4-way junction after all. The other sensors (E, W, S) are also

    flaky.

    6. So we adjust the probability of all states according to Bayes Rule. It's possible that two ormore sensors are incorrect, but that's less likely than just one sensor being incorrect. When

    we're done, the states describing 4-way junctions have a higher probability (since that'smost likely), and the rest have lower.

    7. Perhaps 30 of our states - the ones which describe 4-way junctions - have a weight of 2%and the rest have weights smaller than 1% and the total of all weights is 100%.

    Resample step

    1. Now we move East. Just like the vacuum with slippery wheels, we could end up 1 positionEast, but there's also a smaller chance that we could end up NE or SE.

    2. How do we update the list of states?a. We generate 100 new states from the existing states.b. We choose a state and duplicate it, then apply the movement and randomly choose

    the expected outcome. If we are using the robot from the gridworld lectures(80&/10%/10%) then there's an 80% chance that the new position is East of our

    original position, and 10% chance that we generate a new state which is NE and 10%

    SE.

    c. We choose states to duplicate using a weighted average of the existing states. Since30 of the states (the ones which describe a 4-way junction) are 2% likely and the rest

    are

  • 8/3/2019 Stanford Intro AI Class Notes

    38/56

    and the observation second. This is the same algorithm with a different implementation. In his

    model the system makes a move and presents the list of original states, the move taken, and the

    measurement made after the move. The essential concepts are the same.

    Unit 12 MDP Review

    Unit 13 GamesKey point: games can be solved by search (depth-first, breadth-first, A*, etc.)

    Deterministic single-player games

    - Set of states S (including the start state S0)- Set of players P (in this case it contains a single player)- A function Actions(s,p) that gives us the possible actions at state s for player p- A transition function Result(s,a)a that tells us the result of action a at state s- A terminal test Terminal(s)TRUE or FALSE to tell us if it is an end of the game- Terminal utilities U(s,p) that tells us, for a given state s and a given player p, some number

    which is the value of the game for this player

    Deterministic 2-player (turn taking) zero-sum games

    - Deterministic: there is a single result of any actionMinimax routine

    - 2-player: 2 players, MAX and MIN

  • 8/3/2019 Stanford Intro AI Class Notes

    39/56

    o is a move by MAXo is a move by MINo is a terminal stateo The value function is defined as

    o MAX tries to maximise value function. The algorithm is

    The complexity of the algorithm for the tree below

    Computational complexity = (bm) Space complexity = (bm)

    o MIN tries to minimise value function: similar to above but oppositely- Zero-sum: sum of utilities of the 2 players is 0

    Reduce complexity: 3 approaches

    - Reduce b the breadth of the treeo - pruning technique can reduce (bm) to (bm/2)

    - Reduce m the depth of the tree: e.g. cut-off at some level and use an evaluation function

  • 8/3/2019 Stanford Intro AI Class Notes

    40/56

    - Combination of reduce b and reduce m (alpha-beta pruning)

    o This algorithm also uses new definition of maxValue() as below

    - Convert tree into graph: e.g. in chess we have opening books, ending books, midgame-

    Only in the reduce m approach we have information lost

    Stochastic games

  • 8/3/2019 Stanford Intro AI Class Notes

    41/56

    - ? is a chance node: where we take the expect value- Expected value in Probability is calculated this way

    o Assume we have N possibilities with values a1, a2,, aN and correspondingprobabilities p1, p2,, pN (p=1)

    o Expected value = apUnit 14 Game Theory

    2 objectives

    - Agent design: given a game find optimal policy- Mechanism design: design game rules to attract players and game owner. More formally:

    given a utility function and assuming the agents act rationally, find mechanism to maximise

    global utilities

    Key definitions (on example ofPrisoners dilemma)

    Dominant strategy: a strategy that a player does better than in any other strategies

    - For A: testify- For B: testify

    Pareto optimal outcome: no other outcomes that all players prefer

    - The outcome A=-1, B=-1 is Pareto optimalEquilibrium: an outcome that no player can benefit from switching to a different strategy assuming

    the other player stays the same

    - The outcome A=-5, B=-5 is Pareto optimalTwo Finger Morra (zero-sum)

    2 players Even (E) and Odd (O) showing their finger at the same time

  • 8/3/2019 Stanford Intro AI Class Notes

    42/56

    Difficulty: no dominant strategy, Pareto optimum

    Solution 1: move from matrix form to tree one; assuming that one player must go first:

    - Left: MAX goes first- Right: MIN goes first- Utility -3UE2 no good, very big discrepancy because we handicap (ask to reveal) the

    first player to much. Solution 2 will ask less

    Solution 2: like solution 1 but we assume the first player only need to reveal his strategy

    - The probability the first player select his move is [p: one, (1-p): two]

    Unit 15 Advanced PlanningAdvanced planning is like normal planning taking into account also the followings

    - Time- Resources- Active perception

  • 8/3/2019 Stanford Intro AI Class Notes

    43/56

    - Hierarchical plansScheduling

    - Network of tasks- S start, F finish- Each task has ES (earliest start) and LS (latest start) as defined above. Below is ES (left box)

    and LS (right box) for each state

    Extending planning

    Problem of classical planningit cant handle resources; so it may need to check many combinations.

    So it is natural to add resources to the language of classical planning; below is an example

    - New type Resources (red highlighted above)- 2 new attributes to Actions to deal with resources

    o USE (green highlighted above): to use a resource. After using the resource still existso CONSUME (green highlighted above): to consume a resource. After consuming the

    resource vanished

  • 8/3/2019 Stanford Intro AI Class Notes

    44/56

    Hierarchical Planning

    Aim: close abstraction gap

    - Group actions into abstract actions- Do the planning with bigger abstract actions- Then do the refinement to find concrete action for a abstract action

    HTN = hierarchical task network

    How do we know we reach solution?

    - A hierarchical task network achieves the goal if for every part, every abstract action at leastone of the refinements achieves the goal

    Reachable states (by a abstract action)

    Approximate reachable states: lower and upper bounders of the states we can reach by a abstract

    action.

    Conformant vs. Sensory Planning

    Conformant plan = plan without perception. Sensory planning is about to extend classical planning to

    allow active perception to deal with partial observability

  • 8/3/2019 Stanford Intro AI Class Notes

    45/56

    - New type Percept (red highlighted above)to express that we sense something

    Unit 16 Computer Vision I

    Image formation

    (the way to capture image)

    Pinhole camera

    Perspective Projection formula (for one dimension, but also applied for other dimensions)

    Vanishing points: parallel lines converges in perspective into vanishing points

    Lens: to eliminate the drawback of pinhole which is only one ray reaches the image. Restriction of

    lens is the image must be on certain distance (lens law)

  • 8/3/2019 Stanford Intro AI Class Notes

    46/56

    Computer vision

    - Classify objects- 3d reconstruction- Motion analysis

    Invariance is a key concept in Object Recognition: there are natural variations of an image that dont

    affect the nature of the object itself. We will try to design recognition algorithms invariant to, say,

    scale, illumination, rotation, deformation, occlusion (object is shaded by other objects) and view

    point.

    Grey scale images

    -

    More used than colour ones in image recognition- As usual a grey scale image represent by a 700x700 matrix with values from 0255 in each

    cell (255=black, 0 = white)

    Extract features: using (kernel) masks

    Linear filter

    Gradient kernels (filters)

    - Horizontal filter - Vertical filter

  • 8/3/2019 Stanford Intro AI Class Notes

    47/56

    Horizontal filters find the vertical edges and vice versa!

    Gradient images

    To find all the edges (both horizontal and vertical) we need to combine horizontal and vertical filters

    gradient images (gradient magnitude kernel)

    Canny edge detector (by professor Canny) improves gradient images significantly.

    There are other masks like Prewitt, Gaussian kernel (to blur images), etc.

    Harris corner detector

    - Corners is where exist a lot of horizontal edges and vertical edges (top figure below)- Sometimes we may need to rotate the image (bottom figure below). The trick is to use

    eigenvalues

    Modern feature detectors

    - Localisable- As

    HOG = Histogram of Oriented Gradient

    SIFT = Scale Invariant Feature Transform

  • 8/3/2019 Stanford Intro AI Class Notes

    48/56

    Unit 17 Computer Vision II (3D)

    Stereo

    Task: sensing range (distance) with cameras

    - One camera sometimes we can recover the 3D (i.e. distance to the object) but not all thetime.

    - Stereo vision with 2 cameras more easily but again not all the time e.g. in the case withaperture effect.

    Stereo Rig = 2 pinholes usually with the same focal length. Below is how we solve the depth Z (of an

    object at P) from images from 2 pinholes camera:

    - f = focal length- Baseline B = distance between 2 cameras- x1 = projected image via pinhole 1- x2 = projected image via pinhole 2- Displacement aka parallax = x1-x2- Optical axes = the axes drawn via the pinholes orthogonally to the image planes

    Correspondence in stereoWe have images for 2 points P1, P2 (each point has 2 images from 2 cameras)

    - If we mistake to mix the images we may end up in phantom points P1 and P2

  • 8/3/2019 Stanford Intro AI Class Notes

    49/56

    - So to find correspondence (data association) is importantTake for example 2 cameras and an object that projects to point P in camera 1

    Question is how to find its projection to camera 2

    - Not the whole square (2D)- Not able yet to pinpoint it (0D)- The right answer is along some line (1D). The line is the projection of the line connecting the

    real object and P!

    Search along the line

    - How can we find (pinpoint) the image on camera 2 along the line? 2 wayso Matching small image pattern; oro Matching the features (like edges) using linear filter (see unit 16)

    - SSD (sum of square difference) minimization algorithm is usually used

    Example

    We try to correspond 2 patterns below

  • 8/3/2019 Stanford Intro AI Class Notes

    50/56

    To do it we try to minimise the cost function (see above

    Dynamic Programming is usually used to find best alignment (similar to MVP)

    - Define of each point in the grid to be the best taking the value of getting there- Start-point: top-left, end-point: bottom right. Calculate V(i,j) for each point in grid, in the

    end we get the V(i,j) of the end-point of 20 and also find the path

    Example of dynamic programming

    B B R R R B

    B

    R

    RR

  • 8/3/2019 Stanford Intro AI Class Notes

    51/56

    R

    B

    Unit 18 Computer Vision III

    SFM - structure from motion

    Motion here means we move a camera around, capture images of and object and recover the object

    structure (3D world)

    SFM = Non-linear Least Squares problem, minimisation is through

    - Gradient descent- Conjugate gradient- Gauss-Newton- Levenberg Marquardt (common method!)- Singular Value decomposition (affine, orthographic)

    Unit 19 Robotics I

    2 key tasks

    - Find out where you are (aka localisation)- Find the path to the goal state (aka planning)

    2 types of state

    - Kinematic state: state of an object in space- Dynamic state = kinetic state + velocity

    Localisation: how to find your position in space given a map. For robotic cars

    - Could use GPS but error is ~5m- Particle Filtergives error of 10cm!

    Monte-Carlo localisation

    (on example of differential-drive robot)

    Deterministic case

  • 8/3/2019 Stanford Intro AI Class Notes

    52/56

    Add noise (probability): after giving the command MOVE the robot could be few possible places,

    each with some probability. This is the PREDICTION step ofParticle Filter

    MEASUREMENT step

    Unit 20 Robotic II

    Robotic Path Planning vs. normal Planning

    - Robotic one is in continuous state space- Normal one is in discrete state space

    A* in continuous space

    A* is discrete. It can find the path to the goal like in the picture below, but the path has many sharp

    turns not suitable for robots like self-driving car.

    In continuous space A* becomes Hybrid A*

    Hybrid A* lacks of completeness (it may not find the path) but it guarantees correctness (if it found a

    path, the path is correct)

  • 8/3/2019 Stanford Intro AI Class Notes

    53/56

    Unit 21 Natural Language Processing

    2 language models

    - Word-based; probabilistic; learned from datao Probability P(word1, word2,)

    - Tree-based; logical; hand-codedo Set of sentences (=a language): {S1, S2,}

    Probabilistic modelsWe talk about probability that a sequence of words makes a sentence or for short

    |

    2 important assumptions

    - Markov assumption (of order k): the localism of the probabilities i.e. | |

    . Specifically when k=1 we have

    | |

    - Stationary assumption: the probabilities are similar across the sequence i.e. | ()

    We look at the data and try to find the probability for one word to follow another very often we

    need smoothing or other techniques otherwise the probability=0%

    We also want to go beyond words (augmented models) by extending to non-word components

    n-gram models

    - Bag of words (e.g. all Shakespeare text)- Build n-gram model and sample from that model (i.e. to generate random sentences that

    come from the probability distribution defined by that model)

    Unigram model: to sample from words according to frequency in the corpus of Shakespeare text)

    but not taking into account any relationship between adjacent words

    Bigram model: to sample from the probability if a word given the previous word

    Classification

    Common tasks

    - Classify words into categories- Detect language of a text

  • 8/3/2019 Stanford Intro AI Class Notes

    54/56

    Can be word-based or character-based

    Methods

    - Nave Bayes-

    K-nearest neighbour- Support Vector Machine (SVM)- Logistic regression- gzip compression utility (Unix)

    Segmentation

    Given a sequence of words (characters) find where spaces are (like in Chinese)

    Probabilistic model of segmentation: the best segmentation S* is the one that maximises the joint

    probability of the segmentation | Approximation can be done with Markov assumption, nave Bayes, etc.

    In case of nave Bayes we try just to maximise probability of each individual word - Equivalently we can find argmax over all possible segmentations of the string s into a 1st

    word f and the rest of the words r:

    Spelling correction

    Given a misspelled word w find the correct word c.

    Probabilistic model of spelling correction: find best correction

    |

    Apply Bayes rule (ignoring the denominator as it is the same) |- P(c) comes from data counts- P(w|c) comes from spelling collection data

    Example: pulse misspell as pluse

    - It is usually not enough data for P(pluse|pulse)- So instead we work at character level - define misspelling type ullu

    Unit 22 Natural Language Processing IITree model

    Need a grammar e.g.

    Grammar

    SNP VP

    NPN | D N | N N | N N N

    VPV | V NP | V NP NP

    Ninterest | Fed | rates | raises

    Vinterest | rates | raisesD the | a

    Where:

    - S = sentence- N = noun- V = verb- NP = noun phrase- VP = verb phrase- D = determiner (e.g. a or the)

  • 8/3/2019 Stanford Intro AI Class Notes

    55/56

    This type of grammar is called Context Free Grammar (CFG)

    Problems with grammar

    - Easy to omit good parser-

    Easy to include bad parser by accident- Not a problem: trees are unobservable

    Solutions

    - Add probability to each tree- Add word association like Markov assumption- Not possible solution: make grammar unambiguous

    Probabilistic Context Free Grammar (PCFG)

    Add probability to CFG grammar we know so far

    Example

    Lexiconsi

    How to define the probabilities: people are trained and paid to parse real life texts

    Ambiguity

    - I saw (a man with telescope)

  • 8/3/2019 Stanford Intro AI Class Notes

    56/56

    - I saw (a man) with telescopeLexicalised PCFG (LPCFG)

    Normal PCFG

    - The probability is given in regard of the category of left hand side- Example P(VPV NP NP | lhs=VP) = 0.2; lhs = left hand side

    Lexicalised PCFG

    - The probability is given to specific word- Example P(VPV NP NP | V=gave) = 0.25; gave is the word

    How to build grammar tree? Use Search