lecture 03

22
Statistical Methods in Artificial Intelligence CSE471 - Monsoon 2015 : Lecture 03 Avinash Sharma CVIT, IIIT Hyderabad

Upload: srikanthmujjiga

Post on 07-Dec-2015

218 views

Category:

Documents


0 download

DESCRIPTION

this is the lecture 3 in machine learning series of lectures in iiiiiiiit

TRANSCRIPT

Page 1: Lecture 03

Statistical Methods in Artificial IntelligenceCSE471 - Monsoon 2015 : Lecture 03

Avinash SharmaCVIT, IIIT Hyderabad

Page 2: Lecture 03

Lecture 03: Plan

• Linear Algebra Recap• Linear Discriminant Functions (LDFs)• The Perceptron• Generalized LDFs• The Two-Category Linearly Separable Case• Learning LDF: Basic Gradient Descend• Perceptron Criterion Function

Page 3: Lecture 03

Basic Linear Algebra Operations

• Vector• Vector Operations– Scaling– Transpose– Addition– Subtraction– Dot Product

• Equation of a Plane

Page 4: Lecture 03

¿ [𝑥1  𝑦1  𝑧 1  ]

Vector Operations

𝑥

𝑦

¿ [𝑥1  𝑦1  ]

𝑥1

𝑦 1𝒗

𝑧1

𝑧

𝑎𝒗=[𝑎𝑥1  𝑎𝑦1  𝑎𝑧 1   ]

‖𝒗‖=√𝑥12+𝑦12+𝑧12‖𝑎𝒗‖=𝑎√𝑥12+𝑦12+𝑧12

Scaling: Only Magnitude Changes

¿ [𝑎𝑥1 𝑎𝑦1 𝑎𝑧1 ]𝑇Transpose

Page 5: Lecture 03

Vector Operations

𝑥

𝑦

𝒖

𝒗

𝒖+𝒗

𝒖−𝒗

−𝒗

Page 6: Lecture 03

Vector Operations

• Dot Product (Inner Product) of two vectors is a scalar.

𝒖 . 𝒗=𝑥1 𝑦1+𝑥2 𝑦2

𝑥

𝑦

𝒖

𝒗

¿𝒖𝑻 𝒗¿‖𝒖‖‖𝒗‖cos𝜶

𝜶

𝒖𝑻 𝒗‖𝒖‖‖𝒗‖=cos𝜶

• Dot product if two perpendicular vectors is 0

Page 7: Lecture 03

Equation of a Plane

𝑶

𝑃0

𝑃

𝒓𝒓𝟎

𝒓 −𝒓𝟎

𝒏

90 °

(𝒓 −𝒓𝟎 ) .𝒏=𝟎

Page 8: Lecture 03

Linear Discriminant Functions

• Assumes a 2-class classification setup

• Decision boundary is represented explicitly in terms of components of .

• Aim is to seek parameters of a linear discriminant function which minimize the training error.

• Why Linear ? – Simplest possible– Generalized

Page 9: Lecture 03

Linear Discriminant Functions

𝑔 ( 𝑿 )={ ¿0 (+𝑣𝑒)𝑐𝑙𝑎𝑠𝑠 𝐴¿0 (−𝑣𝑒)𝑐𝑙𝑎𝑠𝑠𝐵

¿ 0𝐷𝑒𝑐𝑖𝑠𝑖𝑜𝑛𝐵𝑜𝑢𝑛𝑑𝑎𝑟𝑦

𝑔 ( 𝑿 )=0𝑿

Class A

Class B

𝑾

𝑔 ( 𝑿 )=𝑾 𝑇 𝑿

𝑔 ( 𝑿 )=𝑾 𝑇 𝑿+𝑤0

Page 10: Lecture 03

The perceptron𝑔 ( 𝑿 )=𝑾 𝑇 𝑿+𝑤0

𝑥0=1

𝑥1

𝑥2

𝑥𝑑

⋮ ⋮

𝑤0

𝑤1

𝑤2

𝑤𝑑

∑❑ 𝑔 ( 𝑿 )

𝑿=[𝑥1  ⋮  𝑥𝑑  ]

𝑾=[𝑤1  ⋮  𝑤𝑑  

]

Page 11: Lecture 03

Perceptron Decision Boundary

Page 12: Lecture 03

• Decision boundary surface (hyperplane) divides feature space into two regions.

• Orientation of the boundary surface is decided by the normal vector .

• Location of the boundary surface is determined by the bias term .

• is proportional to distance of from the boundary surface.

• positive side and negative side.

Perceptron Summary

Page 13: Lecture 03

Generalized LDFs• Linear:

• Non Linear

𝑔 ( 𝑿 )=𝑤0+𝑾𝑇 𝑿 𝑿=[𝑥1  ⋮  𝑥𝑑  

]𝑾=[𝑤1  ⋮  𝑤𝑑  

]𝑔 ( 𝑿 )=𝑤0+∑

𝑖=1

𝑑

𝑤 𝑖𝑥 𝑖

𝑔 ( 𝑿 )=𝑤0+∑𝑖=1

𝑑

𝑤 𝑖𝑥 𝑖+∑𝑖=1

𝑑

∑𝑗=1

𝑑

𝑤 𝑖 𝑗 𝑥𝑖 𝑥 𝑗(Quadratic)

Page 14: Lecture 03

Generalized LDFs

• Linear

• Non Linear

𝑿=[𝑥1  ⋮  𝑥𝑑  ]𝑾=[𝑤1  

⋮  𝑤𝑑  

]

𝒀=[𝑥0  ⋮  𝑥𝑑  ]=[𝑥0𝑿 ]𝒂=[𝑤0  

⋮  𝑤𝑑  

]=[𝑤0

𝑾 ]

=1

𝑔 ( 𝑿 )=𝑤0+𝑾𝑇 𝑿

𝑔 ( 𝑿 )=𝒂𝑇𝒀

𝑔 ( 𝑿 )=∑𝑖=0

𝑑

𝑤𝑖 𝑥𝑖

𝒀=φ(𝑿 )

𝑔 ( 𝑿 )=𝒂𝑇𝒀=∑𝑖=1

�̂�

𝑎𝑖 𝑦 𝑖𝒂=[𝑎1  ⋮  𝑎

�̂� ]

Page 15: Lecture 03

Generalized LDFs

Page 16: Lecture 03

Generalized LDFs Summary

• can be any arbitrary mapping function that projects original data points to points where .

• The hyperplane decision surface passes through origin.

• Advantage: In the mapped higher dimensional space data might be linear separable.

• Disadvantage: The mapping is computationally intensive and learning the classification parameters can be non-trivial.

Page 17: Lecture 03

Two-Category Linearly Separable Case

𝑔 ( 𝑿 )=𝒂𝑇𝒀=∑𝑖=1

�̂�

𝑎𝑖 𝑦 𝑖¿ { ¿ 0(+𝑣𝑒)𝑐𝑙𝑎𝑠𝑠 𝐴¿0 (−𝑣𝑒)𝑐𝑙𝑎𝑠𝑠𝐵

¿0𝐷𝑒𝑐𝑖𝑠𝑖𝑜𝑛 𝐵𝑜𝑢𝑛𝑑𝑎𝑟𝑦

𝑦 1

𝑦 2𝒂

Page 18: Lecture 03

Two-Category Linearly Separable Case

𝑔 ( 𝑿 )=𝒂𝑇𝒀=∑𝑖=1

�̂�

𝑎𝑖 𝑦 𝑖>0

𝑦 1

𝑦 2𝒂

Normalized Case

Page 19: Lecture 03

Two-Category Linearly Separable Case

Data vector

Page 20: Lecture 03

Two-Category Linearly Separable Case

Page 21: Lecture 03

Learning LDF: Basic Gradient Descend

• Define a scalar function which captures classification error for specific boundary plane described by parameter

• Minimize using gradient descent.– Start with arbitrary value of for .– Iteratively refine estimate of :

• is the positive scale factor also known as learning rate– A too small makes the convergence very slow– A too large can diverge due to overshooting of corrections.

Page 22: Lecture 03