lecture 03
DESCRIPTION
this is the lecture 3 in machine learning series of lectures in iiiiiiiitTRANSCRIPT
Statistical Methods in Artificial IntelligenceCSE471 - Monsoon 2015 : Lecture 03
Avinash SharmaCVIT, IIIT Hyderabad
Lecture 03: Plan
• Linear Algebra Recap• Linear Discriminant Functions (LDFs)• The Perceptron• Generalized LDFs• The Two-Category Linearly Separable Case• Learning LDF: Basic Gradient Descend• Perceptron Criterion Function
Basic Linear Algebra Operations
• Vector• Vector Operations– Scaling– Transpose– Addition– Subtraction– Dot Product
• Equation of a Plane
¿ [𝑥1 𝑦1 𝑧 1 ]
Vector Operations
𝑥
𝑦
¿ [𝑥1 𝑦1 ]
𝑥1
𝑦 1𝒗
𝑧1
𝑧
𝑎𝒗=[𝑎𝑥1 𝑎𝑦1 𝑎𝑧 1 ]
‖𝒗‖=√𝑥12+𝑦12+𝑧12‖𝑎𝒗‖=𝑎√𝑥12+𝑦12+𝑧12
Scaling: Only Magnitude Changes
¿ [𝑎𝑥1 𝑎𝑦1 𝑎𝑧1 ]𝑇Transpose
Vector Operations
𝑥
𝑦
𝒖
𝒗
𝒖+𝒗
𝒖−𝒗
−𝒗
Vector Operations
• Dot Product (Inner Product) of two vectors is a scalar.
𝒖 . 𝒗=𝑥1 𝑦1+𝑥2 𝑦2
𝑥
𝑦
𝒖
𝒗
¿𝒖𝑻 𝒗¿‖𝒖‖‖𝒗‖cos𝜶
𝜶
𝒖𝑻 𝒗‖𝒖‖‖𝒗‖=cos𝜶
• Dot product if two perpendicular vectors is 0
Equation of a Plane
𝑶
𝑃0
𝑃
𝒓𝒓𝟎
𝒓 −𝒓𝟎
𝒏
90 °
(𝒓 −𝒓𝟎 ) .𝒏=𝟎
Linear Discriminant Functions
• Assumes a 2-class classification setup
• Decision boundary is represented explicitly in terms of components of .
• Aim is to seek parameters of a linear discriminant function which minimize the training error.
• Why Linear ? – Simplest possible– Generalized
Linear Discriminant Functions
𝑔 ( 𝑿 )={ ¿0 (+𝑣𝑒)𝑐𝑙𝑎𝑠𝑠 𝐴¿0 (−𝑣𝑒)𝑐𝑙𝑎𝑠𝑠𝐵
¿ 0𝐷𝑒𝑐𝑖𝑠𝑖𝑜𝑛𝐵𝑜𝑢𝑛𝑑𝑎𝑟𝑦
𝑔 ( 𝑿 )=0𝑿
Class A
Class B
𝑾
𝑔 ( 𝑿 )=𝑾 𝑇 𝑿
𝑔 ( 𝑿 )=𝑾 𝑇 𝑿+𝑤0
The perceptron𝑔 ( 𝑿 )=𝑾 𝑇 𝑿+𝑤0
𝑥0=1
𝑥1
𝑥2
𝑥𝑑
⋮ ⋮
𝑤0
𝑤1
𝑤2
𝑤𝑑
∑❑ 𝑔 ( 𝑿 )
𝑿=[𝑥1 ⋮ 𝑥𝑑 ]
𝑾=[𝑤1 ⋮ 𝑤𝑑
]
Perceptron Decision Boundary
• Decision boundary surface (hyperplane) divides feature space into two regions.
• Orientation of the boundary surface is decided by the normal vector .
• Location of the boundary surface is determined by the bias term .
• is proportional to distance of from the boundary surface.
• positive side and negative side.
Perceptron Summary
Generalized LDFs• Linear:
• Non Linear
𝑔 ( 𝑿 )=𝑤0+𝑾𝑇 𝑿 𝑿=[𝑥1 ⋮ 𝑥𝑑
]𝑾=[𝑤1 ⋮ 𝑤𝑑
]𝑔 ( 𝑿 )=𝑤0+∑
𝑖=1
𝑑
𝑤 𝑖𝑥 𝑖
𝑔 ( 𝑿 )=𝑤0+∑𝑖=1
𝑑
𝑤 𝑖𝑥 𝑖+∑𝑖=1
𝑑
∑𝑗=1
𝑑
𝑤 𝑖 𝑗 𝑥𝑖 𝑥 𝑗(Quadratic)
Generalized LDFs
• Linear
• Non Linear
𝑿=[𝑥1 ⋮ 𝑥𝑑 ]𝑾=[𝑤1
⋮ 𝑤𝑑
]
𝒀=[𝑥0 ⋮ 𝑥𝑑 ]=[𝑥0𝑿 ]𝒂=[𝑤0
⋮ 𝑤𝑑
]=[𝑤0
𝑾 ]
=1
𝑔 ( 𝑿 )=𝑤0+𝑾𝑇 𝑿
𝑔 ( 𝑿 )=𝒂𝑇𝒀
𝑔 ( 𝑿 )=∑𝑖=0
𝑑
𝑤𝑖 𝑥𝑖
𝒀=φ(𝑿 )
𝑔 ( 𝑿 )=𝒂𝑇𝒀=∑𝑖=1
�̂�
𝑎𝑖 𝑦 𝑖𝒂=[𝑎1 ⋮ 𝑎
�̂� ]
Generalized LDFs
Generalized LDFs Summary
• can be any arbitrary mapping function that projects original data points to points where .
• The hyperplane decision surface passes through origin.
• Advantage: In the mapped higher dimensional space data might be linear separable.
• Disadvantage: The mapping is computationally intensive and learning the classification parameters can be non-trivial.
Two-Category Linearly Separable Case
𝑔 ( 𝑿 )=𝒂𝑇𝒀=∑𝑖=1
�̂�
𝑎𝑖 𝑦 𝑖¿ { ¿ 0(+𝑣𝑒)𝑐𝑙𝑎𝑠𝑠 𝐴¿0 (−𝑣𝑒)𝑐𝑙𝑎𝑠𝑠𝐵
¿0𝐷𝑒𝑐𝑖𝑠𝑖𝑜𝑛 𝐵𝑜𝑢𝑛𝑑𝑎𝑟𝑦
𝑦 1
𝑦 2𝒂
Two-Category Linearly Separable Case
𝑔 ( 𝑿 )=𝒂𝑇𝒀=∑𝑖=1
�̂�
𝑎𝑖 𝑦 𝑖>0
𝑦 1
𝑦 2𝒂
Normalized Case
Two-Category Linearly Separable Case
Data vector
Two-Category Linearly Separable Case
Learning LDF: Basic Gradient Descend
• Define a scalar function which captures classification error for specific boundary plane described by parameter
• Minimize using gradient descent.– Start with arbitrary value of for .– Iteratively refine estimate of :
• is the positive scale factor also known as learning rate– A too small makes the convergence very slow– A too large can diverge due to overshooting of corrections.