center for big data analytics and discovery informatics artificial … · 2019. 9. 11. · center...

Center for Big Data Analytics and Discovery InformaticsArtificial Intelligence Research Laboratory

Fall 2019 Vasant G Honavar

Machine LearningVasant Honavar

Artificial Intelligence Research LaboratoryInformatics Graduate Program

Computer Science and Engineering Graduate ProgramBioinformatics and Genomics Graduate Program

Neuroscience Graduate ProgramData Sciences Undergraduate Program

Center for Big Data Analytics and Discovery InformaticsHuck Institutes of the Life Sciences

Institute for CyberscienceClinical and Translational Sciences Institute

Northeast Big Data HubPennsylvania State University

[email protected]://faculty.ist.psu.edu/vhonavar

http://ailab.ist.psu.edu

mailto:[email protected]

http://faculty/ist.psu.edu/vhonavar

http://ailab.ist.psu.edu


Fall 2019 Vasant G Honavar41

Now, on to some real content …



• How would you write a program to distinguish a picture of youfrom a picture of someone else?– Provide examples pictures of you and pictures of other

people and let a classifier learn to distinguish the two.• How would you write a program to determine whether a

sentence is grammatical or not?– Provide examples of grammatical and ungrammatical

sentences and let a classifier learn to distinguish the two.• How would you write a program to distinguish cancerous cells

from normal cells?– Provide examples of cancerous and normal cells and let a

classifier learn to distinguish the two.

Classification

42



Example: To play or not to play tennis

43

Example dataset

Three key elements• Class label (“label”, denoted by y)• Features (“attributes”)• Feature values (“attribute values”, denoted by x)Feature values can be binary, nominal or continuous

A labeled dataset is a collection of (x, y) pairs



Example: To play or not to play tennis?

44

Example dataset

Task:

Predict the class of this “test” sample Requires us to generalize from the training data



What is a good representation for images?

Example (face recognition)

45

Pixel values? Edges?



Example (chair detection)



Ingredients for classification

47

Idea: Incorporate your knowledge of the problem into a learning systemSources of knowledge:1. Feature representation

2. Training data: labeled examples

3. Model

• Important for success of machine learning• Can be “problem specific”• Designing a good representation will take you half way

• High quality labeled data can be hard to find• Often have to make do with the available data

• No single learning algorithm outperforms all others on every task (“no free lunch”)

• Different learning algorithms come with different inductive biases



Nearest Neighbor Classifiers• Basic idea:

– If it walks like a duck, quacks like a duck, then it’s probably a duck

Training Samples

Test SampleCompute Distance

Choose k of the “nearest” labeled samples



Nearest-Neighbor Classifiers● Requires three things

– The set of stored training samples and their labels

– Distance Metric to compute distance between samples

– The value of K, the number of nearest neighbors to retrieve

● To classify a query sample:– Compute distance to training

samples– Identify K nearest neighbors – Use class labels of the K nearest

neighbors to determine the class label of the query sample (e.g., by taking majority vote)

Unknown recordQuery sample



Definition of Nearest Neighbor

X X X

(a) 1-nearest neighbor (b) 2-nearest neighbor (c) 3-nearest neighbor

K-nearest neighbors of a sample x are data points that have the k smallest distance to x



K nearest neighbor classifierData samples are assumed to lie in an n-dimensional space

– e.g., the Euclidean space An instance X is described by a feature vector

Where xip denotes the value of the ith feature in Xp

( ) ( ) ÷÷ø

öççè

æ-= å

=

N

iiriprp xxXXd

1

2,

Defines the Euclidean distance between two points in the Euclidean space

𝑋"=[𝑥$" ⋯𝑥&"]



Standardization

Standardization can be important when the variables are not all measured on the same scale• 0-1 scaling

4, 3, 1 2e.g. 3 à (3-min)/(max-min)=(3-1)/(4-1)=2/3

• Z-score scaling: subtract out the mean, divide by std. deviation



K nearest neighbor ClassifierLearning Phase

For each training example (Xi, f (Xi)), store the example in memory

Classification phase Given a query instance Xq, identify the k nearest neighbors

X1…Xk of Xq

Assign Xq the label of the majority class

where

( ) ( ) ., andiff , 11 =d==d bab aba

𝑔 𝑋( = 𝑎𝑟𝑔𝑚𝑎𝑥𝜔 .

/0$

1

𝛿 𝜔, 𝑓 𝑥/



Distance weighted K nearest neighbor Classifier

Learning PhaseFor each training example (Xi, f (Xi)), store the example in memory

Classification phase Given a query instance Xq, identify the k nearest neighbors

of Xq - KNN (Xq)= {X1…Xk}

And obtain a weighted vote, with each nearest neighbor getting a vote in favor of its class label that is weighted by the distance to the query

( )21

qii XXdw

,=



Distance Measures• Distance– Depends on the data representation – Distance measure chosen

ID Gender Age Salary1 F 27 19,0002 M 51 64,0003 M 52 100,0004 F 33 55,0005 M 45 45,000

w1 w2 w3 w4 w5 w6Doc1 0 4 0 0 0 2Doc2 3 1 4 3 1 2Doc3 3 0 0 0 3 0Doc4 0 1 0 3 0 0Doc5 2 2 2 3 1 4

An Employee DB Word Frequencies for Documents

Representation has to be chosen with some careDistance measure should be chosen to work with the representation



Distance measures

• d (p, q) between two points p and q is a proper distance measure if it satisfies:1. Positive definiteness:

d (p, q) ³ 0 for all p and q and d (p, q) = 0 only if p = q.

2. Symmetry: d (p, q) = d (q, p) for all p and q.3. Triangle Inequality:

d (p, r) £ d (p, q) + d (q, r) for all points p, q, and r.



Cosine Distance• If d1 and d2 are two document vectors, then

1-cos( d1, d2 ) = 1-(d1 • d2) / ||d1|| ||d2|| ,where • indicates vector dot product and || d || is the

length of vector d.

• Example:d1 = 3 2 0 5 0 0 0 2 0 0 d2 = 1 0 0 0 0 0 0 1 0 2

d1 • d2= 3x1 + 2x0 + 0x0 + 5x0 + 0x0 + 0x0 + 0x0 + 2x1 + 0x0 + 0x2 = 5||d1|| = (3x3+2x2+0x0+5x5+0x0+0x0+0x0+2x2+0x0+0x0)0.5 = (42) 0.5 = 6.481||d2|| = (6) 0.5 = 2.245

cos( d1, d2 ) = .3150



Distance MeasuresDistances in vector spaces

• Euclidean distance ∑60$7 𝑝6 − 𝑞6;

• Minkowski distance – a generalization of Euclidean distance

–< ∑60$7 𝑝6 − 𝑞6

=

Distance measures in Boolean spaces• n=1 Manhattan distance• n=2 Euclidean distance



Distance measures for nominal attributes

• Nominal attributes can take 2 or more values, e.g., red, yellow, blue, green (generalization of a binary attribute)

• Simple matching – distance between two objects is simply the number of mismatched attributes divided by the total number of attributes

• One hot encoding – Encode each M-valued nominal attribute an M-bit vectorRed: 1 0 0 0, Yellow: 0 1 0 0; Blue: 0 0 1 0 …

• Use distance measures designed for vectors …



Decision Boundary induced by the 1 nearest neighbor classifier



Nearest Neighbor Classification…

• Choosing the value of k:– If k is too small, sensitive to noise points– If k is too large, neighborhood may include points

from other classes

X



• Nearest neighbor classifiers are conceptually simple• Learn by simply memorizing training examples• The computational effort of learning is low• The storage requirements of learning is high

– need to memorize the examples in the training set• Cost of classifying new instances can be high

– Use efficient data structures and algorithms for nearest neighbor search, k-d trees, e.g., locality sensitive hashing

• A distance measure needs to be defined over the input space• Performance degrades when there are many irrelevant

attributes: – Perform feature selection or dimensionality reduction

Nearest neighbor classifiers



• Regression is like classification except the labels are real valued

• Example applications:

Regression or Function approximation

63

Predicting• Stock value• Income • Power consumption



K nearest neighbor Function ApproximatorLearning Phase

For each training example (Xi, f (Xi)), store the example in memory

Approximation phase Given a query instance Xq, identify the k nearest neighbors

X1…Xk of Xq

( )K

XfXg

i

K

lq

)(å=¬ 1



Regression• For classification the output is nominal• In regression the output is continuous– Function Approximation

• Linear regression is perhaps the simplest approach– Fit data with the best hyper-plane which "goes through" the

points

ydependent

variable(output)

x – independent variable (input)



Regression• For classification the output(s) is nominal• In regression the output is continuous– Function Approximation

• Many models could be used – Simplest is linear regression– Fit data with the best hyper-plane which "goes through" the

points

ydependent

variable(output)

x – independent variable (input)



Simple Linear Regression• For now, assume just one (input) independent variable x, and one

(output) dependent variable y– Multiple linear regression assumes an input vector x– Multivariate linear regression assumes an output vector y

• We will "fit" the points with a linear hyper-plane (line in the simplest case)

• Which line should we use?– Choose an objective function– For simple linear regression we choose sum squared error (SSE)• S (di – yi)2 =S (ei)2

– Thus, find the line which minimizes the sum of the squared residues (e.g. least squares)



Digression – Minimizing / Maximizing Functions

( )( )

( ) ( )( )

( )( ) ( )

( ) ( )d

ede

<"

<->$>=

>Î"Í$=

Î"Í

®

x-ax

AxfAxf

XfxfUxXDUXxxf

xfXfXf

DXXDDxfDxxf

ax

aa

xa

x

x

such that

such that 0 0,any for if, limsay that We

,such that around odneighborho if at minimum local a has

ofgraph the above lies and points thejoining chord the

, , if domain -sub someover convex is domain with iablescalar var a offunction a ,Consider

21

21



Minimizing/Maximizing Functions

( )( ){ } ( ){ }

( )( ) ( )

( )

minimum local a or maximum local a is if

as defined is function the of derivative The

lim lim if at continuous is thatsay We

00

0

0

0

lim

limlim

0

Xdxdf

xxfxxf

dxdf

xf

xfxfaxxf

Xx

x

axax

=

D-D+

=

==

=

®D

Î-®®Î+®® ee



Minimizing/Maximizing Functions

( )

( )

2vdxdvu

dxduv

dxvud

dxduv

dxdvu

dxuvd

dxdv

dxdu

dxvud

÷øö

çèæ-÷

øö

çèæ

=÷øö

çèæ

+=

+=+



Examples𝑓 𝑥 = 𝑥; + 3𝑥

𝑑(𝑢 + 𝑣)𝑑𝑥 =

𝑑𝑢𝑑𝑥 +

𝑑𝑣𝑑𝑥

𝑑𝑓𝑑𝑥 =



Examples𝑓 𝑥 = 𝑥 𝑥 + 3

𝑑(𝑢𝑣)𝑑𝑥 = 𝑢

𝑑𝑣𝑑𝑥 + 𝑣

𝑑𝑢𝑑𝑥

𝑑𝑓𝑑𝑥 =



Examples

𝑓 𝑥 =𝑥(𝑥 + 3)𝑥;

𝑑 𝑢𝑣

𝑑𝑥 =𝑣 𝑑𝑢𝑑𝑥 − 𝑢 𝑑𝑣

𝑑𝑥𝑣;

𝑑𝑓𝑑𝑥 =



Partial derivatives and chain rule

( ) ( )

( )( )

÷÷ø

öççè

æ¶¶

÷÷ø

öççè

æ¶¶

=¶¶

"

==

¹¶¶

=

å= k

im

i ik

nii

m

ii

n

xu

uz

xzk

xxxfu....uuφz

jixxf

xxxxff

1

1,0

1

21,0

Then

......Let Let

ruleChain

constant. as all gby treatin obtained is

,.....,Let X



Example

• 𝑧 = 𝑓 𝑢, 𝑣 = 𝑢; + 2𝑣• 𝑢 = 𝑓$ 𝑥, 𝑦 = 2𝑥+y• 𝑣 = 𝑓; 𝑥, 𝑦 = 𝑥; + 𝑦

• HIHJ =

HIHK

HKHJ +

HIHL

HLHJ



Taylor Series Approximation of Functions

( )( )

( )

( ) ( ) ( ) ( )

( ) ( ) ( )00

000

0

02

2

0

00

!1.....

then, of odneighborho in the continuous is

and at exist ... , ,

sderivative its i.e., abledifferenti is If ofion approximat seriesTaylor

XxdxdfXfxf

Xxdxdf

nXx

dxdfXfxf

Xxxf

Xxdxfd

dxdf

dxd

dxfd

dxdf

xfxf

Xx

n

Xxn

n

Xx

n

n

-÷÷ø

öççè

æ+»

-÷÷

ø

ö

çç

è

æ++-÷

÷ø

öççè

æ+=

=

=÷øö

çèæ=

=

==



Example



Taylor Series Approximation of Multivariate Functions

( ) ( )

( )

( ) ( ) ( )00

0

02010,000

21,0

0

Then ,.....,

at continuous and abledifferenti be ,.....,Let

ii

n

i

n

n

xxxfff

xxxx

xxxxff

-÷øö

çèæ¶¶

+»

=

=

==å

XX

XX

X

X



Minimizing / Maximizing Multivariate Functions

( )( )

small)mally infinitesi(ideally small for

(why?) .............,

at evaluated of gradient negative the of direction the in guess current change we, minimizes that find To *

Cn

CC

C

C

xf

xf

xfη

ff

XX

XX

XXXXX

=÷÷ø

öççè

æ¶¶

¶¶

¶¶

-¬10