machine learning for computer vision part 2

Machine Learning Extra : 1BMVA Summer School 2014

The bits the whirlwind tour left out ...

BMVA Summer School 2014 – extra background slides


Machine Learning

Definition:

– “A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks T, improves with experience E.”

[Mitchell, 1997]


Algorithm to construct decision trees ….


Building Decision Trees – ID3 node = root of tree

Main loop:A = “best” decision attribute for next node

.....

But which attribute is best to split on ?


Entropy in machine learning

Entropy : a measure of impurity

– S is a sample of training examples

– P is the proportion of positive examples in S

– P⊖ is the proportion of negative examples in S

Entropy measures the impurity of S:


Information Gain – reduction in Entropy

Gain(S,A) = expected reduction in entropy due to splitting on attribute A– i.e. expected reduction in impurity in the data

– (improvement in consistent data sorting)



– reduction in entropy in set of examples S if split on attribute A

– Sv = subset of S for which attribute A has value v

– Gain(S,A) = original entropy – SUM(entropy of sub-nodes if split on A)



Information Gain : – “information provided about the target function given the value of

some attribute A”

– How well does A sort the data into the required classes?

Generalise to c classes :– (not just or ⊖)

Entropy S =−∑i=1

c

pi log pi


Building Decision Trees Selecting the Next Attribute

– which attribute should we split on next?


Backpropogation Algorithm ….


Backpropagation AlgorithmAssume we have:

– input examples d={1...D}

• each is pair {xd,t

d} = {input

vector, target vector}

– node index n={1 … N}

– weight wji connects node j → i

– input xji is the input on the

connection node j → i

• corresponding weight = wji

– output error for node n is δn

• similar to (o – t)

Output Layer

Input layer

Input, x

Output vector, Ok

Hidden Layer

node

inde

x {1

… N

}


Backpropagation Algorithm(1) Input Example

example d

(2) output layer error based on :

difference between output and target

(t - o)

derivative of sigmoid function

(3) Hidden layer errorproportional to node contribution to output error

(4) Update weights wij

–


Backpropagation

Termination criteria– number of iterations

reached

– Or error below suitable bound

Output layer error

Hidden layer error

Add weights updated using relevant error


Backpropagation

Output Layer, unit k

Input layer

Input, x

Output vector, Ok

Hidden Layer, unit h


BackpropagationOutput vector, O

k

δh is expressed as a weighted

sum of the output layer errors δk

to which it contributes (i.e. whk

> 0)


Input layer

Input, x



Backpropagation

Error is propogated backwards from network output ....to weights of output layer....to weights of the hidden layer…

Hence the name: backpropagation


Input layer

Input, x


Output vector, Ok


Backpropagation

Repeat these stages for every hidden layer in a multi-layer network:(using error δ

i where x

ji>0)

.......


Input layer

Input, x

Hidden Layer(s),

unit h

Output vector, Ok


Backpropagation

Error is propogated backwards from network output ....to weights of output layer....over weights of all N hidden layers…

Hence the name: backpropagation

.......


Input layer

Input, x

Hidden Layer(s),

unit h

Output vector, Ok


Backpropagation

Will perform gradient descent over the weight space of {w

ji} for all

connections i → j in the network

Stochastic gradient descent– as updates based on

training one sample at a time


Understanding (and believing) the SVM stuff ….


Remedial Note: equations of 2D lines

Line:

where:

are 2D vectors.

Offset from origin

Normal to line

2D LINES REMINDER



http://www.mathopenref.com/coordpointdisttrig.html

2D LINES REMINDER

http://www.mathopenref.com/coordpointdisttrig.html



For a defined line equation:Fixed Insert point into equation …...

Normal to line

Result is +ve if point on this side of line (i.e.> 0).

Result is -ve if point on this side of line. ( < 0 )

Result is the distance (+ve or -ve) of point from line given by:

for:

2D LINES REMINDER


Linear Separator Instances (i.e, examples) {x

i , y

i }

– xi = point in instance space

(Rn) made

up of n attributes

– yi =class value for classification of x

i

Want a linear separator. Can view this as constraint satisfaction problem:

Equivalently,

y = +1

y = -1

Classification of example function f(x) = y = {+1, -1} i.e. 2 classes

N.B. we have a vector of weights coefficients w⃗


Linear Separator

If we define the distance of the nearest point to the margin as 1

→ width of margin is

(i.e. equal width each side)

We thus want to maximize:

finding the parameters:

y = +1

y = -1

Classification of example function f(x) = y = {+1, -1} i.e. 2 classes


which is equivalent to minimizing:


…............. back to main slides


So ….

Find the “hyperplane” (i.e. boundary) with:

a) maximum margin

b) minimum number of (training) examples on the wrong side of the chosen boundary

(i.e. minimal penalties due to C)

Solve via optimization (in polynomial time/complexity)


Find hyperplane separator (plane in 3D) via optimization

Non-linear Separation (red / blue data itemson 2D plane).

Kernel projection to higher dimensional space

Non-linear boundary in original dimension (e.g. circle n 2D) defined by planar boundary (cut) in 3D.

Example:


.... but it is all about the data!


Desirable Data Properties

Machine learning is a Data Driven ApproachThe Data is important! Ideally training/testing data used for learning must be:

– Unbiased• towards any given subset of the space of examples ...

– Representative• of the “real-world” data to be encountered in use/deployment

– Accurate• inaccuracies in training/testing produce inaccuracies results

– Available• the more training/testing data available the better the results

• greater confidence in the results can be achieved


Data Training Methodologies

Simple approach : Data Splits

– split overall data set into separate training and test sets• No established rule but 80%:20%, 70%:30% or ⅓:⅔ training to testing

splits common

– Training on one, test on the other

– Test error = error on the test set

– Training error = error on training set

– Weakness: susceptible to bias in data sets or “over-fitting”• Also less data available for training


Data Training Methodologies

More advanced (and robust): K Cross Validation

– Randomly split (all) the data into k-subsets

– For 1 to k

• train using all the data not in kth subset

• test resulting learned [classifier|function …] using kth subset

– report mean error over all k tests


Key Summary Statistics #1

tp = true positive / tn = true negative

fp = false positive / fn = false negative

Often quoted or plotted when comparing ML techniques


Kappa Statistic

Measure of classification of “N items into C mutually exclusive categories”

Pr(a) = probability of success of classification ( = accuracy)Pr(e) = probability of success due to chance

– e.g. 2 categories = 50% (0.5), 3 categories = 33% (0.33) ….. etc.

– Pr(e) can be replaced with Pr(b) to measure agreement between classifiers/techniques a and b

[Cohen, 1960]

machine learning for computer vision part 2

Science

machine learning extra

bmvasummer school

machine learning entropy

backpropagation output

machine learning definition

output layer errors

backpropagation output

network output