adapted by doug downey from bryan pardo fall 2007 machine learning eecs 349 machine learning lecture...

Adapted by Doug Downey from Bryan Pardo Fall 2007 Machine Learning EECS 349

Machine Learning

Lecture 4: Greedy Local Search (Hill Climbing)


Local search algorithms

• We’ve discussed ways to select a hypothesis h that performs well on training examples, e.g.– Candidate-Elimination– Decision Trees

• Another technique that is quite general:– Start with some (perhaps random) hypothesis

h– Incrementally improve h

• Known as local search


Example: n-queens

• Put n queens on an n × n board with no two queens on the same row, column, or diagonal


Hill-climbing search

• "Like climbing Everest in thick fog with amnesia“

h = initialState

loop:h’ = highest valued Successor(h)

if Value(h) >= Value(h’)

return h

else

h = h’


Hill-climbing search

• Problem: depending on initial state, can get stuck in local maxima


Underfitting

• Overfitting: Performance on test examples is much lower than on training examples

• Underfitting: Performance on training examples is low Two leading causes:– Hypothesis space is too small/simple– Training algorithm (i.e., hypothesis

search algorithm) stuck in local maxima


Hill-climbing search: 8-queens problem

• v = number of pairs of queens that are attacking each other, either directly or indirectly

v =17


Hill-climbing search: 8-queens problem

• A local minimum with v = 1


Simulated annealing search

• Idea: escape local maxima by allowing some "bad" moves but gradually decrease their frequencyh = initialStateT = initialTemperatureloop:h’ = random Successor(h)if (V = Value(h’)-Value(h)) > 0h = h’else

h = h’ with probability eV/T

decrease T; if T==0, return h


Properties of simulated annealing

• One can prove: If T decreases slowly enough, then simulated annealing search will find a global optimum with probability approaching 1

• Widely used in VLSI layout, airline scheduling, etc


Local beam search

• Keep track of k states rather than just one

• Start with k randomly generated states

• At each iteration, all the successors of all k states are generated

• If any one is a goal state, stop; else select the k best successors from the complete list and repeat.


Gradient Descent

• Hill Climbing and Simulated Annealing are “generate and test” algorithms– Successor function generates candidates,

Value function helps select

• In some cases, we can do much better:– Define: Error(training data D, hypothesis h)

– If h is represented by parameters w1,…wn

and dError/dwi is known, we can compute the error gradient, and descend in the direction that is (locally) steepest


About distance….

• Clustering requires distance measures.

• Local methods require a measure of “locality”

• Search engines require a measure of similarity

• So….when are two things close to each other?


Euclidean Distance

• What people intuitively think of as “distance”

Dimension 1: x

Dim

ensi

on

2:

y

22 )()(),( yyxx babaBAd


Generalized Euclidean Distance

),bi(a

bbbB

aaaA

baBAd

ii

n

n

n

iii

and

},...,,{

},,...,,{ where

||),(

21

21

2/1

1

2


Weighting Dimensions

• Apparent clusters at one scaling of X are not so apparent at another scaling


Weighted Euclidean Distance

• You can, of course compensate by weighting your dimensions….

2/1

1

2||),(

n

iiii bawBAd


More Generalization: Minkowsky metric

1,0 and 1 :Distance Hamming

2 :DistanceEuclidean

1 :DistanceManhattan

||),(/1

1

ii

pn

i

piii

,ba p

p

p

bawBAd

• My three favorites are special cases of this:


What is a “metric”?

• A metric has these four qualities.

• …otherwise, call it a “measure”

( , ) 0 iff (reflexivity)

( , ) 0 (non-negative)

( , ) ( , ) (symmetry)

( , ) ( , ) ( , ) (triangle inequality)

d A B x y

d A B

d A B d B A

d A B d B C d A C


Metric, or not?

• Driving distance with 1-way streets

• Categorical Stuff : – Is distance Jazz -> Blues -> Rock no

less than distance Jazz -> Rock?


What about categorical variables?

• Consider feature vectors for genre & vocals– Genre: {Blues, Jazz, Rock, Zydeco}– Vocals: {vocals,no vocals}

s1 = {rock, vocals}s2 = {jazz, no vocals}s3 = { rock, no vocals}• Which two songs are more similar?


Binary Features + Hamming distance

s1 = {rock, yes}s2 = {jazz, no}s3 = { rock, no

vocals}

0 0 1 0 1

0 1 0 0 1

0 0 1 0 0

Blues Jazz ZydecoRock Vocals

Hamming Distance = number of bits different between binary vectors


Hamming Distance

})1,0{,i( and

},...,,{

},,...,,{ where

),(

21

21

1

ii

n

n

n

iii

ba

bbbB

aaaA

baBAd


Other approaches…

• Define your own distance: (a,b)

Beethoven

Beatles Liz Phair

Beethoven 7 0 0

Beatles 4 5 0

Liz Phair ? 1 2

Quote Frequency


Missing data

• What if, for some category, on some examples, there is no value given?

• Approaches:– Discard all examples missing the category– Fill in the blanks with the mean value– Only use a category in the distance measure

if both examples give a value


Dealing with missing data

n

iiin

ii

iii

bawn

nBAd

baw

1

1

),(),(

else 1,

defined are and both if ,0

adapted by doug downey from bryan pardo fall 2007 machine learning eecs 349 machine learning lecture...

Documents