machine learning revision notes

8/19/2019 Machine Learning Revision Notes

1/6

Machine Learning Top Algorithms Revision

Material

March 1, 2016

1 Decision Trees

When to use decision trees?

• When there are fixed set of features and features take on a small set of values

• Target function has discrete small set of values

• It is robust to errors in classification of training data and can also handlesome missing values. Why? Because it considers all available trainingexamples during learning..

Important algorithms and extensions

1. ID3(a) The hypothesis space for ID3 consists of all possible hypothesis func-

tions and amongst the multiple possible valid hypotheses, the algo-rithm prefers shorter trees over longer trees.

(b) Why shorter trees? Occam’s razor: Prefer the simplest hypothesisthat fits the data. Why?

(c) It is a greedy search algorithm i.e. the algorithm never backtracksto re-consider its previous choice. Therefore it might end up at localoptimum

(d) The crux of the algorithm is to specify a way to choose an optimalroot node

(e) Optimal selection of root node is decided by calculating Information gain which measures how well the node/feature separates the trainingexamples w.r.t target classification

(f) For variable with only binary values, entropy is defined as -

−P ⊕ log2 P ⊕ − P log2 P

1


2/6

Therefore for a training set with 9 ⊕ and 5 values of target variable,the entropy will be -

− 9

14 log2

9

14 −

5

14 log2

5

14 = 0.940

(g) Information gain of a node with attribute A is defined as -

Entropy(S ) −

i∈A

|S v|

|S | Entropy(S v)

Where S is the set of all examples at root ans S v is the set of examplesin the child node corresponding to a value of A. Node with maximuminformation gain is selected as root

2. C4.5

(a) If training data has significant noise, it results in over-fitting.

(b) Over-fitting can be resolved by post pruning. One such successfulmethod is called Rule Post Pruning and a variant of the algorithmis called C4.5

(c) Infer the decision tree from the training set, growing the tree untilthe training data is fit as well as possible and allowing over-fitting tooccur

(d) Convert the learned tree into an equivalent set of rules by creatingone rule for each path from the root node to a leaf node. Why?

(e) Prune (generalize) each rule by removing any preconditions that re-sult in improving its estimated accuracy. Done recursively on that

rule till the accuracy worsens.(f) Sort the pruned rules by their estimated accuracy, and consider them

in this sequence when classifying subsequent instances

(g) Drawback is keeping a separate validation set. To apply algorithmusing same training set, use pessimistic estimate

Other Modifications

• Substitute maximum occurring value at that node in case of missing values

• Divide information gains by weights if the features are to be weighted

• Use thresholds for continuous values

• Need to use a modified version of information gain for choosing root nodesuch as Split Information Ratio

References

• Machine Learning - Tom Mitchell, Chapter 3

2


3/6

2 Logistic Regression

When to use logistic regression?

• When target variable is discrete valued

Algorithm

1. The hypothesis function is of the form (for binary classification)-

hθ(x) = g(θT x) =

1

1 + e−θT x

Where g is the logistic function which varies between 0 and 1

2. How did we choose this logistic function? Answer - GLMs

3. Let -P (y = 1x; θ) = hθ(x)

P (y = 0x; θ) = 1 − hθ(x)

More generally,

P (yx; θ) = hθ(x)y(1 − hθ(x))

1−y

4. Likelihood, which is the joint probability will be -

m

i=1

P (−→y X ; θ) = hθ(x(i))y

(i)

(1− hθ(x(i)))1−y

(i)

5. We maximize log of this likelihood function by using batch gradient ascentmethod -

θ := θ + α∇θl(θ)

6. for one training example, derivative of log likelihood w.r.t θ is (simplederivation)-

∂l(θ)

∂θj= (y − hθ(x))xj

7. Therefore the update rule becomes -

θj := θj + α(y(i) − hθ(x

(i)))x(i)j

8. Can also use Newton’s method for maximizing log likelihood for fast con-vergence but it involves computing inverse of hessian of log likelihoodfunction w.r.t θ at each iteration

9. Use Softmax Regression which can also be derived from GLM theory

3


4/6

References

• Stanford CS229 Notes 1

3 Linear Regression

When to use linear regression?

• For any regression problem

Algorithm

1. The hypothesis function is of the form (for binary classification)-

hθ(x) = θT x

With x0 = 1

2. Intuitively, we reduce the sum squared error over the training data to findθ

3. Cost function becomes -

J (θ) = 1

2

m

i=1

(hθ(x(i)) − y(i))2

And to solve we minimize J (θ)

4. Why did we choose least squares cost function? Answer - Assume the

training data is just a Gaussian noise around an actual polynomial. Thelog of probability distribution of noise i.e. log likelihood gives you theleast squares formula

5. Use gradient descent to minimize cost function

6. Normal equation solution -

θ = (X T X )−1X T −→y )

7. Locally weighted linear regression is one modification to account for highlyvarying data without over-fitting

References• Stanford CS229 Notes 1

4

http://cs229.stanford.edu/notes/cs229-notes1.pdfhttp://cs229.stanford.edu/notes/cs229-notes1.pdfhttp://cs229.stanford.edu/notes/cs229-notes1.pdfhttp://cs229.stanford.edu/notes/cs229-notes1.pdf


5/6

4 Nearest Neighbours

5 Gaussian Discriminant Analysis

When to use GDA?

• For classification problems with continuous features

• GDA makes stronger assumptions about the training data than logisticregression. When these are correct, GDA generally performs better thanlogistic regression.

Algorithm

1. GDA is a generative algorithm

2. Intuitive explanation for generative algorithms - First, looking at ele-phants, we can build a model of what elephants look like. Then, lookingat dogs, we can build a separate model of what dogs look like. Finally, toclassify a new animal, we can match the new animal against the elephantmodel, and match it against the dog model, to see whether the new an-imal looks more like the elephants or more like the dogs we had seen inthe training set.

3. Model -y = Bernoulli(φ)

(x|y = 0) = ℵ(µ0,Σ)

(x|y = 1) = ℵ(µ1,Σ)

5


6/6

6 Naive Bayes

7 Bayes Networks

8 Adaboost

9 SVM

10 K-means Clustering

11 Expectation Maximization

12 SVD

13 PCA

14 Random Forests

15 Artificial Neural Networks

6

machine learning revision notes

Documents