machine learning revision notes

Upload: gaurav-tendolkar

Post on 07-Jul-2018

214 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/19/2019 Machine Learning Revision Notes

    1/6

    Machine Learning Top Algorithms Revision

    Material

    March 1, 2016

    1 Decision Trees

    When to use decision trees?

    •  When there are fixed set of features and features take on a small set of values

    •  Target function has discrete small set of values

    •  It is robust to errors in classification of training data and can also handlesome missing values. Why? Because it considers all available trainingexamples during learning..

    Important algorithms and extensions

    1. ID3(a) The hypothesis space for ID3 consists of all possible hypothesis func-

    tions and amongst the multiple possible valid hypotheses, the algo-rithm prefers shorter trees over longer trees.

    (b) Why shorter trees? Occam’s razor: Prefer the simplest hypothesisthat fits the data. Why?

    (c) It is a greedy search algorithm i.e. the algorithm never backtracksto re-consider its previous choice. Therefore it might end up at localoptimum

    (d) The crux of the algorithm is to specify a way to choose an optimalroot node

    (e) Optimal selection of root node is decided by calculating  Information gain  which measures how well the node/feature separates the trainingexamples w.r.t target classification

    (f) For variable with only binary values, entropy is defined as -

    −P ⊕ log2 P ⊕ − P  log2 P 

    1

  • 8/19/2019 Machine Learning Revision Notes

    2/6

    Therefore for a training set with 9 ⊕ and 5 values of target variable,the entropy will be -

    − 9

    14 log2

    9

    14 −

      5

    14 log2

    5

    14 = 0.940

    (g) Information gain of a node with attribute A is defined as -

    Entropy(S ) −

    i∈A

    |S v|

    |S | Entropy(S v)

    Where S is the set of all examples at root ans  S v is the set of examplesin the child node corresponding to a value of A. Node with maximuminformation gain is selected as root

    2. C4.5

    (a) If training data has significant noise, it results in over-fitting.

    (b) Over-fitting can be resolved by post pruning. One such successfulmethod is called   Rule Post Pruning  and a variant of the algorithmis called C4.5

    (c) Infer the decision tree from the training set, growing the tree untilthe training data is fit as well as possible and allowing over-fitting tooccur

    (d) Convert the learned tree into an equivalent set of rules by creatingone rule for each path from the root node to a leaf node. Why?

    (e) Prune (generalize) each rule by removing any preconditions that re-sult in improving its estimated accuracy. Done recursively on that

    rule till the accuracy worsens.(f) Sort the pruned rules by their estimated accuracy, and consider them

    in this sequence when classifying subsequent instances

    (g) Drawback is keeping a separate validation set. To apply algorithmusing same training set, use   pessimistic estimate 

    Other Modifications

    •  Substitute maximum occurring value at that node in case of missing values

    •  Divide information gains by weights if the features are to be weighted

    •  Use thresholds for continuous values

    •   Need to use a modified version of information gain for choosing root nodesuch as  Split Information Ratio

    References

    •  Machine Learning - Tom Mitchell, Chapter 3

    2

  • 8/19/2019 Machine Learning Revision Notes

    3/6

    2 Logistic Regression

    When to use logistic regression?

    •  When target variable is discrete valued

    Algorithm

    1. The hypothesis function is of the form (for binary classification)-

    hθ(x) =  g(θT x) =

      1

    1 + e−θT x

    Where g  is the logistic function which varies between 0 and 1

    2. How did we choose this logistic function? Answer - GLMs

    3. Let -P (y  = 1x; θ) =  hθ(x)

    P (y = 0x; θ) = 1 − hθ(x)

    More generally,

    P (yx; θ) =  hθ(x)y(1 − hθ(x))

    1−y

    4. Likelihood, which is the joint probability will be -

    m

    i=1

    P (−→y X ; θ) =  hθ(x(i))y

    (i)

    (1− hθ(x(i)))1−y

    (i)

    5. We maximize log of this likelihood function by using batch gradient ascentmethod -

    θ :=  θ  + α∇θl(θ)

    6. for one training example, derivative of log likelihood w.r.t   θ   is (simplederivation)-

    ∂l(θ)

    ∂θj= (y − hθ(x))xj

    7. Therefore the update rule becomes -

    θj   := θj +  α(y(i) − hθ(x

    (i)))x(i)j

    8. Can also use Newton’s method for maximizing log likelihood for fast con-vergence but it involves computing inverse of hessian of log likelihoodfunction w.r.t  θ  at each iteration

    9. Use Softmax Regression which can also be derived from GLM theory

    3

  • 8/19/2019 Machine Learning Revision Notes

    4/6

    References

    •  Stanford CS229 Notes 1

    3 Linear Regression

    When to use linear regression?

    •  For any regression problem

    Algorithm

    1. The hypothesis function is of the form (for binary classification)-

    hθ(x) =  θT x

    With x0  = 1

    2. Intuitively, we reduce the sum squared error over the training data to findθ

    3. Cost function becomes -

    J (θ) = 1

    2

    m

    i=1

    (hθ(x(i)) − y(i))2

    And to solve we minimize  J (θ)

    4. Why did we choose least squares cost function? Answer - Assume the

    training data is just a Gaussian noise around an actual polynomial. Thelog of probability distribution of noise i.e. log likelihood gives you theleast squares formula

    5. Use gradient descent to minimize cost function

    6. Normal equation solution -

    θ = (X T X )−1X T −→y )

    7. Locally weighted linear regression is one modification to account for highlyvarying data without over-fitting

    References•  Stanford CS229 Notes 1

    4

    http://cs229.stanford.edu/notes/cs229-notes1.pdfhttp://cs229.stanford.edu/notes/cs229-notes1.pdfhttp://cs229.stanford.edu/notes/cs229-notes1.pdfhttp://cs229.stanford.edu/notes/cs229-notes1.pdf

  • 8/19/2019 Machine Learning Revision Notes

    5/6

    4 Nearest Neighbours

    5 Gaussian Discriminant Analysis

    When to use GDA?

    •  For classification problems with continuous features

    •   GDA makes stronger assumptions about the training data than logisticregression. When these are correct, GDA generally performs better thanlogistic regression.

    Algorithm

    1. GDA is a generative algorithm

    2. Intuitive explanation for generative algorithms - First, looking at ele-phants, we can build a model of what elephants look like. Then, lookingat dogs, we can build a separate model of what dogs look like. Finally, toclassify a new animal, we can match the new animal against the elephantmodel, and match it against the dog model, to see whether the new an-imal looks more like the elephants or more like the dogs we had seen inthe training set.

    3. Model -y =  Bernoulli(φ)

    (x|y = 0) = ℵ(µ0,Σ)

    (x|y = 1) = ℵ(µ1,Σ)

    5

  • 8/19/2019 Machine Learning Revision Notes

    6/6

    6 Naive Bayes

    7 Bayes Networks

    8 Adaboost

    9 SVM

    10 K-means Clustering

    11 Expectation Maximization

    12 SVD

    13 PCA

    14 Random Forests

    15 Artificial Neural Networks

    6