machine learning revision notes
TRANSCRIPT
-
8/19/2019 Machine Learning Revision Notes
1/6
Machine Learning Top Algorithms Revision
Material
March 1, 2016
1 Decision Trees
When to use decision trees?
• When there are fixed set of features and features take on a small set of values
• Target function has discrete small set of values
• It is robust to errors in classification of training data and can also handlesome missing values. Why? Because it considers all available trainingexamples during learning..
Important algorithms and extensions
1. ID3(a) The hypothesis space for ID3 consists of all possible hypothesis func-
tions and amongst the multiple possible valid hypotheses, the algo-rithm prefers shorter trees over longer trees.
(b) Why shorter trees? Occam’s razor: Prefer the simplest hypothesisthat fits the data. Why?
(c) It is a greedy search algorithm i.e. the algorithm never backtracksto re-consider its previous choice. Therefore it might end up at localoptimum
(d) The crux of the algorithm is to specify a way to choose an optimalroot node
(e) Optimal selection of root node is decided by calculating Information gain which measures how well the node/feature separates the trainingexamples w.r.t target classification
(f) For variable with only binary values, entropy is defined as -
−P ⊕ log2 P ⊕ − P log2 P
1
-
8/19/2019 Machine Learning Revision Notes
2/6
Therefore for a training set with 9 ⊕ and 5 values of target variable,the entropy will be -
− 9
14 log2
9
14 −
5
14 log2
5
14 = 0.940
(g) Information gain of a node with attribute A is defined as -
Entropy(S ) −
i∈A
|S v|
|S | Entropy(S v)
Where S is the set of all examples at root ans S v is the set of examplesin the child node corresponding to a value of A. Node with maximuminformation gain is selected as root
2. C4.5
(a) If training data has significant noise, it results in over-fitting.
(b) Over-fitting can be resolved by post pruning. One such successfulmethod is called Rule Post Pruning and a variant of the algorithmis called C4.5
(c) Infer the decision tree from the training set, growing the tree untilthe training data is fit as well as possible and allowing over-fitting tooccur
(d) Convert the learned tree into an equivalent set of rules by creatingone rule for each path from the root node to a leaf node. Why?
(e) Prune (generalize) each rule by removing any preconditions that re-sult in improving its estimated accuracy. Done recursively on that
rule till the accuracy worsens.(f) Sort the pruned rules by their estimated accuracy, and consider them
in this sequence when classifying subsequent instances
(g) Drawback is keeping a separate validation set. To apply algorithmusing same training set, use pessimistic estimate
Other Modifications
• Substitute maximum occurring value at that node in case of missing values
• Divide information gains by weights if the features are to be weighted
• Use thresholds for continuous values
• Need to use a modified version of information gain for choosing root nodesuch as Split Information Ratio
References
• Machine Learning - Tom Mitchell, Chapter 3
2
-
8/19/2019 Machine Learning Revision Notes
3/6
2 Logistic Regression
When to use logistic regression?
• When target variable is discrete valued
Algorithm
1. The hypothesis function is of the form (for binary classification)-
hθ(x) = g(θT x) =
1
1 + e−θT x
Where g is the logistic function which varies between 0 and 1
2. How did we choose this logistic function? Answer - GLMs
3. Let -P (y = 1x; θ) = hθ(x)
P (y = 0x; θ) = 1 − hθ(x)
More generally,
P (yx; θ) = hθ(x)y(1 − hθ(x))
1−y
4. Likelihood, which is the joint probability will be -
m
i=1
P (−→y X ; θ) = hθ(x(i))y
(i)
(1− hθ(x(i)))1−y
(i)
5. We maximize log of this likelihood function by using batch gradient ascentmethod -
θ := θ + α∇θl(θ)
6. for one training example, derivative of log likelihood w.r.t θ is (simplederivation)-
∂l(θ)
∂θj= (y − hθ(x))xj
7. Therefore the update rule becomes -
θj := θj + α(y(i) − hθ(x
(i)))x(i)j
8. Can also use Newton’s method for maximizing log likelihood for fast con-vergence but it involves computing inverse of hessian of log likelihoodfunction w.r.t θ at each iteration
9. Use Softmax Regression which can also be derived from GLM theory
3
-
8/19/2019 Machine Learning Revision Notes
4/6
References
• Stanford CS229 Notes 1
3 Linear Regression
When to use linear regression?
• For any regression problem
Algorithm
1. The hypothesis function is of the form (for binary classification)-
hθ(x) = θT x
With x0 = 1
2. Intuitively, we reduce the sum squared error over the training data to findθ
3. Cost function becomes -
J (θ) = 1
2
m
i=1
(hθ(x(i)) − y(i))2
And to solve we minimize J (θ)
4. Why did we choose least squares cost function? Answer - Assume the
training data is just a Gaussian noise around an actual polynomial. Thelog of probability distribution of noise i.e. log likelihood gives you theleast squares formula
5. Use gradient descent to minimize cost function
6. Normal equation solution -
θ = (X T X )−1X T −→y )
7. Locally weighted linear regression is one modification to account for highlyvarying data without over-fitting
References• Stanford CS229 Notes 1
4
http://cs229.stanford.edu/notes/cs229-notes1.pdfhttp://cs229.stanford.edu/notes/cs229-notes1.pdfhttp://cs229.stanford.edu/notes/cs229-notes1.pdfhttp://cs229.stanford.edu/notes/cs229-notes1.pdf
-
8/19/2019 Machine Learning Revision Notes
5/6
4 Nearest Neighbours
5 Gaussian Discriminant Analysis
When to use GDA?
• For classification problems with continuous features
• GDA makes stronger assumptions about the training data than logisticregression. When these are correct, GDA generally performs better thanlogistic regression.
Algorithm
1. GDA is a generative algorithm
2. Intuitive explanation for generative algorithms - First, looking at ele-phants, we can build a model of what elephants look like. Then, lookingat dogs, we can build a separate model of what dogs look like. Finally, toclassify a new animal, we can match the new animal against the elephantmodel, and match it against the dog model, to see whether the new an-imal looks more like the elephants or more like the dogs we had seen inthe training set.
3. Model -y = Bernoulli(φ)
(x|y = 0) = ℵ(µ0,Σ)
(x|y = 1) = ℵ(µ1,Σ)
5
-
8/19/2019 Machine Learning Revision Notes
6/6
6 Naive Bayes
7 Bayes Networks
8 Adaboost
9 SVM
10 K-means Clustering
11 Expectation Maximization
12 SVD
13 PCA
14 Random Forests
15 Artificial Neural Networks
6