prob6

3
582631 Introduction to Machine Learning, Autumn 2014 Exercise set 6 To be discussed in the exercise session 4 December. Attending the session is completely voluntary. Credit is given based on the written solutions turned in to the course assistant. The deadline for turning in your solutions is Friday, 5 December, at 23:59. Send your solutions to the course assistant ([email protected].fi) by e-mail. You should send one PDF file including your solutions to the pen-and-paper problems and the report for the programming problem, and a single compressed file including the code for the programming problem. Pen-and-paper problems Problem 1 (3 points) Consider our standard linear regression problem, where we have have a set of N data points (x i ,y i ) = ((x i1 ,...,x in ),y n ) with x i R n and y i R for i =1,...,N . Let w * R n be the least-squares weight vector, as discussed for example on page 203 of lecture notes. (a) Consider a variable x i that has a positive coefficient in the optimal solution, that is, w * j > 0. Does this imply that x j and y have a positive correlation in the data? Why? (b) Consider a variable x j that is positively correlated with y. Does this imply that w * j > 0? Why? Hint: Consider situations where x j has a strong correlation with another variable x k . Problem 2 (3 points) Exclusive or is a classical example used to demonstrate the limitations of linear classifiers. In the most basic setting, we have two binary inputs and a binary output. Using +1 and -1 as the binary values, the exclusive or function XOR is given by the following table: x 1 x 2 XOR(x 1 ,x 2 ) -1 -1 -1 -1 +1 +1 +1 -1 +1 +1 +1 -1 (a) Give a detailed proof for the fact that the exclusive or function cannot be represented as a linear classifier. That is, prove that there is no coefficient vector (w 0 ,w 1 ,w 2 ) such that XOR(x 1 ,x 2 ) = sign(w 0 + w 1 x 1 + w 2 x 2 ) would hold for all possible inputs (x 1 ,x 2 ). (a) Show that if we include an extra input variable x 1 x 2 , we can represent XOR as a linear classifier in terms of the modified inputs. That is, find a coefficient vector (w 0 ,w 1 ,w 2 ,w 3 ) such that XOR(x 1 ,x 2 ) = sign(w 0 + w 1 x 1 + w 2 x 2 + w 3 x 1 x 2 ) holds for all (x 1 ,x 2 ). Continues on the next page! 1

Upload: burakizu

Post on 08-Nov-2015

219 views

Category:

Documents


0 download

DESCRIPTION

machine

TRANSCRIPT

  • 582631 Introduction to Machine Learning, Autumn 2014Exercise set 6

    To be discussed in the exercise session 4 December. Attending the session is completely voluntary.

    Credit is given based on the written solutions turned in to the course assistant.

    The deadline for turning in your solutions is Friday, 5 December, at 23:59.

    Send your solutions to the course assistant ([email protected]) by e-mail. You should send onePDF file including your solutions to the pen-and-paper problems and the report for the programming problem,and a single compressed file including the code for the programming problem.

    Pen-and-paper problems

    Problem 1 (3 points) Consider our standard linear regression problem, where we have have a set of N datapoints (xi, yi) = ((xi1, . . . , xin), yn) with xi Rn and yi R for i = 1, . . . , N . Let w Rn be the least-squaresweight vector, as discussed for example on page 203 of lecture notes.

    (a) Consider a variable xi that has a positive coefficient in the optimal solution, that is, wj > 0. Does this

    imply that xj and y have a positive correlation in the data? Why?

    (b) Consider a variable xj that is positively correlated with y. Does this imply that wj > 0? Why?

    Hint: Consider situations where xj has a strong correlation with another variable xk.

    Problem 2 (3 points) Exclusive or is a classical example used to demonstrate the limitations of linearclassifiers. In the most basic setting, we have two binary inputs and a binary output. Using +1 and 1 as thebinary values, the exclusive or function XOR is given by the following table:

    x1 x2 XOR(x1, x2)1 1 11 +1 +1+1 1 +1+1 +1 1

    (a) Give a detailed proof for the fact that the exclusive or function cannot be represented as a linear classifier.That is, prove that there is no coefficient vector (w0, w1, w2) such that

    XOR(x1, x2) = sign(w0 + w1x1 + w2x2)

    would hold for all possible inputs (x1, x2).

    (a) Show that if we include an extra input variable x1x2, we can represent XOR as a linear classifier in termsof the modified inputs. That is, find a coefficient vector (w0, w1, w2, w3) such that

    XOR(x1, x2) = sign(w0 + w1x1 + w2x2 + w3x1x2)

    holds for all (x1, x2).

    Continues on the next page!

    1

  • Problem 3 (3 points) Multilayer neural networks are another way of using linear classifiers for nonlinearproblems. The most basic such classifier, for two-dimensional inputs and with two hidden units, has parameters(uij), i { 1, 2 }, j { 0, 1, 2 } and (vk), i { 0, 1, 2 } (that is, a total of 9 parameters), and computes a mapping(x1, x2) 7 y as follows:

    z1 = sign(u10 + u11x1 + u12x2)

    z2 = sign(u20 + u21x1 + u22x2)

    y = sign(v0 + v1z1 + v2z2).

    This can be visualized as follows:

    1

    x2

    x1

    u10

    u20

    u11

    u22

    v2

    z1

    z2

    yu21

    u12

    1

    v1

    v0

    The term hidden unit refers to the computation of intermediate values zi that are not as such part of thevisible input or output.

    (a) Show that with a suitable choice of the weights (uij) and (vk), the neural network explained above cancompute the XOR function from Problem 2.

    (b) There are numerous variants of this basic neural network model. In particular, the signum function usedabove in computing the values zi is often replaced by some continuous function, such as the logistic sigmoidr 7 1/(1 + er). However, one variant that does not make sense is just to ignore the signum function forzi and just compute

    z1 = u10 + u11x1 + u12x2

    z2 = u20 + u21x1 + u22x2

    y = sign(v0 + v1z1 + v2z2).

    Show that this model would not offer any advantage over a basic linear classifier y = sign(w0+w1x1+w2x2).

    Continues on the next page!

    2

  • Computer problem

    General instructions:

    Return a brief written report (as PDF, included in the same file as your pen-and-paper solutions) and onedirectory (as a zip or tar.gz file) including all your code.

    Do not include any of the data files in your solution file. Always mention your name and student ID in the report. We use the report as the main basis for grading: All your results should be in the report. We also look at

    the code, but we wont however go fishing for results in your code.

    The code needs to be submitted as a runnable file or set of files (command to run it given in the report). Your code should be reasonably efficient. You are not expected to really optimize the code, but your

    solution should not take excessive amounts of time on data sets relevant to the problem. A typical causefor a failure here would be using for-loops instead of vector operations in MATLAB. If your solution isgrossly inefficient, your score may be reduced accordingly.

    In your report, the results will be mostly either in the form of a figure or program output. In bothcases, add some sentences which explain what you are showing and why the results are the answer to thequestion.

    If we ask you to test whether an algorithm works, always give a few examples showing that the algorithmindeed works

    Problem 4 (15 points)

    Download the MNIST data set. It can be found on the course web page, in a package that also includessome helpful MATLAB and R routines for handling it. Split the data into two equally sized parts, using thefirst half of the data set as training set, and the second half as test set. The data set contains 60,000 datapoints, so both training and test sets will contain 30,000 points. Each X value is a 784-dimensional vector,corresponding to grey values of a 28 28 pixel image. The y values are classes, corresponding to digits 0through 9.

    Implement the Perceptron algorithm, including the pocket modification for use on non-separable data.Use it to learn a multi-class linear classifier for the MNIST data set. Do this in two different ways, first usingthe one vs. all technique for applying binary classifiers to a multiclass problem, and then using the all vs. alltechnique.

    Apply the multi-class classifiers you have learned to the test set. Create the 10 10 confusion matrix, bothfor the one vs. all and for the all vs. all classifier. Are the classification results reasonably accurate? Can yousee any difference between the two methods? Are there any other interesting observations you can make?

    3