applied machine learning assignment 1 -...

8
Applied Machine Learning Assignment 1 Professor: Aude Billard Assistants: Guillaume de Chambrier, Nadia Figueroa, Joao Abrantes contacts: aude.billard@epfl.ch guillaume.dechambrier@epfl.ch nadia.figueroafernandez@epfl.ch joao.abrantes@epfl.ch@epfl.ch Winter Semester 2015 1 Goals The goal of this assignment is to familiarize yourself with the Principal Component Analysis technique presented during the class and get you acquainted with the importance of choosing well one’s dataset to obtain the best performance of an algorithm. 1.1 Structure of the practicals Part I of this assignment comprises of ungraded exercises to familiarize yourself with the machine learning software we will use for this and future practical sessions. Part II consists of a set of graded exercises. The percentage of marks carried are indicated next to each exercise. For the graded part, you will have to submit a written report, in which you answer each of the listed questions. In this, and all future practical sessions, you will use the MLDemos (http: //lasa.epfl.ch/teaching/lectures/ML_Msc/MLDemos_Master.zip) toolkit that provides a collection of machine learning algorithms which you can apply on hand-made as well as real- world datasets. Recall that practical sessions are performed by teams of 3 persons. If you do not yet have a partner, let the assistants know and they will assign you to a team. 1.1.1 PCA During the first practice session on PCA, 3 different data sets will be analyzed step-by-step by the class as a whole with the help of the assistants. The objective is to draw your attention to the different types of situations and caveats you might encounter when performing PCA. You should also heed the different techniques for visualizing high-dimensional data provided in MLDemos. This section will be ungraded. The second section of the assignment will require to form your own data set via MLDemos and carry out the same analysis in your respective groups and will be graded. 1

Upload: phammien

Post on 05-Jun-2018

232 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Applied Machine Learning Assignment 1 - EPFLlasa.epfl.ch/teaching/lectures/ML_Msc/practicals/2015-2016/AML-TP1.… · Applied Machine Learning Assignment 1 ... //lasa.epfl.ch/teaching/lectures/ML_Msc

Applied Machine Learning

Assignment 1

Professor: Aude BillardAssistants: Guillaume de Chambrier, Nadia Figueroa, Joao Abrantes

contacts:[email protected]

[email protected]@[email protected]@epfl.ch

Winter Semester 2015

1 Goals

The goal of this assignment is to familiarize yourself with the Principal Component Analysistechnique presented during the class and get you acquainted with the importance of choosingwell one’s dataset to obtain the best performance of an algorithm.

1.1 Structure of the practicals

Part I of this assignment comprises of ungraded exercises to familiarize yourself with the machinelearning software we will use for this and future practical sessions. Part II consists of a set ofgraded exercises. The percentage of marks carried are indicated next to each exercise. Forthe graded part, you will have to submit a written report, in which you answer each of thelisted questions. In this, and all future practical sessions, you will use the MLDemos (http://lasa.epfl.ch/teaching/lectures/ML_Msc/MLDemos_Master.zip) toolkit that provides acollection of machine learning algorithms which you can apply on hand-made as well as real-world datasets.

Recall that practical sessions are performed by teams of 3 persons. If you do not yet havea partner, let the assistants know and they will assign you to a team.

1.1.1 PCA

During the first practice session on PCA, 3 different data sets will be analyzed step-by-step bythe class as a whole with the help of the assistants. The objective is to draw your attentionto the different types of situations and caveats you might encounter when performing PCA.You should also heed the different techniques for visualizing high-dimensional data provided inMLDemos. This section will be ungraded. The second section of the assignment will requireto form your own data set via MLDemos and carry out the same analysis in your respectivegroups and will be graded.

1

Page 2: Applied Machine Learning Assignment 1 - EPFLlasa.epfl.ch/teaching/lectures/ML_Msc/practicals/2015-2016/AML-TP1.… · Applied Machine Learning Assignment 1 ... //lasa.epfl.ch/teaching/lectures/ML_Msc

1.1.2 Grading & Submission

This assignment will be graded through a report which you must hand in no later than byOctober 16th, 18h00. All reports should be submitted online at the course webpage http:

//lasa.epfl.ch/teaching/lectures/ML_Msc/#submission. The submission form is locatedat the bottom of the page and indicates which submission is currently open. You should selectyour group and upload a .pdf file not more than 10 MB in size. You may upload multiple times,in which case, only the latest file will be graded.

Delays will be penalized: 1 point will be subtracted for each day of delay. The first daylate counts starting one hour after the deadline. This report counts for 5% of the total gradeof the course. Practicals are conducted in teams of two. Unless told otherwise, we assume thatthe work has been shared equally by the members of the team and hence all members will begiven the same grade. More information on the assignment and on the way the report shouldbe written are given below.

2

Page 3: Applied Machine Learning Assignment 1 - EPFLlasa.epfl.ch/teaching/lectures/ML_Msc/practicals/2015-2016/AML-TP1.… · Applied Machine Learning Assignment 1 ... //lasa.epfl.ch/teaching/lectures/ML_Msc

2 Part I: Principal Component Analysis (ungraded)

For this first practical you will focus on choosing a suitable projection of the data throughPrincipal Component Analysis. Such a projection aims at improving the separability of thedata and at reducing the dimensionality of the dataset. Throughout the practical session, youwill be working on synthetic and real data. Synthetic data can be created and analysed withML methods by using MLDemos. It is a useful help to visualize how changes in data affect theresults of a learning algorithm. However, synthetic data seldom help to grasp some of the issuesarising when using realistic and hence noisy data. You are advised to start with synthetic dataduring the practical sessions, but then move to using real data. The report should focus solelyon results obtained when using your own real datasets.

The first, non-graded, step will require you to investigate different PCA projections (ana-lyzing with PCA) for 3 pre-chosen data sets, namely iris, biotac and ads. Your task is todetermine which projection is best suited for the purpose of dimensionality reduction and classseparability. You will then be asked to discuss the influence those particular choices may havein improving or degrading performance of the classification or clustering process.

2.1 Getting started

1. Download MLdemos and datasets The software (downloadable at http://lasa.

epfl.ch/teaching/lectures/ML_Msc/MLDemos_Master.zip ) provides a graphical in-terface for visualizing the data and algorithms you will use throughout this year. Thedatasets for the in-class part of this practical can be downloaded from http://lasa.

epfl.ch/teaching/lectures/ML_Msc/Practical1_data.rar.

It is advised to decompress the MLdemos zip file in the desktop folder if you are using anEPFL computer to avoid folder/files path issues. For each dataset, carry out the followingtasks and answer the questions.

2. Load your data set

Launch mldemos.exe and load a *.data file (Drag and drop the file in MLdemos, or go file> Import > Data (csv,text) ). The data is displayed in its original space.

Figure 1: Display of the first two dimensionsof the iris data set.

Figure 2: Choose the dimensions your data aredisplayed on (1) and the way they are displayed(2).

3

Page 4: Applied Machine Learning Assignment 1 - EPFLlasa.epfl.ch/teaching/lectures/ML_Msc/practicals/2015-2016/AML-TP1.… · Applied Machine Learning Assignment 1 ... //lasa.epfl.ch/teaching/lectures/ML_Msc

3. Interpret high-dimensional data

By default, the data is projected on the first two dimensions (either in the original spaceor in the projected space). You can change this by selecting the dimensions you wantto display your data on: select the dimensions on the bottom left of the main window(Figure 2, (1)).

As shown in Figure 3, there are other ways of visualizing your data to display more thantwo dimensions at the same time, choose from the list (Figure 2, (2)) between Scatterplots(Plot every combination of 2x2 dimensions next to each other. Expect some slowdown ifyou have a big dataset), Parrallel coordinates (Each datapoint is a line passing througheach dimension.), BubblePlots (display a third dimension by varying the size of eachdatapoint).

Figure 3: Different data displays are available. From left to right: Standard, Scatterplots,Parallel coordinates and Bubble plots methods.

4. Project your data To project it with PCA, click on the Algorithms button (Figure 4,

Figure 4: How to project your data: open the Algorithms window (1), select the Projectionstab (2) and choose PCA (3). Click on Project (4).

(1)) and go to the Projections tab (Figure 4, (2)). Select Principal Component Analysisand click on Project. Your data is now projected onto the eigenvectors of its covariancematrix. Note that you can project your data back to its original space by clicking the

4

Page 5: Applied Machine Learning Assignment 1 - EPFLlasa.epfl.ch/teaching/lectures/ML_Msc/practicals/2015-2016/AML-TP1.… · Applied Machine Learning Assignment 1 ... //lasa.epfl.ch/teaching/lectures/ML_Msc

Revert button. The graph of the reconstruction error (Figure 5) when projected on each

Figure 5: Reconstruction error and component variance (1). Eigen button (2) to display theeigenvectors in a new window

eigenvector and a table with each component’s (precentage) variance is shown. Thisinfromation can be useful to get an idea of how much information is stored in each eigen-component. The Eigen button will display the eigenvectors in a new window.

2.2 In-class questions

1. Using the visualization type: correlations, how many eigen vectors would you expect toachieve a ≈ 99% reconstruction error (iris and biotac datasets).

2. What qualitative difference do you see in the data projected onto the first eigenvectors asopposed to the later ones (for all datasets)?

2.2.1 Data set description

irisThis data set was introduced by Sir Ronald Fisher as test dataset for discriminant analysis.It is a multivariate data set consisting of three different different flower types (Iris Setosa, IrisVersicolour, Iris Virginica). Each type of flower is represented by a 4 dimensional vector.

1 Sepal length

2 Sepal width

3 Petal length

4 Petal width

The data set in question was taken from the UCI Machine Learning Repository http://

archive.ics.uci.edu/ml/datasets/Iris.biotac

The biotac data set consists of two classes where each sample has 19 features. The data wasrecording during a simple sweeping motion on a table top in which all the biotac receptorpatches (see figure 6) where in contact with the table. Two sweeping motions were performedwhich correspond to the two classes. The first sweep was from left to right, whilst the secondfrom right to left.

5

Page 6: Applied Machine Learning Assignment 1 - EPFLlasa.epfl.ch/teaching/lectures/ML_Msc/practicals/2015-2016/AML-TP1.… · Applied Machine Learning Assignment 1 ... //lasa.epfl.ch/teaching/lectures/ML_Msc

(a)

1

4

2

3

56

7

89

10

11

12

14

15

13

16

17

18

19

(b)

Figure 6: a) Biotac finger, meant to be as close as possible to the sensing capabilities of a humanfinger. b) Each circle represents a sensor, the spacial location of the patches map onto the skinof the finger.

3 Part II: Principal Component Analysis (graded)

Face and object images present an interesting source of data as they live in a high-dimensionalspace (images often have several thousand dimensions). In this part, groups should be usingoriginal images that they have gathered through the internet or other sources (personal photos,etc). Note that in the report that you will submit, the images must be a mix of different objecttypes, e.g. mugs, pens, faces etc.

When creating the image dataset, keep in mind that you will have to split each datasetinto different classes later on in the practicals, therefore make sure to have enough samples forwhichever types of objects you choose to have in your dataset. The minimal size for a datasetshould be 50-60 samples (i.e. 25-30 samples per class), but you will realize that a bigger datasetcan help your understanding of how the algorithm works. The system should be able to processup to a couple of thousands of samples.

Figure 7: PCAFaces plugin GUI for creating image-based datasets and projecting the resultsinto MLDemos.

Follow these steps to create your own dataset with images.

1. Launch MLDemos and select Plugins > Input / Output > PCAFaces from the menu.

6

Page 7: Applied Machine Learning Assignment 1 - EPFLlasa.epfl.ch/teaching/lectures/ML_Msc/practicals/2015-2016/AML-TP1.… · Applied Machine Learning Assignment 1 ... //lasa.epfl.ch/teaching/lectures/ML_Msc

2. An interface should pop up (See Figure 7). If you have a camera attached to your computer,it should open up on the left-hand side of the interface and allow you to select a region of theimage that can be captured multiple times (e.g. different faces or different face expressions).Alternatively, an image can be loaded and sub-regions of that image be selected as samples.

3. Use the button marked with >> to add the selected regions to the dataset.

4. Once you have selected enough number of samples (all samples will be gathered in the right-hand side of the interface), you can assign labels to each sample by left/right-clicking onthem. You can left/right-Shift-click a sample to change the class label of all samples below.Ctrl+clicking on a sample will remove it from the dataset. Save the dataset once you’resatisfied with the results. The dataset is saved as an image which you can open and editwith any imaging software (and which you could for example include in your report).

5. In the PCAFaces window, you select the eigenvectors to project your data on in the bottomright of the window.

Figure 8: Selecting the eigenvectors in the PCAFaces window will determine onto which twoeigenvectors the data is projected in the main window.

4 Report

Write a report of maximum 4 pages (single column, 10pt minimum) in PDF format. Pagesbeyond the fourth one will be ignored. The best way to write the report is to fill it in asyou go during the practical session. Just jotting down some quick notes while you experimentwill save you hours once you work on the report itself. A qualitative evaluation should containimages (e.g. screenshots) which exemplify the concepts you want to explain (e.g. an image of agood projection and an image of a bad one). Make sure to plot only a subset of all the plots youmay have visualized during the practical. Choose the ones that are the most representative.Make sure that there is no redundancy in the information conveyed by the graphs and thusthat each graph presents a different concept. Each graph/image should be accompanied by acaption that explains the content of the image. Bad captions are captions that contain solelythe figure number! An example of good caption would typically read as follows: Figure 2: Theleft plot shows the e1 and e2 projection of 10 images of human faces, typical of those shown inFigure 1. In the main text, refer to all figures using their figure numbers. Bad captions andlack of clear references to pictures in the text will be penalized.

7

Page 8: Applied Machine Learning Assignment 1 - EPFLlasa.epfl.ch/teaching/lectures/ML_Msc/practicals/2015-2016/AML-TP1.… · Applied Machine Learning Assignment 1 ... //lasa.epfl.ch/teaching/lectures/ML_Msc

4.1 Format

In this first report, we expect solely a qualitative assessment of the performance and behaviorof the system. Your report will be graded on the following aspects:

1. Description of your data set. This would include the number of classes, number of samplesper class and the dimensionality of your data. You may provide illustrations depicting atypical member from each class. (20%)

2. Following discussions regarding the PCA algorithm:

(a) Discuss the effectiveness of using PCA as a preprocesssing step before classification.Think in terms of the separability of the data in the projected space. (20%)

(b) Can you find one or more projections of the data, that would make the classesseparable? If this is the case, can you decipher which feature of the data was extractedby the projection and whether these features correspond to your expectations. If youdid not manage to find a suitable pair or group of projections to separate the data,discuss why this is the case. (30%)

(c) What happens if you do not use all samples to train PCA? (You can do this byright/left + clicking on the samples in the dataset window), e.g., if all objects used inPCA have similar shape/color etc. Repeat this process 3 times by selecting differentsubgroups of images, and discuss how the choice of training set affects the choice ofPCA features and the separability of the data. (30%)

8