data mining and machine learning -...

DATA MINING AND MACHINE LEARNINGLecture 1: Introduction to machine learning

Lecturer: Simone Scardapane

Academic Year 2016/2017

Table of contents

About the courseMaterials and table of contents

What is machine learningTop-down programming vs. machine learningBasic conceptsSome bits of history

Organization

The course is organized in 52 hours, 2/3 theoretical, 1/3 prac-tical. Slides and lab sessions will be self-contained, and will beprovided along with a selection of further reading material at:

http://ispac.diet.uniroma1.it/scardapane/

Main reading book:

1. The Elements of Statistical Learning[Hastie, Tibshirani & Friedman], available online.

Additional books:

2. Introduction to Machine Learning[unpublished, Smola & Vishwanathan], available online.

3. Deep learning[Goodfellow, Bengio & Courville], available in HTML form.

Lab sessions

4 lab sessions will be organized in the Python programming lan-guage. A basic knowledge of the language is required. In orderto have a working scientific environment, it is recommended toinstall a scientific Python distribution such as Anaconda:

https://www.continuum.io/downloads

Tentative table of contents

I Introduction and basics of optimization [5 h].

I Linear models [3 h].

I Regularization and loss functions [2 h].

I Data preprocessing, model evaluation and fine-tuning [2 h].

I Neural networks and deep learning (Dr. Elisa Ricci) [5 h].

I Kernel methods [2 h].

I Ensemble learning [2 h].

I Clustering [3 h].

I Additional topics and seminars [4 h].

Table of contents

An XKCD joke

The alt-text reads: “In the 60s, MarvinMinsky assigned a couple of undergradsto spend the summer programming acomputer to use a camera to identifyobjects in a scene. He figured they’dhave the problem solved by the end ofthe summer. Half a century later,we’re still working on it.”

What is simple to program?

Figure 1 : Taken from “Two big challenges in machinelearning”, by Leon Bottou, ICML 2015.

What is simple to program? (2)

Navigating in a labyrinth (or finding whether a path exists)is simple for a programmer to implement. The problem is well-defined, and there is a clear way to represent the data structures.On the opposite, humans can find this task tedious and notobvious if the labyrinth is huge.

Recognizing the mouse (or the cheese) is extremely intuitivefor a human, irrespective of the size of the image, but very hardto program in a computer. This is because there are countlesspossible configurations of pixels giving rise to the concepts of a‘mouse’/’cheese’.

Some additional examples

Other situations where some ‘heuristic’ reasoning is needed fordesigning some program:

1. Filtering an email in a spam folder: is it about the oc-currence of some words? Which words? Should we careabout sentence structure?

2. For a bank company, deciding whether a client will defaulton their loan given their history and demographic details.

3. Classifying a patient as ill or not-ill given their medicalrecords.

An alternative approach

In all previous cases, we generally have a long history of “exam-ples” in some database, such as photos of mice, spam emails, illpatients... As a matter of fact, it is reasonable to assume that abank will likely make a decision based on past interactions withsimilar clients.

The motivating question for this course than becomes:

How can we ‘learn’ from such data?

Standard programming: a schema

ProgrammingInterface

Program

Output

What we would like to have

ProgrammingInterface

New data

LearningProgram

Old data

Output

Characteristics of a learning algorithm

In principle, we would like some sort of ‘universal ’ learning al-gorithm, which should be able to work irrespective of the datadomain. From a theoretical point of view, the impossibility ofhaving such an algorithm in the absence of any assumption isformalized in a set of ‘no-free-lunch’ theorems [1].

More practically, specific algorithms have vastly different trade-offs in terms of what type of data they can handle, their ex-pressive power, computational cost, comprehensibility, and soon. This is why ML is an incredibly vast world with hundredsof tools at your disposal.

[1] Wolpert, D.H., 1996. The lack of a priori distinctions betweenlearning algorithms. Neural computation, 8(7), pp.1341-1390.

Table of contents

A formal definition of ML

The following classical definition was provided by Tom Mitchell:

“A computer program is said to learn from experience E withrespect to some class of tasks T and performance measure P ,if its performance at tasks in T , as measured by P , improveswith experience.”

— Machine Learning, 1997

Categorization of problems

The problems we described before belong to the subfield ofsupervised learning: learning a relation from a set of (labeled)examples, which are akin to a teacher signal. This will be themain topic of this course.

If we do not have an explicit label, we have the so-called un-supervised learning, which in itself contains a large set ofpossible problems: dimensionality reduction, clustering, 2D vi-sualization, etc. These are mostly concerned with hypothesesand modeling with respect to the structure of the data.

Categorization of problems (2)

Reinforcement learning is the more advanced subfield of ML,that considers the learning capabilities of an agent that canmove in an unstructured environment, and it only receives apartial ‘reward’ signal at given instants (e.g., an agent learningto play tic-tac-toe).

Some problems do not perfectly fit in this standard categoriza-tion, most notably recommending and ranking systems. Re-cently, Yann LeCun (Director of AI Research @ Facebook) pro-posed to include predictive learning to this standard catego-rization, e.g. the capability of an agent of entirely predictingthe state of the world (and its evolution) from data.

Practical synonims of ML

In practice, all these terms can be considered akin or highlyoverlapping with ML (this slide is open to many debates):

I Pattern recognition: sometimes this term refers to clas-sification only, which is a specific problem in supervisedlearning.

I Data mining (more focus on exploration of data). Datamining is sometimes referred to as ‘practical ML’.

I Predictive analytics (focus on predictive modeling).

I Knowledge discovery (common in the databases litera-ture).

I Inferential statistics (as opposed to descriptive statis-tics).

The main elements in a ML program

Despite the variety of algorithms and approaches to ML, mostmethods can be understood as a varying combination of thefollowing three items:

I Model: how we represent our knowledge (polynomials,trees, graphs, ...).

I Evaluation (performance measure P): how we evaluatethe results of our learning model. Depending on the mea-sure we choose, we can achieve extremely different resultseven with the same model formulation.

I Optimization: the algorithm we use to find a model thatmaximizes P . Many problems in ML are NP-hard, so weneed efficient heuristic procedures to handle current bigdata problems.

Table of contents

ML today

A completely arbitrary timeline

1950 1960 1970 1980 1990 2000 2010

1952: First checkers program (Samuels)

1957: Perceptron (Rosenblatt)

1967: k-NN formalization

1969: Perceptrons [Book]

1979: Decision Trees

1980s: Expert systems

1986: Backpropagation (?)

1990s: SVMs (Vapnik & coll.)

1994: PAC theory

1998: Convolutional NN (LeCun)

2006: First ‘deep learning’ paper (Hinton)

2012: AlexNet

ML today is everywhere

Reason 1: data

The first main reason is the huge availability of data:

There is enough data in a day of tweets to possibly recreatethe English language from scratch [Image source].

Reason 2: computing power

Figure 2 : Evolution of computing power for deep learningapplications [Image source].

Reason 3: software libraries

Today, there are many mature software libraries for learning,such as scikit-learn in Python. Some of them can be easilydistributed over clusters (e.g., the MLlib module in Spark).

Additionally, there are new auto-differencing tools making deeplearning easily affordable, such as TensorFlow and Chainer.Models and algorithms built on these software are commonlyreleased in open-source, increasing the speed of research evenfurther.

Many ready-to-use services over the web, also known as ‘cogni-tive services’, launching the era of ML-as-a-service.

Crowd ML

Figure 3 : Competitions platforms such as Kaggle allow the userto compete in real-world (or very plausible) scenarios, and toexplore strategies from other users.

An initial word of caution... Can you find ML?

data mining and machine learning -...

Documents

massive data analytics data mining...

applied data mining - lagout mining/applied data mining...

october 18, 2015 data mining: concepts and techniques 1 data...

data mining with - lagout mining/data mining with...

data mining and analysis - doc.lagout.org mining/data mining...

data mining vs. statistics pavel brusilovsky. 2 objectives 2...

data mining: what is data mining?

cs590d: data mining chris clifton - purdue university ·...

visual data mining: an overview what is visual data mining?...

data mining: introduction. chapter 1. introduction...

lecture 2: data mining 1. roadmap what is data mining? data...

data mining lecture 1: introduction to data mining

data warehousing mining & bi data streams mining dwmbi1

introduction to data mining -...

data mining taylor statistics 202: data mining

data mining chapter 1 introduction -- basic data mining...

statistical data mining€¦ · 3 data mining data...

september 4, 20151 chapter 1. introduction motivation: why...

data mining week 1 - pengantar data mining

1 data mining chapter 26. 2 chapter 1. introduction...