how to make smarter programs. a gentle introduction to machine learning by simone scardapane

28
How to make smarter programs. A gentle introduction to Machine Learning Simone Scardapane [email protected]

Upload: codemotion

Post on 05-Jul-2015

194 views

Category:

Technology


0 download

DESCRIPTION

The ability of automatically extracting relevant patterns and information from data is probably the biggest challenge that awaits in the next future. Today, mature technologies and algorithms exists for successful data mining and machine learning, but the somewhat complicated theory behind them has hindered their application for everyday programmers and small companies. In this talk we will introduce in a very gentle way the main concepts and goals of machine learning. We will then conclude by showing a realistic example, and point to dedicated material for the interested audience.

TRANSCRIPT

Page 1: How to make smarter programs. A gentle introduction to Machine Learning by Simone Scardapane

How to make smarter programs.A gentle introduction to Machine Learning

Simone Scardapane

[email protected]

Page 2: How to make smarter programs. A gentle introduction to Machine Learning by Simone Scardapane

Argomenti di oggi

Simone Scardapane

1. Il Machine Learning (ieri ed oggi)

2. Un esempio pratico: spam detection

3. Cenni su Weka

[email protected]

Page 3: How to make smarter programs. A gentle introduction to Machine Learning by Simone Scardapane

Cos’è il Machine Learning?

Simone Scardapane

«Estrazione automatica di conoscenza a partire da un insieme di dati»

[email protected]

Dati «Learning» Conoscenza

Page 4: How to make smarter programs. A gentle introduction to Machine Learning by Simone Scardapane

Chi ci lavora?

Simone Scardapane

65 anni di ricerche da parte di:

• Ingegneri

• Informatici

• Statistici

• Matematici

• Fisici

• Neuroscienziati...

[email protected]

0

500

1000

1500

2000

2500

3000

199

5

199

6

199

7

199

8

199

9

20

00

20

01

20

02

20

03

20

04

20

05

20

06

20

07

20

08

20

09

20

10

20

11

20

12

Citazioni ML

Citazioni Scopus

Page 5: How to make smarter programs. A gentle introduction to Machine Learning by Simone Scardapane

Il Machine Learning oggi

Simone Scardapane [email protected]

Page 6: How to make smarter programs. A gentle introduction to Machine Learning by Simone Scardapane

Simone Scardapane [email protected]

Perché usare il Machine Learning?

(Fonte: IDC)

Page 7: How to make smarter programs. A gentle introduction to Machine Learning by Simone Scardapane

Ma voi cosa potete farci?

Simone Scardapane

Possibili operazioni su una libreria musicale:

[email protected]

1. Classificazione del genere (o del mood)2. Raggruppamento automatico (clustering)3. Tagging4. Ricerca per similitudini (association rule)5. Predizione del prossimo ascolto6. …

Page 8: How to make smarter programs. A gentle introduction to Machine Learning by Simone Scardapane

Il processo di Learning

Simone Scardapane [email protected]

Raccolta dati

Pre-Processamento

Scelta modelloAllenamento

/ testing

Utilizzo

Page 9: How to make smarter programs. A gentle introduction to Machine Learning by Simone Scardapane

Variazioni

Simone Scardapane [email protected]

• Online Learning: l’algoritmo riceve dati in real-time e si adatta di conseguenza.

• Active Learning: durante la fase di learning, è possibile richiedere attivamente nuove informazioni.

• Collaborative/Cooperative Learning…

Page 10: How to make smarter programs. A gentle introduction to Machine Learning by Simone Scardapane

Simone Scardapane [email protected]

Un esempio pratico

• Dati: insieme S di emails taggate come spam / non spam.

• Obiettivo: metodo automatico per individuare spam.

• Problemi:1. Come rappresentare l’email?2. Che modello utilizzare?3. Come allenarlo?

Page 11: How to make smarter programs. A gentle introduction to Machine Learning by Simone Scardapane

Simone Scardapane [email protected]

Passo 1: Pre-processamento

Parola #

Viagra 2

Bambino 5

Macchina 0

Stereo 0

Cane 1

Spam?

«Bag of words»

Email

Page 12: How to make smarter programs. A gentle introduction to Machine Learning by Simone Scardapane

Simone Scardapane [email protected]

Passo 2: La scelta del modello

Decision Tree:

Viagra

Pallone Spam

SpamSpam

No Sì

>2≤2

Page 13: How to make smarter programs. A gentle introduction to Machine Learning by Simone Scardapane

Simone Scardapane [email protected]

Passo 3: Allenamento

Come costruirlo?

Viagra

Spam

No Sì

???

Page 14: How to make smarter programs. A gentle introduction to Machine Learning by Simone Scardapane

Simone Scardapane [email protected]

Allenamento /2

Consideriamo l’algoritmo C4.5:

1. Scegliamo per il nodo l’attributo a che «divide» meglio i dati.

2. Suddividiamo l’insieme lungo i nodi.3. Ci fermiamo quando i dati sono

perfettamente divisi.

(Difficoltà: gestire dati continui, mancanti…)

Page 15: How to make smarter programs. A gentle introduction to Machine Learning by Simone Scardapane

Simone Scardapane [email protected]

Allenamento /3

Page 16: How to make smarter programs. A gentle introduction to Machine Learning by Simone Scardapane

Simone Scardapane [email protected]

Overfitting

Problema principale: overfitting!

(Immagine con Copyright Tomaso Poggio)

Page 17: How to make smarter programs. A gentle introduction to Machine Learning by Simone Scardapane

Simone Scardapane [email protected]

Pruning

Possibile soluzione (per i decision trees):

• Si tiene da parte un insieme di dati.• Si eliminano i rami non necessari (pruning)

in base a quei dati (error-based pruning).

Più generalmente si usano tecniche di cross-validation.

Page 18: How to make smarter programs. A gentle introduction to Machine Learning by Simone Scardapane

Simone Scardapane [email protected]

Testare l’algoritmo

Possiamo tenere da parte un secondo insieme per testare l’accuratezza dell’algoritmo.

Dividiamo quindi i nostri dati in tre parti:

Training Validation Testing

Page 19: How to make smarter programs. A gentle introduction to Machine Learning by Simone Scardapane

Simone Scardapane [email protected]

Weka

Tool di data mining sviluppato dalla Waikato University in Java:

1. Ampio numero di funzioni.

2. Estremamente portabile.

3. Interfaccia di facile utilizzo.

Page 20: How to make smarter programs. A gentle introduction to Machine Learning by Simone Scardapane

Simone Scardapane [email protected]

SpamBase

Usiamo il dataset SpamBase dal repository UCI:http://archive.ics.uci.edu/ml/datasets/Spambase

4601 email rappresentate da 48 frequenze di parole (più qualche informazione aggiuntiva).

I dati sono salvati in formato ARFF (file di testo):

1. Header con descrizione degli attributi.2. Elenco delle email.

Page 21: How to make smarter programs. A gentle introduction to Machine Learning by Simone Scardapane

Simone Scardapane [email protected]

Interfaccia di Weka

Page 22: How to make smarter programs. A gentle introduction to Machine Learning by Simone Scardapane

Simone Scardapane [email protected]

Apertura file

Apriamo il file:

Page 23: How to make smarter programs. A gentle introduction to Machine Learning by Simone Scardapane

Simone Scardapane [email protected]

Scelta del classificatore

Scegliamo il modello:

Page 24: How to make smarter programs. A gentle introduction to Machine Learning by Simone Scardapane

Simone Scardapane [email protected]

Training

Risultati:

Page 25: How to make smarter programs. A gentle introduction to Machine Learning by Simone Scardapane

Simone Scardapane [email protected]

Albero finale

Page 26: How to make smarter programs. A gentle introduction to Machine Learning by Simone Scardapane

Simone Scardapane [email protected]

API di Weka

http://weka.wikispaces.com/Use+WEKA+in+your+Java+code

Page 27: How to make smarter programs. A gentle introduction to Machine Learning by Simone Scardapane

Simone Scardapane [email protected]

Letture consigliate

Programming Collective Intelligence, di Toby Segaran. Publisher: O'Reilly Media (2007).

Data Mining: Practical Machine Learning Tools and Techniques, di Witten, Frank et Hall. Publisher: Morgan Kaufmann (2011).

Introduction to Machine Learning, di Alpaydin. Publisher: the MIT Press (2009).

Page 28: How to make smarter programs. A gentle introduction to Machine Learning by Simone Scardapane

Simone Scardapane [email protected]

Fine!

Grazie per l’attenzione!