wapid and wobust active online machine leawning with vowpal wabbit

39
Wapid and wobust active online machine leawning with Vowpal Wabbit Pycon Finland 2014, Helsinki 2014-10-27 Antti Haapala [email protected]

Upload: antti-haapala

Post on 01-Jul-2015

518 views

Category:

Software


0 download

TRANSCRIPT

Page 1: Wapid and wobust active online machine leawning with Vowpal Wabbit

W a p i d a n d w o b u s t a c t i v e o n l i n e m a c h i n e l e a w n i n g w i t h V o w p a l W a b b i t

Pycon Finland 2014, Helsinki2014-10-27

Antti [email protected]

Page 2: Wapid and wobust active online machine leawning with Vowpal Wabbit

Disclaimer

● IANAS – I Am Not A Statistician● I researched the principles on how this works

for this presentation

Page 3: Wapid and wobust active online machine leawning with Vowpal Wabbit

Why did I start to do ML?● Task

– Receive social media content from various sources

– Filter out all messages that are not in English, are press releases or outright spam.

● Easy, when you can hire a team of people for just this task...● But people are expensive compared to computers...

– And filtering messages is tedious work

● Clearly a little machine learning could help us to separate the spam from sausage, eggs and ham.

Page 4: Wapid and wobust active online machine leawning with Vowpal Wabbit

Time to code

● Write a binary classifier● But with what?

– How does one even do it?

Page 5: Wapid and wobust active online machine leawning with Vowpal Wabbit

Libraries: Scikit-Learn

● NLTK– Has some pure Python classifier implementations

– These algorithms require all data in memory

– The speed is an issue here● Some of them are too slow● The rest are even slower

Page 6: Wapid and wobust active online machine leawning with Vowpal Wabbit

Libraries: Scikit-Learn

● Scikit-Learn– Better than NLTK

– Though most algorithms require all data in memory● And our data still does not fit

– There are some out-of-core algorithms yes, but they're not clearly documented

– Still slow - we cannot afford to reevaluate our classifiers for hours...

Page 7: Wapid and wobust active online machine leawning with Vowpal Wabbit

Possible libraries

● How about FANN, Orange, PyMC, PyML, LIBSVM, PyBrain, ffnet, MDP, Shogun toolbox, Theano, mlpy, Elefant, Bayes Blocks, Monte Python, hcluster, Plearn, Pycplex, pymorph....

Page 8: Wapid and wobust active online machine leawning with Vowpal Wabbit

????

Page 9: Wapid and wobust active online machine leawning with Vowpal Wabbit

Asking does not hurt

“Have you tried using Vowpal Wabbit?”

“Vowwhat?”

“Vowpal Wabbit”

Page 10: Wapid and wobust active online machine leawning with Vowpal Wabbit

What is Vowpal Wabbit?

A research project with the most Pythonic name ever

Page 11: Wapid and wobust active online machine leawning with Vowpal Wabbit

The name

Page 12: Wapid and wobust active online machine leawning with Vowpal Wabbit

What is Vowpal Wabbit?

• John Langford: I'd like to solve AI. • Interviewer: How?• John: I want to use parallel learning algorithms to create fantastic learning machines!

Page 13: Wapid and wobust active online machine leawning with Vowpal Wabbit

What is Vowpal Wabbit?

“VW is the essence of speed in machine learning, able to learn from terafeature datasets with ease.”

Page 14: Wapid and wobust active online machine leawning with Vowpal Wabbit

What is Vowpal Wabbit?

“Via parallel learning, it can exceed the throughput of any single machine network interface when doing linear learning, a first

amongst learning algorithms.”

Page 15: Wapid and wobust active online machine leawning with Vowpal Wabbit

Built for speed and scalability

● “Plausibly the most scalable public linear learner, and plausibly the most scalable anywhere”

● Excels on the network though impressive performance even on a single node.

Page 16: Wapid and wobust active online machine leawning with Vowpal Wabbit

Vowpal Wabbit compared to scikit-learn

The algorithms where the cheatsheet says “> 100k samples”

Page 17: Wapid and wobust active online machine leawning with Vowpal Wabbit

Scalability

● Find a good linear predictor– For 2,100,000,000,000 features...

– 17,000,000,000 examples...

– 16,000,000 parameters...

– Using 1,000 nodes...

● Finished in 70 minutes, at 500M features per second ● That was years ago, using the then stock build of

VW.

f w (x )=∑i

wi x i

Page 18: Wapid and wobust active online machine leawning with Vowpal Wabbit

Open Source

● Vowpal Wabbit is open source, under BSD license

● Exists even in Ubuntu universe repository● The project was started by Yahoo Research,

currently under Microsoft Research.– So even Windows will be supported...

Page 19: Wapid and wobust active online machine leawning with Vowpal Wabbit

Sparse StochasticGradient Descent

● Maps all inputs to n-dimensional space● And divides the space by one hyperplane

minimizing the loss caused by wrong classification– One class is on one side of the plane

– The other is on the other side of the plane

– The loss is modeled by a loss function

Page 20: Wapid and wobust active online machine leawning with Vowpal Wabbit

Stochastic Gradient Descent

Image from Scikit-Learn

Page 21: Wapid and wobust active online machine leawning with Vowpal Wabbit

Which loss function for a classifier?

● Crash course in statistics:– “It helps if you understand the data”

– “But if you don't then try logistic regression”

– Thus go for the logistic loss function

Page 22: Wapid and wobust active online machine leawning with Vowpal Wabbit

Multiclass classifier

● Vowpal Wabbit supports various methods for multiclass classification, read on documentation how to use them.

Page 23: Wapid and wobust active online machine leawning with Vowpal Wabbit

Least squares regression

● The gradient descent algorithm can also be used for regression, for example using the “squared” loss function for least squares.

● A regression predicts the real number value for the input that is dependent on the given features

● A classifier gives a class for the input, and possibly the probability for input belonging to that class

Page 24: Wapid and wobust active online machine leawning with Vowpal Wabbit

Classifier output in logistic regression

● With Vowpal Wabbit the prediction value given by a classifier with logistic loss is in range [-50, 50]

● You can map this to a binary probability using the logistic function

Page 25: Wapid and wobust active online machine leawning with Vowpal Wabbit

From prediction to probability

p=1

1+e− x

Page 26: Wapid and wobust active online machine leawning with Vowpal Wabbit

Common practices of machine learning

● Reduce the number of features by hand guessing which features are relevant

● Use non-linear approaches such as the kernel trick

● Map your features to integers● Leave your computer on at night to build the

model from your training data

Page 27: Wapid and wobust active online machine leawning with Vowpal Wabbit

... become don'ts

● Reduce the number of features by hand guessing which features are relevant

● Use non-linear approaches such as the kernel trick

● Map your features to integers● Leave your workstation on at night to build the

model from your training data

Page 28: Wapid and wobust active online machine leawning with Vowpal Wabbit

Reduce the number of features

● Vowpal Wabbit can handle sparse featuresets having millions of features efficiently

Page 29: Wapid and wobust active online machine leawning with Vowpal Wabbit

Use non-linear approaches

● Sparse dataset with many dimensions yields comparative results to using fewer features with kernel tricks

● One can ask Vowpal Wabbit to generate new features as the Cartesian product of existing features, using namespaces:– That is, given features u^a, u^b, v^c, and v^d, by

using command line parameter -q uv, VW can make u^a^v^c and so forth.

Page 30: Wapid and wobust active online machine leawning with Vowpal Wabbit

Map your features to integers

● Vowpal Wabbit hashes feature names to integers internally using Murmur hash v3

● The downside of hashing are the possible collisions for too many features– H(“Nigerian prince”) = H(“job interview”)

● Though it also decreases the possibility of overfitting

Page 31: Wapid and wobust active online machine leawning with Vowpal Wabbit

Fit the model at night

● Vowpal Wabbit supports online and active learning.

● Most learning tasks are IO-, not CPU-bound● That is to mean, your feature extraction code

will be the bottleneck.

Page 32: Wapid and wobust active online machine leawning with Vowpal Wabbit

Supervised Learning● Training

● Prediction

Label

Input Feature extractorFeature extractor Features

MachineLearningalgorithm

MachineLearningalgorithm

Input Feature extractorFeature extractor FeaturesModelModel

Label

Page 33: Wapid and wobust active online machine leawning with Vowpal Wabbit

Offline vs Online learning● In offline learning the model is fed all the input, after

which it is finalized; the finalized model will be used for predictions– That is, teach the classifier all kinds of unwanted messages

before actual use, and use the resulting classifier for 10 years.

→ Certainly not going to work.

● In online learning, the model can be used for predictions right after the first input– The model will gradually converge towards better classification

Page 34: Wapid and wobust active online machine leawning with Vowpal Wabbit

Semisupervised learning – active learning

● Asking for input for classifier is expensive– If one asks to label all given examples, it is almost

even worse as not asking at all

● The solution is active learning

Page 35: Wapid and wobust active online machine leawning with Vowpal Wabbit

Active learning

● Train only if importance >= threshold

Label

Input Feature extractorFeature extractor Features

MachineLearningalgorithm

MachineLearningalgorithm

● Prediction

Input Feature extractorFeature extractor FeaturesModelModel

Label

Importance

Page 36: Wapid and wobust active online machine leawning with Vowpal Wabbit

How to use Vowpal Wabbit

● You can use it on the command line. To teach a model using logistic regression:

% cat train.txt-1 |t nigerian prince offers money ... |a [email protected] |t invite job interview ... |a [email protected]...% vw -d train.txt --loss_function=logistic -f model.vw

● To test% vw -i model.vw --loss_function=logistic -p /dev/stdout|t nigerian prince interview-0.145824|t spam ham and eggs |a [email protected]

Page 37: Wapid and wobust active online machine leawning with Vowpal Wabbit

How to use VW in Python

● Multiple libraries exist– Though none of the APIs are to my liking

– So I wrote my own

from caerbannog import Rabbit

Page 38: Wapid and wobust active online machine leawning with Vowpal Wabbit

Examples in Python

Page 39: Wapid and wobust active online machine leawning with Vowpal Wabbit

Thanks

Questions?