openml for algorithm selection and configuration

56
OpenML NETWORKED MACHINE LEARNING FOR ALGORITHM SELECTION AND OPTIMISATION JOAQUIN VANSCHOREN, TU EINDHOVEN, 2015

Upload: joaquin-vanschoren

Post on 13-Feb-2017

382 views

Category:

Science


3 download

TRANSCRIPT

Page 1: OpenML for Algorithm Selection and Configuration

OpenMLN E T W O R K E D M A C H I N E L E A R N I N G

F O R A L G O R I T H M S E L E C T I O N A N D O P T I M I S A T I O N

J O A Q U I N VA N S C H O R E N , T U E I N D H O V E N , 2 0 1 5

Page 2: OpenML for Algorithm Selection and Configuration

Meta-learning:

• Learn from experience how to select + optimize learning algorithms + workflows

Requires:

• Large amounts of real datasets

• Wide range of state-of-the-art algorithms

• Huge amounts of experimentation: explore how algorithms/params behave on many kinds of data

Page 3: OpenML for Algorithm Selection and Configuration

Millions of real, open datasets are generated • Drug activity, gene expressions, astronomical observations, text,…

Extensive toolboxes exist to analyse data • MLR, SKLearn, RapidMiner, KNIME, WEKA, AmazonML, AzureML,…

Massive amounts of experiments are run, most of this information is lost forever (in people’s heads) • No learning across people, labs, fields • Start from scratch, slows research and innovation

Page 4: OpenML for Algorithm Selection and Configuration

Let’s connect machine learning environments to network, so that we can organize, learn from experience

Connect minds, collaborate globally in real time

Train algorithms to automate data science

Page 5: OpenML for Algorithm Selection and Configuration

F R I C T I O N L E S S , N E T W O R K E D M A C H I N E L E A R N I N G

Easy to use: Integrated in ML environments. Automated, reproducible sharingOrganized data: Experiments connected to data, code, people anywhereEasy to contribute: Post single dataset, algorithm, experiment, commentReward structure*: Build reputation and trust (e-citations, social interaction)

OpenML

Page 6: OpenML for Algorithm Selection and Configuration

Data (ARFF) uploaded or referenced, versioned analysed, characterized, organised online

Page 7: OpenML for Algorithm Selection and Configuration

analysed, characterized, organised online

Page 8: OpenML for Algorithm Selection and Configuration

Tasks contain data, goals, procedures. Readable by tools, automates experimentation

Train-test splits

Evaluation measure

Results organized online: realtime overview

Page 9: OpenML for Algorithm Selection and Configuration

Results organized online: realtime overview

Train-test splits

Evaluation measure

Page 10: OpenML for Algorithm Selection and Configuration

Flows (code) from various tools, versioned.Integrations + APIs (REST, Java, R, Python,…)

Page 11: OpenML for Algorithm Selection and Configuration

Integrations + APIs (REST, Java, R, Python,…)

Page 12: OpenML for Algorithm Selection and Configuration

reproducible, linked to data, flows and authorsExperiments auto-uploaded, evaluated online

Page 13: OpenML for Algorithm Selection and Configuration

Experiments auto-uploaded, evaluated online

Page 14: OpenML for Algorithm Selection and Configuration

ExploreReuseShare

Page 15: OpenML for Algorithm Selection and Configuration
Page 16: OpenML for Algorithm Selection and Configuration

• Search by keywords or properties

• Filters

• Tagging

Data

Page 17: OpenML for Algorithm Selection and Configuration

• Wiki-like descriptions

• Analysis and visualisation of features

Data

Page 18: OpenML for Algorithm Selection and Configuration

• Wiki-like descriptions

• Analysis and visualisation of features

• Auto-calculation of large range of meta-features

Data

Page 19: OpenML for Algorithm Selection and Configuration

• Simple: Number/perc/log of instances, classes, features, dimensionality, NAs, …

• Statistical: Default acc, Min/max/mean distinct values, skewness, kurtosis, stdev, mutual information, …

• Information-theoretic: Entropy (class, num/nom feats), signal-noise ratio,…

• Landmarkers (AUC, ErrRate, Kappa): Decision stumps, Naive Bayes, kNN, RandomTree, J48(pruned), JRip, NBTree, lin.SVM, …

• Subsampling landmarkers (partial learning curves)

• Streaming landmarkers: Evaluations in previous window, previous best, per-algorithm significant win/loss

• Change detectors: HoeffdingAdwin, HoeffdingDDM

• Domain specific meta-features (e.g. molecular descriptors)

• Many not included yet

Meta-features (120+)

Page 20: OpenML for Algorithm Selection and Configuration

• Example: Classification on click prediction dataset, using 10-fold CV and AUC

• People submit results (e.g. predictions)

• Server-side evaluation (many measures)

• All results organized online, per algorithm, parameter setting

• Online visualizations: every dot is a run plotted by score

Tasks

Page 21: OpenML for Algorithm Selection and Configuration

• Leaderboards visualize progress over time: who delivered breakthroughs when, who built on top of previous solutions

• Collaborative: all code and data available, learn from others

• Real-time: clear who submitted first, others can improve immediately

Page 22: OpenML for Algorithm Selection and Configuration

Classroom challenges

Page 23: OpenML for Algorithm Selection and Configuration

• All results obtained with same flow organised online

• Results linked to data sets, parameter settings -> trends/comparisons

• Visualisations (dots are models, ranked by score, colored by parameters)

Page 24: OpenML for Algorithm Selection and Configuration

• Detailed run info

• Author, data, flow, parameter settings, result files, …

• Evaluation details (e.g., results per sample)

Page 25: OpenML for Algorithm Selection and Configuration

REST API (new) Requires OpenML accountPredictive URLs

Page 26: OpenML for Algorithm Selection and Configuration

R API Idem for Java, Python Tutorial: http://www.openml.org/guide

List datasets List tasks

List flowsList runs and results (soon)

Page 27: OpenML for Algorithm Selection and Configuration

Share

Explore

Reuse

Page 28: OpenML for Algorithm Selection and Configuration

Website1) Search, 2) Table view 3) CVS export*

Page 29: OpenML for Algorithm Selection and Configuration

R API Idem for Java, Python Tutorial: http://www.openml.org/guide

Download datasets Download tasks

Download flows

Page 30: OpenML for Algorithm Selection and Configuration

R API

Download run

Download run with predictions

Idem for Java, Python Tutorial: http://www.openml.org/guide

Page 31: OpenML for Algorithm Selection and Configuration

ReuseExplore

Share

Page 32: OpenML for Algorithm Selection and Configuration

WEKAOpenML extension in plugin manager

Page 33: OpenML for Algorithm Selection and Configuration

MOA

Page 34: OpenML for Algorithm Selection and Configuration

RapidMiner: 3 new operators

Page 35: OpenML for Algorithm Selection and Configuration

R API

Run a task (with mlr):

Page 36: OpenML for Algorithm Selection and Configuration

R API

Run a task (with mlr):

And upload:

Page 37: OpenML for Algorithm Selection and Configuration

Studies (e-papers) - Online counterpart of a paper, linkable- Add data, code, experiments (new or old)- Public or shared within circle

Circles Create collaborations with trusted researchersShare results within team prior to publication

Altmetrics - Measure real impact of your work - Reuse, downloads, likes of data, code, projects,…- Online reputation (more sharing)

Online collaboration (soon)

Page 38: OpenML for Algorithm Selection and Configuration

OpenML Community

Jan-Jun 2015

Used all over the world400 7-day, 1700 30-day active users, growing1000s of datasets, flows, 450000+ runs

Page 39: OpenML for Algorithm Selection and Configuration

Opportunities for automated algorithms selection + configuration• Many datasets, flows, experiments: much larger meta-learning

studies than possible before• More meta-features, better meta-models• Meta-data to speed up algorithm configuration, learn over time• APIs: Algorithm selection and configuration can be built on top

of OpenML, reusing and sharing data

Page 40: OpenML for Algorithm Selection and Configuration

OpenML in drug discovery

ChEMBL databaseSM

ILEs'

Molecular'proper0es'

(e.g.'MW,'LogP)'

Fingerprints'(e.g.'FP2,'FP3,'FP4,'MACSS)'

MW #LogP #TPSA #b1 #b2 #b3 #b4 #b5 #b6 #b7 #b8 #b9#!!

377.435 !3.883 !77.85 !1 !1 !0 !0 !0 !0 !0 !0 !0 !!341.361 !3.411 !74.73 !1 !1 !0 !1 !0 !0 !0 !0 !0 !…!197.188 !-2.089 !103.78 !1 !1 !0 !1 !0 !0 !0 !1 !0 !!346.813 !4.705 !50.70 !1 !0 !0 !1 !0 !0 !0 !0 !0!

! ! ! ! ! ! !.!! ! ! ! ! ! !: !!

10.000+ regression datasets

1.4M compounds,10k proteins, 12,8M activities

Predict which compounds (drugs) will inhibit certain proteins (and hence viruses, parasites,…)

2750 targets have >10 compounds, x 4 fingerprints

Page 41: OpenML for Algorithm Selection and Configuration

OpenML in drug discovery

cforest

ctree

earth

fnn

gbm

glmnetlassolmnnetpcrplsr

rforest

ridge

rtree

ECFP4_1024* FCFP4_1024* ECFP6_1024+ FCFP6_1024*

0.697* 0.427* 0.725+ 0.627*

|msdfeat< 0.1729

mutualinfo< 0.7182

skewresp< 0.3315

nentropyfeat< 1.926

netcharge1>=0.9999

netcharge3< 13.85

hydphob21< −0.09666

glmnet

fnn pcr

rforest

ridge rforest

rforest ridge

Predict best algorithm with meta-models

MW #LogP #TPSA #b1 #b2 #b3 #b4 #b5 #b6 #b7 #b8 #b9#!!

377.435 !3.883 !77.85 !1 !1 !0 !0 !0 !0 !0 !0 !0 !!341.361 !3.411 !74.73 !1 !1 !0 !1 !0 !0 !0 !0 !0 !…!197.188 !-2.089 !103.78 !1 !1 !0 !1 !0 !0 !0 !1 !0 !!346.813 !4.705 !50.70 !1 !0 !0 !1 !0 !0 !0 !0 !0!

! ! ! ! ! ! !.!! ! ! ! ! ! !: !!

Metafeatures: - simple, statistical, info-theoretic, landmarkers - target: aliphatic index, hydrophobicity, net

charge, mol. weight, sequence length, …

Page 42: OpenML for Algorithm Selection and Configuration

I. Olier et al. MetaSel@ECMLPKDD 2015

Best algorithm overall (10f CV)

Page 43: OpenML for Algorithm Selection and Configuration

I. Olier et al. MetaSel@ECMLPKDD 2015

Random Forest meta-classifier (10f CV)

Page 44: OpenML for Algorithm Selection and Configuration

I. Olier et al. MetaSel@ECMLPKDD 2015

Random Forest stacker (10f CV)

Page 45: OpenML for Algorithm Selection and Configuration

Frugal learningWhich algorithms are fast (and low-memory) but accurate?

In Python Notebook:

Page 46: OpenML for Algorithm Selection and Configuration

Frugal learningWhich algorithms are fast (and low-memory) but accurate?

Page 47: OpenML for Algorithm Selection and Configuration

Frugal learningWhich algorithms are fast (and low-memory) but accurate?

Page 48: OpenML for Algorithm Selection and Configuration

Meta-learning on streamsStream data in OpenML: ‘best’ algorithm changes over time Concept drift

• Use meta-learning to select the best models at each point in time

Page 49: OpenML for Algorithm Selection and Configuration

Meta-learning on streamsStream data in OpenML: ‘best’ algorithm changes over time

Evaluation

• Base-level:

• Interleaved train-then-test (prequential evaluation)

• Accuracy on data streams using predicted method [default, best]

• Meta-level:

• Leave one stream out

• Accuracy of predicting best model [0,1]

Page 50: OpenML for Algorithm Selection and Configuration

Meta-learning on streams

• Streaming ensembles (current work)

• Train multiple models, use meta-learning to weight their votes

• Stacking / Cascading

• BLast (Best-Last), J. van Rijn et al. ICDM 2015

• Simply choose model that performed best in previous window. Equivalent to state-of-the-art (Leveraging Bagging)

Page 51: OpenML for Algorithm Selection and Configuration

Algorithm selection (ranking)

J. van Rijn et al. IDA 2015

Learning curves in OpenML

• Identify k nearest prior datasets by distance between partial curves:

• Build a ranking of algorithms to run, start with overall best abest • Draw random algorithm acompetitor

• If acompetitor wins most, add to ranking, repeat

For new dataset: evaluate learning curves up to T (e.g. 256 instances) Fraction of full CV No other meta-features

Page 52: OpenML for Algorithm Selection and Configuration

0

0.02

0.04

0.06

0.08

0.1

1 4 16 64 256 1024 4096 16384 65536

Accu

racy

Los

s

Time (seconds)

Best on SampleAverage Rank

PCCPCC (A3R, r = 1)

For 53 classifiers on 39 datasets Multi-objective function A3R:

J. van Rijn et al. IDA 2015

Algorithm selection (ranking)

Page 53: OpenML for Algorithm Selection and Configuration

Towards automating machine learning

Human scientists

meta-data, models, evaluations

Automated processes

Data cleaning Algorithm Selection

Parameter Optimization Workflow construction

Post processing

API

Connect your tools and services to OpenML, so they may learn

Page 54: OpenML for Algorithm Selection and Configuration

Bruce Mau

When everything is connected to everything else, for better or for worse, everything matters.

Page 55: OpenML for Algorithm Selection and Configuration

- Open Source, on GitHub- Regular workshops, hackathons

Join OpenML

Next workshop: - Lorentz Center (Leiden),

14-18 March 2016

Page 56: OpenML for Algorithm Selection and Configuration

T H A N K Y O U

Joaquin Vanschoren

Jan van Rijn

Bernd Bischl

Matthias Feurer

Michel Lang

Nenad Tomašev

Giuseppe Casalicchio

Luis Torgo

You?

#OpenMLFarzan Majdani

Jakob Bossek