openml for algorithm selection and configuration
Post on 13-Feb-2017
382 Views
Preview:
TRANSCRIPT
OpenMLN E T W O R K E D M A C H I N E L E A R N I N G
F O R A L G O R I T H M S E L E C T I O N A N D O P T I M I S A T I O N
J O A Q U I N VA N S C H O R E N , T U E I N D H O V E N , 2 0 1 5
Meta-learning:
• Learn from experience how to select + optimize learning algorithms + workflows
Requires:
• Large amounts of real datasets
• Wide range of state-of-the-art algorithms
• Huge amounts of experimentation: explore how algorithms/params behave on many kinds of data
Millions of real, open datasets are generated • Drug activity, gene expressions, astronomical observations, text,…
Extensive toolboxes exist to analyse data • MLR, SKLearn, RapidMiner, KNIME, WEKA, AmazonML, AzureML,…
Massive amounts of experiments are run, most of this information is lost forever (in people’s heads) • No learning across people, labs, fields • Start from scratch, slows research and innovation
Let’s connect machine learning environments to network, so that we can organize, learn from experience
Connect minds, collaborate globally in real time
Train algorithms to automate data science
F R I C T I O N L E S S , N E T W O R K E D M A C H I N E L E A R N I N G
Easy to use: Integrated in ML environments. Automated, reproducible sharingOrganized data: Experiments connected to data, code, people anywhereEasy to contribute: Post single dataset, algorithm, experiment, commentReward structure*: Build reputation and trust (e-citations, social interaction)
OpenML
Data (ARFF) uploaded or referenced, versioned analysed, characterized, organised online
analysed, characterized, organised online
Tasks contain data, goals, procedures. Readable by tools, automates experimentation
Train-test splits
Evaluation measure
Results organized online: realtime overview
Results organized online: realtime overview
Train-test splits
Evaluation measure
Flows (code) from various tools, versioned.Integrations + APIs (REST, Java, R, Python,…)
Integrations + APIs (REST, Java, R, Python,…)
reproducible, linked to data, flows and authorsExperiments auto-uploaded, evaluated online
Experiments auto-uploaded, evaluated online
ExploreReuseShare
• Search by keywords or properties
• Filters
• Tagging
Data
• Wiki-like descriptions
• Analysis and visualisation of features
Data
• Wiki-like descriptions
• Analysis and visualisation of features
• Auto-calculation of large range of meta-features
Data
• Simple: Number/perc/log of instances, classes, features, dimensionality, NAs, …
• Statistical: Default acc, Min/max/mean distinct values, skewness, kurtosis, stdev, mutual information, …
• Information-theoretic: Entropy (class, num/nom feats), signal-noise ratio,…
• Landmarkers (AUC, ErrRate, Kappa): Decision stumps, Naive Bayes, kNN, RandomTree, J48(pruned), JRip, NBTree, lin.SVM, …
• Subsampling landmarkers (partial learning curves)
• Streaming landmarkers: Evaluations in previous window, previous best, per-algorithm significant win/loss
• Change detectors: HoeffdingAdwin, HoeffdingDDM
• Domain specific meta-features (e.g. molecular descriptors)
• Many not included yet
Meta-features (120+)
• Example: Classification on click prediction dataset, using 10-fold CV and AUC
• People submit results (e.g. predictions)
• Server-side evaluation (many measures)
• All results organized online, per algorithm, parameter setting
• Online visualizations: every dot is a run plotted by score
Tasks
• Leaderboards visualize progress over time: who delivered breakthroughs when, who built on top of previous solutions
• Collaborative: all code and data available, learn from others
• Real-time: clear who submitted first, others can improve immediately
Classroom challenges
• All results obtained with same flow organised online
• Results linked to data sets, parameter settings -> trends/comparisons
• Visualisations (dots are models, ranked by score, colored by parameters)
• Detailed run info
• Author, data, flow, parameter settings, result files, …
• Evaluation details (e.g., results per sample)
REST API (new) Requires OpenML accountPredictive URLs
R API Idem for Java, Python Tutorial: http://www.openml.org/guide
List datasets List tasks
List flowsList runs and results (soon)
Share
Explore
Reuse
Website1) Search, 2) Table view 3) CVS export*
R API Idem for Java, Python Tutorial: http://www.openml.org/guide
Download datasets Download tasks
Download flows
R API
Download run
Download run with predictions
Idem for Java, Python Tutorial: http://www.openml.org/guide
ReuseExplore
Share
WEKAOpenML extension in plugin manager
MOA
RapidMiner: 3 new operators
R API
Run a task (with mlr):
R API
Run a task (with mlr):
And upload:
Studies (e-papers) - Online counterpart of a paper, linkable- Add data, code, experiments (new or old)- Public or shared within circle
Circles Create collaborations with trusted researchersShare results within team prior to publication
Altmetrics - Measure real impact of your work - Reuse, downloads, likes of data, code, projects,…- Online reputation (more sharing)
Online collaboration (soon)
OpenML Community
Jan-Jun 2015
Used all over the world400 7-day, 1700 30-day active users, growing1000s of datasets, flows, 450000+ runs
Opportunities for automated algorithms selection + configuration• Many datasets, flows, experiments: much larger meta-learning
studies than possible before• More meta-features, better meta-models• Meta-data to speed up algorithm configuration, learn over time• APIs: Algorithm selection and configuration can be built on top
of OpenML, reusing and sharing data
OpenML in drug discovery
ChEMBL databaseSM
ILEs'
Molecular'proper0es'
(e.g.'MW,'LogP)'
Fingerprints'(e.g.'FP2,'FP3,'FP4,'MACSS)'
MW #LogP #TPSA #b1 #b2 #b3 #b4 #b5 #b6 #b7 #b8 #b9#!!
377.435 !3.883 !77.85 !1 !1 !0 !0 !0 !0 !0 !0 !0 !!341.361 !3.411 !74.73 !1 !1 !0 !1 !0 !0 !0 !0 !0 !…!197.188 !-2.089 !103.78 !1 !1 !0 !1 !0 !0 !0 !1 !0 !!346.813 !4.705 !50.70 !1 !0 !0 !1 !0 !0 !0 !0 !0!
! ! ! ! ! ! !.!! ! ! ! ! ! !: !!
10.000+ regression datasets
1.4M compounds,10k proteins, 12,8M activities
Predict which compounds (drugs) will inhibit certain proteins (and hence viruses, parasites,…)
2750 targets have >10 compounds, x 4 fingerprints
OpenML in drug discovery
cforest
ctree
earth
fnn
gbm
glmnetlassolmnnetpcrplsr
rforest
ridge
rtree
ECFP4_1024* FCFP4_1024* ECFP6_1024+ FCFP6_1024*
0.697* 0.427* 0.725+ 0.627*
|msdfeat< 0.1729
mutualinfo< 0.7182
skewresp< 0.3315
nentropyfeat< 1.926
netcharge1>=0.9999
netcharge3< 13.85
hydphob21< −0.09666
glmnet
fnn pcr
rforest
ridge rforest
rforest ridge
Predict best algorithm with meta-models
MW #LogP #TPSA #b1 #b2 #b3 #b4 #b5 #b6 #b7 #b8 #b9#!!
377.435 !3.883 !77.85 !1 !1 !0 !0 !0 !0 !0 !0 !0 !!341.361 !3.411 !74.73 !1 !1 !0 !1 !0 !0 !0 !0 !0 !…!197.188 !-2.089 !103.78 !1 !1 !0 !1 !0 !0 !0 !1 !0 !!346.813 !4.705 !50.70 !1 !0 !0 !1 !0 !0 !0 !0 !0!
! ! ! ! ! ! !.!! ! ! ! ! ! !: !!
Metafeatures: - simple, statistical, info-theoretic, landmarkers - target: aliphatic index, hydrophobicity, net
charge, mol. weight, sequence length, …
I. Olier et al. MetaSel@ECMLPKDD 2015
Best algorithm overall (10f CV)
I. Olier et al. MetaSel@ECMLPKDD 2015
Random Forest meta-classifier (10f CV)
I. Olier et al. MetaSel@ECMLPKDD 2015
Random Forest stacker (10f CV)
Frugal learningWhich algorithms are fast (and low-memory) but accurate?
In Python Notebook:
Frugal learningWhich algorithms are fast (and low-memory) but accurate?
Frugal learningWhich algorithms are fast (and low-memory) but accurate?
Meta-learning on streamsStream data in OpenML: ‘best’ algorithm changes over time Concept drift
• Use meta-learning to select the best models at each point in time
Meta-learning on streamsStream data in OpenML: ‘best’ algorithm changes over time
Evaluation
• Base-level:
• Interleaved train-then-test (prequential evaluation)
• Accuracy on data streams using predicted method [default, best]
• Meta-level:
• Leave one stream out
• Accuracy of predicting best model [0,1]
Meta-learning on streams
• Streaming ensembles (current work)
• Train multiple models, use meta-learning to weight their votes
• Stacking / Cascading
• BLast (Best-Last), J. van Rijn et al. ICDM 2015
• Simply choose model that performed best in previous window. Equivalent to state-of-the-art (Leveraging Bagging)
Algorithm selection (ranking)
J. van Rijn et al. IDA 2015
Learning curves in OpenML
• Identify k nearest prior datasets by distance between partial curves:
• Build a ranking of algorithms to run, start with overall best abest • Draw random algorithm acompetitor
• If acompetitor wins most, add to ranking, repeat
For new dataset: evaluate learning curves up to T (e.g. 256 instances) Fraction of full CV No other meta-features
0
0.02
0.04
0.06
0.08
0.1
1 4 16 64 256 1024 4096 16384 65536
Accu
racy
Los
s
Time (seconds)
Best on SampleAverage Rank
PCCPCC (A3R, r = 1)
For 53 classifiers on 39 datasets Multi-objective function A3R:
J. van Rijn et al. IDA 2015
Algorithm selection (ranking)
Towards automating machine learning
Human scientists
meta-data, models, evaluations
Automated processes
Data cleaning Algorithm Selection
Parameter Optimization Workflow construction
Post processing
API
Connect your tools and services to OpenML, so they may learn
Bruce Mau
When everything is connected to everything else, for better or for worse, everything matters.
- Open Source, on GitHub- Regular workshops, hackathons
Join OpenML
Next workshop: - Lorentz Center (Leiden),
14-18 March 2016
T H A N K Y O U
Joaquin Vanschoren
Jan van Rijn
Bernd Bischl
Matthias Feurer
Michel Lang
Nenad Tomašev
Giuseppe Casalicchio
Luis Torgo
You?
#OpenMLFarzan Majdani
Jakob Bossek
top related