making ml more useful to more people · ml.net: an open source and cross-platform machine learning...
TRANSCRIPT
“It has exquisite buttons …
with long sleeves …works for
casual as well as business
settings”{f(x) {f(x)
Why Machine Learning?“Programming the UnProgrammable”
Point of view: Data Science is Software Engineering with Data
Models are Software
• Built as software, just with different tools
• Deployed and updated as software
• Tested as software
• Debugged like software
Training data needs management
• Data is private and increasingly regulated
• Data is dynamic (CRUD, retention policies, …)
• Best managed as part of the data estate
• Training and deployment of models needs to respect data governance
https://dot.net/ml
ML.NET
Brought to you by (amongst others)
Zeeshan Ahmed (Microsoft) [email protected], Saeed Amizadeh (Microsoft) <[email protected]>, Mikhail Bilenko (Yandex) <[email protected]>, Rogan Carr (Microsoft) <[email protected]>, Wei-Sheng Chin (Microsoft) <[email protected]>, Yael Dekel (Microsoft) <[email protected]>, Xavier Dupre (Microsoft) <[email protected]>, Vadim Eksarevskiy (Microsoft) <[email protected]>, Senja Filipi (Microsoft) <[email protected]>, Tom Finley (Microsoft) <[email protected]>, Abhishek Goswami (Microsoft) <[email protected]>, Monte Hoover (Microsoft) <[email protected]>, Scott Inglis (Microsoft) <[email protected]>, Matteo Interlandi (Microsoft) <[email protected]>, Najeeb Kazmi (Microsoft) <[email protected]>, Gleb Krivosheev (Microsoft) <[email protected]>, Pete Luferenko (Microsoft) <[email protected]>, Ivan Matantsev (Microsoft) <[email protected]>, Sergiy Matusevych (Microsoft) <[email protected]>, Shahab Moradi (Microsoft) <[email protected]>, Gani Nazirov (Microsoft) <[email protected]>, Justin Ormont (Microsoft) <[email protected]>, Gal Oshri (Microsoft) <[email protected]>, Artidoro Pagnoni (Microsoft) <[email protected]>, Jignesh Parmar (Microsoft) <[email protected]>, Prabhat Roy (Microsoft) <[email protected]>, Zeeshan Siddiqui (Microsoft) <[email protected]>, Markus Weimer (Microsoft) <[email protected]>, Shauheen Zahirazami (Microsoft) <[email protected]>, Yiwen Zhu (Microsoft) <[email protected]>, …
About .NET
• .NET has cool stuff ML people care about
• C#: Like Java, but from the future
• F#: Like Python, but with static types and multithreading
• Almost-free calls into native code
• .NET is OSS and cross platform
• Windows (surprise!), Linux, macOS
• Phones via Xamarin: Android, iOS
• Interesting HW: Xbox, IoT devices, …
• Lots of developers build important stuff in .NET
• 4M active; 450k added each month
• 15% growth MoM in https://github.com/dotnet
• Half the top-10k websites are built in .NET
.NET
ML.NET: An open source and cross-platform machine learning framework
Machine Learning made for .NET Developers
Covers many developer scenarios
Available in C#, F# and VB.NET
Open source and cross-platformWindows, Linux, Mac
X64, x86 (some), ARM (some)
Proven and extensibleDevelopment started ~10 years ago
Received contribution (and scrutiny) from all over Microsoft
This designed most of my slides used today ☺
ML.NET is used in many products
• Many MS products use TLC ML.NET.
• You have likely used ML.NET today ☺
• Why is that?
• Many products are written in (ASP).NET
• Using ML.NET is just like using any other .NET API
var model = mlContext.Model.Load(“mymodel.zip”);
var predFunc = trainedModel.MakePredictionFunction<T_IN, T_OUT>(mlContext);
var result = predFunc.Predict(x);
Using a model is just like using codeResource
shipped with the app.
Standard software
dependency
Training: Think sklearn, but with a statically typed language
ML.NET captures end-to-end Machine Learning Pipelines
Data Ingestion
Text
SQL
In Memory
…
Featurization and Transforms
Text & Image featurization
Pre-trained DNNs in ONNX, TensorFlow
Feature transforms (normalization, pruning, …)
…
Learning Algorithms
Supervised: Linear, Trees, Factorization Machines, …
Unsupervised: PCA, LDA, K-Means, …
Time Series
…
ML.NET is fast & good
• Core infrastructure: IDataView
• Carefully designed to avoid memory allocations
• Only required data is lazily materialized
• Carefully tuned defaults
• Many ML tasks are more alike than we’d like to admit ☺
GBDT Experiments done on Criteo, using default parameters
ML.NET’s journey to OSS
• Developed for almost a decade as an internal tool
• Open Sourced in May 2018 (at //build)
• MIT License, .NET Foundation
• Monthly releases ever since; 1.0rc1 this Tuesday
• Please check it out, and leave feedback
Other efforts not discussed today
• Pretzel
• Model compiler
• Especially good at the many models → one program problem
• http://www.markusweimer.com/publication/2018/10/23/pretzel/
• TorchSharp
• PyTorch – Python + .NET
• https://github.com/xamarin/TorchSharp
Distributed Machine Learning where the Data is
• One cluster used by allworkloads (interactive, batch, streaming, …)
• Resources are handed out as containers• A container is slice of a
machine• Fixed RAM, CPU, I/O, …
• Examples:• Azure Batch• Apache Hadoop YARN• Apache Mesos• Google Borg
Resource Managers
Container
• Fault tolerance
• Pre-emption
• Elasticity
Challenges
• ML thrives with gang scheduling• Iterative • Fixed data sets
• Gangs are undesirable on shared clusters• Utilization is paramount• MPI: Wait …• MapReduce: Do the work
slowly on fewer machines
• Let’s do better than that
Machine learning
Approach I: Elastic MLNeurIPS ‘14
• Our solution:• Ramp up the workload
with the allocations
• In each iteration, add machines and data
• First iteration
Elastic ML
• Our solution:• Ramp up the workload
with the allocations
• In each iteration, add machines and data
• Second Iteration
Elastic ML
• Our solution:• Ramp up the workload
with the allocations
• In each iteration, add machines and data
• End state
Elastic ML
Is it any good?
Approach II: Coded computingYaoqing Yang (CMU), Matteo Interlandi, Saeed Amizadeh
NeurIPS ’18, ongoing work
Coded DataOriginal Data
Or: Coded Computing
Container 1 Container 2 Container 3 Container 4 Container 5 Container 6
X[1] X[2] X[3] X[1]+2x[2]+3X[3] X[1]+4X[2]+9X[3]
X[1]+8X[2]+27X[3]
Y[1] Y[2] Y[3] Y[1]+2Y[2]+3Y[3] Y[1]+4Y[2]+9Y[3] Y[1]+8Y[2]+27Y[3]
… … … … … …
… … … … … …
… … … … … …
• Encode 3 splits into 6 splits
• Any 3 row bloks out of 6 are sufficient
Results
• Real dataset: 100,000 samples, 3352 Features.
• Distributed computing on 20 machines.
• Randomly pick 10 machines and let them randomly fail during the computation.
Point of view: Data Science is Software Engineering with Data
Models are Software
• Built as software, just with different tools
• Deployed and updated as software
• Tested as software
• Debugged like software
Training data needs management
• Data is private and increasingly regulated
• Data is dynamic (CRUD, retention policies, …)
• Best managed as part of the data estate
• Training and deployment of models needs to respect data governance
Many open questions
• For software, we have source control. For data and models we have …?
• For software, we have code reviews. For data we have … ?
• For software, we have semantic versions, for data we have … ?
• For software, we have debuggers. For models, we have … ?
• For software, we have signing. For models, we have … ?
• …
Thanks for your time!Let’s stay in touch!
ML.NET is ML for .NEThttps://dot.net/ml
https://github.com/dotnet/machinelearning
You can reach me at:[email protected]
@MarkusWeimer
http://markusweimer.com
Of course, we are hiring