data versioning and reproducible ml with dvc and mlflow

15
Data Versioning and Reproducible ML with DVC and MLflow Petra Kaferle Devisschere Data Scientist, Adaltas

Upload: others

Post on 12-Jul-2022

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Data Versioning and Reproducible ML with DVC and MLflow

Data Versioning and Reproducible ML with DVC and MLflowPetra Kaferle DevisschereData Scientist, Adaltas

Page 2: Data Versioning and Reproducible ML with DVC and MLflow

IntroductionAbout me:

▪ Data Scientist

▪ 13 years of experience in Data Analytics in Life Sciences

▪ Developed solutions for automating frequently used protocols

About Adaltas:▪ Involved with Big Data since 2010

▪ 16 experts in France, 2 in Morocco

▪ Partner with Cloudera, Databricks, D2IQ

▪ Actor in Open Source community

▪ Teaching Big Data at 6 universities

Page 3: Data Versioning and Reproducible ML with DVC and MLflow

Agenda

▪ Versioning in Data Science

▪ Data Versioning

▪ Intro to DVC and MLflow

▪ Demo with DVC and MLflow

Page 4: Data Versioning and Reproducible ML with DVC and MLflow

Problem definition

▪ Dataset▪ Tabular▪ Images, video, sound, text

▪ Code▪ Functionality, Hyper-parameters▪ Data pre-processing, Modeling

▪ Results▪ Numeric (metrics)▪ Plots

▪ Environment▪ Dependencies

What do we need to track in Data Science project to be able to reproduce the results?

Page 5: Data Versioning and Reproducible ML with DVC and MLflow

Data versioning use cases

▪ Data engineering during data cleaning and dataset preparation

▪ Test data in software engineering to assure the functionality of the software

▪ Development and testing of database applications

▪ Ontology versioning for Semantic Web

▪ ...

Data beyond Data Science

Page 6: Data Versioning and Reproducible ML with DVC and MLflow

Data is in the heart of any ML project

▪ Data quality management▪ Reproducible training and re-training▪ Automation of testing or deployment▪ Use in applications and web services▪ Audit

Why is the exact version of a dataset important?

Page 7: Data Versioning and Reproducible ML with DVC and MLflow

Classical DV solutions and their downsides

▪ Git ▪ Not suitable for big files

▪ Difficulties with binary files

▪ Git-LFS▪ File size limitation (ex.: 2 GB for GitHub free account and 5 GB for GitHub Enterprise Cloud)▪ Data needs to be stored on servers of Git provider

▪ External storage▪ Checksum calculation for any change in the dataset and placing them under version control

Classical Software Engineering tools are not enough to solve ML reproducibility crisis

Page 8: Data Versioning and Reproducible ML with DVC and MLflow

Our proposed solutionCombining Data Version Control (DVC) and MLflow

&- solves Git’s limitations- easy to learn

- extensive and easy tracking- user-friendly UI

Page 9: Data Versioning and Reproducible ML with DVC and MLflow

What is DVC?https://dvc.org/

▪ Tracks datasets and ML projects▪ Works with many types of storages (cloud,

local, HDFS, HTTP…)▪ Runs on top of a Git repository▪ Supports building and running pipelines

We will focus on dataset tracking.

Page 10: Data Versioning and Reproducible ML with DVC and MLflow

What is MLflow?https://mlflow.org/

We will focus on experiment tracking.

Page 11: Data Versioning and Reproducible ML with DVC and MLflow

Let’s put them together!Best of both worlds

Page 12: Data Versioning and Reproducible ML with DVC and MLflow

Demo

Page 13: Data Versioning and Reproducible ML with DVC and MLflow

RecapLet’s look again at what we just did

Page 14: Data Versioning and Reproducible ML with DVC and MLflow

Thank you for your attention!

Page 15: Data Versioning and Reproducible ML with DVC and MLflow

Feedback

Your feedback is important to us.

Don’t forget to rateand review the sessions.