( big ) data management - data mining and machine learning - global concepts in 10 slides

13
( Big ) Data Management Data Mining & Machine Learning Global Concepts in 10 slides 2016 Nicolas SARRAMAGNA https://fr.linkedin.com/pub/nicolas-sarramagna/19/941/587

Upload: nicolas-sarramagna

Post on 10-Feb-2017

157 views

Category:

Engineering


1 download

TRANSCRIPT

Page 1: ( Big ) Data Management - Data Mining and Machine Learning - Global concepts in 10 slides

( Big ) Data Management

Data Mining & Machine Learning

Global Concepts in 10 slides 2016

Nicolas SARRAMAGNA

https://fr.linkedin.com/pub/nicolas-sarramagna/19/941/587

Page 2: ( Big ) Data Management - Data Mining and Machine Learning - Global concepts in 10 slides

CONTENTS

Introduction

What / Why

How

References

Page 3: ( Big ) Data Management - Data Mining and Machine Learning - Global concepts in 10 slides

COMPAGNIE PLASTIC OMNIUM

CONFIDENTIAL

Data Mining / Machine Learning in Data Management 3

Collect

Storage

Data Mining /

Machine Learning

Data Viz

Governance

Security

Master Data

Data quality

DATA MANAGEMENT Multiples modules

BIG DATA Velocity, Volume, Variety, Veracity, Value

Page 4: ( Big ) Data Management - Data Mining and Machine Learning - Global concepts in 10 slides

COMPAGNIE PLASTIC OMNIUM

CONFIDENTIAL

Data Mining / Machine Learning – What / Why 4

DATA MINING - VALUE Explore, understand data and find : relations, new properties, inductions on them

Descriptive approach

MACHINE LEARNING - VALUE Build a predictive model to answer a question

Predictive approach

20/30 YEARS OLD BUT NEW CONTEXT cpu, db, ram capacities

more data and features

Internet

Big data

Page 5: ( Big ) Data Management - Data Mining and Machine Learning - Global concepts in 10 slides

COMPAGNIE PLASTIC OMNIUM

CONFIDENTIAL

Overview - Data Mining

SEPTEMBER 2015

5

EXPLORE DATA usage of statistics

need data vizualisation for interpretation and insights

CLUSTERING, ASSOCIATION usage of machine learning

Page 6: ( Big ) Data Management - Data Mining and Machine Learning - Global concepts in 10 slides

COMPAGNIE PLASTIC OMNIUM

CONFIDENTIAL

Overview - Machine Learning

6

PREDICTION predict a categorical : classification

predict a number : regression

clustering, association

usage of data mining

Page 7: ( Big ) Data Management - Data Mining and Machine Learning - Global concepts in 10 slides

COMPAGNIE PLASTIC OMNIUM

CONFIDENTIAL

Data Mining / Machine Learning - How

7

PROCESS Define objective, answer, success criteria -> ML Canvas Data understanding : collect data (one or more data sources), explore (min, max, histogram, charts) Data preparation : data quality (outliers, void values), normalize, dimension reduction, noise, new features, data

labeled, text, date, shuffle Data modeling : baseline (random, mean), split data : train & test, select, combine, apply algorithms Data evaluation : interpretation, evaluation (confusion matrix : recall, precision, formula), validation Data deployment : deploy and monitor the model (integration, performance : latency, throughput), A/B testing,

scalable, sustainability

WARNING Need business : domain knowledge Need data, need features : min 10 by feature, 100 better, relevant features Date preparation is crucial : garbage in -> garbage out Stay rigorous on phases of modeling and evaluation : overfitting (train, test, cross validation), models can fail Use best practices of Web development : Continuous integration, deployment, evaluation, monitoring, packaging

IN PRACTICE, DIFFERENT LEVELS OF ABSTRACTION Dev/lib (R, python scikit-learn, Spark) < generic (MLaaS : BigML, AWS) < problem specific and / or dedicated soft Use a data-driven approach than model-driven : better ROI with new features, more input data, trying different

models (as-is) and usage of combination of parameters than creating, tuning models and no automatic combination parameters approach

Page 8: ( Big ) Data Management - Data Mining and Machine Learning - Global concepts in 10 slides

COMPAGNIE PLASTIC OMNIUM

CONFIDENTIAL

Data Mining / Machine Learning - How

MARCH 2015

8

EXAMPLE OF MACHINE LEARNING CANVAS ~ BUSINESS MODEL CANVAS https://github.com/louisdorard/machinelearningcanvas

Page 9: ( Big ) Data Management - Data Mining and Machine Learning - Global concepts in 10 slides

COMPAGNIE PLASTIC OMNIUM

CONFIDENTIAL

Data Mining / Machine Learning – How

9

DATA MODELING DEV/LIB LEVEL MODE (SEE LINKS IN LAST SLIDE)

DATA MODELING GENERIC LEVEL MODE : 1-CLICK (AND SOME OPTIONS)

Page 10: ( Big ) Data Management - Data Mining and Machine Learning - Global concepts in 10 slides

COMPAGNIE PLASTIC OMNIUM

CONFIDENTIAL

Data Mining / Machine Learning – How

MARCH 2015

10

SOFTWARE

Page 11: ( Big ) Data Management - Data Mining and Machine Learning - Global concepts in 10 slides

COMPAGNIE PLASTIC OMNIUM

CONFIDENTIAL

Data Mining / Machine Learning - How

11

EVALUATION (TRAIN, TEST) WITH CONFUSION MATRIX :

Recall -> % quantity of results : False Negative = 0 -> recall 100%

Precision -> % quality of results : False Positive = 0 -> precision 100%

Other metric : TP x costTP + TN x costTN + FP x costFP + FN x costFN = value of the model

Page 12: ( Big ) Data Management - Data Mining and Machine Learning - Global concepts in 10 slides

COMPAGNIE PLASTIC OMNIUM

CONFIDENTIAL

Data Mining / Machine Learning - How

MARCH 2015 FOOTER CAN BE PERSIZED AS FOLLOW: INSERT / HEADER AND FOOTER

12

ACTORS ON THE MARKET : LIBS, GENERIC, PROBLEM SPECIFIC

Page 13: ( Big ) Data Management - Data Mining and Machine Learning - Global concepts in 10 slides

COMPAGNIE PLASTIC OMNIUM

CONFIDENTIAL

REFERENCES

http://www.saedsayad.com

http://www.louisdorard.com/courses/

https://bigml.com/

http://scikit-learn.org/stable/tutorial/machine_learning_map/

http://oliviaklose.com/machine-learning-11-algorithms-explained/

http://www.kdnuggets.com/2016/02/gartner-2016-mq-analytics-platforms-gainers-losers.html

http://www.kdnuggets.com/2015/04/forrester-wave-big-data-predictive-analytics-gainers-losers.html

http://www.shivonzilis.com/

http://www.datasciencecentral.com/profiles/blogs/20-data-science-r-python-excel-and-machine-learning-cheat-

sheets

Data Mining / Machine Learning - References 13