( big ) data management - data mining and machine learning - global concepts in 10 slides
TRANSCRIPT
( Big ) Data Management
Data Mining & Machine Learning
Global Concepts in 10 slides 2016
Nicolas SARRAMAGNA
https://fr.linkedin.com/pub/nicolas-sarramagna/19/941/587
CONTENTS
Introduction
What / Why
How
References
COMPAGNIE PLASTIC OMNIUM
CONFIDENTIAL
Data Mining / Machine Learning in Data Management 3
Collect
Storage
Data Mining /
Machine Learning
Data Viz
Governance
Security
Master Data
Data quality
DATA MANAGEMENT Multiples modules
BIG DATA Velocity, Volume, Variety, Veracity, Value
COMPAGNIE PLASTIC OMNIUM
CONFIDENTIAL
Data Mining / Machine Learning – What / Why 4
DATA MINING - VALUE Explore, understand data and find : relations, new properties, inductions on them
Descriptive approach
MACHINE LEARNING - VALUE Build a predictive model to answer a question
Predictive approach
20/30 YEARS OLD BUT NEW CONTEXT cpu, db, ram capacities
more data and features
Internet
Big data
COMPAGNIE PLASTIC OMNIUM
CONFIDENTIAL
Overview - Data Mining
SEPTEMBER 2015
5
EXPLORE DATA usage of statistics
need data vizualisation for interpretation and insights
CLUSTERING, ASSOCIATION usage of machine learning
COMPAGNIE PLASTIC OMNIUM
CONFIDENTIAL
Overview - Machine Learning
6
PREDICTION predict a categorical : classification
predict a number : regression
clustering, association
usage of data mining
COMPAGNIE PLASTIC OMNIUM
CONFIDENTIAL
Data Mining / Machine Learning - How
7
PROCESS Define objective, answer, success criteria -> ML Canvas Data understanding : collect data (one or more data sources), explore (min, max, histogram, charts) Data preparation : data quality (outliers, void values), normalize, dimension reduction, noise, new features, data
labeled, text, date, shuffle Data modeling : baseline (random, mean), split data : train & test, select, combine, apply algorithms Data evaluation : interpretation, evaluation (confusion matrix : recall, precision, formula), validation Data deployment : deploy and monitor the model (integration, performance : latency, throughput), A/B testing,
scalable, sustainability
WARNING Need business : domain knowledge Need data, need features : min 10 by feature, 100 better, relevant features Date preparation is crucial : garbage in -> garbage out Stay rigorous on phases of modeling and evaluation : overfitting (train, test, cross validation), models can fail Use best practices of Web development : Continuous integration, deployment, evaluation, monitoring, packaging
IN PRACTICE, DIFFERENT LEVELS OF ABSTRACTION Dev/lib (R, python scikit-learn, Spark) < generic (MLaaS : BigML, AWS) < problem specific and / or dedicated soft Use a data-driven approach than model-driven : better ROI with new features, more input data, trying different
models (as-is) and usage of combination of parameters than creating, tuning models and no automatic combination parameters approach
COMPAGNIE PLASTIC OMNIUM
CONFIDENTIAL
Data Mining / Machine Learning - How
MARCH 2015
8
EXAMPLE OF MACHINE LEARNING CANVAS ~ BUSINESS MODEL CANVAS https://github.com/louisdorard/machinelearningcanvas
COMPAGNIE PLASTIC OMNIUM
CONFIDENTIAL
Data Mining / Machine Learning – How
9
DATA MODELING DEV/LIB LEVEL MODE (SEE LINKS IN LAST SLIDE)
DATA MODELING GENERIC LEVEL MODE : 1-CLICK (AND SOME OPTIONS)
COMPAGNIE PLASTIC OMNIUM
CONFIDENTIAL
Data Mining / Machine Learning – How
MARCH 2015
10
SOFTWARE
COMPAGNIE PLASTIC OMNIUM
CONFIDENTIAL
Data Mining / Machine Learning - How
11
EVALUATION (TRAIN, TEST) WITH CONFUSION MATRIX :
Recall -> % quantity of results : False Negative = 0 -> recall 100%
Precision -> % quality of results : False Positive = 0 -> precision 100%
Other metric : TP x costTP + TN x costTN + FP x costFP + FN x costFN = value of the model
COMPAGNIE PLASTIC OMNIUM
CONFIDENTIAL
Data Mining / Machine Learning - How
MARCH 2015 FOOTER CAN BE PERSIZED AS FOLLOW: INSERT / HEADER AND FOOTER
12
ACTORS ON THE MARKET : LIBS, GENERIC, PROBLEM SPECIFIC
COMPAGNIE PLASTIC OMNIUM
CONFIDENTIAL
REFERENCES
http://www.saedsayad.com
http://www.louisdorard.com/courses/
https://bigml.com/
http://scikit-learn.org/stable/tutorial/machine_learning_map/
http://oliviaklose.com/machine-learning-11-algorithms-explained/
http://www.kdnuggets.com/2016/02/gartner-2016-mq-analytics-platforms-gainers-losers.html
http://www.kdnuggets.com/2015/04/forrester-wave-big-data-predictive-analytics-gainers-losers.html
http://www.shivonzilis.com/
http://www.datasciencecentral.com/profiles/blogs/20-data-science-r-python-excel-and-machine-learning-cheat-
sheets
Data Mining / Machine Learning - References 13