managing data science by david martínez rego

24

Upload: big-data-spain

Post on 16-Apr-2017

95 views

Category:

Technology


1 download

TRANSCRIPT

Page 1: Managing Data Science by David Martínez Rego
Page 2: Managing Data Science by David Martínez Rego

Managing Data ScienceDr. David Martinez Rego Big Data Spain 2016

Page 3: Managing Data Science by David Martínez Rego

Lead Data Science• Leading Data Science is probably one of the most

exciting/fun positions that someone can have nowadays

• Create computational algorithms that can take decisions and learn from errors in any problem that can be formulated in numbers

• There many paths to find the treasure in your data, the role of a Lead Data Scientist is to find the shortest and safest one.

Page 4: Managing Data Science by David Martínez Rego

DS plan meeting!

Page 5: Managing Data Science by David Martínez Rego

Top weird moments• I prefer not to give you any insight into the problem. Why

do you want to know what the columns are? I prefer you treat the problem as just data.

• There exist labels. We do not have permission to access them. You inspect the results and see if they makes sense.

• We can give the screenshot of the dashboard and tell the algorithm to predict if it will break.

• Your algorithm is wrong! I have been managing this for 10 years, it cannot be like that.

Page 6: Managing Data Science by David Martínez Rego

Other meetings!

Page 7: Managing Data Science by David Martínez Rego

Common language• Because of its short public life,

Machine Learning lacks the general understanding of its fundamental limitations/principles.

• Focus on practicality makes literature/media oblivious to these fundamentals.

• Only when we agree on some common language, all parties in a room can start to understand each other.

Page 8: Managing Data Science by David Martínez Rego

Learning theory• Set of fundamental results that are behind many of

the common practices and algorithms we use nowadays.

• Has been heavily researched since the 80s and offers a set of mathematical guarantees/limitations in the practice of ML

• Useful both for ML practitioners and managers as a rule of thumb to understand and manage DS.

Page 9: Managing Data Science by David Martínez Rego

Domain

Dataset

Loss function

Hypothesis space

Training algorithm

Evaluation

The problem

Page 10: Managing Data Science by David Martínez Rego

No Free Lunch

Page 11: Managing Data Science by David Martínez Rego

No Free Lunch

Page 12: Managing Data Science by David Martínez Rego

No Free Lunch

How can we prevent such failures? By using our prior knowledge about a specific learning task, to avoid the distributions that will cause us to fail when learning that task. Such prior knowledge can be expressed by restricting our hypothesis class.

Page 13: Managing Data Science by David Martínez Rego
Page 14: Managing Data Science by David Martínez Rego

No Free Lunch take aways• No free lunch theorem is a mathematical certificate

• For managers & HR

• foresee an investment in a variety of specialists if you plan to tackle an increasing number of data challenges

• escape from promises of one killer technique that acts as a hammer for all problems

• For Data Science teams

• foresee and increasing number of specific techniques which you have to keep up to date (team effort)

Page 15: Managing Data Science by David Martínez Rego

Generalisation bounds

• How can be sure that a model will not fail in production?

• How can we correct when things do not go well?

• How can I know if I am being wasteful?

Page 16: Managing Data Science by David Martínez Rego

Generalisation bounds• A ML practitioner is going to train a model with

complexity d (VC-dimension), on m samples, and she is going to observe an error Ls.

• The expected performance when this model goes to production is bounded by with probability 1-𝜹

Page 17: Managing Data Science by David Martínez Rego

Manage DS• How can we correct when things do not go well

• Get a larger sample • Change the hypothesis class by:

• Enlarging it • Reducing it • Completely changing it • Changing the parameters you consider

• Change the feature representation of the data • Change the optimisation algorithm used to apply your learning

rule

Page 18: Managing Data Science by David Martínez Rego

Big Data• Big Data has had a significant impact in the number of m samples, and also

the complexity complexity d (VC-dimension).

• When tackling Variety by making use of unstructured data we increase the complexity d and so it should be planned that the size m is adequate.

• Review the modelling that we are doing to know if we need a big database.

• Is it the case that you do not need to maintain all that data?

Page 19: Managing Data Science by David Martínez Rego

Half pie syndrome• Symptoms

• You are spending a lot of money on gathering data to fuel growth in your business

• Your systems look like this pie, succulent but it seems that your business has lost appetite.

Page 20: Managing Data Science by David Martínez Rego

Enough data?Andrew Gelman (2005):

“Sample sizes are never large. If N is too small to get a sufficiently-precise estimate, you need to get more data (or make more assumptions). But once N is "large enough”, you can start subdividing the data to learn more. N is never enough because if it were "enough" you'd already be on to the next problem for which you need more data.”

Page 21: Managing Data Science by David Martínez Rego

Big data bounds

Alg. design #Data Engineering

Page 22: Managing Data Science by David Martínez Rego

Conclusions• In order to build a better understanding between

data science teams and other stakeholders, we need to make an effort to build a robust common language!

• Learning theory, originally devised as the fundamental theoretic pillar of ML, can help to build an understanding

• These proven basic laws can help you to have a structured way to manage Data Science

Page 23: Managing Data Science by David Martínez Rego

References• Shai Shalev-Shwartz and Shai Ben-David.

Understanding Machine Learning: From Theory to Algorithms, 2014.

• León Bottou and Olivier Bousquet. The Tradeoffs of Large Scale Learning. NIPS 2008

• SVM Optimization: Inverse depencen on dataset size. ICML 2008

• Gelman, Andrew. N is never large enough, http://andrewgelman.com/2005/07/31/n_is_never_larg/

Page 24: Managing Data Science by David Martínez Rego

Managing Data ScienceDr. David Martinez Rego Big Data Spain 2016