managing data science by david martínez rego

Managing Data ScienceDr. David Martinez Rego Big Data Spain 2016

Lead Data Science• Leading Data Science is probably one of the most

exciting/fun positions that someone can have nowadays

• Create computational algorithms that can take decisions and learn from errors in any problem that can be formulated in numbers

• There many paths to find the treasure in your data, the role of a Lead Data Scientist is to find the shortest and safest one.

DS plan meeting!

Top weird moments• I prefer not to give you any insight into the problem. Why

do you want to know what the columns are? I prefer you treat the problem as just data.

• There exist labels. We do not have permission to access them. You inspect the results and see if they makes sense.

• We can give the screenshot of the dashboard and tell the algorithm to predict if it will break.

• Your algorithm is wrong! I have been managing this for 10 years, it cannot be like that.

Other meetings!

Common language• Because of its short public life,

Machine Learning lacks the general understanding of its fundamental limitations/principles.

• Focus on practicality makes literature/media oblivious to these fundamentals.

• Only when we agree on some common language, all parties in a room can start to understand each other.

Learning theory• Set of fundamental results that are behind many of

the common practices and algorithms we use nowadays.

• Has been heavily researched since the 80s and offers a set of mathematical guarantees/limitations in the practice of ML

• Useful both for ML practitioners and managers as a rule of thumb to understand and manage DS.

Domain

Dataset

Loss function

Hypothesis space

Training algorithm

Evaluation

The problem

No Free Lunch

No Free Lunch

How can we prevent such failures? By using our prior knowledge about a specific learning task, to avoid the distributions that will cause us to fail when learning that task. Such prior knowledge can be expressed by restricting our hypothesis class.

No Free Lunch take aways• No free lunch theorem is a mathematical certificate

• For managers & HR

• foresee an investment in a variety of specialists if you plan to tackle an increasing number of data challenges

• escape from promises of one killer technique that acts as a hammer for all problems

• For Data Science teams

• foresee and increasing number of specific techniques which you have to keep up to date (team effort)

Generalisation bounds

• How can be sure that a model will not fail in production?

• How can we correct when things do not go well?

• How can I know if I am being wasteful?

Generalisation bounds• A ML practitioner is going to train a model with

complexity d (VC-dimension), on m samples, and she is going to observe an error Ls.

• The expected performance when this model goes to production is bounded by with probability 1-𝜹

Manage DS• How can we correct when things do not go well

• Get a larger sample • Change the hypothesis class by:

• Enlarging it • Reducing it • Completely changing it • Changing the parameters you consider

• Change the feature representation of the data • Change the optimisation algorithm used to apply your learning

rule

Big Data• Big Data has had a significant impact in the number of m samples, and also

the complexity complexity d (VC-dimension).

• When tackling Variety by making use of unstructured data we increase the complexity d and so it should be planned that the size m is adequate.

• Review the modelling that we are doing to know if we need a big database.

• Is it the case that you do not need to maintain all that data?

Half pie syndrome• Symptoms

• You are spending a lot of money on gathering data to fuel growth in your business

• Your systems look like this pie, succulent but it seems that your business has lost appetite.

Enough data?Andrew Gelman (2005):

“Sample sizes are never large. If N is too small to get a sufficiently-precise estimate, you need to get more data (or make more assumptions). But once N is "large enough”, you can start subdividing the data to learn more. N is never enough because if it were "enough" you'd already be on to the next problem for which you need more data.”

Big data bounds

Alg. design #Data Engineering

Conclusions• In order to build a better understanding between

data science teams and other stakeholders, we need to make an effort to build a robust common language!

• Learning theory, originally devised as the fundamental theoretic pillar of ML, can help to build an understanding

• These proven basic laws can help you to have a structured way to manage Data Science

References• Shai Shalev-Shwartz and Shai Ben-David.

Understanding Machine Learning: From Theory to Algorithms, 2014.

• León Bottou and Olivier Bousquet. The Tradeoffs of Large Scale Learning. NIPS 2008

• SVM Optimization: Inverse depencen on dataset size. ICML 2008

• Gelman, Andrew. N is never large enough, http://andrewgelman.com/2005/07/31/n_is_never_larg/

http://andrewgelman.com/2005/07/31/n_is_never_larg/

Managing Data ScienceDr. David Martinez Rego Big Data Spain 2016

managing data science by david martínez rego

Technology