Download - Managing Data Science by David Martínez Rego
![Page 1: Managing Data Science by David Martínez Rego](https://reader031.vdocuments.us/reader031/viewer/2022022414/586fa11c1a28abcc238b6a1f/html5/thumbnails/1.jpg)
![Page 2: Managing Data Science by David Martínez Rego](https://reader031.vdocuments.us/reader031/viewer/2022022414/586fa11c1a28abcc238b6a1f/html5/thumbnails/2.jpg)
Managing Data ScienceDr. David Martinez Rego Big Data Spain 2016
![Page 3: Managing Data Science by David Martínez Rego](https://reader031.vdocuments.us/reader031/viewer/2022022414/586fa11c1a28abcc238b6a1f/html5/thumbnails/3.jpg)
Lead Data Science• Leading Data Science is probably one of the most
exciting/fun positions that someone can have nowadays
• Create computational algorithms that can take decisions and learn from errors in any problem that can be formulated in numbers
• There many paths to find the treasure in your data, the role of a Lead Data Scientist is to find the shortest and safest one.
![Page 4: Managing Data Science by David Martínez Rego](https://reader031.vdocuments.us/reader031/viewer/2022022414/586fa11c1a28abcc238b6a1f/html5/thumbnails/4.jpg)
DS plan meeting!
![Page 5: Managing Data Science by David Martínez Rego](https://reader031.vdocuments.us/reader031/viewer/2022022414/586fa11c1a28abcc238b6a1f/html5/thumbnails/5.jpg)
Top weird moments• I prefer not to give you any insight into the problem. Why
do you want to know what the columns are? I prefer you treat the problem as just data.
• There exist labels. We do not have permission to access them. You inspect the results and see if they makes sense.
• We can give the screenshot of the dashboard and tell the algorithm to predict if it will break.
• Your algorithm is wrong! I have been managing this for 10 years, it cannot be like that.
![Page 6: Managing Data Science by David Martínez Rego](https://reader031.vdocuments.us/reader031/viewer/2022022414/586fa11c1a28abcc238b6a1f/html5/thumbnails/6.jpg)
Other meetings!
![Page 7: Managing Data Science by David Martínez Rego](https://reader031.vdocuments.us/reader031/viewer/2022022414/586fa11c1a28abcc238b6a1f/html5/thumbnails/7.jpg)
Common language• Because of its short public life,
Machine Learning lacks the general understanding of its fundamental limitations/principles.
• Focus on practicality makes literature/media oblivious to these fundamentals.
• Only when we agree on some common language, all parties in a room can start to understand each other.
![Page 8: Managing Data Science by David Martínez Rego](https://reader031.vdocuments.us/reader031/viewer/2022022414/586fa11c1a28abcc238b6a1f/html5/thumbnails/8.jpg)
Learning theory• Set of fundamental results that are behind many of
the common practices and algorithms we use nowadays.
• Has been heavily researched since the 80s and offers a set of mathematical guarantees/limitations in the practice of ML
• Useful both for ML practitioners and managers as a rule of thumb to understand and manage DS.
![Page 9: Managing Data Science by David Martínez Rego](https://reader031.vdocuments.us/reader031/viewer/2022022414/586fa11c1a28abcc238b6a1f/html5/thumbnails/9.jpg)
Domain
Dataset
Loss function
Hypothesis space
Training algorithm
Evaluation
The problem
![Page 10: Managing Data Science by David Martínez Rego](https://reader031.vdocuments.us/reader031/viewer/2022022414/586fa11c1a28abcc238b6a1f/html5/thumbnails/10.jpg)
No Free Lunch
![Page 11: Managing Data Science by David Martínez Rego](https://reader031.vdocuments.us/reader031/viewer/2022022414/586fa11c1a28abcc238b6a1f/html5/thumbnails/11.jpg)
No Free Lunch
![Page 12: Managing Data Science by David Martínez Rego](https://reader031.vdocuments.us/reader031/viewer/2022022414/586fa11c1a28abcc238b6a1f/html5/thumbnails/12.jpg)
No Free Lunch
How can we prevent such failures? By using our prior knowledge about a specific learning task, to avoid the distributions that will cause us to fail when learning that task. Such prior knowledge can be expressed by restricting our hypothesis class.
![Page 13: Managing Data Science by David Martínez Rego](https://reader031.vdocuments.us/reader031/viewer/2022022414/586fa11c1a28abcc238b6a1f/html5/thumbnails/13.jpg)
![Page 14: Managing Data Science by David Martínez Rego](https://reader031.vdocuments.us/reader031/viewer/2022022414/586fa11c1a28abcc238b6a1f/html5/thumbnails/14.jpg)
No Free Lunch take aways• No free lunch theorem is a mathematical certificate
• For managers & HR
• foresee an investment in a variety of specialists if you plan to tackle an increasing number of data challenges
• escape from promises of one killer technique that acts as a hammer for all problems
• For Data Science teams
• foresee and increasing number of specific techniques which you have to keep up to date (team effort)
![Page 15: Managing Data Science by David Martínez Rego](https://reader031.vdocuments.us/reader031/viewer/2022022414/586fa11c1a28abcc238b6a1f/html5/thumbnails/15.jpg)
Generalisation bounds
• How can be sure that a model will not fail in production?
• How can we correct when things do not go well?
• How can I know if I am being wasteful?
![Page 16: Managing Data Science by David Martínez Rego](https://reader031.vdocuments.us/reader031/viewer/2022022414/586fa11c1a28abcc238b6a1f/html5/thumbnails/16.jpg)
Generalisation bounds• A ML practitioner is going to train a model with
complexity d (VC-dimension), on m samples, and she is going to observe an error Ls.
• The expected performance when this model goes to production is bounded by with probability 1-𝜹
![Page 17: Managing Data Science by David Martínez Rego](https://reader031.vdocuments.us/reader031/viewer/2022022414/586fa11c1a28abcc238b6a1f/html5/thumbnails/17.jpg)
Manage DS• How can we correct when things do not go well
• Get a larger sample • Change the hypothesis class by:
• Enlarging it • Reducing it • Completely changing it • Changing the parameters you consider
• Change the feature representation of the data • Change the optimisation algorithm used to apply your learning
rule
![Page 18: Managing Data Science by David Martínez Rego](https://reader031.vdocuments.us/reader031/viewer/2022022414/586fa11c1a28abcc238b6a1f/html5/thumbnails/18.jpg)
Big Data• Big Data has had a significant impact in the number of m samples, and also
the complexity complexity d (VC-dimension).
• When tackling Variety by making use of unstructured data we increase the complexity d and so it should be planned that the size m is adequate.
• Review the modelling that we are doing to know if we need a big database.
• Is it the case that you do not need to maintain all that data?
![Page 19: Managing Data Science by David Martínez Rego](https://reader031.vdocuments.us/reader031/viewer/2022022414/586fa11c1a28abcc238b6a1f/html5/thumbnails/19.jpg)
Half pie syndrome• Symptoms
• You are spending a lot of money on gathering data to fuel growth in your business
• Your systems look like this pie, succulent but it seems that your business has lost appetite.
![Page 20: Managing Data Science by David Martínez Rego](https://reader031.vdocuments.us/reader031/viewer/2022022414/586fa11c1a28abcc238b6a1f/html5/thumbnails/20.jpg)
Enough data?Andrew Gelman (2005):
“Sample sizes are never large. If N is too small to get a sufficiently-precise estimate, you need to get more data (or make more assumptions). But once N is "large enough”, you can start subdividing the data to learn more. N is never enough because if it were "enough" you'd already be on to the next problem for which you need more data.”
![Page 21: Managing Data Science by David Martínez Rego](https://reader031.vdocuments.us/reader031/viewer/2022022414/586fa11c1a28abcc238b6a1f/html5/thumbnails/21.jpg)
Big data bounds
Alg. design #Data Engineering
![Page 22: Managing Data Science by David Martínez Rego](https://reader031.vdocuments.us/reader031/viewer/2022022414/586fa11c1a28abcc238b6a1f/html5/thumbnails/22.jpg)
Conclusions• In order to build a better understanding between
data science teams and other stakeholders, we need to make an effort to build a robust common language!
• Learning theory, originally devised as the fundamental theoretic pillar of ML, can help to build an understanding
• These proven basic laws can help you to have a structured way to manage Data Science
![Page 23: Managing Data Science by David Martínez Rego](https://reader031.vdocuments.us/reader031/viewer/2022022414/586fa11c1a28abcc238b6a1f/html5/thumbnails/23.jpg)
References• Shai Shalev-Shwartz and Shai Ben-David.
Understanding Machine Learning: From Theory to Algorithms, 2014.
• León Bottou and Olivier Bousquet. The Tradeoffs of Large Scale Learning. NIPS 2008
• SVM Optimization: Inverse depencen on dataset size. ICML 2008
• Gelman, Andrew. N is never large enough, http://andrewgelman.com/2005/07/31/n_is_never_larg/
![Page 24: Managing Data Science by David Martínez Rego](https://reader031.vdocuments.us/reader031/viewer/2022022414/586fa11c1a28abcc238b6a1f/html5/thumbnails/24.jpg)
Managing Data ScienceDr. David Martinez Rego Big Data Spain 2016