Download - Workshop Data Manager
![Page 1: Workshop Data Manager](https://reader031.vdocuments.us/reader031/viewer/2022021922/58eef13a1a28abf1298b45f1/html5/thumbnails/1.jpg)
Lu
tzFi
nger
.com
How to extract significant business value from big
data
![Page 2: Workshop Data Manager](https://reader031.vdocuments.us/reader031/viewer/2022021922/58eef13a1a28abf1298b45f1/html5/thumbnails/2.jpg)
Lu
tzFi
nger
.com
Lutz
![Page 3: Workshop Data Manager](https://reader031.vdocuments.us/reader031/viewer/2022021922/58eef13a1a28abf1298b45f1/html5/thumbnails/3.jpg)
Lu
tzFi
nger
.com
Disclaimer
This presentation is solemnly my opinion and not necessarily the
opinion of my employer Harvard, Linkedin or Cornell.
![Page 4: Workshop Data Manager](https://reader031.vdocuments.us/reader031/viewer/2022021922/58eef13a1a28abf1298b45f1/html5/thumbnails/4.jpg)
Lu
tzFi
nger
.com
AgendaThe right AskData is KingTeam-Work: Discover an Ask
LunchDecision TreeTeam-Work: Your ModelPitfalls with Data
BreakTechnologyTeam-Work: Become Data Driven?BI vs. Data ScienceBuild A Team
![Page 5: Workshop Data Manager](https://reader031.vdocuments.us/reader031/viewer/2022021922/58eef13a1a28abf1298b45f1/html5/thumbnails/5.jpg)
Lu
tzFi
nger
.com
Why is there such hype?
![Page 6: Workshop Data Manager](https://reader031.vdocuments.us/reader031/viewer/2022021922/58eef13a1a28abf1298b45f1/html5/thumbnails/6.jpg)
Lu
tzFi
nger
.com
PREDICTING
![Page 7: Workshop Data Manager](https://reader031.vdocuments.us/reader031/viewer/2022021922/58eef13a1a28abf1298b45f1/html5/thumbnails/7.jpg)
Lu
tzFi
nger
.com
The ones who predict:
image by Mike under Creative Commons
![Page 8: Workshop Data Manager](https://reader031.vdocuments.us/reader031/viewer/2022021922/58eef13a1a28abf1298b45f1/html5/thumbnails/8.jpg)
Lu
tzFi
nger
.com
McK Study forecasted:
10 Times More Managers per Data Savvy Person
![Page 9: Workshop Data Manager](https://reader031.vdocuments.us/reader031/viewer/2022021922/58eef13a1a28abf1298b45f1/html5/thumbnails/9.jpg)
Lu
tzFi
nger
.com
?
![Page 10: Workshop Data Manager](https://reader031.vdocuments.us/reader031/viewer/2022021922/58eef13a1a28abf1298b45f1/html5/thumbnails/10.jpg)
Lu
tzFi
nger
.com
Actionable Insights
![Page 11: Workshop Data Manager](https://reader031.vdocuments.us/reader031/viewer/2022021922/58eef13a1a28abf1298b45f1/html5/thumbnails/11.jpg)
Lu
tzFi
nger
.com
ASK the right Questions.
MEASURE the right data – even if it is not Big data.
Take Actions and LEARN from them.
?
![Page 12: Workshop Data Manager](https://reader031.vdocuments.us/reader031/viewer/2022021922/58eef13a1a28abf1298b45f1/html5/thumbnails/12.jpg)
Lu
tzFi
nger
.com
![Page 13: Workshop Data Manager](https://reader031.vdocuments.us/reader031/viewer/2022021922/58eef13a1a28abf1298b45f1/html5/thumbnails/13.jpg)
Lu
tzFi
nger
.com
Google had the right Questionis difficult to find
![Page 14: Workshop Data Manager](https://reader031.vdocuments.us/reader031/viewer/2022021922/58eef13a1a28abf1298b45f1/html5/thumbnails/14.jpg)
Lu
tzFi
nger
.com
Fisheye Learning
![Page 15: Workshop Data Manager](https://reader031.vdocuments.us/reader031/viewer/2022021922/58eef13a1a28abf1298b45f1/html5/thumbnails/15.jpg)
Lu
tzFi
nger
.com
Already Known Asks
by rg
iese
king
und
er C
reat
ive
Com
mon
s (C
C B
Y 2
.0)
Who should get an E-Shot?
Territory Planning for my Sales Force
Budget Planning of Marketing Spent
Online Product Recommendation
Real Time Betting for Ad-spaces
Customer Segmentation
Social Media Influencers
Call Center Routing based on Questions
Capacity Forecasting …. and more
![Page 16: Workshop Data Manager](https://reader031.vdocuments.us/reader031/viewer/2022021922/58eef13a1a28abf1298b45f1/html5/thumbnails/16.jpg)
Lu
tzFi
nger
.com
The “So-What” Test
![Page 17: Workshop Data Manager](https://reader031.vdocuments.us/reader031/viewer/2022021922/58eef13a1a28abf1298b45f1/html5/thumbnails/17.jpg)
Lu
tzFi
nger
.com
Data by itselfis
USELESS
Information by itselfis
USELESS
Only action counts!
![Page 18: Workshop Data Manager](https://reader031.vdocuments.us/reader031/viewer/2022021922/58eef13a1a28abf1298b45f1/html5/thumbnails/18.jpg)
Lu
tzFi
nger
.com
Data by itselfis
USELESS
Information by itselfis
USELESS
Let’s connect...
![Page 19: Workshop Data Manager](https://reader031.vdocuments.us/reader031/viewer/2022021922/58eef13a1a28abf1298b45f1/html5/thumbnails/19.jpg)
Lu
tzFi
nger
.com
Benchmarking
Recommending
![Page 20: Workshop Data Manager](https://reader031.vdocuments.us/reader031/viewer/2022021922/58eef13a1a28abf1298b45f1/html5/thumbnails/20.jpg)
Lu
tzFi
nger
.com
A good ‘so what’?
Dat
a by
Lin
kedI
n
Example: Laboratory Manager?
![Page 21: Workshop Data Manager](https://reader031.vdocuments.us/reader031/viewer/2022021922/58eef13a1a28abf1298b45f1/html5/thumbnails/21.jpg)
Lu
tzFi
nger
.com
A bad ‘so what’
300+ Million Member at LinkedIn
60.000 with a Job Title that might fit
19.000 who switched after 3 to 8 years
24 who had the same career path
![Page 22: Workshop Data Manager](https://reader031.vdocuments.us/reader031/viewer/2022021922/58eef13a1a28abf1298b45f1/html5/thumbnails/22.jpg)
Lu
tzFi
nger
.com
Benchmarking in Health Care
![Page 24: Workshop Data Manager](https://reader031.vdocuments.us/reader031/viewer/2022021922/58eef13a1a28abf1298b45f1/html5/thumbnails/24.jpg)
Lu
tzFi
nger
.com
Recommendations: Your FocusPeople You May Know
Groups You May Like
Ads in Which You May Be Interested
Companies You May Want to Follow
Pulse
Similar Profiles
![Page 25: Workshop Data Manager](https://reader031.vdocuments.us/reader031/viewer/2022021922/58eef13a1a28abf1298b45f1/html5/thumbnails/25.jpg)
Lu
tzFi
nger
.com
Recommendation in Health Care
![Page 26: Workshop Data Manager](https://reader031.vdocuments.us/reader031/viewer/2022021922/58eef13a1a28abf1298b45f1/html5/thumbnails/26.jpg)
Lu
tzFi
nger
.com
Many Good Examples
Benchmark Recommendations
![Page 27: Workshop Data Manager](https://reader031.vdocuments.us/reader031/viewer/2022021922/58eef13a1a28abf1298b45f1/html5/thumbnails/27.jpg)
Lu
tzFi
nger
.com
Agenda
The right AskData is KingTeam-Work: Discover an Ask
LunchDecision TreeTeam-Work: Your ModelPitfalls with DataTechnologyTeam-Work: Become Data Driven?BI vs. Data ScienceBuild A Team
![Page 28: Workshop Data Manager](https://reader031.vdocuments.us/reader031/viewer/2022021922/58eef13a1a28abf1298b45f1/html5/thumbnails/28.jpg)
Lu
tzFi
nger
.com
“Data is the new oil”- World Economic Forum
![Page 29: Workshop Data Manager](https://reader031.vdocuments.us/reader031/viewer/2022021922/58eef13a1a28abf1298b45f1/html5/thumbnails/29.jpg)
Lu
tzFi
nger
.com
“DATA IS THE NEW OIL”
Oil Mine the oil
Use the oil
Goal
![Page 30: Workshop Data Manager](https://reader031.vdocuments.us/reader031/viewer/2022021922/58eef13a1a28abf1298b45f1/html5/thumbnails/30.jpg)
Lu
tzFi
nger
.com
V OF “BIG DATA”
Data at scale(TB, PB … )
Data in many forms(Structured, unstructured ...)
Speed(Streaming, real time, near time ..)
Uncertainty(imprecise, not always up-to-date ..)
![Page 31: Workshop Data Manager](https://reader031.vdocuments.us/reader031/viewer/2022021922/58eef13a1a28abf1298b45f1/html5/thumbnails/31.jpg)
Lu
tzFi
nger
.com
1st. Round Monopoly
Photo by William Warby under the Creative Commons (CC BY 2.0)
![Page 32: Workshop Data Manager](https://reader031.vdocuments.us/reader031/viewer/2022021922/58eef13a1a28abf1298b45f1/html5/thumbnails/32.jpg)
Lu
tzFi
nger
.com
$3.2 billion
![Page 33: Workshop Data Manager](https://reader031.vdocuments.us/reader031/viewer/2022021922/58eef13a1a28abf1298b45f1/html5/thumbnails/33.jpg)
Lu
tzFi
nger
.com
Prediction
Photo by KOMUnews under the Creative Commons (CC BY 2.0)
Boring could be the New Sexy!
![Page 34: Workshop Data Manager](https://reader031.vdocuments.us/reader031/viewer/2022021922/58eef13a1a28abf1298b45f1/html5/thumbnails/34.jpg)
Lu
tzFi
nger
.com
Data Might (Not) Be A Barrier To Enter
![Page 35: Workshop Data Manager](https://reader031.vdocuments.us/reader031/viewer/2022021922/58eef13a1a28abf1298b45f1/html5/thumbnails/35.jpg)
Lu
tzFi
nger
.com
Data Might (Not) Be A Barrier To Enter
![Page 36: Workshop Data Manager](https://reader031.vdocuments.us/reader031/viewer/2022021922/58eef13a1a28abf1298b45f1/html5/thumbnails/36.jpg)
Lu
tzFi
nger
.com
Public Data is Not
![Page 37: Workshop Data Manager](https://reader031.vdocuments.us/reader031/viewer/2022021922/58eef13a1a28abf1298b45f1/html5/thumbnails/37.jpg)
Lu
tzFi
nger
.com
DATACategorical
• Ordinal: Monday, Tuesday, Wednesday• Nominal: Man, Woman
Quantitative:• Ratio: Kelvin, Height, Weight• Interval: Celsius, Fahrenheit
Structure:• Structured• Unstructured• Semi-structured / Meta data
Read more: “On the Theory of scales of measurement”S.Stevens 1946
![Page 38: Workshop Data Manager](https://reader031.vdocuments.us/reader031/viewer/2022021922/58eef13a1a28abf1298b45f1/html5/thumbnails/38.jpg)
Lu
tzFi
nger
.com
Data Is Kingbut not all data is equal.
![Page 39: Workshop Data Manager](https://reader031.vdocuments.us/reader031/viewer/2022021922/58eef13a1a28abf1298b45f1/html5/thumbnails/39.jpg)
Lu
tzFi
nger
.com
The Tale of “Social Media” DataSo
urce: ‘Ask M
easure Learn’ by O’Reilly M
edia
![Page 40: Workshop Data Manager](https://reader031.vdocuments.us/reader031/viewer/2022021922/58eef13a1a28abf1298b45f1/html5/thumbnails/40.jpg)
Lu
tzFi
nger
.com
Structured Data Is Often Better
New York Weather in April 2013
Source: ‘Ask Measure Learn’ by O’Reilly Media
![Page 41: Workshop Data Manager](https://reader031.vdocuments.us/reader031/viewer/2022021922/58eef13a1a28abf1298b45f1/html5/thumbnails/41.jpg)
Lu
tzFi
nger
.com
Sometimes, it’s worth it.
RE @dave_mcgregor: Publicly pledging to never fly @delta again. The worst airline ever. U have lost my patronage forever du to ur incompetence
Completely unimpressed with @continental or @united. Poor communication, goofy reservations systems and all to turn my trip into a mess.
@SouthWestAir I know you don't make the weather. But at least pretend I am not a bother when I ask if the delay will make miss my connection
![Page 42: Workshop Data Manager](https://reader031.vdocuments.us/reader031/viewer/2022021922/58eef13a1a28abf1298b45f1/html5/thumbnails/42.jpg)
Lu
tzFi
nger
.com
But Data Is King
This will give birth to devices (i.e., the Star Trek Tricorder) that allow you, the consumer, to self-diagnose, anytime, anywhere.
![Page 43: Workshop Data Manager](https://reader031.vdocuments.us/reader031/viewer/2022021922/58eef13a1a28abf1298b45f1/html5/thumbnails/43.jpg)
Lu
tzFi
nger
.com
Agenda
The right AskData is KingTeam-Work: Discover an Ask
LunchDecision TreeTeam-Work: Your ModelPitfalls with DataTechnologyTeam-Work: Become Data Driven?BI vs. Data ScienceBuild A Team
![Page 44: Workshop Data Manager](https://reader031.vdocuments.us/reader031/viewer/2022021922/58eef13a1a28abf1298b45f1/html5/thumbnails/44.jpg)
Lu
tzFi
nger
.com
About Innovation
By
Alis
tair
Cro
ll
![Page 45: Workshop Data Manager](https://reader031.vdocuments.us/reader031/viewer/2022021922/58eef13a1a28abf1298b45f1/html5/thumbnails/45.jpg)
Lu
tzFi
nger
.com
The Media industry has changed! The retail industry has change! The Education sector is changing! Finance Industry and healthcare sector are under attack. Which industry will be next?
![Page 46: Workshop Data Manager](https://reader031.vdocuments.us/reader031/viewer/2022021922/58eef13a1a28abf1298b45f1/html5/thumbnails/46.jpg)
Lu
tzFi
nger
.com
Team Work
Photo by Creative Sustainability under the Creative Commons (CC BY 2.0)
A. The Ask ○ is it actionable? “So What?○ is it Benchmarking / is it Recommendation
B. The Data ○ do only you have this data?○ do you have a feedback loop?
![Page 47: Workshop Data Manager](https://reader031.vdocuments.us/reader031/viewer/2022021922/58eef13a1a28abf1298b45f1/html5/thumbnails/47.jpg)
Lu
tzFi
nger
.com
LUNCH
![Page 48: Workshop Data Manager](https://reader031.vdocuments.us/reader031/viewer/2022021922/58eef13a1a28abf1298b45f1/html5/thumbnails/48.jpg)
Lu
tzFi
nger
.com
Agenda
The right AskData is KingTeam-Work: Discover an Ask
LunchDecision TreeTeam-Work: Your ModelPitfalls with DataTechnologyTeam-Work: Become Data Driven?BI vs. Data ScienceBuild A Team
![Page 49: Workshop Data Manager](https://reader031.vdocuments.us/reader031/viewer/2022021922/58eef13a1a28abf1298b45f1/html5/thumbnails/49.jpg)
Lu
tzFi
nger
.com
Are Retail Banks ‘dead’?
![Page 50: Workshop Data Manager](https://reader031.vdocuments.us/reader031/viewer/2022021922/58eef13a1a28abf1298b45f1/html5/thumbnails/50.jpg)
Lu
tzFi
nger
.com
Decision Trees Step by Step
by Maciej Lewandowski under Creative Commons (CC BY-SA 2.0)
![Page 51: Workshop Data Manager](https://reader031.vdocuments.us/reader031/viewer/2022021922/58eef13a1a28abf1298b45f1/html5/thumbnails/51.jpg)
Lu
tzFi
nger
.com
Split Apples & Mandarins
![Page 52: Workshop Data Manager](https://reader031.vdocuments.us/reader031/viewer/2022021922/58eef13a1a28abf1298b45f1/html5/thumbnails/52.jpg)
Lu
tzFi
nger
.com
What Is The Target Variable?
![Page 53: Workshop Data Manager](https://reader031.vdocuments.us/reader031/viewer/2022021922/58eef13a1a28abf1298b45f1/html5/thumbnails/53.jpg)
Lu
tzFi
nger
.com
What Is The Features To Describe The Target?
![Page 54: Workshop Data Manager](https://reader031.vdocuments.us/reader031/viewer/2022021922/58eef13a1a28abf1298b45f1/html5/thumbnails/54.jpg)
Lu
tzFi
nger
.com
What Is The Features To Describe The Target?
• Weight: light, medium, heavy - or x gram• Size: round or not• Color:green, orange, red• Surface: flat or porous surface• …
![Page 55: Workshop Data Manager](https://reader031.vdocuments.us/reader031/viewer/2022021922/58eef13a1a28abf1298b45f1/html5/thumbnails/55.jpg)
Lu
tzFi
nger
.com
Which Feature Works Best?
● The variable with the most important information about target variable.
● Which variable can split the group as homogeneous with respect to the target variable.
(pure vs. inpure)
![Page 56: Workshop Data Manager](https://reader031.vdocuments.us/reader031/viewer/2022021922/58eef13a1a28abf1298b45f1/html5/thumbnails/56.jpg)
Lu
tzFi
nger
.com
Color Red?
Color Orange?
Split on Color Red vs. Split on Color Orange
Which One Is Better?
![Page 57: Workshop Data Manager](https://reader031.vdocuments.us/reader031/viewer/2022021922/58eef13a1a28abf1298b45f1/html5/thumbnails/57.jpg)
Lu
tzFi
nger
.com
We Need A Way To Describe Chaos
"Cla
ude
Elw
ood
Sha
nnon
(191
6-20
01)"
by
Sou
rce.
Lic
ense
d un
der
Fair
use
via
Wik
iped
ia
![Page 58: Workshop Data Manager](https://reader031.vdocuments.us/reader031/viewer/2022021922/58eef13a1a28abf1298b45f1/html5/thumbnails/58.jpg)
Lu
tzFi
nger
.com
ENTROPYEntropy is a measure of disorder.
Entropy only tells us how impure one individual subset is.
![Page 59: Workshop Data Manager](https://reader031.vdocuments.us/reader031/viewer/2022021922/58eef13a1a28abf1298b45f1/html5/thumbnails/59.jpg)
Lu
tzFi
nger
.com
ENTROPY & PROBABILITY
entropy = -p1 * log (p1) - p2 * log (p2) - ….
![Page 60: Workshop Data Manager](https://reader031.vdocuments.us/reader031/viewer/2022021922/58eef13a1a28abf1298b45f1/html5/thumbnails/60.jpg)
Lu
tzFi
nger
.com
● Highest Entropy Reduction
● Highest Information Gain
![Page 61: Workshop Data Manager](https://reader031.vdocuments.us/reader031/viewer/2022021922/58eef13a1a28abf1298b45f1/html5/thumbnails/61.jpg)
Lu
tzFi
nger
.com
1st. Entropy Without Splitentropy = -p1 * log (p1) - p2 * log (p2)
Apple: 8 out of 15 p(apple)= 8/15
Mandarines: 7 out of 15 p(mandarine)= 7/15
ENTROPY (Without Split):
-p(apple)*log(p(apple)) -p(mandarins)*log(p(mandarines))
= 0.996791632 = 1
very impure
![Page 62: Workshop Data Manager](https://reader031.vdocuments.us/reader031/viewer/2022021922/58eef13a1a28abf1298b45f1/html5/thumbnails/62.jpg)
Lu
tzFi
nger
.com
Color Red?
Color Orange?
entropy = -p1 * log (p1) - p2 * log (p2)
ENTROPY (After Split on Red):
= 8/15* ENTROPY (Split on Red=’no’) + 7/15* ENTROPY (Split on Red=’yes’)
= 0.43 + 0.28 = 0.71
INFORMATION GAIN= Entropy (Before) - Entropy (After) = 1 - 0.71 = 0.29
ENTROPY (Split on Red=’no’):= -6/8*(log2(6/8))-2/8*(log2(2/8))= 0.81
ENTROPY (Split on Red=’yes’):= -6/7*(log2(6/7)) -1/7*(log2(1/7))= 0.59
ENTROPY (Split on Orange=’yes’):= -6/6*(log2(6/6))= 0
ENTROPY (Split on Orange=’no’):= -8/9*(log2(8/9))-1/9*(log2(1/9))= 0.50
ENTROPY (After Split on Orange):
= 6/15* ENTROPY (Split on Orange=’no’) + 9/15* ENTROPY (Split on Orange=’yes’)
= 0 + 0.23 = 0.23
INFORMATION GAIN= Entropy (Before) - Entropy (After) = 1 - 0.23 = 0.77
![Page 63: Workshop Data Manager](https://reader031.vdocuments.us/reader031/viewer/2022021922/58eef13a1a28abf1298b45f1/html5/thumbnails/63.jpg)
Lu
tzFi
nger
.com
INFORMATION GAIN (IG)Information gain measures how much a
given feature improves (decreases) entropy over the whole segmentation it creates.
How important is this feature for the prediction?
![Page 64: Workshop Data Manager](https://reader031.vdocuments.us/reader031/viewer/2022021922/58eef13a1a28abf1298b45f1/html5/thumbnails/64.jpg)
Lu
tzFi
nger
.com
Decision Tree
Color Orange? ROOT NODE
LEAFS
![Page 65: Workshop Data Manager](https://reader031.vdocuments.us/reader031/viewer/2022021922/58eef13a1a28abf1298b45f1/html5/thumbnails/65.jpg)
Lu
tzFi
nger
.com
Decision Tree
Color Orange?
Decision Tree Structure
![Page 66: Workshop Data Manager](https://reader031.vdocuments.us/reader031/viewer/2022021922/58eef13a1a28abf1298b45f1/html5/thumbnails/66.jpg)
Lu
tzFi
nger
.com
Which Feature Would Be Better?
![Page 67: Workshop Data Manager](https://reader031.vdocuments.us/reader031/viewer/2022021922/58eef13a1a28abf1298b45f1/html5/thumbnails/67.jpg)
Lu
tzFi
nger
.com
Heavy?
Always Start With Highest IG
![Page 68: Workshop Data Manager](https://reader031.vdocuments.us/reader031/viewer/2022021922/58eef13a1a28abf1298b45f1/html5/thumbnails/68.jpg)
Lu
tzFi
nger
.com
Hyperplanes
![Page 69: Workshop Data Manager](https://reader031.vdocuments.us/reader031/viewer/2022021922/58eef13a1a28abf1298b45f1/html5/thumbnails/69.jpg)
Lu
tzFi
nger
.com
Hyperplane (2 dimensions)
Mandarines Red Green
Ligh
tH
eavy
![Page 70: Workshop Data Manager](https://reader031.vdocuments.us/reader031/viewer/2022021922/58eef13a1a28abf1298b45f1/html5/thumbnails/70.jpg)
Lu
tzFi
nger
.com
Agenda
The right AskData is KingTeam-Work: Discover an Ask
LunchDecision TreeTeam-Work: Your ModelPitfalls with DataTechnologyTeam-Work: Become Data Driven?BI vs. Data ScienceBuild A Team
![Page 71: Workshop Data Manager](https://reader031.vdocuments.us/reader031/viewer/2022021922/58eef13a1a28abf1298b45f1/html5/thumbnails/71.jpg)
Lu
tzFi
nger
.com
Back To The Lending Industry
![Page 72: Workshop Data Manager](https://reader031.vdocuments.us/reader031/viewer/2022021922/58eef13a1a28abf1298b45f1/html5/thumbnails/72.jpg)
Lu
tzFi
nger
.com
BIG ML
Competitors:
● Algorithms.io● SnapAnalytx● wise.io● Predixion Software● Google Prediction
API
![Page 73: Workshop Data Manager](https://reader031.vdocuments.us/reader031/viewer/2022021922/58eef13a1a28abf1298b45f1/html5/thumbnails/73.jpg)
Lu
tzFi
nger
.com
Real Data Set
![Page 74: Workshop Data Manager](https://reader031.vdocuments.us/reader031/viewer/2022021922/58eef13a1a28abf1298b45f1/html5/thumbnails/74.jpg)
Lu
tzFi
nger
.com
Build Database
How Do You Deal With Categorical vs. Numeric Variables in Decision Trees?
screenshot from bigML tool
![Page 75: Workshop Data Manager](https://reader031.vdocuments.us/reader031/viewer/2022021922/58eef13a1a28abf1298b45f1/html5/thumbnails/75.jpg)
Lu
tzFi
nger
.comConfigure And Build
Model
Select The Objective Field - What To Train The Model On?
That is the Row ID - surely no impact.
screenshot from bigML tool
![Page 76: Workshop Data Manager](https://reader031.vdocuments.us/reader031/viewer/2022021922/58eef13a1a28abf1298b45f1/html5/thumbnails/76.jpg)
Lu
tzFi
nger
.com
screenshot from bigML tool
![Page 77: Workshop Data Manager](https://reader031.vdocuments.us/reader031/viewer/2022021922/58eef13a1a28abf1298b45f1/html5/thumbnails/77.jpg)
Lu
tzFi
nger
.com
screenshot from bigML tool
![Page 78: Workshop Data Manager](https://reader031.vdocuments.us/reader031/viewer/2022021922/58eef13a1a28abf1298b45f1/html5/thumbnails/78.jpg)
Lu
tzFi
nger
.com
Highest Information Gain
screenshot from bigML tool
![Page 79: Workshop Data Manager](https://reader031.vdocuments.us/reader031/viewer/2022021922/58eef13a1a28abf1298b45f1/html5/thumbnails/79.jpg)
Lu
tzFi
nger
.com
Using The Model
screenshot from bigML tool
![Page 80: Workshop Data Manager](https://reader031.vdocuments.us/reader031/viewer/2022021922/58eef13a1a28abf1298b45f1/html5/thumbnails/80.jpg)
Lu
tzFi
nger
.com
Using The Model
screenshot from bigML tool
![Page 81: Workshop Data Manager](https://reader031.vdocuments.us/reader031/viewer/2022021922/58eef13a1a28abf1298b45f1/html5/thumbnails/81.jpg)
Lu
tzFi
nger
.com
Using The Model
screenshot from bigML tool
![Page 82: Workshop Data Manager](https://reader031.vdocuments.us/reader031/viewer/2022021922/58eef13a1a28abf1298b45f1/html5/thumbnails/82.jpg)
Lu
tzFi
nger
.com
Found 2,470 New Instances
screenshot from bigML tool
![Page 83: Workshop Data Manager](https://reader031.vdocuments.us/reader031/viewer/2022021922/58eef13a1a28abf1298b45f1/html5/thumbnails/83.jpg)
Lu
tzFi
nger
.com
How Can I Improve Now Quality?
![Page 84: Workshop Data Manager](https://reader031.vdocuments.us/reader031/viewer/2022021922/58eef13a1a28abf1298b45f1/html5/thumbnails/84.jpg)
Lu
tzFi
nger
.com
Agenda
The right AskData is KingTeam-Work: Discover an Ask
LunchDecision TreeTeam-Work: Your ModelPitfalls with DataTechnologyTeam-Work: Become Data Driven?BI vs. Data ScienceBuild A Team
![Page 85: Workshop Data Manager](https://reader031.vdocuments.us/reader031/viewer/2022021922/58eef13a1a28abf1298b45f1/html5/thumbnails/85.jpg)
Lu
tzFi
nger
.com
Pitfalls with Data
Data & EthicsMore Data & OverfittingConfidence & Cut-offCause & Correlation…
![Page 86: Workshop Data Manager](https://reader031.vdocuments.us/reader031/viewer/2022021922/58eef13a1a28abf1298b45f1/html5/thumbnails/86.jpg)
Lu
tzFi
nger
.com
How Did They Improve Scoring?
![Page 87: Workshop Data Manager](https://reader031.vdocuments.us/reader031/viewer/2022021922/58eef13a1a28abf1298b45f1/html5/thumbnails/87.jpg)
Lu
tzFi
nger
.com
Social Network InfoCould Social Network improve the quality of our
prediction?
Who is more credit-worthy?
a. Tim whose friends are all very credit worthy
b. Tom whose friends are not creditworthy
![Page 88: Workshop Data Manager](https://reader031.vdocuments.us/reader031/viewer/2022021922/58eef13a1a28abf1298b45f1/html5/thumbnails/88.jpg)
Lu
tzFi
nger
.com
Ethical?
![Page 89: Workshop Data Manager](https://reader031.vdocuments.us/reader031/viewer/2022021922/58eef13a1a28abf1298b45f1/html5/thumbnails/89.jpg)
Lu
tzFi
nger
.com
Nobel Worthy!
Muhammad YunusPhoto by University of Salford under Creative Commons CC BY 2.0
![Page 90: Workshop Data Manager](https://reader031.vdocuments.us/reader031/viewer/2022021922/58eef13a1a28abf1298b45f1/html5/thumbnails/90.jpg)
Lu
tzFi
nger
.com
In the EU insurers will no longer be allowed to take the gender of their customers into account for insurance premiums:
● young men's premiums will fall by up to 10%
● young women's premiums will rise by up to 30%
by: BBC News: http://www.bbc.com/news/business-12608777
Not Everything That Is Possible Is Legal
![Page 91: Workshop Data Manager](https://reader031.vdocuments.us/reader031/viewer/2022021922/58eef13a1a28abf1298b45f1/html5/thumbnails/91.jpg)
Lu
tzFi
nger
.com
Pitfalls with Data
Data & EthicsMore Data & OverfittingConfidence & Cut-offCause & Correlation…
![Page 92: Workshop Data Manager](https://reader031.vdocuments.us/reader031/viewer/2022021922/58eef13a1a28abf1298b45f1/html5/thumbnails/92.jpg)
Lu
tzFi
nger
.com
The Tale of Big Data
![Page 93: Workshop Data Manager](https://reader031.vdocuments.us/reader031/viewer/2022021922/58eef13a1a28abf1298b45f1/html5/thumbnails/93.jpg)
Lu
tzFi
nger
.com
Overfitting
To tailor a model to training data at the expense of being generalizable for previously unseen data
points. The model becomes perfect in describing noise and spurious correlations.
TRADE OFF
Complexity of a Model & Overfitting Likelihood
![Page 94: Workshop Data Manager](https://reader031.vdocuments.us/reader031/viewer/2022021922/58eef13a1a28abf1298b45f1/html5/thumbnails/94.jpg)
Lu
tzFi
nger
.com
How Trustworthy Is This Prediction?
• 45 instances• 59% confidence
screenshot from bigML tool
![Page 95: Workshop Data Manager](https://reader031.vdocuments.us/reader031/viewer/2022021922/58eef13a1a28abf1298b45f1/html5/thumbnails/95.jpg)
Lu
tzFi
nger
.com
The Need for Domain Knowledge
![Page 96: Workshop Data Manager](https://reader031.vdocuments.us/reader031/viewer/2022021922/58eef13a1a28abf1298b45f1/html5/thumbnails/96.jpg)
Lu
tzFi
nger
.com
Pitfalls with Data
Data & EthicsMore Data & OverfittingConfidence & Cut-offCause & Correlation…
![Page 97: Workshop Data Manager](https://reader031.vdocuments.us/reader031/viewer/2022021922/58eef13a1a28abf1298b45f1/html5/thumbnails/97.jpg)
Lu
tzFi
nger
.com
Give Credit or Not?49% Confidence
screenshot from bigML tool
![Page 98: Workshop Data Manager](https://reader031.vdocuments.us/reader031/viewer/2022021922/58eef13a1a28abf1298b45f1/html5/thumbnails/98.jpg)
Lu
tzFi
nger
.com
CONFUSION MATRIX
Pregnant(60)
Not pregnant(940)
Pregnant (A) true positive
(B) false positive
Not pregnant
(C) false negative
(D) true negativeC
lass
ifier
Reality
![Page 99: Workshop Data Manager](https://reader031.vdocuments.us/reader031/viewer/2022021922/58eef13a1a28abf1298b45f1/html5/thumbnails/99.jpg)
Lu
tzFi
nger
.com
Agenda
The right AskData is KingTeam-Work: Discover an Ask
LunchDecision TreeTeam-Work: Your ModelPitfalls with DataTechnologyTeam-Work: Become Data Driven?BI vs. Data ScienceBuild A Team
![Page 100: Workshop Data Manager](https://reader031.vdocuments.us/reader031/viewer/2022021922/58eef13a1a28abf1298b45f1/html5/thumbnails/100.jpg)
Lu
tzFi
nger
.com
How Invented Big Data Infrastructure?
![Page 101: Workshop Data Manager](https://reader031.vdocuments.us/reader031/viewer/2022021922/58eef13a1a28abf1298b45f1/html5/thumbnails/101.jpg)
Lu
tzFi
nger
.com
How Invented Big Data Infrastructure?
![Page 102: Workshop Data Manager](https://reader031.vdocuments.us/reader031/viewer/2022021922/58eef13a1a28abf1298b45f1/html5/thumbnails/102.jpg)
Lu
tzFi
nger
.com
Issue Of YahooCENTRALIZED SYSTEMS ARE EXPENSIVE
• diminishing returns in power (overhead issue)• exponential cost to scale.• slow to transport (ETL) the data
Scan 1000 TB Datasets on a 1000 node cluster:• Remote Storage @ 10 MB/s = 165 min• Local Storage @ 200 MB's = 8 min
MAKE SYSTEMS FAULT TOLERANT1000 nodes - a machine a day will break
![Page 103: Workshop Data Manager](https://reader031.vdocuments.us/reader031/viewer/2022021922/58eef13a1a28abf1298b45f1/html5/thumbnails/103.jpg)
Lu
tzFi
nger
.com
The VisionCHEAP Systems
• can run on commodity hardware
Computation are done DECENTRAL• ability to ‘dispatch’ a task• parallelize work-streams
Fault TOLERANTno matter where and when break is not an issue
![Page 104: Workshop Data Manager](https://reader031.vdocuments.us/reader031/viewer/2022021922/58eef13a1a28abf1298b45f1/html5/thumbnails/104.jpg)
Lu
tzFi
nger
.com
![Page 105: Workshop Data Manager](https://reader031.vdocuments.us/reader031/viewer/2022021922/58eef13a1a28abf1298b45f1/html5/thumbnails/105.jpg)
Lu
tzFi
nger
.com
How To Access HDFS
Hadoop Storage (HDFS / HBase / Solr)
Map Reduce
![Page 106: Workshop Data Manager](https://reader031.vdocuments.us/reader031/viewer/2022021922/58eef13a1a28abf1298b45f1/html5/thumbnails/106.jpg)
Lu
tzFi
nger
.com
Via The Normal Languages
Hadoop Storage (HDFS / HBase / Solr)
Map Reduce
Map
Red
uce
Hiv
e
Pig
/Cas
scad
ing
Gira
ph
Mah
out
SQL Like
Scripting Like
Graph Oriented
ML Engine
![Page 107: Workshop Data Manager](https://reader031.vdocuments.us/reader031/viewer/2022021922/58eef13a1a28abf1298b45f1/html5/thumbnails/107.jpg)
Lu
tzFi
nger
.com
Pro & Con
Hadoop Storage (HDFS / HBase / Solr)
Map Reduce
Map
Red
uce
Hiv
e
Pig
/Cas
scad
ing
Gira
ph
Mah
out
SQL Like
Scripting Like
Graph Oriented
ML Engine
Store
ETL: Extract / Transform / Load
DB / Key Value Store
Visualize
Pro:way better than traditional BI
Con:Heavy tech involvement. 12-18 month for non-tech company to implement a schema
![Page 108: Workshop Data Manager](https://reader031.vdocuments.us/reader031/viewer/2022021922/58eef13a1a28abf1298b45f1/html5/thumbnails/108.jpg)
Lu
tzFi
nger
.com
Pro & Con
Hadoop Storage (HDFS / HBase / Solr)
Map Reduce
Map
Red
uce
Hiv
e
Pig
/Cas
scad
ing
Gira
ph
Mah
out
SQL Like
Scripting Like
Graph Oriented
ML Engine
DB / Key Value Store
Visualize
New Approaches:
● Spark● Tez● Flink
![Page 109: Workshop Data Manager](https://reader031.vdocuments.us/reader031/viewer/2022021922/58eef13a1a28abf1298b45f1/html5/thumbnails/109.jpg)
Lu
tzFi
nger
.com
Agenda
The right AskData is KingTeam-Work: Discover an Ask
LunchDecision TreeTeam-Work: Your ModelPitfalls with DataTechnologyTeam-Work: Become Data Driven?BI vs. Data ScienceBuild A Team
![Page 110: Workshop Data Manager](https://reader031.vdocuments.us/reader031/viewer/2022021922/58eef13a1a28abf1298b45f1/html5/thumbnails/110.jpg)
Lu
tzFi
nger
.com
Why Is It So Hard To Become Data Driven
![Page 111: Workshop Data Manager](https://reader031.vdocuments.us/reader031/viewer/2022021922/58eef13a1a28abf1298b45f1/html5/thumbnails/111.jpg)
Lu
tzFi
nger
.com
Ingredients of Data Products
The question?
Ask
The need?
The Why? MeasureThe Data?
The features?
Team
All of them are necessary - None of them is sufficient!
The algorithms?
The right Skills?
Collaboration
111
![Page 112: Workshop Data Manager](https://reader031.vdocuments.us/reader031/viewer/2022021922/58eef13a1a28abf1298b45f1/html5/thumbnails/112.jpg)
Lu
tzFi
nger
.com
How To Ingest Ideas
Hack - Days & IncubatorInternal Process
External Competition
Close Collaboration between Business & Data Scientists“All we do is Data” - Jeff Weiner
112
![Page 113: Workshop Data Manager](https://reader031.vdocuments.us/reader031/viewer/2022021922/58eef13a1a28abf1298b45f1/html5/thumbnails/113.jpg)
Lu
tzFi
nger
.com
What Would You Need To Do To Be A Leader In Data
![Page 114: Workshop Data Manager](https://reader031.vdocuments.us/reader031/viewer/2022021922/58eef13a1a28abf1298b45f1/html5/thumbnails/114.jpg)
Lu
tzFi
nger
.com
Agenda
The right AskData is KingTeam-Work: Discover an Ask
LunchDecision TreeTeam-Work: Your ModelPitfalls with DataTechnologyTeam-Work: Become Data Driven?BI vs. Data ScienceBuild A Team
![Page 115: Workshop Data Manager](https://reader031.vdocuments.us/reader031/viewer/2022021922/58eef13a1a28abf1298b45f1/html5/thumbnails/115.jpg)
Lu
tzFi
nger
.com
Old vs. New
Old School Today / Big data
Data Amount Gigabytes & Terabytes Petabytes & Exabytes
IT Infrastructure Centralized Decentralized / Parallelized
Data Types Structured Structured & unstructured
Schema Stable schema Schema on the fly
When and How is the ASK formulated?
Set ask Ad-hoc ask
![Page 116: Workshop Data Manager](https://reader031.vdocuments.us/reader031/viewer/2022021922/58eef13a1a28abf1298b45f1/html5/thumbnails/116.jpg)
Lu
tzFi
nger
.com
How to build a Data Team
![Page 117: Workshop Data Manager](https://reader031.vdocuments.us/reader031/viewer/2022021922/58eef13a1a28abf1298b45f1/html5/thumbnails/117.jpg)
Lu
tzFi
nger
.com
Data Scientist
![Page 118: Workshop Data Manager](https://reader031.vdocuments.us/reader031/viewer/2022021922/58eef13a1a28abf1298b45f1/html5/thumbnails/118.jpg)
Lu
tzFi
nger
.com
Data Scientist Confusion
![Page 119: Workshop Data Manager](https://reader031.vdocuments.us/reader031/viewer/2022021922/58eef13a1a28abf1298b45f1/html5/thumbnails/119.jpg)
Lu
tzFi
nger
.com
New Ways To Automate
![Page 120: Workshop Data Manager](https://reader031.vdocuments.us/reader031/viewer/2022021922/58eef13a1a28abf1298b45f1/html5/thumbnails/120.jpg)
Lu
tzFi
nger
.com
Data Scientist
BI Analyst
Engineer
Product Manager
Communication Skills Domain Knowledge
![Page 121: Workshop Data Manager](https://reader031.vdocuments.us/reader031/viewer/2022021922/58eef13a1a28abf1298b45f1/html5/thumbnails/121.jpg)
Lu
tzFi
nger
.com
You Learned
image by Mike under Creative Commons
• The Ask is the most Important part - you need Domain Knowledge
• Data Science is NO Rocket Science
• Data is King & There is Monopoly Game happening
• Data Can be misleading
• Data is a Team Sport
![Page 122: Workshop Data Manager](https://reader031.vdocuments.us/reader031/viewer/2022021922/58eef13a1a28abf1298b45f1/html5/thumbnails/122.jpg)
Lu
tzFi
nger
.com
Thank You
![Page 123: Workshop Data Manager](https://reader031.vdocuments.us/reader031/viewer/2022021922/58eef13a1a28abf1298b45f1/html5/thumbnails/123.jpg)
Lu
tzFi
nger
.com
What to MEASURE?
• Error• Correlation• Cost&• Privacy
Workbook “Measure” at LutzFinger.com