fountain of youth or polluted swamp: is your data lake revitalizing your business or eroding the...
Post on 20-Mar-2017
24 Views
Preview:
TRANSCRIPT
11 ZILLOW | TRULIA | STREETEASY | HOTPADS | NAKED APARTMENTS
Vincent Yates, Director of Analytics Engineering@VincentYates8
FOUNTAIN OF YOUTH OR POLLUTED SWAMP: IS YOUR DATA LAKE REVITALIZING YOUR BUSINESS OR ERODING THE FOUNDATION?
2
One of these is worth $42,000 more
Finished sq-ft 2,602 2,602Lot Size 4,400 5,342Bathrooms 3 3Bedrooms 4 4Year Built 2004 2005Sale Price 861,000 819,000
3
One of these is worth $164,000 more
Finished sq-ft 1,620 1,620Lot Size 1,620 1,620Bathrooms 2.5 3Bedrooms 3 3Year Built 2007 2007Sale Price 499,000 663,000
4
One of these is worth >$10M annually
http://www.exp-platform.com/Pages/SevenRulesofThumbforWebSiteExperimenters.aspx
55
DATA SCIENCE’S DIRTY LITTLE SECRET
66
$3.1 TRILLION
IBM Big Data Hub
7
Unknowns ≠ Seasonality
Seasonality
Seasonality
Seasonality
Seasonality
SeasonalitySeasonality
88
SeriouslyDATA SCIENCE IS HARD
9
Product &Communication
Programming
Statistics
1010
24% of data scientists
UNSURE OF HOW MUCH OF THEIR DATA ARE INACCURATE
IBM Big Data Hub
11
Errors Propagate in Dynamic Ways
12
1313
66% of data scientists
CLEANING DATA IS THE MOST TIME CONSUMING TASK
CroundFlower 2015 Data Science Report
1414
My data is pretty good.DOES IT REALLY MATTER?
1515
52.3% of data scientists
POOR DATA QUALITY IS THEIR BIGGEST HURDLE
CroundFlower 2015 Data Science Report
1616
The cost of poor data quality
15-25% OF OPERATING PROFIT
Kaufman,Morgan: The Accuracy Dimension
1717
Someone would have noticed and fixed itHOW DID WE GET HERE?
18
Cracks start to show under pressure
Data Quality: The Accuracy DimensionThe Morgan Kaufmann Series in Data Management Systems
OperationalIntegration Replication
19
Complexity/Agility is the scapegoat
Transaction applications,APIs, Third-party data producers
Transaction databases
Data Marts
Data Lake
20
Complexity/Agility is the scapegoat
Transaction applications,APIs, Third-party data producers
Transaction databases
Data Marts
Data Lake
21
Complexity/Agility is the scapegoat
Transaction applications,APIs, Third-party data producers
Transaction databases
Data Marts
Data Lake
22
Complexity/Agility is the scapegoat
23
Complexity/Agility is the scapegoat
24
Complexity/Agility is the scapegoatData Marts
Data Lake
25
Moral Hazard is the culprit
2626
HOW DO WE GET OUT?A few simple tricks to head in the right direction
2727
PROACTIVE NOT REACTIVEData scientist is not great under duress
28
Get Back to Raw Data
29
Centralize Definitions
30
Model Where Possible
3131
MODELING IS HARDBuild tools to make reactive easier
32
33
34
Data Problems are as Old as Data
35
Many mistakes are required for catastrophe• Climate caused more icebergs
– Ignored Forecasts
• Tides sent icebergs southward– Poor/Wrong Measurement
• The ship was going too fast– Business needs over best data
• Iceberg warnings went unheeded– Data was Disregarded for Intuition
• The binoculars were locked up– Tools were behind lock and key
• The steersman took a wrong turn– Reactive action under stress lead to
wrong decisions
• The iron rivets were too weak– Cost savings over best data
• There were too few lifeboats– Marketing owned the message
http://cosmiclog.nbcnews.com/_news/2012/04/01/10970732-10-causes-of-the-titanic-tragedy
3636
VincentYa@zillowgroup.com@VincentYates8
THANK YOU!
top related