fountain of youth or polluted swamp: is your data lake revitalizing your business or eroding the...

Post on 20-Mar-2017

24 Views

Category:

Technology

1 Downloads

Preview:

Click to see full reader

TRANSCRIPT

11 ZILLOW | TRULIA | STREETEASY | HOTPADS | NAKED APARTMENTS

Vincent Yates, Director of Analytics Engineering@VincentYates8

FOUNTAIN OF YOUTH OR POLLUTED SWAMP: IS YOUR DATA LAKE REVITALIZING YOUR BUSINESS OR ERODING THE FOUNDATION?

2

One of these is worth $42,000 more

Finished sq-ft 2,602 2,602Lot Size 4,400 5,342Bathrooms 3 3Bedrooms 4 4Year Built 2004 2005Sale Price 861,000 819,000

3

One of these is worth $164,000 more

Finished sq-ft 1,620 1,620Lot Size 1,620 1,620Bathrooms 2.5 3Bedrooms 3 3Year Built 2007 2007Sale Price 499,000 663,000

4

One of these is worth >$10M annually

http://www.exp-platform.com/Pages/SevenRulesofThumbforWebSiteExperimenters.aspx

55

DATA SCIENCE’S DIRTY LITTLE SECRET

66

$3.1 TRILLION

IBM Big Data Hub

7

Unknowns ≠ Seasonality

Seasonality

Seasonality

Seasonality

Seasonality

SeasonalitySeasonality

88

SeriouslyDATA SCIENCE IS HARD

9

Product &Communication

Programming

Statistics

1010

24% of data scientists

UNSURE OF HOW MUCH OF THEIR DATA ARE INACCURATE

IBM Big Data Hub

11

Errors Propagate in Dynamic Ways

12

1313

66% of data scientists

CLEANING DATA IS THE MOST TIME CONSUMING TASK

CroundFlower 2015 Data Science Report

1414

My data is pretty good.DOES IT REALLY MATTER?

1515

52.3% of data scientists

POOR DATA QUALITY IS THEIR BIGGEST HURDLE

CroundFlower 2015 Data Science Report

1616

The cost of poor data quality

15-25% OF OPERATING PROFIT

Kaufman,Morgan: The Accuracy Dimension

1717

Someone would have noticed and fixed itHOW DID WE GET HERE?

18

Cracks start to show under pressure

Data Quality: The Accuracy DimensionThe Morgan Kaufmann Series in Data Management Systems

OperationalIntegration Replication

19

Complexity/Agility is the scapegoat

Transaction applications,APIs, Third-party data producers

Transaction databases

Data Marts

Data Lake

20

Complexity/Agility is the scapegoat

Transaction applications,APIs, Third-party data producers

Transaction databases

Data Marts

Data Lake

21

Complexity/Agility is the scapegoat

Transaction applications,APIs, Third-party data producers

Transaction databases

Data Marts

Data Lake

22

Complexity/Agility is the scapegoat

23

Complexity/Agility is the scapegoat

24

Complexity/Agility is the scapegoatData Marts

Data Lake

25

Moral Hazard is the culprit

2626

HOW DO WE GET OUT?A few simple tricks to head in the right direction

2727

PROACTIVE NOT REACTIVEData scientist is not great under duress

28

Get Back to Raw Data

29

Centralize Definitions

30

Model Where Possible

3131

MODELING IS HARDBuild tools to make reactive easier

32

33

34

Data Problems are as Old as Data

35

Many mistakes are required for catastrophe• Climate caused more icebergs

– Ignored Forecasts

• Tides sent icebergs southward– Poor/Wrong Measurement

• The ship was going too fast– Business needs over best data

• Iceberg warnings went unheeded– Data was Disregarded for Intuition

• The binoculars were locked up– Tools were behind lock and key

• The steersman took a wrong turn– Reactive action under stress lead to

wrong decisions

• The iron rivets were too weak– Cost savings over best data

• There were too few lifeboats– Marketing owned the message

http://cosmiclog.nbcnews.com/_news/2012/04/01/10970732-10-causes-of-the-titanic-tragedy

3636

VincentYa@zillowgroup.com@VincentYates8

THANK YOU!

top related