building a data lake - an app dev's perspective

Post on 12-Apr-2017

57 Views

Category:

Technology

1 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Building a Data LakeAn App Dev’s Perspective

GeekNight Hyderabad - March 8th 2017

Geetha Balasundaram

geethab@thoughtworks.com

© 2017 ThoughtWorks Technologies Pvt. Limited

ABOUT ME

Developer @ ThoughtWorks

Building a data lake in the enterprise ecosystem

Helping a retail business make sense of it ( data guided org )

Been part of web development space ( enterprise rewrite )

Equally startled like everyone else by the data engineering space

Share know-how’s and do-how’s from our team’s experience

snithish@thoughtworks.com

© 2017 ThoughtWorks Technologies Pvt. Limited

AGENDA

What is data in the true sense…

Data Warehouse in an enterprise ecosystem...

What is a data lake...

Data lake implementation in an enterprise ecosystem…

How to make effective use of a data lake: technology+process+people

Cluster Administration tool - Cloudera Manager

Pitfalls to avoid

© 2017 ThoughtWorks Technologies Pvt. Limited

Question ???

How did R.Ashwin perform in the last Test match?

HIGH LEVEL

PROBLEM STATEMENT

© 2017 ThoughtWorks Technologies Pvt. Limited

COMPLEX HISTORICAL DATA

Why?

Exploit and derive as much new insights as possible

Match Made

Enterprise systems produce this nature of complexity

© 2017 ThoughtWorks Technologies Pvt. Limited

DATA WAREHOUSE

https://martinfowler.com/articles/microservices.html

ETL

© 2017 ThoughtWorks Technologies Pvt. Limited

DID MICROSERVICES CAUSE THIS PROBLEM ?

Decentralised Data

https://martinfowler.com/articles/microservices.html© 2017 ThoughtWorks Technologies Pvt. Limited

MICROSERVICES HELPED

Break down business unit

Break down complexity

Understand the nature of data

© 2017 ThoughtWorks Technologies Pvt. Limited

Question ???

R.Ashwin performed well ( 6/41 ) in yesterday’s match!

Complex historical data can quantify how well he has performed

Can we say why did he do well in this particular match? What factors affected his enhanced performance?

© 2017 ThoughtWorks Technologies Pvt. Limited

FACT is a FACT

… even when we don’t know how it can be used

© 2017 ThoughtWorks Technologies Pvt. Limited

KEY DIFFERENCE

https://martinfowler.com/bliki/DataLake.html© 2017 ThoughtWorks Technologies Pvt. Limited

What is a data lake?

© 2017 ThoughtWorks Technologies Pvt. Limited

LAKE is...

.. a large body of water in a more natural state.

The contents of the lake, stream in from a source to fill the lake,

and various users of the lake can come to examine, dive in, or

take samples

https://jamesdixon.wordpress.com/2010/10/14/pentaho-hadoop-and-data-lakes/

© 2017 ThoughtWorks Technologies Pvt. Limited

DATA LAKE is...

.. a large body of water data facts in a more natural state.

The contents of the lake, stream in from a source to fill the lake,

and various users of the lake can come to examine analyse, dive

in build models, or take samples use subset for specific use

cases

© 2017 ThoughtWorks Technologies Pvt. Limited

KEY DIFFERENCE

https://martinfowler.com/bliki/DataLake.html© 2017 ThoughtWorks Technologies Pvt. Limited

Implementation

© 2017 ThoughtWorks Technologies Pvt. Limited

OUR IMPLEMENTATION - TECH STACK

DATA SOURCE

DATA INGESTION

DATA LAKE

DATA MARTS DATA ANALYSIS

Staging / Queue

© 2017 ThoughtWorks Technologies Pvt. Limited

© 2017 ThoughtWorks Technologies Pvt. Limited

How to make effective use of a data lake:

technology+process+people

© 2017 ThoughtWorks Technologies Pvt. Limited

Functionality Vs Reality

I need a feature so that I can do this action…..

to

I need this insight so that I can take this action….

eg : I need a functionality to order items anytime before or during a promotion…

to

..I need to know on time, if I have to order items anytime before or during a promotion…

so that I can improve promotion sales

People

© 2017 ThoughtWorks Technologies Pvt. Limited

Start Simple

There is no data lake yet…

Carve out portions of data which are easy wins yet critical to

arrive at the earlier stated insight..

Set up the infrastructure and pipeline

Get your hands dirty..

eg: Sales is an important factor to analyse / predict anything in retail space..

Technology

© 2017 ThoughtWorks Technologies Pvt. Limited

How much should I know about the data ?

As a consumer of data (read ‘not a consumer of service’)

How much should I know about it?

Schema ⇔ Contracts

Nature of the data versioned vs latest

transactional vs reference

facts vs aggregate

frequency of change

…..

Technology

© 2017 ThoughtWorks Technologies Pvt. Limited

DATA INSIGHT - Part 1

Incrementally add

new data to the

lake

Serve data

for analysis

eg: What data wrt promotions do I need to bring into the datalake ??

Sales → improve promotion sales

Technology

© 2017 ThoughtWorks Technologies Pvt. Limited

DATA INSIGHT - Part 2Sales + Promotions → improve promotion sales

How does adding more data to the lake help arriving at new insights..?

history of past promotions sales = how much to order for this promotion

history of past promotion sales + ‘X’ = how much to order for this promotion

history of past promotion sales + ‘X’ + ‘Y’ …… = how much to order for this promotion

eg: seasonality has a strong correlation with sales

history of past promotion sales + ‘X’ + ‘Y’ …… + ‘A’ = how much to order for this promotion after the start

People

© 2017 ThoughtWorks Technologies Pvt. Limited

Think Agile

Sales + Promotions + X factor → improve promotion sales

Near perfect list of parameters

Progressive set of parameters

Sales + Promotions → is the quantity arrived from these factors (known to business) ordered on time?

Process

© 2017 ThoughtWorks Technologies Pvt. Limited

DataMarts

... as a store of bottled water – cleansed and packaged and

structured for easy consumption

© 2017 ThoughtWorks Technologies Pvt. Limited

DataMarts

... as a store of data subset - curated from meaningful facts

bundled into logical groups for arriving at useful insights

© 2017 ThoughtWorks Technologies Pvt. Limited

Easy Insight

Sales + Promotions →

is the quantity arrived from these factors (known to business) ordered on time?

System : Tells me what is the quantity that is supposed to be ordered for this promotion..

System : Tells me in realtime what is the quantity that is ordered

Technology

© 2017 ThoughtWorks Technologies Pvt. Limited

Cluster Administration Tool

Cloudera Manager

© 2017 ThoughtWorks Technologies Pvt. Limited

Think DevOps

Scale | Performance | Memory | Resource Contention |

Optimization | Stability |

Need for an ecosystem - to monitor how well the different tools

play together without chaos

Tools

© 2017 ThoughtWorks Technologies Pvt. Limited

QUICK RECAP

What is data in the true sense…

Data Warehouse in an enterprise ecosystem...

What is a data lake...

Data lake implementation in an enterprise ecosystem...

How to make effective use of a data lake…

Cluster Administration tool - Cloudera Manager

© 2017 ThoughtWorks Technologies Pvt. Limited

PITFALLS TO AVOID

Data envy - Ref:https://martinfowler.com/bliki/Datensparsamkeit.html

Tool envy

Reliable data is a luxury

Understanding the nature of data is a must

Dialogue with the data scientist

Treating the data lake like a RDBMS

Keeping the business involved

Data flow state visibility

© 2017 ThoughtWorks Technologies Pvt. Limited

top related