building a data lake - an app dev's perspective
TRANSCRIPT
![Page 1: Building a Data Lake - An App Dev's Perspective](https://reader034.vdocuments.us/reader034/viewer/2022051706/58ee5ba91a28ab7d4d8b4607/html5/thumbnails/1.jpg)
Building a Data LakeAn App Dev’s Perspective
GeekNight Hyderabad - March 8th 2017
Geetha Balasundaram
© 2017 ThoughtWorks Technologies Pvt. Limited
![Page 2: Building a Data Lake - An App Dev's Perspective](https://reader034.vdocuments.us/reader034/viewer/2022051706/58ee5ba91a28ab7d4d8b4607/html5/thumbnails/2.jpg)
ABOUT ME
Developer @ ThoughtWorks
Building a data lake in the enterprise ecosystem
Helping a retail business make sense of it ( data guided org )
Been part of web development space ( enterprise rewrite )
Equally startled like everyone else by the data engineering space
Share know-how’s and do-how’s from our team’s experience
© 2017 ThoughtWorks Technologies Pvt. Limited
![Page 3: Building a Data Lake - An App Dev's Perspective](https://reader034.vdocuments.us/reader034/viewer/2022051706/58ee5ba91a28ab7d4d8b4607/html5/thumbnails/3.jpg)
AGENDA
What is data in the true sense…
Data Warehouse in an enterprise ecosystem...
What is a data lake...
Data lake implementation in an enterprise ecosystem…
How to make effective use of a data lake: technology+process+people
Cluster Administration tool - Cloudera Manager
Pitfalls to avoid
© 2017 ThoughtWorks Technologies Pvt. Limited
![Page 4: Building a Data Lake - An App Dev's Perspective](https://reader034.vdocuments.us/reader034/viewer/2022051706/58ee5ba91a28ab7d4d8b4607/html5/thumbnails/4.jpg)
Question ???
How did R.Ashwin perform in the last Test match?
HIGH LEVEL
PROBLEM STATEMENT
© 2017 ThoughtWorks Technologies Pvt. Limited
![Page 5: Building a Data Lake - An App Dev's Perspective](https://reader034.vdocuments.us/reader034/viewer/2022051706/58ee5ba91a28ab7d4d8b4607/html5/thumbnails/5.jpg)
COMPLEX HISTORICAL DATA
Why?
Exploit and derive as much new insights as possible
Match Made
Enterprise systems produce this nature of complexity
© 2017 ThoughtWorks Technologies Pvt. Limited
![Page 6: Building a Data Lake - An App Dev's Perspective](https://reader034.vdocuments.us/reader034/viewer/2022051706/58ee5ba91a28ab7d4d8b4607/html5/thumbnails/6.jpg)
DATA WAREHOUSE
https://martinfowler.com/articles/microservices.html
ETL
© 2017 ThoughtWorks Technologies Pvt. Limited
![Page 7: Building a Data Lake - An App Dev's Perspective](https://reader034.vdocuments.us/reader034/viewer/2022051706/58ee5ba91a28ab7d4d8b4607/html5/thumbnails/7.jpg)
DID MICROSERVICES CAUSE THIS PROBLEM ?
Decentralised Data
https://martinfowler.com/articles/microservices.html© 2017 ThoughtWorks Technologies Pvt. Limited
![Page 8: Building a Data Lake - An App Dev's Perspective](https://reader034.vdocuments.us/reader034/viewer/2022051706/58ee5ba91a28ab7d4d8b4607/html5/thumbnails/8.jpg)
MICROSERVICES HELPED
Break down business unit
Break down complexity
Understand the nature of data
© 2017 ThoughtWorks Technologies Pvt. Limited
![Page 9: Building a Data Lake - An App Dev's Perspective](https://reader034.vdocuments.us/reader034/viewer/2022051706/58ee5ba91a28ab7d4d8b4607/html5/thumbnails/9.jpg)
Question ???
R.Ashwin performed well ( 6/41 ) in yesterday’s match!
Complex historical data can quantify how well he has performed
Can we say why did he do well in this particular match? What factors affected his enhanced performance?
© 2017 ThoughtWorks Technologies Pvt. Limited
![Page 10: Building a Data Lake - An App Dev's Perspective](https://reader034.vdocuments.us/reader034/viewer/2022051706/58ee5ba91a28ab7d4d8b4607/html5/thumbnails/10.jpg)
FACT is a FACT
… even when we don’t know how it can be used
© 2017 ThoughtWorks Technologies Pvt. Limited
![Page 11: Building a Data Lake - An App Dev's Perspective](https://reader034.vdocuments.us/reader034/viewer/2022051706/58ee5ba91a28ab7d4d8b4607/html5/thumbnails/11.jpg)
KEY DIFFERENCE
https://martinfowler.com/bliki/DataLake.html© 2017 ThoughtWorks Technologies Pvt. Limited
![Page 12: Building a Data Lake - An App Dev's Perspective](https://reader034.vdocuments.us/reader034/viewer/2022051706/58ee5ba91a28ab7d4d8b4607/html5/thumbnails/12.jpg)
What is a data lake?
© 2017 ThoughtWorks Technologies Pvt. Limited
![Page 13: Building a Data Lake - An App Dev's Perspective](https://reader034.vdocuments.us/reader034/viewer/2022051706/58ee5ba91a28ab7d4d8b4607/html5/thumbnails/13.jpg)
LAKE is...
.. a large body of water in a more natural state.
The contents of the lake, stream in from a source to fill the lake,
and various users of the lake can come to examine, dive in, or
take samples
https://jamesdixon.wordpress.com/2010/10/14/pentaho-hadoop-and-data-lakes/
© 2017 ThoughtWorks Technologies Pvt. Limited
![Page 14: Building a Data Lake - An App Dev's Perspective](https://reader034.vdocuments.us/reader034/viewer/2022051706/58ee5ba91a28ab7d4d8b4607/html5/thumbnails/14.jpg)
DATA LAKE is...
.. a large body of water data facts in a more natural state.
The contents of the lake, stream in from a source to fill the lake,
and various users of the lake can come to examine analyse, dive
in build models, or take samples use subset for specific use
cases
© 2017 ThoughtWorks Technologies Pvt. Limited
![Page 15: Building a Data Lake - An App Dev's Perspective](https://reader034.vdocuments.us/reader034/viewer/2022051706/58ee5ba91a28ab7d4d8b4607/html5/thumbnails/15.jpg)
KEY DIFFERENCE
https://martinfowler.com/bliki/DataLake.html© 2017 ThoughtWorks Technologies Pvt. Limited
![Page 16: Building a Data Lake - An App Dev's Perspective](https://reader034.vdocuments.us/reader034/viewer/2022051706/58ee5ba91a28ab7d4d8b4607/html5/thumbnails/16.jpg)
Implementation
© 2017 ThoughtWorks Technologies Pvt. Limited
![Page 17: Building a Data Lake - An App Dev's Perspective](https://reader034.vdocuments.us/reader034/viewer/2022051706/58ee5ba91a28ab7d4d8b4607/html5/thumbnails/17.jpg)
OUR IMPLEMENTATION - TECH STACK
DATA SOURCE
DATA INGESTION
DATA LAKE
DATA MARTS DATA ANALYSIS
Staging / Queue
© 2017 ThoughtWorks Technologies Pvt. Limited
![Page 18: Building a Data Lake - An App Dev's Perspective](https://reader034.vdocuments.us/reader034/viewer/2022051706/58ee5ba91a28ab7d4d8b4607/html5/thumbnails/18.jpg)
© 2017 ThoughtWorks Technologies Pvt. Limited
![Page 19: Building a Data Lake - An App Dev's Perspective](https://reader034.vdocuments.us/reader034/viewer/2022051706/58ee5ba91a28ab7d4d8b4607/html5/thumbnails/19.jpg)
How to make effective use of a data lake:
technology+process+people
© 2017 ThoughtWorks Technologies Pvt. Limited
![Page 20: Building a Data Lake - An App Dev's Perspective](https://reader034.vdocuments.us/reader034/viewer/2022051706/58ee5ba91a28ab7d4d8b4607/html5/thumbnails/20.jpg)
Functionality Vs Reality
I need a feature so that I can do this action…..
to
I need this insight so that I can take this action….
eg : I need a functionality to order items anytime before or during a promotion…
to
..I need to know on time, if I have to order items anytime before or during a promotion…
so that I can improve promotion sales
People
© 2017 ThoughtWorks Technologies Pvt. Limited
![Page 21: Building a Data Lake - An App Dev's Perspective](https://reader034.vdocuments.us/reader034/viewer/2022051706/58ee5ba91a28ab7d4d8b4607/html5/thumbnails/21.jpg)
Start Simple
There is no data lake yet…
Carve out portions of data which are easy wins yet critical to
arrive at the earlier stated insight..
Set up the infrastructure and pipeline
Get your hands dirty..
eg: Sales is an important factor to analyse / predict anything in retail space..
Technology
© 2017 ThoughtWorks Technologies Pvt. Limited
![Page 22: Building a Data Lake - An App Dev's Perspective](https://reader034.vdocuments.us/reader034/viewer/2022051706/58ee5ba91a28ab7d4d8b4607/html5/thumbnails/22.jpg)
How much should I know about the data ?
As a consumer of data (read ‘not a consumer of service’)
How much should I know about it?
Schema ⇔ Contracts
Nature of the data versioned vs latest
transactional vs reference
facts vs aggregate
frequency of change
…..
Technology
© 2017 ThoughtWorks Technologies Pvt. Limited
![Page 23: Building a Data Lake - An App Dev's Perspective](https://reader034.vdocuments.us/reader034/viewer/2022051706/58ee5ba91a28ab7d4d8b4607/html5/thumbnails/23.jpg)
DATA INSIGHT - Part 1
Incrementally add
new data to the
lake
Serve data
for analysis
eg: What data wrt promotions do I need to bring into the datalake ??
Sales → improve promotion sales
Technology
© 2017 ThoughtWorks Technologies Pvt. Limited
![Page 24: Building a Data Lake - An App Dev's Perspective](https://reader034.vdocuments.us/reader034/viewer/2022051706/58ee5ba91a28ab7d4d8b4607/html5/thumbnails/24.jpg)
DATA INSIGHT - Part 2Sales + Promotions → improve promotion sales
How does adding more data to the lake help arriving at new insights..?
history of past promotions sales = how much to order for this promotion
history of past promotion sales + ‘X’ = how much to order for this promotion
history of past promotion sales + ‘X’ + ‘Y’ …… = how much to order for this promotion
eg: seasonality has a strong correlation with sales
history of past promotion sales + ‘X’ + ‘Y’ …… + ‘A’ = how much to order for this promotion after the start
People
© 2017 ThoughtWorks Technologies Pvt. Limited
![Page 25: Building a Data Lake - An App Dev's Perspective](https://reader034.vdocuments.us/reader034/viewer/2022051706/58ee5ba91a28ab7d4d8b4607/html5/thumbnails/25.jpg)
Think Agile
Sales + Promotions + X factor → improve promotion sales
Near perfect list of parameters
Progressive set of parameters
Sales + Promotions → is the quantity arrived from these factors (known to business) ordered on time?
Process
© 2017 ThoughtWorks Technologies Pvt. Limited
![Page 26: Building a Data Lake - An App Dev's Perspective](https://reader034.vdocuments.us/reader034/viewer/2022051706/58ee5ba91a28ab7d4d8b4607/html5/thumbnails/26.jpg)
DataMarts
... as a store of bottled water – cleansed and packaged and
structured for easy consumption
© 2017 ThoughtWorks Technologies Pvt. Limited
![Page 27: Building a Data Lake - An App Dev's Perspective](https://reader034.vdocuments.us/reader034/viewer/2022051706/58ee5ba91a28ab7d4d8b4607/html5/thumbnails/27.jpg)
DataMarts
... as a store of data subset - curated from meaningful facts
bundled into logical groups for arriving at useful insights
© 2017 ThoughtWorks Technologies Pvt. Limited
![Page 28: Building a Data Lake - An App Dev's Perspective](https://reader034.vdocuments.us/reader034/viewer/2022051706/58ee5ba91a28ab7d4d8b4607/html5/thumbnails/28.jpg)
Easy Insight
Sales + Promotions →
is the quantity arrived from these factors (known to business) ordered on time?
System : Tells me what is the quantity that is supposed to be ordered for this promotion..
System : Tells me in realtime what is the quantity that is ordered
Technology
© 2017 ThoughtWorks Technologies Pvt. Limited
![Page 29: Building a Data Lake - An App Dev's Perspective](https://reader034.vdocuments.us/reader034/viewer/2022051706/58ee5ba91a28ab7d4d8b4607/html5/thumbnails/29.jpg)
Cluster Administration Tool
Cloudera Manager
© 2017 ThoughtWorks Technologies Pvt. Limited
![Page 30: Building a Data Lake - An App Dev's Perspective](https://reader034.vdocuments.us/reader034/viewer/2022051706/58ee5ba91a28ab7d4d8b4607/html5/thumbnails/30.jpg)
Think DevOps
Scale | Performance | Memory | Resource Contention |
Optimization | Stability |
Need for an ecosystem - to monitor how well the different tools
play together without chaos
Tools
© 2017 ThoughtWorks Technologies Pvt. Limited
![Page 31: Building a Data Lake - An App Dev's Perspective](https://reader034.vdocuments.us/reader034/viewer/2022051706/58ee5ba91a28ab7d4d8b4607/html5/thumbnails/31.jpg)
QUICK RECAP
What is data in the true sense…
Data Warehouse in an enterprise ecosystem...
What is a data lake...
Data lake implementation in an enterprise ecosystem...
How to make effective use of a data lake…
Cluster Administration tool - Cloudera Manager
© 2017 ThoughtWorks Technologies Pvt. Limited
![Page 32: Building a Data Lake - An App Dev's Perspective](https://reader034.vdocuments.us/reader034/viewer/2022051706/58ee5ba91a28ab7d4d8b4607/html5/thumbnails/32.jpg)
PITFALLS TO AVOID
Data envy - Ref:https://martinfowler.com/bliki/Datensparsamkeit.html
Tool envy
Reliable data is a luxury
Understanding the nature of data is a must
Dialogue with the data scientist
Treating the data lake like a RDBMS
Keeping the business involved
Data flow state visibility
© 2017 ThoughtWorks Technologies Pvt. Limited