Download - Taming data lake - scalable metrics model
![Page 1: Taming data lake - scalable metrics model](https://reader035.vdocuments.us/reader035/viewer/2022062412/58f1e06a1a28ab71568b45fd/html5/thumbnails/1.jpg)
www.globalbigdataconference.comTwitter : @bigdataconf
![Page 2: Taming data lake - scalable metrics model](https://reader035.vdocuments.us/reader035/viewer/2022062412/58f1e06a1a28ab71568b45fd/html5/thumbnails/2.jpg)
“Taming the Data Lake”
2
![Page 3: Taming data lake - scalable metrics model](https://reader035.vdocuments.us/reader035/viewer/2022062412/58f1e06a1a28ab71568b45fd/html5/thumbnails/3.jpg)
Intended for Knowledge Sharing only
Disclaimer: Participation in this summit is purely on personal basis and not representing VISA in any form or matter. The talk is based on learnings from work across industries and firms. Care has been taken to ensure no proprietary or work related info of any firm is used in any material.
Director, Insights at Visa, Inc. Enable Decision Making at the Executives/ Product/Marketing level via actionable insights derived from Data.
RAMKUMAR RAVICHANDRAN
Data Warehouse Architect at Visa, Inc. Architect a data-shop in Hadoop to get 360-degree view of the interaction. Technology interface for the Data Stakeholder Community.
BHARATHIRAJA CHANDRASEKHARAN
![Page 4: Taming data lake - scalable metrics model](https://reader035.vdocuments.us/reader035/viewer/2022062412/58f1e06a1a28ab71568b45fd/html5/thumbnails/4.jpg)
Intended for Knowledge Sharing only
Quick recap of what it is
Intended for Knowledge Sharing only
Data Lakes – the concept
![Page 5: Taming data lake - scalable metrics model](https://reader035.vdocuments.us/reader035/viewer/2022062412/58f1e06a1a28ab71568b45fd/html5/thumbnails/5.jpg)
AS THEY ARE ENVISIONED TODAY…
Intended for Knowledge Sharing only
Source: http://www.tangerine.co.th/tag/how-do-data-lake-work/
5
![Page 6: Taming data lake - scalable metrics model](https://reader035.vdocuments.us/reader035/viewer/2022062412/58f1e06a1a28ab71568b45fd/html5/thumbnails/6.jpg)
DOES IT RING A BELL?
*only satiric to wake you up and not indicative of anyone or anything- any similarity is purely coincidental! 6
![Page 7: Taming data lake - scalable metrics model](https://reader035.vdocuments.us/reader035/viewer/2022062412/58f1e06a1a28ab71568b45fd/html5/thumbnails/7.jpg)
& DOES THIS TOO?
*only satiric to wake you up and not indicative of anyone or anything- any similarity is purely coincidental! 7
![Page 8: Taming data lake - scalable metrics model](https://reader035.vdocuments.us/reader035/viewer/2022062412/58f1e06a1a28ab71568b45fd/html5/thumbnails/8.jpg)
SO WHAT DO WE HEAR FROM OUR USERS?
We often hear these statements in the context of data lakes…
Success criteria was engineering specific – Storage/Scalability cost
saving, etc
Expensive Change Management
Complex for the end users to deal with
Analytical performance issues
Data Governance, Lineage and Management complexities
“Although the cost of Storage went down, actual cost of utilizing the data has shot up”
8
![Page 9: Taming data lake - scalable metrics model](https://reader035.vdocuments.us/reader035/viewer/2022062412/58f1e06a1a28ab71568b45fd/html5/thumbnails/9.jpg)
Intended for Knowledge Sharing only
Quick recap of what it is
Intended for Knowledge Sharing only
Taking a step back
![Page 10: Taming data lake - scalable metrics model](https://reader035.vdocuments.us/reader035/viewer/2022062412/58f1e06a1a28ab71568b45fd/html5/thumbnails/10.jpg)
DATA REALLY HAS GOTTEN BIG – VOLUME, VARIETY, VELOCITY & VERACITY
Each of the data source is critical either across all or multiple functions….
Intended for Knowledge Sharing only
…and are consumed either as reports, analytical deep dive insights, forward looking projections, etc.
TRANSACTION DATA
CLICK STREAM DATA (MOBILE & WEB)
SENTIMENT/SOCIAL DATA
• Are overall txns going up/down; where the txns are happening, etc..
• How are Consumers interacting with the website/app – drop-offs, clicks, Time spent, etc..
• Social Media, NPS surveys, Media mentions helps in gauging true Consumer reactions
DATA SOURCES TYPES OF INSIGHTS
SERVER LOGS DATA • How are consumers reacting with various functions on the front end?
LOCATION DATA • Are consumers using the product in-store or on the move?
PROMOTIONS DATA • How are consumers reacting to various marketing campaigns?
INDUSTRY DATA • Benchmarking against industry performance
10
![Page 11: Taming data lake - scalable metrics model](https://reader035.vdocuments.us/reader035/viewer/2022062412/58f1e06a1a28ab71568b45fd/html5/thumbnails/11.jpg)
EVERYONE NEEDS DATA…
Intended for Knowledge Sharing only
How are we doing today?
BIWhere will be
tomorrow? What if we do this?
What can we do?
ANALYTICS
Did the initiative work?
A/B TESTING
How do Customers feel about us?
USER RESEARCH
Where should we invest?
STRATEGY
11
![Page 12: Taming data lake - scalable metrics model](https://reader035.vdocuments.us/reader035/viewer/2022062412/58f1e06a1a28ab71568b45fd/html5/thumbnails/12.jpg)
…AND DISTRIBUTED DATA SYSTEMS HAD THEIR OWN ISSUES
Intended for Knowledge Sharing only
Inconsistent (and/or conflicting) definitions of data and numbers
Varying granularities
Multiple methodologies
Different BU = (different KPIs or same KPIs different priorities)
Lack of visibility/understanding outside of the BUs
“Slow & inefficient, Non-scalable, Difficulties rolling up, Trust issues,
Cascading mistakes”
12
![Page 13: Taming data lake - scalable metrics model](https://reader035.vdocuments.us/reader035/viewer/2022062412/58f1e06a1a28ab71568b45fd/html5/thumbnails/13.jpg)
AND IT THEN JUST HAPPENED…
Intended for Knowledge Sharing only
TRANSACTION DATA
CLICK STREAM DATA (MOBILE & WEB)
SENTIMENT DATA
DATA SOURCES
SERVER LOGS DATA
LOCATION DATA
CAMPAIGN DATA
INDUSTRY DATA
Source: http://www.adamadiouf.com/2013/03/22/bigdata-vs-enterprise-data-warehouse/
As if all prayers were answered Hadoop arrived in a big way & poof all problems seemed to disappear…
13
![Page 14: Taming data lake - scalable metrics model](https://reader035.vdocuments.us/reader035/viewer/2022062412/58f1e06a1a28ab71568b45fd/html5/thumbnails/14.jpg)
Intended for Knowledge Sharing only
Quick recap of what it is
Intended for Knowledge Sharing only
A stroke of luck or was it?
![Page 15: Taming data lake - scalable metrics model](https://reader035.vdocuments.us/reader035/viewer/2022062412/58f1e06a1a28ab71568b45fd/html5/thumbnails/15.jpg)
WE FOCUSED ON OUR SPOUSE BUT FORGOT THE IN-LAWS…
Inform Reports on
KPIs with high level
drilldowns
ActDeep dives
via Business Analytics
Predict Identify Causal
relationships via Advanced
Analytics
OptimizeExperiments
to verify which one
works via A/B Testing
Maturity phases of Analytics Practice
Valu
e A
ddit
ion
Intended for Knowledge Sharing only
MineMachine Learning
Focus on the 20% Data consumers (Reports) and assumption was that 80% Data Consumers will either love it or at least figure it out…
5%
50%
15%
20%
10%
15
![Page 16: Taming data lake - scalable metrics model](https://reader035.vdocuments.us/reader035/viewer/2022062412/58f1e06a1a28ab71568b45fd/html5/thumbnails/16.jpg)
HIGH DEVELOPMENT/MODIFICATION COSTS
Intended for Knowledge Sharing only
Rigid Structure and scale of operations make dynamism difficult…
16
Data Modeling/Schema
ETL; Metadata
Raw Data
![Page 17: Taming data lake - scalable metrics model](https://reader035.vdocuments.us/reader035/viewer/2022062412/58f1e06a1a28ab71568b45fd/html5/thumbnails/17.jpg)
NOT ONLY IS THE AUDIENCE CHANGING…
Intended for Knowledge Sharing only
Stakeholders Needs
Reports, Insights & Drilldowns
Datamart Documentation
Executives- Reports- High level drilldown- Unified summary- “On the go*”
Marketing & PR
- Campaign performance- Infographics- Deep dives- Testing
Sales / RM- Sales performance- Prospecting- Competitive- Infographics
Product
- Product performance- Deep dive- Mining- Testing- Research
Technology / AE /
Operations
- Platform performance- Deep dive- Forecasting- Real time alerting
FP & A
- Consolidated Initiative readouts (E2E)
- Deduping- Drill downs - Forecasting
17
![Page 18: Taming data lake - scalable metrics model](https://reader035.vdocuments.us/reader035/viewer/2022062412/58f1e06a1a28ab71568b45fd/html5/thumbnails/18.jpg)
…BUT ALSO THE NEEDS ARE EVER CHANGING
Intended for Knowledge Sharing only
“In mail”
Recommendations with supporting
graphs, tables, etc.
“Story Deck”
Full deck with the pitch and supporting arguments, numbers,
graphs, charts
“On-the-go”
-Mobile App, On the Cloud,
Subscriptions-Reports,
Dashboards, Infographics
Algorithm/Model
Ready to be deployed
How to decide? Customer needs; Turnaround Speed;
One time/reuse; Deployment on Front end; Strategic Doc;
Quick read/research doc18
![Page 19: Taming data lake - scalable metrics model](https://reader035.vdocuments.us/reader035/viewer/2022062412/58f1e06a1a28ab71568b45fd/html5/thumbnails/19.jpg)
Intended for Knowledge Sharing only
Quick recap of what it is
Intended for Knowledge Sharing only
Getting to the point – what do we propose?
![Page 20: Taming data lake - scalable metrics model](https://reader035.vdocuments.us/reader035/viewer/2022062412/58f1e06a1a28ab71568b45fd/html5/thumbnails/20.jpg)
WE BRING TO YOU THE SCALABLE METRICS MODEL (SMM)…
EDW
Aggregated Cubes
Every attempt to bring the best of the most used models…
20
ACID, Fast, Stable
Rigid, Cost, Resourcing
Scalable Metrics Model
(Pre-Aggregated Metrics +
Primary-Foreign Keys)
Cost, Flexibility, Scalability
Performance, Reliability
Performance, Easy to understand
Reporting only
![Page 21: Taming data lake - scalable metrics model](https://reader035.vdocuments.us/reader035/viewer/2022062412/58f1e06a1a28ab71568b45fd/html5/thumbnails/21.jpg)
TACTICAL DETAILS: WHERE DO WE START?
An illustrative example from Retail domain…
21
• Defined Granularity & associated Info: Determined by Core Objectives, e.g., Customer level table for Customer Engagement team
• Defined Foreign Keys & Common Dimensions: For extensibility • Defined Metrics: KPIs as required• Identify Value Add Metrics: recommendation, forecasting etc
CUSTOMER•Primary Key: Customer id•Foreign Keys: Sign Up Partner, Promotion Id, First Txn id•Customer Level Info: Email, Phone, Number, Geo, etc. •Metrics:
• Lifetime Spend, Txns• Behavioral Bucket• RFM Bucket
•Recommended Action items:• Next Best Product• CLV• Target Offers• Call Center Agent Reco
![Page 22: Taming data lake - scalable metrics model](https://reader035.vdocuments.us/reader035/viewer/2022062412/58f1e06a1a28ab71568b45fd/html5/thumbnails/22.jpg)
TACTICAL DETAILS: DATA MODEL
An illustrative example from Retail domain…
22
id Dimensions foreign_keys metrics
Customer_id
NameEmail
Address,etc.
signup_partner_id
promotion_id
Lifetime Spend, TxnsBehavioral Bucket
RFM BucketRecommended Action
items:Next Best Product
CLVTarget Offers
Call Center Agent Reco
11234
{"name":"John", "Email" : "john@email.
com" , "Address":"12
3 nowhereblvd"
}
{"signup_partner_id":"666YYY", "promotion" : "YAH123" }
{"Lifetime Spend":"3400", "Txns":"150",
"Behavioural Bucket" : "repeat user" ,
"RFM Bucket":"","recommended Product
id":"PRD789","CLV":"??",
"Target Offer":"OFF789","CallCenterAgentReco":"1234
"}
Wha
t it
con
tain
sSa
mpl
e da
ta
![Page 23: Taming data lake - scalable metrics model](https://reader035.vdocuments.us/reader035/viewer/2022062412/58f1e06a1a28ab71568b45fd/html5/thumbnails/23.jpg)
TACTICAL DETAILS: ETL FRAMEWORK
An illustrative example from Retail domain…
23
STEP I:QUERIES
STEP II:FRAMEWORK
RUNS
•Write separate queries/code to get metrics on the defined granularity•Put those queries into the framework
STEP III:IMPLEMENT
MODULARITY
STEP IV:USER
INTERFACE
•Adding a new metric is just adding a new query/code for that metric alone•Can change an existing logic for a metric will impact that metric alone
•Create physical impala tables for interactive querying•Create views for abstraction and end-user access•Exporting data to reporting tools like Tableau/QlikView brings a high level of analysis capability to this model.
•Framework runs each of these queries and populate respective keys
![Page 24: Taming data lake - scalable metrics model](https://reader035.vdocuments.us/reader035/viewer/2022062412/58f1e06a1a28ab71568b45fd/html5/thumbnails/24.jpg)
ETL framework• Divide and conquer– Write separate queries/code to get metrics on the defined
granularity– Put those queries into the framework
• Framework runs each of these queries and populate respective keys
• Modularity– Adding a new metric is just adding a new query/code for that metric
alone– Can change an existing logic for a metric will impact that metric
alone
24
![Page 25: Taming data lake - scalable metrics model](https://reader035.vdocuments.us/reader035/viewer/2022062412/58f1e06a1a28ab71568b45fd/html5/thumbnails/25.jpg)
Reporting and presentation• Map data-types are hard for the users for access• Three options– Create physical impala tables for interactive querying– Create views for abstraction and end-user access
• Reporting layer (like Tableau)– Brings a different level of accessibility and analysis capability to this model.• Faster (if data is cached)• Create report level calculations• Data blending• Using metrics as a dimension – like customer buckets on transaction size• Visualization
25
![Page 26: Taming data lake - scalable metrics model](https://reader035.vdocuments.us/reader035/viewer/2022062412/58f1e06a1a28ab71568b45fd/html5/thumbnails/26.jpg)
DATA BUS EXTENSIBILITY
CUSTOMER•Primary Key: Customer id•Foreign Keys: Sign Up Partner, Promotion Id, First Txn id•Customer Level Info: Email, Phone, Number, Geo, etc. •Metrics:
• Lifetime Spend, Txns• Behavioral Bucket• RFM Bucket
•Recommended Action items:
• Next Best Product• CLV• Target Offers• Call Center Agent
Reco
SELLERS•Primary Key: Seller id•Foreign Keys: Product id, Operating Channel•Customer Level Info: Name, Operating Region, Annual Sales •Metrics:
• Lifetime Sales, Txns• Performance Bucket• Special Category Flag
•Recommended Action items:
• Next Best Product• Next Co-Marketing• RM action
TXNS•Primary Key: Txn id•Foreign Keys: Custid, Sellerid, Channel, •Txn Level Info: Amt, Type, Date, •Flags:
• Buyer/Seller Type• Deviation Metrics• Fraud/Good• Agent Verification• Next Best Offer
+ CLICKSTREAM+ PROMOTIONS+ PARTNERS+ PRODUCTS+ SENTIMENT+ LOGS+ 3rd PARTY+ ETC ETC…
Common Dimensions or Foreign Keys
![Page 27: Taming data lake - scalable metrics model](https://reader035.vdocuments.us/reader035/viewer/2022062412/58f1e06a1a28ab71568b45fd/html5/thumbnails/27.jpg)
Business requirements
Design
DataModel
ETL Framework
Reporting Layer
Use and learn
27
![Page 28: Taming data lake - scalable metrics model](https://reader035.vdocuments.us/reader035/viewer/2022062412/58f1e06a1a28ab71568b45fd/html5/thumbnails/28.jpg)
THE SALIENT FEATURES
28
• Fit for wide variety of Solution Sets & audiences: Optimal data model to support all three needs – Reporting, Analytics & Data Mining.
• Best of all worlds: Scalable Metrics Model is a hybrid approach,• ACID Strengths: performance, stability and reliability of RDBMS. • Non ACID Strengths: scalability, flexibility, versatility of Hadoop.
• Needs Optimized Model: Highest premium is provided to needs of the user – easy to incorporate changes as they come along (view like). Refresh cycle is easy and changed logics easily get incorporated in the next run.
• Data Governance & Lineage: Operates with a modular approach – break down complex problems into smaller items and integrate in a bigger scheme of things. This eases better Data Governance and Lineage.
• Extensibility: • Caching: Easy integration with buffering technologies to optimize on
performance.• Visualization: Easier integration with visualization tools like Tableau.• Coding Interface: Additional drilldowns, analyses, data analysis via
HIVE/SAS/R.● MODULAR ● EXTENDABLE ● UPDATABLE ● SCALABLE
![Page 29: Taming data lake - scalable metrics model](https://reader035.vdocuments.us/reader035/viewer/2022062412/58f1e06a1a28ab71568b45fd/html5/thumbnails/29.jpg)
FOUR DIMENSIONS OF SUCCESSFUL EXECUTION
29
PEO
PLE
• Business Analysts: Details on Business needs like Timing(Immediate/ near/medium/long term), Priority (Critical/Urgent/Important/Good to have), Frequency (Regular/once-in-a-while/rare), Real-time, Delivery & Users.
• Technical Architects: Understand the raw data structure, flow mechanisms & pipelines, security/legal/storage/resourcing constraints, feasibility assessments.
PRO
CESS
• Matching & Gap Analysis: Is the technology available to handle all business needs (possible/not enough RoI/deferred); Contingency, resourcing & budgeting.
• Project Planning: Milestone based delivery, Deep Stakeholder involvement in development & validation, Communications Management
• Execution: Schema on read efficient, Aggregates, Tight Metadata, reporting/analytics layer, Tables/Partitions/File types/Compression, Metadata
TECH
• PIG: ETL• HIVE/Impala: Schema & Table creation• Java/Streaming:• SAS/Python/R: Statistical Modeling
CULT
URE
• Customer Needs Focused• Need for a smart vision, sound planning and able change
management• Outcome Focused Organization (common business goal)• SAS/Python/R: Statistical Modeling
![Page 30: Taming data lake - scalable metrics model](https://reader035.vdocuments.us/reader035/viewer/2022062412/58f1e06a1a28ab71568b45fd/html5/thumbnails/30.jpg)
WHY DO WE THINK THE TIME IS NOW?
Evolution in the value prop of Analysts: What/where/how much -> what can happen ->what should we do ?
Audience has broadened (A numbers middle man -> Front line Managers)Luxury of time has evaporated
Nature of questions have drastically changed (Expectation of being able to connect the dots in “Data Lake” world).
Overselling potential before getting “there”
30
KPI of Analytics has changed from Turn-Around-Time (TAT) to Time-to-Action (TTA)
![Page 31: Taming data lake - scalable metrics model](https://reader035.vdocuments.us/reader035/viewer/2022062412/58f1e06a1a28ab71568b45fd/html5/thumbnails/31.jpg)
Intended for Knowledge Sharing only
Quick recap of what it is
Intended for Knowledge Sharing only
Putting it all together
![Page 32: Taming data lake - scalable metrics model](https://reader035.vdocuments.us/reader035/viewer/2022062412/58f1e06a1a28ab71568b45fd/html5/thumbnails/32.jpg)
SWOT ANALYSIS OF SMM
STRENGTHS
OPPORTUNITIES
WEAKNESSES
THREATS
• Need sensitive model• Cost of development, modification &
refresh reduced• Easy for Analysts/End Users to
understand and play with • Data Governance & Lineage: Break
down bigger problems into smaller manageable
• Integration with front end tools that can simplify UX.
• Tools that buffer the backend data to ensure speedy delivery.
• Good vision of future Analytical requirements is paramount.
• Full refresh every time it runs again.
• Maximum granularity needs to be pre-fixed.
• Learning Curve on Coding language/syntax.
• Non-normalized data model.
• Not for real-time insights delivery
• No Slowly Changing Dimensions
32
![Page 33: Taming data lake - scalable metrics model](https://reader035.vdocuments.us/reader035/viewer/2022062412/58f1e06a1a28ab71568b45fd/html5/thumbnails/33.jpg)
THE FIVE COMMANDMENTS
33
• “Know” that it caters to most frequent and not all needs.
• “Must have” as good & farther as possible Analytics vision/needs and Outcome Focused approach.
• “Ensure” Deeper Stakeholder involvement in the development. Test & Learn approach must. And be ready to modify if needed.
• “Develop” modularity in delivery.
• “Prepare” for ever more increasing dependencies from Analytics and other stakeholders.
![Page 34: Taming data lake - scalable metrics model](https://reader035.vdocuments.us/reader035/viewer/2022062412/58f1e06a1a28ab71568b45fd/html5/thumbnails/34.jpg)
Intended for Knowledge Sharing only
Quick recap of what it is
Intended for Knowledge Sharing only
Appendix
![Page 35: Taming data lake - scalable metrics model](https://reader035.vdocuments.us/reader035/viewer/2022062412/58f1e06a1a28ab71568b45fd/html5/thumbnails/35.jpg)
THANK YOU!
Intended for Knowledge Sharing only
Would love to hear from you on any of the following forums…
https://twitter.com/decisions_2_0
http://www.slideshare.net/RamkumarRavichandran
https://www.youtube.com/channel/UCODSVC0WQws607clv0k8mQA/videos
http://www.odbms.org/2015/01/ramkumar-ravichandran-visa/
https://www.linkedin.com/pub/ramkumar-ravichandran/10/545/67a
https://www.linkedin.com/in/dataisbig
http://bigdatadw.blogspot.com/
BHARATHIRAJA CHANDRASEKHARAN
RAMKUMAR RAVICHANDRAN
35
![Page 36: Taming data lake - scalable metrics model](https://reader035.vdocuments.us/reader035/viewer/2022062412/58f1e06a1a28ab71568b45fd/html5/thumbnails/36.jpg)
36
RESEARCH/LEARNING RESOURCES
Intended for Knowledge Sharing only
• Alternative approach by Martin Fowler:http://martinfowler.com/bliki/DataLake.html• Teradata/Hortonworks Data Lake Whitepaper:http://hortonworks.com/wp-content/uploads/2014/05/TeradataHortonworks_Datalake_White-Paper_20140410.pdf• Teradata/Hortonworks Data Lake Whitepaper:http://hortonworks.com/wp-content/uploads/2014/05/TeradataHortonworks_Datalake_White-Paper_20140410.pdf• EMC Data Lake:https://www.youtube.com/watch?v=o2fs02h_LEo
36