Download - Deploying a Governed Data Lake
![Page 1: Deploying a Governed Data Lake](https://reader036.vdocuments.us/reader036/viewer/2022081511/5882ea3d1a28ab33258b7de1/html5/thumbnails/1.jpg)
Deploying a Governed Data LakeAlex GorelikFounder and CEO, Waterline Data
![Page 2: Deploying a Governed Data Lake](https://reader036.vdocuments.us/reader036/viewer/2022081511/5882ea3d1a28ab33258b7de1/html5/thumbnails/2.jpg)
2
Everyone needs data to make better decisions
![Page 3: Deploying a Governed Data Lake](https://reader036.vdocuments.us/reader036/viewer/2022081511/5882ea3d1a28ab33258b7de1/html5/thumbnails/3.jpg)
3
A data lake
http://www.pwc.com/us/en/technology-forecast/2014/issue1/features/data-lakes.jhtml
“Size and low cost”
“Fidelity: Hadoop data lakes preserve data in its original form”
“Ease of accessibility: Accessibility is easy in the data lake”
“Late binding: Hadoop lends itself to flexible, task-oriented structuring and does not require up-front data models”
“Nearly unlimited potential for operational insight and data discovery. As data volumes, data variety, and metadata richness grow, so does the benefit.”
![Page 4: Deploying a Governed Data Lake](https://reader036.vdocuments.us/reader036/viewer/2022081511/5882ea3d1a28ab33258b7de1/html5/thumbnails/4.jpg)
4
Data warehouse vs. data lakeData Warehouse
• Production system
• Well-defined usage
• Well-defined schema
• Clean, trusted data
• Heavy IT reliance– Less technical analysts– Large IT teams: DBAs,
Data Architects, ETL Developers, BI Developers, DQ Developers, Data Modelers, Data Stewards
Data Lake
• Non-production system
• Future, experimental usage
• No schema (schema on read)
• Raw data, frictionless ingestion
• Self-service– More technical analysts– IT manages the cluster and ingestion,
but no IT involvement when working with data
![Page 5: Deploying a Governed Data Lake](https://reader036.vdocuments.us/reader036/viewer/2022081511/5882ea3d1a28ab33258b7de1/html5/thumbnails/5.jpg)
5
as the platform for a scalable data lake infrastructure
✔ Hadoop
✔ Hadoop
✔ Hadoop
• Lots of data (Volume): cost-effective storage and scalable processing
• Flexibility to handle all kinds of data (Variety)
• Will be around for a long time: modularity to insure future-proofing
![Page 6: Deploying a Governed Data Lake](https://reader036.vdocuments.us/reader036/viewer/2022081511/5882ea3d1a28ab33258b7de1/html5/thumbnails/6.jpg)
6
Is Hadoop enough?
Big Data Architect
Hadoop
We have Hadoop, now
what?
10-20 nodes
![Page 7: Deploying a Governed Data Lake](https://reader036.vdocuments.us/reader036/viewer/2022081511/5882ea3d1a28ab33258b7de1/html5/thumbnails/7.jpg)
7
Big Data Architect
Hadoop
How do I get the business to start using it?
Data Scientist/Business
Analyst
10-20 nodes
![Page 8: Deploying a Governed Data Lake](https://reader036.vdocuments.us/reader036/viewer/2022081511/5882ea3d1a28ab33258b7de1/html5/thumbnails/8.jpg)
8
Big Data Architect
Hadoop
How do I get the business to start using it?
Data Scientist/Business
AnalystHow do I find
and understand data easily to do big data analytics?
Self-service
10-20 nodes
![Page 9: Deploying a Governed Data Lake](https://reader036.vdocuments.us/reader036/viewer/2022081511/5882ea3d1a28ab33258b7de1/html5/thumbnails/9.jpg)
9
Big Data Architect
Hadoop
Data Scientist/Business
Analyst
No security and governance
10-20 nodes
Risk/Data Governance Executive
How do I ensure compliance with regulations and data policies ?
Sensitive data?
![Page 10: Deploying a Governed Data Lake](https://reader036.vdocuments.us/reader036/viewer/2022081511/5882ea3d1a28ab33258b7de1/html5/thumbnails/10.jpg)
10
Big Data Architect
Hadoop
How do I scale?
Data Scientist/Business
Analysts
100s/1000s of nodesManual process to catalog the lake can’t
scale
![Page 11: Deploying a Governed Data Lake](https://reader036.vdocuments.us/reader036/viewer/2022081511/5882ea3d1a28ab33258b7de1/html5/thumbnails/11.jpg)
11
• Lots of data (Volume): cost-effective storage and scalable processing
• Flexibility to handle all kinds of data (Variety)
• Will be around for a long time: modularity to insure future-proofing
• Self-service to help users find, understand and use the data
• Governance to protect sensitive data, document lineage and asses quality
The platform for a scalable data lake infrastructure
✔ Hadoop
✔ Hadoop
✔ Hadoop
X Hadoop
X Hadoop
![Page 12: Deploying a Governed Data Lake](https://reader036.vdocuments.us/reader036/viewer/2022081511/5882ea3d1a28ab33258b7de1/html5/thumbnails/12.jpg)
12
Waterline Data Inventory broadens Hadoop adoption through governed self-service
Big Data Architect
Hadoop
Data Scientist/Business
Analyst
100s/1000s of nodes
Risk/Data Governance Executive
Self-service Security and governance
Massive scale
![Page 13: Deploying a Governed Data Lake](https://reader036.vdocuments.us/reader036/viewer/2022081511/5882ea3d1a28ab33258b7de1/html5/thumbnails/13.jpg)
13
3-phase approach to a governed data lake
Organize the lake
Inventory the lake
Open up the lake
![Page 14: Deploying a Governed Data Lake](https://reader036.vdocuments.us/reader036/viewer/2022081511/5882ea3d1a28ab33258b7de1/html5/thumbnails/14.jpg)
14
Organize the lake into zonesOrganize the lake
![Page 15: Deploying a Governed Data Lake](https://reader036.vdocuments.us/reader036/viewer/2022081511/5882ea3d1a28ab33258b7de1/html5/thumbnails/15.jpg)
15
Establish access control per zone
• Business Analysts• Data Scientists
• Data Scientists• Data Engineers
• Data Scientists• Data Engineers
• Data Stewards
Sensitive Landing
GoldWork
Organize the lake
![Page 16: Deploying a Governed Data Lake](https://reader036.vdocuments.us/reader036/viewer/2022081511/5882ea3d1a28ab33258b7de1/html5/thumbnails/16.jpg)
16
The governed data lakeData Scientist/Business Analyst Data Steward Big Data Architect
HDFS Hive
Waterline Data Inventory
Find/understand Govern Governed data layer
Governance
Inventory
Self-Service
![Page 17: Deploying a Governed Data Lake](https://reader036.vdocuments.us/reader036/viewer/2022081511/5882ea3d1a28ab33258b7de1/html5/thumbnails/17.jpg)
17
Metadata Curation
Self-Service Catalog/Provisioning
Big Data Architect
Find/understandGoverned data layer
Data Scientist/Business Analyst
The governed data lakeData Steward
HDFS Hive
Waterline Data Inventory
Govern
Inventory
Inventory the lake
Profile and discover the content of files and Hive tables
![Page 18: Deploying a Governed Data Lake](https://reader036.vdocuments.us/reader036/viewer/2022081511/5882ea3d1a28ab33258b7de1/html5/thumbnails/18.jpg)
18
Inventory
Parse multiple content types
Create catalog automatically
Discover lineage automatically
![Page 19: Deploying a Governed Data Lake](https://reader036.vdocuments.us/reader036/viewer/2022081511/5882ea3d1a28ab33258b7de1/html5/thumbnails/19.jpg)
19
Self-Service Catalog/Provisioning
Big Data Architect
Find/understandGoverned data layer
Data Scientist/Business Analyst
The governed data lakeData Steward
HDFS Hive
Waterline Data Inventory
Govern
Inventory
Govern the lake
Governance
• Inspect files and perform tag curation
• Identify sensitive data• Assess data quality• Discover data lineage• Manage glossary
![Page 20: Deploying a Governed Data Lake](https://reader036.vdocuments.us/reader036/viewer/2022081511/5882ea3d1a28ab33258b7de1/html5/thumbnails/20.jpg)
20
Navigate Lineage of Files in HadoopClickable, navigable
lineage discovered using file content or imported from other tools through
REST APIs
![Page 21: Deploying a Governed Data Lake](https://reader036.vdocuments.us/reader036/viewer/2022081511/5882ea3d1a28ab33258b7de1/html5/thumbnails/21.jpg)
21
Automated Data Profiling Helps with Quality Assessment
Infographic shows contents at a glance:• Different types of data in
the same field• Number of missing
values
Separate profiles for each data type including number of unique values (cardinality), uniqueness (selectivity) and type-specific measures like mean and standard deviation for numbers
![Page 22: Deploying a Governed Data Lake](https://reader036.vdocuments.us/reader036/viewer/2022081511/5882ea3d1a28ab33258b7de1/html5/thumbnails/22.jpg)
22
Data Preview and Visualization Helps Understand the Data
Visualization helps understand the shape and distribution of data
Most frequent values for each field
![Page 23: Deploying a Governed Data Lake](https://reader036.vdocuments.us/reader036/viewer/2022081511/5882ea3d1a28ab33258b7de1/html5/thumbnails/23.jpg)
23
Discover Sensitive Data
Screen shot
Find all fields that may have SSN
![Page 24: Deploying a Governed Data Lake](https://reader036.vdocuments.us/reader036/viewer/2022081511/5882ea3d1a28ab33258b7de1/html5/thumbnails/24.jpg)
24
Curate Discovered Sensitive Data Fields
Curate the field and accept or reject the tag
![Page 25: Deploying a Governed Data Lake](https://reader036.vdocuments.us/reader036/viewer/2022081511/5882ea3d1a28ab33258b7de1/html5/thumbnails/25.jpg)
25
Manage Glossary
Import or create a business glossary
Manage tags
![Page 26: Deploying a Governed Data Lake](https://reader036.vdocuments.us/reader036/viewer/2022081511/5882ea3d1a28ab33258b7de1/html5/thumbnails/26.jpg)
26
View and search history
Screenshot of history tab
Another screenshot of searching history (made up)
Data Inventory keeps track of all user tagging, schema changes, lineage changes in Audit History
![Page 27: Deploying a Governed Data Lake](https://reader036.vdocuments.us/reader036/viewer/2022081511/5882ea3d1a28ab33258b7de1/html5/thumbnails/27.jpg)
27
Data Steward
Govern
Big Data Architect
Governed data layer
Open up the data lake
HDFS Hive
Waterline Data InventoryInventory
Governance
Self-Service
Find/understand
Data Scientist/Business Analyst
Explore catalog and provision data securely
Open up the lake
![Page 28: Deploying a Governed Data Lake](https://reader036.vdocuments.us/reader036/viewer/2022081511/5882ea3d1a28ab33258b7de1/html5/thumbnails/28.jpg)
28
Find and Understand
Automatically propagate user-defined tags (crowdsource ontology)
Discover meaning of fields and tag automatically
Multi-faceted drill down
Automated facet creation based on metadata
Business metadata-based search
![Page 29: Deploying a Governed Data Lake](https://reader036.vdocuments.us/reader036/viewer/2022081511/5882ea3d1a28ab33258b7de1/html5/thumbnails/29.jpg)
29
Annotate fields, files and folders with tags
• Analysts can tag fields and files with meaningful business tags
• Type-ahead shows existing available tags that match the typed string
• Users can choose one or create a new tag
• Period in tag name automatically creates tag hierarchy (e.g., Restaurant.Name creates category “Restaurant” and tag “Name”
![Page 30: Deploying a Governed Data Lake](https://reader036.vdocuments.us/reader036/viewer/2022081511/5882ea3d1a28ab33258b7de1/html5/thumbnails/30.jpg)
30
Based on a single field in one file tagged as Restaurant.Name, Waterline Data Inventory
discovery engine found 25 additional instances of Restaurant Name automatically.
User assigned tags are solid blue
Automatically suggested tags are faded blue with
confidence levelDelimited files
don’t have field names
Waterline Data Inventory learns from analysts who manually tag fields and automatically finds and tags similar fields
![Page 31: Deploying a Governed Data Lake](https://reader036.vdocuments.us/reader036/viewer/2022081511/5882ea3d1a28ab33258b7de1/html5/thumbnails/31.jpg)
31
Create Hive tables
Screen shot of file with “Generate Hive Table” option selected- Replace Hive with Drill
Generate Hive Tables
![Page 33: Deploying a Governed Data Lake](https://reader036.vdocuments.us/reader036/viewer/2022081511/5882ea3d1a28ab33258b7de1/html5/thumbnails/33.jpg)
33
Company overview• Headquartered in Mountain View, CA• Funded in 2013 by Menlo Ventures and Sigma West• Management Team:
Alex Gorelik, Founder, CEO
Founded Exeros (IBM) and Acta (SAP), IBM DE,
Informatica GM. Columbia BSCS, Stanford MSCS.
Oliver Claude, Marketing
VP SAP, VP Informatica, IBM,
Siebel. Nova Southeastern MS
MIS.
Jason Chen, Engineering
VP Teradata, Acta, Sybase. USC PhD
CS.
Ravi Ramachandran,
Sales
CSC-Infochimps Big Data, AppLabs,
Xchanging, Pegasystems.
Scient (Razorfish)WATERLINE DATA NAMED COOL VENDORGartner, Cool Vendors in Information Governance and MDM, 2015Guido DeSimoni, Roxane Edjlali, Saul Judah, Bill O'Kane, Andrew White
![Page 34: Deploying a Governed Data Lake](https://reader036.vdocuments.us/reader036/viewer/2022081511/5882ea3d1a28ab33258b7de1/html5/thumbnails/34.jpg)
Visit our exhibit in the ballroom to get more information