best practices to deploy a governed data lake
TRANSCRIPT
© 2015 MapR Technologies 1© 2015 MapR Technologies
Deploying a Governed Data Lake
Dale Kim, MapR Technologies Oliver Claude, Waterline Data
August 4, 2015
© 2015 MapR Technologies 2
Welcome
• Event will be recorded• Ask your questions in the Q&A Panel in the lower right-hand
corner of your screen• Tweet us @mapr during the event
© 2015 MapR Technologies 3
Key Points
• The data lake is becoming a “real-time” shared service to provide data to the business to support data science and big data analytics needs
• As the data lake becomes a trusted source of data to drive big data analytics, security and data governance have to be addressed
• Security and data governance policies need to be implemented in a way that still enables self-service and quick time to value vs. creating 3-6 month delays
© 2015 MapR Technologies 4
Deliver Data Discovery Agility with a Governed “Data Layer”
Adhere to security, compliance and data governance policies
Catalog data assets at scale, with secure provisioning to
the business
Find and understand best-suited and most trusted data
© 2015 MapR Technologies 5
The danger of the data lake becoming a flea market
Botond Horvath / Shutterstock.com
INVENTORYDATA
Can’t create and maintain an inventory fast enough
Big Data Architect INVENTORYDATA
Can’t explore everything to find the best item
Data Engineer/Data Scientist/Business Analyst
INVENTORYDATA
Can’t tell what’s what and what can be trusted
CDO/Data Steward
© 2015 MapR Technologies 6
Imagine shopping on Amazon.com
GOVERNANCE
Inventory
Find and Understand
Provision
© 2015 MapR Technologies 7
Governed data lake is like Amazon.com for data in Hadoop
GOVERNANCE
Inventory
Find and Understand
Provision
© 2015 MapR Technologies 8
SourcesRELATIONAL, SAAS, MAINFRAME
DOCUMENTS, EMAILS
LOG FILES, CLICKSTREAMSSENSORS
BLOGS, TWEETS,LINK DATA
Analytics
Search
Schema-less data exploration
BI, reportingAd-hoc integrated
analytics
Operational Apps
Recommendation
Fraud Detection
Logistics
MapR-DB MapR-FS
MapR Data Platform
Distribution including Apache Hadoop
The Governed Data Lake on Apache Hadoop
Data Inventory: Find, understand and govern
© 2015 MapR Technologies 9
The Governed Data Lake
Define Ingest Inventory Explore Provision Wrangle/Model/Visualize
• Critical data elements
• Sensitive data elements
• Security and data governance policies
• Load
• Profile
• Automatic tagging
• Discover metadata and generate tags
• Discover data lineage
• Manage tags
• Browse/search inventory
• Inspect data quality
• Tag and annotate
• Bookmark
• Copy
• Authorized view
Governed data lake as a shared service
Data Governance Data Discovery Agility
Data protection, authentication, authorization, auditing
Can you achieve both?
© 2015 MapR Technologies 11
Waterline Data is like Amazon.com for data in Hadoop
GOVERNANCE
Inventory
Find and Understand
Provision
© 2015 MapR Technologies 16
SourcesRELATIONAL, SAAS, MAINFRAME
DOCUMENTS, EMAILS
LOG FILES, CLICKSTREAMSSENSORS
BLOGS, TWEETS,LINK DATA
Analytics
Search
Schema-less data exploration
BI, reportingAd-hoc integrated
analytics
Operational Apps
Recommendation
Fraud Detection
Logistics
MapR-DB MapR-FS
MapR Data Platform
Distribution including Apache Hadoop
The Governed Data Lake on Apache Hadoop with MapR
Data Inventory: Find, understand and govern
© 2015 MapR Technologies 17
Separate Distinct Data Sets via MapR Volumes
Volumes dramatically simplify management:
• Replication factor• Scheduled mirroring• Scheduled snapshots• Data placement control• User access and tracking• Administrative permissions
/projects
/tahoe
/yosemite
/user
/msmith
/bjohnson
© 2015 MapR Technologies 18
MapR Trust Model (Product Security)Flexible
Authentication• Wire-level authentication for all
services in the cluster• NSA-level cryptographic algorithms• Integration with LDAP, Active
Directory and other third party directory services
• Kerberos or username/password authentication
1
A
AA
DP
Granular
Authorization• Access Control Expressions• Protect files, tables, column families,
columns, and management objects• Extend to role-based access control
(RBAC) with custom role functions• Drill Views
2Robust
Auditing• All events recorded immediately
in JSON log files• Includes data access and
administrative actions• Ad-hoc queries and custom
reports on audit logs via SQL and standard BI tools
3
Ubiquitous
Data Protection• Encryption for Data in Motion
• Within a Cluster• Between Clusters• Between Client and Cluster
• Encryption for Data at Rest• LUKS• Self-Encrypting Disk• Partners
4
© 2015 MapR Technologies 19
MapR Comprehensive Auditing
Serving Security Analysts…
Monitoring
IncidentResponse
• Who touched customer records outside of business hours?
• What actions did users take in the days before leaving the company?
• What operations were performed without following change control?
• Are users accessing sensitive files from protected/secured source IPs?
• Why do my reports look different, despite sourcing from same underlying data?
Security
© 2015 MapR Technologies 20
MapR Comprehensive Auditing (cont.)
…And Data Scientists Too
• Which data is used most frequently? Implication: High Value; Share More Broadly
• Which data is least commonly used? Implication: Low Value; Candidate for Purge
• Which data should be used more? Implication: Underutilized; Increase Awareness
• What administrative actions are most commonly performed? Implication: Candidate for automation
Predictive Analytics
© 2015 MapR Technologies 21
MapR Audits – Key Features
Data Access• Files• MapR-DB Tables
Cluster Operations• Administrative Operations• Maprcli commands
Authentication Requests
Secure High Performance
Flexible• Retention Period• Maxsize• Coalesce Interval
JSON Format
{"timestamp":"{$date=2015-06-01T05:24:58.231Z}","operation":"GETATTR","user":"root","uid":"0","ipAddress":"10.10.x.x","nfsServer":"10.10.x.x","srcPath":"/dbtest.0/","srcFid":"2147.16.2","VolumeName":“mktg_files","volumeId":“mktg_files","status":"0"}
© 2015 MapR Technologies 22
Access Control that Scales
PAM Authentication + User Impersonation
Fine-grained row and column level access control with Drill Views – no centralized security repository required
Files HBase Hive
Drill View 1
Drill View 2
UUU
User
User
© 2015 MapR Technologies 23
Ownership ChainingCombine Self Service Exploration with Data Governance
Name City State Credit Card #
Dave San Jose CA 1374-7914-3865-4817
John Boulder CO 1374-9735-1794-9711
Raw File (/raw/cards.csv)
Name City State Credit Card #
Dave San Jose CA 1374-1111-1111-1111
John Boulder CO 1374-1111-1111-1111
Data Scientist (/views/V_Scientist)
Jane (Read)John (Owner)
Name City State
Dave San Jose CA
John Boulder CO
Analyst(/views/V_Analyst)
Jack (Read)Jane(Owner)
RA
W F
ILEV
_Scientist
V_A
nalyst
Does Jack have access to V_Analyst? ->YES
Who is the owner of V_Analyst? ->Jane
Drill accesses V_Analyst as Jane (Impersonation hop 1)
Does Jane have access to V_Scientist ? -> YES
Who is the owner of V_Scientist? ->John
Drill accesses V_Scientist as John (Impersonation hop 2)
John(Owner)
Does John have permissions on raw file? -> YES
Who is the owner of raw file? ->John
Drill accesses source file as John (no impersonation here)
Jack queries the view V_Analyst
*Ownership chain length (# hops) is configurable
Ownership chaining
Access path
© 2015 MapR Technologies 24
Find, Understand and Govern Data in HadoopAt Scale and in Real-Time
Discover and protect sensitive data, audit and authorize access to the data lake, discover data lineage, and provide data stewardship
CDO/Data Steward
Automate cataloging of data assets at scale, with secure provisioning to business users
Big Data Architect
Find and understand best-suited and most trusted data without having to explore every file manually
Data Engineer/Data Scientist/Business Analyst
© 2015 MapR Technologies 25
Learn Morewww.waterlinedata.com
• Watch the solution video• Read analyst papers• Download the free Waterline
Data / MapR sandbox• Request a demo• Download and evaluate the
product
www.mapr.com
• Get free On-Demand Training for Hadoop
• Download the free Waterline Data / MapR sandbox