best practices to deploy a governed data lake

25
© 2015 MapR Technologies 1 © 2015 MapR Technologies Deploying a Governed Data Lake Dale Kim, MapR Technologies Oliver Claude, Waterline Data August 4, 2015

Upload: mapr-technologies

Post on 18-Aug-2015

86 views

Category:

Technology


3 download

TRANSCRIPT

© 2015 MapR Technologies 1© 2015 MapR Technologies

Deploying a Governed Data Lake

Dale Kim, MapR Technologies Oliver Claude, Waterline Data

August 4, 2015

© 2015 MapR Technologies 2

Welcome

• Event will be recorded• Ask your questions in the Q&A Panel in the lower right-hand

corner of your screen• Tweet us @mapr during the event

© 2015 MapR Technologies 3

Key Points

• The data lake is becoming a “real-time” shared service to provide data to the business to support data science and big data analytics needs

• As the data lake becomes a trusted source of data to drive big data analytics, security and data governance have to be addressed

• Security and data governance policies need to be implemented in a way that still enables self-service and quick time to value vs. creating 3-6 month delays

© 2015 MapR Technologies 4

Deliver Data Discovery Agility with a Governed “Data Layer”

Adhere to security, compliance and data governance policies

Catalog data assets at scale, with secure provisioning to

the business

Find and understand best-suited and most trusted data

© 2015 MapR Technologies 5

The danger of the data lake becoming a flea market

Botond Horvath / Shutterstock.com

INVENTORYDATA

Can’t create and maintain an inventory fast enough

Big Data Architect INVENTORYDATA

Can’t explore everything to find the best item

Data Engineer/Data Scientist/Business Analyst

INVENTORYDATA

Can’t tell what’s what and what can be trusted

CDO/Data Steward

© 2015 MapR Technologies 6

Imagine shopping on Amazon.com

GOVERNANCE

Inventory

Find and Understand

Provision

© 2015 MapR Technologies 7

Governed data lake is like Amazon.com for data in Hadoop

GOVERNANCE

Inventory

Find and Understand

Provision

© 2015 MapR Technologies 8

SourcesRELATIONAL, SAAS, MAINFRAME

DOCUMENTS, EMAILS

LOG FILES, CLICKSTREAMSSENSORS

BLOGS, TWEETS,LINK DATA

Analytics

Search

Schema-less data exploration

BI, reportingAd-hoc integrated

analytics

Operational Apps

Recommendation

Fraud Detection

Logistics

MapR-DB MapR-FS

MapR Data Platform

Distribution including Apache Hadoop

The Governed Data Lake on Apache Hadoop

Data Inventory: Find, understand and govern

© 2015 MapR Technologies 9

The Governed Data Lake

Define Ingest Inventory Explore Provision Wrangle/Model/Visualize

• Critical data elements

• Sensitive data elements

• Security and data governance policies

• Load

• Profile

• Automatic tagging

• Discover metadata and generate tags

• Discover data lineage

• Manage tags

• Browse/search inventory

• Inspect data quality

• Tag and annotate

• Bookmark

• Copy

• Authorized view

Governed data lake as a shared service

Data Governance Data Discovery Agility

Data protection, authentication, authorization, auditing

Can you achieve both?

© 2015 MapR Technologies 10

Find, understand and govern data in Hadoop

© 2015 MapR Technologies 11

Waterline Data is like Amazon.com for data in Hadoop

GOVERNANCE

Inventory

Find and Understand

Provision

© 2015 MapR Technologies 12

Inventory

© 2015 MapR Technologies 13

Find and Understand

© 2015 MapR Technologies 14

Provision

Future: Generate Drill Views

© 2015 MapR Technologies 15

Governance

© 2015 MapR Technologies 16

SourcesRELATIONAL, SAAS, MAINFRAME

DOCUMENTS, EMAILS

LOG FILES, CLICKSTREAMSSENSORS

BLOGS, TWEETS,LINK DATA

Analytics

Search

Schema-less data exploration

BI, reportingAd-hoc integrated

analytics

Operational Apps

Recommendation

Fraud Detection

Logistics

MapR-DB MapR-FS

MapR Data Platform

Distribution including Apache Hadoop

The Governed Data Lake on Apache Hadoop with MapR

Data Inventory: Find, understand and govern

© 2015 MapR Technologies 17

Separate Distinct Data Sets via MapR Volumes

Volumes dramatically simplify management:

• Replication factor• Scheduled mirroring• Scheduled snapshots• Data placement control• User access and tracking• Administrative permissions

/projects

/tahoe

/yosemite

/user

/msmith

/bjohnson

© 2015 MapR Technologies 18

MapR Trust Model (Product Security)Flexible

Authentication• Wire-level authentication for all

services in the cluster• NSA-level cryptographic algorithms• Integration with LDAP, Active

Directory and other third party directory services

• Kerberos or username/password authentication

1

A

AA

DP

Granular

Authorization• Access Control Expressions• Protect files, tables, column families,

columns, and management objects• Extend to role-based access control

(RBAC) with custom role functions• Drill Views

2Robust

Auditing• All events recorded immediately

in JSON log files• Includes data access and

administrative actions• Ad-hoc queries and custom

reports on audit logs via SQL and standard BI tools

3

Ubiquitous

Data Protection• Encryption for Data in Motion

• Within a Cluster• Between Clusters• Between Client and Cluster

• Encryption for Data at Rest• LUKS• Self-Encrypting Disk• Partners

4

© 2015 MapR Technologies 19

MapR Comprehensive Auditing

Serving Security Analysts…

Monitoring

IncidentResponse

• Who touched customer records outside of business hours?

• What actions did users take in the days before leaving the company?

• What operations were performed without following change control?

• Are users accessing sensitive files from protected/secured source IPs?

• Why do my reports look different, despite sourcing from same underlying data?

Security

© 2015 MapR Technologies 20

MapR Comprehensive Auditing (cont.)

…And Data Scientists Too

• Which data is used most frequently? Implication: High Value; Share More Broadly

• Which data is least commonly used? Implication: Low Value; Candidate for Purge

• Which data should be used more? Implication: Underutilized; Increase Awareness

• What administrative actions are most commonly performed? Implication: Candidate for automation

Predictive Analytics

© 2015 MapR Technologies 21

MapR Audits – Key Features

Data Access• Files• MapR-DB Tables

Cluster Operations• Administrative Operations• Maprcli commands

Authentication Requests

Secure High Performance

Flexible• Retention Period• Maxsize• Coalesce Interval

JSON Format

{"timestamp":"{$date=2015-06-01T05:24:58.231Z}","operation":"GETATTR","user":"root","uid":"0","ipAddress":"10.10.x.x","nfsServer":"10.10.x.x","srcPath":"/dbtest.0/","srcFid":"2147.16.2","VolumeName":“mktg_files","volumeId":“mktg_files","status":"0"}

© 2015 MapR Technologies 22

Access Control that Scales

PAM Authentication + User Impersonation

Fine-grained row and column level access control with Drill Views – no centralized security repository required

Files HBase Hive

Drill View 1

Drill View 2

UUU

User

User

© 2015 MapR Technologies 23

Ownership ChainingCombine Self Service Exploration with Data Governance

Name City State Credit Card #

Dave San Jose CA 1374-7914-3865-4817

John Boulder CO 1374-9735-1794-9711

Raw File (/raw/cards.csv)

Name City State Credit Card #

Dave San Jose CA 1374-1111-1111-1111

John Boulder CO 1374-1111-1111-1111

Data Scientist (/views/V_Scientist)

Jane (Read)John (Owner)

Name City State

Dave San Jose CA

John Boulder CO

Analyst(/views/V_Analyst)

Jack (Read)Jane(Owner)

RA

W F

ILEV

_Scientist

V_A

nalyst

Does Jack have access to V_Analyst? ->YES

Who is the owner of V_Analyst? ->Jane

Drill accesses V_Analyst as Jane (Impersonation hop 1)

Does Jane have access to V_Scientist ? -> YES

Who is the owner of V_Scientist? ->John

Drill accesses V_Scientist as John (Impersonation hop 2)

John(Owner)

Does John have permissions on raw file? -> YES

Who is the owner of raw file? ->John

Drill accesses source file as John (no impersonation here)

Jack queries the view V_Analyst

*Ownership chain length (# hops) is configurable

Ownership chaining

Access path

© 2015 MapR Technologies 24

Find, Understand and Govern Data in HadoopAt Scale and in Real-Time

Discover and protect sensitive data, audit and authorize access to the data lake, discover data lineage, and provide data stewardship

CDO/Data Steward

Automate cataloging of data assets at scale, with secure provisioning to business users

Big Data Architect

Find and understand best-suited and most trusted data without having to explore every file manually

Data Engineer/Data Scientist/Business Analyst

© 2015 MapR Technologies 25

Learn Morewww.waterlinedata.com

• Watch the solution video• Read analyst papers• Download the free Waterline

Data / MapR sandbox• Request a demo• Download and evaluate the

product

www.mapr.com

• Get free On-Demand Training for Hadoop

• Download the free Waterline Data / MapR sandbox