hadoop security in big-data-as-a-service deployments - presented at hadoop summit 2016

31
End-to-End Security and Auditing in a Big-Data-as-a-Service (BDaaS) Deployment Abhiraj Butala – BlueData Nanda Vijaydev - BlueData

Upload: abhiraj-butala

Post on 21-Apr-2017

638 views

Category:

Data & Analytics


0 download

TRANSCRIPT

Page 1: Hadoop Security in Big-Data-as-a-Service Deployments - Presented at Hadoop Summit 2016

End-to-End Security and Auditing in a Big-Data-as-a-Service (BDaaS) Deployment

Abhiraj Butala – BlueDataNanda Vijaydev - BlueData

Page 2: Hadoop Security in Big-Data-as-a-Service Deployments - Presented at Hadoop Summit 2016

“A mechanism for the delivery of statistical analysis tools and information that helps organizations understand and use insights gained from large information sets in order to gain a competitive

advantage.”

On-Demand, Self-Service, ElasticBig Data Infrastructure, Applications,

Analytics

Source: www.semantikoz.com/blog/big-data-as-a-service-definition-classification

Big-Data-as-a-Service (BDaaS)

Page 3: Hadoop Security in Big-Data-as-a-Service Deployments - Presented at Hadoop Summit 2016

Multi-Tenant Big-Data-as-a-Service

Data/Storage

Prod

2.2

Dev/Test

2.4

POC

2.3

Prod

2.3

Dev/Test

2.4

MARKETING R&D MANUFACTURING360 Customer View Log Analysis Predictive Maintenance

Data Lake Staging

Multiple compute services (Hadoop, BI, Spark)

There is a shared Data Lake (Shared HDFS)

Page 4: Hadoop Security in Big-Data-as-a-Service Deployments - Presented at Hadoop Summit 2016

Why BDaaS? – Compute Side Of The Story

• Set of applications that interact with Hadoop keeps growing

• Various versions of the same app/distro run in parallel

• Enterprises have need to scale compute up and down based on usage

• A model similar to Amazon AWS with S3 as storage and applications on EC2

Page 5: Hadoop Security in Big-Data-as-a-Service Deployments - Presented at Hadoop Summit 2016

Why BDaaS? – Data Side Of The Story

• Production cluster access takes time and is generally restricted

• Staging clusters may not have all the data• Data exists on other storage systems such

as NFS Isilon is common• Users also want to upload arbitrary files

for analysis

Page 6: Hadoop Security in Big-Data-as-a-Service Deployments - Presented at Hadoop Summit 2016

Hadoop – A Collection Of Services

Hadoop is a collection of storage and compute services such as HDFS, HBase, Hive, Yarn, Solr, Kafka

Page 7: Hadoop Security in Big-Data-as-a-Service Deployments - Presented at Hadoop Summit 2016

Security In Hadoop • Authenticate user into Hadoop ecosystem

– Each service has its own integration with LDAP/AD for authentication

• Authorize and limit their actions to selected services. Authorization is granted separately for each service. Example:– Folder “/user/customer” in HDFS has ‘r-x’ to user ‘alice’, and ‘-

wx’ to user ‘bob’– Enable column level access to a Hive Table. “Customer.Name”

& “Customer.PhoneNumber” is only accessible by some users and groups

Page 8: Hadoop Security in Big-Data-as-a-Service Deployments - Presented at Hadoop Summit 2016

Ranger – A Pluggable Security Framework

• Ranger works with a common user DB (LDAP/AD) for authentication • Provides a plug-in for individual Hadoop services to enable

authorization• Allows users to define policies in a central location, using WEB UI or

APIs• Users can define their own plug-in for a custom service and manage

them centrally via Ranger Admin

Page 9: Hadoop Security in Big-Data-as-a-Service Deployments - Presented at Hadoop Summit 2016

Defining HDFS Ranger Policies

HDFS Policy List

Marketing Policy Drill Down

Page 10: Hadoop Security in Big-Data-as-a-Service Deployments - Presented at Hadoop Summit 2016

Security Considerations in BDaaS

Data/Storage

Prod

2.2

Dev/Test

2.4

POC

2.3

Prod

2.3

Dev/Test

2.4

MARKETING R&D MANUFACTURING360 Customer View Log Analysis Predictive Maintenance

Data Lake Staging 1. User Identity – Data Lake

2. User Identity - Application Level

3. User Identity propagation to Data Layer

1. User identity within a Data Lake

2. User identity in application layer

3. Prevent data duplication & maintain user integrity across layers

Page 11: Hadoop Security in Big-Data-as-a-Service Deployments - Presented at Hadoop Summit 2016

1. Securing The Data Lake

LDAPKDCData/Storage

Prod

2.2

Dev/Test

2.4

POC

2.3

Prod

2.3

Dev/Test

2.4

MARKETING R&D MANUFACTURING360 Customer View Log Analysis Predictive Maintenance

Data Lake Staging 1. Authentication & Authorization – Data Lake

2. User Identity - Application Level

3. User Identity propagation to Data Layer

Page 12: Hadoop Security in Big-Data-as-a-Service Deployments - Presented at Hadoop Summit 2016

2. Securing The App Layer

LDAP

KDCData/Storage

Prod

2.2

Dev/Test

2.4

POC

2.3

Prod

2.3

Dev/Test

2.4

MARKETING R&D MANUFACTURING360 Customer View Log Analysis Predictive Maintenance

Data Lake Staging 1. Authentication & Authorization – Data Lake

2. User Identity - Application Level

3. User Identity propagation to Data Layer

App containers are integrated with LDAP

KDC

AliceBob Tom

Page 13: Hadoop Security in Big-Data-as-a-Service Deployments - Presented at Hadoop Summit 2016

3. Identity Propagation to Data Layer

LDAP

KDCData/Storage

Prod

2.2

Dev/Test

2.4

POC

2.3

Prod

2.3

Dev/Test

2.4

MARKETING R&D MANUFACTURING360 Customer View Log Analysis Predictive Maintenance

Data Lake Staging 1. Authentication & Authorization – Data Lake

2. User Identity - Application Level

3. User Identity propagation to Data Layer

KDC

AliceBob Tom

Page 14: Hadoop Security in Big-Data-as-a-Service Deployments - Presented at Hadoop Summit 2016

User Identity Propagation

Two Ways–Users connect directly to HDFS

• Simple Authentication• Kerberos Authentication

–Users connect to HDFS via a Super-user (Impersonation)

Page 15: Hadoop Security in Big-Data-as-a-Service Deployments - Presented at Hadoop Summit 2016

HDFS Direct Connections

LDAP

KDC

Prod

2.2

Dev/Test

2.4

POC

2.3

Prod

2.3

Dev/Test

2.4

MARKETING R&D MANUFACTURING360 Customer View Log Analysis Predictive Maintenance

KDC

Alice BobTom

HDFSData Lake

Page 16: Hadoop Security in Big-Data-as-a-Service Deployments - Presented at Hadoop Summit 2016

HDFS Direct Connections..

– hdfs-audit.log

– Ranger policies are enforced for alice and bob as they are the effective users

Page 17: Hadoop Security in Big-Data-as-a-Service Deployments - Presented at Hadoop Summit 2016

HDFS Direct Connections..

• Single Hadoop Setup– Ideal

• Multi-tenant, Multi-application Setup– Kerberized HDFS needs kerberized compute and services– May not want to kerberize Dev/QA setups– Hadoop versions should be compatible all across– Data duplication

Page 18: Hadoop Security in Big-Data-as-a-Service Deployments - Presented at Hadoop Summit 2016

HDFS Super-user Connections

• Super-users perform actions on behalf of other users (Impersonation/Proxying)

• Adding a new super-user is easy– core-site.xml

Page 19: Hadoop Security in Big-Data-as-a-Service Deployments - Presented at Hadoop Summit 2016

HDFS Super-user Connections..

LDAP

KDC

Prod

2.2

Dev/Test

2.4

POC

2.3

Prod

2.3

Dev/Test

2.4

MARKETING R&D MANUFACTURING360 Customer View Log Analysis Predictive Maintenance

KDC

Alice BobTom

HDFSData Lake

DataTap Caching Servicevia – super-user

Page 20: Hadoop Security in Big-Data-as-a-Service Deployments - Presented at Hadoop Summit 2016

HDFS Super-user Connections..

– hdfs-audit.log

– Ranger Authorization policies still enforced, as alice and bob are effective users

Page 21: Hadoop Security in Big-Data-as-a-Service Deployments - Presented at Hadoop Summit 2016

HDFS Super-user Connections..

Multi-tenant, Multi-application Setup– Works for applications which don’t support Kerberos (yet)– Dev/Test setups need not be kerberized– DataTap service can abstract version incompatibilities– Can help avoid data duplication– Need tight LDAP/AD integration though!

Page 22: Hadoop Security in Big-Data-as-a-Service Deployments - Presented at Hadoop Summit 2016

Ranger in Action

Hue Example

Page 23: Hadoop Security in Big-Data-as-a-Service Deployments - Presented at Hadoop Summit 2016

HDFS Permissions on Data Lake

• Set HDFS file access for ‘/user/secret’ to strict mode

• Set umask to ‘077’

Page 24: Hadoop Security in Big-Data-as-a-Service Deployments - Presented at Hadoop Summit 2016

HDFS Ranger Policies

Page 25: Hadoop Security in Big-Data-as-a-Service Deployments - Presented at Hadoop Summit 2016

DataTap Caching Service

Page 26: Hadoop Security in Big-Data-as-a-Service Deployments - Presented at Hadoop Summit 2016

Create Table via Hue

Page 27: Hadoop Security in Big-Data-as-a-Service Deployments - Presented at Hadoop Summit 2016

Query table via Hue - Success

Page 28: Hadoop Security in Big-Data-as-a-Service Deployments - Presented at Hadoop Summit 2016

Query table via Hue - Failure

Page 29: Hadoop Security in Big-Data-as-a-Service Deployments - Presented at Hadoop Summit 2016

Ranger Audit Logs

Page 30: Hadoop Security in Big-Data-as-a-Service Deployments - Presented at Hadoop Summit 2016

Key Takeaways

• BDaaS is more than Hadoop-as-a-Service– Includes BI / ETL / Analytics + Data Science tools

• Security is an important consideration in BDaaS• Data duplication is not an option• Global user authentication using a centralized DB like

LDAP/AD is a must• Apache Ranger helps in enforcing global policies,

provided user identities are propagated correctly

Page 31: Hadoop Security in Big-Data-as-a-Service Deployments - Presented at Hadoop Summit 2016

Q & A

www.bluedata.com

Nanda Vijaydev@nandavijaydev

Abhiraj Butala@abhirajbutala