developing and deploying apache hadoop security owen o’malley - hortonworks co-founder and...

19
FOOD TRANSPORTATION AND PRODUCTION Market Research in Turkey

Upload: vivien-mclaughlin

Post on 24-Dec-2015

228 views

Category:

Documents


0 download

TRANSCRIPT

Developing and Deploying Apache Hadoop SecurityOwen O’Malley - Hortonworks Co-founder and [email protected]@owen_omalley

© Hortonworks Inc. 2011 July 25, 2011

Who am I

• An architect working on Hadoop full time since the beginning of the project (Jan ‘06)−Primarily focused on MapReduce

• Tech-lead on adding security to Hadoop

• Co-founded Hortonworks this month

• Before Hadoop – Yahoo Search WebMap

• Before Yahoo – NASA, Sun

• PhD from UC Irvine

What is Hadoop?

• A framework for storing and processing big data on lots of commodity machines.−Up to 4,500 machines in a cluster−Up to 20 PB in a cluster

• Open Source Apache project

• High reliability done in software−Automated failover for data and computation

• Implemented in Java

• Primary data analysis platform at Yahoo!−40,000+ machines running Hadoop−More than 1,000,000 jobs every month

twice the engagement

4

Personalized for each visitor

Result: twice the engagement

+160% clicksvs. one size fits all

+79% clicksvs. randomly selected

+43% clicksvs. editor selected

Recommended links News Interests Top Searches

Case Study: Yahoo Front Page

Problem

• Yahoo! has more yahoos than clusters.• Hundreds of yahoos using Hadoop each month• 40,000 computers in ~20 Hadoop clusters.• Sharing requires isolation or trust.

• Different users need different data.• Not all yahoos should have access to sensitive data−financial data and PII

• In Hadoop 0.20, easy to impersonate.−Segregate different data on separate clusters

5

Solution

• Prevent unauthorized HDFS access• All HDFS clients must be authenticated.• Including tasks running as part of MapReduce jobs• And jobs submitted through Oozie.

• Users must also authenticate servers• Otherwise fraudulent servers could steal credentials

• Integrate Hadoop with Kerberos• Provides well tested open source distributed

authentication system.

6

Requirements

• Security must be optional.−Not all clusters are shared between users.

• Hadoop commands must not prompt for passwords−Must have single sign on.−Otherwise trojan horse versions are easy to write.

• Must support backwards compatibility−HFTP must be secure, but allow reading from

insecure clusters

Primary Communication Paths

8

Definitions

• Authentication – Determining the user−Hadoop 0.20 completely trusted the user

• User passes their username and groups over wire

−We need it on both RPC and Web UI.

• Authorization – What can that user do?−HDFS had owners, groups and permissions since 0.16.−Map/Reduce had nothing in 0.20.

• Auditing – Who did what?−Available since 0.20

Authentication

• Changes low-level transport

• RPC authentication using SASL−Kerberos (GSSAPI)−Token (Digest-MD5)−Simple

• Browser HTTP secured via plugin

• Tool HTTP (eg. fsck) via SSL/Kerberos

Authorization

• HDFS−Command line unchanged−Web UI enforces authentication

• MapReduce added Access Control Lists−Lists of users and groups that have access.−mapreduce.job.acl-view-job – view job−mapreduce.job.acl-modify-job – kill or modify job

© Hortonworks Inc. 2011 12

Auditing

• A critical part of security is an accurate method for determining who did what.−Almost useless until you have strong authentication

• HDFS Audit log tracks−Reading or writing of files

• MapReduce audit log tracks−Launching or modifying job properties

Kerberos and Single Sign-on

• Kerberos allows user to sign in once−Obtains Ticket Granting Ticket (TGT)

• kinit – get a new Kerberos ticket• klist – list your Kerberos tickets• kdestroy – destroy your Kerberos ticket• TGT’s last for 10 hours, renewable for 7 days by

default−Once you have a TGT, Hadoop commands just work

• hadoop fs –ls /• hadoop jar wordcount.jar in-dir out-dir

13

API Changes

• Very Minimal API Changes−Most applications work unchanged−UserGroupInformation *completely* changed.

• MapReduce added secret credentials−Available from JobConf and JobContext−Never displayed via Web UI

• Automatically get tokens for HDFS−Primary HDFS, File{In,Out}putFormat, and DistCp−Can set mapreduce.job.hdfs-servers

14

© Hortonworks Inc. 2011 15

MapReduce task-level security

• MapReduce tasks run as submitting user.−No more accidently killing TaskTrackers!−Use a setuid C program.

• Task output logs aren’t globally visible.

• Task work directories aren’t globally visible.

• Distributed cache is split−Public – shared between all users−Private – shared between jobs of same user

Web UIs

• Hadoop relies on Web User Interfaces served from embedded Jetty.−These need to be authenticated also…

• Web UI authentication is pluggable.−SPENGO or static user plug-ins are available−Companies may need or want their own systems

• All servlets enforce permissions based on the authenticated user.

Proxy-Users

• Some services access HDFS and MapReduce as other users.

• Configure service masters (NameNode and JobTracker) with the proxy user:−For each proxy user, configuration defines:

• Who the proxy service can impersonate• Which hosts they can impersonate from

• New admin commands to refresh−Don’t need to bounce cluster

17

Out of Scope

• Encryption−RPC transport −Block transport protocol−On disk

• File Access Control Lists−Still use Unix-style owner, group, other permissions

• Non-Kerberos Authentication−Much easier now that framework is available

Deployment

• The security team worked hard to get security added to Hadoop on schedule.

−Roll out was smoothest major Hadoop version in a long time.−In the 0.20.203.0 and upcoming 0.20.204.0 release.−Measured performance degradation < 3%

• Security Development team:

−Devaraj Das, Ravi Gummadi, Jakob Homan, Owen O’Malley, Jitendra Pandey, Boris Shkolnik, Vinod Vavilapalli, Kan Zhang

• Currently deployed on all shared clusters (alpha, science, and production) at Yahoo!

© Hortonworks Inc. 2011 20

Incident after Deployment

• Only tense incident involved one cluster where 1/3 of the machines dropped out of the cluster after a day.

• Had to diagnose what had gone wrong.

• The dropped machines had newer keytab files!

• An operator had regenerated the keys on 1/3 of the cluster after it was running. Servers failed when they tried to renew their tickets.

© Hortonworks Inc. 2011 21

Hadoop Eco-system

• Security percolates upward…−You can only be as secure as the lower levels−Pig finished integrating with security−Oozie supports security−HBase is being updated for security

• All backing data files are owned by HBase user.• Doesn’t support reading/writing files directly by

application

−Hive is also being updated• Doesn’t support column level permissions

Questions?

• Questions should be sent to:

− common/hdfs/[email protected]

• Security holes should be sent to:

[email protected]

• Thanks!