hw09 map reduce over tahoe a least authority encrypted distributed filesystem

28
Hadoop World NYC 2009 1 Hadoop World 2009 New York Oct 1, 2009 MapReduce over Tahoe . Booz Allen Hamilton Inc. 134 National Business Parkway Annapolis Junction, MD 20701 [email protected] Aaron Cordova Associate

Upload: cloudera-inc

Post on 20-Aug-2015

2.452 views

Category:

Technology


3 download

TRANSCRIPT

Hadoop World NYC 20091

Hadoop World 2009

New YorkOct 1, 2009

MapReduce over Tahoe

.Booz Allen Hamilton Inc.134 National Business ParkwayAnnapolis Junction, MD 20701

[email protected]

Aaron CordovaAssociate

Hadoop World NYC 20092

MapReduce over Tahoe

Impact of data security requirements on large scale analysis

Introduction to Tahoe

Integrating Tahoe with Hadoop’s MapReduce

Deployment scenarios, considerations

Test results

Hadoop World NYC 2009

Features of Large Scale Analysis

As data grows, it becomes harder, more expensive to move– “Massive” data

The more data sets are located together, the more valuable each is– Network Effect

Bring computation to the data

3

Hadoop World NYC 20094

Data Security and Large Scale Analysis

CRM

SalesProductTesting

Each department within an organization has its own data

Some data need to be shared

Others are protected

Hadoop World NYC 2009

Data Security

5

Support

Storage

Processing

Apps

Support

Storage

Processing

Apps

Support

Storage

Processing

Apps

Support

Storage

Processing

Apps

Because of security constraints,departments tend to setuptheir own data storage andprocessing systems independently

This includes support staff

Highly inefficient

Analysis across datasets isimpossible

Hadoop World NYC 20096

“Stovepipe Effect”

Hadoop World NYC 2009

Tahoe - A Least Authority File System

Release 1.5

AllMyData.com

Included in Ubuntu Karmic Koala

Open Source

7

Hadoop World NYC 2009

Tahoe Architecture

8

Storage Servers

Client

SSL

Data originates at the client, which is trusted

Client encrypts, segments, and erasure-codes data

Segments are distributed to storage nodes over encrypted links

Storage nodes only see encrypted data, and are not trusted

Hadoop World NYC 2009

Tahoe Architecture Features

AES Encryption

Segmentation

Erasure-coding

Distributed

Flexible Access Control

9

Hadoop World NYC 2009

Erasure Coding Overview

10

N

KOnly k of n segments are needed to recover the file

Up to n-k machines can fail, be compromised, or malicious without data loss

n and k are configurable, and can be chosen to achieve desired availability

Expansion factor of data is k/n (default is 3/10, or 3.3)

Hadoop World NYC 2009

Flexible Access Control

11

FileReadCapWriteCap

DirReadCapWriteCap

Each file has a Read Capability and a Write Capability

These are decryption keys

Directories have capabilities too

Hadoop World NYC 2009

Flexible Access Control

12

File

Dir

Dir

File File

ReadCap

Access to a subset of files can be done by:– creating a directory– attaching files– sharing read or write capabilities of the dir

Any files or directories attached are accessible

Any outside the directory are not

Hadoop World NYC 2009

Access Control Example

13

Files

/Sales /TestingDirectories

Each department can access their own files

Hadoop World NYC 2009

Access Control Example

14

Files

/Sales /TestingDirectories

Each department can access their own files

Hadoop World NYC 2009

Access Control Example

15

Files

/Sales /TestingDirectories /New Products

Files that need to be shared can be linked to a new directory, whose read capability is given to both departments

Hadoop World NYC 2009

Hadoop Can Use The Following File Systems

HDFS

Cloud Store (KFS)

Amazon S3

FTP

Read only HTTP

Now, Tahoe!

16

Hadoop World NYC 2009

Hadoop File System Integration HowTo

Step 1. – Locate your favorite file system’s API

Step 2.– subclass FileSystem– found in /src/core/org/apache/hadoop/fs/FileSystem.java

Step 3.– Add lines to core-site.xml:

<name> fs.lafs.impl </name><value> your.class </value>

Step 4.– Test using your favorite Infrastructure Service Provider

17

Hadoop World NYC 2009

Hadoop Integration : MapReduce

18

Storage Servers

Hadoop Map Reduce Workers

One Tahoe client is run on each machine that serves as a MapReduce Worker

On average, clients communicate with k storage servers

Jobs are limited by aggregate network bandwidth

MapReduce workers are trusted, storage nodes are not

Hadoop World NYC 2009

Hadoop-Tahoe Configuration

Step 1. Start Tahoe

Step 2. Create a new directory in Tahoe, note the WriteCap

Step 3. Configure core-site.xml thus:– fs.lafs.impl: org.apache.hadoop.fs.lafs.LAFS– lafs.rootcap: $WRITE_CAP– fs.default.name: lafs://localhost

Step 4. Start MapReduce, but not HDFS

19

Hadoop World NYC 2009

Deployment Scenario - Large Organization

20

Storage Servers

MapReduce Workers / Tahoe Clients

Sales Audit

Within a datacenter, departments can run MapReduce jobs on discrete groups of compute nodes

Each MapReduce job accesses a directory containing a subset of files

Results are written back to the storage servers, encrypted

Hadoop World NYC 2009

Deployment Scenario - Community

21

Storage Servers

FBI Homeland Sec

MapReduce Workers / Tahoe Clients

If a community uses a shared data center, different organizations can run discrete MapReduce jobs

Perhaps most importantly, when results are deemed appropriate to share, access can be granted simply by sending a read or write capability

Since the data are all co-located already, no data needs to be moved

Hadoop World NYC 2009

Deployment Scenario - Public Cloud Services

22

Storage Servers

Cloud Service Provider

MapReduce Workers / Tahoe Clients

Since storage nodes require no trust, they can be located at a remote location, e.g. within a cloud service provider’s datacenter

MapReduce jobs can be done this way if bandwidth to the datacenter is adequate

Hadoop World NYC 2009

Deployment Scenario - Public Cloud Services

23

Storage Servers

Cloud Service Provider

MapReduce Workers / Tahoe Clients

For some users, everything could be run remotely in a service provider’s data center

There are a few caveats and additional precautions in this scenario:

Hadoop World NYC 2009

Public Cloud Deployment Considerations

Store configuration files in memory

Encrypt / disable swap

Encrypt spillover

Must trust memory / hypervisor

Trust service provider disks

24

Cloud Service Provider

Hadoop World NYC 2009

HDFS and Linux Disk Encryption Drawbacks

At most one key per node - no support for flexible access control

Decryption done at the storage node rather than at the client - still have to trust storage nodes

25

Hadoop World NYC 2009

Tahoe and HDFS - Comparison

26

Feature HDFS Tahoe

Confidentiality File Permissions AES Encryption

Integrity Checksum Merkel Hash Tree

Availability Replication Erasure Coding

Expansion Factor 3x 3.3x (k/n)

Self-Healing Automatic Automatic

Load-balancing Automatic Planned

Mutable Files No Yes

Hadoop World NYC 2009

Performance

27

0

50

100

150

200

Random Write Word Count

HDFS Tahoe

Tests run on ten nodes

RandomWrite writes 1 GB per node

WordCount done over randomly generated text

Tahoe write speed is 10x slower

Read-intensive jobs are about the same

Not so bad since the most common data use case is write-once, read-many

Hadoop World NYC 2009

Code

Tahoe available from http://allmydata.org– Licensed under GPL 2 or TGPPL

Integration code available at http://hadoop-lafs.googlecode.com– Licensed under Apache 2

28