hw09 map reduce over tahoe a least authority encrypted distributed filesystem

Hadoop World NYC 20091

Hadoop World 2009

New YorkOct 1, 2009

MapReduce over Tahoe

.Booz Allen Hamilton Inc.134 National Business ParkwayAnnapolis Junction, MD 20701

[email protected]

Aaron CordovaAssociate


MapReduce over Tahoe

Impact of data security requirements on large scale analysis

Introduction to Tahoe

Integrating Tahoe with Hadoop’s MapReduce

Deployment scenarios, considerations

Test results


Features of Large Scale Analysis

As data grows, it becomes harder, more expensive to move– “Massive” data

The more data sets are located together, the more valuable each is– Network Effect

Bring computation to the data

3


Data Security and Large Scale Analysis

CRM

SalesProductTesting

Each department within an organization has its own data

Some data need to be shared

Others are protected


Data Security

5

Support

Storage

Processing

Apps

Support

Storage

Processing

Apps

Support

Storage

Processing

Apps

Support

Storage

Processing

Apps

Because of security constraints,departments tend to setuptheir own data storage andprocessing systems independently

This includes support staff

Highly inefficient

Analysis across datasets isimpossible


“Stovepipe Effect”


Tahoe - A Least Authority File System

Release 1.5

AllMyData.com

Included in Ubuntu Karmic Koala

Open Source

7


Tahoe Architecture

8

Storage Servers

Client

SSL

Data originates at the client, which is trusted

Client encrypts, segments, and erasure-codes data

Segments are distributed to storage nodes over encrypted links

Storage nodes only see encrypted data, and are not trusted


Tahoe Architecture Features

AES Encryption

Segmentation

Erasure-coding

Distributed

Flexible Access Control

9


Erasure Coding Overview

10

N

KOnly k of n segments are needed to recover the file

Up to n-k machines can fail, be compromised, or malicious without data loss

n and k are configurable, and can be chosen to achieve desired availability

Expansion factor of data is k/n (default is 3/10, or 3.3)



11

FileReadCapWriteCap

DirReadCapWriteCap

Each file has a Read Capability and a Write Capability

These are decryption keys

Directories have capabilities too



12

File

Dir

Dir

File File

ReadCap

Access to a subset of files can be done by:– creating a directory– attaching files– sharing read or write capabilities of the dir

Any files or directories attached are accessible

Any outside the directory are not


Access Control Example

13

Files

/Sales /TestingDirectories

Each department can access their own files



14

Files

/Sales /TestingDirectories

Each department can access their own files



15

Files

/Sales /TestingDirectories /New Products

Files that need to be shared can be linked to a new directory, whose read capability is given to both departments


Hadoop Can Use The Following File Systems

HDFS

Cloud Store (KFS)

Amazon S3

FTP

Read only HTTP

Now, Tahoe!

16


Hadoop File System Integration HowTo

Step 1. – Locate your favorite file system’s API

Step 2.– subclass FileSystem– found in /src/core/org/apache/hadoop/fs/FileSystem.java

Step 3.– Add lines to core-site.xml:

<name> fs.lafs.impl </name><value> your.class </value>

Step 4.– Test using your favorite Infrastructure Service Provider

17


Hadoop Integration : MapReduce

18

Storage Servers

Hadoop Map Reduce Workers

One Tahoe client is run on each machine that serves as a MapReduce Worker

On average, clients communicate with k storage servers

Jobs are limited by aggregate network bandwidth

MapReduce workers are trusted, storage nodes are not


Hadoop-Tahoe Configuration

Step 1. Start Tahoe

Step 2. Create a new directory in Tahoe, note the WriteCap

Step 3. Configure core-site.xml thus:– fs.lafs.impl: org.apache.hadoop.fs.lafs.LAFS– lafs.rootcap: $WRITE_CAP– fs.default.name: lafs://localhost

Step 4. Start MapReduce, but not HDFS

19


Deployment Scenario - Large Organization

20

Storage Servers

MapReduce Workers / Tahoe Clients

Sales Audit

Within a datacenter, departments can run MapReduce jobs on discrete groups of compute nodes

Each MapReduce job accesses a directory containing a subset of files

Results are written back to the storage servers, encrypted


Deployment Scenario - Community

21

Storage Servers

FBI Homeland Sec


If a community uses a shared data center, different organizations can run discrete MapReduce jobs

Perhaps most importantly, when results are deemed appropriate to share, access can be granted simply by sending a read or write capability

Since the data are all co-located already, no data needs to be moved


Deployment Scenario - Public Cloud Services

22

Storage Servers

Cloud Service Provider


Since storage nodes require no trust, they can be located at a remote location, e.g. within a cloud service provider’s datacenter

MapReduce jobs can be done this way if bandwidth to the datacenter is adequate


Deployment Scenario - Public Cloud Services

23

Storage Servers



For some users, everything could be run remotely in a service provider’s data center

There are a few caveats and additional precautions in this scenario:


Public Cloud Deployment Considerations

Store configuration files in memory

Encrypt / disable swap

Encrypt spillover

Must trust memory / hypervisor

Trust service provider disks

24



HDFS and Linux Disk Encryption Drawbacks

At most one key per node - no support for flexible access control

Decryption done at the storage node rather than at the client - still have to trust storage nodes

25


Tahoe and HDFS - Comparison

26

Feature HDFS Tahoe

Confidentiality File Permissions AES Encryption

Integrity Checksum Merkel Hash Tree

Availability Replication Erasure Coding

Expansion Factor 3x 3.3x (k/n)

Self-Healing Automatic Automatic

Load-balancing Automatic Planned

Mutable Files No Yes


Performance

27

0

50

100

150

200

Random Write Word Count

HDFS Tahoe

Tests run on ten nodes

RandomWrite writes 1 GB per node

WordCount done over randomly generated text

Tahoe write speed is 10x slower

Read-intensive jobs are about the same

Not so bad since the most common data use case is write-once, read-many


Code

Tahoe available from http://allmydata.org– Licensed under GPL 2 or TGPPL

Integration code available at http://hadoop-lafs.googlecode.com– Licensed under Apache 2

28

http://allmydata.org

http://allmydata.org

http://hadoop-lafs.googlecode.com

http://hadoop-lafs.googlecode.com

hw09 map reduce over tahoe a least authority encrypted distributed filesystem

Technology