what does it mean to virtualize the hadoop file system? tom phelan chief architect for bluedata

45
What does it mean to virtualize the Hadoop File System? Tom Phelan Chief Architect for BlueData

Upload: arline-snow

Post on 14-Dec-2015

220 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: What does it mean to virtualize the Hadoop File System? Tom Phelan Chief Architect for BlueData

What does it mean to virtualize the Hadoop

File System?

Tom Phelan

Chief Architect for BlueData

Page 2: What does it mean to virtualize the Hadoop File System? Tom Phelan Chief Architect for BlueData

It is HDFS …

Page 3: What does it mean to virtualize the Hadoop File System? Tom Phelan Chief Architect for BlueData

Unless it is not

Page 4: What does it mean to virtualize the Hadoop File System? Tom Phelan Chief Architect for BlueData

Outline

There are questions to be answered …

Three “What”’s:• What is HDFS?• What does it mean to virtualize HDFS?• What are the different methods of virtualization?

Instances Advantages and considerations

And a “When”:• When to choose HDFS storage virtualization?

Page 5: What does it mean to virtualize the Hadoop File System? Tom Phelan Chief Architect for BlueData

What is HDFS?

Before we can virtualize it, we need to understand what “it” is.

Page 6: What does it mean to virtualize the Hadoop File System? Tom Phelan Chief Architect for BlueData

HDFSIt is a distributed file system built with NameNodes and

DataNodes

http://image.slidesharecdn.com/introtohadoop-javamug-110414122200-phpapp01/95/intro-to-the-hadoop-stack-april-2011-javamug-14-728.jpg?cb=1302793500

Source: David Engfer via slidershare.net

Page 7: What does it mean to virtualize the Hadoop File System? Tom Phelan Chief Architect for BlueData

hadoop-hdfs.jar org.apache.hadoop.fs.FileSystem

org.apache.hadoop.hdfs.FileSystem org.apache.hadoop.hdfs.DistributedFileSystem

HDFS Implementation

HDFS Implementation

Page 8: What does it mean to virtualize the Hadoop File System? Tom Phelan Chief Architect for BlueData

HDFS ImplementationHDFS Implementation

Hadoop Distributed File System API/Java Class

Distributed File System Client Protocol at TCP/IP level – “over the wire”

HDFS Implementation

It is a stack of Java code used by Hadoop applications to access data.

YARN

HDFS Implementation

Page 9: What does it mean to virtualize the Hadoop File System? Tom Phelan Chief Architect for BlueData

Generic Java ClassesJava class org.apache.hadoop.fs.FileSystem

HDFS over the wire protocolJava class org.apache.hadoop.hdfs.DFSClient

HDFS Layers of Potential Virtualization

Page 10: What does it mean to virtualize the Hadoop File System? Tom Phelan Chief Architect for BlueData

Host

NameNodeResourceManager

Host

DataNode

NodeManager

App

HDFS Impl

DFSClient

Local Disk

Local Disk

Host

DataNode

NodeManager

App

HDFS Impl

DFSClientLocal Disk

Local Disk

HDFS Implementation

WireProtocol

HDFS Implementation

Page 11: What does it mean to virtualize the Hadoop File System? Tom Phelan Chief Architect for BlueData

HDFS Virtualization

The virtualization of either the HDFS Implementation or the Protocols

Page 12: What does it mean to virtualize the Hadoop File System? Tom Phelan Chief Architect for BlueData

Outline

There are questions to be answered …

Three “What”’s:• What is HDFS?• What does it mean to virtualize HDFS?• What are the different methods of virtualization?

Instances Advantages and considerations

And a “When”:• When to choose HDFS storage virtualization?

Page 13: What does it mean to virtualize the Hadoop File System? Tom Phelan Chief Architect for BlueData

HDFS Virtualization Methods

• Virtualize the HDFS Implementation• Implement one of the Hadoop Compatible File System (HCFS)

Protocols Implement a HCFS via the over-the-wire protocol (hdfs.DFSClient) Implement a HCFS via the FileSystem protocol (fs.FileSystem)

Page 14: What does it mean to virtualize the Hadoop File System? Tom Phelan Chief Architect for BlueData

Virtualize the HDFS Implementation

This is the only method of HDFS virtualization that requires Hadoop compute virtualization.

Simple. Install a Hadoop distro into a cluster of virtualized compute nodes and run the HDFS services in the cluster storing data on vdisks/vmdks.

Instances of this type of HDFS virtualization include:• VMware BDE• Apache OpenStack Sahara• Cloudera Director• Hortonworks Cloudbreak

Page 15: What does it mean to virtualize the Hadoop File System? Tom Phelan Chief Architect for BlueData

NameNodeResourceManager

DataNode

NodeManager

App

HDFS Impl

DFSClient

Local Disk

Local Disk DataNode

NodeManager

App

HDFS Impl

DFSClientLocal Disk

Local Disk

HOST

HOST

HOSTVM

VM

VM

Virtualize the HDFS Implementation

Page 16: What does it mean to virtualize the Hadoop File System? Tom Phelan Chief Architect for BlueData

Advantages:•Simple•No new Java code•Compute/data locality

Considerations:•Requires data ingest time•The clusters become stateful

Virtualize the HDFS Implementation

Page 17: What does it mean to virtualize the Hadoop File System? Tom Phelan Chief Architect for BlueData

HDFS Virtualization Methods

• Virtualize the HDFS Implementation• Implement a Hadoop Compatible File System – HCFS

• Implement a HCFS via the over-the-wire protocol (hdfs.DFSClient)

• Implement a HCFS via the FileSystem protocol (fs.FileSystem)

Page 18: What does it mean to virtualize the Hadoop File System? Tom Phelan Chief Architect for BlueData

Implement a HCFS via the over-the-wire protocol

Use the unmodified hadoop-hdfs jarfs.defaultfs hdfs://1.2.3.4:8020/path

Instance:• EMC Isilon

Page 19: What does it mean to virtualize the Hadoop File System? Tom Phelan Chief Architect for BlueData

Host

NameNodeResourceManager

Host

DataNode

NodeManager

App

HDFS Impl

DFSClient

Local Disk

Local Disk

Host

DataNode

NodeManager

App

HDFS Impl

DFSClientLocal Disk

Local Disk

StorageService Local

Disk

Local Disk

Implement a HCFS via the over-the-wire protocol

Page 20: What does it mean to virtualize the Hadoop File System? Tom Phelan Chief Architect for BlueData

Advantages:•Multi-protocol•No new Java code•Enterprise storage services

Considerations:•Open source / proprietary•No compute / data locality

Implement a HCFS via the over-the-wire protocol

Page 21: What does it mean to virtualize the Hadoop File System? Tom Phelan Chief Architect for BlueData

HDFS Virtualization Methods

• Virtualize the HDFS Implementation• Implement a Hadoop Compatible File System – HCFS

• Implement a HCFS via the over-the-wire protocol (hdfs.DFSClient)• Implement a HCFS via the FileSystem protocol

(fs.FileSystem)

Page 22: What does it mean to virtualize the Hadoop File System? Tom Phelan Chief Architect for BlueData

Implement a HCFS via the FileSystem Java classes

Write the java code that implements the class, build a jar file,put the jar file in the YARN services class path

edit the core-site.xml file

Instances:•S3 and S3a/S3n – org.apache.hadoop.fs.FileSystem

https://github.com/Aloisius/hadoop-s3a•GlusterFS - org.apache.hadoop.fs.FilterFileSystem

https://github.com/gluster/glusterfs-hadoop•Tachyon – org.apache.hadoop.fs.FileSystem

https://github.com/amplab/tachyon•Apache Ignite – org.apache.hadoop.fs.AbstractFileSystem

https://github.com/apache/ignite

Page 23: What does it mean to virtualize the Hadoop File System? Tom Phelan Chief Architect for BlueData

Host

NameNodeResourceManager

Host

DataNode

NodeManager

App

HDFS Impl

DFSClient

Local Disk

Local Disk

Host

DataNode

NodeManager

App

HDFS Impl

DFSClientLocal Disk

Local Disk

CustomFS Impl CustomFS

Impl

StorageService

StorageService

StorageService

Implement a HCFS via the FileSystem Java classes

Page 24: What does it mean to virtualize the Hadoop File System? Tom Phelan Chief Architect for BlueData

Host

NameNode

Host

DataNode

NodeManager

App

HDFS Impl

DFSClient

Local Disk Local

Disk

Host

DataNode

NodeManager

App

HDFS Impl

DFSClientLocal Disk

Local Disk

Local Disk

Local Disk

CustomFS Impl CustomFS

Impl

StorageService

Implement a HCFS via the FileSystem Java classes

StorageService

StorageService

ResourceManager

Page 25: What does it mean to virtualize the Hadoop File System? Tom Phelan Chief Architect for BlueData

Advantages:•Open source / proprietary•Multiple file access protocols supported

Considerations:•These are file systems•New Java code•Possibly no compute / data locality•May lag latest HDFS feature set

Implement a HCFS via the FileSystem Java classes

Page 26: What does it mean to virtualize the Hadoop File System? Tom Phelan Chief Architect for BlueData

HDFS Virtualization

Is there another way?

Page 27: What does it mean to virtualize the Hadoop File System? Tom Phelan Chief Architect for BlueData

HDFS Virtualization

• Virtualize the HDFS Implementation• Implement a Hadoop Compatible File System – HCFS

• Implement a HCFS via the over-the-wire protocol• Implement a HCFS via the FileSystem Java classes

• Virtualize the Hadoop Compatible File System Protocol

Page 28: What does it mean to virtualize the Hadoop File System? Tom Phelan Chief Architect for BlueData

Virtualize the Hadoop Compatible File System Protocol

Instance:• BlueData EPIC software – org.apache.fs.FileSystem

Translate the Hadoop File System Calls into native calls to the BackEnd File systems

Insert intelligent caching layer

Page 29: What does it mean to virtualize the Hadoop File System? Tom Phelan Chief Architect for BlueData

Host

NameNodeResourceManager

Host

DataNode

NodeManager

App

HDFS Impl

DFSClientLocal Disk

Local Disk

Host

DataNode

NodeManager

App

HDFS Impl

DFSClient Local Disk

Local Disk

DTAPImpl

DTAPImpl

DTAPService

DTAPService

HostStorageService

Local Disk

Local Disk

Virtualize the Hadoop Compatible File System Protocol

Page 30: What does it mean to virtualize the Hadoop File System? Tom Phelan Chief Architect for BlueData

HDFS mem cachePage

Cache

HDFS Implementation

DFSClient

DataNode

page

Application is cache aware

Page 31: What does it mean to virtualize the Hadoop File System? Tom Phelan Chief Architect for BlueData

Extend mem cache to any File System or Object storage

Page Cache

DTAP FileSystem Implementation

DTAPService

page

HDFS GlusterFS Object Store

Application is cache unaware

Page 32: What does it mean to virtualize the Hadoop File System? Tom Phelan Chief Architect for BlueData

Advantages:•Not a file system•Transparent in memory cache

write back, read ahead•Supports multiple protocols•Supports compute / data locality

Considerations:•New Java code•Open source / proprietary•May lag latest HDFS feature set

Virtualize the Hadoop Compatible File System Protocol

Page 33: What does it mean to virtualize the Hadoop File System? Tom Phelan Chief Architect for BlueData

Let’s Review

Page 34: What does it mean to virtualize the Hadoop File System? Tom Phelan Chief Architect for BlueData

Outline

There are questions to be answered …

Three “What”’s:• What is HDFS?• What does it mean to virtualize HDFS?• What are the different methods of virtualization?

Instances Advantages and considerations

And a “When”:• When to choose HDFS storage virtualization?

Page 35: What does it mean to virtualize the Hadoop File System? Tom Phelan Chief Architect for BlueData

A Few Words about Performance

Performance measurements are an art as well as a science

•Bottlenecks in applications•Bottlenecks in infrastructure

network CPU disk

•Configuration is key block size distro security

Page 36: What does it mean to virtualize the Hadoop File System? Tom Phelan Chief Architect for BlueData

Virtualize the HDFS Implementation

Source of graph: VMware Technical Paper – Virtualized Hadoop Performance with VMware vSphere 6 on High Performance Servers

Performance – VMware BDE

Page 37: What does it mean to virtualize the Hadoop File System? Tom Phelan Chief Architect for BlueData

Performance – Isilon

http://stefanradtke.blogspot.com/2015/05/comparing-hadoop-performance-on-das-and.htmlSource of graph: Stefan Radtke blog post

Implement a HCFS via the over-the-wire protocol

Page 38: What does it mean to virtualize the Hadoop File System? Tom Phelan Chief Architect for BlueData

Performance – Tachyon

Source of graph: Haoyuan Li

Implement a HCFS via the FileSystem Java classes

https://spark-summit.org/2014/wp-content/uploads/2014/07/Tachyon-Further-Improve-Sparks-Performance-Haoyuan-Li.pdf

Page 39: What does it mean to virtualize the Hadoop File System? Tom Phelan Chief Architect for BlueData

Performance – BlueData

Source of Graph: BlueData customer proof-of-concept results

Virtualize the Hadoop Compatible File System Protocol

Page 40: What does it mean to virtualize the Hadoop File System? Tom Phelan Chief Architect for BlueData

Virtualized HDFS solutions provide good performance

Even with remote storage

Even in virtualized environments

Page 41: What does it mean to virtualize the Hadoop File System? Tom Phelan Chief Architect for BlueData

When it comes to Hadoop storage virtualization, speed is not the whole story

Other factors to consider when implementing a virtualized HDFS option:

•Use of a virtualized compute environment

•Open source / proprietary solution

•Required Hadoop File System features

•Lifespan of Hadoop cluster

Page 42: What does it mean to virtualize the Hadoop File System? Tom Phelan Chief Architect for BlueData

Other factors to consider when selecting storage:

•Data accessibility

Hadoop File System protocol

NFS, object store, other protocols

•Enterprise storage services

data protection

geographical replication

offline backup

When it comes to Hadoop storage virtualization, speed is not the whole story

Page 43: What does it mean to virtualize the Hadoop File System? Tom Phelan Chief Architect for BlueData

Consider a Virtualized HDFS Solution

When any of the following are true:

•Hadoop and non-Hadoop applications are required to access the same data

Do not want to replicate the data

•Enterprise storage data services required

•Need to run Hadoop in a virtual compute environment

Page 44: What does it mean to virtualize the Hadoop File System? Tom Phelan Chief Architect for BlueData

Hadoop File System

Volume, Velocity, Variety

Virtualization

Page 45: What does it mean to virtualize the Hadoop File System? Tom Phelan Chief Architect for BlueData

Q & A

twitter: @tapbluedata

email: [email protected]

www.bluedata.com

Visit our booth in the Expo