cap2956-inside the hadoop machine_final_us.pdf

7/27/2019 CAP2956-Inside the Hadoop Machine_Final_US.pdf

1/41

Inside the Hadoop

Machine

Jeff Buell, VMware, Inc.

Richard McDougall, VMware, Inc.

Sanjay Radia, Hortonworks

APP-CAP2956

#vmworldapps


2/41

2

Disclaimer

This session may contain product features that are

currently under development.

This session/overview of the new technology represents

no commitment from VMware to deliver these features in

any generally available product.

Features are subject to change, and must not be included in

contracts, purchase orders, or sales agreements of any kind.

Technical feasibility and market demand will affect final delivery.

Pricing and packaging for any new technologies or features

discussed or presented have not been determined.


3/41

3

Log Processing / ClickStream Analytics

Machine Learning /

sophisticated data mining

Web crawling / text

processing

Extract Transform Load

(ETL) replacement

Image / XML message

processing

Broad Application of Hadoop technology

General archiving /

compliance

Financial Services

Mobile / Telecom

Internet Retailer

Scientific Research

Pharmaceutical / Drug

Discovery

Social Media

Vertical Use CasesHorizontal Use Cases

Hadoops ability to handle large unstructured data affordably and efficiently makes

it a valuable tool kit for enterprises across a number of applications and fields.


4/41

4

How does Hadoop enable parallel processing?

Source: http://architects.dzone.com/articles/how-hadoop-mapreduce-works

A framework fordistributed processing oflarge data sets acrossclusters of computers using a simple programming model.


5/41

5

Hadoop System Architecture

MapReduce: Programmingframework for highly parallel data

processing Hadoop Distributed File System

(HDFS): Distributed data storage


6/41

6

Hadoop Map-Reduce Framework (Runtime Layer)

Source: http://architects.dzone.com/articles/how-hadoop-mapreduce-works


7/41

7

Hadoop Distributed File System


8/41

8

Hadoop Data Locality and Replication


9/41

9

Hadoop Virtualization Extensions: Topology Awareness


10/41

10

Why Virtualize Hadoop?

Shrink and expandcluster on demand

Resource Guarantee

Independent scaling ofCompute and data

Elastic Scaling

No more single point offailure

One click to setup

High availability for MRJobs

Highly Available

Rapid deployment

Unified operationsacross enterprise

Easy Clone of Cluster

Simple to Operate


11/41

11

Enterprise Challenges with Using Hadoop

Deployment

Slow to provision

Complex to keep running/tune

Single Points of Failure

Single point of failure with Name Node and Job tracker

No HA for Hadoop Framework Components (Hive, HCatalog, etc.)

Low Utilization Dedicated clusters to run Hadoop with low CPU utilization

No easy way to share resource between Hadoop and non-Hadoop workloads

Noisy neighbor, lack resource containment

Need Multi-tenant Isolation, Resource Management, etc,

Noisy Neighbor - no performance or security isolation between different tenants/users

Lack of configuration isolation - Cant run multiple versions on the cluster


12/41

12

Virtualization enables a Common Infrastructure for Big Data

Single purpose clusters for variousbusiness applications lead to clustersprawl.

Virtualization Platform

Simplify

Single Hardware Infrastructure

Unified operations

Optimize

Shared Resources = higher utilization

Elastic resources = faster on-demand access

MPP DB HadoopHBase

Virtualization Platform

MPP DB

Hadoop

HBase

Cluster Sprawling

Cluster Consolidation


13/41

13

Deploy a Hadoop Cluster in under 30 Minutes

Deploy vHelperOVF tovSphere

Select configuration template

Automate deployment

Select Compute, memory,

storage and network

Done

Step 1: Deploy Serengeti virtual appliance on vSphere.

Step 2: A few simple commands to stand up Hadoop Cluster.


14/41

14

A Tour Through Serengeti

$ ssh serengeti@serengeti-vm

$ serengeti

serengeti>


15/41

15


serengeti> cluster create --name myElephant

serengeti> cluster list -name myElephant

name: myElephant, distro: cdh, status:RUNNING

NAME ROLES INSTANCE CPU MEM(MB) TYPE

---------------------------------------------------------------------------

master [hadoop_NameNode, hadoop_jobtracker] 1 2 7500 LOCAL 50

name: myElephant, distro: cdh, status:RUNNING

NAME ROLES INSTANCE CPU MEM(MB) TYPE

---------------------------------------------------------------------------

master [hive, hadoop_client, pig] 1 1 3700 LOCAL 50

NAME HOST IP

-----------------------------------------------------------------

myElephant-client0 rmc-elephant-009.eng.vmware.com 10.0.20.184


16/41

16


$ ssh [email protected]

$ hadoop jar hadoop-examples.jar teragen 1000000000 tera-data


17/41

17

Serengeti Spec File

[

"distro":"apache", Choice of Distro

{

"name": "master","roles": [

"hadoop_NameNode",

"hadoop_jobtracker"

],

"instanceNum": 1,

"instanceType": "MEDIUM",

ha:true, HA Option

},{

"name": "worker",

"roles": [

"hadoop_datanode", "hadoop_tasktracker"

],

"instanceNum": 5,

"instanceType": "SMALL",

"storage": { Choice of Shared Storage or Local Disk

"type": "LOCAL",

"sizeGB": 10

}

},

]


18/41

18

Configuring Distros

{

"name" : "cdh",

"version" : "3u3",

"packages" : [

{

"roles" : ["hadoop_NameNode", "hadoop_jobtracker",

"hadoop_tasktracker", "hadoop_datanode",

"hadoop_client"],

"tarball" : "cdh/3u3/hadoop-0.20.2-cdh3u3.tar.gz"

},{

"roles" : ["hive"],

"tarball" : "cdh/3u3/hive-0.7.1-cdh3u3.tar.gz"

},

{

"roles" : ["pig"],

"tarball" : "cdh/3u3/pig-0.8.1-cdh3u3.tar.gz"

}

]

},


19/41

19

Open Source of Serengeti, Spring Hadoop, Hadoop Extensions

Community ProjectsCommercial Vendors

Support major distribution and multiple projects

Contribute Hadoop Virtualization Extension (HVE) to Open

Source Community


20/41

20

Use Local Disk where its Needed

SAN Storage

$2 - $10/Gigabyte

$1M gets:0.5Petabytes200,000 IOPS8Gbyte/sec

NAS Filers

$1 - $5/Gigabyte

$1M gets:1 Petabyte

200,000 IOPS10Gbyte/sec

Local Storage

$0.05/Gigabyte

$1M gets:10 Petabytes400,000 IOPS

250 Gbytes/sec


21/41

21

Extend Virtual Storage Architecture to Include Local Disk

Shared Storage: SAN or NAS

Easy to provision

Automated cluster rebalancing

Hybrid Storage

SAN for boot images, VMs, other

workloads Local disk for Hadoop & HDFS

Scalable Bandwidth, Lower Cost/GB

Host

Had

oop

OtherVM

OtherVM

Host

Had

oop

Had

oop

OtherVM

Host

Had

oop

Had

oop

OtherVM

Host

Had

oop

OtherVM

OtherVM

Host

Had

oop

Had

oop

OtherVM

Host

Had

oop

Had

oop

OtherVM


22/41

22

Virtualized Hadoop Performance

Issues of interest

Native vs various virtual configurations

Local disks vs Fibre Channel SAN

Effect of protecting Hadoop master daemons with Fault Tolerance

Public cloud (renting) vs private cloud (buying)

24x HP DL380 G72x X5687, 72 GB16x SAS 146 GB

Broadcom 10 GbE adapterQlogic 8 Gb/s HBA

Arista 7124SX 10 GbE switch

EMC VNX7500


23/41

23

Configuration

Software

vSphere 5.0 U1 (storage tests), 5.1 (Native/Virtual, FT)

RHEL 6.1 x86_64

Cloudera CDH3u4

Hadoop applications: TeraGen, TeraSort, TeraValidate (1 TB)

Hadoop VMs

Processors (16 logical threads), memory (72 GB), disks (12) partitioned among1, 2, or 4 VMs per host

Separate VMs for NameNode and JobTracker for storage and FT tests

Hadoop configuration

One map and one reduce task per vCPU (= logical thread) Machines are highly loaded

256 MB block size

FT tests: 8 256 MB block sizes to vary load on NN and JT


24/41

24

Native versus Virtual Platforms, 24 hosts, 12 disks/host

0

50

100

150

200

250

300

350

400

450

TeraGen TeraSort TeraValidate

Elapsedtime,seconds

(lowerisbetter)

Native

1 VM

2 VMs

4 VMs


25/41

25

Local vs Various SAN Storage Configurations

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

TeraGen TeraSort TeraValidate

Elapsed

timeratiotoLocal

disks(lowerisbet

ter)

Local disks

SAN JBOD

SAN RAID-0, 16 KB page size

SAN RAID-0

SAN RAID-5

16 x HP DL380G7, EMC VNX 7500, 96 physical disks


26/41

26

Performance Effect of FT for Master Daemons

NameNode and JobTracker placed in separate UP VMs

Small overhead: Enabling FT causes 2-4% slowdown for TeraSort

8 MB case places similar load on NN &JT as >200 hosts with 256 MB

1

1.01

1.02

1.03

1.04

256 64 16 8

Elaps

edtimeratioto

FToff

HDFS block size, MB

TeraSort


27/41

27

Different Clouds for Different Folks

Yahoo! Hadoop 2009: Classic benchmark test, 1460 hosts

Google/MapR: SaaS on Google Compute Engine

vSphere 5.1: 24 host cluster, 2 VMs/host, 8 or 12 disks/host,

CDH3u4

Vastly different cluster sizes

Compare throughput (MB sorted per second) normalized with resources

Cost: rental or estimate of running continuously for 3 years

#cores #disks TeraSort, s MB/s/core MB/s/disk cost

Yahoo! 11680 5840 62 1.3 2.6 ~$7

Google/MapR 5024 1256 80 2.4 9.5 $16

vSphere 5.1 192 192 442 11.2 11.2 ~$2

vSphere 5.1 192 288 359 13.8 9.2 ~$2


28/41

28



Resource Guarantee


Elastic Scaling


One click to setup


Highly Available

Rapid deployment



Simple to Operate


29/41

29

VMware-Hortonworks Joint Engineering

Hortonworks goal

Expand Hadoop ecosystem

Provide first class support of various platforms

Hadoop should run well on VMs

VMs offer several advantages as presented earlier

Take advantage of vSphere for HA

First class support for VMs Topology plugins (Hadoop-8468)

2 VMs can be on same host

Pick closer data

Schedule tasks closer

Dont put two replicas on same host

MR-tmp on HDFS using block pools

Elastic Compute-VMs will not need local disk

Fast communications within VMs

H d F ll St k Hi h A il bilit


30/41

30

Hadoop Full-Stack High Availability

HA Cluster for Master Daemons

Server Server Server

NN JT

Failover

N+K

failover

AppsRunningOutside

JT into Safemode

NN

job

job

job

job

job

Slave Nodes of Hadoop Cluster


31/41

31

HA is in HDP 1.0Using Total System Availability Architecture

HA i H d 1 ith HDP1


32/41

32

HA in Hadoop 1 with HDP1

Full Stack High Availability

Namenode

Clients pause automatically JobTracker pauses automatically

Other Hadoop master services (JT, ) coming

Use industry proven HA framework VMWare vSphere-HA

Failover, fencing,

Corner cases are tricky if not addressed, corruption

Addition benefits:

N-N & N+K failover

Migration for maintenance

H d NN/JT HA ith S h


33/41

33

Hadoop NN/JT HA with vSphere

N d F il Ti


34/41

34

Namenode Failover Times

60 Nodes, 60K files, 6 million blocks, 300 TB raw storage 1-3.5

minutes

Failure detection and Failover 0.5 to 2 minutes

Namenode Startup (exit safemode) 30 sec

180 Nodes, 200K files, 18 million blocks, 900TB raw storage 2-4.5

minutes

Failure detection and Failover 0.5 to 2 minutes

Namenode Startup (exit safemode) 110 sec

For vSphere - OS bootup is needed 10-20 seconds is included above.

Cold Failover is good enough for small/medium clusters

Failure Detection and Automatic Failover Dominates

34

S


35/41

35

Summary

Advantages of Hadoop on VMs

Cluster Management

Cluster consolidation

Greater Elasticity in mixed environment

Alternate multi-tenancy to capacity schedulers offerings

HA for Hadoop Master Daemons

vSphere based HA for NN, JT, in Hadoop 1 Total System Availability Architecture

Wh Vi t li H d ?


36/41

36



Resource Guarantee


Elastic Scaling


One click to setup


Highly Available

Rapid deployment



Simple to Operate

Elastic Scaling and Multi tenancy of Hadoop on vSphere


37/41

37

Storage

Elastic Scaling and Multi-tenancy of Hadoop on vSphere

1. Hadoop in VM

- Single Tenant

- Fixed Resources

2. Separate Compute and Data

- Single Tenant

- Elastic Compute

3. Multi. Clusters

- Multiple Tenants

- Elastic Compute

ComputeCurrent

Hadoop:

Combined

Storage/Compute

Storage

T1 T2

VM VM VM

VMVM

VM

Separated Compute and Data


38/41

38

VirtualHadoopNode

Datanode

Separated Compute and Data

Virtualization Host

VirtualHadoopNode

OtherWorkload

VMDK

Task Tracker

SlotSlot

VirtualHadoopNode

VMDK

Task Tracker

SlotSlot

VirtualHadoopNode

VirtualHadoopNode

Task Tracker

Slot

Slot

Truly Elastic Hadoop:Scalable through virtual nodes

References


39/41

39

References

www.projectserengeti.org

www.hortonworks.com

www.cloudera.com

Fault Tolerance performance whitepaper:

www.vmware.com/resources/techresources/10301

MapR/Google blog: www.mapr.com/blog/google-mapr
http://www.projectserengeti.org/http://www.hortonworks.com/http://www.cloudera.com/http://www.vmware.com/resources/techresources/10301http://www.mapr.com/blog/google-maprhttp://www.mapr.com/blog/google-maprhttp://www.mapr.com/blog/google-maprhttp://www.mapr.com/blog/google-maprhttp://www.vmware.com/resources/techresources/10301http://www.cloudera.com/http://www.hortonworks.com/http://www.projectserengeti.org/


40/41

FILL OUTA SURVEY

EVERY COMPLETE SURVEYIS ENTERED INTO

DRAWING FOR A$25 VMWARE COMPANY

STORE GIFT CERTIFICATE


41/41

Inside the Hadoop

Machine

Jeff Buell, VMware, Inc.

Richard McDougall, VMware, Inc.

Sanjay Radia, Hortonworks

APP-CAP2956

ld

cap2956-inside the hadoop machine_final_us.pdf

Documents