cap2956-inside the hadoop machine_final_us.pdf
TRANSCRIPT
-
7/27/2019 CAP2956-Inside the Hadoop Machine_Final_US.pdf
1/41
Inside the Hadoop
Machine
Jeff Buell, VMware, Inc.
Richard McDougall, VMware, Inc.
Sanjay Radia, Hortonworks
APP-CAP2956
#vmworldapps
-
7/27/2019 CAP2956-Inside the Hadoop Machine_Final_US.pdf
2/41
2
Disclaimer
This session may contain product features that are
currently under development.
This session/overview of the new technology represents
no commitment from VMware to deliver these features in
any generally available product.
Features are subject to change, and must not be included in
contracts, purchase orders, or sales agreements of any kind.
Technical feasibility and market demand will affect final delivery.
Pricing and packaging for any new technologies or features
discussed or presented have not been determined.
-
7/27/2019 CAP2956-Inside the Hadoop Machine_Final_US.pdf
3/41
3
Log Processing / ClickStream Analytics
Machine Learning /
sophisticated data mining
Web crawling / text
processing
Extract Transform Load
(ETL) replacement
Image / XML message
processing
Broad Application of Hadoop technology
General archiving /
compliance
Financial Services
Mobile / Telecom
Internet Retailer
Scientific Research
Pharmaceutical / Drug
Discovery
Social Media
Vertical Use CasesHorizontal Use Cases
Hadoops ability to handle large unstructured data affordably and efficiently makes
it a valuable tool kit for enterprises across a number of applications and fields.
-
7/27/2019 CAP2956-Inside the Hadoop Machine_Final_US.pdf
4/41
4
How does Hadoop enable parallel processing?
Source: http://architects.dzone.com/articles/how-hadoop-mapreduce-works
A framework fordistributed processing oflarge data sets acrossclusters of computers using a simple programming model.
-
7/27/2019 CAP2956-Inside the Hadoop Machine_Final_US.pdf
5/41
5
Hadoop System Architecture
MapReduce: Programmingframework for highly parallel data
processing Hadoop Distributed File System
(HDFS): Distributed data storage
-
7/27/2019 CAP2956-Inside the Hadoop Machine_Final_US.pdf
6/41
6
Hadoop Map-Reduce Framework (Runtime Layer)
Source: http://architects.dzone.com/articles/how-hadoop-mapreduce-works
-
7/27/2019 CAP2956-Inside the Hadoop Machine_Final_US.pdf
7/41
7
Hadoop Distributed File System
-
7/27/2019 CAP2956-Inside the Hadoop Machine_Final_US.pdf
8/41
8
Hadoop Data Locality and Replication
-
7/27/2019 CAP2956-Inside the Hadoop Machine_Final_US.pdf
9/41
9
Hadoop Virtualization Extensions: Topology Awareness
-
7/27/2019 CAP2956-Inside the Hadoop Machine_Final_US.pdf
10/41
10
Why Virtualize Hadoop?
Shrink and expandcluster on demand
Resource Guarantee
Independent scaling ofCompute and data
Elastic Scaling
No more single point offailure
One click to setup
High availability for MRJobs
Highly Available
Rapid deployment
Unified operationsacross enterprise
Easy Clone of Cluster
Simple to Operate
-
7/27/2019 CAP2956-Inside the Hadoop Machine_Final_US.pdf
11/41
11
Enterprise Challenges with Using Hadoop
Deployment
Slow to provision
Complex to keep running/tune
Single Points of Failure
Single point of failure with Name Node and Job tracker
No HA for Hadoop Framework Components (Hive, HCatalog, etc.)
Low Utilization Dedicated clusters to run Hadoop with low CPU utilization
No easy way to share resource between Hadoop and non-Hadoop workloads
Noisy neighbor, lack resource containment
Need Multi-tenant Isolation, Resource Management, etc,
Noisy Neighbor - no performance or security isolation between different tenants/users
Lack of configuration isolation - Cant run multiple versions on the cluster
-
7/27/2019 CAP2956-Inside the Hadoop Machine_Final_US.pdf
12/41
12
Virtualization enables a Common Infrastructure for Big Data
Single purpose clusters for variousbusiness applications lead to clustersprawl.
Virtualization Platform
Simplify
Single Hardware Infrastructure
Unified operations
Optimize
Shared Resources = higher utilization
Elastic resources = faster on-demand access
MPP DB HadoopHBase
Virtualization Platform
MPP DB
Hadoop
HBase
Cluster Sprawling
Cluster Consolidation
-
7/27/2019 CAP2956-Inside the Hadoop Machine_Final_US.pdf
13/41
13
Deploy a Hadoop Cluster in under 30 Minutes
Deploy vHelperOVF tovSphere
Select configuration template
Automate deployment
Select Compute, memory,
storage and network
Done
Step 1: Deploy Serengeti virtual appliance on vSphere.
Step 2: A few simple commands to stand up Hadoop Cluster.
-
7/27/2019 CAP2956-Inside the Hadoop Machine_Final_US.pdf
14/41
14
A Tour Through Serengeti
$ ssh serengeti@serengeti-vm
$ serengeti
serengeti>
-
7/27/2019 CAP2956-Inside the Hadoop Machine_Final_US.pdf
15/41
15
A Tour Through Serengeti
serengeti> cluster create --name myElephant
serengeti> cluster list -name myElephant
name: myElephant, distro: cdh, status:RUNNING
NAME ROLES INSTANCE CPU MEM(MB) TYPE
---------------------------------------------------------------------------
master [hadoop_NameNode, hadoop_jobtracker] 1 2 7500 LOCAL 50
name: myElephant, distro: cdh, status:RUNNING
NAME ROLES INSTANCE CPU MEM(MB) TYPE
---------------------------------------------------------------------------
master [hive, hadoop_client, pig] 1 1 3700 LOCAL 50
NAME HOST IP
-----------------------------------------------------------------
myElephant-client0 rmc-elephant-009.eng.vmware.com 10.0.20.184
-
7/27/2019 CAP2956-Inside the Hadoop Machine_Final_US.pdf
16/41
16
A Tour Through Serengeti
$ ssh [email protected]
$ hadoop jar hadoop-examples.jar teragen 1000000000 tera-data
-
7/27/2019 CAP2956-Inside the Hadoop Machine_Final_US.pdf
17/41
17
Serengeti Spec File
[
"distro":"apache", Choice of Distro
{
"name": "master","roles": [
"hadoop_NameNode",
"hadoop_jobtracker"
],
"instanceNum": 1,
"instanceType": "MEDIUM",
ha:true, HA Option
},{
"name": "worker",
"roles": [
"hadoop_datanode", "hadoop_tasktracker"
],
"instanceNum": 5,
"instanceType": "SMALL",
"storage": { Choice of Shared Storage or Local Disk
"type": "LOCAL",
"sizeGB": 10
}
},
]
-
7/27/2019 CAP2956-Inside the Hadoop Machine_Final_US.pdf
18/41
18
Configuring Distros
{
"name" : "cdh",
"version" : "3u3",
"packages" : [
{
"roles" : ["hadoop_NameNode", "hadoop_jobtracker",
"hadoop_tasktracker", "hadoop_datanode",
"hadoop_client"],
"tarball" : "cdh/3u3/hadoop-0.20.2-cdh3u3.tar.gz"
},{
"roles" : ["hive"],
"tarball" : "cdh/3u3/hive-0.7.1-cdh3u3.tar.gz"
},
{
"roles" : ["pig"],
"tarball" : "cdh/3u3/pig-0.8.1-cdh3u3.tar.gz"
}
]
},
-
7/27/2019 CAP2956-Inside the Hadoop Machine_Final_US.pdf
19/41
19
Open Source of Serengeti, Spring Hadoop, Hadoop Extensions
Community ProjectsCommercial Vendors
Support major distribution and multiple projects
Contribute Hadoop Virtualization Extension (HVE) to Open
Source Community
-
7/27/2019 CAP2956-Inside the Hadoop Machine_Final_US.pdf
20/41
20
Use Local Disk where its Needed
SAN Storage
$2 - $10/Gigabyte
$1M gets:0.5Petabytes200,000 IOPS8Gbyte/sec
NAS Filers
$1 - $5/Gigabyte
$1M gets:1 Petabyte
200,000 IOPS10Gbyte/sec
Local Storage
$0.05/Gigabyte
$1M gets:10 Petabytes400,000 IOPS
250 Gbytes/sec
-
7/27/2019 CAP2956-Inside the Hadoop Machine_Final_US.pdf
21/41
21
Extend Virtual Storage Architecture to Include Local Disk
Shared Storage: SAN or NAS
Easy to provision
Automated cluster rebalancing
Hybrid Storage
SAN for boot images, VMs, other
workloads Local disk for Hadoop & HDFS
Scalable Bandwidth, Lower Cost/GB
Host
Had
oop
OtherVM
OtherVM
Host
Had
oop
Had
oop
OtherVM
Host
Had
oop
Had
oop
OtherVM
Host
Had
oop
OtherVM
OtherVM
Host
Had
oop
Had
oop
OtherVM
Host
Had
oop
Had
oop
OtherVM
-
7/27/2019 CAP2956-Inside the Hadoop Machine_Final_US.pdf
22/41
22
Virtualized Hadoop Performance
Issues of interest
Native vs various virtual configurations
Local disks vs Fibre Channel SAN
Effect of protecting Hadoop master daemons with Fault Tolerance
Public cloud (renting) vs private cloud (buying)
24x HP DL380 G72x X5687, 72 GB16x SAS 146 GB
Broadcom 10 GbE adapterQlogic 8 Gb/s HBA
Arista 7124SX 10 GbE switch
EMC VNX7500
-
7/27/2019 CAP2956-Inside the Hadoop Machine_Final_US.pdf
23/41
23
Configuration
Software
vSphere 5.0 U1 (storage tests), 5.1 (Native/Virtual, FT)
RHEL 6.1 x86_64
Cloudera CDH3u4
Hadoop applications: TeraGen, TeraSort, TeraValidate (1 TB)
Hadoop VMs
Processors (16 logical threads), memory (72 GB), disks (12) partitioned among1, 2, or 4 VMs per host
Separate VMs for NameNode and JobTracker for storage and FT tests
Hadoop configuration
One map and one reduce task per vCPU (= logical thread) Machines are highly loaded
256 MB block size
FT tests: 8 256 MB block sizes to vary load on NN and JT
-
7/27/2019 CAP2956-Inside the Hadoop Machine_Final_US.pdf
24/41
24
Native versus Virtual Platforms, 24 hosts, 12 disks/host
0
50
100
150
200
250
300
350
400
450
TeraGen TeraSort TeraValidate
Elapsedtime,seconds
(lowerisbetter)
Native
1 VM
2 VMs
4 VMs
-
7/27/2019 CAP2956-Inside the Hadoop Machine_Final_US.pdf
25/41
25
Local vs Various SAN Storage Configurations
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
TeraGen TeraSort TeraValidate
Elapsed
timeratiotoLocal
disks(lowerisbet
ter)
Local disks
SAN JBOD
SAN RAID-0, 16 KB page size
SAN RAID-0
SAN RAID-5
16 x HP DL380G7, EMC VNX 7500, 96 physical disks
-
7/27/2019 CAP2956-Inside the Hadoop Machine_Final_US.pdf
26/41
26
Performance Effect of FT for Master Daemons
NameNode and JobTracker placed in separate UP VMs
Small overhead: Enabling FT causes 2-4% slowdown for TeraSort
8 MB case places similar load on NN &JT as >200 hosts with 256 MB
1
1.01
1.02
1.03
1.04
256 64 16 8
Elaps
edtimeratioto
FToff
HDFS block size, MB
TeraSort
-
7/27/2019 CAP2956-Inside the Hadoop Machine_Final_US.pdf
27/41
27
Different Clouds for Different Folks
Yahoo! Hadoop 2009: Classic benchmark test, 1460 hosts
Google/MapR: SaaS on Google Compute Engine
vSphere 5.1: 24 host cluster, 2 VMs/host, 8 or 12 disks/host,
CDH3u4
Vastly different cluster sizes
Compare throughput (MB sorted per second) normalized with resources
Cost: rental or estimate of running continuously for 3 years
#cores #disks TeraSort, s MB/s/core MB/s/disk cost
Yahoo! 11680 5840 62 1.3 2.6 ~$7
Google/MapR 5024 1256 80 2.4 9.5 $16
vSphere 5.1 192 192 442 11.2 11.2 ~$2
vSphere 5.1 192 288 359 13.8 9.2 ~$2
-
7/27/2019 CAP2956-Inside the Hadoop Machine_Final_US.pdf
28/41
28
Why Virtualize Hadoop?
Shrink and expandcluster on demand
Resource Guarantee
Independent scaling ofCompute and data
Elastic Scaling
No more single point offailure
One click to setup
High availability for MRJobs
Highly Available
Rapid deployment
Unified operationsacross enterprise
Easy Clone of Cluster
Simple to Operate
-
7/27/2019 CAP2956-Inside the Hadoop Machine_Final_US.pdf
29/41
29
VMware-Hortonworks Joint Engineering
Hortonworks goal
Expand Hadoop ecosystem
Provide first class support of various platforms
Hadoop should run well on VMs
VMs offer several advantages as presented earlier
Take advantage of vSphere for HA
First class support for VMs Topology plugins (Hadoop-8468)
2 VMs can be on same host
Pick closer data
Schedule tasks closer
Dont put two replicas on same host
MR-tmp on HDFS using block pools
Elastic Compute-VMs will not need local disk
Fast communications within VMs
H d F ll St k Hi h A il bilit
-
7/27/2019 CAP2956-Inside the Hadoop Machine_Final_US.pdf
30/41
30
Hadoop Full-Stack High Availability
HA Cluster for Master Daemons
Server Server Server
NN JT
Failover
N+K
failover
AppsRunningOutside
JT into Safemode
NN
job
job
job
job
job
Slave Nodes of Hadoop Cluster
-
7/27/2019 CAP2956-Inside the Hadoop Machine_Final_US.pdf
31/41
31
HA is in HDP 1.0Using Total System Availability Architecture
HA i H d 1 ith HDP1
-
7/27/2019 CAP2956-Inside the Hadoop Machine_Final_US.pdf
32/41
32
HA in Hadoop 1 with HDP1
Full Stack High Availability
Namenode
Clients pause automatically JobTracker pauses automatically
Other Hadoop master services (JT, ) coming
Use industry proven HA framework VMWare vSphere-HA
Failover, fencing,
Corner cases are tricky if not addressed, corruption
Addition benefits:
N-N & N+K failover
Migration for maintenance
H d NN/JT HA ith S h
-
7/27/2019 CAP2956-Inside the Hadoop Machine_Final_US.pdf
33/41
33
Hadoop NN/JT HA with vSphere
N d F il Ti
-
7/27/2019 CAP2956-Inside the Hadoop Machine_Final_US.pdf
34/41
34
Namenode Failover Times
60 Nodes, 60K files, 6 million blocks, 300 TB raw storage 1-3.5
minutes
Failure detection and Failover 0.5 to 2 minutes
Namenode Startup (exit safemode) 30 sec
180 Nodes, 200K files, 18 million blocks, 900TB raw storage 2-4.5
minutes
Failure detection and Failover 0.5 to 2 minutes
Namenode Startup (exit safemode) 110 sec
For vSphere - OS bootup is needed 10-20 seconds is included above.
Cold Failover is good enough for small/medium clusters
Failure Detection and Automatic Failover Dominates
34
S
-
7/27/2019 CAP2956-Inside the Hadoop Machine_Final_US.pdf
35/41
35
Summary
Advantages of Hadoop on VMs
Cluster Management
Cluster consolidation
Greater Elasticity in mixed environment
Alternate multi-tenancy to capacity schedulers offerings
HA for Hadoop Master Daemons
vSphere based HA for NN, JT, in Hadoop 1 Total System Availability Architecture
Wh Vi t li H d ?
-
7/27/2019 CAP2956-Inside the Hadoop Machine_Final_US.pdf
36/41
36
Why Virtualize Hadoop?
Shrink and expandcluster on demand
Resource Guarantee
Independent scaling ofCompute and data
Elastic Scaling
No more single point offailure
One click to setup
High availability for MRJobs
Highly Available
Rapid deployment
Unified operationsacross enterprise
Easy Clone of Cluster
Simple to Operate
Elastic Scaling and Multi tenancy of Hadoop on vSphere
-
7/27/2019 CAP2956-Inside the Hadoop Machine_Final_US.pdf
37/41
37
Storage
Elastic Scaling and Multi-tenancy of Hadoop on vSphere
1. Hadoop in VM
- Single Tenant
- Fixed Resources
2. Separate Compute and Data
- Single Tenant
- Elastic Compute
3. Multi. Clusters
- Multiple Tenants
- Elastic Compute
ComputeCurrent
Hadoop:
Combined
Storage/Compute
Storage
T1 T2
VM VM VM
VMVM
VM
Separated Compute and Data
-
7/27/2019 CAP2956-Inside the Hadoop Machine_Final_US.pdf
38/41
38
VirtualHadoopNode
Datanode
Separated Compute and Data
Virtualization Host
VirtualHadoopNode
OtherWorkload
VMDK
Task Tracker
SlotSlot
VirtualHadoopNode
VMDK
Task Tracker
SlotSlot
VirtualHadoopNode
VirtualHadoopNode
Task Tracker
Slot
Slot
Truly Elastic Hadoop:Scalable through virtual nodes
References
-
7/27/2019 CAP2956-Inside the Hadoop Machine_Final_US.pdf
39/41
39
References
www.projectserengeti.org
www.hortonworks.com
www.cloudera.com
Fault Tolerance performance whitepaper:
www.vmware.com/resources/techresources/10301
MapR/Google blog: www.mapr.com/blog/google-mapr
http://www.projectserengeti.org/http://www.hortonworks.com/http://www.cloudera.com/http://www.vmware.com/resources/techresources/10301http://www.mapr.com/blog/google-maprhttp://www.mapr.com/blog/google-maprhttp://www.mapr.com/blog/google-maprhttp://www.mapr.com/blog/google-maprhttp://www.vmware.com/resources/techresources/10301http://www.cloudera.com/http://www.hortonworks.com/http://www.projectserengeti.org/ -
7/27/2019 CAP2956-Inside the Hadoop Machine_Final_US.pdf
40/41
FILL OUTA SURVEY
EVERY COMPLETE SURVEYIS ENTERED INTO
DRAWING FOR A$25 VMWARE COMPANY
STORE GIFT CERTIFICATE
-
7/27/2019 CAP2956-Inside the Hadoop Machine_Final_US.pdf
41/41
Inside the Hadoop
Machine
Jeff Buell, VMware, Inc.
Richard McDougall, VMware, Inc.
Sanjay Radia, Hortonworks
APP-CAP2956
ld