cap2956-inside the hadoop machine_final_us.pdf

Upload: kinankazuki104

Post on 14-Apr-2018

233 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/27/2019 CAP2956-Inside the Hadoop Machine_Final_US.pdf

    1/41

    Inside the Hadoop

    Machine

    Jeff Buell, VMware, Inc.

    Richard McDougall, VMware, Inc.

    Sanjay Radia, Hortonworks

    APP-CAP2956

    #vmworldapps

  • 7/27/2019 CAP2956-Inside the Hadoop Machine_Final_US.pdf

    2/41

    2

    Disclaimer

    This session may contain product features that are

    currently under development.

    This session/overview of the new technology represents

    no commitment from VMware to deliver these features in

    any generally available product.

    Features are subject to change, and must not be included in

    contracts, purchase orders, or sales agreements of any kind.

    Technical feasibility and market demand will affect final delivery.

    Pricing and packaging for any new technologies or features

    discussed or presented have not been determined.

  • 7/27/2019 CAP2956-Inside the Hadoop Machine_Final_US.pdf

    3/41

    3

    Log Processing / ClickStream Analytics

    Machine Learning /

    sophisticated data mining

    Web crawling / text

    processing

    Extract Transform Load

    (ETL) replacement

    Image / XML message

    processing

    Broad Application of Hadoop technology

    General archiving /

    compliance

    Financial Services

    Mobile / Telecom

    Internet Retailer

    Scientific Research

    Pharmaceutical / Drug

    Discovery

    Social Media

    Vertical Use CasesHorizontal Use Cases

    Hadoops ability to handle large unstructured data affordably and efficiently makes

    it a valuable tool kit for enterprises across a number of applications and fields.

  • 7/27/2019 CAP2956-Inside the Hadoop Machine_Final_US.pdf

    4/41

    4

    How does Hadoop enable parallel processing?

    Source: http://architects.dzone.com/articles/how-hadoop-mapreduce-works

    A framework fordistributed processing oflarge data sets acrossclusters of computers using a simple programming model.

  • 7/27/2019 CAP2956-Inside the Hadoop Machine_Final_US.pdf

    5/41

    5

    Hadoop System Architecture

    MapReduce: Programmingframework for highly parallel data

    processing Hadoop Distributed File System

    (HDFS): Distributed data storage

  • 7/27/2019 CAP2956-Inside the Hadoop Machine_Final_US.pdf

    6/41

    6

    Hadoop Map-Reduce Framework (Runtime Layer)

    Source: http://architects.dzone.com/articles/how-hadoop-mapreduce-works

  • 7/27/2019 CAP2956-Inside the Hadoop Machine_Final_US.pdf

    7/41

    7

    Hadoop Distributed File System

  • 7/27/2019 CAP2956-Inside the Hadoop Machine_Final_US.pdf

    8/41

    8

    Hadoop Data Locality and Replication

  • 7/27/2019 CAP2956-Inside the Hadoop Machine_Final_US.pdf

    9/41

    9

    Hadoop Virtualization Extensions: Topology Awareness

  • 7/27/2019 CAP2956-Inside the Hadoop Machine_Final_US.pdf

    10/41

    10

    Why Virtualize Hadoop?

    Shrink and expandcluster on demand

    Resource Guarantee

    Independent scaling ofCompute and data

    Elastic Scaling

    No more single point offailure

    One click to setup

    High availability for MRJobs

    Highly Available

    Rapid deployment

    Unified operationsacross enterprise

    Easy Clone of Cluster

    Simple to Operate

  • 7/27/2019 CAP2956-Inside the Hadoop Machine_Final_US.pdf

    11/41

    11

    Enterprise Challenges with Using Hadoop

    Deployment

    Slow to provision

    Complex to keep running/tune

    Single Points of Failure

    Single point of failure with Name Node and Job tracker

    No HA for Hadoop Framework Components (Hive, HCatalog, etc.)

    Low Utilization Dedicated clusters to run Hadoop with low CPU utilization

    No easy way to share resource between Hadoop and non-Hadoop workloads

    Noisy neighbor, lack resource containment

    Need Multi-tenant Isolation, Resource Management, etc,

    Noisy Neighbor - no performance or security isolation between different tenants/users

    Lack of configuration isolation - Cant run multiple versions on the cluster

  • 7/27/2019 CAP2956-Inside the Hadoop Machine_Final_US.pdf

    12/41

    12

    Virtualization enables a Common Infrastructure for Big Data

    Single purpose clusters for variousbusiness applications lead to clustersprawl.

    Virtualization Platform

    Simplify

    Single Hardware Infrastructure

    Unified operations

    Optimize

    Shared Resources = higher utilization

    Elastic resources = faster on-demand access

    MPP DB HadoopHBase

    Virtualization Platform

    MPP DB

    Hadoop

    HBase

    Cluster Sprawling

    Cluster Consolidation

  • 7/27/2019 CAP2956-Inside the Hadoop Machine_Final_US.pdf

    13/41

    13

    Deploy a Hadoop Cluster in under 30 Minutes

    Deploy vHelperOVF tovSphere

    Select configuration template

    Automate deployment

    Select Compute, memory,

    storage and network

    Done

    Step 1: Deploy Serengeti virtual appliance on vSphere.

    Step 2: A few simple commands to stand up Hadoop Cluster.

  • 7/27/2019 CAP2956-Inside the Hadoop Machine_Final_US.pdf

    14/41

    14

    A Tour Through Serengeti

    $ ssh serengeti@serengeti-vm

    $ serengeti

    serengeti>

  • 7/27/2019 CAP2956-Inside the Hadoop Machine_Final_US.pdf

    15/41

    15

    A Tour Through Serengeti

    serengeti> cluster create --name myElephant

    serengeti> cluster list -name myElephant

    name: myElephant, distro: cdh, status:RUNNING

    NAME ROLES INSTANCE CPU MEM(MB) TYPE

    ---------------------------------------------------------------------------

    master [hadoop_NameNode, hadoop_jobtracker] 1 2 7500 LOCAL 50

    name: myElephant, distro: cdh, status:RUNNING

    NAME ROLES INSTANCE CPU MEM(MB) TYPE

    ---------------------------------------------------------------------------

    master [hive, hadoop_client, pig] 1 1 3700 LOCAL 50

    NAME HOST IP

    -----------------------------------------------------------------

    myElephant-client0 rmc-elephant-009.eng.vmware.com 10.0.20.184

  • 7/27/2019 CAP2956-Inside the Hadoop Machine_Final_US.pdf

    16/41

    16

    A Tour Through Serengeti

    $ ssh [email protected]

    $ hadoop jar hadoop-examples.jar teragen 1000000000 tera-data

  • 7/27/2019 CAP2956-Inside the Hadoop Machine_Final_US.pdf

    17/41

    17

    Serengeti Spec File

    [

    "distro":"apache", Choice of Distro

    {

    "name": "master","roles": [

    "hadoop_NameNode",

    "hadoop_jobtracker"

    ],

    "instanceNum": 1,

    "instanceType": "MEDIUM",

    ha:true, HA Option

    },{

    "name": "worker",

    "roles": [

    "hadoop_datanode", "hadoop_tasktracker"

    ],

    "instanceNum": 5,

    "instanceType": "SMALL",

    "storage": { Choice of Shared Storage or Local Disk

    "type": "LOCAL",

    "sizeGB": 10

    }

    },

    ]

  • 7/27/2019 CAP2956-Inside the Hadoop Machine_Final_US.pdf

    18/41

    18

    Configuring Distros

    {

    "name" : "cdh",

    "version" : "3u3",

    "packages" : [

    {

    "roles" : ["hadoop_NameNode", "hadoop_jobtracker",

    "hadoop_tasktracker", "hadoop_datanode",

    "hadoop_client"],

    "tarball" : "cdh/3u3/hadoop-0.20.2-cdh3u3.tar.gz"

    },{

    "roles" : ["hive"],

    "tarball" : "cdh/3u3/hive-0.7.1-cdh3u3.tar.gz"

    },

    {

    "roles" : ["pig"],

    "tarball" : "cdh/3u3/pig-0.8.1-cdh3u3.tar.gz"

    }

    ]

    },

  • 7/27/2019 CAP2956-Inside the Hadoop Machine_Final_US.pdf

    19/41

    19

    Open Source of Serengeti, Spring Hadoop, Hadoop Extensions

    Community ProjectsCommercial Vendors

    Support major distribution and multiple projects

    Contribute Hadoop Virtualization Extension (HVE) to Open

    Source Community

  • 7/27/2019 CAP2956-Inside the Hadoop Machine_Final_US.pdf

    20/41

    20

    Use Local Disk where its Needed

    SAN Storage

    $2 - $10/Gigabyte

    $1M gets:0.5Petabytes200,000 IOPS8Gbyte/sec

    NAS Filers

    $1 - $5/Gigabyte

    $1M gets:1 Petabyte

    200,000 IOPS10Gbyte/sec

    Local Storage

    $0.05/Gigabyte

    $1M gets:10 Petabytes400,000 IOPS

    250 Gbytes/sec

  • 7/27/2019 CAP2956-Inside the Hadoop Machine_Final_US.pdf

    21/41

    21

    Extend Virtual Storage Architecture to Include Local Disk

    Shared Storage: SAN or NAS

    Easy to provision

    Automated cluster rebalancing

    Hybrid Storage

    SAN for boot images, VMs, other

    workloads Local disk for Hadoop & HDFS

    Scalable Bandwidth, Lower Cost/GB

    Host

    Had

    oop

    OtherVM

    OtherVM

    Host

    Had

    oop

    Had

    oop

    OtherVM

    Host

    Had

    oop

    Had

    oop

    OtherVM

    Host

    Had

    oop

    OtherVM

    OtherVM

    Host

    Had

    oop

    Had

    oop

    OtherVM

    Host

    Had

    oop

    Had

    oop

    OtherVM

  • 7/27/2019 CAP2956-Inside the Hadoop Machine_Final_US.pdf

    22/41

    22

    Virtualized Hadoop Performance

    Issues of interest

    Native vs various virtual configurations

    Local disks vs Fibre Channel SAN

    Effect of protecting Hadoop master daemons with Fault Tolerance

    Public cloud (renting) vs private cloud (buying)

    24x HP DL380 G72x X5687, 72 GB16x SAS 146 GB

    Broadcom 10 GbE adapterQlogic 8 Gb/s HBA

    Arista 7124SX 10 GbE switch

    EMC VNX7500

  • 7/27/2019 CAP2956-Inside the Hadoop Machine_Final_US.pdf

    23/41

    23

    Configuration

    Software

    vSphere 5.0 U1 (storage tests), 5.1 (Native/Virtual, FT)

    RHEL 6.1 x86_64

    Cloudera CDH3u4

    Hadoop applications: TeraGen, TeraSort, TeraValidate (1 TB)

    Hadoop VMs

    Processors (16 logical threads), memory (72 GB), disks (12) partitioned among1, 2, or 4 VMs per host

    Separate VMs for NameNode and JobTracker for storage and FT tests

    Hadoop configuration

    One map and one reduce task per vCPU (= logical thread) Machines are highly loaded

    256 MB block size

    FT tests: 8 256 MB block sizes to vary load on NN and JT

  • 7/27/2019 CAP2956-Inside the Hadoop Machine_Final_US.pdf

    24/41

    24

    Native versus Virtual Platforms, 24 hosts, 12 disks/host

    0

    50

    100

    150

    200

    250

    300

    350

    400

    450

    TeraGen TeraSort TeraValidate

    Elapsedtime,seconds

    (lowerisbetter)

    Native

    1 VM

    2 VMs

    4 VMs

  • 7/27/2019 CAP2956-Inside the Hadoop Machine_Final_US.pdf

    25/41

    25

    Local vs Various SAN Storage Configurations

    0

    0.5

    1

    1.5

    2

    2.5

    3

    3.5

    4

    4.5

    TeraGen TeraSort TeraValidate

    Elapsed

    timeratiotoLocal

    disks(lowerisbet

    ter)

    Local disks

    SAN JBOD

    SAN RAID-0, 16 KB page size

    SAN RAID-0

    SAN RAID-5

    16 x HP DL380G7, EMC VNX 7500, 96 physical disks

  • 7/27/2019 CAP2956-Inside the Hadoop Machine_Final_US.pdf

    26/41

    26

    Performance Effect of FT for Master Daemons

    NameNode and JobTracker placed in separate UP VMs

    Small overhead: Enabling FT causes 2-4% slowdown for TeraSort

    8 MB case places similar load on NN &JT as >200 hosts with 256 MB

    1

    1.01

    1.02

    1.03

    1.04

    256 64 16 8

    Elaps

    edtimeratioto

    FToff

    HDFS block size, MB

    TeraSort

  • 7/27/2019 CAP2956-Inside the Hadoop Machine_Final_US.pdf

    27/41

    27

    Different Clouds for Different Folks

    Yahoo! Hadoop 2009: Classic benchmark test, 1460 hosts

    Google/MapR: SaaS on Google Compute Engine

    vSphere 5.1: 24 host cluster, 2 VMs/host, 8 or 12 disks/host,

    CDH3u4

    Vastly different cluster sizes

    Compare throughput (MB sorted per second) normalized with resources

    Cost: rental or estimate of running continuously for 3 years

    #cores #disks TeraSort, s MB/s/core MB/s/disk cost

    Yahoo! 11680 5840 62 1.3 2.6 ~$7

    Google/MapR 5024 1256 80 2.4 9.5 $16

    vSphere 5.1 192 192 442 11.2 11.2 ~$2

    vSphere 5.1 192 288 359 13.8 9.2 ~$2

  • 7/27/2019 CAP2956-Inside the Hadoop Machine_Final_US.pdf

    28/41

    28

    Why Virtualize Hadoop?

    Shrink and expandcluster on demand

    Resource Guarantee

    Independent scaling ofCompute and data

    Elastic Scaling

    No more single point offailure

    One click to setup

    High availability for MRJobs

    Highly Available

    Rapid deployment

    Unified operationsacross enterprise

    Easy Clone of Cluster

    Simple to Operate

  • 7/27/2019 CAP2956-Inside the Hadoop Machine_Final_US.pdf

    29/41

    29

    VMware-Hortonworks Joint Engineering

    Hortonworks goal

    Expand Hadoop ecosystem

    Provide first class support of various platforms

    Hadoop should run well on VMs

    VMs offer several advantages as presented earlier

    Take advantage of vSphere for HA

    First class support for VMs Topology plugins (Hadoop-8468)

    2 VMs can be on same host

    Pick closer data

    Schedule tasks closer

    Dont put two replicas on same host

    MR-tmp on HDFS using block pools

    Elastic Compute-VMs will not need local disk

    Fast communications within VMs

    H d F ll St k Hi h A il bilit

  • 7/27/2019 CAP2956-Inside the Hadoop Machine_Final_US.pdf

    30/41

    30

    Hadoop Full-Stack High Availability

    HA Cluster for Master Daemons

    Server Server Server

    NN JT

    Failover

    N+K

    failover

    AppsRunningOutside

    JT into Safemode

    NN

    job

    job

    job

    job

    job

    Slave Nodes of Hadoop Cluster

  • 7/27/2019 CAP2956-Inside the Hadoop Machine_Final_US.pdf

    31/41

    31

    HA is in HDP 1.0Using Total System Availability Architecture

    HA i H d 1 ith HDP1

  • 7/27/2019 CAP2956-Inside the Hadoop Machine_Final_US.pdf

    32/41

    32

    HA in Hadoop 1 with HDP1

    Full Stack High Availability

    Namenode

    Clients pause automatically JobTracker pauses automatically

    Other Hadoop master services (JT, ) coming

    Use industry proven HA framework VMWare vSphere-HA

    Failover, fencing,

    Corner cases are tricky if not addressed, corruption

    Addition benefits:

    N-N & N+K failover

    Migration for maintenance

    H d NN/JT HA ith S h

  • 7/27/2019 CAP2956-Inside the Hadoop Machine_Final_US.pdf

    33/41

    33

    Hadoop NN/JT HA with vSphere

    N d F il Ti

  • 7/27/2019 CAP2956-Inside the Hadoop Machine_Final_US.pdf

    34/41

    34

    Namenode Failover Times

    60 Nodes, 60K files, 6 million blocks, 300 TB raw storage 1-3.5

    minutes

    Failure detection and Failover 0.5 to 2 minutes

    Namenode Startup (exit safemode) 30 sec

    180 Nodes, 200K files, 18 million blocks, 900TB raw storage 2-4.5

    minutes

    Failure detection and Failover 0.5 to 2 minutes

    Namenode Startup (exit safemode) 110 sec

    For vSphere - OS bootup is needed 10-20 seconds is included above.

    Cold Failover is good enough for small/medium clusters

    Failure Detection and Automatic Failover Dominates

    34

    S

  • 7/27/2019 CAP2956-Inside the Hadoop Machine_Final_US.pdf

    35/41

    35

    Summary

    Advantages of Hadoop on VMs

    Cluster Management

    Cluster consolidation

    Greater Elasticity in mixed environment

    Alternate multi-tenancy to capacity schedulers offerings

    HA for Hadoop Master Daemons

    vSphere based HA for NN, JT, in Hadoop 1 Total System Availability Architecture

    Wh Vi t li H d ?

  • 7/27/2019 CAP2956-Inside the Hadoop Machine_Final_US.pdf

    36/41

    36

    Why Virtualize Hadoop?

    Shrink and expandcluster on demand

    Resource Guarantee

    Independent scaling ofCompute and data

    Elastic Scaling

    No more single point offailure

    One click to setup

    High availability for MRJobs

    Highly Available

    Rapid deployment

    Unified operationsacross enterprise

    Easy Clone of Cluster

    Simple to Operate

    Elastic Scaling and Multi tenancy of Hadoop on vSphere

  • 7/27/2019 CAP2956-Inside the Hadoop Machine_Final_US.pdf

    37/41

    37

    Storage

    Elastic Scaling and Multi-tenancy of Hadoop on vSphere

    1. Hadoop in VM

    - Single Tenant

    - Fixed Resources

    2. Separate Compute and Data

    - Single Tenant

    - Elastic Compute

    3. Multi. Clusters

    - Multiple Tenants

    - Elastic Compute

    ComputeCurrent

    Hadoop:

    Combined

    Storage/Compute

    Storage

    T1 T2

    VM VM VM

    VMVM

    VM

    Separated Compute and Data

  • 7/27/2019 CAP2956-Inside the Hadoop Machine_Final_US.pdf

    38/41

    38

    VirtualHadoopNode

    Datanode

    Separated Compute and Data

    Virtualization Host

    VirtualHadoopNode

    OtherWorkload

    VMDK

    Task Tracker

    SlotSlot

    VirtualHadoopNode

    VMDK

    Task Tracker

    SlotSlot

    VirtualHadoopNode

    VirtualHadoopNode

    Task Tracker

    Slot

    Slot

    Truly Elastic Hadoop:Scalable through virtual nodes

    References

  • 7/27/2019 CAP2956-Inside the Hadoop Machine_Final_US.pdf

    39/41

    39

    References

    www.projectserengeti.org

    www.hortonworks.com

    www.cloudera.com

    Fault Tolerance performance whitepaper:

    www.vmware.com/resources/techresources/10301

    MapR/Google blog: www.mapr.com/blog/google-mapr

    http://www.projectserengeti.org/http://www.hortonworks.com/http://www.cloudera.com/http://www.vmware.com/resources/techresources/10301http://www.mapr.com/blog/google-maprhttp://www.mapr.com/blog/google-maprhttp://www.mapr.com/blog/google-maprhttp://www.mapr.com/blog/google-maprhttp://www.vmware.com/resources/techresources/10301http://www.cloudera.com/http://www.hortonworks.com/http://www.projectserengeti.org/
  • 7/27/2019 CAP2956-Inside the Hadoop Machine_Final_US.pdf

    40/41

    FILL OUTA SURVEY

    EVERY COMPLETE SURVEYIS ENTERED INTO

    DRAWING FOR A$25 VMWARE COMPANY

    STORE GIFT CERTIFICATE

  • 7/27/2019 CAP2956-Inside the Hadoop Machine_Final_US.pdf

    41/41

    Inside the Hadoop

    Machine

    Jeff Buell, VMware, Inc.

    Richard McDougall, VMware, Inc.

    Sanjay Radia, Hortonworks

    APP-CAP2956

    ld