hw09 optimizing hadoop deployments

23
Optimizing Hadoop* Workloads Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. Other names and brands may be claimed as the property of others. All products, dates, and figures are preliminary and are subject to change without any notice. Copyright © 2009, Intel Corporation. Workloads Nurcan Coskun Intel Software & Solutions Group October 2, 2009 Acknowledgements to Jason Dai, Intel SSG, for many of the test results and optimization techniques

Upload: cloudera-inc

Post on 06-May-2015

788 views

Category:

Technology


2 download

TRANSCRIPT

Page 1: Hw09   Optimizing Hadoop Deployments

Optimizing Hadoop* Workloads

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. Other names and brands may be claimed as the property of others. All products, dates, and figures are preliminary and are subject to change without any notice. Copyright © 2009, Intel Corporation.

Workloads

Nurcan CoskunIntel Software & Solutions Group

October 2, 2009

Acknowledgements to Jason Dai, Intel SSG, for many of the test results and optimization techniques

Page 2: Hw09   Optimizing Hadoop Deployments

Legal DisclaimersDisclaimers & Legal Notices

THE INFORMATION IS FURNISHED FOR INFORMATIONAL USE ONLY, IS SUBJECT TO CHANGE WITHOUT NOTICE, AND SHOULD NOT BE CONSTRUED AS A COMMITMENT BY INTEL CORPORATION. INTEL CORPORATION ASSUMES NO RESPONSIBILITY OR LIABILITY FOR ANY ERRORS OR INACCURACIES THAT MAY APPEAR IN THIS DOCUMENT OR ANY SOFTWARE THAT MAY BE PROVIDED IN ASSOCIATION WITH THIS DOCUMENT. THIS INFORMATION IS PROVIDED "AS IS" AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THE USE OF THIS INFORMATION INCLUDING WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, COMPLIANCE WITH A SPECIFICATION OR STANDARD, MERCHANTABILITY OR NONINFRINGEMENT.

Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. Buyers should consult other sources of information to evaluate the performance of systems or components they are considering purchasing. For more information on performance tests and on the performance of Intel products, visit Intel Performance Benchmark Limitations

INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL® PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL'S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER, AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. UNLESS OTHERWISE AGREED

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. Other names and brands may be claimed as the property of others. All products, dates, and figures are preliminary and are subject to change without notice. Copyright © 2009, Intel Corporation.

2

OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. UNLESS OTHERWISE AGREED IN WRITING BY INTEL, THE INTEL PRODUCTS ARE NOT DESIGNED NOR INTENDED FOR ANY APPLICATION IN WHICH THE FAILURE OF THE INTEL PRODUCT COULD CREATE A SITUATION WHERE PERSONAL INJURY OR DEATH MAY OCCUR.

Intel may make changes to specifications and product descriptions at any time, without notice. Designers must not rely on theabsence or characteristics of any features or instructions marked "reserved" or "undefined." Intel reserves these for future definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising from future changes to them. The information here is subject to change without notice. Do not finalize a design with this information. The products described in this document may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request. Contact your local Intel sales office or your distributor to obtain the latest specifications and before placing your product order. Copies of documents which have an order number and are referenced in this document, or other Intel literature, may be obtained by calling 1-800-548-4725, or by visiting Intel's Web Site http://www.intel.com/.

Page 3: Hw09   Optimizing Hadoop Deployments

Why Optimize Hadoop Deployments?

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. Other names and brands may be claimed as the property of others. All products, dates, and figures are preliminary and are subject to change without notice. Copyright © 2009, Intel Corporation.

3

Handle More Data

At LowerCost

InLessTime

WithLessPower

Page 4: Hw09   Optimizing Hadoop Deployments

Where to Optimize?

Hardware Software

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. Other names and brands may be claimed as the property of others. All products, dates, and figures are preliminary and are subject to change without notice. Copyright © 2009, Intel Corporation.

4

Page 5: Hw09   Optimizing Hadoop Deployments

Hadoop Servers

Masters: JobTracker, NameNode, Secondary NameNode– Deploy additional RAM and secondary power supplies– Ensure highest performance and reliability

Slaves: DataNodes, TaskTrackers– Hadoop Framework handles slave failures well

– Data blocks are replicated and distributed

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. Other names and brands may be claimed as the property of others. All products, dates, and figures are preliminary and are subject to change without notice. Copyright © 2009, Intel Corporation.

5

– Data blocks are replicated and distributed– Workload may be bound by I/O, memory or processor resources

– The system level hardware should be adjusted on a case-by-case basis

Page 6: Hw09   Optimizing Hadoop Deployments

Server Platform

•Dual-socket servers are optimal for Hadoop deployments

•Dual-socket servers are more efficient than large-scale multi-processor platforms from a per-node, cost benefit perspective

•Dual-socket servers offset the added per-node hardware cost relative to entry-level servers through superior efficiencies in terms of load-balancing and parallelization overheads

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. Other names and brands may be claimed as the property of others. All products, dates, and figures are preliminary and are subject to change without notice. Copyright © 2009, Intel Corporation.

6

terms of load-balancing and parallelization overheads

•Choosing hardware based on the most current platform technologies available helps to ensure the optimal intra-server throughput and efficiency

Page 7: Hw09   Optimizing Hadoop Deployments

Processor Choice Matters

Faster

Handles More Data

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. Other names and brands may be claimed as the property of others. All products, dates, and figures are preliminary and are subject to change without notice. Copyright © 2009, Intel Corporation.

7

Handles More Data

More Energy Efficient

Page 8: Hw09   Optimizing Hadoop Deployments

Processor Choice Impacts Speed

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. Other names and brands may be claimed as the property of others. All products, dates, and figures are preliminary and are subject to change without notice. Copyright © 2009, Intel Corporation.

8

Data Source: Intel internal measurements by using Hadoop 0.19.1 as of September 20, 2009. Hardware configurations are on slide 22. Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance.

Page 9: Hw09   Optimizing Hadoop Deployments

Processor Choice Impacts Throughput

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. Other names and brands may be claimed as the property of others. All products, dates, and figures are preliminary and are subject to change without notice. Copyright © 2009, Intel Corporation.

9

• Throughput = # of tasks completed / minute when cluster is at 100% utilization. • Intel Xeon processor 5500 provides up to 86% more throughput than 5400 series.

Data Source: Intel internal measurements by using Hadoop 0.19.1 as of September 20, 2009. Hardware configurations are on slide 22. Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance.

Page 10: Hw09   Optimizing Hadoop Deployments

Processor Scaling

Inte l® Xeon® Processor 5500 Se rie s (Nehalem) C luste r(Lowe r Values are Be tte r)

6000

8000

10000

12000

14000

16000

18000

20000

Java

Sort Tompletion Tim

e (sec

onds)

1GB

2GB

3GB

4GB

5GB

6GB

7GB

8GB

9GB

10GB

50GB

100GB

Inte l® Xeon® Processor 5400 Se rie s (Harpe rtown) C luste r(Lowe r Values are Be tte r)

10000

15000

20000

25000

30000

Java

Sort T

ompletion Time (sec

onds

)

1GB

2GB

3GB

4GB

5GB

6GB

7GB

8GB

9GB

10GB

50GB

100GB

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. Other names and brands may be claimed as the property of others. All products, dates, and figures are preliminary and are subject to change without notice. Copyright © 2009, Intel Corporation.

10

•Hadoop workloads scales well on Intel processors•Intel® Xeon® processor 5500 can handle larger data sizes than 5400 series.

Data Source: Intel internal measurements by using Hadoop 0.19.0 as of September 20, 2009. Hardware configurations are on slide 21. Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance.

0

2000

4000

1 2 3 4 5 6 7

Numbe r of Node s

Java

Sort Tompletion Tim

e (sec

onds)

100GB

150GB

200GB

250GB0

5000

1 2 3 4 5 6 7

Num be r of Node s

Java

Sort T

ompletion Time (sec

onds

)

150GB

200GB

Page 11: Hw09   Optimizing Hadoop Deployments

Turn on Intel® Hyper-threading Technology

Intel® Hyper-threading Technology

Increases performance for threaded applications delivering greater throughput

and responsiveness

Intel® Xeon® Processor 5500 Series (Nehalem) SMT effect in 8 node cluster(Lower Values Are Better)

100

150

200

250

JavaSort Completion Time (seconds)

SMT ON

SMT OFF

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. Other names and brands may be claimed as the property of others. All products, dates, and figures are preliminary and are subject to change without notice. Copyright © 2009, Intel Corporation.

11

Up to 25% better performance

0

50

1GB 2GB 3GB 4GB 5GB 6GB 7GB 8GB 9GB 10GB

Data Set SizeJavaSort Completion Time (seconds)

Data Source: Intel internal measurements by using Hadoop 0.19.0 as of September 20, 2009. Hardware configurations are on slide 21. Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance.

Page 12: Hw09   Optimizing Hadoop Deployments

Memory

•Sufficient memory capacity is critical for efficient operation of servers in a Hadoop cluster, supporting high throughput by allowing large number of map/reduce tasks to be carried out simultaneously

•Typical Hadoop applications require approximately 1-2 GB of RAM per processor core, which corresponds to 8-16GB for a dual-socket server using quad-core processors

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. Other names and brands may be claimed as the property of others. All products, dates, and figures are preliminary and are subject to change without notice. Copyright © 2009, Intel Corporation.

12

dual-socket server using quad-core processors

•Error Correcting Code (ECC) memory is highly recommended to detect and correct errors introduced during storage and transmission of data

Page 13: Hw09   Optimizing Hadoop Deployments

Selecting Server Motherboard

•Select server motherboards which are optimized for high density computing environments. – They should use high efficiency voltage regulators – They need to be optimized for airflow– They should use certified power supplies

•Optimized server motherboards will use less power, need less

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. Other names and brands may be claimed as the property of others. All products, dates, and figures are preliminary and are subject to change without notice. Copyright © 2009, Intel Corporation.

13

•Optimized server motherboards will use less power, need less cooling, and save money

Page 14: Hw09   Optimizing Hadoop Deployments

Hard Disk and SSD

•Large number of hard drives per server (4-6)

•Hadoop orchestrates data provisioning and redundancy across individual nodes (Using RAID 0 is not needed)

•SSD’s are faster and they require very little power, SSD usage will also eliminate cooling cost created by hard disk drives

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. Other names and brands may be claimed as the property of others. All products, dates, and figures are preliminary and are subject to change without notice. Copyright © 2009, Intel Corporation.

14

•Use SSD’s:– To store mission critical smaller data sets– To store map/reduce intermediate results– To replace HDD’s with SDD’s to reduce power consumption, increase throughput and improve performance

Page 15: Hw09   Optimizing Hadoop Deployments

Use Intel® X25-E SATA SSD’s

10 Node Inte l® Xeon® L5520 (Nehalem) C luste r(Lower Values are Be tte r)

1000

1500

2000

2500

Java

Sort Completion Tim

e (sec

onds) hdd

ssd

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. Other names and brands may be claimed as the property of others. All products, dates, and figures are preliminary and are subject to change without notice. Copyright © 2009, Intel Corporation.

15

Data Source: Intel internal measurements by using Hadoop 0.19.0 as of September 20, 2009. Hardware configurations are on slide 23. Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance.

0

500

1GB 10GB 50GB 80GB 100GB

Da ta Se t S ize

Java

Sort Completion Tim

e (sec

onds)

Page 16: Hw09   Optimizing Hadoop Deployments

System Software

•Use a Linux* distribution based on kernel version 2.6.30 or later because of the optimizations included for energy and threading efficiency– For Example: energy consumption can be up to 60 percent (42 watts) higher at idle for each server using older versions of Linux

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. Other names and brands may be claimed as the property of others. All products, dates, and figures are preliminary and are subject to change without notice. Copyright © 2009, Intel Corporation.

16

•Optimize Linux* file system configurations– Noatime attribute– Open file descriptor limit

•Use latest Java (for example Sun Java* 6u14) – Use 64 bit optimized JVM builds

Page 17: Hw09   Optimizing Hadoop Deployments

Hadoop Configuration Tuning

•The number of NameNode and JobTracker threads(10 -> 64)

•The number of DataNode server threads (3 -> 8)

•The number of work threads on HTTP server that runs on each TaskTracker (40-50)

•HDFS replication factor (3)

•Default HDFS block size (64MB -> 128MB)

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. Other names and brands may be claimed as the property of others. All products, dates, and figures are preliminary and are subject to change without notice. Copyright © 2009, Intel Corporation.

17

•Maximum number of map/reduce tasks per node– (cores_per_node)/2 -> 2*(cores_per_node)

• The number of input streams (files) to be merged at once in map/reduce tasks (example: 100)

• JVM settings

• The total size of result and metadata buffers associates with a map task (100MB -> 200 MB)

Page 18: Hw09   Optimizing Hadoop Deployments

System-stack Example

Two-way Intel® Xeon® processor 5500 series

Intel® X25-E SATA SSD’s

Four to six 7200 RPM SATA drives

12-24 GB DDR3 ECC RAM

Intel® Server Board S5500WB

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. Other names and brands may be claimed as the property of others. All products, dates, and figures are preliminary and are subject to change without notice. Copyright © 2009, Intel Corporation.

18

80 PLUS* Gold Certified power supplies

Linux* based on kernel 2.6.30 or later

Sun Java* 6u14 or later

Hadoop* (0.18.3 or 0.20.0)

Page 19: Hw09   Optimizing Hadoop Deployments

Summary

Hardware selection:

• Intel® Xeon® 5500 (“Nehalem”) improves Hadoop Workload performance

• Choosing an optimized server board such as Intel® SB5500WB (“WillowBrook”) can reduce power consumption

• Use Intel® X25-E SATA SSD’s to improve performance

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. Other names and brands may be claimed as the property of others. All products, dates, and figures are preliminary and are subject to change without notice. Copyright © 2009, Intel Corporation.

19

• Use Intel® X25-E SATA SSD’s to improve performance

Software & configurations:

• Use latest Linux kernel

• Turn on Intel® Hyper-threading

• Optimize Hadoop Configuration

• Tuning may be different for different workload types

Page 20: Hw09   Optimizing Hadoop Deployments

References:

1. http://www.intel.com/p/en_US/products/server/processor

2. http://www.intel.com/it/pdf/server-rightsizing.pdf

3. http://www.80plus.org/

4. https://opencirrus.org/content/agenda-open-cirrus-summit-palo-alto-june-8-9-2009

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. Other names and brands may be claimed as the property of others. All products, dates, and figures are preliminary and are subject to change without notice. Copyright © 2009, Intel Corporation.

20

Page 21: Hw09   Optimizing Hadoop Deployments

Cluster Configurations Information(Slides: “Processor Scaling” and “Turn on Intel® Hyper-threading”)

Hardware Configuration

Item Endeavor Atlantis

Node count 1-10 nodes 1-10 nodes

Platform Intel SR1600URIntel S5520UR main board1U chassis

Intel SR1560SF systemIntel S5400SF main board1U chassis

CPU/Stepping Intel® Xeon® X5560 C1 step (Nehalem EP)2.8GHz / 6.4 QPI 1333 95 W1MB L2 cache, 8M L3 cache

Intel® Xeon® X5482; C0 step (Harpertown)3.2 GHz / 12 MB L2 cache

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. Other names and brands may be claimed as the property of others. All products, dates, and figures are preliminary and are subject to change without notice. Copyright © 2009, Intel Corporation.

21

RAM 24 GB total/node6*4GB 1333MHz Reg ECC DDR3

16 GB(FBDIMM 8x2-GB 667MHz)

Chipset Tylersburg Seaburg

BIOS Version Rev 2608 Apr 2008

Rev 22.17 Nov 2007

Interconnects Gigabit EthernetQDR InfiniBand

Gigabit EthernetDDR InfiniBand

Hard drive specs Seagate Cheetah NS400 GB SAS HDD 10kRPMModel: ST3400755SS

Using onboard Intel Entry Level Raid controller

Seagate Barracuda ES250 GB SATA HDDModel: ST3250620NS

Page 22: Hw09   Optimizing Hadoop Deployments

Cluster Configurations Information(Slides: “Processor Choice Impacts Speed” and “Processor Choice Impacts Throughput”)Intel® Xeon® X5460-based server

Processor: Dual-socket quad-core Intel® Xeon® X5460 3.16GHz

Processor Memory: 16GB (DDR2 FBDIM ECC 667MHz) RAM

Storage: 1 X 300GB 15K RPM SAS disk for system and log files, 4 X 1TB 7200RPM SATA for HDFS and intermediate results

Network: 1 Gigabit Ethernet NIC

BIOS: BIOS version S5000.86B.10.60.0091.100920081631EIST (Enhanced Intel SpeedStep Technology) disabled both hardware prefetcher and adjacent cache-line, prefetch disable

Intel® Xeon® X5570-based server

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. Other names and brands may be claimed as the property of others. All products, dates, and figures are preliminary and are subject to change without notice. Copyright © 2009, Intel Corporation.

22

Intel® Xeon® X5570-based server

Processor: Dual-socket quad-core Intel® Xeon® X5570 2.93GHz

Processor Memory: 16GB (DDR3 ECC 1333MHz) RAM

Storage: 1 X 1TB 7200RPM SATA for system and log files, 4 X 1TB 7200RPM SATA for HDFS and intermediate results

Network: 1 Gigabit Ethernet NIC

BIOS: BIOS version 4.6.3 Both EIST (Enhanced Intel SpeedStep Technology) and Turbo mode disabled both hardware prefetcher and adjacent cache-line prefetch enabled, SMT (Simultaneous MultiThreading), enabled (Disabling hardware prefetcher and adjacent cache-line prefetch helps improve Hadoop performance on Xeon X5460 server according to our benchmarking.)

Page 23: Hw09   Optimizing Hadoop Deployments

Cluster Configurations Information(Slides: “Use Intel® X25-E SATA SSD’s”)

Slaves:

• Intel® Xeon® L5520 Processor (Nehalem) @ 2.27 GHz CPUs 5.8 GB/sec QPI, 24 GBy RAM

• Server Board: Intel® SB5500WB (Willowbrook)

• 1x 1 TB SATA HDD boot disk, holds ${HOME} dirs: /• 2x 1 TB SATA HDD scratch/experiment disks: • 2x 64 GB Intel® X25-E SATA SLC SSD scratch/experiment disks

•OS: Ubuntu* 9.04 == 2.6.28-4 kernel (to enable power saving with preserved performance)

Master:

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. Other names and brands may be claimed as the property of others. All products, dates, and figures are preliminary and are subject to change without notice. Copyright © 2009, Intel Corporation.

23

Master:

•Intel® Xeon® Processor 2.93 GHz CPUs, 6.4 GB/sec QPI, 16 GBy RAM

•Server Board: Intel® SB5500WB (Willowbrook)

•Hard Disks:

• 1x 500 GB SATA OS boot disk (/dev/sda1), holds installed software

and ${HOME} dirs

• 2x 500 GB SATA scratch disks

• 2x64 GB Intel® X25-E SATA SLC SSDs

•OS: RedHat* Enterprise Linux 5.3 Server x64t