hadoop hardware infrastructure considerations ©2013 opalsoft big data

16
Hadoop Hardware Infrastructure considerations ©2013 OpalSoft Big Data

Upload: ethan-richardson

Post on 30-Dec-2015

215 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Hadoop Hardware Infrastructure considerations ©2013 OpalSoft Big Data

Hadoop Hardware

Infrastructure considerations

©2013 OpalSoft Big Data

Page 2: Hadoop Hardware Infrastructure considerations ©2013 OpalSoft Big Data

©2013 OpalSoft Big Data

Hardware Considerations

Original Design Idea:Hadoop infrastructure is designed and developed to run on a normal commodity hardware.

Actual Production Environment:Large production Hadoop clusters do require a proper hardware infrastructure planning to have a minimal latency in storing and processing of large volume of data.

Hardware architecture should be carefully designed based on the nature of data and the jobs that will be run and the agreed SLA

Page 3: Hadoop Hardware Infrastructure considerations ©2013 OpalSoft Big Data

©2013 OpalSoft Big Data

Hardware Considerations

Various aspects of hadoop hardware infrastructure

• Servers (Named Node, Job Tracker, Data Node)• Racks• Network Switch• Storage• Backup• Number of copies of data

Page 4: Hadoop Hardware Infrastructure considerations ©2013 OpalSoft Big Data

©2013 OpalSoft Big Data

Name Node & Job Tracker Server

• RAM size should be sized based on the following

– Number of data nodes in cluster– Approximate number of blocks that will be stored in the

cluster– Number of different hadoop process that is run the

machine

Page 5: Hadoop Hardware Infrastructure considerations ©2013 OpalSoft Big Data

©2013 OpalSoft Big Data

• I/O Adapters– Not a critical element as named node does not

participate in data transfers.

• Processor– Minimally need multi core processors

• Standby Node Server– Should be as same capacity as primary named node

Page 6: Hadoop Hardware Infrastructure considerations ©2013 OpalSoft Big Data

©2013 OpalSoft Big Data

Data Node Servers

• RAM size should be sized based on the following

– Approximate number of blocks that will be stored in the cluster

– Number of different hadoop process that is run the machine.

Page 7: Hadoop Hardware Infrastructure considerations ©2013 OpalSoft Big Data

©2013 OpalSoft Big Data

• I/O Adapters

– High throughput I/O adapter is needed

• Processor

– Need multi core, multi processors for parallel execution of more than one map reduce job

• Virtualization is not recommended

Page 8: Hadoop Hardware Infrastructure considerations ©2013 OpalSoft Big Data

©2013 OpalSoft Big Data

Racks

• Hadoop is rack aware

• Configure hadoop with node’s rack information

• Servers should be distributed at least across two racks to prevent any data loss due to rack failure

Page 9: Hadoop Hardware Infrastructure considerations ©2013 OpalSoft Big Data

©2013 OpalSoft Big Data

• Hadoop automatically performs block replication across servers in two different racks

• Servers located in same rack has low latency of data transfer because, all data transfer occurs via rack’s network switch

Page 10: Hadoop Hardware Infrastructure considerations ©2013 OpalSoft Big Data

©2013 OpalSoft Big Data

Network Switch

• Its recommended to have a separate private network for hadoop cluster.

• Both core and rack network switch should support high bandwidth duplex data transfer

• Higher capacity core and rack switch will be required if the number of data copies are more than the standard 3 copies.

Page 11: Hadoop Hardware Infrastructure considerations ©2013 OpalSoft Big Data

©2013 OpalSoft Big Data

Storage

• Locally attached storage provide better performance than a NFS or SAN storage

• Hard disk with higher RPM provides better read/write throughput

• More number of smaller capacity hard disk should be used instead of a single large capacity disk will allow concurrent read/writes and reduces disk level bottle neck

Page 12: Hadoop Hardware Infrastructure considerations ©2013 OpalSoft Big Data

©2013 OpalSoft Big Data

• Named Node and Job Tracker servers should have raid configurations which is highly fault tolerant

• Data Node raid configuration is not as critical. Data is already replicated across multiple servers

• Using SSD will improve performance drastically at the expense of higher setup cost

Page 13: Hadoop Hardware Infrastructure considerations ©2013 OpalSoft Big Data

©2013 OpalSoft Big Data

Backup

• Named Node data is the most critical information that needs more frequent backup

• Named node data is regularly streamed to a standby node to restore hadoop cluster operation in case of primary named node failure.

Page 14: Hadoop Hardware Infrastructure considerations ©2013 OpalSoft Big Data

©2013 OpalSoft Big Data

• Another backup server is recommended to perform periodic backup and checksum verification of named node data

• Data node backup is requirement depends upon the criticality and availability of data.

• Doesn’t require frequent backup though a regular back up

is needed only to recover from a data center level failures.

Page 15: Hadoop Hardware Infrastructure considerations ©2013 OpalSoft Big Data

©2013 OpalSoft Big Data

Number of Copies of Data

Following determines the number of copies of a block– Criticality of data

– Number of Concurrent map reduce jobs that will be executed in the data set

– More replicas of data allows more number of jobs to be run concurrently on the same data set.

– Reduces job execution time as most often all the required data is either available locally or at least in the same rack

Page 16: Hadoop Hardware Infrastructure considerations ©2013 OpalSoft Big Data

©2013 OpalSoft Big Data

– Be aware, more number of replicas affects the write performance of hadoop cluster.

– Having more replicas only for most frequently used data provides maximum benefit instead of having a general replication factor across the cluster.