hadoop hardware infrastructure considerations ©2013 opalsoft big data

Hadoop Hardware

Infrastructure considerations

©2013 OpalSoft Big Data


Hardware Considerations

Original Design Idea:Hadoop infrastructure is designed and developed to run on a normal commodity hardware.

Actual Production Environment:Large production Hadoop clusters do require a proper hardware infrastructure planning to have a minimal latency in storing and processing of large volume of data.

Hardware architecture should be carefully designed based on the nature of data and the jobs that will be run and the agreed SLA


Hardware Considerations

Various aspects of hadoop hardware infrastructure

• Servers (Named Node, Job Tracker, Data Node)• Racks• Network Switch• Storage• Backup• Number of copies of data


Name Node & Job Tracker Server

• RAM size should be sized based on the following

– Number of data nodes in cluster– Approximate number of blocks that will be stored in the

cluster– Number of different hadoop process that is run the

machine


• I/O Adapters– Not a critical element as named node does not

participate in data transfers.

• Processor– Minimally need multi core processors

• Standby Node Server– Should be as same capacity as primary named node


Data Node Servers

• RAM size should be sized based on the following

– Approximate number of blocks that will be stored in the cluster

– Number of different hadoop process that is run the machine.


• I/O Adapters

– High throughput I/O adapter is needed

• Processor

– Need multi core, multi processors for parallel execution of more than one map reduce job

• Virtualization is not recommended


Racks

• Hadoop is rack aware

• Configure hadoop with node’s rack information

• Servers should be distributed at least across two racks to prevent any data loss due to rack failure


• Hadoop automatically performs block replication across servers in two different racks

• Servers located in same rack has low latency of data transfer because, all data transfer occurs via rack’s network switch


Network Switch

• Its recommended to have a separate private network for hadoop cluster.

• Both core and rack network switch should support high bandwidth duplex data transfer

• Higher capacity core and rack switch will be required if the number of data copies are more than the standard 3 copies.


Storage

• Locally attached storage provide better performance than a NFS or SAN storage

• Hard disk with higher RPM provides better read/write throughput

• More number of smaller capacity hard disk should be used instead of a single large capacity disk will allow concurrent read/writes and reduces disk level bottle neck


• Named Node and Job Tracker servers should have raid configurations which is highly fault tolerant

• Data Node raid configuration is not as critical. Data is already replicated across multiple servers

• Using SSD will improve performance drastically at the expense of higher setup cost


Backup

• Named Node data is the most critical information that needs more frequent backup

• Named node data is regularly streamed to a standby node to restore hadoop cluster operation in case of primary named node failure.


• Another backup server is recommended to perform periodic backup and checksum verification of named node data

• Data node backup is requirement depends upon the criticality and availability of data.

• Doesn’t require frequent backup though a regular back up

is needed only to recover from a data center level failures.


Number of Copies of Data

Following determines the number of copies of a block– Criticality of data

– Number of Concurrent map reduce jobs that will be executed in the data set

– More replicas of data allows more number of jobs to be run concurrently on the same data set.

– Reduces job execution time as most often all the required data is either available locally or at least in the same rack


– Be aware, more number of replicas affects the write performance of hadoop cluster.

– Having more replicas only for most frequently used data provides maximum benefit instead of having a general replication factor across the cluster.

hadoop hardware infrastructure considerations ©2013 opalsoft big data

Documents