hdfs-hc2: analysis of data placement strategy based on computing power of nodes on heterogeneous...

23
Analysis of Data Placement Strategy based on Computing Power of Nodes on Heterogeneous Hadoop Clusters Sanket Reddy Chintapalli Advisor - Dr. Xiao Qin

Upload: xiao-qin

Post on 29-Jun-2015

191 views

Category:

Technology


3 download

DESCRIPTION

Hadoop and the term 'Big Data' go hand in hand. The information explosion caused due to cloud and distributed computing lead to the curiosity to process and analyze massive amount of data. The process and analysis helps to add value to an organization or derive valuable information. The current Hadoop implementation assumes that computing nodes in a cluster are homogeneous in nature. Hadoop relies on its capability to take computation to the nodes rather than migrating the data around the nodes which might cause a signi cant network overhead. This strategy has its potential bene ts on homogeneous environment but it might not be suitable on an heterogeneous environment. The time taken to process the data on a slower node on a heterogeneous environment might be signi cantly higher than the sum of network overhead and processing time on a faster node. Hence, it is necessary to study the data placement policy where we can distribute the data based on the processing power of a node. The project explores this data placement policy and notes the rami cations of this strategy based on running few benchmark applications.

TRANSCRIPT

Page 1: HDFS-HC2: Analysis of Data Placement Strategy based on Computing Power of Nodes on Heterogeneous Hadoop Clusters

Analysis of Data Placement Strategy based on Computing Power of Nodes onHeterogeneous Hadoop Clusters

Sanket Reddy Chintapalli Advisor - Dr. Xiao Qin

Page 2: HDFS-HC2: Analysis of Data Placement Strategy based on Computing Power of Nodes on Heterogeneous Hadoop Clusters

Presentation Overview

● Synopsis● Mapreduce Programming Model Overview● HDFS Overview● Motivation● Design● Software Description● Hardware Description● Results● Conclusion

Page 3: HDFS-HC2: Analysis of Data Placement Strategy based on Computing Power of Nodes on Heterogeneous Hadoop Clusters

Synopsis

● Data placement strategy● Heterogeneous Clusters● Computing Power● Calculating Computing Ratio● WordCount and Grep

Page 4: HDFS-HC2: Analysis of Data Placement Strategy based on Computing Power of Nodes on Heterogeneous Hadoop Clusters

MapReduce Model

● Hadoop 1.0 and Hadoop 2.0● Master - Slave Model● JobTracker and TaskTracker Hadoop 1.0● YARN Hadoop 2.0● Resource Manager YARN● Application Manager YARN● Node Manager YARN● MapReduce Flow

Page 5: HDFS-HC2: Analysis of Data Placement Strategy based on Computing Power of Nodes on Heterogeneous Hadoop Clusters

Mapreduce Model

Page 6: HDFS-HC2: Analysis of Data Placement Strategy based on Computing Power of Nodes on Heterogeneous Hadoop Clusters

Mapreduce Model - 1.0

Page 7: HDFS-HC2: Analysis of Data Placement Strategy based on Computing Power of Nodes on Heterogeneous Hadoop Clusters

Mapreduce Model - YARN - 2.0

Page 8: HDFS-HC2: Analysis of Data Placement Strategy based on Computing Power of Nodes on Heterogeneous Hadoop Clusters

Mapreduce Model - Flow

Page 9: HDFS-HC2: Analysis of Data Placement Strategy based on Computing Power of Nodes on Heterogeneous Hadoop Clusters

HDFS

● Namenode● Datanode● Replication● Federated Namenodes

Page 10: HDFS-HC2: Analysis of Data Placement Strategy based on Computing Power of Nodes on Heterogeneous Hadoop Clusters

HDFS Architecture

Page 11: HDFS-HC2: Analysis of Data Placement Strategy based on Computing Power of Nodes on Heterogeneous Hadoop Clusters

HDFS Federated Namenodes

Page 12: HDFS-HC2: Analysis of Data Placement Strategy based on Computing Power of Nodes on Heterogeneous Hadoop Clusters

HDFS Federated Namenodes

● Scalability● Performance● Isolation - overload

Page 13: HDFS-HC2: Analysis of Data Placement Strategy based on Computing Power of Nodes on Heterogeneous Hadoop Clusters

Motivation

Page 14: HDFS-HC2: Analysis of Data Placement Strategy based on Computing Power of Nodes on Heterogeneous Hadoop Clusters

Software Description

● Hadoop 2.3.0● Maven● Eclipse● Protocol Buffers

Page 15: HDFS-HC2: Analysis of Data Placement Strategy based on Computing Power of Nodes on Heterogeneous Hadoop Clusters

Hardware Description

Page 16: HDFS-HC2: Analysis of Data Placement Strategy based on Computing Power of Nodes on Heterogeneous Hadoop Clusters

Design

Run WordCount and Grep Applications on individual nodes

Page 17: HDFS-HC2: Analysis of Data Placement Strategy based on Computing Power of Nodes on Heterogeneous Hadoop Clusters

Design

Calculate Computing Power of Individual Nodes fora specific application

Page 18: HDFS-HC2: Analysis of Data Placement Strategy based on Computing Power of Nodes on Heterogeneous Hadoop Clusters

Design

● Evaluate Hadoop Distribution by running grep and wordcount together on all nodes

● Run the CRBalancer to balance the nodes● Finally re-run the applications to note the ramifications

of the data placement strategy.

Page 19: HDFS-HC2: Analysis of Data Placement Strategy based on Computing Power of Nodes on Heterogeneous Hadoop Clusters

Design - Algorithm

CRBalancer Strategy

Page 20: HDFS-HC2: Analysis of Data Placement Strategy based on Computing Power of Nodes on Heterogeneous Hadoop Clusters

Implementation

● CRBalancer ● CRBalancingPolicy● CRNamenodeConnector

Page 21: HDFS-HC2: Analysis of Data Placement Strategy based on Computing Power of Nodes on Heterogeneous Hadoop Clusters

Results - WordCount

Page 22: HDFS-HC2: Analysis of Data Placement Strategy based on Computing Power of Nodes on Heterogeneous Hadoop Clusters

Results - Grep

Page 23: HDFS-HC2: Analysis of Data Placement Strategy based on Computing Power of Nodes on Heterogeneous Hadoop Clusters

Questions ??