understanding hadoop clusters and the network

Understanding Hadoop Clusters and the Network

Part 1. Introduction and Overview

BRAD HEDLUND .com

Brad Hedlundhttp://bradhedlund.comhttp://www.linkedin.com/in/bradhedlund@bradhedlund

http://bradhedlund.com/

http://www.linkedin.com/in/bradhedlund

Hadoop Server Roles

Data Node &Task Tracker






slaves

masters

Clients

Name NodeJob TrackerSecondary

Name Node

Map Reduce HDFSDistributed Data Analytics Distributed Data Storage

BRAD HEDLUND .com

Hadoop Cluster

Rack 1

DN + TT

DN + TT

DN + TT

DN + TT

Name Node

Rack 2

DN + TT

DN + TT

DN + TT

DN + TT

Job Tracker

Rack 3

DN + TT

DN + TT

DN + TT

DN + TT

Secondary NN

Rack 4

DN + TT

DN + TT

DN + TT

DN + TT

Client

Rack N

DN + TT

DN + TT

DN + TT

DN + TT

DN + TT

DN + TT

World

switch switch switch switch switch

switch switch

BRAD HEDLUND .com

Typical Workflow

• Load data into the cluster (HDFS writes)• Analyze the data (Map Reduce)• Store results in the cluster (HDFS writes)• Read the results from the cluster (HDFS reads)

How many times did our customers type the word “Fraud” into emails sent to customer service?

Sample Scenario:

File.txtHuge file containing all emails sent to customer service

BRAD HEDLUND .com

Writing files to HDFS

• Client consults Name Node• Client writes block directly to one Data Node• Data Nodes replicates block• Cycle repeats for next block

Name Node

Data Node 1 Data Node 5 Data Node 6 Data Node N

Client

I want to write Blocks A,B,C of

File.txt

OK. Write to Data Nodes

1,5,6

Blk A Blk B Blk C

File.txt

Blk A Blk B Blk C

BRAD HEDLUND .com

Hadoop Rack Awareness – Why?Name Node

File.txt=Blk A:DN1, DN5, DN6

Blk B:DN7, DN1, DN2

Blk C:DN5, DN8,DN9

metadata

Rack 1

Data Node 1

Data Node 2

Data Node 3

Data Node 5

switch

A

Rack 5

Data Node 5

Data Node 6

Data Node 7

Data Node 8

switch

Rack 9

Data Node 9

Data Node 10

Data Node 11

Data Node 12

switch

• Never loose all data if entire rack fails• Keep bulky flows in-rack when possible• Assumption that in-rack is higher bandwidth,

lower latency

A

A

BB

B

C C

C

switch

Rack 1:Data Node 1Data Node 2Data Node 3


Rack aware

BRAD HEDLUND .com

switch

switchswitch

Preparing HDFS writes

Name Node

Data Node 1 Data Node 5

Data Node 6

Client

I want to write File.txtBlock A

OK. Write to Data Nodes

1,5,6

Blk A Blk B Blk C

File.txt

Ready Data Nodes

5,6

Read

y!

Ready?

Rack 1 Rack 5

Rack 1:Data Node 1

Rack 5:Data Node 5Data Node 6

ReadyData Node 6

Rack aware

Ready! • Name Node picks two nodes in the same rack, one node in a different rack

• Data protection• Locality for M/R

Ready!

BRAD HEDLUND .com

switch

switchswitch

Pipelined Write

Name Node


Data Node 6

ClientBlk A Blk B Blk C

File.txt

Rack 1 Rack 5

Rack 1:Data Node 1


A A

A

• Data Nodes 1 & 2 pass data along as its received

• TCP 50010

Rack aware

BRAD HEDLUND .com

switch

switchswitch

Pipelined Write

Name Node


Data Node 3

ClientBlk A Blk B Blk C

File.txt

Rack 1 Rack 5

Rack 1:Data Node 1


A A

A

Block received

Success

Success

File.txtBlk A:DN1, DN2, DN3

BRAD HEDLUND .com

Multi-block Replication Pipeline

Data Node 1 Data Node X

Data Node 3

Rack 1

switch

switch

Client

switch

Blk A Blk A

Blk A

Rack 4

Data Node 2

Rack 5

switch

Data Node Y

Data Node Z

Blk B

Blk B

Blk B

Data Node WBlk C

Blk C

Blk C

Blk A Blk B Blk C

File.txt 1TB File =3TB storage3TB network traffic

BRAD HEDLUND .com

Client writes Span the HDFS ClusterClient

Rack N

Data Node

Data Node

Data Node

Data Node

Data Node

Data Node

switch

File.txt

Rack 4

Data Node

Data Node

Data Node

Data Node

Data Node

Data Node

switch

Rack 3

Data Node

Data Node

Data Node

Data Node

Data Node

Data Node

switch

Rack 2

Data Node

Data Node

Data Node

Data Node

Data Node

Data Node

switch

Rack 1

Data Node

Data Node

Data Node

Data Node

Data Node

Data Node

switch

• Block size• File Size

Factors:

More blocks = Wider spread

BRAD HEDLUND .com

Data Node writes span itself, and other racks

Rack N

Data Node

Data Node

Data Node

Data Node

Data Node

Data Node

switch

Rack 4

Data Node

Data Node

Data Node

Data Node

Data Node

Data Node

switch

Rack 3

Data Node

Data Node

Data Node

Data Node

Data Node

Data Node

switch

Rack 2

Data Node

Data Node

Data Node

Data Node

Data Node

Data Node

switch

C

C

B

B

A

A

Rack 1

Data Node

Data Node

Data Node

Data Node

Data Node

switch

CBA

Blk A Blk B Blk C

Results.txt

BRAD HEDLUND .com

Name Node

• Data Node sends Heartbeats• Every 10th heartbeat is a Block report• Name Node builds metadata from Block reports• TCP – every 3 seconds• If Name Node is down, HDFS is down

Name Node

Data Node 1 Data Node 2 Data Node 3 Data Node N

A AA CC

DN1: A,CDN2: A,CDN3: A,C

metadata

File.txt = A,C

C

File systemAwesome!

Thanks.

I’m alive!

I have blocks:

A, C

BRAD HEDLUND .com

Re-replicating missing replicas

Name Node

Data Node 1 Data Node 2 Data Node 3 Data Node 8

A AA CC

DN1: A,CDN2: B,CDN3: A, B

metadataRack1: DN1, DN2Rack5: DN3, Rack9: DN8

C

Rack AwarenessUh Oh!Missing replicas

Copy blocks A,C to Node 8

A C

• Missing Heartbeats signify lost Nodes• Name Node consults metadata, finds affected data• Name Node consults Rack Awareness script• Name Node tells a Data Node to re-replicate

BRAD HEDLUND .com

Secondary Name Node

• Not a hot standby for the Name Node• Connects to Name Node every hour*• Housekeeping, backup of Name Node metadata• Saved metadata can rebuild a failed Name Node

Name Node

metadata

File.txt = A,C

File system

Secondary Name Node

Its been an hour, give me your

metadata

BRAD HEDLUND .com

Client reading files from HDFS

Name NodeClient

Tell me the block locations of Results.txt

Blk A = 1,5,6Blk B = 8,1,2Blk C = 5,8,9

Results.txt=Blk A:DN1, DN5, DN6

Blk B:DN7, DN1, DN2

Blk C:DN5, DN8,DN9

metadata

Rack 1

Data Node 1

Data Node 2

Data Node

Data Node

switch

A

Rack 5

Data Node 5

Data Node 6

Data Node

Data Node

switch

Rack 9

Data Node 8

Data Node 9

Data Node

Data Node

switch

• Client receives Data Node list for each block• Client picks first Data Node for each block• Client reads blocks sequentially

A

A

BB

B

C C

C

BRAD HEDLUND .com

Data Node reading files from HDFS

Name Node

Block A = 1,5,6

File.txt=Blk A:DN1, DN5, DN6

Blk B:DN7, DN1, DN2

Blk C:DN5, DN8,DN9

metadata

Rack 1

Data Node 1

Data Node 2

Data Node 3

Data Node

switch

A

Rack 5

Data Node 5

Data Node 6

Data Node

Data Node

switch

Rack 9

Data Node 8

Data Node 9

Data Node

Data Node

switch

• Name Node provides rack local Nodes first• Leverage in-rack bandwidth, single hop

A

A

BB

B

C C

C

Tell me the locations of Block A of

File.txt switch


Rack 5:Data Node 5

Rack aware

BRAD HEDLUND .com

Data Processing: Map

• Map: “Run this computation on your local data”• Job Tracker delivers Java code to Nodes with local data

Map Task Map Task Map Task

A B C

Client

How many times does

“Fraud” appear in File.txt?

Count “Fraud”

in Block C

File.txt

Fraud = 3 Fraud = 0 Fraud = 11

Job TrackerName Node

Data Node 1 Data Node 5 Data Node 9

BRAD HEDLUND .com

switchswitchswitch

What if data isn’t local?

• Job Tracker tries to select Node in same rack as data• Name Node rack awareness

“I need block A”Map Task Map Task

B C

Client

How many times does

“Fraud” appear in File.txt?

Count “Fraud”

in Block C

Fraud = 0 Fraud = 11

Job TrackerName Node

Data Node 1Data Node 5 Data Node 9

“no Map tasks left”A

Data Node 2

Rack 1 Rack 5 Rack 9

BRAD HEDLUND .com

Data Processing: Reduce

• Reduce: “Run this computation across Map results”• Map Tasks deliver output data over the network • Reduce Task data output written to and read from HDFS

Fraud = 3Fraud = 0

Fraud = 11

Job Tracker

Reduce Task

Sum “Fraud”

Results.txtFraud = 14

Map Task Map Task Map Task

A B C

Client

HDFS

X Y Z

Data Node 1 Data Node 5 Data Node 9

Data Node 3

BRAD HEDLUND .com

Unbalanced Cluster

New Rack

Data Node

Data Node

Data Node

Data Node

Data Node

Data Node

switch

Rack 2

Data Node

Data Node

Data Node

Data Node

Data Node

Data Node

switch

Rack 1

Data Node

Data Node

Data Node

Data Node

Data Node

Data Node

switch NEW

switch

New Rack

Data Node

Data Node

Data Node

Data Node

Data Node

Data Node

switch NEW

File.txt

• Hadoop prefers local processing if possible• New servers underutilized for Map Reduce, HDFS*• Might see more network bandwidth, slower job times**

**I was assigned a Map Task but don’t have the block. Guess I need to get it.

*I’m bored!

BRAD HEDLUND .com

Cluster Balancing

New Rack

Data Node

Data Node

Data Node

Data Node

Data Node

Data Node

switch

Rack 2

Data Node

Data Node

Data Node

Data Node

Data Node

Data Node

switch

Rack 1

Data Node

Data Node

Data Node

Data Node

Data Node

Data Node

switch NEW

switch

New Rack

Data Node

Data Node

Data Node

Data Node

Data Node

Data Node

switch NEW

File.txt

• Balancer utility (if used) runs in the background• Does not interfere with Map Reduce or HDFS• Default speed limit 1 MB/s

brad@cloudera-1:~$hadoop balancer

BRAD HEDLUND .com

BRAD HEDLUND .com

Thanks!

Narrated at:http://bradhedlund.com/?p=3108

http://bradhedlund.com/?p=3108

understanding hadoop clusters and the network

Technology

data node xdata node

data node naaacccdata

data node writes

aadata node

mrreadydata node

node picks

baabccdata node

bcadata node