hadoop summit 2012 - validated network architecture and reference deployment in enterprise

50
Hadoop Summit 2012 - Validated Network Architecture and Reference Deployment in Enterprise Nimish Desai – [email protected] Technical Leader, Data Center Group Cisco Systems Inc.

Upload: cisco-data-center-sdn

Post on 14-Jun-2015

4.734 views

Category:

Technology


4 download

TRANSCRIPT

Page 1: Hadoop Summit 2012 - Validated Network Architecture and Reference Deployment in Enterprise

Hadoop Summit 2012 - Validated Network Architecture and Reference Deployment in Enterprise

Nimish Desai – [email protected]

Technical Leader, Data Center Group

Cisco Systems Inc.

Page 2: Hadoop Summit 2012 - Validated Network Architecture and Reference Deployment in Enterprise

Goal 1: Provide Reference Network Architecture for Hadoop in Enterprise

Goal 2: Characterize Hadoop Application on Network

Goal 3: Network Validation Results with Hadoop Workload

Session Objectives & Takeways

2

Page 3: Hadoop Summit 2012 - Validated Network Architecture and Reference Deployment in Enterprise

3

Big Data in Enterprise

Page 4: Hadoop Summit 2012 - Validated Network Architecture and Reference Deployment in Enterprise

Validated 96 Node Hadoop Cluster

Network

Three Racks each with 32 nodes

Distribution Layer – Nexus 7000 or Nexus 5000

ToR – FEX or Nexus 3000

2 FEX per Rack

Each Rack with either 32 single or dual attached host

Hadoop Framework

Apache 0.20.2

Linux 6.2

Slots – 10 Maps & 2 Reducers per node

Compute – UCS C200 M2

Cores:  12Processor:  2 x Intel(R) Xeon(R) CPU  X5670 @ 2.93GHzDisk: 4 x 2TB (7.2K RPM)Network: 1G: LOM, 10G: Cisco UCS P81E

Name Node Cisco UCS C200

Single NIC

2248TP-E

Nexus 5548 Nexus 5548

Data Nodes 1 – 48Cisco UCS C 200 Single NIC

…Data Nodes 49- 96

Cisco UCS 200 Single NIC

Traditional DC Design Nexus 55xx/2248

2248TP-E

Name Node Cisco UCS C 200

Single NIC

Nexus 7000 Nexus 7000

Data Nodes 1 – 48Cisco UCS C 200 Single NIC

…Data Nodes 49 - 96

Cisco UCS C 200 Single NIC

Nexus 3000Nexus 3000

Nexus 7K-N3K based Topology

Page 5: Hadoop Summit 2012 - Validated Network Architecture and Reference Deployment in Enterprise

Cisco Confidential© 2010 Cisco and/or its affiliates. All rights reserved. 5

Aggregation & Services

Layer

Core Layer(LAN & SAN)

Access Layer

SAN Edge

WAN EdgeLayer

Data Center Infrastructure

Nexus 700010 GE Aggr

NetworkServices

Layer 3 Layer 2 - 1GELayer 2 - 10GE10 GE DCB10 GE FCoE/DCB4/8 Gb FC

FC SAN A

FC SAN B

vPC+ FabricPath

Nexus 700010 GE Core

Nexus 5500 10GE Nexus 2148TP-E

Bare Metal

CBS 31xxBlade switch

Nexus 7000End-of-Row

Nexus 5500 FCoE Nexus 2232 Top-of-Rack

UCS FCoE Nexus 3000Top-of-Rack

10G

1 GbE Server Access & 4/8Gb FC via dual HBA (SAN A // SAN B) 10Gb DCB / FCoE Server Access or 10 GbE Server Access & 4/8Gb FC via dual HBA (SAN A // SAN B)

L3

L2

MDS 9500SAN

Director

B22FEXHP

BladeC-class

FC SAN A

FC SAN

B

MDS 9200 /9100

Nexus 5500 FCoE

Bare Metal 1G Nexus 3000

Top-of-Rack

Page 6: Hadoop Summit 2012 - Validated Network Architecture and Reference Deployment in Enterprise

Big Data Application Realm – Web 2.0 & Social/Community Networks

Data live/die in Internet only entities

Data Domain Partially private

Homogeneous Data Life Cycle

Mostly Unstructured

Web Centric, User Driven

Unified workload – few process & owners

Typically non-virtualized

Scaling & Integration Dynamics

Purpose Driven Apps

Thousands of nodes

Hundreds of PB and growing exponentially

6

Data store

ServiceUI

Page 7: Hadoop Summit 2012 - Validated Network Architecture and Reference Deployment in Enterprise

Big Data Application Realm - Enterprise Data Lives in a confined zone of enterprise

repository

Long Lived, Regulatory and Compliance Driven

Heterogeneous Data Life Cycle

Many Data Models

Diverse data – Structured and Unstructured

Diverse data sources - Subscriber based

Diverse workload from many sources/groups/process/technology

Virtualized and non-virtualized with mostly SAN/NAS base

7

Customer DB(Oracle/SAP)

SocMedia

ERPModul

e B

DataServic

e

SalesPipeli

ne

ERPModul

e A

CallCente

r

Product

Catalog

Catalog

Data

VideoConf

CollabOffice Apps

Records

Mgmt

DocMgmt

B

DocMgmt

A

VOIPExec

Reports

Scaling & Integration Dynamics are different Data Warehousing(structured) with divers repository +

Unstructured Data

Few hundred to thousand nodes, few PB

Integration, Policy & Security Challenges

Each Apps/Group/Technology limited in

data generation

Consumption

Servicing confined domains

Page 8: Hadoop Summit 2012 - Validated Network Architecture and Reference Deployment in Enterprise

Data Sources

8

Enterprise Application

Sales Products Process Inventory Finance Payroll Shipping TrackingAuthorization Customers Profile

Machine logs Sensor data Call data records Web click stream dataSatellite feeds GPS data Sales data Blogs Emails Pictures Video

Transactional Business

Intelligence Transactional

Data

Analytics

HB

ase

Ora

cle

No

SQ

LD

BC

assa

nd

raM

on

go

DB

Co

uch

DB

Red

isM

emb

ase

Neo

4j

Hadoop

MapR

educe

MapR

GP

MapR

educe

Row Store Column

Store

Big Data

Page 9: Hadoop Summit 2012 - Validated Network Architecture and Reference Deployment in Enterprise

Traditional Database

RDBMS

Storage

SAN and NAS

“Big Data”

Store and Analyze

“Big Data”

Real-Time Capture, Read and Update

Operations

NoSQL

Big Data Building Blocks into the Enterprise

9

ApplicationVirtualized,

Bare Metal and Cloud Sensor DataLogs

Social

Media

Click Streams

Mobility

Trends

Event Data

Big Data

Cisco Unified Fabric

Page 10: Hadoop Summit 2012 - Validated Network Architecture and Reference Deployment in Enterprise

10

Hadoop Cluster Design & Reference Network Architecture

Page 11: Hadoop Summit 2012 - Validated Network Architecture and Reference Deployment in Enterprise

Characteristics that Affect Hadoop Clusters Cluster Size

Number of Data Nodes

Data Model & Mapper/Reduces Ratio

MapReduce functions

Input Data Size

Total starting dataset

Data Locality in HDFS

Ability to processes data where it already is located

Background Activity

Number of Jobs running

type of jobs

Importing

exporting

11

Characteristics of Data Node

‒ I/O, CPU, Memory, etc.

Networking Characteristics

‒ Availability

‒ Buffering

‒ Data Node Speed (1G vs. 10G)

‒ Oversubscription

‒ Latency

http://www.cloudera.com/resource/hadoop-world-2011-presentation-video-hadoop-network-and-compute-architecture-considerations/

Page 12: Hadoop Summit 2012 - Validated Network Architecture and Reference Deployment in Enterprise

MapMap

MapMap

Map

The Data Ingest & Replication

External Connectivity

East West Traffic (Replication of data blocks)

Map Phase – Raw data Analyzed and converted to name/value pair.

Workload translate to multiple batches of Map task

Reducer can start the reduce phase ONLY after the entire Map set is complete

Mostly a IO/compute function

Hadoop Components and Operations

12

Hadoop Distributed File SystemUnstructured Data

MapMap

MapMap

Map

MapMap

MapMap

Map

Key 1Key 1

Key 1Key 1

Key 1Key 1

Key 1Key 2

Key 1Key 1

Key 1Key 3

Key 1Key 1

Key 1Key 4

Reduce

Shuffle Phase

ReduceReduce

Result/Output

Reduce

MapMap

MapMap

Map

Page 13: Hadoop Summit 2012 - Validated Network Architecture and Reference Deployment in Enterprise

MapMap

MapMap

Map

Shuffle Phase - All name/value pair are sorted and grouped by their keys.

Mapper sending the data to Reducers

High Network Activity

Reduce Phase – All values associates with a key are process for results, three phases

Copy - get intermediate result from each data node local disk

Merge - to reduce the number of files

Reduce method

Output Replication Phase - Reducer replicating result to multiple nodes

Highest Network Activity

Network Activities Dependent on Workload Behavior

Hadoop Components and Operations

13

Hadoop Distributed File SystemUnstructured Data

MapMap

MapMap

Map

MapMap

MapMap

Map

Key 1Key 1

Key 1Key 1

Key 1Key 1

Key 1Key 2

Key 1Key 1

Key 1Key 3

Key 1Key 1

Key 1Key 4

Reduce

Shuffle Phase

ReduceReduce

Result/Output

Reduce

MapMap

MapMap

Map

Page 14: Hadoop Summit 2012 - Validated Network Architecture and Reference Deployment in Enterprise

MapReduce Data ModelETL & BI Workload Benchmark

14

The complexity of the functions used in Map and/or Reduce has a large impact on the job completion time and network traffic.

• Data set size varies in various phase – Varying impact on the network e.g. 1TB Input, 10MB Shuffle, 1MB Output

• Most of the processing in the Map Functions, smaller intermediate and even smaller final Data

Map Start

Yahoo TeraSort – ETL Workload – Most Network Intensive

Reducers Start

Map Finish Job Finish

• Input, Shuffle and Output data size is the same – e.g. 10 TB data set in all phases• Yahoo Terasort has a more balanced Map vs. Reduce functions - linear compute and IO

Map Start

Shakespeare WordCount – BI Workload

Reducers Start Map Finish

Job Finish

Page 15: Hadoop Summit 2012 - Validated Network Architecture and Reference Deployment in Enterprise

ETL Workload (1TB Yahoo Terasort)

15

Network Graph of all Traffic Received on an Single Node (80 Node Run)

Reducers Start Maps

Finish

Job CompleteMaps Start

These symbols represent a node sending traffic to HPC064

Shortly after the Reducers start Map tasks are finishing and data is being shuffled to reducersAs Maps completely finish the network is no loner used as Reducers have all the data they need to finish the job

The red line is the total amount of traffic received by hpc064

Page 16: Hadoop Summit 2012 - Validated Network Architecture and Reference Deployment in Enterprise

Output Data Replication Enabled Replication of 3 enabled (1 copy stored locally, 2 stored remotely) Each reduce output is replicated now, instead of just stored locally

If output replication is enabled, then the end of the terasort, must store additional copies. For a 1TB sort, 2TB will need to be replicated across the network.

ETL Workload (1TB Yahoo Terasort)

16

Network Activity of all Traffic Received on an Single Node (80 Node Run)

Page 17: Hadoop Summit 2012 - Validated Network Architecture and Reference Deployment in Enterprise

Wordcount on 200K Copies of complete works of Shakespeare

Reducers Start Maps

Finish

Job CompleteMaps Start

The red line is the total amount of traffic received by hpc064

These symbols represent a node sending traffic to HPC064

Due the combination of the length of the Map phase and the reduced data set being shuffled, the network is being utilized throughout the job, but by a limited amount.

BI Workload

17

Network Graph of all Traffic Received on an Single Node (80 Node Run)

Page 18: Hadoop Summit 2012 - Validated Network Architecture and Reference Deployment in Enterprise

Data Locality in HDFS

18

Data Locality – The ability to process data where it is locally stored.

Note:During the Map Phase, the JobTracker attempts to use data locality to schedule map tasks where the data is locally stored. This is not perfect and is dependent on a data nodes where the data is located. This is a consideration when choosing the replication factor. More replicas tend to create higher probability for data locality.

Reducers StartMaps Finish

Job CompleteMaps Start

ObservationsNotice this initial spike in RX Traffic is before the Reducers kick in.

It represents data each map task needs that is not local.

Looking at the spike it is mainly data from only a few nodes.

Map Tasks: Initial spike for non-local data. Sometimes a task may be scheduled on a node that does not have the data available locally.

Page 19: Hadoop Summit 2012 - Validated Network Architecture and Reference Deployment in Enterprise

1 TB file with 128 MB Blocks == 7,813 Map Tasks

The job completion time is directly related to number of reducers

Average Network buffer usage lowers as number of reducer gets lower (see hidden slides) and vice versa.

Map to Reducer Ratio Impact on Job Completion

19

192 96 48 24 12 60

5000

10000

15000

20000

25000

30000

Total Graph of Job Comple-tion Time in Sec

No. Of Reduceers 24 12 60

5000

10000

15000

20000

25000

30000

Job Completion Time in Sec

No. Of Reduceers

192 96 480

100200300400500600700800

Job Completion Time in Sec

No. Of Reduceers

Page 20: Hadoop Summit 2012 - Validated Network Architecture and Reference Deployment in Enterprise

Job Completion Time with 96 Reducers

20

Page 21: Hadoop Summit 2012 - Validated Network Architecture and Reference Deployment in Enterprise

Job Completion Time with 48 Reducers

21

Page 22: Hadoop Summit 2012 - Validated Network Architecture and Reference Deployment in Enterprise

Job Completion Graph with 24 Reducers

22

Page 23: Hadoop Summit 2012 - Validated Network Architecture and Reference Deployment in Enterprise

23

Network Characteristics

The relative impact of various network characteristics on Hadoop clusters*

* Not a scaled or measured data

Availa

blity

Buffe

ring

Ove

rsub

scrip

tion

Data

Node

Speed

Late

ncy

AvailablityBufferingOversubscriptionData Node SpeedLatency

Page 24: Hadoop Summit 2012 - Validated Network Architecture and Reference Deployment in Enterprise

24

Validated Network Reference Architecture

Page 25: Hadoop Summit 2012 - Validated Network Architecture and Reference Deployment in Enterprise

© 2012 Cisco and/or its affiliates. All rights reserved. Cisco PublicBRKAPP-2027 25

Nexus 2000

1GE Rack Mount Servers

1GE Rack Mount Servers

Nexus 2000

10GE Rack Mount Servers

10GE Rack Mount Servers

Nexus 4000

10GE Blade Switch w/ FCoE

(IBM/Dell)

10GE Blade Switch w/ FCoE

(IBM/Dell)

Cisco UCS

Nexus 2000

1 & 10GE Blade Servers w/ Pass-Thru

1 & 10GE Blade Servers w/ Pass-Thru

10GE Rack Mount Servers

10GE Rack Mount Servers

Direct Attach 10GE

Core Distribution

Nexus 7000 MDS 9000

LAN SAN

Unified Access Layer

Nexus 5000

UCS Compute Blade & Rack

UCS Compute Blade & Rack

Data Center Access Connectivity

Nexus 1000V

Page 26: Hadoop Summit 2012 - Validated Network Architecture and Reference Deployment in Enterprise

26

Network AttributesArchitecture AvailabilityCapacity, Scale &

Oversubscription Flexibility

Management & Visibility

Network Reference Architecture

slot 1slot 2slot 3slot 4slot 5slot 6slot 7slot 8

blade1blade2blade3blade4blade5blade6blade7blade8

slot 1slot 2slot 3slot 4slot 5slot 6slot 7slot 8

blade1blade2blade3blade4blade5blade6blade7blade8

slot 1slot 2slot 3slot 4slot 5slot 6slot 7slot 8

blade1blade2blade3blade4blade5blade6blade7blade8

slot 1slot 2slot 3slot 4slot 5slot 6slot 7slot 8

blade1blade2blade3blade4blade5blade6blade7blade8

slot 1slot 2slot 3slot 4slot 5slot 6slot 7slot 8

blade1blade2blade3blade4blade5blade6blade7blade8

slot 1slot 2slot 3slot 4slot 5slot 6slot 7slot 8

blade1blade2blade3blade4blade5blade6blade7blade8

Edg

e/A

cces

s La

yer

Nex

us L

AN

and

SA

N C

ore:

O

ptim

ized

for D

ata

Cen

tre

Page 27: Hadoop Summit 2012 - Validated Network Architecture and Reference Deployment in Enterprise

Cisco Confidential© 2010 Cisco and/or its affiliates. All rights reserved. 27

Scaling the Data Centre FabricChanging the device paradigm

De-Coupling of the Layer 1 and Layer 2 Topologies

Simplified Management Model, plug and play provisioning, centralized configuration

Line Card Portability (N2K supported with Multiple Parent Switches – N5K, 6100, N7K)

Unified access for any server (100M1GE10GE FCoE): Scalable Ethernet, HPC, unified fabric or virtualization deployment

Virtualized Switch

. . .

Page 28: Hadoop Summit 2012 - Validated Network Architecture and Reference Deployment in Enterprise

Integration with Enterprise architecture – essential pathway for data flow

Integration

Consistency

Management

Risk-assurance

Enterprise grade features

Consistent Operational Model

NxOS, CLI, Fault Behavior and Management

Though higher BW east-west compared to traditional transactional networks

Over the time it will have multi-user, multi-workload behavior

Need enterprise centric features

Security, SLA, QoS etc

Hadoop Network Topologies - ReferenceUnified Fabric & ToR DC Design

1Gbps Attached Server

Nexus 7000/5000 with 2248TP-E

Nexus 7000 and 3048

NIC Teaming - 1Gbps Attached

Nexus 7000/5000 with 2248TP-E

Nexus 7000 and 3048

10 Gbps Attached Server

Nexus 7000/5000 with 2232PP

Nexus 7000 and 3064

NIC Teaming – 10 Gbps Attached Server

Nexus 7000/5000 with 2232PP

Nexus 7000 & 3064

Page 29: Hadoop Summit 2012 - Validated Network Architecture and Reference Deployment in Enterprise

Validated Reference Network Topology

Network

Three Racks each with 32 nodes

Distribution Layer – Nexus 7000 or Nexus 5000

ToR – FEX or Nexus 3000

2 FEX per Rack

Each Rack with either 32 single or dual attached host

Hadoop Framework

Apache 0.20.2

Linux 6.2

Slots – 10 Maps & 2 Reducers per node

Compute – UCS C200 M2

Cores:  12Processor:  2 x Intel(R) Xeon(R) CPU  X5670 @ 2.93GHzDisk: 4 x 2TB (7.2K RPM)Network: 1G: LOM, 10G: Cisco UCS P81E

Name Node Cisco UCS C200

Single NIC

2248TP-E

Nexus 5548 Nexus 5548

Data Nodes 1 – 48Cisco UCS C 200 Single NIC

…Data Nodes 49- 96

Cisco UCS 200 Single NIC

Traditional DC Design Nexus 55xx/2248

2248TP-E

Name Node Cisco UCS C 200

Single NIC

Nexus 7000 Nexus 7000

Data Nodes 1 – 48Cisco UCS C 200 Single NIC

…Data Nodes 49 - 96

Cisco UCS C 200 Single NIC

Nexus 3000Nexus 3000

Nexus 7K-N3K based Topology

Page 30: Hadoop Summit 2012 - Validated Network Architecture and Reference Deployment in Enterprise

30

Network AttributesArchitecture Availability Capacity, Scale &

Oversubscription FlexibilityManagement & Visibility

Network Reference Architecture Characteristics

slot 1slot 2slot 3slot 4slot 5slot 6slot 7slot 8

blade1blade2blade3blade4blade5blade6blade7blade8

slot 1slot 2slot 3slot 4slot 5slot 6slot 7slot 8

blade1blade2blade3blade4blade5blade6blade7blade8

slot 1slot 2slot 3slot 4slot 5slot 6slot 7slot 8

blade1blade2blade3blade4blade5blade6blade7blade8

slot 1slot 2slot 3slot 4slot 5slot 6slot 7slot 8

blade1blade2blade3blade4blade5blade6blade7blade8

slot 1slot 2slot 3slot 4slot 5slot 6slot 7slot 8

blade1blade2blade3blade4blade5blade6blade7blade8

slot 1slot 2slot 3slot 4slot 5slot 6slot 7slot 8

blade1blade2blade3blade4blade5blade6blade7blade8

Edg

e/A

cces

s La

yer

Nex

us L

AN

and

SA

N C

ore:

O

ptim

ized

for D

ata

Cen

tre

Page 31: Hadoop Summit 2012 - Validated Network Architecture and Reference Deployment in Enterprise

The Core High Availability Design Principles are common across all Network Systems Designs

Understand the causes of network outages

Component Failures

Network Anomalies

Understand the Engineering foundations of systems level availability

Device and Network level MTBF

Understanding Hierarchical and Modular Design

Understand the HW and SW interaction in the system

Enhance VPC allows such topology and ideally suited for Big Data applications

Enhanced vPC (EvPC)configuration any and all server NIC teaming configurations will be supported on any port

High Availability Switching Design Common High Availability Engineering Principles

System High Availability is a function of topology and component level High

Availability

Dual NIC Active/Standby

Dual NIC 802.3adSingle NIC

L3Dual Node

L2Dual Node

Full Mesh

Full Mesh

ToRDual Node

NIC Teaming

Page 32: Hadoop Summit 2012 - Validated Network Architecture and Reference Deployment in Enterprise

Availability with Single Attached Server1G or 10G

Important to evaluate the overall availability of the system.

Network failures can span many nodes in the system causing rebalancing and decreased overall resources.

Typically multi-TB of data transfer occurs for a single ToR or FEX failure

Load Sharing, ease of management and consistent SLA is important to enterprise operation

Failure Domain Impact on Job Completion

1 TB Terasort typically Takes ~4.20- 4.30 minutes

A failure of a SINGLE NODE (either NIC or server component) results in roughly doubling of the job completion time

Key observation is that the failure impact is dependent on type of workload being run on the cluster

Short lived interactive vs. Short live batch

Long job – ETL, Normalization, Joins32

Single NIC32 per ToR

Page 33: Hadoop Summit 2012 - Validated Network Architecture and Reference Deployment in Enterprise

No single point of failure from network view point. No impact on job completion time

NIC bonding configured at Linux – with LACP mode of bonding

Effective load-sharing of traffic flow on two NICs.

Recommended to change the hashing to src-dst-ip-port (both network and NIC bonding in Linux) for optimal load-sharing

Availability Single Attached vs. Dual Attached Node

33

Page 34: Hadoop Summit 2012 - Validated Network Architecture and Reference Deployment in Enterprise

Failure of various components

Failure introduce at 33%, 66% and 99% of reducer completion

Singly attached NIC server & Rack failure has bigger impact on job completion time then any other failure

FEX Failure is a RACK failure for 1G topology

Job Completion Time with Various Failure

Availability Network Failure Result – 1TB Terasort - ETL

34

Failure Point 1G Single Attached

2GDual Attached

Peer Link 5000 301 258

FEX * 1137 259

Rack * 1137 1017

A Port – Single Attached

See previous

Slide See previous Slide

1 port – Dual Attach

See previous

SlideSee previous Slide

FEX/ToR A

FEX/ToR B

FEX/ToR A

FEX/ToR A

Rack 1 Rack 2 Rack 3

96 Nodes

2 FEX per Rack

*Variance in run time with % reducer completed

Page 35: Hadoop Summit 2012 - Validated Network Architecture and Reference Deployment in Enterprise

35

Network AttributesArchitectureAvailabilityCapacity, Scale &

Oversubscription FlexibilityManagement & Visibility

Network Reference Architecture Characteristics

Page 36: Hadoop Summit 2012 - Validated Network Architecture and Reference Deployment in Enterprise

Hadoop is a parallel batch job oriented framework

Primary benefits of hadoop is the reduction in job completion time that would otherwise would take longer with traditional technique. E.g. Large ETL, Log Analysis, Join-only-Map job etc.

Typically oversubscription occurs with 10G server access then at 1G server

Non-blocking network is NOT a needed, however degree of oversubscription matters for

Job Completion Time

Replication of Results

Oversubscription during rack or FEX failure

Static vs. actual oversubscription

Often how much data a single node push is IO bound and number of disk configuration

Oversubscription Design

36

Uplinks Oversubscription Theoretical (16 Servers)

Measured

8 2:1 Next Slides

4 4:1 Next Slides

2 8:1 Next Slides

1 16:1 Next Slides

Page 37: Hadoop Summit 2012 - Validated Network Architecture and Reference Deployment in Enterprise

Steady state

Result Replication with 1,2,4, & 8 uplink

Rack Failure with 1, 2, 4 & 8 Uplink

Network Oversubscriptions

37

Page 38: Hadoop Summit 2012 - Validated Network Architecture and Reference Deployment in Enterprise

Data Node Speed Differences1G vs. 10G TCPDUMP of Reducers TX

38

• Generally 1G is being used largely due to the cost/performance trade-offs. Though 10GE can provide benefits depending on workload

• Reduced spike with 10G and smoother job completion time• Multiple 1G or 10G links can be bonded together to not only increase bandwidth, but increase

resiliency.

Page 39: Hadoop Summit 2012 - Validated Network Architecture and Reference Deployment in Enterprise

1 13 25 37 49 61 73 85 97 109

121

133

145

157

169

181

193

205

217

229

241

253

265

277

289

301

313

325

337

349

361

373

385

397

409

421

433

445

457

469

481

493

505

517

529

541

553

565

577

589

601

613

625

637

649

661

673

685

697

709

721

733

745

757

769

781

793

Job

Com

pleti

on

Cell

Usa

ge

1G Buffer Used 10G Buffer Used 1G Map % 1G Reduce % 10G Map % 10G Reduce %

1GE vs. 10GE Buffer Usage

39

Moving from 1GE to 10GE actually lowers the buffer requirement at the switching layer.

By moving to 10GE, the data node has a wider pipe to receive data lessening the need for buffers on the network as the total aggregate transfer rate and amount of data does not increase substantially. This is due, in part, to limits of I/O and Compute capabilities

Page 40: Hadoop Summit 2012 - Validated Network Architecture and Reference Deployment in Enterprise

40

Network AttributesArchitecture Capacity Availability Scale & Oversubscription FlexibilityManagement & Visibility

Network Reference Architecture Characteristics

slot 1slot 2slot 3slot 4slot 5slot 6slot 7slot 8

blade1blade2blade3blade4blade5blade6blade7blade8

slot 1slot 2slot 3slot 4slot 5slot 6slot 7slot 8

blade1blade2blade3blade4blade5blade6blade7blade8

slot 1slot 2slot 3slot 4slot 5slot 6slot 7slot 8

blade1blade2blade3blade4blade5blade6blade7blade8

slot 1slot 2slot 3slot 4slot 5slot 6slot 7slot 8

blade1blade2blade3blade4blade5blade6blade7blade8

slot 1slot 2slot 3slot 4slot 5slot 6slot 7slot 8

blade1blade2blade3blade4blade5blade6blade7blade8

slot 1slot 2slot 3slot 4slot 5slot 6slot 7slot 8

blade1blade2blade3blade4blade5blade6blade7blade8

Edg

e/A

cces

s La

yer

Nex

us L

AN

and

SA

N C

ore:

O

ptim

ized

for D

ata

Cen

tre

Page 41: Hadoop Summit 2012 - Validated Network Architecture and Reference Deployment in Enterprise

Multi-use Cluster Characteristics

41

Hadoop clusters are generally multi-use. The effect of background use can effect any single jobs

completion.

Example View of 24 Hour Cluster Use

Large ETL Job Overlaps with medium and small ETL Jobs and many small BI Jobs(Blue lines are ETL Jobs and purple lines are BI Jobs)

Importing Data into HDFS

A given Cluster, running many different types of Jobs, Importing into HDFS, Etc.

Page 42: Hadoop Summit 2012 - Validated Network Architecture and Reference Deployment in Enterprise

100 Jobs each with 10GB Data SetStable, Node & Rack Failure

42

• Almost all jobs are impacted with a single node failure• With multiple jobs running concurrently, node failure impact is as significant

as rack failure

Page 43: Hadoop Summit 2012 - Validated Network Architecture and Reference Deployment in Enterprise

43

Network AttributesArchitecture Capacity Availability Scale & Oversubscription FlexibilityManagement & Visibility

Network Reference Architecture Characteristics

slot 1slot 2slot 3slot 4slot 5slot 6slot 7slot 8

blade1blade2blade3blade4blade5blade6blade7blade8

slot 1slot 2slot 3slot 4slot 5slot 6slot 7slot 8

blade1blade2blade3blade4blade5blade6blade7blade8

slot 1slot 2slot 3slot 4slot 5slot 6slot 7slot 8

blade1blade2blade3blade4blade5blade6blade7blade8

slot 1slot 2slot 3slot 4slot 5slot 6slot 7slot 8

blade1blade2blade3blade4blade5blade6blade7blade8

slot 1slot 2slot 3slot 4slot 5slot 6slot 7slot 8

blade1blade2blade3blade4blade5blade6blade7blade8

slot 1slot 2slot 3slot 4slot 5slot 6slot 7slot 8

blade1blade2blade3blade4blade5blade6blade7blade8

Edg

e/A

cces

s La

yer

Nex

us L

AN

and

SA

N C

ore:

O

ptim

ized

for D

ata

Cen

tre

Page 44: Hadoop Summit 2012 - Validated Network Architecture and Reference Deployment in Enterprise

Burst Handling and Queue Depth

44

• Several HDFS operations and phases of MapReduce jobs are very bursty in nature

• The extent of bursts largely depend on the type of job (ETL vs. BI)

• Bursty phases can include replication of data (either importing into HDFS or output replication) and the output of the mappers during the shuffle phase.

A network that cannot handle bursts effectively will drop packets, so optimal buffering is needed in network devices to absorb bursts.

Optimal Buffering• Given large enough incast, TCP

will collapse at some point no matter how large the buffer

• Well studied by multiple universities

• Alternate solutions (Changing TCP behavior) proposed rather than Huge buffer switches

http://simula.stanford.edu/sedcl/files/dctcp

-final.pdf

Page 45: Hadoop Summit 2012 - Validated Network Architecture and Reference Deployment in Enterprise

Nexus 2248TP-E Buffer Monitoring Nexus 2248TP-E utilizes a 32MB shared buffer to handle larger traffic bursts Hadoop, NAS, AVID are examples of bursty applications You can control the queue limit for a specified Fabric Extender for egress

(network to the host) or ingress(host to network) Extensive Drop Counters

Provides drop counters for both directions: Network to host and Host to Network on a per host interface basis

Drop counters for different reason

• Out of buffer drop, No credit drop, Queue limit drop(tail drop), MAC error drop, Truncation drop, Multicast drop

Buffer Occupancy Counter

How much buffer is being used. One key indicator of congestion or bursty traffic

45

N5548-L3(config-fex)# hardware N2248TPE queue-limit 4000000 rxN5548-L3(config-fex)# hardware N2248TPE queue-limit 4194304 tx

fex-110# show platform software qosctrl asic 0 0

Page 46: Hadoop Summit 2012 - Validated Network Architecture and Reference Deployment in Enterprise

TeraSort FEX(2248TP-E) Buffer Analysis (10TB)

46

0:1

5:2

6

0:1

7:3

1

0:1

9:5

6

0:2

2:2

8

0:2

4:4

8

0:2

6:5

5

0:2

9:4

1

0:3

2:0

1

0:3

4:3

3

0:3

6:5

4

0:3

9:2

6

0:4

1:4

8

0:4

4:2

0

0:4

6:5

3

0:4

9:2

5

0:5

1:5

8

0:5

4:4

4

0:5

7:2

9

0:5

9:5

0

1:0

2:0

9

1:0

4:2

9

1:0

7:0

3

1:0

9:3

6

1:1

2:0

8

1:1

4:4

0

1:1

6:5

9

1:1

9:1

9

1:2

1:3

9

1:2

4:2

5

1:2

7:4

8

1:3

1:1

1

1:3

4:4

4

1:3

8:1

8

1:4

1:3

9

1:4

5:0

0

1:4

7:5

6

1:5

0:5

0

1:5

3:2

7

1:5

4:1

8

2:0

7:5

5

2:1

0:3

9

2:1

7:1

5

2:2

0:2

4

2:2

3:2

0

2:2

6:3

1

2:2

9:2

7

2:3

2:3

7

2:3

6:0

0

2:3

8:5

8

2:4

2:2

1

2:4

5:3

1

2:4

8:2

8

2:5

1:3

9

2:5

5:0

1

2:5

8:1

2

3:0

1:2

4

3:0

4:3

5

3:0

7:4

6

3:1

0:5

9

3:1

4:1

3

3:1

7:2

8

3:2

1:0

0

3:2

4:3

6

3:2

8:1

0

3:3

1:4

2

3:3

4:5

6

3:3

7:4

5

3:4

0:5

7Jo

b C

om

ple

tio

n

Ce

ll U

sa

ge

FEX #1 FEX #2 Map % Reduce %

The buffer utilization is highest during the shuffle and output replication phases.

Optimized buffer sizes are required to avoid packet loss leading to slower job completion times.

Buffer Usage During Shuffle Phase

Buffer Usage During output Replication

Page 47: Hadoop Summit 2012 - Validated Network Architecture and Reference Deployment in Enterprise

TeraSort(ETL) N3k Buffer Analysis (10TB)

47

16

:39

:45

16

:42

:57

16

:46

:09

16

:49

:22

16

:52

:34

16

:55

:47

16

:59

:00

17

:02

:12

17

:05

:25

17

:08

:37

17

:11

:50

17

:15

:02

17

:18

:15

17

:21

:27

17

:24

:39

17

:27

:53

17

:31

:05

17

:34

:17

17

:37

:29

17

:40

:40

17

:43

:53

17

:47

:05

17

:50

:18

17

:53

:29

17

:56

:42

17

:59

:55

18

:03

:07

18

:06

:20

18

:09

:32

18

:12

:45

18

:15

:58

18

:19

:10

18

:22

:22

18

:25

:34

18

:28

:47

18

:31

:59

18

:35

:12

18

:38

:25

18

:41

:37

18

:44

:49

18

:48

:02

18

:51

:15

18

:54

:27

18

:57

:39

19

:00

:52

19

:04

:04

19

:07

:17

19

:10

:29

19

:13

:42

19

:16

:53

19

:20

:05

19

:23

:18

19

:26

:31

19

:29

:44

19

:32

:56

19

:36

:08

19

:39

:21

19

:42

:33

19

:45

:46

19

:48

:58

19

:52

:10

19

:55

:23

Job

Co

mp

leti

on

Ce

ll U

sag

e

N3048 #1 N3048 #2 N3064 Map % Reduce %

• The buffer utilization is highest during the shuffle and output replication phases.

• Optimized buffer sizes are required to avoid packet loss leading to slower job completion times.

The Aggregation switch buffer remained flat as the bursts were absorbed at the Top of Rack layer

Buffer Usage During Shuffle Phase

Buffer Usage During output Replication

Page 48: Hadoop Summit 2012 - Validated Network Architecture and Reference Deployment in Enterprise

Network Latency

48

Generally network latency, while

consistent latency being important, does

not represent a significant factor for Hadoop Clusters.

Note:There is a difference in network latency vs. application latency. Optimization in the application stack can decrease application latency that can potentially have a significant benefit. 1TB 5TB 10TB

N3K Topology 5k/2k Topology

Data Set Size (80 Node Cluster)

Co

mp

leti

on

Tim

e (S

ec)

Page 49: Hadoop Summit 2012 - Validated Network Architecture and Reference Deployment in Enterprise

Summary Extensive Validation of

Hadoop Workload

Reference ArchitectureMake it easy for Enterprise

Demystify Network for Hadoop Deployment

Integration with Enterprise with efficient choices of network topology/devices

10G and/or Dual attached server provides consistent job completion time & better buffer utilization

10G provide reduce burst at the access layer

A single attached node failure has considerable impact on job completion time

Dual Attached Sever is recommended design – 1G or 10G. 10G for future proofing

Rack failure has the biggest impact on job completion time

Does not require non-blocking network

Degree of oversubscription does impact job completion time

Latency does not matter much in Hadoop work load

49

Page 50: Hadoop Summit 2012 - Validated Network Architecture and Reference Deployment in Enterprise

Big Data @ Cisco

50

Cisco.com Big Datawww.cisco.com/go/bigdata

Certifications and Solutions with UCS C-Series and Nexus 5500+22xx

• EMC Greenplum MR Solution• Cloudera Hadoop Certified Technology

• Cloudera Hadoop Solution Brief• Oracle NoSQL Validated Solution

• Oracle NoSQL Solution Brief

Multi-month network and compute analysis testing

(In conjunction with Cloudera)

• Network/Compute Considerations Whitepaper• Presented Analysis at Hadoop World

128 Node/1PB test cluster