big data/hadoop infrastructure considerations

50
© 2009 VMware Inc. All rights reserved Big Data in a Software-Defined Datacenter Richard McDougall Chief Architect, Storage and Application Services VMware, Inc @richardmcdougll

Upload: richard-mcdougall

Post on 26-May-2015

936 views

Category:

Documents


3 download

DESCRIPTION

Planning big-data infrastructure, implications for

TRANSCRIPT

Page 1: Big Data/Hadoop Infrastructure Considerations

© 2009 VMware Inc. All rights reserved

Big Data in a Software-Defined Datacenter

Richard McDougall

Chief Architect, Storage and Application Services

VMware, Inc

@richardmcdougll

Page 2: Big Data/Hadoop Infrastructure Considerations

2

Trend: New Data Growing at 60% Y/Y

Source: The Information Explosion, 2009

medical(imaging,(sensors(

cad/cam,(appliances,(machine(data,(digital(movies(

digital(photos(

digital(tv(

audio(

camera(phones,(rfid(

satellite(images,(logs,(scanners,(twi7er(

Exabytes of information stored 20 Zetta by 2015 1 Yotta by 2030 Yes, you are part of the yotta generation…

Page 3: Big Data/Hadoop Infrastructure Considerations

3

Trend: Big Data – Driven by Real-World Benefit

Page 4: Big Data/Hadoop Infrastructure Considerations

4

Hadoop batch analysis

Big Data Family of Frameworks

File System/Data Store

Host Host Host Host Host Host

HBase real-time queries

NoSQL Cassandra, Mongo, etc Big SQL

Impala, Pivotal HawQ

Compute layer

Distributed Resource Management

Host

Other Spark, Shark, Solr,

Platfora, Etc,…

Page 5: Big Data/Hadoop Infrastructure Considerations

5

Big (Data) problems: making management easier

Page 6: Big Data/Hadoop Infrastructure Considerations

6

Log Processing / Click Stream Analytics

Machine Learning / sophisticated data mining

Web crawling / text processing

Extract Transform Load (ETL) replacement

Image / XML message processing

Broad Application of Hadoop technology

General archiving / compliance

Financial Services

Mobile / Telecom

Internet Retailer

Scientific Research

Pharmaceutical / Drug Discovery

Social Media

Vertical Use Cases Horizontal Use Cases

Hadoop’s ability to handle large unstructured data affordably and efficiently makes it a valuable tool kit for enterprises across a number of applications and fields.

Page 7: Big Data/Hadoop Infrastructure Considerations

7

How does Hadoop enable parallel processing?

Source: http://architects.dzone.com/articles/how-hadoop-mapreduce-works

!  A framework for distributed processing of large data sets across clusters of computers using a simple programming model.

Page 8: Big Data/Hadoop Infrastructure Considerations

8

Hadoop System Architecture

! MapReduce: Programming framework for highly parallel data processing

!  Hadoop Distributed File System (HDFS): Distributed data storage

! Other Distributed Storage Options: •  Alternatives to HDFS • MAPR (SW), Isilon (HW)

Page 9: Big Data/Hadoop Infrastructure Considerations

9

Host%1 Host%2 Host%3

%Input%File

Input%File

Job Tracker Schedules Tasks Where the Data Resides

Job Tracker

Job

DataNode

Task%%Tracker

Split%1%–%64MB

Task%<%1 Split%2%–%64MB Split%3%–%64MB

Task%%Tracker

Task%%Tracker

DataNode DataNode

Block%1%–%64MB Block%2%–%64MB Block%3%–%64MB

Task%<%2 Task%<%3

Page 10: Big Data/Hadoop Infrastructure Considerations

10

Hadoop Data Locality and Replication

Page 11: Big Data/Hadoop Infrastructure Considerations

11

Hadoop Topology Awareness

Page 12: Big Data/Hadoop Infrastructure Considerations

12

Rules of Thumb: Sizing for Hadoop

!  Disk: •  Provide about 50Mbytes/sec of disk bandwidth per core •  If using SATA, that’s about one disk per core

!  Network •  Provide about 200mbits of aggregate network bandwidth per core

! Memory • Use a memory:core ratio of about 4Gbytes:core

!  Example: 100 node cluster •  100 x 16 cores = 1600 cores

•  1600 x 50Mbytes/sec = 80Gbytes/sec(!) •  1600 x 200mbits = 320gbits of network traffic

Page 13: Big Data/Hadoop Infrastructure Considerations

13

Big Data Frameworks and Characteristics

Framework Scale of data

Scale of Cluster

Computable Data?

Local Disks?

File System: Gluster, Isilon, HDFS, etc,…

10s PB 10s to 100s Some Yes, for cost

Map-reduce: Hadoop

100s PB 10s to 1,000s Yes Yes, for cost, bandwidth and availability

Big-SQL: HawQ,, Aster Data, Impala, …

PB’s 10s to 100s Some Yes, for cost and bandwidth

No-SQL: Cassandra, hBase, …

Trilions Of rows

10s to 100s Some Yes, for cost and availability

In-Memory: Redis, Gemfire, Membase, …

Billions of rows

10s-100s Yes Primarily Memory

Page 14: Big Data/Hadoop Infrastructure Considerations

14

Big Data Virtual Infrastructure

Page 15: Big Data/Hadoop Infrastructure Considerations

15

Virtualization enables a Common Infrastructure for Big Data

Single purpose clusters for various business applications lead to cluster sprawl.

Virtualization Platform

!  Simplify •  Single Hardware Infrastructure

•  Unified operations

!  Optimize •  Shared Resources = higher utilization

•  Elastic resources = faster on-demand access

MPP DB Hadoop HBase

Virtualization Platform

Scaleout SQL

Hadoop

HBase

Cluster Sprawling

Cluster Consolidation

Page 16: Big Data/Hadoop Infrastructure Considerations

16

Native versus Virtual Platforms, 24 hosts, 12 disks/host

0

50

100

150

200

250

300

350

400

450

TeraGen TeraSort TeraValidate

Elap

sed

tim

e, s

econ

ds

(low

er is

bet

ter)

Native 1 VM 2 VMs 4 VMs

Native versus Virtual Platforms, 24 hosts, 12 disks/host

Page 17: Big Data/Hadoop Infrastructure Considerations

17

Why Virtualize Hadoop?

!  Shrink and expand cluster on demand

!  Resource Guarantee !  Independent scaling of

Compute and data

Mixed Workloads

!  No more single point of failure

!  One click to setup

!  High availability for MR Jobs

Highly Available

!  Rapid deployment !  Unified operations

across enterprise

!  Easy Clone of Cluster

Simple to Operate

Page 18: Big Data/Hadoop Infrastructure Considerations

18

Ad hoc data mining

In-house Hadoop as a Service “Enterprise EMR” – (Hadoop + Hadoop)

Compute layer

Data layer

HDFS

Host Host Host Host Host Host

Production recommendation engine

Production ETL of log files

Virtualization platform

HDFS

Page 19: Big Data/Hadoop Infrastructure Considerations

19

Short-lived Hadoop compute cluster

Integrated Hadoop and Webapps – (Hadoop + Other Workloads)

HDFS

Host Host Host Host Host Host

Web servers for ecommerce site

Compute layer

Data layer

Hadoop compute cluster

Virtualization platform

Page 20: Big Data/Hadoop Infrastructure Considerations

20

Hadoop batch analysis

Integrated Big Data Production – (Hadoop + other big data)

HDFS

Host Host Host Host Host Host

HBase real-time queries NoSQL –

Cassandra, Mongo, etc

Big SQL – Impala

Compute layer

Data layer

Virtualization

Host

Other Spark, Shark, Solr,

Platfora, Etc,…

Page 21: Big Data/Hadoop Infrastructure Considerations

21

Serengeti: Deploy a Hadoop Cluster in under 30 Minutes

Deploy vHelperOVF to vSphere

Select configuration template

Automate deployment

Select Compute, memory, storage and network

Done

Step 1: Deploy Serengeti virtual appliance on vSphere.

Step 2: A few simple commands to stand up Hadoop Cluster.

Page 22: Big Data/Hadoop Infrastructure Considerations

22

Why Virtualize Hadoop?

!  Shrink and expand cluster on demand

!  Resource Guarantee !  Independent scaling of

Compute and data

Mixed Workloads

!  No more single point of failure

!  One click to setup

!  High availability for MR Jobs

Highly Available

!  Rapid deployment !  Unified operations

across enterprise

!  Easy Clone of Cluster

Simple to Operate

Page 23: Big Data/Hadoop Infrastructure Considerations

23

vMotion, HA, FT enables High Availability for the Hadoop Stack

HDFS (Hadoop Distributed File System)

HBase (Key-Value store)

MapReduce (Job Scheduling/Execution System)

Pig (Data Flow) Hive (SQL)

BI Reporting ETL Tools

Man

agem

ent S

erve

r

Zook

eepr

(Coo

rdin

atio

n) HCatalog

RDBMS

Namenode

Jobtracker

Hive MetaDB

Hcatalog MDB

Server

Page 24: Big Data/Hadoop Infrastructure Considerations

24

Why Virtualize Hadoop?

!  Shrink and expand cluster on demand

!  Resource Guarantee !  Independent scaling of

Compute and data

Mixed Workloads

!  No more single point of failure

!  One click to setup

!  High availability for MR Jobs

Highly Available

!  Rapid deployment !  Unified operations

across enterprise

!  Easy Clone of Cluster

Simple to Operate

Page 25: Big Data/Hadoop Infrastructure Considerations

25

Containers with Isolation are a Tried and Tested Approach

Host Host Host Host Host Host

vSphere, DRS

Host

Hungry Workload 1 Reckless Workload 2

Nosy Workload 3

Page 26: Big Data/Hadoop Infrastructure Considerations

26

Mixing Workloads: Three big types of Isolation are Required

!  Resource Isolation • Control the greedy noisy neighbor • Reserve resources to meet needs

!  Version Isolation •  Allow concurrent OS, App, Distro versions

!  Security Isolation •  Provide privacy between users/groups

• Runtime and data privacy required

Host Host Host Host Host Host

vSphere, DRS

Host

Page 27: Big Data/Hadoop Infrastructure Considerations

27

Elasticity Enables Sharing of Resources

Page 28: Big Data/Hadoop Infrastructure Considerations

28 O

ther

VM

Oth

er V

M

Oth

er V

M

“Time Share”

Oth

er V

M

Oth

er V

M

Oth

er V

M

Oth

er V

M

Oth

er V

M

Oth

er V

M

Oth

er V

M

Oth

er V

M

Host Host Host

Oth

er V

M

Oth

er V

M

Oth

er V

M

Oth

er V

M

Had

oop

Had

oop

Had

oop

Had

oop

Had

oop

Had

oop

While existing apps run during the day to support business operations, Hadoop batch jobs kicks off at night to conduct deep analysis of data.

VMware vSphere

vHelper

HDFS HDFS HDFS

Page 29: Big Data/Hadoop Infrastructure Considerations

29

Big Data Impact to Storage

Page 30: Big Data/Hadoop Infrastructure Considerations

30

Traditional Storage – VMDK’s on SAN/NAS

VMDK (block)

VMDK (block)

VMDK (block)

App VM DB VM

Boot Data Boot

VMFS (SAN or NFS)

EMC NTAP

Archive

VMCB VM Level Archive

• VMs have just a few self-contained VMDKs • Data is not shared between VMs • VMs have limited individual bandwidth needs

Page 31: Big Data/Hadoop Infrastructure Considerations

31

Rapid Growth of Big Data Capacity Necessitates New storage

Source: The Information Explosion, 2009

medical(imaging,(sensors(

cad/cam,(appliances,(machine(data,(digital(movies(

digital(photos(

digital(tv(

audio(

camera(phones,(rfid(

satellite(images,(logs,(scanners,(twi7er(

Exabytes of information stored 20 Zetta by 2015 1 Yotta by 2030 Yes, you are part of the yotta generation…

Page 32: Big Data/Hadoop Infrastructure Considerations

32

Use Local Disk where it’s Needed

SAN Storage

$2 - $10/Gigabyte

$1M gets: 0.5Petabytes

200,000 Disk IOPS 8Gbyte/sec

NAS Filers

$1 - $5/Gigabyte

$1M gets: 1 Petabyte

200,000 Disk IOPS 10Gbyte/sec

Local Storage

$0.05/Gigabyte

$1M gets: 10 Petabytes

400,000 Disk IOPS 250 Gbytes/sec

Page 33: Big Data/Hadoop Infrastructure Considerations

33

Storage Economics

$-

$0.50

$1.00

$1.50

$2.00

$2.50

$3.00

$3.50

$4.00

$4.50

$5.00

$5.50

0.5 1 2 4 8 16 32 64 128

Cost per GB

Petabytes Deployed

Traditional SAN/NAS VMware vSAN

Distributed Object Storage

HDFS MAPR CEPH Scale-out NAS

Isilon, NTAP

Page 34: Big Data/Hadoop Infrastructure Considerations

34

Scalable Storage Architecture: Hadoop Demands

!  Each node needs approximately ~1 disk of bandwidth per core •  16 core node needs ~1GByte/sec of bandwidth

!  A 1,000 node Hadoop cluster needs 1Terabyte/sec of bandwidth •  Local disks: $800 of local disks per node

•  Traditional SAN: Would require an est. $50k of SAN storage per node

Data Node

Com

pute

Com

pute

Com

pute

Data Node C

ompu

te

Com

pute

Com

pute

Data Node

Com

pute

Com

pute

Com

pute

Data Node

Com

pute

Com

pute

Com

pute

Page 35: Big Data/Hadoop Infrastructure Considerations

35

Storage-Servers: Configurations and Capabilities

16-24 core server 12-24 SATA 2-4TB Disks 10 GbE adapter Throughput: 2.5Gbytes/sec Capacity: 96TB

16-24 core server 80 SATA 2-4TB Disks 10 GbE adapter Throughput: 2.5Gbytes/sec Capacity: 320TB

Typical Storage Server

High Capacity Server

Page 36: Big Data/Hadoop Infrastructure Considerations

36

Scalable Storage Bandwidth is Important for Big Data

0

20

40

60

80

100

120

1 10 20 30 40 50

Scalable Storage Single SAN/NAS

# Hosts

GBytes/sec

Page 37: Big Data/Hadoop Infrastructure Considerations

37

Hadoop has Significant Ephemeral Data

%75%%of%Disk%Bandwidth%

Job%

Map%Task%

Map%Task%

Map%Task%

Map%Task%

Reduce%

Reduce%

HDFS%

DFS%Input%Data%

DFS%Output%Data%%

12%%of%Bandwidth%

%12%%of%Bandwidth%

Spills%&%Logs%spill*.out*

Spills%

Map%Output%file.out*

Shuffle%Map_*.out*

Sort%

Combine%Intermediate.out*

Page 38: Big Data/Hadoop Infrastructure Considerations

38

Impact of Temporary Data

!  Leverage local-disks when shared-storage is used •  As much as 75% of the bandwidth will be transient • No-need for reliable, replicated storage

•  Just use unprotected local-disk

HDFS%

Shared Data (Shared, Network Storage)

Temporary Data (Local Disks)

Page 39: Big Data/Hadoop Infrastructure Considerations

39

Extend Virtual Storage Architecture to Include Local Disk

!  Shared Storage: SAN or NAS •  Easy to provision •  Automated cluster rebalancing

!  Hybrid Storage •  SAN for boot images, VMs, other

workloads

•  Local disk for Hadoop & HDFS •  Scalable Bandwidth, Lower Cost/GB

Host

Had

oop

Oth

er V

M

Oth

er V

M

Host

Had

oop

Had

oop

Oth

er V

M

Host

Had

oop

Had

oop

Oth

er V

M

Host

Had

oop

Oth

er V

M

Oth

er V

M

Host

Had

oop

Had

oop

Oth

er V

M

Host

Had

oop

Had

oop

Oth

er V

M

Page 40: Big Data/Hadoop Infrastructure Considerations

40

Scale-out Storage for Big Data: Local Disk or Scale-out-NAS

Host%

Host%

Host%

Host%

Host%

Host%

Host%

Top%of%Rack%Switch%

Servers with Local Disks Big-Data using Scale-out NAS

Host%

Host%

Host%

Host%

Host%

Host%

Top%of%Rack%Switch%

Scale<out%NAS%

Host%

Host%

Host%

Host%

Host%

Host%

Top%of%Rack%Switch%

Scale<out%NAS%

Temp Data

Shared Data

Page 41: Big Data/Hadoop Infrastructure Considerations

41

Big-Data using Local Disks

Host%

Host%

Host%

Host%

Host%

Host%

Host%

Top%of%Rack%Switch%

Servers with Local Disks

16-24 core server 12-24 SATA 2-4TB Disks 10 GbE adapter iSCSI/NFS for Shared Storage for vMotion etc,…

High Performance 10GBE Switch per Rack

Page 42: Big Data/Hadoop Infrastructure Considerations

42

Big Data with Scale-out-NAS

Big-Data using Scale-out NAS

Host%

Host%

Host%

Host%

Host%

Host%

Top%of%Rack%Switch%

Scale<out%NAS%

Host%

Host%

Host%

Host%

Host%

Host%

Top%of%Rack%Switch%

Scale<out%NAS%

Temp Data

Shared Data

Isilon Scale-out

NAS

Local Disk or SSD In each Host

For Transient Data

Page 43: Big Data/Hadoop Infrastructure Considerations

43

Big Data Impact to Networking

Page 44: Big Data/Hadoop Infrastructure Considerations

44

Hadoop has Specific Network Demands that must be Considered

Source: http://architects.dzone.com/articles/how-hadoop-mapreduce-works

!  A framework for distributed processing of large data sets across clusters of computers using a simple programming model.

Page 45: Big Data/Hadoop Infrastructure Considerations

45

Host%

Host%

Host%

Top%of%Rack%Switch%

Host%

L2%Switch%

Top%of%Rack%Switch%

L2%Switch%

Host%

Host%

Host%

Host%

Top%of%Rack%Switch%

Host%

Host%

Host%

Host%

Top%of%Rack%Switch%

Host%

Host%

Host%

Host%

L2%Switch% L2%Switch%

AggregaQng%Switch%

AggregaQng%Switch%

A Typical Network Architecture

Page 46: Big Data/Hadoop Infrastructure Considerations

46

Host%

Host%

Host%

Top%of%Rack%Switch%

Host%

L2%Switch%

Top%of%Rack%Switch%

L2%Switch%

Host%

Host%

Host%

Host%

Top%of%Rack%Switch%

Host%

Host%

Host%

Host%

Top%of%Rack%Switch%

Host%

Host%

Host%

Host%

L2%Switch% L2%Switch%

AggregaQng%Switch%

AggregaQng%Switch%

A Typical Network Architecture

!  Hadoop Requirements • Hadoop can generate up to 200mbits per core of network traffic •  Bursts of bandwidth during remote data read and shuffle phases

• Data locality can help minimize net reads, but shuffle and reduce are global

!  Inside Rack Network •  Today’s core-counts require 10GBe to hosts inside Rack

• High throughput switches required to meet aggregate bandwidth

•  Switch Network Buffers sizing is important due to coordinated bursts

!  Rack-to-rack Network •  Two level switch aggregation can be a bottleneck

•  Easy to saturate, especially during shuffle phases

!  Link-Speed Sizing • Hadoop originally designed around 4 cores, 1Gbit uplink

• Now, 16-24 cores per box means 10Gbe is required for uplink

Page 47: Big Data/Hadoop Infrastructure Considerations

47

Host%

Host%

Host%

Top%of%Rack%Switch%

Host%

L2%Switch%

Top%of%Rack%Switch%

L2%Switch%

Host%

Host%

Host%

Host%

Top%of%Rack%Switch%

Host%

Host%

Host%

Host%

Top%of%Rack%Switch%

Host%

Host%

Host%

Host%

L2%Switch% L2%Switch%

AggregaQng%Switch%

AggregaQng%Switch%

Two Level Aggregated Network Characteristics

Top of Rack: Overprovisioning Possible 10Gbe High End Switch >500Gbits/sec

Between Racks: Bandwidth can be saturated

Page 48: Big Data/Hadoop Infrastructure Considerations

48

Future Network Designs, optimized for Big-Data

Traditional Tree-structured networks

Clos network between Aggregation and Intermediate (core) switches

LA – (physical) Locator address AA – (flat, virtual) Application address

Pros: full-bisection bandwidth, non-blocking switching fabric, path diversity! Formalized in 1953

Cons: High oversubscription of upper-level (Aggregation) and core switches. Hard to get full-bisection bandwidth Images courtesy Greenberg et al.:http://research.microsoft.com/pubs/80693/vl2-sigcomm09-final.pdf and http://www.stanford.edu/class/ee384y/Handouts/clos_networks.pdf

Page 49: Big Data/Hadoop Infrastructure Considerations

49

In Conclusion

!  Software Defined Datacenter enables Unified Big Data Platform !  Plan to build a Software Defined “Big Data”, Datacenter •  Enable Hadoop and other Big-Data ecosystem components

!  Values • Consolidated Hardware, reduced SKU

• Higher Utilization: Can leverage a shared cluster for bursty workloads • Rapid Deployment: Use APIs to deploy software-defined clusters rapidly

•  Flexible Cluster Use: Share the same infrastructure for different workload types

!  References www.projectserengeti.org vmware.com/hadoop VMware Hadoop Performance Whitepaper VMware Hadoop HA/FT Paper

Page 50: Big Data/Hadoop Infrastructure Considerations

© 2009 VMware Inc. All rights reserved

Big Data in a Software-Defined Datacenter

Richard McDougall

Chief Architect, Storage and Application Services

VMware, Inc

@richardmcdougll