hadoop aws infrastructure cost evaluation

35
Hadoop Platform infrastructure cost evaluation

Upload: mattlieber

Post on 04-Dec-2014

3.210 views

Category:

Technology


2 download

DESCRIPTION

How to calculate the cost of a Hadoop infrastructure on Amazon AWS, given some data volume estimates and the rough use case ? Presentation attempts to compare the different options available on AWS.

TRANSCRIPT

Page 1: Hadoop AWS infrastructure cost evaluation

Hadoop Platform infrastructure cost evaluation

Page 2: Hadoop AWS infrastructure cost evaluation

• High level requirements

• Cloud architecture

• Major architecture components

• Amazon AWS

• Hadoop distributions

• Capacity Planning

• Amazon AWS – EMR

• Hadoop distributions

• On-premise hardware costs

• Gotcha’s

Agenda

2

Page 3: Hadoop AWS infrastructure cost evaluation

• Build an Analytical & BI platform for web log analytics

• Ingest multiple data sources:

• Log data

• internal user data

• Apply complex business rules

• Manage Events, filter Crawler Driven Logs, apply Industry and Domain Specific rules

• Populate/export to a BI tool for visualization.

High Level Requirements

3

Page 4: Hadoop AWS infrastructure cost evaluation

• Today’s baseline: ~42 TB per year (~ 3.5TB raw data per month), 3 years store

• SLA: Should process data every day. Currently done once a month.

• Predefined processing via Hive; no exploratory analysis

• Everything in the cloud:

• Store (HDFS), Compute (M/R), Analysis (BI tool)

Non-Functional Requirements

4

Page 5: Hadoop AWS infrastructure cost evaluation

• Seeding data in S3 (3 year’s data worth)

• Adding monthly net-new data only.

• Speed not of primary importance

Non-Functional Requirements [2]

5

Page 6: Hadoop AWS infrastructure cost evaluation

• Cleaned-up log data per year 42 TB (3 years = 126 TB)

• Total disk space required should consider

• Compression (LZO 40%) – Reduces disk space required to ~25 *

• Replication Factor of 3 : ~75 TB

• 75% disk utilization maximum in Hadoop: 100TB

• Total disk capacity required for DN: ~100TB / year (17.5TB/ mo)

• (*disclaimer: depends on codec and data input)

Data Estimates for Capacity planning [2]

6

Page 7: Hadoop AWS infrastructure cost evaluation

Data Estimates for Capacity planning: reduced logs

7

Expected data volume

Log data volume (TB)

After compression (Gzip 40%)

Data Replication on 3 nodes

70% disk utilization maximum (TB)

1 month 3.6 2.16 6.5 9.21 year 42 25 75 107

3 years 126 75.6 226 322

• Total disk capacity required for DN: ~10TB/ month

Page 8: Hadoop AWS infrastructure cost evaluation

Amazon AWS

Cloud Solution Architecture

8

Hadoop

HDFS

S3 BI Tool

User

Logs

Metadata Extraction

Webservers

Hive Tables

Client

1. Copy data to

S3

2. Export data to HDFS

3. Process in M/R 4. Display

in BI tool

5. Retain results into S3

Page 9: Hadoop AWS infrastructure cost evaluation

• Amazon Elastic Compute Cloud (EC2) is a web service that provides resizable compute capacity in the cloud.

• Manual set up of Hadoop on EC2

• Use EBS for storage capacity (HDFS)

• Storage on S3

Hadoop on AWS: EC2

9

Page 10: Hadoop AWS infrastructure cost evaluation

• EC2 instances options

• Choose instance type

• Choose instance type availability

• Choose instance family

• Choose where the data resides:

• S3 – high latency, but highly available

• EBS

• Permanent storage?

• Snapshots to S3?

• Apache Whirr for set up

Running Hadoop on AWS: EC2

10

Page 11: Hadoop AWS infrastructure cost evaluation

• Other choices:

• EBS-optimized instances: dedicated throughput between Amazon EC2 and Amazon EBS, with options between 500 Mbps and 1000 Mbps depending on the instance type used.

• Inter-region data transfer

• Dedicated instances: run on single-tenant hardware dedicated to a single customer.

• Spot instances: Name your price

Amazon EC2 – Instance features

11

Page 12: Hadoop AWS infrastructure cost evaluation

• Amazon EC2 instances are grouped into six families: General purpose, Memory optimized, Compute optimized, Storage optimized, micro and GPU.

• General-purpose instances have memory to CPU ratios suitable for most general purpose apps.

• Memory-optimized instances offer larger memory sizes for high throughput applications.

• Compute-optimized instances have proportionally more CPU resources than memory (RAM) and are well suited for compute-intensive applications.

• Storage-optimized instances are optimized for very high random I/O performance , or very high storage density, low storage cost, and high sequential I/O performance.

• micro instances provide a small amount of CPU with the ability to burst to higher amounts for brief periods.

• GPU instances, for dynamic applications.

Amazon Instance Families

12

Data nodes

Page 13: Hadoop AWS infrastructure cost evaluation

• On-Demand Instances – On-Demand Instances let you pay for compute capacity by the hour with no long-term commitments. This frees you from the costs and complexities of planning, purchasing, and maintaining.

• Reserved Instances – Reserved Instances give you the option to make a one-time payment for each instance you want to reserve and in turn receive a discount on the hourly charge for that instance. There are three Reserved Instance types (Light, Medium, and Heavy Utilization Reserved Instances) that enable you to balance the amount you pay upfront with your effective hourly price.

• Spot Instances – Spot Instances allow customers to bid on unused Amazon EC2 capacity and run those instances for as long as their bid exceeds the current Spot Price. The Spot Price changes periodically based on supply and demand, and customers whose bids meet or exceed it gain access to the available Spot Instances. If you have flexibility in when your applications can run, Spot Instances can significantly lower your Amazon EC2 costs.

Amazon Instances types availability

13

Page 14: Hadoop AWS infrastructure cost evaluation

Amazon EC2 – Storage

14

Page 15: Hadoop AWS infrastructure cost evaluation

Amazon EC2 – Instance types

15

BI instances

Master nodes

Data nodes

Page 16: Hadoop AWS infrastructure cost evaluation

• Hadoop cluster is initiated when analytics is run

• Data is streamed from S3 to EBS Volumes

• Results from analytics stored to S3 once computed

• BI nodes permanent

Systems Architecture – EC2

16

Logs

AWS

NN SN

Hadoop

S3

DNs EN

HDFS on EBS drives

Client

Node Node

BI

Node Node

BI

Page 17: Hadoop AWS infrastructure cost evaluation

• Probably not the best choice:

• EBS volumes make the solution costly

• If instead using instance storage, choices of EC2 instances either too small (a few Gigs) or too big (48 TB/per instance).

• Don’t need the flexibility – just want to use Hive

Hadoop on AWS: EC2

17

Page 18: Hadoop AWS infrastructure cost evaluation

• EC2 Amazon Elastic MapReduce ( EMR) is a web service that provides a hosted Hadoop framework running on the EC2 and Amazon Simple Storage Service (S3).

Hadoop on AWS: EMR

18

Page 19: Hadoop AWS infrastructure cost evaluation

• Elastic Map Reduce

• For occasional jobs – Ephemeral clusters

• Ease of use, but 20% costlier

• Data stored in S3 - Highly tuned for S3 storage

• Hive and Pig available

• Only pay for S3 + instances time while jobs running

• Or: leave it always on.

Running Hadoop on AWS - EMR

19

Page 20: Hadoop AWS infrastructure cost evaluation

• EC2 instances with own flavor of Hadoop

• Amazon Apache Hadoop is 1.0.3 version. You can also choose MapR M3 or M5 (0.20.205) version.

• You can run Hive (0.7.1 or 0.8.1), Custom JAR, Streaming, Pig or Hbase.

Hadoop on AWS - EMR

20

Page 21: Hadoop AWS infrastructure cost evaluation

• Hadoop cluster created elastically

• Data is streamed from S3 to initiate Hadoop cluster dynamically

• Results from analytics stored to S3 once computed

• BI nodes permanent

Systems Architecture – EMR

21

Logs

AWS

NNSN

Hadoop

S3

DNsEMR

HDFS from S3

Client

Instance

Instance

BI

Instance

Instance

BI

Page 22: Hadoop AWS infrastructure cost evaluation

Amazon EMR– Instance types

22

BI instances

Master nodes

Data nodes

Page 23: Hadoop AWS infrastructure cost evaluation

• Calculate and add:

• S3 cost (seeded data)

• Incremental S3 cost, per month

• EC2 cost

• EMR cost

• In/out Transfer of data cost

• Amazon support cost

• Infrastructure support Engineer cost

AWS calculator – EMR calculation

23

Page 24: Hadoop AWS infrastructure cost evaluation

• Say for 24hrs/day, EMR cost:

AWS calculator – EMR calculation

24

Page 25: Hadoop AWS infrastructure cost evaluation

• Say for 24hrs/day, 3 year S3:

AWS calculator – EMR calculation

25

Page 26: Hadoop AWS infrastructure cost evaluation

• Say for 24hrs/day, 3 year EC2:

AWS calculator – EMR calculation

26

Page 27: Hadoop AWS infrastructure cost evaluation

Data volume (in year)

Instances types

Price/yearRunning 24 hours/day

Price/yearRunning 8 hours/day

Price/yearRunning 8 hours/week

1 year - storing 42TB on S3

10 instances – Data nodes: m1.xlargeNN: m2.2xlargeBI: m2.2xlargeLoad balancer: t1.micro1 year reserved10 EMR instances (Subject to change depending on actual load)

$14.1k/mo * 12 = $169.2k

$8.9k * 12= $106k

$6.6k * 12 = $79.2k

3 years storing 126TB on S3

$19.5k *36 mos = $684k

$15.5k * 36 mos = $558k

$13.2k * 36 mos = $475

Amazon EMR Pricing – Reduced log volume

27

Page 28: Hadoop AWS infrastructure cost evaluation

Hadoop on AWS: trade-offs

28

Feature EC2 EMR

Ease of use Hard – IT Ops costs Easy; Hadoop clusters can be of any size; can have multiple clusters.

Cost Cheaper Costlier: pay for EC2 + EMR

Flexibility Better: Access to full stack of Hadoop ecosystem

On demand Hadoop cluster: Ease of use - Hadoop installed, but with limited options

Portability Easier to move to dedicated hardware

Speed Faster Lower performance: all data is streamed from S3 for each job

Maintability Can choose any vendor;Can be updated to latest versoin;

Debugging tricky: cluster terminated, no logs

Page 29: Hadoop AWS infrastructure cost evaluation

• EMR with Spot instances seems to be the trend for minimal cost, if SLA timeliness is not of primary importance.

• Use reserved instances to bring down cost drastically (60%).

• Compression on S3 ?

• Need to account for secondary NN?

• Ability to estimate better how many EMR nodes are needed with AWS’s AMI task configuration

EC2 Pricing Gotcha’s

29

Page 30: Hadoop AWS infrastructure cost evaluation

• Transferring data between S3 and EMR clusters is very fast (and free), so long as your S3 bucket and Hadoop cluster are in the same Amazon region

• EMR’S3 File System streams data directly to S3 instead of buffering to intermediate local files.

• EMR’S3 File System adds Multipart Upload, which splits your writes into smaller chunks and uploads them in parallel.

• Store fewer, larger files instead of many smaller ones

• http://blog.mortardata.com/post/58920122308/s3-hadoop-performance

EMR Technical Gotcha’s

30

Page 31: Hadoop AWS infrastructure cost evaluation

Data volume (in year)

Storage for Data nodes Instances Price, first year

126TB 6*12x2TB 10 data nodes, 3 Master

Dell PowerEdge R720: Processor E5-2640 2.50GHz, 8 cores, 12M Cache,Turbo, Memory 64GB Memory, Quad Ranked RDIMM for 2 Processors, Low VoltHard Drives 12 - 2TB 7.2K RPM SATA 3.5in Hot Plug Hard DriveNetwork Card Intel 82599 Dual Port 10GE Mezzanine Card

$10.6k * 6 DN + $7.3k * 3 = $128k

+ Vendor Support ($50k)+ Full-time person ($150k)=$328k

BI 4 nodes $43k

In house Hadoop cluster

31

Page 32: Hadoop AWS infrastructure cost evaluation

32

Licensing and support costs

Page 33: Hadoop AWS infrastructure cost evaluation

• Cloudera or Hortonworks

• Enterprise 24X7 Production Support - phone and support portal access(Support Datasheet Attached)

• Minimum $50k$

Hadoop Distributions:

33

Page 34: Hadoop AWS infrastructure cost evaluation

Business Enterprise

Response Time : 1 HourAccess: Phone, Chat and Email 24/7

CostsGreater of $100 - or -•10% of monthly AWS usage for the first $0-$10K•7% of monthly AWS usage from $10K-$80K•5% of monthly AWS usage from $80K-$250K•3% of monthly AWS usage from $250K+

(about $800/yr)

http://aws.amazon.com/premiumsupport/

Response Time: 15 minutesAccess: Phone, Chat, TAM and Email 24/7

CostsGreater of $15,000 - or -•10% of monthly AWS usage for the first $0-$150K•7% of monthly AWS usage from $150K-$500K•5% of monthly AWS usage from $500K-$1M•3% of monthly AWS usage from $1M+

Amazon – Support EC2 & EMR

34

Page 35: Hadoop AWS infrastructure cost evaluation

35

Thank You