Transcript
Page 1: Challenges for running Hadoop on AWS - AdvancedAWS Meetup

Headline Goes HereSpeaker Name or Subhead Goes Here

DO NOT USE PUBLICLY PRIOR TO 10/23/12

Challenges of running Hadoop on AWSJune 12, 2014 @ AdvancedAWS Meetup - Citizen Space

Andrei Savu - @andreisavu

Software Engineer, Cloud Automation Team

Page 2: Challenges for running Hadoop on AWS - AdvancedAWS Meetup

Overview

● Introduction● Context● Challenges● Questions

Page 3: Challenges for running Hadoop on AWS - AdvancedAWS Meetup

Andrei Savu

Software Engineer

Cloud Automation Team @ Cloudera

Previously: founder of Axemblr, Apache Whirr PMC, contributor to jclouds, Cloudsoft, Facebook etc. (see LinkedIn)

Page 4: Challenges for running Hadoop on AWS - AdvancedAWS Meetup

Cloud Automation Team @ Cloudera

Focused on:

● building tools to automate deployment and ongoing management of Hadoop clusters on cloud infrastructure

● improving Hadoop cloud compatibility (e.g. s3 integration, swift, managed databases, custom network topologies etc.)

We are hiring!

Page 5: Challenges for running Hadoop on AWS - AdvancedAWS Meetup

Context

● Hadoop● Types of Deployments● Cluster Topology● AWS

Page 6: Challenges for running Hadoop on AWS - AdvancedAWS Meetup

Context: Hadoop

Hadoop is a broad, coherent stack of products for data storage and processing.

“Hadoop” is more than HDFS & MapReduce. It can do: multiple storage systems, different query engines, batch and real-time etc.

Usually running on bare metal now moving towards cloud infra.

Page 7: Challenges for running Hadoop on AWS - AdvancedAWS Meetup

Types of Deployments

Long running:

- store data for analytics jobs with MapReduce, Impala, Spark

- online data serving with HBase

On-demand:

- analytical workloads, fetch data on-demand

- triggered by workflows (1:1)

- disconnected lifecycle (Netflix Genie)

Page 8: Challenges for running Hadoop on AWS - AdvancedAWS Meetup

Cluster Topology #1

Simple:

● EC2 classic (being phased-out) ● VPC: single subnet, security group with an optional VPN

Complex:

● VPC: multiple subnets & security groups● DirectConnect● highly available with disaster recovery● multiple users & security

Page 10: Challenges for running Hadoop on AWS - AdvancedAWS Meetup

Amazon Web Services

Paradigm shift in how we work with infrastructure.

Key concept: software defined - controlled by APIs

Has most of the things we need for storage and high performance data processing (placement groups, large instances, high storage density, ssds, many vCPUs etc.)

Enterprise-ready: IAM, VPC, VPN / DirectConnect, Support etc.

Page 11: Challenges for running Hadoop on AWS - AdvancedAWS Meetup

Challenges

● Instance Provisioning & Health● Ensuring Idempotency● Networking & Performance● AMIs & Bootstrap Speed● Data durability● S3 integration

Page 12: Challenges for running Hadoop on AWS - AdvancedAWS Meetup

What makes it more difficult?

… versus a typical stateless web application in an auto-scaling group monitoring request latency or OS load averages

● statefulness (think databases)● each cluster has multiple processes playing different roles● topology & configuration changes require orchestration● knowledge of service inter-dependencies is required

Page 13: Challenges for running Hadoop on AWS - AdvancedAWS Meetup

Instance Provisioning & Health

Questions:

● How do you define your cluster size to deal lack of capacity?● How do you define health? Is that stable during setup?● Is health a binary property? Or a threshold that needs to be

continuously evaluated?

Potential answers:

● match AWS semantics: define size as a range● make simplifying assumptions (e.g. healthy during setup)

Page 14: Challenges for running Hadoop on AWS - AdvancedAWS Meetup

Ensuring Idempotency

Questions:

● How do you safely retry expensive calls?● How do you build reliable workflows?

Potential answers:

● AWS User Guide via client token● Discuss: Convergence vs. Single step retries

Page 15: Challenges for running Hadoop on AWS - AdvancedAWS Meetup

Networking & Performance

Questions:

● What’s the ideal setup that’s both usable and secure?● How do you get consistent intra-cluster performance?

Potential answers:

● VPC with VPN or DirectConnect. Placement groups help. ● Security model: initial it was just perimeter security, now it

can do a lot more (disk encryption, SSL, kerberos)

Page 16: Challenges for running Hadoop on AWS - AdvancedAWS Meetup

Images & Bootstrap Speed

Questions:

● Do you allow custom AMIs or force your own choices?● If using custom AMIs how can you reduce bootstrap time?

Potential answers:

● Custom AMIs are common - integrated with existing infra● Fast bootstrap by baking on top with custom bits

Page 17: Challenges for running Hadoop on AWS - AdvancedAWS Meetup

Data durability

Questions:

● How do you place replicas? Datacenter topology?● How are instances distributed in different failure domains?

Potential answers:

● ignore or go with large instances that map 1:1 to hosts● would be nice to have: a way to influence host to instance

allocation or to get datacenter topology data

Page 18: Challenges for running Hadoop on AWS - AdvancedAWS Meetup

S3 integration

Questions:

● How do you reconcile differences in semantics with HDFS? (strongly consistent vs. eventual consistency)

● How do you get most out of it in terms of performance?

Potential answers:

● we’ve done a fair amount of work improving S3 in the open source (features, stability improvements, security etc.)

● performance is network bound

Page 19: Challenges for running Hadoop on AWS - AdvancedAWS Meetup

Thanks! Questions?

Andrei Savu - [email protected]

Twitter: @andreisavu

Join us to take Hadoop to the clouds!https://hire.jobvite.com/Jobvite/job.aspx?j=orafYfwy&b=nqlg3nwW

Page 20: Challenges for running Hadoop on AWS - AdvancedAWS Meetup

Top Related