challenges for running hadoop on aws - advancedaws meetup

Post on 19-Aug-2014

374 Views

Category:

Engineering

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

Nowadays we've got all the tools we need to spin-up and tear-down clusters with hundreds of nodes in minutes and this puts more pressure on the tools we use to configure and monitor our applications. This challenge is even more interesting when we have to deal with long running distributed data storage and processing systems like Hadoop. In this talk we will look into some of the challenges we need to deal with when creating and managing Hadoop clusters in AWS, we will discuss improvement opportunities in monitoring (e.g. detecting and dealing with instance failure, resource contention & noisy neighbors) and a bit about the future and how we should go about disconnecting workload dispatch from cluster lifecycle.

TRANSCRIPT

Headline Goes HereSpeaker Name or Subhead Goes Here

DO NOT USE PUBLICLY PRIOR TO 10/23/12

Challenges of running Hadoop on AWSJune 12, 2014 @ AdvancedAWS Meetup - Citizen Space

Andrei Savu - @andreisavu

Software Engineer, Cloud Automation Team

Overview

● Introduction● Context● Challenges● Questions

Andrei Savu

Software Engineer

Cloud Automation Team @ Cloudera

Previously: founder of Axemblr, Apache Whirr PMC, contributor to jclouds, Cloudsoft, Facebook etc. (see LinkedIn)

Cloud Automation Team @ Cloudera

Focused on:

● building tools to automate deployment and ongoing management of Hadoop clusters on cloud infrastructure

● improving Hadoop cloud compatibility (e.g. s3 integration, swift, managed databases, custom network topologies etc.)

We are hiring!

Context

● Hadoop● Types of Deployments● Cluster Topology● AWS

Context: Hadoop

Hadoop is a broad, coherent stack of products for data storage and processing.

“Hadoop” is more than HDFS & MapReduce. It can do: multiple storage systems, different query engines, batch and real-time etc.

Usually running on bare metal now moving towards cloud infra.

Types of Deployments

Long running:

- store data for analytics jobs with MapReduce, Impala, Spark

- online data serving with HBase

On-demand:

- analytical workloads, fetch data on-demand

- triggered by workflows (1:1)

- disconnected lifecycle (Netflix Genie)

Cluster Topology #1

Simple:

● EC2 classic (being phased-out) ● VPC: single subnet, security group with an optional VPN

Complex:

● VPC: multiple subnets & security groups● DirectConnect● highly available with disaster recovery● multiple users & security

Amazon Web Services

Paradigm shift in how we work with infrastructure.

Key concept: software defined - controlled by APIs

Has most of the things we need for storage and high performance data processing (placement groups, large instances, high storage density, ssds, many vCPUs etc.)

Enterprise-ready: IAM, VPC, VPN / DirectConnect, Support etc.

Challenges

● Instance Provisioning & Health● Ensuring Idempotency● Networking & Performance● AMIs & Bootstrap Speed● Data durability● S3 integration

What makes it more difficult?

… versus a typical stateless web application in an auto-scaling group monitoring request latency or OS load averages

● statefulness (think databases)● each cluster has multiple processes playing different roles● topology & configuration changes require orchestration● knowledge of service inter-dependencies is required

Instance Provisioning & Health

Questions:

● How do you define your cluster size to deal lack of capacity?● How do you define health? Is that stable during setup?● Is health a binary property? Or a threshold that needs to be

continuously evaluated?

Potential answers:

● match AWS semantics: define size as a range● make simplifying assumptions (e.g. healthy during setup)

Ensuring Idempotency

Questions:

● How do you safely retry expensive calls?● How do you build reliable workflows?

Potential answers:

● AWS User Guide via client token● Discuss: Convergence vs. Single step retries

Networking & Performance

Questions:

● What’s the ideal setup that’s both usable and secure?● How do you get consistent intra-cluster performance?

Potential answers:

● VPC with VPN or DirectConnect. Placement groups help. ● Security model: initial it was just perimeter security, now it

can do a lot more (disk encryption, SSL, kerberos)

Images & Bootstrap Speed

Questions:

● Do you allow custom AMIs or force your own choices?● If using custom AMIs how can you reduce bootstrap time?

Potential answers:

● Custom AMIs are common - integrated with existing infra● Fast bootstrap by baking on top with custom bits

Data durability

Questions:

● How do you place replicas? Datacenter topology?● How are instances distributed in different failure domains?

Potential answers:

● ignore or go with large instances that map 1:1 to hosts● would be nice to have: a way to influence host to instance

allocation or to get datacenter topology data

S3 integration

Questions:

● How do you reconcile differences in semantics with HDFS? (strongly consistent vs. eventual consistency)

● How do you get most out of it in terms of performance?

Potential answers:

● we’ve done a fair amount of work improving S3 in the open source (features, stability improvements, security etc.)

● performance is network bound

Thanks! Questions?

Andrei Savu - asavu@cloudera.com

Twitter: @andreisavu

Join us to take Hadoop to the clouds!https://hire.jobvite.com/Jobvite/job.aspx?j=orafYfwy&b=nqlg3nwW

top related