challenges for running hadoop on aws - advancedaws meetup

Headline Goes HereSpeaker Name or Subhead Goes Here

DO NOT USE PUBLICLY PRIOR TO 10/23/12

Challenges of running Hadoop on AWSJune 12, 2014 @ AdvancedAWS Meetup - Citizen Space

Andrei Savu - @andreisavu

Software Engineer, Cloud Automation Team

Overview

● Introduction● Context● Challenges● Questions

Andrei Savu

Software Engineer

Cloud Automation Team @ Cloudera

Previously: founder of Axemblr, Apache Whirr PMC, contributor to jclouds, Cloudsoft, Facebook etc. (see LinkedIn)

Cloud Automation Team @ Cloudera

Focused on:

● building tools to automate deployment and ongoing management of Hadoop clusters on cloud infrastructure

● improving Hadoop cloud compatibility (e.g. s3 integration, swift, managed databases, custom network topologies etc.)

We are hiring!

Context

● Hadoop● Types of Deployments● Cluster Topology● AWS

Context: Hadoop

Hadoop is a broad, coherent stack of products for data storage and processing.

“Hadoop” is more than HDFS & MapReduce. It can do: multiple storage systems, different query engines, batch and real-time etc.

Usually running on bare metal now moving towards cloud infra.

Types of Deployments

Long running:

- store data for analytics jobs with MapReduce, Impala, Spark

- online data serving with HBase

On-demand:

- analytical workloads, fetch data on-demand

- triggered by workflows (1:1)

- disconnected lifecycle (Netflix Genie)

Cluster Topology #1

Simple:

● EC2 classic (being phased-out) ● VPC: single subnet, security group with an optional VPN

Complex:

● VPC: multiple subnets & security groups● DirectConnect● highly available with disaster recovery● multiple users & security

Cluster Topology #2

● Cloudera Reference Architecture for AWS Deployments:http://www.cloudera.com/content/cloudera/en/resources/library/whitepaper/cloudera-enterprise-reference-architecture-for-aws-deployments.html

● Best Practices for Deploying Cloudera Enterprise on AWS:http://blog.cloudera.com/blog/2014/02/best-practices-for-deploying-cloudera-enterprise-on-amazon-web-services/

Amazon Web Services

Paradigm shift in how we work with infrastructure.

Key concept: software defined - controlled by APIs

Has most of the things we need for storage and high performance data processing (placement groups, large instances, high storage density, ssds, many vCPUs etc.)

Enterprise-ready: IAM, VPC, VPN / DirectConnect, Support etc.

Challenges

● Instance Provisioning & Health● Ensuring Idempotency● Networking & Performance● AMIs & Bootstrap Speed● Data durability● S3 integration

What makes it more difficult?

… versus a typical stateless web application in an auto-scaling group monitoring request latency or OS load averages

● statefulness (think databases)● each cluster has multiple processes playing different roles● topology & configuration changes require orchestration● knowledge of service inter-dependencies is required

Instance Provisioning & Health

Questions:

● How do you define your cluster size to deal lack of capacity?● How do you define health? Is that stable during setup?● Is health a binary property? Or a threshold that needs to be

continuously evaluated?

Potential answers:

● match AWS semantics: define size as a range● make simplifying assumptions (e.g. healthy during setup)

Ensuring Idempotency

Questions:

● How do you safely retry expensive calls?● How do you build reliable workflows?

Potential answers:

● AWS User Guide via client token● Discuss: Convergence vs. Single step retries

Networking & Performance

Questions:

● What’s the ideal setup that’s both usable and secure?● How do you get consistent intra-cluster performance?

Potential answers:

● VPC with VPN or DirectConnect. Placement groups help. ● Security model: initial it was just perimeter security, now it

can do a lot more (disk encryption, SSL, kerberos)

Images & Bootstrap Speed

Questions:

● Do you allow custom AMIs or force your own choices?● If using custom AMIs how can you reduce bootstrap time?

Potential answers:

● Custom AMIs are common - integrated with existing infra● Fast bootstrap by baking on top with custom bits

Data durability

Questions:

● How do you place replicas? Datacenter topology?● How are instances distributed in different failure domains?

Potential answers:

● ignore or go with large instances that map 1:1 to hosts● would be nice to have: a way to influence host to instance

allocation or to get datacenter topology data

S3 integration

Questions:

● How do you reconcile differences in semantics with HDFS? (strongly consistent vs. eventual consistency)

● How do you get most out of it in terms of performance?

Potential answers:

● we’ve done a fair amount of work improving S3 in the open source (features, stability improvements, security etc.)

● performance is network bound

Thanks! Questions?

Andrei Savu - asavu@cloudera.com

Twitter: @andreisavu

Join us to take Hadoop to the clouds!https://hire.jobvite.com/Jobvite/job.aspx?j=orafYfwy&b=nqlg3nwW

challenges for running hadoop on aws - advancedaws meetup

custom amis

large instances

placement

potential

aws

hadoop

performance

cloudera

Engineering

aws meetup - sydney - march

datameer - may 2014 hadoop meetup

manchester hadoop meetup: cassandra spark internals

aws meetup august 2016

scaling rdbms on aws- clustrixdb @aws meetup 20160711

hadoop architecture meetup

aws meetup - sydney - february

aws config rules - advanced aws meetup

aws meetup aws security 06112015 final 1

hadoop virtualization extensions hadoop world meetup

martyn hadoop aws

chicago aws meetup

aws meetup building_lambda

kafka & hadoop - for nyc kafka meetup

aws ne meetup - introduction to aws iaas

surviving hadoop on aws

manchester hadoop meetup: spark cassandra integration

vpc aws meetup

aws jounrey at justgiving (manchester aws meetup)

casablanca hadoop & big data meetup - introduction à hadoop