apache yarn - hadoop cluster management

Apache YARNDmitry Tolpeko – EPAM Systems, March 2015

Why Should I Care

Performance

Becoming Standard

Before YARN

Hadoop = Map Reduce

Independent clusters for:

• Spark

• Storm

• Kafka

• HBase etc.

Problems

Too many clusters

Under-utilization

Weak data locality

Administration cost

YARN

All applications

All environments -Production, Research, Development

YARN Core

Resource Manager (RM)

Node Managers (NM)

Application Masters (AM)

Node 1

RM

Node 2 Node 3

NM NM

AM

AM

AM

Containers

The right to use a resource

RMResource Request

NM NM NM

Run Containers

AM

Capacity Scheduler

Organize jobs into queues

Use resources of other queues if they are not busy

Preemption

Hive and Map Reduce

Multiple Map Reduce jobs per query

Separate Application Master and containers for each job

10+ seconds overhead in a busy cluster

Tez

Single Application Master per Session

Pre-allocated containers

Spark

Driver in Client or Application Master

Spark Web UI – Application Master URL in Resource Manager

Easy deployment

YARN and Docker

Running Dockercontainers in YARN

Isolating applications

Packaging complex applications

YARN not designed for:

•Long running services

•Short-lived interactive queries

Slider

Long-lived Applications

Run non-YARN distributed applications on YARN

Llama – Impala on YARN

Get resources from YARN

Single Application Master per queue – multiplex all Impala requests

YARN vs Mesos

Mesos

Scalable, global resource manager for the entire data center

2-level scheduler

Myriad

Dynamic YARN

Thank [email protected]

mailto:[email protected]

apache yarn - hadoop cluster management

Data & Analytics