building the right platform architecture for hadoop

Building the right Platform Architecture for Hadoop

ROMMEL GARCIA HORTONWORKS

#whoami

▸ Sr. Solutions Engineer & Platform Architect @hortonworks

▸ Global Security SME Lead @hortonworks

▸ Author of “Virtualizing Hadoop: How to Install, Deploy, and Optimize Hadoop in a Virtualized Architecture”

▸ Runs Atlanta Hadoop User Group

OverviewThe Elephant keeps getting bigger

THE HADOOP ECOSYSTEM

THE TWO FACE OF HADOOP

Hadoop Cluster

Master Nodes Worker Nodes

Manage cluster state

Manage data & job state

Manage CPU resources

Manage RAM resources

Manage I/O resources

Manage Network resources

RUN W/

YARN

PLATFORM FOCUS

▸ Performance (SLA)

▸ Scale (Bursts)

▸ Speed (RT)

▸ Throughput (data/time, compute/time)

▸ Resiliency (HA)

▸ Graceful Degradation (Throttling/Failure Mgt.)

NetworkIt’s not Pyramidal

NETWORK IS EVERYTHING!

W

N

E

S

WorkloadsYour workload no longer affects me

HADOOP AND THE WORLD OF WORKLOADS

Workloads (YARN)

Hadoop Cluster

Deployment

OLAP OLTP DATA SCIENCE STREAMING

STRUCTURED UNSTRUCTUREDData (HDFS)

HADOOP WORKLOAD MANAGEMENT

Hadoop Cluster

YARN

PHYSICAL MEMORY/CPUqueues

containers

C1 C2

C3 CN

C1 C2

C3 CN

C1 C2

C3 CN

C1 C2

C3 CN

C1 C2

C3 CN

C1 C2

C3 CN

DEFAULT GRID SETTINGS FOR ALL WORKLOADS

▸ ext4 or XFS for Worker Nodes ENFORCED

▸ Transparent Huge Pages (THP) Compaction OFF

▸ Masters’ Swapiness = 1, Workers’ Swap turned OFF

▸ Jumbo Frames ENABLED

▸ IO Scheduling “deadline” ENABLED

▸ Limiting “processes” and “files” ENFORCED

▸ Name Service Cache ENABLED

OLAPGive back my precious SQL !

HIVE/SPARK SQL WORKLOAD BEHAVIOR

▸ Hive/Spark SQL workload, depends on YARN

▸ Near-realtime to Batch SLA

▸ Large data set, consumes lots of memory, fair cpu usage

▸ Hundreds to hundreds of thousands of analytical jobs daily

▸ Typically Memory bound first, then I/O

HIVE/SPARK SQL WORKLOAD PARALLELISM

▸ Hive

▸ Hive auto-parallelism ENABLED

▸ Hive on Tez ENABLED

▸ Reuse Tez Session ENABLED

▸ ORC for Hive ENFORCED

▸ Spark

▸ Repartition Spark SQL RDD ENFORCED

▸ 2GB to 4GB YARN container size

HIVE/SPARK SQL WORKLOAD DEPLOYMENT MODELS

Bare Metal

Master Node

▸ 2 x 6 CPU Cores

▸ 128 GB RAM

▸ 4 x 1TB SSD RAID 10 plus 1 Hot-spare

▸ 2 x 10Gbe NIC Bonded

Worker Node

▸ 2 x 8 CPU Cores

▸ 256 GB RAM

▸ 2 x 1TB SSD RAID 1

▸ 12 x 4TB SATA/SAS/NL-SAS


Cloud

All Nodes are the same

▸ Typically more nodes vs bare metal

▸ Storage (AWS/Azure)

▸ Batch - S3/Blob/ADLS

▸ Interactive - EBS/Premium

▸ Near-realtime - Local/Local

▸ vm types (AWS/Azure)

▸ >=m4.4xlarge (EBS) / >= A9 (Blob/Premium)

▸ >=i2.4xlarge (Local) / >= DS14 (Premium/Local)

▸ >=d2.2xlarge (Local) / >= DS14_v2 (Premium/Local)

Virtualized On-prem

More Master Node vs Bare Metal

▸ 2 x 4 vCPU Cores

▸ 48 GB vRAM

▸ 1TB SAN/NAS

▸ 2 x 10Gbe vNIC Bonded

Worker Node

▸ 6 vCPU Cores

▸ 32 GB vRAM

▸ 1TB SAN/NAS


▸ Storage (data)

▸ SAN/NAS

▸ Appliance

▸ NetApp

▸ Isilon

▸ vBlock

OLTPCome on, OLTP is fun!

HBASE/SOLR WORKLOADS BEHAVIOR

▸ HBase & Solr workload

▸ Realtime, in sub-seconds SLA

▸ Large data set

▸ Millions to Billions of hits per day

▸ HBase

▸ HBase for GET/PUT/SCAN

▸ Typically IO bound first, then memory for HBase

▸ Solr

▸ Solr for Search

▸ Typically CPU bound first, then memory for Solr

HBASE WORKLOAD PARALLELISM

▸ Use HBase Java API/Phoenix, not ThriftServer

▸ HBase “short circuit” reads ENABLED

▸ HBase BucketCache ENABLED

▸ Manage HBase Compaction for block locality

▸ HBase Read HA ENABLED

▸ Properly design “rowkey” and ColumnFamily

▸ Properly design RPC Handlers

▸ Minimize GC pauses

HBASE WORKLOAD DEPLOYMENT MODEL

Bare Metal

Master Node (ZK, HBase Master)

▸ 2 x 6 CPU Cores

▸ 128 GB RAM



Worker Node

▸ 2 x 8-12 CPU Cores

▸ 256 GB RAM


▸ 12-24 x 4TB SATA/SAS/NL-SAS/SSD


Cloud




▸ Local


▸ >= d2.4xlarge / >= D14_v2

Virtualized On-prem



▸ 48 GB vRAM

▸ 1TB SAN/NAS


Worker Node w/ DAS


▸ 256 GB RAM


▸ 12-24 x 4TB SATA/SAS/NL-SAS/SSD


SOLR WORKLOAD PARALLELISM

▸ # of Shards I/O operations Speed of disk

▸ # of threads Indexing Speed

▸ Properly manage RAM Buffer Size

▸ Merge Factor = 25 (indexing), 2 (search)

▸ Use large disk cache

▸ Minimize GC pauses

SOLR WORKLOAD DEPLOYMENT MODEL

Bare Metal

Master Node (ZK)

▸ 2 x 4 CPU Cores

▸ 32 GB RAM

▸ 2 x 1TB SSD (OS) RAID 1


Worker Node

▸ 2 x 10 CPU Cores

▸ 256 GB RAM


▸ 12 x 2 - 4TB SATA/SAS/NL-SAS/SSD


Cloud




▸ Local


▸ >= d2.4xlarge / >= D14_v2

Virtualized On-prem



▸ 48 GB vRAM

▸ 1TB SAN/NAS


Worker Node w/ DAS


▸ 256 GB RAM


▸ 12 x 2 - 4TB SATA/SAS/NL-SAS/SSD


Data ScienceBut Captain, one does not simply warp into Mordor

DATA SCIENCE WORKLOAD BEHAVIOR

▸ Spark ML workload, depends on YARN

▸ Near-realtime to Batch SLA

▸ Runs “Monte Carlo” type of processing

▸ Hundreds to hundreds of thousands of analytical jobs daily

▸ CPU bound first, then memory

SPARK ML WORKLOAD PARALLELISM

▸ YARN ENABLED

▸ Cache data

▸ Serialized/Raw for fast access/processing

▸ Off-heap, slower processing

▸ Use Checkpointing

▸ Use Broadcast Variables to minimize network traffic

▸ Minimize GC by limiting object creation, use Builders

▸ executor-memory = between 8GB to 64GB

▸ executor-cores = between 2 to 4

▸ num-executors = w/ caching, datasize * 2 as total app memory

SPARK ML WORKLOAD DEPLOYMENT MODELS

Bare Metal

Master Node

▸ 2 x 6 CPU Cores

▸ 64 GB RAM



Worker Node


▸ 256 - 512 GB RAM


▸ 12 x 4TB SATA/SAS/NL-SAS


Cloud




▸ Local


▸ >= d2.4xlarge / >= D14_v2

Virtualized On-prem



▸ 48 GB vRAM

▸ 1TB SAN/NAS


Worker Node w/ DAS


▸ 256 GB RAM


▸ 12-24 x 4TB SATA/SAS/NL-SAS


StreamingStream me in scotty !

NIFI/STORM WORKLOAD BEHAVIOR

▸ NiFi & Storm Streaming workloads

▸ Always “ON” data ingest and processing

▸ Guarantees data delivery

▸ Highly distributed

▸ simple event processing -> complex event processing

NIFI WORKLOAD PARALLELISM

▸ Network bound first, then memory or cpu

▸ Go granular on NiFi processors

▸ nifi.bored.yield.duration=10

▸ RAID 10 for repositories: Content, FlowFile and Provenance

▸ nifi.queue.swap.threshold=20000

▸ No sharing of disk across repositories, use RAID 10

NIFI WORKLOAD DEPLOYMENT MODELS

Bare Metal

Master Node

▸ 2 x 4 CPU Cores

▸ 32 GB RAM



Worker Node

▸ 2 x 8 CPU Cores

▸ 128 GB RAM


▸ 6 x 1TB SATA/SAS/NL-SAS RAID 10


Cloud




▸ Local/EBS/Premium


▸ >= m4.4xlarge / >= A9

Virtualized On-prem



▸ 16 GB vRAM

▸ 1TB SAN/NAS


Worker Node w/ DAS

▸ 2 x 8 CPU Cores

▸ 128 GB RAM


▸ 6 x 1TB SATA/SAS/NL-SAS RAID 10


STORM WORKLOAD PARALLELISM

▸ CPU bound first, then memory

▸ Keep Tuple processing light, quick execution of execute()

▸ Use Google’s Guava for bolt-local caches (mem <1GB)

▸ Shared caches across bolts: HBase, Phoenix, Redis or MemcacheD

▸ # of workers memory

STORM WORKLOAD DEPLOYMENT MODELS

Bare Metal

Master Node

▸ 2 x 4 CPU Cores

▸ 32 GB RAM



Worker Node


▸ 256 GB RAM



Cloud




▸ Local/EBS/Premium


▸ >= r3.4xlarge / >= A9

Virtualized On-prem



▸ 24 GB vRAM

▸ 1TB SAN/NAS


Worker Node w/ DAS


▸ 256 GB RAM

▸ 1TB SAN/NAS


http://hortonworks.com/products/sandbox/

THANK YOU!

@rommelgarcia /in/rommelgarcia

#hortonworks

building the right platform architecture for hadoop

Technology