hadoop performance at linkedin

Grid Operations

Hadoop Performance at LinkedInAllen Wittenauer

Grid Computing Architect

“I have never seen a Hadoop cluster that waslegitimately CPU bound.”

-- Milind Bhandarkar

X5650 - 6 Core @ 2.67 MHz

“I have only seen one Hadoop cluster that waslegitimately CPU bound.”

-- Milind Bhandarkar

Why do we have such high CPU usage?

We do a lot of Graph Theory.

Ticket to Ride

Ticket To Ride is a registered trademark of Days of Wonder

Social Graph

2nd Degree Connection

We under-commit our memory.

Our Hadoop Software Needs... The Plan...

Tasks– 2 GB of RAM = 1 GB of JVM Heap, .5-1GB for non-heap– (Typically) 1 Super Active Threads

TaskTracker– 1.5 GB of RAM = 1 GB of JVM Heap, .5GB for non-heap– 1-4 Super Active Threads

DataNode– 1.5 GB of RAM = 1 GB of JVM Heap, .5GB for non-heap– 1-4 Super Active Threads

RAM: 3GB + (task count * 2GB) + OS needs Threads: 8 + (task count) + OS needs

Our Hadoop Software Needs... The Reality

Task Counts – Westmere (5650): 6

Cores+HT = 12 Tasks

– Sandy Bridge (2640): 6 Cores+HT = 14 Tasks

Most of our tasks leave at most .5 GB free– = combined -> very

large buffer & cache

We don’t have as many disks per node.

Typical Hadoop Node Out in the Wild

Most user’s don’t know their actual needs– Vendor advice... play it safe!

Significantly more memory– “For the future!”– Badly written code

Significantly more disk– “Hadoop is IO intensive!” – “Greater task locality!”

Greater performance...but is it worth the cost...

What Happens With Fewer Disks?

Physical footprint requirements are smaller Linux buffers & caches are more efficient

– More per disk– Fewer to manage

Spindle count DOES matter... but the price/perf isn’t there for our workflows.– From a few years ago & based on store.sun.com prices (so not “real”)...

Nodes/Cores RAM/Bus Disks Time In Minutes HW Cost*

3/24 16/half 8 254.98 $37827

3/24 24/full 8 244.50 $38817

3/24 16/half 4 257.38 $21456

3/24 24/full 4 246.82 $22986

6/48 16/half 4 126.98 $42912

LinkedIn Node Configuration

No RAID controller– More cost for negative perf when doing

6 Drives– Still fits in 1U w/SATA drives– ~same perf as 8 drives

Less metal = cheaper cost

Rack Level View

If we assume we can use 40u in a rack then:– More CPUs– Just as many HDs– More Network– Potentially more RAM

We care about file system tuning.

LinkedIn Hadoop Disk/File Systems

noatime Enabled

writeback Enabled

Each Disk (except root) Partitions:– Swap– MapReduce Spill Space– HDFS

Delayed Commits – Why write once when you can do ganged writes more efficiently?

We care about job tuning.

LinkedIn Job Tuning Guidelines

All jobs get reviewed prior to going to production.

Task times should be between 5-15 minutes.

Jobs should have less than 10,000 tasks.

Jobs should be smart about # of files and the size of those files generated.

... and the result?

Why is LinkedIn Running so Hot?

We do a lot of non-MapReduce work.

RAM buffers and caches allow us to offset a lot of disk IO.

We audit our jobs.

As a result, our CPUs are actually busy.

hadoop performance at linkedin

legitimately cpu bound

2012 linkedin corporation

hadoop software

hadoop cluster

milind bhandarkar

rights reserved

jvm heap

heap 1

Technology

building a healthy data ecosystem around kafka and hadoop:...

dell poweredge c5220: hadoop mapreduce performance

improving hadoop performance via linux

hadoop performance on wekafs

building data products using hadoop at linkedin - mitul...

dell poweredge c5220: hadoop performance

strata sg 2015: linkedin self serve reporting platform on...

hadoop network performance profile

workload dependent hadoop mapreduce application performance...

performance evaluation of job schedulers on hadoop yarn ·...

virtualized hadoop performance with vmware … › content...

improving performance of hadoop clusters

hadoop at linkedin

the past, present, and future of hadoop at linkedin

hadoop network performance

cloudspeed™ sata ssds support faster hadoop performance...

a hadoop mapreduce performance prediction method

kafka and hadoop at linkedin meetup

hadoop for high-performance climate analytics

hadoop and voldemort @ linkedin