advanced apache spark meetup: how spark beat hadoop @ 100 tb daytona graysort challenge

How Spark Beat Hadoop @ 100 TB SortAdvanced Apache Spark Meetup

Chris Fregly, Principal Data Solutions EngineerIBM Spark Technology Center

Power of data. Simplicity of design. Speed of innovation.IBM | spark.tc

Meetup Housekeeping

IBM | spark.tc

Announcements

Deepak SrinivasanBig Commerce

Steve BeierIBM Spark Tech Center

IBM | spark.tc

Who am I?Streaming Platform Engineer

Streaming Data EngineerNetflix Open Source Committer

Data Solutions EngineerApache Contributor

Principle Data Solutions Engineer

IBM Technology Center

IBM | spark.tc

Last Meetup (End-to-End Data Pipeline)Presented `Flux Capacitor`

End-to-End Data Pipeline in a Box!Real-time, Advanced AnalyticsMachine LearningRecommendations

Githubgithub.com/fluxcapacitor

Dockerhub.docker.com/r/fluxcapacitor

IBM | spark.tc

Since Last Meetup (End-to-End Data Pipeline)Meetup Statistics

Total Spark Experts: ~850 (+100%)Mean RSVPs per Meetup: 268Mean Attendance Percentage: ~60% of RSVPsDonations: $15 (Thank you so much, but please keep

your $!)

Github Statistics (github.com/fluxcapacitor)18 forks, 13 clones, ~1300 views

Docker Statistics (hub.docker.com/r/fluxcapacitor)~1600 download

IBM | spark.tc

Recent EventsReplay of Last SF Meetup in Mtn View@BaseCRM

Presented Flux Capacitor End-to-End Data Pipe-line

(Scala + Big Data) By The Bay ConferenceWorkshop and 2 TalksTrained ~100 on End-to-End Data Pipeline

Galvanize WorkshopTrained ~30 on End-to-End Data Pipeline

IBM | spark.tc

Upcoming USA EventsIBM Hackathon @ Galvanize (Sept 18th – Sept 21st)

Advanced Apache Spark Meetup@DataStax (Sept 21st)

Spark-Cassandra Spark SQL+DataFrame Connector

Cassandra Summit Talk (Sept 22nd – Sept 24th)Real-time End-to-End Data Pipeline w/ Cassandra

Strata New York (Sept 29th - Oct 1st)

IBM | spark.tc

Upcoming European Events

Dublin Spark Meetup Talk (Oct 15th)Barcelona Spark Meetup Talk (Oct ?)Madrid Spark Meetup Talk (Oct ?)Amsterdam Spark Meetup (Oct 27th)Spark Summit Amsterdam (Oct 27th – Oct 29th)Brussels Spark Meetup Talk (Oct 30th)

Spark and the Daytona GraySort tChallengesortbenchmark.org

sortbenchmark.org/ApacheSpark2014.pdf

IBM | spark.tc

Themes of this Talk: Mechanical SympathySeek Once, Scan Sequentially

CPU Cache Locality, Memory Hierarchy are Key

Go Off-Heap Whenever Possible

Customize Data Structures for your Work-load

IBM | spark.tc

What is the Daytona GraySort Challenge?Key MetricThroughput of sorting 100TB of 100 byte data, 10

byte keyTotal time includes launching app and writing out-

put file

DaytonaApp must be general purpose

GrayNamed after Jim Gray

IBM | spark.tc

Daytona GraySort Challenge: Input and ResourcesInput

Records are 100 bytes in lengthFirst 10 bytes are random keyInput generator: `ordinal.com/gensort.html`28,000 fixed-size partitions for 100 TB sort250,000 fixed-size partitions for 1 PB sort1 partition = 1 HDFS block = 1 node = no partial

read I/OHardware and Runtime Resources

Commercially available and off-the-shelfUnmodified, no over/under-clockingGenerates 500TB of disk I/O, 200TB network I/O

IBM | spark.tc

Daytona GraySort Challenge: RulesMust sort to/from OS files in secondary stor-age

No raw disk since I/O subsystem is being tested

File and device striping (RAID 0) are encour-aged

Output file(s) must have correct key order

IBM | spark.tc

Daytona GraySort Challenge: Task SchedulingTypes of Data Locality

PROCESS_LOCALNODE_LOCALRACK_LOCALANY

Delay Scheduling`spark.locality.wait.node`: time to wait for next

shitty levelSet to infinite to reduce shittiness, force

NODE_LOCALStraggling Executor JVMs naturally fade away on

each run

IncreasingLevel ofShittiness

IBM | spark.tc

Daytona GraySort Challenge: Winning Results On-disk only, in-memory caching

disabled!

EC2 (i2.8xlarge)

EC2 (i2.8xlarge)

28,000partition

s

250,000 partitions

(!!)

IBM | spark.tc

Daytona GraySort Challenge: EC2 Configuration206 EC2 Worker nodes, 1 Master node

i2.8xlarge 32 Intel Xeon CPU E5-2670 @ 2.5 Ghz244 GB RAM, 8 x 800GB SSD, RAID 0 striping, ext4NOOP I/O scheduler: FIFO, request merging, no re-

ordering3 Gbps mixed read/write disk I/O

Deployed within Placement Group/VPCEnhanced NetworkingSingle Root I/O Virtualization (SR-IOV): extension of

PCIe10 Gbps, low latency, low jitter (iperf showed ~9.5

Gbps)

IBM | spark.tc

Daytona GraySort Challenge: Winning Configuration

Spark 1.2, OpenJDK 1.7_<amazon-something>_u65-b17Disabled in-memory caching -- all on-disk!HDFS 2.4.1 short-circuit local reads, 2x replicationWrites flushed after every run (5 runs for 28,000 parti-tions)Netty 4.0.23.Final with native epollSpeculative Execution disabled: `spark.speculation`=falseForce NODE_LOCAL: `spark.locality.wait.node`=Infinite Force Netty Off-Heap: `spark.shuffle.io.preferDirect-Buffers`Spilling disabled: `spark.shuffle.spill`=trueAll compression disabled

IBM | spark.tc

Daytona GraySort Challenge: Partitioning

Range Partitioning ( vs. Hash Partitioning)Take advantage of sequential key spaceSimilar keys grouped together within a partitionRanges defined by sampling 79 values per partitionDriver sorts samples and defines range boundariesSampling took ~10 seconds for 28,000 partitions

IBM | spark.tc

Daytona GraySort Challenge: Why Bother?Sorting relies heavily on shuffle, I/O subsys-tem

Shuffle is major bottleneck in big data pro-cessing

Large number of partitions can exhaust OS re-sources

Shuffle optimization benefits all high-level libraries

Goal is to saturate network controller on all nodes

~125 MB/s (1 GB ethernet), 1.25 GB/s (10 GB eth-ernet)

IBM | spark.tc

Daytona GraySort Challenge: Per Node Results

Mappers: 3 Gbps/node disk I/O (8x800 SSD)Reducers: 1.1 Gbps/node network I/O (10Gbps)

Quick Shuffle Refresher

IBM | spark.tc

Shuffle Overview

All to All, Cartesian Product Operation

Least ->UsefulExampleI CouldFind ->

IBM | spark.tc

Spark Shuffle Overview

Most ->ConfusingExampleI CouldFind ->

Stages are Defined by Shuffle Boundaries

IBM | spark.tc

Shuffle Intermediate Data: Spill to DiskIntermediate shuffle data stored in memorySpill to Disk

`spark.shuffle.spill`=true`spark.shuffle.memoryFraction`=% of all shuffle

buffersCompetes with `spark.storage.memoryFraction`Bump this up from default!! Will help Spark SQL,

too.

Skipped StagesReuse intermediate shuffle data found on reducerDAG for that partition can be truncated

IBM | spark.tc

Shuffle Intermediate Data: Compression`spark.shuffle.compress`

Compress outputs (mapper)

`spark.shuffle.spill.compress`Compress spills (reducer)

`spark.io.compression.codec`LZF: Most workloads (new default for Spark)Snappy: LARGE workloads (less memory required to

compress)

IBM | spark.tc

Spark Shuffle Operationsjoin

distinctcogroupcoalesce

repartitionsortByKey

groupByKeyreduceByKey

aggregateByKey

IBM | spark.tc

Spark Shuffle Managers`spark.shuffle.manager` = {`hash` <10,000 Reducers

Output file determined by hashing the key of (K,V) pair

Each mapper creates an output buffer/file per re-ducer

Leads to M*R number of output buffers/files per shuffle

`sort` >= 10,000 ReducersDefault since Spark 1.2Wins Daytona GraySort Challenge w/ 250,000 re-

ducers!!`tungsten-sort` (Future Meetup!)

}

IBM | spark.tc

Shuffle Managers

IBM | spark.tc

Hash Shuffle Manager

M*R num open files per shuffle; M=num mappers R=num reducers

Mapper Opens 1 File per Partition/ReducerHDFS

(2x repl)

HDFS(2x repl)

IBM | spark.tc

Sort Shuffle Manager

Hold Tight!

IBM | spark.tc

Tungsten-Sort Shuffle Manager

Future Meetup!!

IBM | spark.tc

Shuffle Performance Tuning

Hash Shuffle Manager (no longer default)`spark.shuffle.consolidateFiles`: mapper output

filesò.a.s.shuffle.FileShuffleBlockResolver`

Intermediate FilesIncrease `spark.shuffle.file.buffer`: reduce seeks &

sys calls

Increase `spark.reducer.maxSizeInFlight` if memory allows

Use smaller number of larger workers to reduce to-tal filesSQL: BroadcastHashJoin vs. Shuffled-HashJoin

`spark.sql.autoBroadcastJoinThresholdÙse `DataFrame.explain(true)` or ÈXPLAIN` to ver-

ify

IBM | spark.tc

Shuffle Configuration

Documentationspark.apache.org/docs/latest/configuration.html#shuffle-

behavior

Prefixspark.shuffle

Winning Optimizations Deployed across Spark 1.1 and 1.2

IBM | spark.tc

Daytona GraySort Challenge: Winning OptimizationsCPU-Cache Locality: (Key, Pointer-to-Record)

& Cache Alignment

Optimized Sort Algorithm: Elements of (K, V) Pairs

Reduce Network Overhead: Async Netty, epoll

Reduce OS Resource Utilization: Sort Shuffle

IBM | spark.tc

CPU-Cache Locality: (Key, Pointer-to-Record)AlphaSort paper ~1995

Chris Nyberg and Jim Gray

NaïveList (Pointer-to-Record)Requires Key to be dereferenced for comparison

AlphaSortList (Key, Pointer-to-Record)Key is directly available for comparison

IBM | spark.tc

CPU-Cache Locality: Cache AlignmentKey(10 bytes) + Pointer(4 bytes*) = 14 bytes

*4 bytes when using compressed OOPS (<32 GB heap)

Not binary in size, not CPU-cache friendly

Cache Alignment Options① Add Padding (2 bytes)

Key(10 bytes) + Pad(2 bytes) + Pointer(4 bytes)=16 bytes

② (Key-Prefix, Pointer-to-Record)Perf affected by key distroKey-Prefix (4 bytes) + Pointer (4 bytes)=8 bytes

IBM | spark.tc

CPU-Cache Locality: Performance Compari-son

IBM | spark.tc

Optimized Sort Algorithm: Elements of (K, V) Pairsò.a.s.util.collection.TimSort`

Based on JDK 1.7 TimSortPerforms best on partially-sorted datasets Optimized for elements of (K,V) pairsSorts impl of SortDataFormat (ie. KVArraySort-

DataFormat)

ò.a.s.util.collection.AppendOnlyMapÒpen addressing hash, quadratic probingArray of [(key0, value0), (key1, value1)] Good memory localityKeys never removed, values only append

(^2 Probing)

IBM | spark.tc

Reduce Network Overhead: Async Netty, epollNew Network Module based on Async Netty

Replaces old java.nio, low-level, socket-based codeZero-copy epoll uses kernel-space between disk &

networkCustom memory management reduces GC pauses`spark.shuffle.blockTransferService`=netty

Spark-Netty Performance Tuning`spark.shuffle.io.numConnectionsPerPeer`

Increase to saturate hosts with multiple disks`spark.shuffle.io.preferDirectBuffers`

On or Off-heap (Off-heap is default)

Apache Spark JiraSPARK-2468

S

IBM | spark.tc

Reduce OS Resource Utilization: Sort Shuffle

M open files per shuffle; M = num of mappers`spark.shuffle.sort.bypassMergeThreshold`

Merge Sort

(Disk)

Reducers seek and scan from range

offsetof Master File on

Mapper

TimSort

(RAM)

HDFS(2x

repl)

HDFS(2x repl)

SPARK-2926: Replace TimSort w/Merge

Sort(Memory)

Mapper Merge Sorts Partitions into 1 Master File Indexed by Partition Range Offsets

<- Master->

File

Bonus!

IBM | spark.tc

External Shuffle Service: Separate JVM ProcessTakes over when Spark Executor is in GC or dies

Use new Netty-based Network Module

Required for YARN dynamic allocationNode Manager serves files

Apache Spark Jira: SPARK-3796

Next StepsProject Tungsten

IBM | spark.tc

Project Tungsten: CPU and Memory Opti-mizationsDisk

Net-workCPU

Mem-ory

Daytona GraySort Optimizations

Tungsten OptimizationsCustom Memory Management

Eliminates JVM object and GC overheadMore Cache-aware Data Structs and Algos

`o.a.s.unsafe.map.BytesToBytesMap` vs. j.u.HashMapCode Generation (default in 1.5)

Generate bytecode from overall query plan

Thank you!Special thanks to Big Commerce!!

IBM Spark Tech Center is Hiring! Nice people only, please!!

IBM | spark.tc

Sign up for our newsletter at

To Be Continued…

IBM | spark.tc

Relevant Links

http://sortbenchmark.org/ApacheSpark2014.pdfhttps://databricks.com/blog/2014/10/10/spark-petabyte-sort.htmlhttps://databricks.com/blog/2014/11/05/spark-offi-cially-sets-a-new-record-in-large-scale-sorting.htmlhttp://0x0fff.com/spark-architecture-shuffle/http://www.cs.berkeley.edu/~kubitron/courses/cs262a-F13/projects/reports/project16_report.pdf

Power of data. Simplicity of design. Speed of innovation.IBM Spark

advanced apache spark meetup: how spark beat hadoop @ 100 tb daytona graysort challenge

Software