advanced apache spark meetup: how spark beat hadoop @ 100 tb daytona graysort challenge

49
How Spark Beat Hadoop @ 100 TB Sort Advanced Apache Spark Meetup Chris Fregly, Principal Data Solutions Engineer IBM Spark Technology Center Power of data. Simplicity of design. Speed of innovation. BM | spark.tc

Upload: chris-fregly

Post on 11-Apr-2017

2.726 views

Category:

Software


0 download

TRANSCRIPT

Page 1: Advanced Apache Spark Meetup:  How Spark Beat Hadoop @ 100 TB Daytona GraySort Challenge

How Spark Beat Hadoop @ 100 TB SortAdvanced Apache Spark Meetup

Chris Fregly, Principal Data Solutions EngineerIBM Spark Technology Center

Power of data. Simplicity of design. Speed of innovation.IBM | spark.tc

Page 2: Advanced Apache Spark Meetup:  How Spark Beat Hadoop @ 100 TB Daytona GraySort Challenge

Meetup Housekeeping

Page 3: Advanced Apache Spark Meetup:  How Spark Beat Hadoop @ 100 TB Daytona GraySort Challenge

IBM | spark.tc

Announcements

Deepak SrinivasanBig Commerce

Steve BeierIBM Spark Tech Center

Page 4: Advanced Apache Spark Meetup:  How Spark Beat Hadoop @ 100 TB Daytona GraySort Challenge

IBM | spark.tc

Who am I?Streaming Platform Engineer

Streaming Data EngineerNetflix Open Source Committer

Data Solutions EngineerApache Contributor

Principle Data Solutions Engineer

IBM Technology Center

Page 5: Advanced Apache Spark Meetup:  How Spark Beat Hadoop @ 100 TB Daytona GraySort Challenge

IBM | spark.tc

Last Meetup (End-to-End Data Pipeline)Presented `Flux Capacitor`

End-to-End Data Pipeline in a Box!Real-time, Advanced AnalyticsMachine LearningRecommendations

Githubgithub.com/fluxcapacitor

Dockerhub.docker.com/r/fluxcapacitor

Page 6: Advanced Apache Spark Meetup:  How Spark Beat Hadoop @ 100 TB Daytona GraySort Challenge

IBM | spark.tc

Since Last Meetup (End-to-End Data Pipeline)Meetup Statistics

Total Spark Experts: ~850 (+100%)Mean RSVPs per Meetup: 268Mean Attendance Percentage: ~60% of RSVPsDonations: $15 (Thank you so much, but please keep

your $!)

Github Statistics (github.com/fluxcapacitor)18 forks, 13 clones, ~1300 views

Docker Statistics (hub.docker.com/r/fluxcapacitor)~1600 download

Page 7: Advanced Apache Spark Meetup:  How Spark Beat Hadoop @ 100 TB Daytona GraySort Challenge

IBM | spark.tc

Recent EventsReplay of Last SF Meetup in Mtn View@BaseCRM

Presented Flux Capacitor End-to-End Data Pipe-line

(Scala + Big Data) By The Bay ConferenceWorkshop and 2 TalksTrained ~100 on End-to-End Data Pipeline

Galvanize WorkshopTrained ~30 on End-to-End Data Pipeline

Page 8: Advanced Apache Spark Meetup:  How Spark Beat Hadoop @ 100 TB Daytona GraySort Challenge

IBM | spark.tc

Upcoming USA EventsIBM Hackathon @ Galvanize (Sept 18th – Sept 21st)

Advanced Apache Spark Meetup@DataStax (Sept 21st)

Spark-Cassandra Spark SQL+DataFrame Connector

Cassandra Summit Talk (Sept 22nd – Sept 24th)Real-time End-to-End Data Pipeline w/ Cassandra

Strata New York (Sept 29th - Oct 1st)

Page 9: Advanced Apache Spark Meetup:  How Spark Beat Hadoop @ 100 TB Daytona GraySort Challenge

IBM | spark.tc

Upcoming European Events

Dublin Spark Meetup Talk (Oct 15th)Barcelona Spark Meetup Talk (Oct ?)Madrid Spark Meetup Talk (Oct ?)Amsterdam Spark Meetup (Oct 27th)Spark Summit Amsterdam (Oct 27th – Oct 29th)Brussels Spark Meetup Talk (Oct 30th)

Page 10: Advanced Apache Spark Meetup:  How Spark Beat Hadoop @ 100 TB Daytona GraySort Challenge

Spark and the Daytona GraySort tChallengesortbenchmark.org

sortbenchmark.org/ApacheSpark2014.pdf

Page 11: Advanced Apache Spark Meetup:  How Spark Beat Hadoop @ 100 TB Daytona GraySort Challenge

IBM | spark.tc

Themes of this Talk: Mechanical SympathySeek Once, Scan Sequentially

CPU Cache Locality, Memory Hierarchy are Key

Go Off-Heap Whenever Possible

Customize Data Structures for your Work-load

Page 12: Advanced Apache Spark Meetup:  How Spark Beat Hadoop @ 100 TB Daytona GraySort Challenge

IBM | spark.tc

What is the Daytona GraySort Challenge?Key MetricThroughput of sorting 100TB of 100 byte data, 10

byte keyTotal time includes launching app and writing out-

put file

DaytonaApp must be general purpose

GrayNamed after Jim Gray

Page 13: Advanced Apache Spark Meetup:  How Spark Beat Hadoop @ 100 TB Daytona GraySort Challenge

IBM | spark.tc

Daytona GraySort Challenge: Input and ResourcesInput

Records are 100 bytes in lengthFirst 10 bytes are random keyInput generator: `ordinal.com/gensort.html`28,000 fixed-size partitions for 100 TB sort250,000 fixed-size partitions for 1 PB sort1 partition = 1 HDFS block = 1 node = no partial

read I/OHardware and Runtime Resources

Commercially available and off-the-shelfUnmodified, no over/under-clockingGenerates 500TB of disk I/O, 200TB network I/O

Page 14: Advanced Apache Spark Meetup:  How Spark Beat Hadoop @ 100 TB Daytona GraySort Challenge

IBM | spark.tc

Daytona GraySort Challenge: RulesMust sort to/from OS files in secondary stor-age

No raw disk since I/O subsystem is being tested

File and device striping (RAID 0) are encour-aged

Output file(s) must have correct key order

Page 15: Advanced Apache Spark Meetup:  How Spark Beat Hadoop @ 100 TB Daytona GraySort Challenge

IBM | spark.tc

Daytona GraySort Challenge: Task SchedulingTypes of Data Locality

PROCESS_LOCALNODE_LOCALRACK_LOCALANY

Delay Scheduling`spark.locality.wait.node`: time to wait for next

shitty levelSet to infinite to reduce shittiness, force

NODE_LOCALStraggling Executor JVMs naturally fade away on

each run

IncreasingLevel ofShittiness

Page 16: Advanced Apache Spark Meetup:  How Spark Beat Hadoop @ 100 TB Daytona GraySort Challenge

IBM | spark.tc

Daytona GraySort Challenge: Winning Results On-disk only, in-memory caching

disabled!

EC2 (i2.8xlarge)

EC2 (i2.8xlarge)

28,000partition

s

250,000 partitions

(!!)

Page 17: Advanced Apache Spark Meetup:  How Spark Beat Hadoop @ 100 TB Daytona GraySort Challenge

IBM | spark.tc

Daytona GraySort Challenge: EC2 Configuration206 EC2 Worker nodes, 1 Master node

i2.8xlarge 32 Intel Xeon CPU E5-2670 @ 2.5 Ghz244 GB RAM, 8 x 800GB SSD, RAID 0 striping, ext4NOOP I/O scheduler: FIFO, request merging, no re-

ordering3 Gbps mixed read/write disk I/O

Deployed within Placement Group/VPCEnhanced NetworkingSingle Root I/O Virtualization (SR-IOV): extension of

PCIe10 Gbps, low latency, low jitter (iperf showed ~9.5

Gbps)

Page 18: Advanced Apache Spark Meetup:  How Spark Beat Hadoop @ 100 TB Daytona GraySort Challenge

IBM | spark.tc

Daytona GraySort Challenge: Winning Configuration

Spark 1.2, OpenJDK 1.7_<amazon-something>_u65-b17Disabled in-memory caching -- all on-disk!HDFS 2.4.1 short-circuit local reads, 2x replicationWrites flushed after every run (5 runs for 28,000 parti-tions)Netty 4.0.23.Final with native epollSpeculative Execution disabled: `spark.speculation`=falseForce NODE_LOCAL: `spark.locality.wait.node`=Infinite Force Netty Off-Heap: `spark.shuffle.io.preferDirect-Buffers`Spilling disabled: `spark.shuffle.spill`=trueAll compression disabled

Page 19: Advanced Apache Spark Meetup:  How Spark Beat Hadoop @ 100 TB Daytona GraySort Challenge

IBM | spark.tc

Daytona GraySort Challenge: Partitioning

Range Partitioning ( vs. Hash Partitioning)Take advantage of sequential key spaceSimilar keys grouped together within a partitionRanges defined by sampling 79 values per partitionDriver sorts samples and defines range boundariesSampling took ~10 seconds for 28,000 partitions

Page 20: Advanced Apache Spark Meetup:  How Spark Beat Hadoop @ 100 TB Daytona GraySort Challenge

IBM | spark.tc

Daytona GraySort Challenge: Why Bother?Sorting relies heavily on shuffle, I/O subsys-tem

Shuffle is major bottleneck in big data pro-cessing

Large number of partitions can exhaust OS re-sources

Shuffle optimization benefits all high-level libraries

Goal is to saturate network controller on all nodes

~125 MB/s (1 GB ethernet), 1.25 GB/s (10 GB eth-ernet)

Page 21: Advanced Apache Spark Meetup:  How Spark Beat Hadoop @ 100 TB Daytona GraySort Challenge

IBM | spark.tc

Daytona GraySort Challenge: Per Node Results

Mappers: 3 Gbps/node disk I/O (8x800 SSD)Reducers: 1.1 Gbps/node network I/O (10Gbps)

Page 22: Advanced Apache Spark Meetup:  How Spark Beat Hadoop @ 100 TB Daytona GraySort Challenge

Quick Shuffle Refresher

Page 23: Advanced Apache Spark Meetup:  How Spark Beat Hadoop @ 100 TB Daytona GraySort Challenge

IBM | spark.tc

Shuffle Overview

All to All, Cartesian Product Operation

Least ->UsefulExampleI CouldFind ->

Page 24: Advanced Apache Spark Meetup:  How Spark Beat Hadoop @ 100 TB Daytona GraySort Challenge

IBM | spark.tc

Spark Shuffle Overview

Most ->ConfusingExampleI CouldFind ->

Stages are Defined by Shuffle Boundaries

Page 25: Advanced Apache Spark Meetup:  How Spark Beat Hadoop @ 100 TB Daytona GraySort Challenge

IBM | spark.tc

Shuffle Intermediate Data: Spill to DiskIntermediate shuffle data stored in memorySpill to Disk

`spark.shuffle.spill`=true`spark.shuffle.memoryFraction`=% of all shuffle

buffersCompetes with `spark.storage.memoryFraction`Bump this up from default!! Will help Spark SQL,

too.

Skipped StagesReuse intermediate shuffle data found on reducerDAG for that partition can be truncated

Page 26: Advanced Apache Spark Meetup:  How Spark Beat Hadoop @ 100 TB Daytona GraySort Challenge

IBM | spark.tc

Shuffle Intermediate Data: Compression`spark.shuffle.compress`

Compress outputs (mapper)

`spark.shuffle.spill.compress`Compress spills (reducer)

`spark.io.compression.codec`LZF: Most workloads (new default for Spark)Snappy: LARGE workloads (less memory required to

compress)

Page 27: Advanced Apache Spark Meetup:  How Spark Beat Hadoop @ 100 TB Daytona GraySort Challenge

IBM | spark.tc

Spark Shuffle Operationsjoin

distinctcogroupcoalesce

repartitionsortByKey

groupByKeyreduceByKey

aggregateByKey

Page 28: Advanced Apache Spark Meetup:  How Spark Beat Hadoop @ 100 TB Daytona GraySort Challenge

IBM | spark.tc

Spark Shuffle Managers`spark.shuffle.manager` = {`hash` <10,000 Reducers

Output file determined by hashing the key of (K,V) pair

Each mapper creates an output buffer/file per re-ducer

Leads to M*R number of output buffers/files per shuffle

`sort` >= 10,000 ReducersDefault since Spark 1.2Wins Daytona GraySort Challenge w/ 250,000 re-

ducers!!`tungsten-sort` (Future Meetup!)

}

Page 29: Advanced Apache Spark Meetup:  How Spark Beat Hadoop @ 100 TB Daytona GraySort Challenge

IBM | spark.tc

Shuffle Managers

Page 30: Advanced Apache Spark Meetup:  How Spark Beat Hadoop @ 100 TB Daytona GraySort Challenge

IBM | spark.tc

Hash Shuffle Manager

M*R num open files per shuffle; M=num mappers R=num reducers

Mapper Opens 1 File per Partition/ReducerHDFS

(2x repl)

HDFS(2x repl)

Page 31: Advanced Apache Spark Meetup:  How Spark Beat Hadoop @ 100 TB Daytona GraySort Challenge

IBM | spark.tc

Sort Shuffle Manager

Hold Tight!

Page 32: Advanced Apache Spark Meetup:  How Spark Beat Hadoop @ 100 TB Daytona GraySort Challenge

IBM | spark.tc

Tungsten-Sort Shuffle Manager

Future Meetup!!

Page 33: Advanced Apache Spark Meetup:  How Spark Beat Hadoop @ 100 TB Daytona GraySort Challenge

IBM | spark.tc

Shuffle Performance Tuning

Hash Shuffle Manager (no longer default)`spark.shuffle.consolidateFiles`: mapper output

files`o.a.s.shuffle.FileShuffleBlockResolver`

Intermediate FilesIncrease `spark.shuffle.file.buffer`: reduce seeks &

sys calls

Increase `spark.reducer.maxSizeInFlight` if memory allows

Use smaller number of larger workers to reduce to-tal filesSQL: BroadcastHashJoin vs. Shuffled-HashJoin

`spark.sql.autoBroadcastJoinThreshold`Use `DataFrame.explain(true)` or `EXPLAIN` to ver-

ify

Page 34: Advanced Apache Spark Meetup:  How Spark Beat Hadoop @ 100 TB Daytona GraySort Challenge

IBM | spark.tc

Shuffle Configuration

Documentationspark.apache.org/docs/latest/configuration.html#shuffle-

behavior

Prefixspark.shuffle

Page 35: Advanced Apache Spark Meetup:  How Spark Beat Hadoop @ 100 TB Daytona GraySort Challenge

Winning Optimizations Deployed across Spark 1.1 and 1.2

Page 36: Advanced Apache Spark Meetup:  How Spark Beat Hadoop @ 100 TB Daytona GraySort Challenge

IBM | spark.tc

Daytona GraySort Challenge: Winning OptimizationsCPU-Cache Locality: (Key, Pointer-to-Record)

& Cache Alignment

Optimized Sort Algorithm: Elements of (K, V) Pairs

Reduce Network Overhead: Async Netty, epoll

Reduce OS Resource Utilization: Sort Shuffle

Page 37: Advanced Apache Spark Meetup:  How Spark Beat Hadoop @ 100 TB Daytona GraySort Challenge

IBM | spark.tc

CPU-Cache Locality: (Key, Pointer-to-Record)AlphaSort paper ~1995

Chris Nyberg and Jim Gray

NaïveList (Pointer-to-Record)Requires Key to be dereferenced for comparison

AlphaSortList (Key, Pointer-to-Record)Key is directly available for comparison

Page 38: Advanced Apache Spark Meetup:  How Spark Beat Hadoop @ 100 TB Daytona GraySort Challenge

IBM | spark.tc

CPU-Cache Locality: Cache AlignmentKey(10 bytes) + Pointer(4 bytes*) = 14 bytes

*4 bytes when using compressed OOPS (<32 GB heap)

Not binary in size, not CPU-cache friendly

Cache Alignment Options① Add Padding (2 bytes)

Key(10 bytes) + Pad(2 bytes) + Pointer(4 bytes)=16 bytes

② (Key-Prefix, Pointer-to-Record)Perf affected by key distroKey-Prefix (4 bytes) + Pointer (4 bytes)=8 bytes

Page 39: Advanced Apache Spark Meetup:  How Spark Beat Hadoop @ 100 TB Daytona GraySort Challenge

IBM | spark.tc

CPU-Cache Locality: Performance Compari-son

Page 40: Advanced Apache Spark Meetup:  How Spark Beat Hadoop @ 100 TB Daytona GraySort Challenge

IBM | spark.tc

Optimized Sort Algorithm: Elements of (K, V) Pairs`o.a.s.util.collection.TimSort`

Based on JDK 1.7 TimSortPerforms best on partially-sorted datasets Optimized for elements of (K,V) pairsSorts impl of SortDataFormat (ie. KVArraySort-

DataFormat)

`o.a.s.util.collection.AppendOnlyMap`Open addressing hash, quadratic probingArray of [(key0, value0), (key1, value1)] Good memory localityKeys never removed, values only append

(^2 Probing)

Page 41: Advanced Apache Spark Meetup:  How Spark Beat Hadoop @ 100 TB Daytona GraySort Challenge

IBM | spark.tc

Reduce Network Overhead: Async Netty, epollNew Network Module based on Async Netty

Replaces old java.nio, low-level, socket-based codeZero-copy epoll uses kernel-space between disk &

networkCustom memory management reduces GC pauses`spark.shuffle.blockTransferService`=netty

Spark-Netty Performance Tuning`spark.shuffle.io.numConnectionsPerPeer`

Increase to saturate hosts with multiple disks`spark.shuffle.io.preferDirectBuffers`

On or Off-heap (Off-heap is default)

Apache Spark JiraSPARK-2468

Page 42: Advanced Apache Spark Meetup:  How Spark Beat Hadoop @ 100 TB Daytona GraySort Challenge

S

IBM | spark.tc

Reduce OS Resource Utilization: Sort Shuffle

M open files per shuffle; M = num of mappers`spark.shuffle.sort.bypassMergeThreshold`

Merge Sort

(Disk)

Reducers seek and scan from range

offsetof Master File on

Mapper

TimSort

(RAM)

HDFS(2x

repl)

HDFS(2x repl)

SPARK-2926: Replace TimSort w/Merge

Sort(Memory)

Mapper Merge Sorts Partitions into 1 Master File Indexed by Partition Range Offsets

<- Master->

File

Page 43: Advanced Apache Spark Meetup:  How Spark Beat Hadoop @ 100 TB Daytona GraySort Challenge

Bonus!

Page 44: Advanced Apache Spark Meetup:  How Spark Beat Hadoop @ 100 TB Daytona GraySort Challenge

IBM | spark.tc

External Shuffle Service: Separate JVM ProcessTakes over when Spark Executor is in GC or dies

Use new Netty-based Network Module

Required for YARN dynamic allocationNode Manager serves files

Apache Spark Jira: SPARK-3796

Page 45: Advanced Apache Spark Meetup:  How Spark Beat Hadoop @ 100 TB Daytona GraySort Challenge

Next StepsProject Tungsten

Page 46: Advanced Apache Spark Meetup:  How Spark Beat Hadoop @ 100 TB Daytona GraySort Challenge

IBM | spark.tc

Project Tungsten: CPU and Memory Opti-mizationsDisk

Net-workCPU

Mem-ory

Daytona GraySort Optimizations

Tungsten OptimizationsCustom Memory Management

Eliminates JVM object and GC overheadMore Cache-aware Data Structs and Algos

`o.a.s.unsafe.map.BytesToBytesMap` vs. j.u.HashMapCode Generation (default in 1.5)

Generate bytecode from overall query plan

Page 47: Advanced Apache Spark Meetup:  How Spark Beat Hadoop @ 100 TB Daytona GraySort Challenge

Thank you!Special thanks to Big Commerce!!

IBM Spark Tech Center is Hiring! Nice people only, please!!

IBM | spark.tc

Sign up for our newsletter at

To Be Continued…

Page 48: Advanced Apache Spark Meetup:  How Spark Beat Hadoop @ 100 TB Daytona GraySort Challenge

IBM | spark.tc

Relevant Links

http://sortbenchmark.org/ApacheSpark2014.pdfhttps://databricks.com/blog/2014/10/10/spark-petabyte-sort.htmlhttps://databricks.com/blog/2014/11/05/spark-offi-cially-sets-a-new-record-in-large-scale-sorting.htmlhttp://0x0fff.com/spark-architecture-shuffle/http://www.cs.berkeley.edu/~kubitron/courses/cs262a-F13/projects/reports/project16_report.pdf

Page 49: Advanced Apache Spark Meetup:  How Spark Beat Hadoop @ 100 TB Daytona GraySort Challenge

Power of data. Simplicity of design. Speed of innovation.IBM Spark