august 2013 hug: compression options in hadoop - a tale of tradeoffs

28
Compression Options In Hadoop – A Tale of Tradeoffs Govind Kamat 39 th Bay Area Hadoop Users Group (HUG) Meetup Yahoo! URL’s Café Sunnyvale, CA August 21, 2013

Upload: yahoo-developer-network

Post on 26-Jan-2015

111 views

Category:

Technology


5 download

DESCRIPTION

Yahoo! is one of the most-visited web sites in the world. It runs one of the largest private cloud infrastructures, one that operates on petabytes of data every day. Being able to store and manage that data well is essential to the efficient functioning of Yahoo!`s Hadoop clusters. A key component that enables this efficient operation is data compression. With regard to compression algorithms, there is an underlying tension between compression ratio and compression performance. Consequently, Hadoop provides support for several compression algorithms, including gzip, bzip2, Snappy, LZ4 and others. This plethora of options can make it difficult for users to select appropriate codecs for their MapReduce jobs. This talk attempts to provide guidance in that regard. Performance results with Gridmix and with several corpuses of data are presented. The talk also describes enhancements we have made to the bzip2 codec that improve its performance. This will be of particular interest to the increasing number of users operating on “Big Data” who require the best possible ratios. The impact of using the Intel IPP libraries is also investigated; these have the potential to improve performance significantly. Finally, a few proposals for future enhancements to Hadoop in this area are outlined. Speaker: Govind Kamat, Member of Technical Staff, Yahoo!

TRANSCRIPT

Page 1: August 2013 HUG: Compression Options in Hadoop - A Tale of Tradeoffs

Compression Options In Hadoop – A Tale of Tradeoffs Govind Kamat

39th Bay Area Hadoop Users Group (HUG) Meetup Yahoo! URL’s Café Sunnyvale, CA August 21, 2013

Page 2: August 2013 HUG: Compression Options in Hadoop - A Tale of Tradeoffs

Introduction

2

Sumeet Singh Director of Products, Hadoop Cloud Engineering Group

701 First Avenue Sunnyvale, CA 94089 USA

Govind Kamat Technical Yahoo!, Hadoop Cloud Engineering Group

§  Member of Technical Staff in the Hadoop Services team at Yahoo!

§  Focuses on HBase and Hadoop performance §  Worked with the Performance Engineering Group on

improving the performance and scalability of several Yahoo! applications

§  Experience includes development of large-scale software systems, microprocessor architecture, instruction-set simulators, compiler technology and electronic design

701 First Avenue Sunnyvale, CA 94089 USA

§  Leads Hadoop products team at Yahoo!

§  Responsible for Product Management, Customer Engagements, Evangelism, and Program Management

§  Prior to this role, led Strategy functions for the Cloud

Platform Group at Yahoo!

Page 3: August 2013 HUG: Compression Options in Hadoop - A Tale of Tradeoffs

Agenda

3

Data Compression in Hadoop 1

Available Compression Options 2

Understanding and Working with Compression Options 3

Problems Faced at Yahoo! with Large Data Sets 4

Performance Evaluations, Native Bzip2, and IPP Libraries 5

Wrap-up and Future Work 6

Page 4: August 2013 HUG: Compression Options in Hadoop - A Tale of Tradeoffs

Compression Needs and Tradeoffs in Hadoop

4

§  Storage §  Disk I/O §  Network bandwidth

§  CPU Time

§  Hadoop jobs are data-intensive, compressing data can speed up the I/O operations §  MapReduce jobs are almost always I/O bound

§  Compressed data can save storage space and speed up data transfers across the network §  Capital allocation for hardware can go further

§  Reduced I/O and network load can bring significant performance improvements §  MapReduce jobs can finish faster overall

§  On the other hand, CPU utilization and processing time increases during compression and decompression §  Understanding the tradeoffs is important for MapReduce pipeline’s overall performance

The Compression Tradeoff

Page 5: August 2013 HUG: Compression Options in Hadoop - A Tale of Tradeoffs

Data Compression in Hadoop’s MR Pipeline

5

Input splits

Map

Source: Hadoop: The Definitive Guide, Tom White

Output

Reduce Buffer in memory

Partition and Sort

fetch

Merge on disk

Merge and sort

Other maps

Other reducers

I/P compressed

Mapper decompresses

Mapper O/P compressed

1

Map Reduce

Reduce I/P

Map O/P

Reducer I/P decompresses

Reducer O/P compressed

2 3

Sort & Shuffle

Compress Decompress

Page 6: August 2013 HUG: Compression Options in Hadoop - A Tale of Tradeoffs

Compression Options in Hadoop (1/2)

6

Format Algorithm Strategy Emphasis Comments

zlib Uses DEFLATE (LZ77 and Huffman coding)

Dictionary-based, API Compression ratio Default codec

gzip Wrapper around zlib Dictionary-based, standard compression utility

Same as zlib, codec operates on and produces standard gzip files

For data interchange on and off Hadoop

bzip2 Burrows-Wheeler transform

Transform-based, block-oriented

Higher compression ratios than zlib Common for Pig

LZO Variant of LZ77 Dictionary-based, block-oriented, API

High compression speeds

Common for intermediate compression, HBase tables

LZ4 Simplified variant of LZ77 Fast scan, API Very high compression

speeds Available in newer Hadoop distributions

Snappy LZ77 Block-oriented, API Very high compression speeds

Came out of Google, previously known as Zippy

Page 7: August 2013 HUG: Compression Options in Hadoop - A Tale of Tradeoffs

Compression Options in Hadoop (2/2)

7

Format Codec (Defined in io.compression.codecs) File Extn. Splittable Java/ Native

zlib/ DEFLATE (default) org.apache.hadoop.io.compress.DefaultCodec !.deflate! N Y/ Y

gzip org.apache.hadoop.io.compress.GzipCodec ! .gz! N Y/ Y

bzip2 org.apache.hadoop.io.compress.BZip2Codec ! .bz2! Y Y/ Y

LZO (download separately)

com.hadoop.compression.lzo.LzoCodec ! .lzo! N N/ Y

LZ4 org.apache.hadoop.io.compress.Lz4Codec ! .lz4! N N/ Y

Snappy org.apache.hadoop.io.compress.SnappyCodec ! .snappy! N N/ Y

NOTES: §  Splittability – Bzip2 is “splittable”, can be decompressed in parallel by multiple MapReduce tasks. Other

algorithms require all blocks together for decompression with a single MapReduce task.

§  LZO – Removed from Hadoop because the LZO libraries are licensed under the GNU GPL. LZO format is still supported and the codec can be downloaded separately and enabled manually.

§  Native bzip2 codec – added by Yahoo! as part of this work in Hadoop 0.23

Page 8: August 2013 HUG: Compression Options in Hadoop - A Tale of Tradeoffs

Space-Time Tradeoff of Compression Options

8

64%, 32.3

71%, 60.0

47%, 4.8 42%, 4.0 44%, 2.4

0.0

10.0

20.0

30.0

40.0

50.0

60.0

70.0

40% 45% 50% 55% 60% 65% 70% 75%

CPU

Tim

e in

Sec

. (C

ompr

ess

+ D

ecom

pres

s)

Space Savings

Bzip2

Zlib (Deflate, Gzip)

LZO Snappy LZ4

Note: A 265 MB corpus from Wikipedia was used for the performance comparisons. Space savings is defined as [1 – (Compressed/ Uncompressed)]

Codec Performance on the Wikipedia Text Corpus

High Compression Ratio

High Compression Speed

Page 9: August 2013 HUG: Compression Options in Hadoop - A Tale of Tradeoffs

Using Data Compression in Hadoop

9

Phase in MR Pipeline Config Values

Input data to Map

File extension recognized automatically for decompression

File extensions for supported formats Note: For SequenceFile, headers have the information [compression (boolean), block compression (boolean), and compression codec]

One of the supported codecs one defined in io.compression.codecs!

Intermediate (Map) Output

mapreduce.map.output.compress!false (default), true

mapreduce.map.output.compress.codec!! one defined in io.compression.codecs!

Final (Reduce) Output

mapreduce.output.fileoutputformat. compress!

false (default), true

mapreduce.output.fileoutputformat. compress.codec! one defined in io.compression.codecs!

mapreduce.output.fileoutputformat. compress.type!

Type of compression to use for SequenceFile outputs: NONE, RECORD (default), BLOCK

1

2

3

Page 10: August 2013 HUG: Compression Options in Hadoop - A Tale of Tradeoffs

§  Compress the input data, if large

§  Always use compression, particularly if spillage or slow network transfers

§  Compress for storage/ archival, better write speeds, or between MR jobs

§  Use splittable algo such as bzip2, or use zlib with SequenceFile format

§  Use faster codecs such as LZO, LZ4, or Snappy

§  Use standard utility such as gzip or bzip2 for data interchange, and faster codecs for chained jobs

When to Use Compression and Which Codec

10

Map Reduce Shuffle & Sort

Input data to Map Intermediate (Map) Output

I/P compressed

Mapper decompresses

Mapper O/P compressed

1

Reducer I/P decompresses

Reducer O/P compressed

2 3

Compress Decompress

Final Reduce Output

Page 11: August 2013 HUG: Compression Options in Hadoop - A Tale of Tradeoffs

Compression in the Hadoop Ecosystem

11

Component When to Use What to Use

Pig §  Compressing data between MR

job

§  Typical in Pig scripts that include joins or other operators that expand your data size

Enable compression and select the codec: pig.tmpfilecompression = true!pig.tmpfilecompression.codec = gzip, lzo!!

Hive §  Intermediate files produced by

Hive between multiple map-reduce jobs

§  Hive writes output to a table

Enable intermediate or output compression: hive.exec.compress.intermediate = true!hive.exec.compress.output = true!

HBase

§  Compress data at the CF level (support for LZO, gzip, Snappy, and LZ4)

List required JNI libraries: hbase.regionserver.codecs!!Enabling compression: create ’table', { NAME => 'colfam', COMPRESSION => ’LZO' }!alter ’table', { NAME => 'colfam', COMPRESSION => ’LZO' } !

Page 12: August 2013 HUG: Compression Options in Hadoop - A Tale of Tradeoffs

4.2M Jobs, Jun 10-16, 2013

Compression in Hadoop at Yahoo!

12

99.8%

0.2%

LZO 98.3%

gzip 1.1%

zlib / default 0.5%

bzip2 0.1%

Map Reduce Shuffle & Sort Input data to Map Intermediate (Map) Output

1 2 3

Final Reduce Output

39.0%

61.0%

LZO 55%

gzip 35%

bzip2 5%

zlib / default 5%

4.2M Jobs, Jun 10-16, 2013

98%

2%

zlib / default 73%

gzip 22%

bzip2 4%

LZO 1%

380M Files on Jun 16, 2013 (/data, /projects)

Includes intermediate Pig/ Hive compression

Pig Intermediate

Compressed

Page 13: August 2013 HUG: Compression Options in Hadoop - A Tale of Tradeoffs

Compression for Data Storage Efficiency §  DSE considerations at Yahoo!

§  RCFile instead of SequenceFile

§  Faster implementation of bzip2

§  Native-code bzip2 codec

§  HADOOP-84621, available in 0.23.7

§  Substituting the IPP library

13

1 Native-code bzip2 implementation done in collaboration with Jason Lowe, Hadoop Core PMC member

Page 14: August 2013 HUG: Compression Options in Hadoop - A Tale of Tradeoffs

IPP Libraries §  Integrated Performance Primitives from Intel

§  Algorithmic and architectural optimizations

§  Processor-specific variants of each function

§  Applications remain processor-neutral

§  Compression: LZ, RLE, BWT, LZO

§  High level formats include: zlib, gzip, bzip2 and LZO

14

Page 15: August 2013 HUG: Compression Options in Hadoop - A Tale of Tradeoffs

Measuring Standalone Performance §  Standard programs (gzip, bzip2) used

§  Driver program written for other cases

§  32-bit mode

§  Single-threaded

§  JVM load overhead discounted

§  Default compression level

§  Quad-core Xeon machine

15

Page 16: August 2013 HUG: Compression Options in Hadoop - A Tale of Tradeoffs

Data Corpuses Used §  Binary files

§  Generated text from randomtextwriter

§  Wikipedia corpus

§  Silesia corpus

16

Page 17: August 2013 HUG: Compression Options in Hadoop - A Tale of Tradeoffs

Compression Ratio

0

50

100

150

200

250

300

uncomp zlib bzip2 LZO Snappy LZ4

File

Siz

e (M

B)

exe rtext wiki silesia

17

Page 18: August 2013 HUG: Compression Options in Hadoop - A Tale of Tradeoffs

Compression Performance

29 23

63

44

26

0

10

20

30

40

50

60

70

80

90

zlib IPP-zlib Java-bzip2 bzip2 IPP-bzip2

CPU

Tim

e (s

ec)

exe rtext wiki silesia

18

Page 19: August 2013 HUG: Compression Options in Hadoop - A Tale of Tradeoffs

Compression Performance (Fast Algorithms)

3.2 2.9

1.7

0

0.5

1

1.5

2

2.5

3

3.5

LZO Snappy LZ4

CPU

Tim

e (s

ec)

exe rtext wiki silesia

19

Page 20: August 2013 HUG: Compression Options in Hadoop - A Tale of Tradeoffs

Decompression Performance

3 2

21

17

12

0

5

10

15

20

25

zlib IPP-zlib Java-bzip2 bzip2 IPP-bzip2

CPU

Tim

e (s

ec)

exe rtext wiki silesia

20

Page 21: August 2013 HUG: Compression Options in Hadoop - A Tale of Tradeoffs

Decompression Performance (Fast Algorithms)

1.6

1.1

0.7

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

LZO Snappy LZ4

CPU

Tim

e (s

ec)

exe rtext wiki silesia

21

Page 22: August 2013 HUG: Compression Options in Hadoop - A Tale of Tradeoffs

Compression Performance within Hadoop §  Daytona performance framework

§  GridMix v1

§  Loadgen and sort jobs

§  Input data compressed with zlib / bzip2

§  LZO used for intermediate compression

§  35 datanodes, dual-quad-core machines

22

Page 23: August 2013 HUG: Compression Options in Hadoop - A Tale of Tradeoffs

Map Performance

47 46 46

33

0

5

10

15

20

25

30

35

40

45

50

Java-bzip2 bzip2 IPP-bzip2 zlib

Map

Tim

e (s

ec)

23

Page 24: August 2013 HUG: Compression Options in Hadoop - A Tale of Tradeoffs

Reduce Performance

31

28

18

14

0

5

10

15

20

25

30

35

Java-bzip2 bzip2 IPP-bzip2 zlib

Red

uce

Tim

e (m

in)

24

Page 25: August 2013 HUG: Compression Options in Hadoop - A Tale of Tradeoffs

Job Performance

38

34

23

19

38

34

25

18

0

5

10

15

20

25

30

35

40

Java-bzip2 bzip2 IPP-bzip2 zlib

Job

Tim

e (m

in)

sort loadgen

25

Page 26: August 2013 HUG: Compression Options in Hadoop - A Tale of Tradeoffs

Future Work §  Splittability support for native-code bzip2 codec

§  Enhancing Pig to use common bzip2 codec

§  Optimizing the JNI interface and buffer copies

§  Varying the compression effort parameter

§  Performance evaluation for 64-bit mode

§  Updating the zlib codec to specify alternative libraries

§  Other codec combinations, such as zlib for transient data

§  Other compression algorithms

26

Page 27: August 2013 HUG: Compression Options in Hadoop - A Tale of Tradeoffs

Considerations in Selecting Compression Type §  Nature of the data set

§  Chained jobs

§  Data-storage efficiency requirements

§  Frequency of compression vs. decompression

§  Requirement for compatibility with a standard data format

§  Splittability requirements

§  Size of the intermediate and final data

§  Alternative implementations of compression libraries

27

Page 28: August 2013 HUG: Compression Options in Hadoop - A Tale of Tradeoffs