supporting content-addressable caching with czip compression

27
Supporting Content- Addressable Caching with CZIP Compression KyoungSoo Park, Sunghwan Ihm, Mic Bowman* and Vivek Pai Princeton University *Intel Research

Upload: egan

Post on 23-Feb-2016

29 views

Category:

Documents


0 download

DESCRIPTION

Supporting Content-Addressable Caching with CZIP Compression. KyoungSoo Park , Sunghwan Ihm, Mic Bowman* and Vivek Pai Princeton University *Intel Research. Content-Based Naming (CBN). Naming scheme based on its content Name = one-way hash (content) Hashing function: MD5, SHA-1, etc. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Supporting Content-Addressable Caching with CZIP Compression

Supporting Content-Addressable Caching with CZIP Compression

KyoungSoo Park, Sunghwan Ihm, Mic Bowman* and Vivek Pai

Princeton University*Intel Research

Page 2: Supporting Content-Addressable Caching with CZIP Compression

KyoungSoo Park USENIX 2007 2

Content-Based Naming (CBN)• Naming scheme based on its

content• Name = one-way hash (content)

• Hashing function: MD5, SHA-1, etc.• Rabin’s fingerprint for chunk detection

• Redundancy elimination• Network-traffic/storage systems• Research/commercial systems• Special-purpose systems

Page 3: Supporting Content-Addressable Caching with CZIP Compression

KyoungSoo Park USENIX 2007 3

Where Can CBN be Applied?• Similar file distribution

• Linux distribution mirror• DVD ISO contains all CD ISOs

• Virtual machine image migration• Base OS takes up majority of content• httpd VM vs. httpd+mysqld VM

• Uncacheable Web content• Some dynamic content doesn’t change

Page 4: Supporting Content-Addressable Caching with CZIP Compression

KyoungSoo Park USENIX 2007 4

Contribution of This Work• Generic CBN tool

• Easy to build new systems• Easy to upgrade existing non-CBN systems

• CZIP compression + CZIP-aware apps• Can be used on existing platforms• Provides benefit to non-CZIP apps

• Demonstrate sample systems• Reduces FC6 mirror memory footprint by half• Comparable compression speed to GZIP’s• 2x throughput for CZIP-aware Apache• 4x origin server BW reduction for CZIP-aware

CDN

Page 5: Supporting Content-Addressable Caching with CZIP Compression

KyoungSoo Park USENIX 2007 5

CZIP Compression• Compression scheme like GZIP, BZIP2• Export CBN information in the header

A

A

C

B

B

A

C

B

Header Global Fields

Chunk Index 1

Chunk Index 2

Chunk Index 3

Chunk Index 4

Chunk Index 5

CZIP

UNCZIP

CZIP Header

Page 6: Supporting Content-Addressable Caching with CZIP Compression

KyoungSoo Park USENIX 2007 6

CZIP Header• Header = global attributes + chunk

info• Global attributes

• One-way hash function (SHA-1/MD5)• Chunk data compression (GZIP/BZIP2)• Convergent encryption (on/off)• Header CRC, File Hash, etc.

• Chunk information• Content hash, start offset, chunk size

Page 7: Supporting Content-Addressable Caching with CZIP Compression

KyoungSoo Park USENIX 2007 7

Deployment Scenario• CZIP-aware server

Client AServer

xyzlo5gasdfghkChunk AChunk B

hdr

xyzlo5gasdfghk

Chunk AChunk B

header

qoiertty

Chunk C

file1.cz

Client Bxyzlo5g Chunk Aasdfghk Chunk Bqoiertty Chunk C

CBN Cache file1.cz

file2.cz

file2.cz

read header

read chunksread headerread chunk C

Page 8: Supporting Content-Addressable Caching with CZIP Compression

KyoungSoo Park USENIX 2007 8

Deployment Scenario• CZIP-aware client-side proxy

Client AProxy

xyzlo5gasdfghkChunk AChunk B

hdr

xyzlo5gasdfghk

Chunk AChunk B

header

qoiertty

Chunk C

file1.cz

Client Bxyzlo5g Chunk Aasdfghk Chunk Bqoiertty Chunk C

CBN Cache file1.cz

file2.cz

file2.cz

read chunk C

Server

GET /file2.czRange: bytes=1000-1999X-SHA-1: qoiertty

1. X-SHA-1 field helps CZIP-aware server2. Browser cache can support CBN too!

Page 9: Supporting Content-Addressable Caching with CZIP Compression

KyoungSoo Park USENIX 2007 9

Compressibility• Fedora Core 6 ISOs/ All files/ Wikipedia

DBData C

ompression

Ratio

6.7 GB 49.7 GB 7.9 GB

3.3 3.2 3.2

6.5 6.5

20.3

48.548.3

19.619.9

7.9

2.7 2.5 2.51.9

00.10.20.30.40.50.60.70.80.9

1

FC6_i386_ISOs.tar FC6_All_files.tar Wikipedia_DB.tar

CZIP+plainCZIP+gzipCZIP+bzip2GZIPBZIP2

Page 10: Supporting Content-Addressable Caching with CZIP Compression

KyoungSoo Park USENIX 2007 10

Compression speed

00.10.20.30.40.50.60.70.80.9

1

FC6_i386_ISOs.tar FC6_All_files.tar Wikipedia_DB.tar

Nor

mal

ized

Tim

e

BZIP2GZIPCZIP+bzip2CZIP+gzipCZIP+plain

• On Pentium D 2.8GHz with 4GB memory3,964 secs 29,004 secs 3,151 secs

Page 11: Supporting Content-Addressable Caching with CZIP Compression

KyoungSoo Park USENIX 2007 11

Virtual Machine Images• Server consolidation/management• Much redundancy among similar VMs

• Xen FC4 base image (X)• X + httpd (Y) / Y + mysqld (Z)

• Investigating content overlap over• Chunk size • Chunking methods

• Rabin’s fingerprint vs. fixed-sized• After extensive use

Page 12: Supporting Content-Addressable Caching with CZIP Compression

KyoungSoo Park USENIX 2007 12

Chunk Size / Chunking Methods Compare three VM images Base = Xen FC4 image / Apache = Base + httpd Both = Apache + mysqld

0

10

20

30

40

50

60

70

80

90

100

4 8 16 32 60Chunk Size (KB)

Cont

ent O

verla

p (%

)

Base vs. ApacheApache vs. BothBase vs. ApacheApache vs. Both

Rabin’s fingerprint

Fixed-sized chunking

Page 13: Supporting Content-Addressable Caching with CZIP Compression

KyoungSoo Park USENIX 2007 13

Real VM ImagesEC1 ~ EC5: VMs based on Xen FC-4 + standard tools Daily used by five different engineers for three weeks

88

89

90

91

92

93

94

95

96

97

98

99

4 8 16 32 60Chunk Size (KB)

Con

tent

Ove

rlap

(%)

EC1 vs. EC2: FixedEC1 vs. EC2: RabinEC3 vs. EC4: FixedEC3 vs. EC4: Rabin

Page 14: Supporting Content-Addressable Caching with CZIP Compression

KyoungSoo Park USENIX 2007 14

Dynamic Web Pages• Observed the front page of these

sites• Google News• CNN• Slashdot• Digg.com• Fark.com• New York Times

• All of them non-cacheable• “no-cache”, “no-store” or “private”

Page 15: Supporting Content-Addressable Caching with CZIP Compression

KyoungSoo Park USENIX 2007 15

Average Content OverlapDownloaded pages every 10 minutes for 18 days

0

10

20

30

40

5060

70

80

90

100

1 2 4 8 16 32

Chunk size (KB)

Cont

ent O

verla

p (%

s)

CNN.comFark.comSlashdotNYTimes.comGoogle NewsDigg.com

Page 16: Supporting Content-Addressable Caching with CZIP Compression

KyoungSoo Park USENIX 2007 16

Potential Data Savings via CZIP

0

50

100

150

200

250

300

350

400

Google News Slashdot CNN Digg.com Fark.com NY Times

Tota

l Tra

nsfe

rred

Dat

a(M

B)

Without CZIP

With CZIP

37%

57%

90%

24%61%

39%

Page 17: Supporting Content-Addressable Caching with CZIP Compression

KyoungSoo Park USENIX 2007 17

Summary So far• CZIP is comparable to GZIP in speed and

performance• CZIP is far better with files with much redundancy

• Redundancy decreases as chunk size increases• Rabin’s fingerprint exposes a good deal of

redundancy regardless of chunk sizes• Optimal chunk size varies over workload• Bigger chunk size is better for network transfer

• Dynamic content also exposes redundancy• CZIP can save 24-90% of BW instead of GZIP

Page 18: Supporting Content-Addressable Caching with CZIP Compression

KyoungSoo Park USENIX 2007 18

Server Performance• CZIP Apache Module• Test scenario (FC mirror simulation)

• 1.5 GB from FC6 DVD• 1.5 GB is split into three 0.5 GB images• Each file is requested in round-robin fashion• 100-300 clients simulated by six machines

in LAN• Server is 2.8GHz Pentium D w/ 2GB

memory• w/ 2GB physical memory with 2 Gbps-NICs

Page 19: Supporting Content-Addressable Caching with CZIP Compression

KyoungSoo Park USENIX 2007 19

CZIP Apache Module

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 50 100 150 200 250 300Throughput (Mbps)

Culm

ulat

ive

Dist

ribut

ion

CZIP-Aware Apache

Normal Apache

Worst client in CZIP-aware Apache is faster than 91%of normal Apache clients

Median 2.07 times

90% 2.56 times

Page 20: Supporting Content-Addressable Caching with CZIP Compression

KyoungSoo Park USENIX 2007 20

CBN-Aware Content Distribution• CoBlitz large-file CDN [NSDI’06]• Serving 1-2 TB every day on

PlanetLab• http://coblitz.codeen.org/URL• University channel – podcast/vodcast• Fedora Core mirror, Citeseer etc.

• Chunk is basic caching unit• Parallel chunk requests/responses• Chunk request in HTTP byte-range query

Page 21: Supporting Content-Addressable Caching with CZIP Compression

KyoungSoo Park USENIX 2007 21

Making CoBlitz CZIP-Aware• CoBlitz’s chunk request

GET /coblitz.codeen.org/www.cs.princeton.edu/bigfile.cz,start=1000,end=1999 HTTP/1.0Host: coblitz.codeen.org

• CZIP-aware CoBlitz (C-CoBlitz) requestGET /czip.codeen.org/Chunk_SHA-1_Hash HTTP/1.0Host: czip.codeen.orgX-URL: www.cs.princeton.edu/bigfile.czX-Range: byte=1000-1999

Page 22: Supporting Content-Addressable Caching with CZIP Compression

KyoungSoo Park USENIX 2007 22

CZIP-Aware CoBlitz Testing• Two content-overlapping files• Simultaneously fetch from 100 PlanetLab

nodes• Origin server is at Princeton• Testing cases

• Regular: Download original files by regular CoBlitz

• File-CZIP: Download CZIP’ed files by regular CoBlitz

• CZIP-CDN: Download CZIP’ed files by C-CoBlitz

Page 23: Supporting Content-Addressable Caching with CZIP Compression

KyoungSoo Park USENIX 2007 23

100 MB File Downloading388 MB

273 MB, 29.6%

191 MB, 29.7%

Regular File-CZIP CZIP-CDN

Page 24: Supporting Content-Addressable Caching with CZIP Compression

KyoungSoo Park USENIX 2007 24

50 MB File Downloading

183 MB

92 MB, 49.7%

24 MB, 73.9%

Regular File-CZIP CZIP-CDN

Page 25: Supporting Content-Addressable Caching with CZIP Compression

KyoungSoo Park USENIX 2007 25

Conclusion• CZIP is a generic compression tool

providing CBN benefits• CZIP is comparable to GZIP in

compression performance• CZIP helps greatly reduce memory

footprint in serving similar files• It is very easy to support CZIP and

the benefit is transparent

Page 26: Supporting Content-Addressable Caching with CZIP Compression

KyoungSoo Park USENIX 2007 26

Thank you!

More information can be found at http://codeen.cs.princeton.edu/czip/

CZIP code will be released soon!

Page 27: Supporting Content-Addressable Caching with CZIP Compression

KyoungSoo Park USENIX 2007 27

200/300 Clients

65%80%

Median 1.95 times

90% 2.27 times

Median 1.84 times

90% 2.11 times

200 clients 300 clients