shredder gpu-accelerated incremental storage and computation pramod bhatotia §, rodrigo rodrigues...

ShredderGPU-Accelerated Incremental

Storage and Computation

Pramod Bhatotia§, Rodrigo Rodrigues§, Akshat Verma¶

§MPI-SWS, Germany¶IBM Research-India

USENIX FAST 2012

Pramod Bhatotia 2

Handling the data deluge• Data stored in data centers is growing at a fast

pace

• Challenge: How to store and process this data ?• Key technique: Redundancy elimination

• Applications of redundancy elimination• Incremental storage: data de-duplication• Incremental computation: selective re-execution

Pramod Bhatotia 3

Redundancy elimination is expensive

For large-scale data, chunking easily becomes a bottleneck

Content-based chunking [SOSP’01]

ContentMarker

ChunkingFile Chunks Hash Yes

No

Duplicate

Hashing Matching

0 1 0 1 1 0 1 0 1 1 0 0 1 1 0 1 1 0 0 1 0 1 1 1 1 0 1 1 0 1 0 1 0

FileFingerprint

Pramod Bhatotia 4

Accelerate chunking using GPUs

GPUs have been successfully applied to compute-intensive tasks

Multicore GPU based design0

0.5

1

1.5

2

2.5

3

GBp

s

2X

20Gbps (2.5GBps)

Using GPUs for data-intensive tasks presents new challenges

?

StorageServers

Pramod Bhatotia 5

Rest of the talk

• Shredder design• Basic design• Background: GPU architecture & programming

model• Challenges and optimizations

• Evaluation• Case studies

• Computation: Incremental MapReduce• Storage: Cloud backup

Pramod Bhatotia 6

Shredder basic design

GPU (Device)CPU (Host)

Reader

Transfer

Store

Chunkingkernel

Data for chunking

Chunkeddata

Pramod Bhatotia 7

GPU architecture

CPU(Host)

GPU (Device)

Multi-processor N

Multi-processor 2Multi-processor 1

SP SP SPSP

PCI

Host memory

Device global

memory

Shared memory

Pramod Bhatotia 8

GPU programming model

CPU(Host)

GPU (Device)

Multi-processor N

Multi-processor 2Multi-processor 1

SP SP SPSP

PCI

Host memory

Device global

memory

Shared memory

Input

Output

Threads

Pramod Bhatotia 9

Scalability challenges

1. Host-device communication bottlenecks

2. Device memory conflicts

3. Host bottlenecks(See paper for details)

Pramod Bhatotia 10

Host-device communication bottleneck

Challenge # 1

GPU (Device)

Device global

memoryPCI

Synchronous data transfer and kernel execution • Cost of data transfer is comparable to kernel

execution• For large-scale data it involves many data transfers

CPU (Host)

Mainmemory Chunking

kernel

I/O

Transfer

Reader

Pramod Bhatotia 11

Asynchronous execution

TimeCopy to Buffer 1

GPU (Device)Asynchronous

copy

Buffer 1

Device global memoryMain

memoryTransfer Buffer 2

CPU (Host)

ComputeBuffer 2

ComputeBuffer 1

Copy to Buffer 2

Pros: + Overlaps communication with computation+ Generalizes to multi-buffering

Cons:- Requires page-pinning of buffers at host side

Pramod Bhatotia 12

Circular ring pinned memory buffers

GPU (Device)

CPU (Host)

Memcpy

Pinned circular Ring buffers

Pageablebuffers

Device global memory

Asynchronouscopy

Pramod Bhatotia 13

Device memory conflictsChallenge # 2

GPU (Device)

Device global

memoryPCI

CPU (Host)

Mainmemory Chunking

kernel

I/O

Transfer

Reader

Pramod Bhatotia 14

Accessing device memory

Thread-1 Thread-2 Thread-4


Multi-processor

SP-1 SP-2 SP-3 SP-4

Thread-3400-600Cycles

Fewcycles

Device shared memory

Pramod Bhatotia 15


Un-coordinated accesses to global memory lead to a large number of memory bank conflicts

Memory bank conflicts

Multi-processor

SP


SP SP SP

Pramod Bhatotia 16

Accessing memory banks

Bank 0 Bank 3Bank 1 Bank 2

048

15

26

37

MSBs LSBs

Address

Memoryaddress

Chipenable

OR

Data out

Interleaved memory

Pramod Bhatotia 17

Memory coalescingDevice global memory


Memorycoalescing

Tim

e1 2 4Thread # 3

Pramod Bhatotia 18

Processing the data

Thread-1 Thread-2 Thread-4


Thread-3

Pramod Bhatotia 19

Outline

• Shredder design• Evaluation• Case-studies

Pramod Bhatotia 20

Evaluating Shredder

• Goal: Determine how Shredder works in practice• How effective are the optimizations?• How does it compare with multicores?

• Implementation• Host driver in C++ and GPU in CUDA• GPU: NVidia Tesla C2050 cards• Host machine: Intel Xeon with12 cores

(See paper for details)

Pramod Bhatotia 21

Shredder vs. Multicores

Multicore GPU Basic GPU Async GPU Async + Coalescing

0

0.5

1

1.5

2

2.5

GBp

s 5X

Pramod Bhatotia 22

Outline

• Shredder design• Evaluation• Case studies

• Computation: Incremental MapReduce• Storage: Cloud backup (See paper for details)

Pramod Bhatotia 23

Read input

Map tasks

Reduce tasks

Write output

Incremental MapReduce

Pramod Bhatotia 24

Unstable input partitions

Read input

Map tasks

Reduce tasks

Write output

Pramod Bhatotia 25

GPU accelerated Inc-HDFS

Input file

ShredderHDFS Client

copyFromLocal

Split-1 Split-2 Split-3

Content-based chunking

Pramod Bhatotia 26

Related work

• GPU-accelerated systems• Storage: Gibraltar [ICPP’10], HashGPU [HPDC’10]• SSLShader[NSDI’11], PacketShader[SIGCOMM’10],

…

• Incremental computations• Incoop[SOCC’11], Nectar[OSDI’10],

Percolator[OSDI’10],…

Pramod Bhatotia 27

Conclusions

• GPU-accelerated framework for redundancy elimination• Exploits massively parallel GPUs in a cost-

effective manner

• Shredder design incorporates novel optimizations• More data-intensive than previous usage of GPUs

• Shredder can be seamlessly integrated with storage systems • To accelerate incremental storage and

computation

Thank You!

shredder gpu-accelerated incremental storage and computation pramod bhatotia §, rodrigo rodrigues...

Documents

device memory pramod

data pramod bhatotia

memory banks pramod

interleaved memory slide

computation pramod bhatotia

device shared memory

hdfs pramod bhatotia

multicores pramod bhatotia