fine&grain!memory!deduplicaon!!! … · linkedin profile page professional profiles 48 gb 2.74x...

Fine-‐grain Memory Deduplica4on for In-‐memory Database Systems

Heiner Litz, David Cheriton, Pete Stevenson

Stanford University

1

Memory Capacity Challenge

•  In-‐memory databases – Limited by memory capacity

•  Technology – DRAM scaling – Pin limita4on – Power

•  Datasets grow > Moore’s Law Memory represents increasing

por4on of system cost 2

Our Approach: HICAMP

Fine grain memory deduplica4on: 1.  Search memory for duplicates 2.  Replace copy with Reference 3.  Virtually increase capacity

Reduces cost & power, increases performance 3

Poten4al Savings

Company Workload Dataset Memory Raw Dedup +Overhead Lightminer Benchmark TPCH 256 GB 1.76x 1.54x SAP SAP-Hana One Private 1024 GB 1.94x 1.68x LinkedIn Profile Page Professional Profiles 48 GB 2.74x 2.28x Quantifind Data Mining Social web data 64 GB 2.91x 2.40x Arista Build server Private 128 GB 3.10x 2.53x UC Berkeley Shark Conviva Server 68 GB 3.19x 2.59x NHN Memcached Private 8 GB 3.74x 2.95x Yelp Hadoop (EMR) Private 7 GB 3.79x 2.99x

2x-‐3x Deduplica4on worth doing!

4

Page sharing versus Dedup

5

0

8

16

24

32

VMmark 1 VMmark 2 SPECvirt LoginVSI VMmark 1 36 virtual machines

32GB 12 virtual machines 30 sessions

VMmark 2 8 virtual machines

24.5GB

LoginVSI 30 virtual machines

32GB

SPECvirt 12 virtual machines

30GB

ESX with 4K page sharing With HICAMP

memory usage (GB)

ESX with 2MB page sharing

Architecture

•  New Indirec4on Layer

Addr 0 Addr 1 Addr 2 Addr 3

PLID 0 PLID 1 PLID 2 PLID 3

Data Data Data Data

Address

Transla4on Table

Memory

6

PHYSICAL MEMORY

value G RC=2 0x72

value A RC=1 0x73

value T RC=3 0x74

value C RC=1 0x75

HICAMP Read

7

TRANSLATION TABLE

0x10 0x72 0x11 0x73 0x12 0x72 0x13 0x74 0x14 0x74 0x15 0x74 …... 0x75 …... 0

PLID lookup

READ request from processor •  conven4onal addr=0x12

Data returned to processor

Data access

1

2

3

4 value G

PHYSICAL MEMORY

value G RC=2 0x72

value A RC=1 0x73

value T RC=3 0x74

value C RC=1 0x75

TRANSLATION TABLE

0x10 0x72 0x11 0x73 0x12 0x72 0x13 0x74 0x14 0x74 0x15 0x74

0x75 0

HICAMP Write 1/2

8

Decrement Reference Count of previous value

RC=1

PLID lookup 2

3

WRITE request from processor •  conven4onal addr=0x12 •  data=“value C”

1

PHYSICAL MEMORY

value G RC=2 0x72

value A RC=1 0x73

value T RC=3 0x74

value C RC=1 0x75

TRANSLATION TABLE

0x10 0x72 0x11 0x73 0x12 0x72 0x13 0x74 0x14 0x74 0x15 0x74

0x75 0

HICAMP Write 2/2

9

Data lookup 4

Write new PLID PLID=0x75

6

WRITE request from processor •  conven4onal addr=0x12 •  data=“value C”

Increment Reference Count

RC=2

5

How to perform efficient data lookup?

S0,S1,S2,S3

S0,S1,S2,S3

S0,S1,S2,S3

S0,S1,S2,S3

S0,S1,S2,S3

S0,S1,S2,S3

Global Content Search

Bucket 0

Bucket 3

Bucket 1

Bucket 2

…

Bucket N

Signatures

Only 2 Mem Ops! 10

Reference Coun4ng

•  Mul4ple references per line •  Garbage collec4on: When to free lines?

bucket0 signature line ref count line data line0 �� data linem

��

��

��

��

bucketn signature line ref count line data line0 �� data linem

rc0 rc0 �... rcn wide rc wide rc 11

Overheads

•  Transla4on Table – 64 Byte data lines – 32 bit PLIDs – Overhead: 1/16th

•  Signatures – 1 byte per data line (1/64th)

•  Reference Counts – 1 byte per data line (1/64th)

12

13

DDC System

L2

L2

L1I

L1D

L1I

L1D

DDC

L3 MC

DDRn

DDR0 core0

coren

L1 DDC

L2

DTB Direct Transla4on Buffer

Deduplicated Cache

Deduplicated Memory Controller

Evalua4on Methodology

•  Cycle accurate simulator (ZSIM) •  Func4onal + 4ming models for

– Deduplicated L3 – Cache Coherency – Replacement strategy – Transla4on cache –  In cache alloca4on

•  Mul4 Workload Analysis

14

Evalua4on

0.2x

0.4x

0.6x

0.8x

1x

1.2x

1.4xspeedup

fluida

nimat

esw

aptio

nsbo

dytra

ck

vips

x264

ferre

tde

dup

freqm

inera

ytrac

estr

eam

cluste

r

mem

cach

edpo

stgre

sql

hado

op

DDC speedupDDM speedup

CompacWon Speedup Spec CPU 2,6x 1.07x Parsec 1,3x 0.97x Datacenter 1,6x 1.01x

15

Compression vs. Dedup •  Both depend on redundancy/entropy •  Local vs. global

–  Runlength encoding – Huffman –  1KB buffer for Ziv-‐Lempel

•  Compression is compute intensive •  Latency

– Deduplica4on: 2 Mem Ops –  Variable, depends on technique and buffer size

•  Both can be combined (delta encoding)

16

Other uses of Indirec4on: SI-‐TM

•  Snapshot isola4on based Transac4onal Memory

•  Extend transla4on layer to support versioning •  Efficient Copy on Write •  Concurrent reads & writes •  Read-‐only transac4ons always succeed •  Aborts only on write-‐write conflicts

17

Abort Rate Comparison

0

0.2

0.4

0.6

0.8

1

1.2

genome bayes intruder kmeans labyrinth ssca2 vaca4on

Abort rate (2PL normalize

d)

2PL

SONTM

SI-‐TM

Concluding Remarks

•  Increasing demands on memory capacity – Faster, simpler, more predictable

•  Deduplica4on –  Increases capacity by 2x – Amor4zes cost of indirec4on layer

•  Level of indirec4on opens the door to many other op4miza4ons

•  Compelling processor extension

Ques4ons?

20

[email protected]

Outlook II -‐ New Memory Technologies

•  Non-‐Vola4le –  Flash DIMM (Diablo), RRAM (Crossbar), PCM – Deduplica4on (↑ capacity) –  Fine grain wear leveling (↑ resilience) –  Fine grain chipkill (↓ spare capacity)

•  Hybrid Memory Cube – Heterogeneous Access Latency – Data remap to benefit from Locality (↑ performance)

Enabled by Transla4on Layer 21

Indirec4on Layer II

•  Memory error remapping (Chip kill) •  Fine grain wear leveling (NVRam) •  Traffic steering (exploit locality/heterogeneity) •  Memcopy

– Copy PLIDs instead of data

22

fine&grain!memory!deduplicaon!!! … · linkedin profile page professional profiles 48 gb 2.74x...

Documents