fine&grain!memory!deduplicaon!!! … · linkedin profile page professional profiles 48 gb 2.74x...

22
Finegrain Memory Deduplica4on for Inmemory Database Systems Heiner Litz, David Cheriton, Pete Stevenson Stanford University 1

Upload: vohanh

Post on 25-Apr-2018

219 views

Category:

Documents


5 download

TRANSCRIPT

       Fine-­‐grain  Memory  Deduplica4on      for  In-­‐memory  Database  Systems  

Heiner  Litz,  David  Cheriton,  Pete  Stevenson  

Stanford  University  

1  

Memory  Capacity  Challenge  

•  In-­‐memory  databases  – Limited  by  memory  capacity  

•  Technology  – DRAM  scaling  – Pin  limita4on  – Power  

•  Datasets  grow  >  Moore’s  Law     Memory  represents  increasing  

por4on  of  system  cost  2  

Our  Approach:  HICAMP  

Fine  grain  memory  deduplica4on:  1.  Search  memory  for  duplicates  2.  Replace  copy  with  Reference  3.  Virtually  increase  capacity  

Reduces  cost  &  power,  increases  performance  3  

Poten4al  Savings  

Company Workload Dataset Memory Raw Dedup +Overhead Lightminer Benchmark TPCH 256 GB 1.76x 1.54x SAP SAP-Hana One Private 1024 GB 1.94x 1.68x LinkedIn Profile Page Professional Profiles 48 GB 2.74x 2.28x Quantifind Data Mining Social web data 64 GB 2.91x 2.40x Arista Build server Private 128 GB 3.10x 2.53x UC Berkeley Shark Conviva Server 68 GB 3.19x 2.59x NHN Memcached Private 8 GB 3.74x 2.95x Yelp Hadoop (EMR) Private 7 GB 3.79x 2.99x

2x-­‐3x  Deduplica4on  worth  doing!  

4  

Page  sharing  versus  Dedup  

5  

0  

8  

16  

24  

32  

VMmark  1     VMmark  2     SPECvirt   LoginVSI  VMmark  1  36  virtual  machines  

32GB  12  virtual  machines   30  sessions  

VMmark  2  8  virtual  machines  

24.5GB  

LoginVSI  30  virtual  machines  

32GB  

SPECvirt  12  virtual  machines  

30GB  

ESX  with  4K  page  sharing   With  HICAMP  

memory  usage  (GB)  

ESX  with  2MB  page  sharing  

Architecture  

•  New  Indirec4on  Layer  

Addr  0   Addr  1   Addr  2   Addr  3  

PLID  0   PLID  1   PLID  2   PLID  3  

Data   Data   Data   Data  

Address  

Transla4on  Table  

Memory  

6  

PHYSICAL  MEMORY  

value  G   RC=2   0x72  

value  A   RC=1   0x73  

value  T   RC=3   0x74  

value  C   RC=1   0x75  

HICAMP  Read  

7  

TRANSLATION  TABLE  

0x10         0x72  0x11   0x73  0x12   0x72  0x13   0x74  0x14   0x74  0x15   0x74  …...   0x75  …...   0  

PLID  lookup  

READ  request  from    processor  •  conven4onal  addr=0x12  

Data  returned  to  processor  

Data  access  

1  

2  

3  

4  value  G  

PHYSICAL  MEMORY  

value  G   RC=2   0x72  

value  A   RC=1   0x73  

value  T   RC=3   0x74  

value  C   RC=1   0x75  

TRANSLATION  TABLE  

0x10         0x72  0x11   0x73  0x12   0x72  0x13   0x74  0x14   0x74  0x15   0x74  

0x75  0  

HICAMP  Write      1/2  

8  

Decrement    Reference  Count  of  previous  value  

RC=1  

PLID  lookup  2  

3  

WRITE  request  from    processor  •  conven4onal  addr=0x12  •  data=“value  C”  

1  

PHYSICAL  MEMORY  

value  G   RC=2   0x72  

value  A   RC=1   0x73  

value  T   RC=3   0x74  

value  C   RC=1   0x75  

TRANSLATION  TABLE  

0x10         0x72  0x11   0x73  0x12   0x72  0x13   0x74  0x14   0x74  0x15   0x74  

0x75  0  

HICAMP  Write      2/2  

9  

Data  lookup  4  

Write  new  PLID  PLID=0x75  

6  

WRITE  request  from    processor  •  conven4onal  addr=0x12  •  data=“value  C”  

Increment    Reference  Count  

RC=2  

5  

How  to  perform  efficient  data  lookup?  

S0,S1,S2,S3  

S0,S1,S2,S3  

S0,S1,S2,S3  

S0,S1,S2,S3  

S0,S1,S2,S3  

S0,S1,S2,S3  

Global  Content  Search  

Bucket  0  

Bucket  3  

Bucket  1  

Bucket  2  

…  

Bucket  N  

Signatures  

Only  2  Mem  Ops!   10  

Reference  Coun4ng  

•  Mul4ple  references  per  line  •  Garbage  collec4on:  When  to  free  lines?  

bucket0   signature  line   ref  count  line   data  line0   ��� data  linem  

���

���

���

    ���

bucketn   signature  line   ref  count  line   data  line0   ��� data  linem  

rc0   rc0   �... rcn   wide  rc   wide  rc  11  

Overheads  

•  Transla4on  Table  – 64  Byte  data  lines  – 32  bit  PLIDs  – Overhead:  1/16th  

•  Signatures  – 1  byte  per  data  line  (1/64th)  

•  Reference  Counts  – 1  byte  per  data  line  (1/64th)  

12  

13

DDC  System  

L2  

L2  

L1I  

L1D  

L1I  

L1D  

DDC  

L3  MC  

DDRn  

DDR0  core0  

coren  

L1                                                                                                                    DDC  

L2  

DTB   Direct  Transla4on  Buffer  

Deduplicated  Cache  

Deduplicated  Memory  Controller  

Evalua4on  Methodology  

•  Cycle  accurate  simulator  (ZSIM)  •  Func4onal  +  4ming  models  for  

– Deduplicated  L3  – Cache  Coherency  – Replacement  strategy  – Transla4on  cache  –  In  cache  alloca4on  

•  Mul4  Workload  Analysis  

14  

Evalua4on  

0.2x

0.4x

0.6x

0.8x

1x

1.2x

1.4xspeedup

fluida

nimat

esw

aptio

nsbo

dytra

ck

vips

x264

ferre

tde

dup

freqm

inera

ytrac

estr

eam

cluste

r

mem

cach

edpo

stgre

sql

hado

op

DDC speedupDDM speedup

CompacWon   Speedup  Spec  CPU   2,6x   1.07x  Parsec   1,3x   0.97x  Datacenter   1,6x   1.01x  

15  

Compression  vs.  Dedup  •  Both  depend  on  redundancy/entropy  •  Local  vs.  global  

–  Runlength  encoding  – Huffman  –  1KB  buffer  for  Ziv-­‐Lempel  

•  Compression  is  compute  intensive  •  Latency  

– Deduplica4on:  2  Mem  Ops  –  Variable,  depends  on  technique  and  buffer  size  

•  Both  can  be  combined  (delta  encoding)  

16  

Other  uses  of  Indirec4on:  SI-­‐TM  

•  Snapshot  isola4on  based  Transac4onal  Memory  

•  Extend  transla4on  layer  to  support  versioning  •  Efficient  Copy  on  Write  •  Concurrent  reads  &  writes  •  Read-­‐only  transac4ons  always  succeed  •  Aborts  only  on  write-­‐write  conflicts  

17  

Abort  Rate  Comparison  

0  

0.2  

0.4  

0.6  

0.8  

1  

1.2  

genome   bayes   intruder   kmeans   labyrinth   ssca2   vaca4on  

Abort  rate  (2PL  normalize

d)  

2PL  

SONTM  

SI-­‐TM  

Concluding  Remarks  

•  Increasing  demands  on  memory  capacity  – Faster,  simpler,  more  predictable  

•  Deduplica4on  –  Increases  capacity  by  2x  – Amor4zes  cost  of  indirec4on  layer  

•  Level  of  indirec4on  opens  the  door  to  many  other  op4miza4ons  

•  Compelling  processor  extension  

Ques4ons?  

     

20  

[email protected]  

Outlook  II  -­‐  New  Memory  Technologies  

•  Non-­‐Vola4le  –  Flash  DIMM  (Diablo),  RRAM  (Crossbar),  PCM  – Deduplica4on  (↑  capacity)  –  Fine  grain  wear  leveling  (↑  resilience)  –  Fine  grain  chipkill  (↓  spare  capacity)  

•  Hybrid  Memory  Cube  – Heterogeneous  Access  Latency  – Data  remap  to  benefit  from  Locality  (↑  performance)  

Enabled  by  Transla4on  Layer  21  

Indirec4on  Layer  II  

•  Memory  error  remapping  (Chip  kill)  •  Fine  grain  wear  leveling  (NVRam)  •  Traffic  steering  (exploit  locality/heterogeneity)  •  Memcopy    

– Copy  PLIDs  instead  of  data  

22