fine&grain!memory!deduplicaon!!! … · linkedin profile page professional profiles 48 gb 2.74x...
TRANSCRIPT
Fine-‐grain Memory Deduplica4on for In-‐memory Database Systems
Heiner Litz, David Cheriton, Pete Stevenson
Stanford University
1
Memory Capacity Challenge
• In-‐memory databases – Limited by memory capacity
• Technology – DRAM scaling – Pin limita4on – Power
• Datasets grow > Moore’s Law Memory represents increasing
por4on of system cost 2
Our Approach: HICAMP
Fine grain memory deduplica4on: 1. Search memory for duplicates 2. Replace copy with Reference 3. Virtually increase capacity
Reduces cost & power, increases performance 3
Poten4al Savings
Company Workload Dataset Memory Raw Dedup +Overhead Lightminer Benchmark TPCH 256 GB 1.76x 1.54x SAP SAP-Hana One Private 1024 GB 1.94x 1.68x LinkedIn Profile Page Professional Profiles 48 GB 2.74x 2.28x Quantifind Data Mining Social web data 64 GB 2.91x 2.40x Arista Build server Private 128 GB 3.10x 2.53x UC Berkeley Shark Conviva Server 68 GB 3.19x 2.59x NHN Memcached Private 8 GB 3.74x 2.95x Yelp Hadoop (EMR) Private 7 GB 3.79x 2.99x
2x-‐3x Deduplica4on worth doing!
4
Page sharing versus Dedup
5
0
8
16
24
32
VMmark 1 VMmark 2 SPECvirt LoginVSI VMmark 1 36 virtual machines
32GB 12 virtual machines 30 sessions
VMmark 2 8 virtual machines
24.5GB
LoginVSI 30 virtual machines
32GB
SPECvirt 12 virtual machines
30GB
ESX with 4K page sharing With HICAMP
memory usage (GB)
ESX with 2MB page sharing
Architecture
• New Indirec4on Layer
Addr 0 Addr 1 Addr 2 Addr 3
PLID 0 PLID 1 PLID 2 PLID 3
Data Data Data Data
Address
Transla4on Table
Memory
6
PHYSICAL MEMORY
value G RC=2 0x72
value A RC=1 0x73
value T RC=3 0x74
value C RC=1 0x75
HICAMP Read
7
TRANSLATION TABLE
0x10 0x72 0x11 0x73 0x12 0x72 0x13 0x74 0x14 0x74 0x15 0x74 …... 0x75 …... 0
PLID lookup
READ request from processor • conven4onal addr=0x12
Data returned to processor
Data access
1
2
3
4 value G
PHYSICAL MEMORY
value G RC=2 0x72
value A RC=1 0x73
value T RC=3 0x74
value C RC=1 0x75
TRANSLATION TABLE
0x10 0x72 0x11 0x73 0x12 0x72 0x13 0x74 0x14 0x74 0x15 0x74
0x75 0
HICAMP Write 1/2
8
Decrement Reference Count of previous value
RC=1
PLID lookup 2
3
WRITE request from processor • conven4onal addr=0x12 • data=“value C”
1
PHYSICAL MEMORY
value G RC=2 0x72
value A RC=1 0x73
value T RC=3 0x74
value C RC=1 0x75
TRANSLATION TABLE
0x10 0x72 0x11 0x73 0x12 0x72 0x13 0x74 0x14 0x74 0x15 0x74
0x75 0
HICAMP Write 2/2
9
Data lookup 4
Write new PLID PLID=0x75
6
WRITE request from processor • conven4onal addr=0x12 • data=“value C”
Increment Reference Count
RC=2
5
How to perform efficient data lookup?
S0,S1,S2,S3
S0,S1,S2,S3
S0,S1,S2,S3
S0,S1,S2,S3
S0,S1,S2,S3
S0,S1,S2,S3
Global Content Search
Bucket 0
Bucket 3
Bucket 1
Bucket 2
…
Bucket N
Signatures
Only 2 Mem Ops! 10
Reference Coun4ng
• Mul4ple references per line • Garbage collec4on: When to free lines?
bucket0 signature line ref count line data line0 ��� data linem
���
���
���
���
bucketn signature line ref count line data line0 ��� data linem
rc0 rc0 �... rcn wide rc wide rc 11
Overheads
• Transla4on Table – 64 Byte data lines – 32 bit PLIDs – Overhead: 1/16th
• Signatures – 1 byte per data line (1/64th)
• Reference Counts – 1 byte per data line (1/64th)
12
13
DDC System
L2
L2
L1I
L1D
L1I
L1D
DDC
L3 MC
DDRn
DDR0 core0
coren
L1 DDC
L2
DTB Direct Transla4on Buffer
Deduplicated Cache
Deduplicated Memory Controller
Evalua4on Methodology
• Cycle accurate simulator (ZSIM) • Func4onal + 4ming models for
– Deduplicated L3 – Cache Coherency – Replacement strategy – Transla4on cache – In cache alloca4on
• Mul4 Workload Analysis
14
Evalua4on
0.2x
0.4x
0.6x
0.8x
1x
1.2x
1.4xspeedup
fluida
nimat
esw
aptio
nsbo
dytra
ck
vips
x264
ferre
tde
dup
freqm
inera
ytrac
estr
eam
cluste
r
mem
cach
edpo
stgre
sql
hado
op
DDC speedupDDM speedup
CompacWon Speedup Spec CPU 2,6x 1.07x Parsec 1,3x 0.97x Datacenter 1,6x 1.01x
15
Compression vs. Dedup • Both depend on redundancy/entropy • Local vs. global
– Runlength encoding – Huffman – 1KB buffer for Ziv-‐Lempel
• Compression is compute intensive • Latency
– Deduplica4on: 2 Mem Ops – Variable, depends on technique and buffer size
• Both can be combined (delta encoding)
16
Other uses of Indirec4on: SI-‐TM
• Snapshot isola4on based Transac4onal Memory
• Extend transla4on layer to support versioning • Efficient Copy on Write • Concurrent reads & writes • Read-‐only transac4ons always succeed • Aborts only on write-‐write conflicts
17
Abort Rate Comparison
0
0.2
0.4
0.6
0.8
1
1.2
genome bayes intruder kmeans labyrinth ssca2 vaca4on
Abort rate (2PL normalize
d)
2PL
SONTM
SI-‐TM
Concluding Remarks
• Increasing demands on memory capacity – Faster, simpler, more predictable
• Deduplica4on – Increases capacity by 2x – Amor4zes cost of indirec4on layer
• Level of indirec4on opens the door to many other op4miza4ons
• Compelling processor extension
Outlook II -‐ New Memory Technologies
• Non-‐Vola4le – Flash DIMM (Diablo), RRAM (Crossbar), PCM – Deduplica4on (↑ capacity) – Fine grain wear leveling (↑ resilience) – Fine grain chipkill (↓ spare capacity)
• Hybrid Memory Cube – Heterogeneous Access Latency – Data remap to benefit from Locality (↑ performance)
Enabled by Transla4on Layer 21