proteus: a flexible and fast software supported hardware...
TRANSCRIPT
-
Proteus: A Flexible and Fast Software supported Hardware Logging approach for NVM
Seunghee Shin, Satish Tirukkovalluri, James Tuck, and Yan Solihin
North Carolina State University
1
The 2018 Non-Volatile Memories Workshop (NVMW 2018)
-
Background
• Use NVM as storage or main memory?• We assume NV main memory (NVMM)
– Keep important data in memory instead of file– Need to ensure failure safety
+ Fast+ Byte-addressable- Volatile
DRAM + Non-volatile- Slow- Block-addressable
Disk / Flash
NVM+ Fast+ Byte-addressable+ Non-volatile
2
-
Failure Safety through Durable Transactions
• Durable transaction - Needed to ensure failure safety
A B C D
X Insert Node X
3
System FailureUndo-logging
- All updates in a transaction are atomically durable- Atomicity can be achieved through HW or SW undo logging
-
Transaction with Software Undo-Logging
• Step 1 - Create undo log and make it durable
• Step 2 - Set log-flag and make it durable, indicating transaction start
• Step 3 - Perform data updates and make them durable
• Step 4 - Unset log-flag and make it durable, indicating transaction end
4
-
Memory Persistency
• Unpredictable persist ordering- Persist: operation which makes NVMM writes durable- NVMM persist order is determined by LLC writebacks,
instead of program order
• Persistency Model- Defines when stores become durable (i.e. placed in
the persistence domain)- E.g. Intel PMEM persistency model, strict
persistency model, epoch persistency model, buffered epoch persistency model, strand persistency model, etc.
5
Shared Cache(LLC)
MC
NVMM
PrivateCache
Unpredictable order
-
PERSISTENCEDOMAIN
PERSISTENCEDOMAIN
Intel PMEM Instruction and ADR
• Asynchronous DRAM Refresh (ADR)- Added write pending queue (WPQ) in MC
to persistence domain- Flush data in WPQ to NVMM automatically
on system failure
• CLWB- Write back a dirty block from caches to
WPQ- A fence is needed for ordering
Shared Cache
MCMC
NVMMNVMM
L1 L1L2
L1 L1L2
clwb
6
st Ast B
st B
clwb A
st A
st Aclwb Asfencest B
-
• Hardware logging (HL)- Hardware creates and manages logs automatically (e.g. ATOM [HPCA’17])- (+) Low performance overheads - (−) Not flexible
Let’s Revisit Software Logging
Time
Program Order log A
log B
log C
log D
st A
st B
st C
st D
FENCE
Time
Program Order log A
log B
log C
log D
st A
st B
st C
st D
Software Logging Hardware Logging
a. Memory fence is not required between logging and data modificationb. New logging optimizations possible
7
• Software logging (SL)- Software performs log creation, maintenance, and truncation- (+) Flexible (e.g. no OS support needed)- (−) High performance overheads (~50% slowdown)
-
Software Supported Hardware Logging
• Software Supported Hardware Logging (SSHL)- Hardware provides logging instructions- Software performs logging operations using logging instructions- Hardware applies optimizations
HLSL
SSHL
Flexible, but not fast Fast, but not flexible
Fast and flexible
8
-
Proteus: SSHL Design
• Flexibility: Software involvement in logging- Add instructions which starts logging operations in hardware- Two instructions are required: log-load and log-flush
• Performance Optimizations- Parallel logging: process multiple loggings concurrently- Redundant logging detection and removal
• Endurance Optimization (log write removal)- With the introduction of ADR, WPQ is considered non-volatile- Key insight: logs are no longer needed when a transaction commits- Remove logs without flushing to NVMM
9
-
Proteus: New Logging Instructions
- log-from address (M1): address of original data- log-to address (M2): address of log entry- Log data register (LR#): register holding logging data
log-load $LR1 M1 LR1= Mem[M1]log-flush $LR1 M2 Mem[M2] = LR1
Shared Cache
MC
NVM
L1 L1L2
M1 M2
$LR1
log-load $LR1 M1log-flush $LR1 M2
tx_beginA = …B = …
tx_end
i1: tx_begini2: log-load LR1, Ai3: log-flush LR1, (LTA)+i4: st Ai5: log-load LR2, Bi6: log-flush LR2, (LTA)+i7: st Bi8: tx_end
Code generation
10
-
Proteus Hardware DesignPipeline
LDRInt
fp
txIDlog-startlog-endcur-log
Register File
from to data
Cache
tag LRU txIDRouter
txID coreID loginfo
WPQ
Arbiter
LoadQ StoreQ
NVMM
LLT
LPQdata
LogQ
Memory Controller with ADR
Dep. Check
Dep. Check
11
Log data register (LDR)Keep log data while logging instructions are in pipeline
txID: holds current transaction ID being executed in the corelog-start: the start address of the log arealog-end: the end address of the log areacur-log: tracks the current free log entry
ArbiterPrioritize writes from WPQ unless LPQ has no free entries (less than threshold)
Log Queue (LogQ)Maintain log to store dependencies Keep track of logging executions (parallel loggings)
Log Look-up Table (LLT)Prevent redundant loggings in a transaction
Log Pending Queue (LPQ)Holds logs until the transaction ends or there is no free entriesSeparate logs from WPQ to avoid the incoming read requests check log entries
-
02 1 0x200 A02 1 0x200 AB
AA
0x2000x3000x100
010202
0x800 0x800 0x2000x800 0x200 A0x800 0x200 A
Proteus Hardware Design
LDR
Register File
from to data
Cache
RoutertxID coreID loginfoWPQ
Arbiter
StoreQ
NVMM
LPQ data
LogQ
Memory Controller with ADR
txIDlog-startlog-endcur-log
LR1
LR2
tx_begin
log-load LR1, (0x800)
log-flush LR1, (LTA)+
store B, (0x800)
clwb (0x800)
sfence
tx_end
0x800: A0x800: B
12
-
Proteus LDR: 8 registers, LogQ: 8 entriesLLT: 64 entries (8way), LPQ: 256 entries
11-29(109)-11-28-39-12-6-6-5-24 (tRCD 29 for Read, 109 for Write)tCAS-tRCD-tRP-tRAS-tRC-tWR-tWTR-tRTP-tRRD-tFAW
NVM DDR3 like interface, 800MHz, 8GB 1 channel16 Banks per rank, 2KB row-buffer
L3 Cache 8MB, 16-way, 64B block, 42 cycles, shared by all coresL2 Cache 256KB, 8-way, 64B block, 12 cycles, private per coreL1 I/D Cache 32KB,8-way,64Bblock,4cycles,private per coreProcessor OOO, 3.4GHz, 4 cores,
System Configuration
Methodology
13
- MarssX86 + DRAMsim2 simulator is used- NVM has 50ns for read latency and 150ns for write latency
-
Evaluation (1) - Speedup
- Baseline: software logging using Intel PMEM instructions- Proteus performs 46% better than baseline, 10% better than ATOM
46% better than baseline10% better than ATOM
14
QueueBtreeAvlTree Hashmap
RB tree
StringSwap
-
Evaluation (2) – Numbers of writes
- Baseline: no logging (not failure safe)- ATOM incurs 350% more writes than baseline- Proteus has similar writes to baseline (only 2% higher) 15
ATOM introduces 3.4x more writes than Proteus
-
Conclusions
• Software logging is expensive but flexible • Hardware logging is fast but inflexible• Proteus: Software Supported Hardware Logging (SSHL)
- Fast and flexible- New logging instructions allow software to manage logging- Performance optimizations: parallel logging, redundant logging removal- Endurance optimization: remove logs before flushing to NVMM
• Results- Performance: 46% better vs. SW logging (10% better vs. ATOM)- Endurance 2% more writes to NVMM vs. 350% with ATOM
16
-
Thank you
17