proteus: a flexible and fast software supported hardware...

Proteus: A Flexible and Fast Software supported Hardware Logging approach for NVM

Seunghee Shin, Satish Tirukkovalluri, James Tuck, and Yan Solihin

North Carolina State University

1

The 2018 Non-Volatile Memories Workshop (NVMW 2018)

Background

• Use NVM as storage or main memory?• We assume NV main memory (NVMM)

– Keep important data in memory instead of file– Need to ensure failure safety

+ Fast+ Byte-addressable- Volatile

DRAM + Non-volatile- Slow- Block-addressable

Disk / Flash

NVM+ Fast+ Byte-addressable+ Non-volatile

2

Failure Safety through Durable Transactions

• Durable transaction - Needed to ensure failure safety

A B C D

X Insert Node X

3

System FailureUndo-logging

- All updates in a transaction are atomically durable- Atomicity can be achieved through HW or SW undo logging

Transaction with Software Undo-Logging

• Step 1 - Create undo log and make it durable

• Step 2 - Set log-flag and make it durable, indicating transaction start

• Step 3 - Perform data updates and make them durable

• Step 4 - Unset log-flag and make it durable, indicating transaction end

4

Memory Persistency

• Unpredictable persist ordering- Persist: operation which makes NVMM writes durable- NVMM persist order is determined by LLC writebacks,

instead of program order

• Persistency Model- Defines when stores become durable (i.e. placed in

the persistence domain)- E.g. Intel PMEM persistency model, strict

persistency model, epoch persistency model, buffered epoch persistency model, strand persistency model, etc.

5

Shared Cache(LLC)

MC

NVMM

PrivateCache

Unpredictable order

PERSISTENCEDOMAIN

PERSISTENCEDOMAIN

Intel PMEM Instruction and ADR

• Asynchronous DRAM Refresh (ADR)- Added write pending queue (WPQ) in MC

to persistence domain- Flush data in WPQ to NVMM automatically

on system failure

• CLWB- Write back a dirty block from caches to

WPQ- A fence is needed for ordering

Shared Cache

MCMC

NVMMNVMM

L1 L1L2

L1 L1L2

clwb

6

st Ast B

st B

clwb A

st A

st Aclwb Asfencest B

• Hardware logging (HL)- Hardware creates and manages logs automatically (e.g. ATOM [HPCA’17])- (+) Low performance overheads - (−) Not flexible

Let’s Revisit Software Logging

Time

Program Order log A

log B

log C

log D

st A

st B

st C

st D

FENCE

Time

Program Order log A

log B

log C

log D

st A

st B

st C

st D

Software Logging Hardware Logging

a. Memory fence is not required between logging and data modificationb. New logging optimizations possible

7

• Software logging (SL)- Software performs log creation, maintenance, and truncation- (+) Flexible (e.g. no OS support needed)- (−) High performance overheads (~50% slowdown)

Software Supported Hardware Logging

• Software Supported Hardware Logging (SSHL)- Hardware provides logging instructions- Software performs logging operations using logging instructions- Hardware applies optimizations

HLSL

SSHL

Flexible, but not fast Fast, but not flexible

Fast and flexible

8

Proteus: SSHL Design

• Flexibility: Software involvement in logging- Add instructions which starts logging operations in hardware- Two instructions are required: log-load and log-flush

• Performance Optimizations- Parallel logging: process multiple loggings concurrently- Redundant logging detection and removal

• Endurance Optimization (log write removal)- With the introduction of ADR, WPQ is considered non-volatile- Key insight: logs are no longer needed when a transaction commits- Remove logs without flushing to NVMM

9

Proteus: New Logging Instructions

- log-from address (M1): address of original data- log-to address (M2): address of log entry- Log data register (LR#): register holding logging data

log-load $LR1 M1 LR1= Mem[M1]log-flush $LR1 M2 Mem[M2] = LR1

Shared Cache

MC

NVM

L1 L1L2

M1 M2

$LR1

log-load $LR1 M1log-flush $LR1 M2

tx_beginA = …B = …

tx_end

i1: tx_begini2: log-load LR1, Ai3: log-flush LR1, (LTA)+i4: st Ai5: log-load LR2, Bi6: log-flush LR2, (LTA)+i7: st Bi8: tx_end

Code generation

10

Proteus Hardware DesignPipeline

LDRInt

fp

txIDlog-startlog-endcur-log

Register File

from to data

Cache

tag LRU txIDRouter

txID coreID loginfo

WPQ

Arbiter

LoadQ StoreQ

NVMM

LLT

LPQdata

LogQ

Memory Controller with ADR

Dep. Check

Dep. Check

11

Log data register (LDR)Keep log data while logging instructions are in pipeline

txID: holds current transaction ID being executed in the corelog-start: the start address of the log arealog-end: the end address of the log areacur-log: tracks the current free log entry

ArbiterPrioritize writes from WPQ unless LPQ has no free entries (less than threshold)

Log Queue (LogQ)Maintain log to store dependencies Keep track of logging executions (parallel loggings)

Log Look-up Table (LLT)Prevent redundant loggings in a transaction

Log Pending Queue (LPQ)Holds logs until the transaction ends or there is no free entriesSeparate logs from WPQ to avoid the incoming read requests check log entries

02 1 0x200 A02 1 0x200 AB

AA

0x2000x3000x100

010202

0x800 0x800 0x2000x800 0x200 A0x800 0x200 A

Proteus Hardware Design

LDR

Register File

from to data

Cache

RoutertxID coreID loginfoWPQ

Arbiter

StoreQ

NVMM

LPQ data

LogQ

Memory Controller with ADR

txIDlog-startlog-endcur-log

LR1

LR2

tx_begin

log-load LR1, (0x800)

log-flush LR1, (LTA)+

store B, (0x800)

clwb (0x800)

sfence

tx_end

0x800: A0x800: B

12

Proteus LDR: 8 registers, LogQ: 8 entriesLLT: 64 entries (8way), LPQ: 256 entries

11-29(109)-11-28-39-12-6-6-5-24 (tRCD 29 for Read, 109 for Write)tCAS-tRCD-tRP-tRAS-tRC-tWR-tWTR-tRTP-tRRD-tFAW

NVM DDR3 like interface, 800MHz, 8GB 1 channel16 Banks per rank, 2KB row-buffer

L3 Cache 8MB, 16-way, 64B block, 42 cycles, shared by all coresL2 Cache 256KB, 8-way, 64B block, 12 cycles, private per coreL1 I/D Cache 32KB,8-way,64Bblock,4cycles,private per coreProcessor OOO, 3.4GHz, 4 cores,

System Configuration

Methodology

13

- MarssX86 + DRAMsim2 simulator is used- NVM has 50ns for read latency and 150ns for write latency

Evaluation (1) - Speedup

- Baseline: software logging using Intel PMEM instructions- Proteus performs 46% better than baseline, 10% better than ATOM

46% better than baseline10% better than ATOM

14

QueueBtreeAvlTree Hashmap

RB tree

StringSwap

Evaluation (2) – Numbers of writes

- Baseline: no logging (not failure safe)- ATOM incurs 350% more writes than baseline- Proteus has similar writes to baseline (only 2% higher) 15

ATOM introduces 3.4x more writes than Proteus

Conclusions

• Software logging is expensive but flexible • Hardware logging is fast but inflexible• Proteus: Software Supported Hardware Logging (SSHL)

- Fast and flexible- New logging instructions allow software to manage logging- Performance optimizations: parallel logging, redundant logging removal- Endurance optimization: remove logs before flushing to NVMM

• Results- Performance: 46% better vs. SW logging (10% better vs. ATOM)- Endurance 2% more writes to NVMM vs. 350% with ATOM

16

Thank you

17

proteus: a flexible and fast software supported hardware...

Documents