(iit8015) lecture#7: ssd architecture and system-level...

Memory Architecture and Storage Systems

Myoungsoo Jung

Computer Architecture and Memory Systems Lab.

School of Integrated TechnologyYonsei University

(IIT8015) Lecture#7: SSD Architecture and System-level Controllers

Outline

• Holistic Viewpoint (Overview)

• SSD Architecture

• Parallelism Overview

• Page Allocation Strategies

• Evaluation Studies for Parallelism

• Host Interface Overview

• Case Studies• Parallelism-Aware Host Interface I/O Scheduler

• GC-Aware Host Interface I/O Scheduler

• Simulation Infrastructure (for your future research)

2

Outline










3

NAND Flash

NAND Flash

CTRL

CH

AN

NEL

1

NAND Flash

NAND Flash

CTRL

CH

AN

NEL

2NAND Flash

NAND Flash

CTRL

CH

AN

NEL

3

NAND Flash

NAND Flash

CTRL

CH

AN

NEL

4

Emb

edd

ed P

roce

sso

rs

SSD Internals

Ho

st In

terf

ace

Co

ntr

olle

r

Die 0 Die 1 Die 2 Die 3

Multiplexed Interface

Flash Package Internals

k*j

Blo

cks

DATA REGISTER

CACHE REGISTER

NAND Flash Memory Array

DATA REGISTER

CACHE REGISTER


1 Block 1 Block

DIE 1

PLANE 0 PLANE jk Blocks k Blocks

Die Internals

Northbridge

IDESATA

USB

Southbridge

I/O controller hub

Memory controller hub

High-speed graphic I/O slots (PCI Express)

PCI Slots

Memory Slots

Cables and ports leading off-board

Core Core Core

Holistic Viewpoint (Hardware)

Lecture 6

Flas

h M

emo

ry R

equ

est

(Ph

ysic

al A

dd

ress

)

Flas

h M

emo

ry R

equ

est

(Vir

tual

Ad

dre

ss)

Holistic Viewpoint (Software)

NVMHC

Queuing Memory Request Building

Core (Flash Translation Layer)

Memory Request Commitment Transaction Handling

Flash Controllers

De

vice

-le

vel Q

ue

ue

Arrivals

I/O

Req

ues

t

Parsing Data Movement Initiation

Memory Requests: data size is the same as atomic flash I/O unit size

AddressTranslation

Execution Sequence

Striping &Pipelining

Transaction Decision

Interleaving & Sharing

Lecture 6Lecture 3 TodayToday Spec-specific

Outline

• Holistic Viewpoint (Overview)• SSD Architecture• Parallelism Overview• Page Allocation Strategies • Evaluation Studies for Parallelism • Host Interface Overview• Case Studies

– Parallelism-Aware Host Interface I/O Scheduler– GC-Aware Host Interface I/O Scheduler


6

SSD Architecture

• At the top of SSD internals, there is host interface controller that parses the incoming requests

• Embedded CPU(s) is employed for flash firmware such asFTL, buffer $, I/O scheduler,parallelism management

7

SSD Architecture

• Underneath the embedded processor, multiple flash controllers exist, each connecting a memory bus, referred to as channel

• Within a bus, there are multiple flash packages, each having flash interface, called way

System-Level Parallelism

• Channel striping– An I/O request is striped

over multiple channels

• Way pipelining – As it shares the channel, an

I/O request cannot be perfectly striped in parallel

– However, NAND flash transactions consists of multiple phases, and individual NAND flash can still work simultaneously

Mic

rop

roce

sso

r

CH A

Flash Chip Flash Chip

CH B


CH C


CH D


ChannelH

ost

Inte

face

Way

CCHH AA

CCHH BB

CH C

CCHH D

Channel

Channel Striping

System-Level Parallelism

• Channel striping– An I/O request is striped

over multiple channels

• Way pipelining – As it shares the channel, an

I/O request cannot be perfectly striped in parallel

– However, NAND flash transactions consists of multiple phases, and individual NAND flash can still work simultaneously

Mic

rop

roce

sso

r

CH A


CH B


CH C


CH D


ChannelH

ost

Inte

face

Wayss

or

CCHH AA


CCHH BB

Way pipelining

Flash-Level Parallelism

• Die interleaving

– A striped/pipelined request can be further interleaved within a chip


DATA REGISTER

CACHE REGISTER

1 Block


DATA REGISTER

CACHE REGISTER

Mu

ltip

lex

Inte

rfac

e DIE 0

DIE 1

DIE 2

DIE 3

DATA REGISTER

CACHE REGISTER


(Plane)

1 Block

wordline

Mu

ltip

lex

Inte

rfac

e DIE 0

DIE 1

DIE 2

DIE 3

Die interleaving

Flash-Level Parallelism

• Die interleaving

– A striped/pipelined request can be further interleaved within a chip

• Plane sharing

– Multiple planes simultaneously works using shared wordline(s)


DATA REGISTER

CACHE REGISTER

1 Block


DATA REGISTER

CACHE REGISTER

Mu

ltip

lex

Inte

rfac

e DIE 0

DIE 1

DIE 2

DIE 3

DATA REGISTER

CACHE REGISTER


(Plane)

1 Block

wordline

DATA REGISTER

CACHE REGISTER

NAND FlashMemory Array

(Plane)

1 Block

worddlline

Plane sharing

Parallelism Overview

• Note that different vendors use different naming rules (interleaving, stripping, way, etc.)

• You need to understand four different level of parallelism based on the contexts

– Plane sharing (=multiple-mode operation, two-plane operation, etc)

– Die interleaving (=interleaved die operation, bank interleaving, etc)

– Way pipelining (=package interleaving)

13

Outline




14

WAY 0 WAY 1

CH

AC

H B

FTL

HIL

HAL

Host Interface Layer- Responsible for communication

Flash Translation LayerAddress translation between the host address space and physical addresses

Hardware Abstraction LayerCommitting flash transaction to underlying flash memory chips

[Image:micron.com]

24Page Allocation Strategies

Page allocation strategies are directly related with physical data layout and access sequences, which have impact on the performance and internal parallelism

Software stack (review) & page allocation

Page Allocation Strategies (Palloc)

• Channel-first pallocs• Allocate internal resources in favor of channel striping method

• Way-first pallocs• Are oriented forward taking advantage of the way pipelining

• Die-first and plane-first pallocs• Allocate die and plane in an attempt to reap the benefit of flash-

level parallelism

NAND Flash Chip

Pla

ne

0

Pla

ne

0

Pla

ne

1

Pla

ne

1

DIE 0 DIE 1

WAY 0 WAY 1

CH

AC

H B

Channel striping

Way Pipelining

Die Interleaving

Plane Sharing

CWDP -- Channel-Way-Die-PlaneCDPWCDWPCPDWCPWDCWPD

Channel-first Page Allocation

Channel-first Page Allocation

• This page allocation strategies give priority to the order of channel, way, die and plane

• Some channel first page allocation strategies introduce low flash-level locality

NAND Flash Chip

Pla

ne

0

Pla

ne

0

Pla

ne

1

Pla

ne

1

DIE 0 DIE 1

WAY 0 WAY 1

CH

AC

H B

Channel striping

Die Interleaving

Plane SharingWDCP -- Way-Die-Channel-PlaneWCPDWDCPWDPCWPCDWPDC Way Pipelining

Way-first Page Allocation

Way-first Page Allocation

• This allocation assigns 1. the way resources in a

channel

2. stripe all the requests over multiple ways

3. interleave the flash-level resources.

• Although it allocates the system-level resource first, some favor flash-level resources

NAND Flash Chip

Pla

ne

0

Pla

ne

0

Pla

ne

1

Pla

ne

1

DIE 0 DIE 1

WAY 0 WAY 1

CH

AC

H B

Die Interleaving with Multiplane

Way Pipelining

Channel stripingDie Interleaving

DPWC -- Die-Plane-Way-ChannelDCPWDCWPDPCWDWCPDWPC

Die-first Page Allocation

Die-first Page Allocation

• The die-first page allocation schemes favor the exploitation of the die-interleaving method

• It could also accommodate system-level resources instead of the flash-level resources based on the access patterns (DCWP/DWCP).

NAND Flash Chip

Pla

ne

0

Pla

ne

0

Pla

ne

1

Pla

ne

1

DIE 0 DIE 1

WAY 0 WAY 1

CH

AC

H B

Die Interleaving with Multiplane

Way Pipelining

Plane Sharing

Channel striping

PWCD -- Plane-Way-Channel-DiePCWDPCWDPDCWPDWCPWDC

Plane-first Page Allocation

Plane-first Page Allocation

• It parallelizes data accesses with plane sharing, which can in turn improve the storage throughput

• An excellent option for realizing the benefits of both inter- and intra-request parallelisms

CH

AC

H B

CH

AC

H B

Assumption in this exampleLegacy : 200 us (1 page)Plane sharing : 240 us (2 pages)

Total latency : 200 us

Channel-first page allocation

Plane-first page allocation

Total latency : 240 us

Channel-first vs plane-first

CH

AC

H B

CH

AC

H B

Assumption in this exampleLegacy : 200 us (1 page)Plane sharing : 240 us (2 pages)

Req1 : 200 usReq2 : 400 us

Req1: 240 usReq2: 240 us

Channel-first page allocation

Plane-first page allocation in favor of way pipelining

Channel-first vs plane-first

WAY 0 WAY 1

CH

AC

H B

WAY 0 WAY 1

CH

AC

H B

System and flash level parallelism are same

Performance among the different page allocation strategies vary based on access pattern

Way-first (WPCD) vs. plane-first (PWCD)

Outline




28

SSD Setup

• NAND Flash Chip• Fine-grained NAND command

• Advanced commands

• Strong address constraints

• Intrinsic latency variation

• SSD Framework• 8 channels, 8 flash per channel (64 total)

• Dual-die package format, 32 entry queue

• A page-level mapping and greedy garbage collection algorithm

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

msnfs usr fin1 web fin2 sql0 sql1 sql2 sql3

CWDP

WCPD

DPCW

PDCW

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5


CWDP

WCPD

DPCW

PDCW

Performance Comparison

Way and flash-level resource first pallocs have better IOPS performance position than channel-first palloc

Channel-first palloc provide shorter latencies than flash-level resource first pallocs

[Normalized Latency]

[Normalized IOPS]

CD

PW

CD

WP

CPD

W

CPW

D

CW

DP

CW

PD

DC

PW

DC

WP

DPC

W

DPW

C

DW

CP

DW

PC

PCD

W

PCW

D

PDC

W

PDW

C

PWC

D

PWD

C

WC

DP

WC

PD

WD

CP

WD

PC

WPC

D

WPD

C

0102030405060708090

100

The

fract

ion

of p

ralle

l dat

a ac

ces

met

hod

type

(%)

Die interleaving with multiplane write Die interleaving with multiplane read Plane sharing write Plane sharing read Die interleaving write Die interleaving read Striped lagacy write Striped lagacy read

Channel-first Die-first Plane-first Way-first

[Parallelism breakdown]

• Low flash-level parallelism is observed under pallocschemes in favor of channel

• They render advanced flash command compositions difficult at runtime (due to low flash-level localities) High parallelism

Parallelism Breakdown

• channel resources– are utilized about 43.1% on average with most parallel data

access methods

• Idle time– About 80% of the total execution time are spent idle

CD

PWC

DW

PC

PDW

CPW

DC

WD

PC

WPD

DC

PWD

CW

PD

PCW

DPW

CD

WC

PD

WPC

PCD

WPC

WD

PDC

WPD

WC

PWC

DPW

DC

WC

DP

WC

PDW

DC

PW

DPC

WPC

DW

PDC

05

1015202590

100

Exe.

tim

e fra

ctio

n (%

) Idle Flash-level conflict Bus contention Bus activate Flash cell activate

CD

PWC

DW

PC

PDW

CPW

DC

WD

PC

WPD

DC

PWD

CW

PD

PCW

DPW

CD

WC

PD

WPC

PCD

WPC

WD

PDC

WPD

WC

PWC

DPW

DC

WC

DP

WC

PDW

DC

PW

DPC

WPC

DW

PDC

05

1015202590

100

Exe.

tim

e fra

ctio

n (%

)

CDPWCDWP

CPDWCPWD

CWDPCWPD

DCPWDCWP

DPCWDPWC

DWCPDWPC

PCDWPCWD

PDCWPDWC

PWCDPWDC

WCDPWCPD

WDCPWDPC

WPCDWPDC

020406080

100

Perc

enta

ge o

f cha

nnel

ut

ilizat

ion

(%)


Channel-first palloc Die-first palloc Plane-first palloc Way-first palloc

[Write Intensive -- Execution breakdown] [Read Intensive -- Execution breakdown]

[Average Channel Utilization]

Resource Utilization

Low

er la

ten

cy

Higher throughput

CWDP

PWCD

DPWC

WDCP

IDEAL

Flash-level parallelism

Avoiding resource conflicts

Maxim

izin

g r

eso

urc

eutiliza

tion

Optimization Point

Outline




34

Host Interface Overview

• Serial AT Attachment (SATA)• The most popular interface for SSD

• 600MB/sec (SATA3.0)

• Non-Volatile Memory Express (NVMe) • PCI Express based storage management protocol

• Around 1GB/sec per lane, and it’s upto 16 lanes (PCIe 3.0)

CPU

PCHchipset

CPU busDMI (20Gbps)

NAND NAND

NAND NAND

SSDSATA\SAS

AHCI SSD

SATA\SAS

SATA / SAS( 1.5 to 6 Gbps)

• The host connection is to advanced host controller interface (AHCI)

• The bus overheads introduce 1 us for each command

• Throughput is also serialized

• Designed based on conventional spinning disks

• 32 entries for device-level queue (Native Command Queue)

SATA

Need for High-Performance Interfaces

• Storage interface as a bridge between host and storage• Traditional SATA and SAS have been widely employed

• Storage-internal bandwidths keep increasing• Thanks to increased resources and parallelisms

• Traditional interfaces failed to deliver the very-high bandwidths

• From upgrading traditional interfaces to devising new high-performance interfaces

HostSystem

Storage System(NVMs)Interface

More Resources More Parallelisms

Higher Bandwidth

PerformanceBottleneck!

SATA/SAS PCIe

CPU

PCHchipset

CPU busDMI (20Gbps)

NVMe SSD

NAND NAND

NAND NAND

SSD

Built in PCIe

PCIe (5Gbps / lane)

• The host connection is Peripheral Component Interconnect Express (PCIe)

• PCIe bus connection still requires SSD controller chip, but it doesn’t SATA/ACHI controller

• Throughput flows in parallel along each available PCIe lane

NVM Express

NVMe’s Rich Queuing Mechanism

• Traditional interface provides single I/O queue with 10s entries• Native Command Queuing (NCQ) with 32 entries

• NVMe strives to increase throughput by providing scalable number of queues with scalable entries

• Up to 64K queues with up to 64K entries

• NVMe queue configurations in the host-side memory• Pairs of Submission Queue (SQ) and Completion Queue (CQ)• Per-core, Per-process, or Per-thread

High-level View of Comparison• Scalable queue

(for multiple cores)

• Lock-less queue management

Host Host

AHCI SSD NVMe SSD

Single, 32 command entries 64K, 64K depth queue

Core A Core B Core A Core B

…

Issue queue

complete queue

Command list

Command tableI/O Issue

I/O Complete

I/O Interrupt

AHCI SSD NVMe SSD

I/O Stack Comparison

• SATA requests traverse • Block layer

• AHCI driver

• Host bus

• AHCI host bus adapter (HBA)

• NVMe requests bypass all conventional modules and directly go to PCIe root complex

SATAController

SATAController

Application

Kern

el M

ode

Har

dw

are

Devi

ce

SATAControllerAHCI HBA

AHCI Driver

Block Layer

SATAController

PCIe root port

NVMe Driver

AHCI Connection NVMe connection

Use

r M

ode

Controller

NANDController

NAND

NAND NAND

SATA Controller

NVMeController

Storage stack

Communication ProtocolH

ost

-Sid

e

SSD

-Sid

e

[I/O Write]

Tim

e f

low H

ost

-Sid

e

SSD

-Sid

e

[I/O Read]

DB-write

IO-Req

IO-Fetch

WR-DMA

SSD WRSSD-PROC

CPL-Submit

MSI

DB-write

IO-Req

IO-Fetch

SSD RDSSD-PROC

RD-DMA

CPL-Submit

MSI

Protocol Comparison

• DB-Write • Door-Bell (DB) register based communication

• Remove all register reads, each consuming 2000 CPU cycles approximately

• MSI• MSI software-based vector interrupt

• Ensures a specific core not IOPs bottleneck

• Less Synchronization and Less Lock• DB per queue – increasing parallelism

• Remove synchronization lock to issue commands

Outline










44

Case Study #1: Physically Addressed Queueing(PAQ): Improving Parallelism in

Solid State Disks

Motivation

Background & Problems

PAQ

Evaluation Results

Conclusion

• Observation– SSD performance varies based on how to

parallelize data accesses

– Writes fully enjoy internal parallelism, but reads suffer from resource contention

• Problem– Virtual addressed queuing is insufficient to

schedule incoming I/O requests

• Our solution– Expose physical address space to the

scheduler and avoid internal resource contention

Motivation

Background & Problem

PAQ

Evaluation Results

Conclusion

Use-cases of High-speed SSDs

ADVANTAGE DISADVANTAGE

FASTER100x more throughput than a 15K RPM disk

EXPENSIVEA server SSD is $30/GB10K SAS HDD is about $1/GB

ENERGY EFFICIENTRequiring 33% less power than HDD

LIFE TIME LIMITNAND flash memory cell wear-out with overuse

[Image: Intel]

[Image: thessdreview.com]

SSDs are considered for workloads rife with

reads or as a cache

HPC-Enterprise

Hybrid-SSD

Reads vs. Writes in bare NAND Flash memory

• NAND flash are biased towards reading

PERF. METRIC OPERATION VALUE

LATENCY WRITE 440 ~ 5000 us

READ 25 us (80x faster -typ.)

ERASE 2500 us

BANDWIDTH WRITE 2.2 MB/sec

READ 26.7 MB/sec (13x faster)

[SK-Hynix 32nm MLC NAND flash]• Care needed for writes on NAND flash– Requires erase operation before a write– Requires garbage collections or block merges, a set of operations,

erase, read, and write operation for new writes

Two Divergent Research Directions

Internal Research workingto improve writes

Garbage collection scheduling

Flash firmware mapping algorithm

Write buffer management

External Research developing mechanisms capitalizing on read performance (avoid the high penalties of writes)

[Image: Thinkpads.com]

Read vs. Write in an SSD

512B 1K 2K 4K 8K 16K 32K 64K128K0

20406080

100120140160180200220240260

Band

wid

th (M

B/s)

Transfer Size

SSD-A Rand. Read SSD-A Seq. Read SSD-A Rand. Write SSD-A Seq. Write

0 2 4 6 8 10

512B 1K 2K 4K 8K 16K 32K 64K128K0

20406080

100120140160180200220240260

Band

wid

th (M

B/s)

Transfer Size

SSD-B Rand.Read SSD-B Seq. Read SSD-B Rand.Write SSD-B Seq. Write

Y Ax

is T

itle

512B 1K 2K 4K 8K 16K 32K 64K128K

0

2

4

6

8

10

12

14

Aver

age

Res

pons

e Ti

me

(ms)

Transfer Size

SSD-A Rand. Read SSD-A Seq. Read SSD-A Rand. Write SSD-A Seq. Write

X Axis Title

512B 1K 2K 4K 8K 16K 32K 64K128K

0

2

4

6

8

10

12

14

Aver

age

Res

pons

e Ti

me

(ms)

Transfer Size

SSD-B Rand.Read SSD-B Seq. Read SSD-B Rand.Write SSD-B Seq. Write

At least 25% At least 56%

Read performance depends on rigid data layout and access sequence, but writes sequence can be remapped and easily reap the benefit of parallelism

Motivation


PAQ

Evaluation Results

Conclusion

NAND Flash

NAND Flash

CTRL

CH

AN

NEL

1

NAND Flash

NAND Flash

CTRLC

HA

NN

EL2

NAND Flash

NAND Flash

CTRL

CH

AN

NEL

3

NAND Flash

NAND Flash

CTRL

CH

AN

NEL

4

Emb

edd

ed P

roce

sso

rs

SSD Internals

Ho

st In

terf

ace

Co

ntr

olle

r

SSD & NAND Flash Internals

Die 0 Die 1 Die 2 Die 3

Multiplexed Interface

Flash Package Internals

k*j

Blo

cks

DATA REGISTER

CACHE REGISTER


DATA REGISTER

CACHE REGISTER


1 Block 1 Block

DIE 1

PLANE 0 PLANE jk Blocks k Blocks

Die Internals

HOST

Software Stack of an SSD

VIRTUAL ADDRESS

PHYSICAL ADDRESS

ADDRESS SPACE

FTL

QBM

PHY

HIL

HAL

Host Interface Layer- Responsible for communication- Row protocols handed by PHY- Queue and Buffer Management (QBM) handled by APP

Flash Translation LayerAddress translation between the host address space and physical addresses

Hardware Abstraction LayerCommitting flash transaction to underlying flash memory chips

[Image:micron.com]

HIL is oblivious of physical address

Conventional Scheduling

11

21

12

31

13

41

21

1

22

2

23

61

41

81

51

82

61

71

31

51

7 parallelized I/O groups

1 2 3 4 5 6

Channel 1 Channel 2 Channel 3 Channel 4

Virtual Addr.

Physical Addr.

QUEUE

2 Die interleavings

Conventional Scheduling

11 12 13 21 22 23 41 51 6131

11 22 33 44 55 66

Virtual Addr.Addr.Addr

QUEUE

Package

Die

I/O request scheduling is not efficient because HIL is sitting on the virtual address space

Multiplane mode operation

• Plane-level parallelism can be achieved via only multi-plane mode advanced command

• Two issues building advanced commands at runtime– Conventional VAQ is

ignorant of physical addresses

– FTL and flash firmware is oblivious of upper device-level queue and requests therein

Even Odd Even Odd

Even Odd Even Odd

Die 0 Die 1C

H 1

CH

2

7372356742Physical Address

Die 0 Die 1

Package 2

Package 1

4321Tag ID

5141333231232221Virtual Address

Multiplane mode

Multiplane mode operation

• Plane-level parallelism can be achieved via only multi-plane mode advanced command

• Two issues building advanced commands at runtimeruntime– Conventional VAQ is

ignorant of physical addresses

– FTL and flash firmware is oblivious of upper device-level queue and requests therein

2 multiplane operations

What if HIL or virtual addressed scheduler knows physical address space?

Motivation


PAQ

Evaluation Results

Conclusion

High Level View of PAQ

VIRTUAL ADDRESS

PHYSICAL ADDRESS

ADDRESS SPACE

FTL

QBM

PHY

HIL

HAL

[Image:micron.com]

QBM

Moving QBM layer out from the HIL and beneath the FTL

QBM migration exposes physical addresses to our scheduler, PAQ (Physical Addressed Queuing)

High Level View of PAQ

• Identify requests that will cause conflicts

• Building a group of request together that do not share conflicts, called a Clump

• Packing transactions based on physical layout of I/O request

VIRTUAL ADDRESS

PHYSICAL ADDRESS

ADDRESS SPACE

FTL

PHY

HIL

HAL

QBM

Clump Composition

• Lower-level conflicts are most costly!!• Building Clumps

1. Add transactions incurring conflicts in the lowest levels first

2. For die- and package-level conflicts, never schedule a clump

3. Continue adding transactions to the clump, prioritizing for low-level conflicts, until no more can be added without breaking #2.

PAQ attempts to build clumps in a bottom-up, conflict-first fashion such that the lowest level with contention does not have conflicting transactions in the clump

Physical Address Queueing

11

21

12

31

13

41

21

1

22

2

23

61

41

81

51

82

61

71

31

51

1 2 3 4 5 6

Channel 1 Channel 2 Channel 3 Channel 4

Virtual Addr.

Physical Addr.

QUEUE

Physical Address Queueing11 22 33 44 55 66QUEUE

3 parallelized I/O groups

5 Die interleavings

Plane Packing

• PAQ knows both device-level queue and physical addresses

• It parses requests in the queue, and sends each transactionin favor of multi-planemode operations

Even Odd Even Odd

Even Odd Even Odd

Die 0 Die 1C

H 1

CH

2

7372356742Physical Address

Die 0 Die 1

Package 2

Package 1

4321Tag ID

5141333231232221Virtual Address

Multiplane mode

Plane Packing

• PAQ knows both device-level queue and physical addresses

• It parses requests in the queue, and sends each transactionin favor of multi-planemode operations

Multiplanemode operations

Multiplanemode operations

mode

4 multiplane operations

Physical Address Queuing can schedule & pack multiple transactions into one advanced command at runtime

Motivation


PAQ

Evaluation Results

Conclusion

SSD Setup

• NAND Flash Chip– Fine-grained NAND command

– Advanced commands

– Strong address constraints

– Intrinsic latency variation

• SSD Framework– 8 channels, 8 flash per channel (64 total)

– Dual-die package format, 32 entry queue

– A page-level mapping algorithm

Configurations & Traces• Queuing strategies

– VAQ : Default queuing scheme (Virtual Address)

– PAQ0: PAQ, only using plane-packing

– PAQ1: PAQ, only using clumping

– PAQ2: PAQ, using both plane-packing and clumping

• Traces– fin :online transaction

– web :search engine

– usr :shared directory

– prn :print serving

– sql :database

– msnfs :file storage servers

Aggregate Performance -Bandwidth

• Improved read performance about 45% (100MB/sec) compared to VAQ scheduler for web workloads (90% random reads)

• PAQ2 never hurts performance for any workload regardless of read- or write oriented

[IOPS – the number host-level I/O requests per sec]

[Bandwidth KB/sec]45% improvement

Worst performance trace : Writes are intermixed with small reads

PAQ2 shows 1.41 times better performance

Outline




68

Case Study #2: Host Interface Assisted

Garbage Collection Scheduler

1. Taking Garbage Collection Overheads off the Critical Path in SSDs, Middleware'122. HIOS: A Host Interface I/O Scheduler for Solid State , ISCA'14

Outline

• Motivation

• Worst-case latency analysis

• Background garbage collection (GC)

– Advanced GC and delayed GC

– Incremental GC

• Garbage collection scheduling

– Slack stealing

– GC overhead redistribution

70

• Observation

– Garbage collection (GC) is the critical performance bottleneck for SSDs

• Bandwidth imposed by GC is 4x worse than normal case I/O operations

• Latency imposed by GC is 8x ~ 10x longer than normal I/O access time

– The presence of idle I/O times in workloads can be exploited by shifting garbage collections from busy periods to other periods

• Our solution

– Removing on-demand GCs from the critical path and secure free block in advance

– Delaying on-demand GC to next idle periods

Motivation• Solid State Drives!!

– Faster than any conventional block devices– Overwriting a page is not allowed before erasing

block, which is a set of pages

Write Cliff

Garbage Collection

Write performance of modern SSDs is significantly degraded after garbage collection begins

• Average and worst-case latency of 7 SSDs and 4 disks

• As we commonly know, SSDs show better average latency

• However, worst-case latencies of SSDs are much higher

• Some I/O showing the worst-case latency may violate QoS

SSDs Always Faster than Disks?<7 SSDs>

<4 Disks>

How Often Worst-case Latencies?

• Two Samsung SSDs are written using Intel IO meter

• Performance variation = worst-case latency – average latency

• As we use SSDs more and more,

– SLC: worst-case latencies are repeatedly and frequently shown

– MLC: In addition to be observed, worst-case latencies get worse

• Worst-case latencies are frequent and get worse over time

<SAMSUNG SLC 120GB SSD> <SAMSUNG MLC 256GB SSD>

average latency average latency

Worst-case Latency affects Throughput?

• Samsung and OCZ SSDs are written by Intel IO meter

• Worst-case latency and throughput measured over time

• At some time point, Write Cliff is observed

– Worst-case latency significantly get worse (by x40)

– (At the same time) throughput severely degrades (by 64%)

• Worst-case latencies directly affect throughput degradation

<SAMSUNG 830> <OCZ Vertex 3>

Latency Impact:Empirical Experimental Results

• 256GB MLC-Based SSD

– 128 * 2 DRAM Buffer

– Dual Core, 8 channel and 64 flash packages

– Device-level Latency is captured ULINK Drive Master

• GC impact

– Warm-up: 1MB random writesfor whole region

Pristine State Performance

So far.. & Our Approach

[Without GCs]

[With GCs]

[Our Approach]

Goals

• Making Garbage Collection (GC) Overheads Invisible to Users

– While using our GC strategies, application does not experience GC overheads

• Avoiding additional GC operations

– Our GC strategies only schedule GC operation that would be invoked soon

• Compatibility with underlying FTL schemes

– Doesn’t need to extra NVM buffer and change main address mapping policy of FTL

Shifting Garbage Collection

• Advanced GC strategy (AGC)– Removing on-demand GCs from the critical

path and secure free block in advance

– 2 components: Look-ahead GC and Proactive Block Compaction

• Delayed GC strategy (DGC)– Handling the cases where idleness does not

frequently occurs and AGC fails to secure free blocks

– Delaying on-demand GC to next idle periods

Device-level short idleness

Long idleness imposed by host

Device-level Short IdlenessUtilization

• Leveraging device-level queue and preinformation arrivals

• 3~17 command tags arrive in parallel (LeCroy commercial protocol analyzer)

• Look-ahead GC (a part of AGC) is executed using short idleness

Previous I/O Expected Execution Time

Long Idleness Utilization

• 38% ~ 83% instructions experience idle periods more than 1 sec

• DGC and Proactive Block Compaction (another part of AGC) are performed

Details of AGC

• Look-ahead GC– Predict on-demand GCs based on incoming

host requests and mapping information

– Look-ahead GC is executed only if the short idle period is longer than the latency of GC predicted

• Proactive Block Compaction

– Reclaiming blocks, which are fully occupied by contents, during long idle periods

Valid Page Migration Time

Block Cleaning Time

Details of DGC

• Update Block Replacement

– GCs need not to be run the same time as writes

– Skipping time consuming tasks (page migration) of GC and serve urgent I/O request first

– Put on-demand page migration into DGC list and replacing another update block

• Retroactive Block Compaction

– Resume page migration activity in long idle time and return update block (used for DGC)

Incremental Garbage Collection• In the case of long idle, there is no pre-arrival

information, or no advantage of device-level queue (empty)

• AGC and DGC employ Incremental Garbage Collection– GC activities split into multiple sub-collections delaminated by

checkpoint– Checkpoint: Check if further collection can be performed

VALIDINVALIDVALID

INVALID

INVALIDVALID

INVALIDVALID

I/O Request

Update Block

Data Block

Target Block

I/O RequestCP

CP

CP

CPDevice-level Queue

Experimental Setup

• 4 channel, 16 flash chips bus-level transaction SSD simulation – 6 volumes of SSD array

• FTL implementation– L-FTL: Log structured block mapping FTL– H-FTL: Superblock-style block and page hybrid mapping

FTL– P-FTL: Partial block cleaning FTL (16% more flash blocks)

• Garbage Collection Straggles– Baseline: Block-merge type garbage collector– AGC: Advanced GC strategy only– DGC: Delayed GC strategy only– AGC+DGC: putting our GC schemes together

Low Write Intensive Workload of Microsoft File Server Storage

• Low write intensive workload, AGC successfully hides GC overheads

[Baseline GC] [AGC Only]

No GC Impact

High Write-Intensive Workload of Microsoft File Server Storage

[Baseline GC] [AGC Only]

[DGC Only] [AGC+DGC]

Fail

FailSuccess

Performance Comparison

[L-FTL] [H-FTL]

[AGC+DGC] [AGC+DGC Invisible]

Hybrid Mapping FTL introduces shorter WCRT as well as lower I/O blocking than L-FTL by reducing GC overheads

AGC+DGC perform all on demand GC even under write-intensive workloads

Constraints of Background GC

• Background garbage collection is feasible only if the system can secure enough idle times

• May violate QoS or SLA in cases where there is an unexpected request during the background GC operations

GC Overhead Distribution

• Four I/O (1~4) requests are present in order

– (Example) given deadlines of four requests are same

– GC is triggered during I/O-1 service

– I/O-1 misses its QoS, whereas others can satisfy it

• I/O-2 to 4 have time-margin (slack) until deadline

– We will distribute GC overhead of I/O-1 over others

– All I/O requests can meet its deadline even GC executed!

I/O-1 I/O-2 I/O-4I/O-3time

Latency

GC overhead

Flash execution

I/O deadline

What Needs for GC Distribution?

• (1) GC overhead estimation

– Need to know how big GC overhead is

• (2) Slack stealing

– Need to know how much slacks (many I/O requests) are required

• (3) GC overhead distribution

– Need to segment GC and distribute them over other I/O requests

I/O-1 I/O-2 I/O-4I/O-3time

Latency

GC overhead

Flash execution

I/O deadline

HIOS’s GC Overhead Estimation

• SATA interface is the best place for GC distribution

– Is aware of flash device operations including GC

– Is also aware of I/O requests by using tag (essential information)

– Before actual I/O request, its tag sent to the command queue

• GC overhead estimation based on flash status and I/O tag

– GC invocation by I/O-1 can be predicted

– GC overhead (# of read, write, and erase) can be estimated

I/O-1 time

Latency

command queueSATA I/F

Flash devices

I/O-2 I/O-3

Tag-

1

Tag-

2

Tag-

3

Tag-

4

I/O-4

HIOS’s Slack Stealing

• For the following I/O requests, slack is accumulated

– For each I/O request, slack is calculated using tag information

– Slack time= Tdeadline – Tflash_latency

– Slack stealing is continued until GC overhead exhausted

• In this scenario, slacks (from following I/O-2, 3, and 4) are enough for distributing I/O-1’s GC overhead

I/O-1 time

Latency


Flash devices

I/O-2 I/O-3

Tag-

1

Tag-

2

Tag-

3

Tag-

4

I/O-4

HIOS’s GC Distribution

• GC (reads, writes, erases) can be segmented into small pieces

• For each I/O request, (I/O-1)

– Write data is transferred from host to device buffer

– Assigned GC segments are executed

– Flash device commands are issued

I/O-1 time

Latency


Flash devices

I/O-2 I/O-3

Tag-

1

Tag-

2

Tag-

3

Tag-

4

I/O-4

I/O-1’s GC overhead

dram buffer

I/O-1 data

Tag-

1

flash cmd flash cmd







I/O-1 time

Latency


Flash devices

I/O-2 I/O-3 I/O-4

dram buffer

I/O-2 data

Tag-

3

Tag-

4

Tag-

2

flash cmd flash cmd flash cmd







I/O-1 time

Latency


Flash devices

I/O-2 I/O-3 I/O-4

dram buffer

I/O-3 data

Tag-

4

Tag-

3

flash cmd flash cmd flash cmd







• “I/O-1’s GC distributed over I/O-2,3, and 4” & “all satisfy QoS”

I/O-1 time

Latency


Flash devices

I/O-2 I/O-3 I/O-4

dram buffer

I/O-4 data

flash cmd

Tag-

4

Simulation-based Evaluation• Simulator

– Models flash chips and associated data paths

– Implements typical SSD software stack

• Baseline configuration

– SSD: 8 channels, 2 flash chips / channel

– SATA I/F: 32 command queue entries

• Compared with four different I/O schedulers

– Noop: schedules in FIFO basis

– Anticipatory: considers spatial locality

– Deadline: sorts logical address in ascending order

– Flash-aware: reduces GC overhead by reducing writes

– HIOS: GC-only or channel-only management, or both

Worst-case Latency

• Worst-case latency is normalized to Noop scheduler

• All existing scheduler are all similar, since they are oblivious to GC and cannot avoid high-cost GC overhead

• HIOS-1 (GC distribution only) significantly reduces it by 41%

• HIOS-2 (GC + channel management) is similar to HIOS-1

– Worst-case latency is caused by GC, not channel-conflict

Average Latency

• Average latency is normalized to Noop scheduler

• All existing schedulers are similar or slightly worse

• HIOS-1 does not affect average latency much

– GC overheads are not eliminated, but redistributed

• HIOS-2 (GC + channel management) improves by 13%

– Resolving channel conflict achieves better performance

Deadline Satisfaction

• Long series of writes to generate multiple GCs

• % of I/O requests missing its deadline (30ms) measured

• Under 40% usage, all I/O schedulers shows negligible miss rate

• As # of writes increase, miss rate of other schedulers dramatically increases suffered from more frequent GCs

Outline




103

What we are doing on now for our heterogeneous computing research …. (real implementation based research)• Software approach: Extending NVMMU for a cluster

• Hardware approach: Storage-based accelerator

Our Real Implementation Approach (SW)

• Advantage:

• Ready to explore systems without a limit from the simulation infra.

• We can measure the real execution time for the integration approach

• Disadvantage:

• Performance is varying based on the test environments somewhat

• Incompatible for the device and software version

• Debug…Debug…Debug.. and Debug….

SSDGPU

NVMMU

• GPU-based Storage I/O Acceleration

CPU

Our Real Implementation Approach (HW)

• FPGA-based Storage I/O Acceleration• Memory/Storage Backend and frontend FPGA

implementation

• Multi-kernel Execution Model

• Scheduling for Near-Data Processing

• Advantage:

• Have no idea as still everything is on-going, sorry

• Disadvantage:

• INFLEXBIE, no choice for design exploration

• Debug…Debug…Debug.. and Debug….

• Slow.. Slow.. Slow.. And Slow

• Expensive and long procurement process for the test: there are 7 more platforms we failed to achieve goals

Challenges for Simulation Model

Traditional system simulator:

It assumes all data have been loaded into RAM before execution…

CPUCore

L2L3

CPUCore

L2

CPUCore

L2

CPUCore

L2

Memory Controllers

RankDRAM Dies

Gem5 GPGPU-sim

Challenges for Simulation Model

LD/STUnit

L1Cache

L2Cache

DRAM

1cycle

4cycles

12cycles

230cycles

60-800us

SSD

Extremely slow for simulation

SimpleSSD

SimpleSSD: Modeling Solid State Drive for Holistic System Simulation

- A high-fidelity SSD simulation framework designed towards an educational purpose

- Free to download from http://SimpleSSD.camelab.org

http://simplessd.camelab.org/

RankDRAM Dies

SimpleSSD overview

L1L2

CPUCore

L1

CPUCore

L1

CPUCore

L1

Memory Controllers

Controller

Die 0Die 1Die N

Controller

Die 0Die 1Die N

Controller

Die 0Die 1Die N

Multi-Channel SSD

ldr r1, [r0]

Application

PC: 0x04

Registers

Readregister 1

Readregister 2

Writedata

Readdata 1

Readdata 2

ALU

resu

lt

MEM

Address Readdata

CPUCoreCPUCore

dyn_inst_impl.hh/cco3_cpu_exec.hh/ccbase_dyn_inst.hh/ccmemhelpers.hh

cpu.hh/cciew_impl.hh/cclsq_impl.hh/cclsq_unit_impl.hh/cc

RankDRAM Dies

SimpleSSD overview

CPUCore

L1L2

CPUCore

L1

CPUCore

L1

CPUCore

L1

Memory Controllers

Controller

Die 0Die 1Die N

Controller

Die 0Die 1Die N

Controller

Die 0Die 1Die N

Multi-Channel SSD

L1

Registers

Readregister 1

Readregister 2

Writedata

Readdata 1

Readdata 2

ALU

resu

ltMEM

Address Readdata

IndexTagD

eco

derTag

ArrayDataArray

S/A S/A

ComparatorMatch?

Cache request

L1D cache

port_interface.hh/ccbase.hh/cccacheset.hhbase.hh/cc

cache_impl.hhcache.hh/ccmshr.hh/ccmshr_queue.hh/cc

RankDRAM Dies

SimpleSSD overview

CPUCore

L1L2

CPUCore

L1

CPUCore

L1

CPUCore

L1

Memory Controllers

Controller

Die 0Die 1Die N

Controller

Die 0Die 1Die N

Controller

Die 0Die 1Die N

Multi-Channel SSD

IndexTag

Dec

od

erTagArray

DataArray

S/A S/A

ComparatorMatch?

L1D cache

IndexTag

Dec

od

erTagArray

DataArray

S/A S/A

ComparatorMatch?

L2 cache

Cache miss

L2

port_interface.hh/ccbase.hh/cccacheset.hhbase.hh/cc

cache_impl.hhcache.hh/ccmshr.hh/ccmshr_queue.hh/cc

RankDRAM Dies

SimpleSSD overview

CPUCore

L1L2

CPUCore

L1

CPUCore

L1

CPUCore

L1

Memory Controllers

Controller

Die 0Die 1Die N

Controller

Die 0Die 1Die N

Controller

Die 0Die 1Die N

Multi-Channel SSD

IndexTag

Dec

od

erTagArray

DataArray

S/A S/A

ComparatorMatch?

L2 cache

Cache miss

MMU

PageTable

I/O controller

Page fault

SSD

MainMemory

DMA

dram_ctrl.hh/ccpage_table.hh/ccmulti_level_page_table.hh/ccmulti_level_page_table_impl.hh

RankDRAM Dies

SimpleSSD overview

CPUCore

L1L2

CPUCore

L1

CPUCore

L1

CPUCore

L1

Memory Controllers

Controller

Die 0Die 1Die N

Controller

Die 0Die 1Die N

Controller

Die 0Die 1Die N

Multi-Channel SSD

IndexTag

Dec

od

erTagArray

DataArray

S/A S/A

ComparatorMatch?

L2 cache

Cache miss

DRAM controller

Arbitration Engine

CMD queue WR queue RD queue

Sequencing Engine

PHY Interface

Main Memory

dram_ctrl.hh/ccsimple_mem.hh/cc

physical.hh/ccaddr_mapper.hh/cc

RankDRAM Dies

SimpleSSD overview

CPUCore

L1L2

CPUCore

L1

CPUCore

L1

CPUCore

L1

Memory Controllers

Controller

Die 0Die 1Die N

Controller

Die 0Die 1Die N

Controller

Die 0Die 1Die N

Multi-Channel SSD

CPUCore

CPUCore

CPUCore

CPUCoreclass LSQ_unit

L1 L1 L1 L1class cacheL2class cache

Memory ControllersMemory Controllersclass mem_ctrl

RankDRAM Dies

class FuncPageTableDRAM Dies

Controller Controller Controllerclass HIL

Die 0Die 1

Die 0Die 1

Die 0Die 1class FTL

Die N Die N Die NMulti-Channel SSDclass PAL

executeLoad(inst)executeStore(inst)

recvTimingResp(pkt)

recvTimingResp(pkt)

recvTimingReq(pkt)

recvTimingReq(pkt)

recvTimingReq(pkt)

changeFlag(paddr, size, alloc)

SSDoperation()

fetchQueue()

accessAndRespond(pkt,lat)

commitLoad(inst)commitStore(inst)

PAL_setLatency()

setLatency()

Inside Host Interface Layer (HIL)• Target: provide universal interface for system simulator, trace

generator, and RAID controller.• Add-on mode overview:

Data Movement modelAdd-on mode

System simulator

Host Interface Layer

FTL

PAM

I/O requestAddress SizecurTick opType

Address

finishTick

Page fault

ResponseAddress SizefinishTick opType

Latency map table

Store SSD access latency


generator, and RAID controller.• Add-on mode overview:

Data Movement modelAdd-on mode

System simulator


FTL

PAM

Address

finishTick

I/O access

Look up latency

Latency map table

SSD delay = finishTick -curTick

Report


generator, and RAID controller.• Standalone mode Overview:

Dispatch


FTL

PAM

Tracefiles

Micro-benchmark

Standalone mode

I/O queue

I/O

I/O requestAddress SizecurTick opTypeI/

O

I/O

I/O


generator, and RAID controller.• Standalone mode Overview:

Issue


FTL

PAM

Tracefiles

Micro-benchmark

Standalone mode

I/O

I/O

I/O

I/O requestAddress SizecurTick opType

I/O

I/O queue Insert into the queue

ResponseAddress SizefinishTick opType

Issue new request @ finishTick

DEMO

Gem5FS-SimpleSSD (full system mode)Software Dependencies:• Linux• mercurial• scons• swigImage Booting Dependencies:• Device Tree Blob: vexpress.aarch32.ll_20131205.0-gem5.4cpu.dtb• Linux Kernel: vmlinux.aarch32.ll_20131205.0-gem5• File System: aarch32-ubuntu-natty-headless.img

• gcc• g++• Python2.6 or Python 2.7• Protobuf

Compile software:scons -j7 build/ARM/gem5.opt

Execution command:./build/ARM/gem5.opt --debug-flags=IdeDisk,HIL,FTLOut,PAM2,GLOBALCONFIG -d ./configs/example/fs.py --num-cpu=4 --dtb-filename = vexpress.aarch32.ll_20131205.0-gem5.4cpu.dtb --disk-image=aarch32-ubuntu-natty-headless.img --kernel=vmlinux.aarch32.ll_20131205.0-gem5 --script=run_BC.rcS --SSD=1 --SSDconfig=rev_ch16_SATA.cfg

Inside Flash Translation Layer (FTL)• Target: provide SSD services such as I/O address mapping, wear

leveling, garbage collection, etc.• Overview:

LBALBALBA

Host Req.

IO QueueMapping

Table

PPNPPNPPN

FTLmapping()

PP

NP

PN

PP

N

push

Direct Mapping Set-assoc Mapping Full-assoc Mapping

ReadTransaction()WriteTransaction()

HIL FTL PAM

Inside Flash Translation Layer (FTL)• Target: provide SSD services such as I/O address mapping, wear

leveling, garbage collection, etc.• Overview:

LBALBALBA

Host Req.

IO QueueMapping

Table

PPNPPNPPN

FTLmapping()

PP

NP

PN

PP

N

push

Free BlockPool

GC_threshold Read

GarbageCollection()

MinHeapWearLeveling()

Direct Mapping Set-assoc Mapping Full-assoc Mapping

ReadTransaction()WriteTransaction()

SendRequest()

SetLatency()SetLatency()

HIL FTL PAM

FTLMapNFTLMapK

FTLGCthreshold

Configurable parameters:FTLMapN FTLMapKFTLGCthresholdFTLOP (over-provisioning)

Inside Flash Translation Layer (FTL)• FTL mapping: support direct mapping, set-assoc mapping, and full-

assoc mapping by configuring FTLMapN amd FTLMapK.

Block layout based on N and K:

N Data Blocks

K Log Blocks

Data Group 0

N Data Blocks

K Log Blocks

Data Group (DGN)

SSD logical blocks

N Data Blocks

K Log Blocks

Data Group 1



Set-assoc Mapping (1<N<max, 1<K<max):

Data blocksLBN PBN

Data Group

N Data Blocks

K Log Blocks

N Data Blocks

K Log Blocks

Data Group (DGN)

Log blocksPage

Mapping

DGN LPN PPN

Page IndexBlock IndexData Group Index

Logical Page Address



Direct Mapping (N=1, K=1):

1 Data Block

1 Log Block

Data Group 0 Data Group max

Page IndexData Group Index


No block index

Log blocksPage

Mapping

DGN LPN PPN

Data blocksDGN PBN



Full-assoc Mapping (N=K=max):

Max Data Blocks

Max Log Blocks

Single Data Group

No data group index

Page IndexBlock Index


Log blocksPage

Mapping

DGN LPN PPN

Data blocksLBN PBN

Inside Page Allocation Module (PAM)Target: simulate SSD internal parallelism and resource conflicts of IO bus and NAND flash memory.

HIL

FTL

PAM

Blk

FTLMapping

CHPKGDIE

Page PPN

CH0DIE0

DIEN

conflict

Overview:

Inside Page Allocation Module (PAM)Target: simulate SSD internal parallelism and resource conflicts of IO bus and NAND flash memory.

• PPN disassemble.• Conflict simulation.

Main Functions:

PP

NP

PN

PP

NFTL

PP

N

CHPKGDIE

PPNdisassemble

CH0DIE0

DMA DMA

MEMLatency

TimelineScheduling

Traditional model

Configurableparameters

Inside Page Allocation Module (PAM)

Simplified latency simulation model

OpCode Addr Data CMDDataChannel

MEMDie

tADL CMD

Channel

Die

pre-dmamem_op

post-dma

Simplified model

Configurable

prepre-dma

• DMA frequency• Command &

Address Delay

mem_oppostpost-dma

• SLC/MLC/TLC• Read/Write• MSB/CSB/LSB

• Page Size• Channel• Package• Die

pre-dma: DMA operation in CHANNEL (transfer in metadata [+ write page]).

pre-dma

mem-op: NAND flash memory island read/write operation in DIE.

mem-op

post-dma: DMA operation in CHANNEL (transfer out metadata [+ read page]).

post-dma

pre-dma/post-dma: consume CHANNEL resource, mem-op: consume DIE resource.

Separate conflict model into i) CHANNEL and ii) DIE.


Simplified conflict simulation model


DMA conflictsIO #1 transfer in DIE 0, CHANNEL 0

IO #2 transfer in DIE 1, CHANNEL 0

IO#1IO#2

Channel 0Die 0Die 1

DMA Conflict

mem-op (w)

mem-op (r)

DMA Conflict

Time

Pre-dma and post-dma conflict example.


MEM conflictsIO #1 transfer in DIE 0, CHANNEL 0

IO #2 transfer in DIE 0, CHANNEL 0

Pre-dma, mem-op and post-dma conflict example.

IO#1IO#2

Channel 0Die 0Die 1 Time

DMA Conflict

MEMConflict

mem-op (w)

DMA Conflict

mem-op (r)

gem5FS-simpleSSD

Overview: leverage the disk interface provided by gem5.

Target: integrate simpleSSD into gem5 full system mode.

RankDRAM Dies

CPUCore

L1L2

CPUCore

L1

CPUCore

L1

CPUCore

L1

Memory Controllers

Page Faults File Read/Write

Simple Disk Latency Calculator

gem5FS model

SimpleSSD simulator

gem5FS-simpleSSD model

DEMO

Gem5FS-SimpleSSDOutput files:

config.ini config.json SimpleSSD.log stats.txt system.terminal

Full system execution log

DEMO

SimpleSSD-standaloneSoftware Dependencies:

• Linux• g++

Compile software:make

Execution command:

./ssdsim ssd_config_file microbench_config_file > SimpleSSD.log

Output files:

SimpleSSD.log

SimpleSSD runtime statistics report

Evaluation Samples

Instruction per cycles on SLC is

better

Page cache (VFS) doesn’t work well

Massive I/O make system call overheads significant

MLC is worse than TLC

Evaluation Samples

CPU utilization is not impacted even when there is storage access (page cache)

CPU utilization is severely impacted by storage access. (no locality)

Educational Research Tools

• OpenNVM• http://opennvm.camelab.org

• SimpleSSD• http://simplessd.camelab.org

• NANDFlashSim• http://nfs.camelab.org

• Trace Repository• http://traces.camelab.org

References

• Ozone (O3): An Out-of-Order Flash Memory Controller Architecture, TC 2011

• Exploring Parallel Data Access Methods in Emerging Non-Volatile Memory Systems, TPDS

• Unleashing the Potentials of Dynamism for Page Allocation Strategies in SSDs, SIGMETRICS

• Exploring and Exploiting the Multilevel Parallelism Inside SSDs for Improved Performance and Endurance, TC

• Design Tradeoffs for SSD Performance, ATC'09

• ParaFS: A Log-Structured File System to Exploit the Internal Parallelism of Flash Devices, ATC'16

References

• Sprinkler: Maximizing resource utilization in many-chip solid state disks, HPCA'14

• Taking Garbage Collection Overheads off the Critical Path in SSDs, Middleware'12

• An In-Depth Study of Next Generation Interface for Emerging Non-Volatile Memories, NVMSA

• HIOS: A Host Interface I/O Scheduler for Solid State , ISCA'14• NVMMU: A Non-Volatile Memory Management Unit for

Heterogeneous GPU-SSD Architectures, PACT’15• SimpleSSD: Modeling Solid State Drive for Holistic System

Simulation, CAL 2017• Disks Exploiting request characteristics and internal parallelism

to improve SSD performance, ICCD 15• Performance Analysis of NVMe SSDs and their Implication on

Real World Databases, SYSTOR’15

(iit8015) lecture#7: ssd architecture and system-level...

Documents