(iit8015) lecture#7: ssd architecture and system-level...
TRANSCRIPT
Memory Architecture and Storage Systems
Myoungsoo Jung
Computer Architecture and Memory Systems Lab.
School of Integrated TechnologyYonsei University
(IIT8015) Lecture#7: SSD Architecture and System-level Controllers
Outline
• Holistic Viewpoint (Overview)
• SSD Architecture
• Parallelism Overview
• Page Allocation Strategies
• Evaluation Studies for Parallelism
• Host Interface Overview
• Case Studies• Parallelism-Aware Host Interface I/O Scheduler
• GC-Aware Host Interface I/O Scheduler
• Simulation Infrastructure (for your future research)
2
Outline
• Holistic Viewpoint (Overview)
• SSD Architecture
• Parallelism Overview
• Page Allocation Strategies
• Evaluation Studies for Parallelism
• Host Interface Overview
• Case Studies• Parallelism-Aware Host Interface I/O Scheduler
• GC-Aware Host Interface I/O Scheduler
• Simulation Infrastructure (for your future research)
3
NAND Flash
NAND Flash
CTRL
CH
AN
NEL
1
NAND Flash
NAND Flash
CTRL
CH
AN
NEL
2NAND Flash
NAND Flash
CTRL
CH
AN
NEL
3
NAND Flash
NAND Flash
CTRL
CH
AN
NEL
4
Emb
edd
ed P
roce
sso
rs
SSD Internals
Ho
st In
terf
ace
Co
ntr
olle
r
Die 0 Die 1 Die 2 Die 3
Multiplexed Interface
Flash Package Internals
k*j
Blo
cks
DATA REGISTER
CACHE REGISTER
NAND Flash Memory Array
DATA REGISTER
CACHE REGISTER
NAND Flash Memory Array
1 Block 1 Block
DIE 1
PLANE 0 PLANE jk Blocks k Blocks
Die Internals
Northbridge
IDESATA
USB
Southbridge
I/O controller hub
Memory controller hub
High-speed graphic I/O slots (PCI Express)
PCI Slots
Memory Slots
Cables and ports leading off-board
Core Core Core
Holistic Viewpoint (Hardware)
Lecture 6
Flas
h M
emo
ry R
equ
est
(Ph
ysic
al A
dd
ress
)
Flas
h M
emo
ry R
equ
est
(Vir
tual
Ad
dre
ss)
Holistic Viewpoint (Software)
NVMHC
Queuing Memory Request Building
Core (Flash Translation Layer)
Memory Request Commitment Transaction Handling
Flash Controllers
De
vice
-le
vel Q
ue
ue
Arrivals
I/O
Req
ues
t
Parsing Data Movement Initiation
Memory Requests: data size is the same as atomic flash I/O unit size
AddressTranslation
Execution Sequence
Striping &Pipelining
Transaction Decision
Interleaving & Sharing
Lecture 6Lecture 3 TodayToday Spec-specific
Outline
• Holistic Viewpoint (Overview)• SSD Architecture• Parallelism Overview• Page Allocation Strategies • Evaluation Studies for Parallelism • Host Interface Overview• Case Studies
– Parallelism-Aware Host Interface I/O Scheduler– GC-Aware Host Interface I/O Scheduler
• Simulation Infrastructure (for your future research)
6
SSD Architecture
• At the top of SSD internals, there is host interface controller that parses the incoming requests
• Embedded CPU(s) is employed for flash firmware such asFTL, buffer $, I/O scheduler,parallelism management
7
SSD Architecture
• Underneath the embedded processor, multiple flash controllers exist, each connecting a memory bus, referred to as channel
• Within a bus, there are multiple flash packages, each having flash interface, called way
System-Level Parallelism
• Channel striping– An I/O request is striped
over multiple channels
• Way pipelining – As it shares the channel, an
I/O request cannot be perfectly striped in parallel
– However, NAND flash transactions consists of multiple phases, and individual NAND flash can still work simultaneously
Mic
rop
roce
sso
r
CH A
Flash Chip Flash Chip
CH B
Flash Chip Flash Chip
CH C
Flash Chip Flash Chip
CH D
Flash Chip Flash Chip
ChannelH
ost
Inte
face
Way
CCHH AA
CCHH BB
CH C
CCHH D
Channel
Channel Striping
System-Level Parallelism
• Channel striping– An I/O request is striped
over multiple channels
• Way pipelining – As it shares the channel, an
I/O request cannot be perfectly striped in parallel
– However, NAND flash transactions consists of multiple phases, and individual NAND flash can still work simultaneously
Mic
rop
roce
sso
r
CH A
Flash Chip Flash Chip
CH B
Flash Chip Flash Chip
CH C
Flash Chip Flash Chip
CH D
Flash Chip Flash Chip
ChannelH
ost
Inte
face
Wayss
or
CCHH AA
Flash Chip Flash Chip
CCHH BB
Way pipelining
Flash-Level Parallelism
• Die interleaving
– A striped/pipelined request can be further interleaved within a chip
NAND Flash Memory Array
DATA REGISTER
CACHE REGISTER
1 Block
NAND Flash Memory Array
DATA REGISTER
CACHE REGISTER
Mu
ltip
lex
Inte
rfac
e DIE 0
DIE 1
DIE 2
DIE 3
DATA REGISTER
CACHE REGISTER
NAND Flash Memory Array
(Plane)
1 Block
wordline
Mu
ltip
lex
Inte
rfac
e DIE 0
DIE 1
DIE 2
DIE 3
Die interleaving
Flash-Level Parallelism
• Die interleaving
– A striped/pipelined request can be further interleaved within a chip
• Plane sharing
– Multiple planes simultaneously works using shared wordline(s)
NAND Flash Memory Array
DATA REGISTER
CACHE REGISTER
1 Block
NAND Flash Memory Array
DATA REGISTER
CACHE REGISTER
Mu
ltip
lex
Inte
rfac
e DIE 0
DIE 1
DIE 2
DIE 3
DATA REGISTER
CACHE REGISTER
NAND Flash Memory Array
(Plane)
1 Block
wordline
DATA REGISTER
CACHE REGISTER
NAND FlashMemory Array
(Plane)
1 Block
worddlline
Plane sharing
Parallelism Overview
• Note that different vendors use different naming rules (interleaving, stripping, way, etc.)
• You need to understand four different level of parallelism based on the contexts
– Plane sharing (=multiple-mode operation, two-plane operation, etc)
– Die interleaving (=interleaved die operation, bank interleaving, etc)
– Way pipelining (=package interleaving)
13
Outline
• Holistic Viewpoint (Overview)• SSD Architecture• Parallelism Overview• Page Allocation Strategies • Evaluation Studies for Parallelism • Host Interface Overview• Case Studies
– Parallelism-Aware Host Interface I/O Scheduler– GC-Aware Host Interface I/O Scheduler
• Simulation Infrastructure (for your future research)
14
WAY 0 WAY 1
CH
AC
H B
FTL
HIL
HAL
Host Interface Layer- Responsible for communication
Flash Translation LayerAddress translation between the host address space and physical addresses
Hardware Abstraction LayerCommitting flash transaction to underlying flash memory chips
[Image:micron.com]
24Page Allocation Strategies
Page allocation strategies are directly related with physical data layout and access sequences, which have impact on the performance and internal parallelism
Software stack (review) & page allocation
Page Allocation Strategies (Palloc)
• Channel-first pallocs• Allocate internal resources in favor of channel striping method
• Way-first pallocs• Are oriented forward taking advantage of the way pipelining
• Die-first and plane-first pallocs• Allocate die and plane in an attempt to reap the benefit of flash-
level parallelism
NAND Flash Chip
Pla
ne
0
Pla
ne
0
Pla
ne
1
Pla
ne
1
DIE 0 DIE 1
WAY 0 WAY 1
CH
AC
H B
Channel striping
Way Pipelining
Die Interleaving
Plane Sharing
CWDP -- Channel-Way-Die-PlaneCDPWCDWPCPDWCPWDCWPD
Channel-first Page Allocation
Channel-first Page Allocation
• This page allocation strategies give priority to the order of channel, way, die and plane
• Some channel first page allocation strategies introduce low flash-level locality
NAND Flash Chip
Pla
ne
0
Pla
ne
0
Pla
ne
1
Pla
ne
1
DIE 0 DIE 1
WAY 0 WAY 1
CH
AC
H B
Channel striping
Die Interleaving
Plane SharingWDCP -- Way-Die-Channel-PlaneWCPDWDCPWDPCWPCDWPDC Way Pipelining
Way-first Page Allocation
Way-first Page Allocation
• This allocation assigns 1. the way resources in a
channel
2. stripe all the requests over multiple ways
3. interleave the flash-level resources.
• Although it allocates the system-level resource first, some favor flash-level resources
NAND Flash Chip
Pla
ne
0
Pla
ne
0
Pla
ne
1
Pla
ne
1
DIE 0 DIE 1
WAY 0 WAY 1
CH
AC
H B
Die Interleaving with Multiplane
Way Pipelining
Channel stripingDie Interleaving
DPWC -- Die-Plane-Way-ChannelDCPWDCWPDPCWDWCPDWPC
Die-first Page Allocation
Die-first Page Allocation
• The die-first page allocation schemes favor the exploitation of the die-interleaving method
• It could also accommodate system-level resources instead of the flash-level resources based on the access patterns (DCWP/DWCP).
NAND Flash Chip
Pla
ne
0
Pla
ne
0
Pla
ne
1
Pla
ne
1
DIE 0 DIE 1
WAY 0 WAY 1
CH
AC
H B
Die Interleaving with Multiplane
Way Pipelining
Plane Sharing
Channel striping
PWCD -- Plane-Way-Channel-DiePCWDPCWDPDCWPDWCPWDC
Plane-first Page Allocation
Plane-first Page Allocation
• It parallelizes data accesses with plane sharing, which can in turn improve the storage throughput
• An excellent option for realizing the benefits of both inter- and intra-request parallelisms
CH
AC
H B
CH
AC
H B
Assumption in this exampleLegacy : 200 us (1 page)Plane sharing : 240 us (2 pages)
Total latency : 200 us
Channel-first page allocation
Plane-first page allocation
Total latency : 240 us
Channel-first vs plane-first
CH
AC
H B
CH
AC
H B
Assumption in this exampleLegacy : 200 us (1 page)Plane sharing : 240 us (2 pages)
Req1 : 200 usReq2 : 400 us
Req1: 240 usReq2: 240 us
Channel-first page allocation
Plane-first page allocation in favor of way pipelining
Channel-first vs plane-first
WAY 0 WAY 1
CH
AC
H B
WAY 0 WAY 1
CH
AC
H B
System and flash level parallelism are same
Performance among the different page allocation strategies vary based on access pattern
Way-first (WPCD) vs. plane-first (PWCD)
Outline
• Holistic Viewpoint (Overview)• SSD Architecture• Parallelism Overview• Page Allocation Strategies • Evaluation Studies for Parallelism • Host Interface Overview• Case Studies
– Parallelism-Aware Host Interface I/O Scheduler– GC-Aware Host Interface I/O Scheduler
• Simulation Infrastructure (for your future research)
28
SSD Setup
• NAND Flash Chip• Fine-grained NAND command
• Advanced commands
• Strong address constraints
• Intrinsic latency variation
• SSD Framework• 8 channels, 8 flash per channel (64 total)
• Dual-die package format, 32 entry queue
• A page-level mapping and greedy garbage collection algorithm
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
msnfs usr fin1 web fin2 sql0 sql1 sql2 sql3
CWDP
WCPD
DPCW
PDCW
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
msnfs usr fin1 web fin2 sql0 sql1 sql2 sql3
CWDP
WCPD
DPCW
PDCW
Performance Comparison
Way and flash-level resource first pallocs have better IOPS performance position than channel-first palloc
Channel-first palloc provide shorter latencies than flash-level resource first pallocs
[Normalized Latency]
[Normalized IOPS]
CD
PW
CD
WP
CPD
W
CPW
D
CW
DP
CW
PD
DC
PW
DC
WP
DPC
W
DPW
C
DW
CP
DW
PC
PCD
W
PCW
D
PDC
W
PDW
C
PWC
D
PWD
C
WC
DP
WC
PD
WD
CP
WD
PC
WPC
D
WPD
C
0102030405060708090
100
The
fract
ion
of p
ralle
l dat
a ac
ces
met
hod
type
(%)
Die interleaving with multiplane write Die interleaving with multiplane read Plane sharing write Plane sharing read Die interleaving write Die interleaving read Striped lagacy write Striped lagacy read
Channel-first Die-first Plane-first Way-first
[Parallelism breakdown]
• Low flash-level parallelism is observed under pallocschemes in favor of channel
• They render advanced flash command compositions difficult at runtime (due to low flash-level localities) High parallelism
Parallelism Breakdown
• channel resources– are utilized about 43.1% on average with most parallel data
access methods
• Idle time– About 80% of the total execution time are spent idle
CD
PWC
DW
PC
PDW
CPW
DC
WD
PC
WPD
DC
PWD
CW
PD
PCW
DPW
CD
WC
PD
WPC
PCD
WPC
WD
PDC
WPD
WC
PWC
DPW
DC
WC
DP
WC
PDW
DC
PW
DPC
WPC
DW
PDC
05
1015202590
100
Exe.
tim
e fra
ctio
n (%
) Idle Flash-level conflict Bus contention Bus activate Flash cell activate
CD
PWC
DW
PC
PDW
CPW
DC
WD
PC
WPD
DC
PWD
CW
PD
PCW
DPW
CD
WC
PD
WPC
PCD
WPC
WD
PDC
WPD
WC
PWC
DPW
DC
WC
DP
WC
PDW
DC
PW
DPC
WPC
DW
PDC
05
1015202590
100
Exe.
tim
e fra
ctio
n (%
)
CDPWCDWP
CPDWCPWD
CWDPCWPD
DCPWDCWP
DPCWDPWC
DWCPDWPC
PCDWPCWD
PDCWPDWC
PWCDPWDC
WCDPWCPD
WDCPWDPC
WPCDWPDC
020406080
100
Perc
enta
ge o
f cha
nnel
ut
ilizat
ion
(%)
msnfs usr fin1 web fin2 sql0 sql1 sql2 sql3
Channel-first palloc Die-first palloc Plane-first palloc Way-first palloc
[Write Intensive -- Execution breakdown] [Read Intensive -- Execution breakdown]
[Average Channel Utilization]
Resource Utilization
Low
er la
ten
cy
Higher throughput
CWDP
PWCD
DPWC
WDCP
IDEAL
Flash-level parallelism
Avoiding resource conflicts
Maxim
izin
g r
eso
urc
eutiliza
tion
Optimization Point
Outline
• Holistic Viewpoint (Overview)• SSD Architecture• Parallelism Overview• Page Allocation Strategies • Evaluation Studies for Parallelism • Host Interface Overview• Case Studies
– Parallelism-Aware Host Interface I/O Scheduler– GC-Aware Host Interface I/O Scheduler
• Simulation Infrastructure (for your future research)
34
Host Interface Overview
• Serial AT Attachment (SATA)• The most popular interface for SSD
• 600MB/sec (SATA3.0)
• Non-Volatile Memory Express (NVMe) • PCI Express based storage management protocol
• Around 1GB/sec per lane, and it’s upto 16 lanes (PCIe 3.0)
CPU
PCHchipset
CPU busDMI (20Gbps)
NAND NAND
NAND NAND
SSDSATA\SAS
AHCI SSD
SATA\SAS
SATA / SAS( 1.5 to 6 Gbps)
• The host connection is to advanced host controller interface (AHCI)
• The bus overheads introduce 1 us for each command
• Throughput is also serialized
• Designed based on conventional spinning disks
• 32 entries for device-level queue (Native Command Queue)
SATA
Need for High-Performance Interfaces
• Storage interface as a bridge between host and storage• Traditional SATA and SAS have been widely employed
• Storage-internal bandwidths keep increasing• Thanks to increased resources and parallelisms
• Traditional interfaces failed to deliver the very-high bandwidths
• From upgrading traditional interfaces to devising new high-performance interfaces
HostSystem
Storage System(NVMs)Interface
More Resources More Parallelisms
Higher Bandwidth
PerformanceBottleneck!
SATA/SAS PCIe
CPU
PCHchipset
CPU busDMI (20Gbps)
NVMe SSD
NAND NAND
NAND NAND
SSD
Built in PCIe
PCIe (5Gbps / lane)
• The host connection is Peripheral Component Interconnect Express (PCIe)
• PCIe bus connection still requires SSD controller chip, but it doesn’t SATA/ACHI controller
• Throughput flows in parallel along each available PCIe lane
NVM Express
NVMe’s Rich Queuing Mechanism
• Traditional interface provides single I/O queue with 10s entries• Native Command Queuing (NCQ) with 32 entries
• NVMe strives to increase throughput by providing scalable number of queues with scalable entries
• Up to 64K queues with up to 64K entries
• NVMe queue configurations in the host-side memory• Pairs of Submission Queue (SQ) and Completion Queue (CQ)• Per-core, Per-process, or Per-thread
High-level View of Comparison• Scalable queue
(for multiple cores)
• Lock-less queue management
Host Host
AHCI SSD NVMe SSD
Single, 32 command entries 64K, 64K depth queue
Core A Core B Core A Core B
…
Issue queue
complete queue
Command list
Command tableI/O Issue
I/O Complete
I/O Interrupt
AHCI SSD NVMe SSD
I/O Stack Comparison
• SATA requests traverse • Block layer
• AHCI driver
• Host bus
• AHCI host bus adapter (HBA)
• NVMe requests bypass all conventional modules and directly go to PCIe root complex
SATAController
SATAController
Application
Kern
el M
ode
Har
dw
are
Devi
ce
SATAControllerAHCI HBA
AHCI Driver
Block Layer
SATAController
PCIe root port
NVMe Driver
AHCI Connection NVMe connection
Use
r M
ode
Controller
NANDController
NAND
NAND NAND
SATA Controller
NVMeController
Storage stack
Communication ProtocolH
ost
-Sid
e
SSD
-Sid
e
[I/O Write]
Tim
e f
low H
ost
-Sid
e
SSD
-Sid
e
[I/O Read]
DB-write
IO-Req
IO-Fetch
WR-DMA
SSD WRSSD-PROC
CPL-Submit
MSI
DB-write
IO-Req
IO-Fetch
SSD RDSSD-PROC
RD-DMA
CPL-Submit
MSI
Protocol Comparison
• DB-Write • Door-Bell (DB) register based communication
• Remove all register reads, each consuming 2000 CPU cycles approximately
• MSI• MSI software-based vector interrupt
• Ensures a specific core not IOPs bottleneck
• Less Synchronization and Less Lock• DB per queue – increasing parallelism
• Remove synchronization lock to issue commands
Outline
• Holistic Viewpoint (Overview)
• SSD Architecture
• Parallelism Overview
• Page Allocation Strategies
• Evaluation Studies for Parallelism
• Host Interface Overview
• Case Studies• Parallelism-Aware Host Interface I/O Scheduler
• GC-Aware Host Interface I/O Scheduler
• Simulation Infrastructure (for your future research)
44
Case Study #1: Physically Addressed Queueing(PAQ): Improving Parallelism in
Solid State Disks
Motivation
Background & Problems
PAQ
Evaluation Results
Conclusion
• Observation– SSD performance varies based on how to
parallelize data accesses
– Writes fully enjoy internal parallelism, but reads suffer from resource contention
• Problem– Virtual addressed queuing is insufficient to
schedule incoming I/O requests
• Our solution– Expose physical address space to the
scheduler and avoid internal resource contention
Motivation
Background & Problem
PAQ
Evaluation Results
Conclusion
Use-cases of High-speed SSDs
ADVANTAGE DISADVANTAGE
FASTER100x more throughput than a 15K RPM disk
EXPENSIVEA server SSD is $30/GB10K SAS HDD is about $1/GB
ENERGY EFFICIENTRequiring 33% less power than HDD
LIFE TIME LIMITNAND flash memory cell wear-out with overuse
[Image: Intel]
[Image: thessdreview.com]
SSDs are considered for workloads rife with
reads or as a cache
HPC-Enterprise
Hybrid-SSD
Reads vs. Writes in bare NAND Flash memory
• NAND flash are biased towards reading
PERF. METRIC OPERATION VALUE
LATENCY WRITE 440 ~ 5000 us
READ 25 us (80x faster -typ.)
ERASE 2500 us
BANDWIDTH WRITE 2.2 MB/sec
READ 26.7 MB/sec (13x faster)
[SK-Hynix 32nm MLC NAND flash]• Care needed for writes on NAND flash– Requires erase operation before a write– Requires garbage collections or block merges, a set of operations,
erase, read, and write operation for new writes
Two Divergent Research Directions
Internal Research workingto improve writes
Garbage collection scheduling
Flash firmware mapping algorithm
Write buffer management
External Research developing mechanisms capitalizing on read performance (avoid the high penalties of writes)
[Image: Thinkpads.com]
Read vs. Write in an SSD
512B 1K 2K 4K 8K 16K 32K 64K128K0
20406080
100120140160180200220240260
Band
wid
th (M
B/s)
Transfer Size
SSD-A Rand. Read SSD-A Seq. Read SSD-A Rand. Write SSD-A Seq. Write
0 2 4 6 8 10
512B 1K 2K 4K 8K 16K 32K 64K128K0
20406080
100120140160180200220240260
Band
wid
th (M
B/s)
Transfer Size
SSD-B Rand.Read SSD-B Seq. Read SSD-B Rand.Write SSD-B Seq. Write
Y Ax
is T
itle
512B 1K 2K 4K 8K 16K 32K 64K128K
0
2
4
6
8
10
12
14
Aver
age
Res
pons
e Ti
me
(ms)
Transfer Size
SSD-A Rand. Read SSD-A Seq. Read SSD-A Rand. Write SSD-A Seq. Write
X Axis Title
512B 1K 2K 4K 8K 16K 32K 64K128K
0
2
4
6
8
10
12
14
Aver
age
Res
pons
e Ti
me
(ms)
Transfer Size
SSD-B Rand.Read SSD-B Seq. Read SSD-B Rand.Write SSD-B Seq. Write
At least 25% At least 56%
Read performance depends on rigid data layout and access sequence, but writes sequence can be remapped and easily reap the benefit of parallelism
Motivation
Background & Problem
PAQ
Evaluation Results
Conclusion
NAND Flash
NAND Flash
CTRL
CH
AN
NEL
1
NAND Flash
NAND Flash
CTRLC
HA
NN
EL2
NAND Flash
NAND Flash
CTRL
CH
AN
NEL
3
NAND Flash
NAND Flash
CTRL
CH
AN
NEL
4
Emb
edd
ed P
roce
sso
rs
SSD Internals
Ho
st In
terf
ace
Co
ntr
olle
r
SSD & NAND Flash Internals
Die 0 Die 1 Die 2 Die 3
Multiplexed Interface
Flash Package Internals
k*j
Blo
cks
DATA REGISTER
CACHE REGISTER
NAND Flash Memory Array
DATA REGISTER
CACHE REGISTER
NAND Flash Memory Array
1 Block 1 Block
DIE 1
PLANE 0 PLANE jk Blocks k Blocks
Die Internals
HOST
Software Stack of an SSD
VIRTUAL ADDRESS
PHYSICAL ADDRESS
ADDRESS SPACE
FTL
QBM
PHY
HIL
HAL
Host Interface Layer- Responsible for communication- Row protocols handed by PHY- Queue and Buffer Management (QBM) handled by APP
Flash Translation LayerAddress translation between the host address space and physical addresses
Hardware Abstraction LayerCommitting flash transaction to underlying flash memory chips
[Image:micron.com]
HIL is oblivious of physical address
Conventional Scheduling
11
21
12
31
13
41
21
1
22
2
23
61
41
81
51
82
61
71
31
51
7 parallelized I/O groups
1 2 3 4 5 6
Channel 1 Channel 2 Channel 3 Channel 4
Virtual Addr.
Physical Addr.
QUEUE
2 Die interleavings
Conventional Scheduling
11 12 13 21 22 23 41 51 6131
11 22 33 44 55 66
Virtual Addr.Addr.Addr
QUEUE
Package
Die
I/O request scheduling is not efficient because HIL is sitting on the virtual address space
Multiplane mode operation
• Plane-level parallelism can be achieved via only multi-plane mode advanced command
• Two issues building advanced commands at runtime– Conventional VAQ is
ignorant of physical addresses
– FTL and flash firmware is oblivious of upper device-level queue and requests therein
Even Odd Even Odd
Even Odd Even Odd
Die 0 Die 1C
H 1
CH
2
7372356742Physical Address
Die 0 Die 1
Package 2
Package 1
4321Tag ID
5141333231232221Virtual Address
Multiplane mode
Multiplane mode operation
• Plane-level parallelism can be achieved via only multi-plane mode advanced command
• Two issues building advanced commands at runtimeruntime– Conventional VAQ is
ignorant of physical addresses
– FTL and flash firmware is oblivious of upper device-level queue and requests therein
2 multiplane operations
What if HIL or virtual addressed scheduler knows physical address space?
Motivation
Background & Problem
PAQ
Evaluation Results
Conclusion
High Level View of PAQ
VIRTUAL ADDRESS
PHYSICAL ADDRESS
ADDRESS SPACE
FTL
QBM
PHY
HIL
HAL
[Image:micron.com]
QBM
Moving QBM layer out from the HIL and beneath the FTL
QBM migration exposes physical addresses to our scheduler, PAQ (Physical Addressed Queuing)
High Level View of PAQ
• Identify requests that will cause conflicts
• Building a group of request together that do not share conflicts, called a Clump
• Packing transactions based on physical layout of I/O request
VIRTUAL ADDRESS
PHYSICAL ADDRESS
ADDRESS SPACE
FTL
PHY
HIL
HAL
QBM
Clump Composition
• Lower-level conflicts are most costly!!• Building Clumps
1. Add transactions incurring conflicts in the lowest levels first
2. For die- and package-level conflicts, never schedule a clump
3. Continue adding transactions to the clump, prioritizing for low-level conflicts, until no more can be added without breaking #2.
PAQ attempts to build clumps in a bottom-up, conflict-first fashion such that the lowest level with contention does not have conflicting transactions in the clump
Physical Address Queueing
11
21
12
31
13
41
21
1
22
2
23
61
41
81
51
82
61
71
31
51
1 2 3 4 5 6
Channel 1 Channel 2 Channel 3 Channel 4
Virtual Addr.
Physical Addr.
QUEUE
Physical Address Queueing11 22 33 44 55 66QUEUE
3 parallelized I/O groups
5 Die interleavings
Plane Packing
• PAQ knows both device-level queue and physical addresses
• It parses requests in the queue, and sends each transactionin favor of multi-planemode operations
Even Odd Even Odd
Even Odd Even Odd
Die 0 Die 1C
H 1
CH
2
7372356742Physical Address
Die 0 Die 1
Package 2
Package 1
4321Tag ID
5141333231232221Virtual Address
Multiplane mode
Plane Packing
• PAQ knows both device-level queue and physical addresses
• It parses requests in the queue, and sends each transactionin favor of multi-planemode operations
Multiplanemode operations
Multiplanemode operations
mode
4 multiplane operations
Physical Address Queuing can schedule & pack multiple transactions into one advanced command at runtime
Motivation
Background & Problem
PAQ
Evaluation Results
Conclusion
SSD Setup
• NAND Flash Chip– Fine-grained NAND command
– Advanced commands
– Strong address constraints
– Intrinsic latency variation
• SSD Framework– 8 channels, 8 flash per channel (64 total)
– Dual-die package format, 32 entry queue
– A page-level mapping algorithm
Configurations & Traces• Queuing strategies
– VAQ : Default queuing scheme (Virtual Address)
– PAQ0: PAQ, only using plane-packing
– PAQ1: PAQ, only using clumping
– PAQ2: PAQ, using both plane-packing and clumping
• Traces– fin :online transaction
– web :search engine
– usr :shared directory
– prn :print serving
– sql :database
– msnfs :file storage servers
Aggregate Performance -Bandwidth
• Improved read performance about 45% (100MB/sec) compared to VAQ scheduler for web workloads (90% random reads)
• PAQ2 never hurts performance for any workload regardless of read- or write oriented
[IOPS – the number host-level I/O requests per sec]
[Bandwidth KB/sec]45% improvement
Worst performance trace : Writes are intermixed with small reads
PAQ2 shows 1.41 times better performance
Outline
• Holistic Viewpoint (Overview)• SSD Architecture• Parallelism Overview• Page Allocation Strategies • Evaluation Studies for Parallelism • Host Interface Overview• Case Studies
– Parallelism-Aware Host Interface I/O Scheduler– GC-Aware Host Interface I/O Scheduler
• Simulation Infrastructure (for your future research)
68
Case Study #2: Host Interface Assisted
Garbage Collection Scheduler
1. Taking Garbage Collection Overheads off the Critical Path in SSDs, Middleware'122. HIOS: A Host Interface I/O Scheduler for Solid State , ISCA'14
Outline
• Motivation
• Worst-case latency analysis
• Background garbage collection (GC)
– Advanced GC and delayed GC
– Incremental GC
• Garbage collection scheduling
– Slack stealing
– GC overhead redistribution
70
• Observation
– Garbage collection (GC) is the critical performance bottleneck for SSDs
• Bandwidth imposed by GC is 4x worse than normal case I/O operations
• Latency imposed by GC is 8x ~ 10x longer than normal I/O access time
– The presence of idle I/O times in workloads can be exploited by shifting garbage collections from busy periods to other periods
• Our solution
– Removing on-demand GCs from the critical path and secure free block in advance
– Delaying on-demand GC to next idle periods
Motivation• Solid State Drives!!
– Faster than any conventional block devices– Overwriting a page is not allowed before erasing
block, which is a set of pages
Write Cliff
Garbage Collection
Write performance of modern SSDs is significantly degraded after garbage collection begins
• Average and worst-case latency of 7 SSDs and 4 disks
• As we commonly know, SSDs show better average latency
• However, worst-case latencies of SSDs are much higher
• Some I/O showing the worst-case latency may violate QoS
SSDs Always Faster than Disks?<7 SSDs>
<4 Disks>
How Often Worst-case Latencies?
• Two Samsung SSDs are written using Intel IO meter
• Performance variation = worst-case latency – average latency
• As we use SSDs more and more,
– SLC: worst-case latencies are repeatedly and frequently shown
– MLC: In addition to be observed, worst-case latencies get worse
• Worst-case latencies are frequent and get worse over time
<SAMSUNG SLC 120GB SSD> <SAMSUNG MLC 256GB SSD>
average latency average latency
Worst-case Latency affects Throughput?
• Samsung and OCZ SSDs are written by Intel IO meter
• Worst-case latency and throughput measured over time
• At some time point, Write Cliff is observed
– Worst-case latency significantly get worse (by x40)
– (At the same time) throughput severely degrades (by 64%)
• Worst-case latencies directly affect throughput degradation
<SAMSUNG 830> <OCZ Vertex 3>
Latency Impact:Empirical Experimental Results
• 256GB MLC-Based SSD
– 128 * 2 DRAM Buffer
– Dual Core, 8 channel and 64 flash packages
– Device-level Latency is captured ULINK Drive Master
• GC impact
– Warm-up: 1MB random writesfor whole region
Pristine State Performance
So far.. & Our Approach
[Without GCs]
[With GCs]
[Our Approach]
Goals
• Making Garbage Collection (GC) Overheads Invisible to Users
– While using our GC strategies, application does not experience GC overheads
• Avoiding additional GC operations
– Our GC strategies only schedule GC operation that would be invoked soon
• Compatibility with underlying FTL schemes
– Doesn’t need to extra NVM buffer and change main address mapping policy of FTL
Shifting Garbage Collection
• Advanced GC strategy (AGC)– Removing on-demand GCs from the critical
path and secure free block in advance
– 2 components: Look-ahead GC and Proactive Block Compaction
• Delayed GC strategy (DGC)– Handling the cases where idleness does not
frequently occurs and AGC fails to secure free blocks
– Delaying on-demand GC to next idle periods
Device-level short idleness
Long idleness imposed by host
Device-level Short IdlenessUtilization
• Leveraging device-level queue and preinformation arrivals
• 3~17 command tags arrive in parallel (LeCroy commercial protocol analyzer)
• Look-ahead GC (a part of AGC) is executed using short idleness
Previous I/O Expected Execution Time
Long Idleness Utilization
• 38% ~ 83% instructions experience idle periods more than 1 sec
• DGC and Proactive Block Compaction (another part of AGC) are performed
Details of AGC
• Look-ahead GC– Predict on-demand GCs based on incoming
host requests and mapping information
– Look-ahead GC is executed only if the short idle period is longer than the latency of GC predicted
• Proactive Block Compaction
– Reclaiming blocks, which are fully occupied by contents, during long idle periods
Valid Page Migration Time
Block Cleaning Time
Details of DGC
• Update Block Replacement
– GCs need not to be run the same time as writes
– Skipping time consuming tasks (page migration) of GC and serve urgent I/O request first
– Put on-demand page migration into DGC list and replacing another update block
• Retroactive Block Compaction
– Resume page migration activity in long idle time and return update block (used for DGC)
Incremental Garbage Collection• In the case of long idle, there is no pre-arrival
information, or no advantage of device-level queue (empty)
• AGC and DGC employ Incremental Garbage Collection– GC activities split into multiple sub-collections delaminated by
checkpoint– Checkpoint: Check if further collection can be performed
VALIDINVALIDVALID
INVALID
INVALIDVALID
INVALIDVALID
I/O Request
Update Block
Data Block
Target Block
I/O RequestCP
CP
CP
CPDevice-level Queue
Experimental Setup
• 4 channel, 16 flash chips bus-level transaction SSD simulation – 6 volumes of SSD array
• FTL implementation– L-FTL: Log structured block mapping FTL– H-FTL: Superblock-style block and page hybrid mapping
FTL– P-FTL: Partial block cleaning FTL (16% more flash blocks)
• Garbage Collection Straggles– Baseline: Block-merge type garbage collector– AGC: Advanced GC strategy only– DGC: Delayed GC strategy only– AGC+DGC: putting our GC schemes together
Low Write Intensive Workload of Microsoft File Server Storage
• Low write intensive workload, AGC successfully hides GC overheads
[Baseline GC] [AGC Only]
No GC Impact
High Write-Intensive Workload of Microsoft File Server Storage
[Baseline GC] [AGC Only]
[DGC Only] [AGC+DGC]
Fail
FailSuccess
Performance Comparison
[L-FTL] [H-FTL]
[AGC+DGC] [AGC+DGC Invisible]
Hybrid Mapping FTL introduces shorter WCRT as well as lower I/O blocking than L-FTL by reducing GC overheads
AGC+DGC perform all on demand GC even under write-intensive workloads
Constraints of Background GC
• Background garbage collection is feasible only if the system can secure enough idle times
• May violate QoS or SLA in cases where there is an unexpected request during the background GC operations
GC Overhead Distribution
• Four I/O (1~4) requests are present in order
– (Example) given deadlines of four requests are same
– GC is triggered during I/O-1 service
– I/O-1 misses its QoS, whereas others can satisfy it
• I/O-2 to 4 have time-margin (slack) until deadline
– We will distribute GC overhead of I/O-1 over others
– All I/O requests can meet its deadline even GC executed!
I/O-1 I/O-2 I/O-4I/O-3time
Latency
GC overhead
Flash execution
I/O deadline
What Needs for GC Distribution?
• (1) GC overhead estimation
– Need to know how big GC overhead is
• (2) Slack stealing
– Need to know how much slacks (many I/O requests) are required
• (3) GC overhead distribution
– Need to segment GC and distribute them over other I/O requests
I/O-1 I/O-2 I/O-4I/O-3time
Latency
GC overhead
Flash execution
I/O deadline
HIOS’s GC Overhead Estimation
• SATA interface is the best place for GC distribution
– Is aware of flash device operations including GC
– Is also aware of I/O requests by using tag (essential information)
– Before actual I/O request, its tag sent to the command queue
• GC overhead estimation based on flash status and I/O tag
– GC invocation by I/O-1 can be predicted
– GC overhead (# of read, write, and erase) can be estimated
I/O-1 time
Latency
command queueSATA I/F
Flash devices
I/O-2 I/O-3
Tag-
1
Tag-
2
Tag-
3
Tag-
4
I/O-4
HIOS’s Slack Stealing
• For the following I/O requests, slack is accumulated
– For each I/O request, slack is calculated using tag information
– Slack time= Tdeadline – Tflash_latency
– Slack stealing is continued until GC overhead exhausted
• In this scenario, slacks (from following I/O-2, 3, and 4) are enough for distributing I/O-1’s GC overhead
I/O-1 time
Latency
command queueSATA I/F
Flash devices
I/O-2 I/O-3
Tag-
1
Tag-
2
Tag-
3
Tag-
4
I/O-4
HIOS’s GC Distribution
• GC (reads, writes, erases) can be segmented into small pieces
• For each I/O request, (I/O-1)
– Write data is transferred from host to device buffer
– Assigned GC segments are executed
– Flash device commands are issued
I/O-1 time
Latency
command queueSATA I/F
Flash devices
I/O-2 I/O-3
Tag-
1
Tag-
2
Tag-
3
Tag-
4
I/O-4
I/O-1’s GC overhead
dram buffer
I/O-1 data
Tag-
1
flash cmd flash cmd
HIOS’s GC Distribution
• GC (reads, writes, erases) can be segmented into small pieces
• For each I/O request, (I/O-2)
– Write data is transferred from host to device buffer
– Assigned GC segments are executed
– Flash device commands are issued
I/O-1 time
Latency
command queueSATA I/F
Flash devices
I/O-2 I/O-3 I/O-4
dram buffer
I/O-2 data
Tag-
3
Tag-
4
Tag-
2
flash cmd flash cmd flash cmd
HIOS’s GC Distribution
• GC (reads, writes, erases) can be segmented into small pieces
• For each I/O request, (I/O-3)
– Write data is transferred from host to device buffer
– Assigned GC segments are executed
– Flash device commands are issued
I/O-1 time
Latency
command queueSATA I/F
Flash devices
I/O-2 I/O-3 I/O-4
dram buffer
I/O-3 data
Tag-
4
Tag-
3
flash cmd flash cmd flash cmd
HIOS’s GC Distribution
• GC (reads, writes, erases) can be segmented into small pieces
• For each I/O request, (I/O-4)
– Write data is transferred from host to device buffer
– Assigned GC segments are executed
– Flash device commands are issued
• “I/O-1’s GC distributed over I/O-2,3, and 4” & “all satisfy QoS”
I/O-1 time
Latency
command queueSATA I/F
Flash devices
I/O-2 I/O-3 I/O-4
dram buffer
I/O-4 data
flash cmd
Tag-
4
Simulation-based Evaluation• Simulator
– Models flash chips and associated data paths
– Implements typical SSD software stack
• Baseline configuration
– SSD: 8 channels, 2 flash chips / channel
– SATA I/F: 32 command queue entries
• Compared with four different I/O schedulers
– Noop: schedules in FIFO basis
– Anticipatory: considers spatial locality
– Deadline: sorts logical address in ascending order
– Flash-aware: reduces GC overhead by reducing writes
– HIOS: GC-only or channel-only management, or both
Worst-case Latency
• Worst-case latency is normalized to Noop scheduler
• All existing scheduler are all similar, since they are oblivious to GC and cannot avoid high-cost GC overhead
• HIOS-1 (GC distribution only) significantly reduces it by 41%
• HIOS-2 (GC + channel management) is similar to HIOS-1
– Worst-case latency is caused by GC, not channel-conflict
Average Latency
• Average latency is normalized to Noop scheduler
• All existing schedulers are similar or slightly worse
• HIOS-1 does not affect average latency much
– GC overheads are not eliminated, but redistributed
• HIOS-2 (GC + channel management) improves by 13%
– Resolving channel conflict achieves better performance
Deadline Satisfaction
• Long series of writes to generate multiple GCs
• % of I/O requests missing its deadline (30ms) measured
• Under 40% usage, all I/O schedulers shows negligible miss rate
• As # of writes increase, miss rate of other schedulers dramatically increases suffered from more frequent GCs
Outline
• Holistic Viewpoint (Overview)• SSD Architecture• Parallelism Overview• Page Allocation Strategies • Evaluation Studies for Parallelism • Host Interface Overview• Case Studies
– Parallelism-Aware Host Interface I/O Scheduler– GC-Aware Host Interface I/O Scheduler
• Simulation Infrastructure (for your future research)
103
What we are doing on now for our heterogeneous computing research …. (real implementation based research)• Software approach: Extending NVMMU for a cluster
• Hardware approach: Storage-based accelerator
Our Real Implementation Approach (SW)
• Advantage:
• Ready to explore systems without a limit from the simulation infra.
• We can measure the real execution time for the integration approach
• Disadvantage:
• Performance is varying based on the test environments somewhat
• Incompatible for the device and software version
• Debug…Debug…Debug.. and Debug….
SSDGPU
NVMMU
• GPU-based Storage I/O Acceleration
CPU
Our Real Implementation Approach (HW)
• FPGA-based Storage I/O Acceleration• Memory/Storage Backend and frontend FPGA
implementation
• Multi-kernel Execution Model
• Scheduling for Near-Data Processing
• Advantage:
• Have no idea as still everything is on-going, sorry
• Disadvantage:
• INFLEXBIE, no choice for design exploration
• Debug…Debug…Debug.. and Debug….
• Slow.. Slow.. Slow.. And Slow
• Expensive and long procurement process for the test: there are 7 more platforms we failed to achieve goals
Challenges for Simulation Model
Traditional system simulator:
It assumes all data have been loaded into RAM before execution…
CPUCore
L2L3
CPUCore
L2
CPUCore
L2
CPUCore
L2
Memory Controllers
RankDRAM Dies
Gem5 GPGPU-sim
Challenges for Simulation Model
LD/STUnit
L1Cache
L2Cache
DRAM
1cycle
4cycles
12cycles
230cycles
60-800us
SSD
Extremely slow for simulation
SimpleSSD
SimpleSSD: Modeling Solid State Drive for Holistic System Simulation
- A high-fidelity SSD simulation framework designed towards an educational purpose
- Free to download from http://SimpleSSD.camelab.org
RankDRAM Dies
SimpleSSD overview
L1L2
CPUCore
L1
CPUCore
L1
CPUCore
L1
Memory Controllers
Controller
Die 0Die 1Die N
Controller
Die 0Die 1Die N
Controller
Die 0Die 1Die N
Multi-Channel SSD
ldr r1, [r0]
Application
PC: 0x04
Registers
Readregister 1
Readregister 2
Writedata
Readdata 1
Readdata 2
ALU
resu
lt
MEM
Address Readdata
CPUCoreCPUCore
dyn_inst_impl.hh/cco3_cpu_exec.hh/ccbase_dyn_inst.hh/ccmemhelpers.hh
cpu.hh/cciew_impl.hh/cclsq_impl.hh/cclsq_unit_impl.hh/cc
RankDRAM Dies
SimpleSSD overview
CPUCore
L1L2
CPUCore
L1
CPUCore
L1
CPUCore
L1
Memory Controllers
Controller
Die 0Die 1Die N
Controller
Die 0Die 1Die N
Controller
Die 0Die 1Die N
Multi-Channel SSD
L1
Registers
Readregister 1
Readregister 2
Writedata
Readdata 1
Readdata 2
ALU
resu
ltMEM
Address Readdata
IndexTagD
eco
derTag
ArrayDataArray
S/A S/A
ComparatorMatch?
Cache request
L1D cache
port_interface.hh/ccbase.hh/cccacheset.hhbase.hh/cc
cache_impl.hhcache.hh/ccmshr.hh/ccmshr_queue.hh/cc
RankDRAM Dies
SimpleSSD overview
CPUCore
L1L2
CPUCore
L1
CPUCore
L1
CPUCore
L1
Memory Controllers
Controller
Die 0Die 1Die N
Controller
Die 0Die 1Die N
Controller
Die 0Die 1Die N
Multi-Channel SSD
IndexTag
Dec
od
erTagArray
DataArray
S/A S/A
ComparatorMatch?
L1D cache
IndexTag
Dec
od
erTagArray
DataArray
S/A S/A
ComparatorMatch?
L2 cache
Cache miss
L2
port_interface.hh/ccbase.hh/cccacheset.hhbase.hh/cc
cache_impl.hhcache.hh/ccmshr.hh/ccmshr_queue.hh/cc
RankDRAM Dies
SimpleSSD overview
CPUCore
L1L2
CPUCore
L1
CPUCore
L1
CPUCore
L1
Memory Controllers
Controller
Die 0Die 1Die N
Controller
Die 0Die 1Die N
Controller
Die 0Die 1Die N
Multi-Channel SSD
IndexTag
Dec
od
erTagArray
DataArray
S/A S/A
ComparatorMatch?
L2 cache
Cache miss
MMU
PageTable
I/O controller
Page fault
SSD
MainMemory
DMA
dram_ctrl.hh/ccpage_table.hh/ccmulti_level_page_table.hh/ccmulti_level_page_table_impl.hh
RankDRAM Dies
SimpleSSD overview
CPUCore
L1L2
CPUCore
L1
CPUCore
L1
CPUCore
L1
Memory Controllers
Controller
Die 0Die 1Die N
Controller
Die 0Die 1Die N
Controller
Die 0Die 1Die N
Multi-Channel SSD
IndexTag
Dec
od
erTagArray
DataArray
S/A S/A
ComparatorMatch?
L2 cache
Cache miss
DRAM controller
Arbitration Engine
CMD queue WR queue RD queue
Sequencing Engine
PHY Interface
Main Memory
dram_ctrl.hh/ccsimple_mem.hh/cc
physical.hh/ccaddr_mapper.hh/cc
RankDRAM Dies
SimpleSSD overview
CPUCore
L1L2
CPUCore
L1
CPUCore
L1
CPUCore
L1
Memory Controllers
Controller
Die 0Die 1Die N
Controller
Die 0Die 1Die N
Controller
Die 0Die 1Die N
Multi-Channel SSD
CPUCore
CPUCore
CPUCore
CPUCoreclass LSQ_unit
L1 L1 L1 L1class cacheL2class cache
Memory ControllersMemory Controllersclass mem_ctrl
RankDRAM Dies
class FuncPageTableDRAM Dies
Controller Controller Controllerclass HIL
Die 0Die 1
Die 0Die 1
Die 0Die 1class FTL
Die N Die N Die NMulti-Channel SSDclass PAL
executeLoad(inst)executeStore(inst)
recvTimingResp(pkt)
recvTimingResp(pkt)
recvTimingReq(pkt)
recvTimingReq(pkt)
recvTimingReq(pkt)
changeFlag(paddr, size, alloc)
SSDoperation()
fetchQueue()
accessAndRespond(pkt,lat)
commitLoad(inst)commitStore(inst)
PAL_setLatency()
setLatency()
Inside Host Interface Layer (HIL)• Target: provide universal interface for system simulator, trace
generator, and RAID controller.• Add-on mode overview:
Data Movement modelAdd-on mode
System simulator
Host Interface Layer
FTL
PAM
I/O requestAddress SizecurTick opType
Address
finishTick
Page fault
ResponseAddress SizefinishTick opType
Latency map table
Store SSD access latency
Inside Host Interface Layer (HIL)• Target: provide universal interface for system simulator, trace
generator, and RAID controller.• Add-on mode overview:
Data Movement modelAdd-on mode
System simulator
Host Interface Layer
FTL
PAM
Address
finishTick
I/O access
Look up latency
Latency map table
SSD delay = finishTick -curTick
Report
Inside Host Interface Layer (HIL)• Target: provide universal interface for system simulator, trace
generator, and RAID controller.• Standalone mode Overview:
Dispatch
Host Interface Layer
FTL
PAM
Tracefiles
Micro-benchmark
Standalone mode
I/O queue
I/O
I/O requestAddress SizecurTick opTypeI/
O
I/O
I/O
Inside Host Interface Layer (HIL)• Target: provide universal interface for system simulator, trace
generator, and RAID controller.• Standalone mode Overview:
Issue
Host Interface Layer
FTL
PAM
Tracefiles
Micro-benchmark
Standalone mode
I/O
I/O
I/O
I/O requestAddress SizecurTick opType
I/O
I/O queue Insert into the queue
ResponseAddress SizefinishTick opType
Issue new request @ finishTick
DEMO
Gem5FS-SimpleSSD (full system mode)Software Dependencies:• Linux• mercurial• scons• swigImage Booting Dependencies:• Device Tree Blob: vexpress.aarch32.ll_20131205.0-gem5.4cpu.dtb• Linux Kernel: vmlinux.aarch32.ll_20131205.0-gem5• File System: aarch32-ubuntu-natty-headless.img
• gcc• g++• Python2.6 or Python 2.7• Protobuf
Compile software:scons -j7 build/ARM/gem5.opt
Execution command:./build/ARM/gem5.opt --debug-flags=IdeDisk,HIL,FTLOut,PAM2,GLOBALCONFIG -d ./configs/example/fs.py --num-cpu=4 --dtb-filename = vexpress.aarch32.ll_20131205.0-gem5.4cpu.dtb --disk-image=aarch32-ubuntu-natty-headless.img --kernel=vmlinux.aarch32.ll_20131205.0-gem5 --script=run_BC.rcS --SSD=1 --SSDconfig=rev_ch16_SATA.cfg
Inside Flash Translation Layer (FTL)• Target: provide SSD services such as I/O address mapping, wear
leveling, garbage collection, etc.• Overview:
LBALBALBA
Host Req.
IO QueueMapping
Table
PPNPPNPPN
FTLmapping()
PP
NP
PN
PP
N
push
Direct Mapping Set-assoc Mapping Full-assoc Mapping
ReadTransaction()WriteTransaction()
HIL FTL PAM
Inside Flash Translation Layer (FTL)• Target: provide SSD services such as I/O address mapping, wear
leveling, garbage collection, etc.• Overview:
LBALBALBA
Host Req.
IO QueueMapping
Table
PPNPPNPPN
FTLmapping()
PP
NP
PN
PP
N
push
Free BlockPool
GC_threshold Read
GarbageCollection()
MinHeapWearLeveling()
Direct Mapping Set-assoc Mapping Full-assoc Mapping
ReadTransaction()WriteTransaction()
SendRequest()
SetLatency()SetLatency()
HIL FTL PAM
FTLMapNFTLMapK
FTLGCthreshold
Configurable parameters:FTLMapN FTLMapKFTLGCthresholdFTLOP (over-provisioning)
Inside Flash Translation Layer (FTL)• FTL mapping: support direct mapping, set-assoc mapping, and full-
assoc mapping by configuring FTLMapN amd FTLMapK.
Block layout based on N and K:
N Data Blocks
K Log Blocks
Data Group 0
N Data Blocks
K Log Blocks
Data Group (DGN)
SSD logical blocks
N Data Blocks
K Log Blocks
Data Group 1
Inside Flash Translation Layer (FTL)• FTL mapping: support direct mapping, set-assoc mapping, and full-
assoc mapping by configuring FTLMapN amd FTLMapK.
Set-assoc Mapping (1<N<max, 1<K<max):
Data blocksLBN PBN
Data Group
N Data Blocks
K Log Blocks
N Data Blocks
K Log Blocks
Data Group (DGN)
Log blocksPage
Mapping
DGN LPN PPN
Page IndexBlock IndexData Group Index
Logical Page Address
Inside Flash Translation Layer (FTL)• FTL mapping: support direct mapping, set-assoc mapping, and full-
assoc mapping by configuring FTLMapN amd FTLMapK.
Direct Mapping (N=1, K=1):
1 Data Block
1 Log Block
Data Group 0 Data Group max
Page IndexData Group Index
Logical Page Address
No block index
Log blocksPage
Mapping
DGN LPN PPN
Data blocksDGN PBN
Inside Flash Translation Layer (FTL)• FTL mapping: support direct mapping, set-assoc mapping, and full-
assoc mapping by configuring FTLMapN amd FTLMapK.
Full-assoc Mapping (N=K=max):
Max Data Blocks
Max Log Blocks
Single Data Group
No data group index
Page IndexBlock Index
Logical Page Address
Log blocksPage
Mapping
DGN LPN PPN
Data blocksLBN PBN
Inside Page Allocation Module (PAM)Target: simulate SSD internal parallelism and resource conflicts of IO bus and NAND flash memory.
HIL
FTL
PAM
Blk
FTLMapping
CHPKGDIE
Page PPN
CH0DIE0
DIEN
conflict
Overview:
Inside Page Allocation Module (PAM)Target: simulate SSD internal parallelism and resource conflicts of IO bus and NAND flash memory.
• PPN disassemble.• Conflict simulation.
Main Functions:
PP
NP
PN
PP
NFTL
PP
N
CHPKGDIE
PPNdisassemble
CH0DIE0
DMA DMA
MEMLatency
TimelineScheduling
Traditional model
Configurableparameters
Inside Page Allocation Module (PAM)
Simplified latency simulation model
OpCode Addr Data CMDDataChannel
MEMDie
tADL CMD
Channel
Die
pre-dmamem_op
post-dma
Simplified model
Configurable
prepre-dma
• DMA frequency• Command &
Address Delay
mem_oppostpost-dma
• SLC/MLC/TLC• Read/Write• MSB/CSB/LSB
• Page Size• Channel• Package• Die
pre-dma: DMA operation in CHANNEL (transfer in metadata [+ write page]).
pre-dma
mem-op: NAND flash memory island read/write operation in DIE.
mem-op
post-dma: DMA operation in CHANNEL (transfer out metadata [+ read page]).
post-dma
pre-dma/post-dma: consume CHANNEL resource, mem-op: consume DIE resource.
Separate conflict model into i) CHANNEL and ii) DIE.
Inside Page Allocation Module (PAM)
Simplified conflict simulation model
Inside Page Allocation Module (PAM)
DMA conflictsIO #1 transfer in DIE 0, CHANNEL 0
IO #2 transfer in DIE 1, CHANNEL 0
IO#1IO#2
Channel 0Die 0Die 1
DMA Conflict
mem-op (w)
mem-op (r)
DMA Conflict
Time
Pre-dma and post-dma conflict example.
Inside Page Allocation Module (PAM)
MEM conflictsIO #1 transfer in DIE 0, CHANNEL 0
IO #2 transfer in DIE 0, CHANNEL 0
Pre-dma, mem-op and post-dma conflict example.
IO#1IO#2
Channel 0Die 0Die 1 Time
DMA Conflict
MEMConflict
mem-op (w)
DMA Conflict
mem-op (r)
gem5FS-simpleSSD
Overview: leverage the disk interface provided by gem5.
Target: integrate simpleSSD into gem5 full system mode.
RankDRAM Dies
CPUCore
L1L2
CPUCore
L1
CPUCore
L1
CPUCore
L1
Memory Controllers
Page Faults File Read/Write
Simple Disk Latency Calculator
gem5FS model
SimpleSSD simulator
gem5FS-simpleSSD model
DEMO
Gem5FS-SimpleSSDOutput files:
config.ini config.json SimpleSSD.log stats.txt system.terminal
Full system execution log
DEMO
SimpleSSD-standaloneSoftware Dependencies:
• Linux• g++
Compile software:make
Execution command:
./ssdsim ssd_config_file microbench_config_file > SimpleSSD.log
Output files:
SimpleSSD.log
SimpleSSD runtime statistics report
Evaluation Samples
Instruction per cycles on SLC is
better
Page cache (VFS) doesn’t work well
Massive I/O make system call overheads significant
MLC is worse than TLC
Evaluation Samples
CPU utilization is not impacted even when there is storage access (page cache)
CPU utilization is severely impacted by storage access. (no locality)
Educational Research Tools
• OpenNVM• http://opennvm.camelab.org
• SimpleSSD• http://simplessd.camelab.org
• NANDFlashSim• http://nfs.camelab.org
• Trace Repository• http://traces.camelab.org
References
• Ozone (O3): An Out-of-Order Flash Memory Controller Architecture, TC 2011
• Exploring Parallel Data Access Methods in Emerging Non-Volatile Memory Systems, TPDS
• Unleashing the Potentials of Dynamism for Page Allocation Strategies in SSDs, SIGMETRICS
• Exploring and Exploiting the Multilevel Parallelism Inside SSDs for Improved Performance and Endurance, TC
• Design Tradeoffs for SSD Performance, ATC'09
• ParaFS: A Log-Structured File System to Exploit the Internal Parallelism of Flash Devices, ATC'16
References
• Sprinkler: Maximizing resource utilization in many-chip solid state disks, HPCA'14
• Taking Garbage Collection Overheads off the Critical Path in SSDs, Middleware'12
• An In-Depth Study of Next Generation Interface for Emerging Non-Volatile Memories, NVMSA
• HIOS: A Host Interface I/O Scheduler for Solid State , ISCA'14• NVMMU: A Non-Volatile Memory Management Unit for
Heterogeneous GPU-SSD Architectures, PACT’15• SimpleSSD: Modeling Solid State Drive for Holistic System
Simulation, CAL 2017• Disks Exploiting request characteristics and internal parallelism
to improve SSD performance, ICCD 15• Performance Analysis of NVMe SSDs and their Implication on
Real World Databases, SYSTOR’15