ftl design exploration in reconfigurable high-performance ssd (rhpssd) for server applications...

FTL Design Exploration in Reconfigurable High-Performance SSD (RHPSSD) for Server Applications

International Conference on Supercomputing

June 12, 2009

Ji-Yong Shin12, Zeng-Lin Xia1, Ning-Yi Xu1, Rui Gao1,

Xiong-Fei Cai1, Seungryoul Maeng2, and Feng-Hsiung Hsu1

1Microsoft Research Asia2Korea Advanced Institute of Science and Technology

Introduction and Background (1/3) Growing popularity of flash memory and SSD

Low latency Low power Solid state reliability

SSD widening its range of application Embedded devices Desktop and laptop PC Server and supercomputer

SSD expected to revolutionize storage subsystem

2/20

Introduction and Background (2/3) Flash memory

Erase needed before write Unit of read/write and erase differs

Read/Write: page (typically 2 to 4KB) Erase: block (typically 64 pages)

Latency for read, write, erase differs Read (25us) < write (250us) < erase (500us) Erase carried out on demand: cleaning or garbage

collection Wear-leveling necessary

Memory cells wears out when erased Typically a block endures 100K erase operations

3/20

Flash Memory

Introduction and Background (3/3) Flash translation layer (FTL)

Provides abstraction of flash memory characteristics

Maintains logical to physical address mapping

Carries out cleaning operations Conducts wear leveling

FTL in multiple flash chip environment Manages parallelism and wear

level among chips

Host Machine

FTL

Flash Memor

y modul

e

IOReque

st

Flash Memor

y modul

e

Flash Memor

y modul

e

Flash Reque

st

Flash Reque

st

Flash Reque

st

4/20

Motivation (1/2) Servers and Supercomputing Environment

High performance storage subsystem required Applications are usually fixed

SSD performance characteristics Highly dependent on FTL design and workloads

Customized SSD Can Boost Up Servers and Supercomputers

5/20

Motivation (2/2)

Our Focus

High performance SSD with abundant resource

FTL design tradeoffs using different algorithms in each functionalities

Customizing FTL considering flash memory and workload characteristics

Based on Reconfigurable High-Performance SSD Architecture,

we will explore FTL design considerations and tradeoffs and propose guidelines for customizing FTL

Related Work

Flash memory for embedded system or generic SSD

Internal hardware’s organizational tradeoffs of SSD [Agrawal et al. USENIX 08]

Configuring RAID system considering disk and workload characteristics

6/20

Reconfigurable High-Performance SSD (RHPSSD) RHPSSD architecture

High performance 36 independent flash channels 4GB/s PCI Express host-to-SSD interface

Flexibility from FPGA for reconfiguring of FTL

PCI Express (4GB/s)PCI Express (4GB/s)

FPGA

FTL or flash

controller

flash channel

controllers for each

flash channel

Random Access Memory

Flash Daughter Board

Flash Chip with Independent

Channel


Channel


Channel


Channel


Channel

… …


Channel

ChipDie Die

Plane

Plane

Plane

Plane

1. Maintaining High Parallelismfor Performance!

2. Wear Leveling for Endurance

a.Among all blocks b.Among chips, dies and

planes

7/20

FTL Design Exploration and Analysis Simulation-based method to discover:

1. Logical page to physical flash plane allocation

2. Effect of hot/cold data separation

3. Wear leveling and Cleaning 1. Cleaning analysis for different allocation2. Wear leveling in different clusters

8/20

Simulation Environment and Workloads (1/2) Simulation Environment

Modified DiskSim 4.0 and SSD plug-in of MSR SVC Various FTL algorithms implemented

Basic Configurations RHPSSD architecture Flash chip

Latencies (read - 25us, write - 250us, erase - 500us) Two types of chip for different SSD capacities

4GB (2 dies with 2 planes) chip 8GB (4 dies with 2 planes) chip

9/20

Simulation Environment and Workloads (2/2) Traces used for simulation

Workload Sequential Random Highly IO intensive

High data locality

SSD Setting

PostmarkO O

144GB SSD

IOzoneO O O

144GB SSD

WebDCO O

288GB SSD

TPC-CO O

288GB SSD

SQLO O

288GB SSD

ExchangeO

2 x 288GB SSD

10/20

Logical Page to Physical Plane Allocation (1/2) Allocation is directly related to parallelism Static allocation

Binding logical page address to specific plane Striping methods

Wide striping, page striping unit: high parallelism, more cleaning Narrow striping, block striping unit: low parallelism, less cleaning

Dynamic allocation Allocate page request to idle plane on runtime Binding logical address to

Chip: less degree of freedom SSD: maximum degree of freedom

Wide Striping Narrow Striping

11/20

Logical Page to Physical Plane Allocation (2/2)

Resp

onse

Tim

e N

orm

aliz

ed t

o S

TA

TIC

W

-PA

GE

12/20

Hot/Cold Data Separation (1/2) Separating pages according to temperature in

each plane Block with hot data are likely to be full of invalid page Block with cold data are likely to maintain its condition

Known to reduce erase operation and valid page migration Also leads to smaller response time

13/20

Hot/Cold Data Separation (2/2)Im

pro

vem

ent

aft

er

apply

ing

the

separa

tion (

%)

14/20

Wear Leveling and Cleaning High performance and wear level of SSD is a

different story

Static allocation Logical addresses are bounded to plane so no page

migration can take place to the outside of the dedicated plane (only local wear leveling) Selecting allocation to evenly wear out each plane is

important

Dynamic allocation Wear leveling can be carried out in different clusters (chip,

SSD) Cluster is the scope where the lifetime of blocks will be

maintained evenly The Larger the cluster is, the more even the wear level is in

SSD as a whole The Larger the cluster is, the greater the overhead is

15/20

# o

f O

pera

tions

Norm

aliz

ed t

o

W-P

age

Number of Cleaning and Erase Distributionwithout Wear Leveling

16/20

Wear Leveling in Different Clusters Wear leveling cluster

Group of blocks that wear leveling algorithm maintains the age even

The larger the cluster the worse the performance becomes

The larger the cluster the evener the age of blocks are

0

0.5

1

1.5

2

resp time

avg life

overall stddev

chip stddev

die stddev

plane stddev

Chip PChip C

00.5

11.5

22.5

3

resp time

avg life

overall stddev

chip stddev

die stddev

plane stddev

SSD PSSD S

17/20

Summary Static vs. dynamic allocation

Static wide striping: dominant sequential IO workloads Page striping unit: small response time, more cleaning Block striping unit: large response time, less cleaning Trade off between response time and cleaning operations

Dynamic: dominant random IO workloads

Hot/Cold data separation Effective for evenly distributed IO

Wear leveling cluster Large cluster: large overhead, even distribution of wear

level Small cluster: small overhead, uneven distribution of

wear level Trade off between response time and even wear level

18/20

Conclusion Algorithms in each FTL functionality studied

for high performance SSD

Tradeoffs and simple guidelines for designing customized FTL in different workload and SSD’s lifetime requirements presented

Please read the paper for more details

19/20

Thank you.Questions?

20/20

ftl design exploration in reconfigurable high-performance ssd (rhpssd) for server applications...

Documents