h-3-3 tlc/mlc nand flash mix-and-match desi gn with ... · tlc/mlc nand flash mix-and-match desi gn...

TLC/MLC NAND Flash Mix-and-Match Design with Exchangeable Storage Array Shogo Hachiya1, Koh Johguchi1, Kousuke Miyaji1,2 and Ken Takeuchi1

1 Chuo Univ. 1-13-27 Kasuga, Bunkyo-ku, Tokyo 112-8551, Japan, 2 Shinshu Univ. E-mail: [email protected] Abstract

This paper proposes TLC/MLC NAND flash mix-and-match design method for exchangeable storage array. The proposed Round-Robin frozen data collection achieves 56% higher write performance and 29% write energy reduction compared with the conventional MLC only SSD. SSD card exchange method is also presented to realize sustainable and flexible storage arrays.

1. Introduction Recently, NAND flash based solid-state-drives (SSDs) are wide-

ly used in high speed and low power storage array at data centers storing the big data. To save bit-cost, triple-level-cell, TLC, is now available for SSD instead of multi-level-cell (MLC) [1]. However, it is complicated and slow to write TLC due to block unit writing [2, 3]. Alternatively, although a ReRAM/MLC hybrid SSD has been pro-posed in [4,5], ReRAM is still high cost. Hence in the proposal, as a low cost storage array application, we propose a TLC/MLC hybrid NAND flash SSD storage array with wireless SSD cards [6] (Fig. 1). With Round-Robin frozen data collecting algorithm (RR-FDCA) (Fig. 2), MLC NAND flash not only works as a simple buffer for TLC but aggressively screens extremely cold data, which is rarely accessed from the host. To reduce TLC access, only the screened static frozen data are stored in TLC. With the proposed RR-FDCA, SSD write performance is enhanced up to 56% compared with the conventional all MLC SSD. Additionally, a SSD card exchange method without storage system suspension is proposed. With the proposed method, system administrators can also choose the opti-mum TLC/MLC capacity ratio and the round number, corresponding to the screening period, that realizes the best mix-and-match design configuration of the storage array. 2. TLC NAND Flash Memories and Proposed

Round-Robin Frozen-Data Collection Algorithm Fig. 3 gives the measurement results of TLC and MLC NAND

flash memories. Compared with MLC, the program disturb and data retention errors of TLC are significantly increased. Fig. 4 shows block and page definitions of TLC NAND. The page and block are the write and erase unit of NAND flash, respectively. The page size depends on word-line (WL) length. In TLC, one word-line corre-sponds to three pages. Fig. 5 illustrates write order and VTH distribu-tion of TLC. Because of small VTH margin and many reference volt-ages for read in TLC, the read latency becomes long and the write verify iteration are increased by 2.7-times. Moreover, due to compli-cated write order [2], TLC requires block unit writing. This causes the write performance degradation because all valid pages in the TLC block have to be read and written to another block (block-modify- write) if an overwrite operation is occurred in TLC. Table I summa-rizes the specification of MLC and TLC NAND flash memories [7]. The read/write latencies of TLC are longer than that of MLC and TLC write unit is a block. Therefore, to deal with TLC’s characteris-tics, TLC/MLC storage array and the suitable control are proposed as shown in Figs. 1 and 2.

Fig. 6 explains Round-Robin page/block selection algorithm (RR algorithm) used in this paper. The target block is selected for write or erase in the cyclic order and Blk#0 becomes the next target if Blk#127 is used up. The conventional garbage collection (GC) is shown in Fig. 7. If the blank block number is not enough, GC opera-tion is triggered and the erase target block is selected according to RR algorithm. All valid pages in the selected block are moved to other blocks. However, this conventional RR algorithm does not include special hot/cold data screening. Hot/cold data are defined as data accessed frequently/rarely, respectively. If hot data are moved to TLC, system performance and reliability can be degraded due to TLC’s write-latency and W/E cycle limitation. Furthermore, hot data induces a block-modify-write, which makes the performance worse.

Thus, RR-FDCA is proposed as shown in Fig. 8. When blank MLC block number is not enough, RR-FDCA starts and selects two victim blocks for GC. If the number of valid pages in the selected MLC blocks is less than the page number per block (PNPB) of TLC (NPNPB,TLC), RR-FDCA collects more valid pages until the collected pages reach two MLC blocks (2×NPNPB,MLC) to accumulate valid pages to be NPNPB,TLC. After block copy, the corresponding blocks are erased and the round number for cold data screening, NRDS is set to 0 for the erased blocks and set to 1 for the copied blocks, respectively (Step A1-A4). In case when the accumulated valid page number is

large enough, RR-FDCA migrates to cold data screening process. First, RR-FDCA screens cold data by skipping copy (Step B1-B2). During the rounding period, corresponding to NRDS, cold data is screened since the data accessed during the period is moved to an-other MLC block. Then, the screened cold data, or static frozen data is moved to TLC and the block is erased (Step C1-C4). If there are residual pages after frozen data eviction to TLC, they are written back to another MLC blocks (Step C5). As a result, only the static frozen data are stored in TLC and the number of TLC access can be drastically reduced, resulting in accelerated system performance. 3. Results of Proposed RR-FDCA

The write performance of the proposed storage array is evaluated by the transaction level modeling simulator [4]. Fig. 9 shows the write frequency distribution of tcc-mysql [8], used in the simulation as a benchmark. Fig. 10 depicts hot/cold data distributions. In the conventional MLC only array, all data are stored ramblingly. Mean-while, the proposed storage array with RR-FDCA can effectively collect cold data and move the static frozen data to TLC. Fig. 11 indicates the performance improvement as a function of the threshold of NRDS, NRDSMAX. In this simulation the chip number is fixed for the evaluation under the same cost constraint. When NRDS = 1, the pro-posed RR-FDCA improves the throughput and energy by 56% and -29%, respectively. In addition, NW/E,MLC also decreases by 36% compared with MLC only NAND. Fig. 12 also demonstrates the TLC/MLC capacity dependence. The results provide the optimum design point, which is 50% of TLC/MLC capacity ratio in tccc-mysql case. In the conventional case, moving cold data by GC, the throughput and energy are aggravated. On the other hand in the pro-posal, MLC free space is not occupied with cold data. A slow speed and small NW/E,TLC do not affect the system performance and reliabil-ity due to less TLC access. Therefore, the proposed TLC/MLC stor-age array has higher system performance compared with MLC only storage despite of moderate TLC characteristics. 4. Chip Exchange Method without System Suspend

If a NAND chip reaches the W/E cycle limit, the degraded chip must be exchanged. Hence, a chip exchange method is proposed (Fig. 13). In the conventional system, the system must be suspended due to enforce data copy when replacing the chip. Assuming 8GB valid pages in the degraded chip, the continuous dead time, which corresponds to the system suspend time, reaches 960 s (Fig. 14), that is unacceptably long. In the proposed method, the degraded chip is continuously used as read only mode, the valid pages are gradually evicted from the degraded chip by normal overwrite or GC operation (Step 2-3). At the timing after finishing the eviction of all valid pages, the degraded chip becomes ready for exchanging (Step 4). As a result, the continuous dead time of the proposal determines ~200 ms of the required time for GC. With the read only mode and proposed method, NAND chips can be removed and replaced. Thus, the proposed ex-change method enables to choose the best NAND configuration to enhance the performance or save the cost. 5. Conclusions

Table II indicates the summary of this work. In tccc-mysql trace case, RR-FDCA realizes 56% the write throughput enhancement and 29% energy saving with 37% W/E cycle decrease. In case of prxy_1 trace [9], the optimum TLC capacity percentage is changed to 87.5%. Therefore, the system cost can be decreased by 71%. Basically, the optimum organization can change according to usage or applications. Besides, the W/E cycle limitation and chip price of NAND depends on generations. Since the chip organization of the proposed storage array and NRDS can change by the proposed methods, system admin-istrators can select the best storage parameter on the performance or cost. Table II also demonstrates the W/E cycles ratio of MLC and TLC NAND W/E cycles. The W/E cycle of TLC NAND is accepta-bly lower than the requirement (NW/E,TLC/NW/E,MLC < 1/100). References [1] J. da Silva, presented in Samsung SSD Global Summit, 2012, [2] S.H. Shin, Dig. Tech. Papers, Symp. VLSI Circuits 2012, pp. 132-133, 2012, [3] I.J. Chang, IEICE ELEX, 9, 23, pp. 1775-1779, 2012. [4] H. Fujii, Dig. Tech. Papers, Symp. VLSI Circuits 2012, pp. 132-133, 2012 [5] C. Sun, Abst. NVMTS2012, pp. 87-88, 2012, [6] W.J. Yun, Dig. Tech. Papers, ISSCC2012, pp. 52-53, 2012, [7] T. Vali, presented in 1st Micro and Nano Electronics Workshop, 2010, [8] tpcc-mysql, https://code.launchpad.net/~percona-dev/perconatools/tpcc-mysql, [9] MSR Cambridge Traces. http://iotta.snia.org/traces/388.

Extended Abstracts of the 2013 International Conference on Solid State Devices and Materials, Fukuoka, 2013,

- 894 -

H-3-3pp894-895

0.01

0.1

1

1.0E+02 1.0E+03 1.0E+04

Ret

enti

on

err

or

(a.u

.)

Time (a.u.)

TLC

MLC

1

10-1

10-2

10-110-2 1Dat

a re

ten

tio

n e

rro

r (a

.u.)

0

0

0

0

1

0 0.25 0.5 0.75 1

Pro

gra

m d

istu

rb e

rro

r (a

.u.)

Write/Erase cycle

MLC

TLC

0 10.25 0.5 0.75

1

10-1

10-2

10-3

10-4

Pro

gra

m d

istu

rb e

rro

r (a

.u.)

Write/Erase cycle (a.u.)

(b) Data retention error(a) Program disturb error Fig. 3 Measurement results of MLC and TLC NAND.

Table I MLC/TLC NAND specifications [7]. MLC NAND TLC NAND

Read latency (Typ.) 50 μs/page 100 μs/pageWrite latency (Avg.) 900 μs/page 2400 μs/pageErase latency (Typ.) 2000 μs/block 3000 μs/block

Read unit Page PageWrite unit Page Block

Page number per block(PNPB), NPNPB

128 192

Bit cost (Normalized) 1 0.66W/E cycle limitation 10000 cycles 1000 cycles0.1

2.7

1.5

SelectGate

WL0

WL1

WL2

SelectGate

1-WL corresponds

to 3 pages

in TLC NAND.

Block

WL63

Source line

Bit line

0

Page#

12

190191

Fig. 4 Cell array and block/page organiza-

tion of TLC NAND Flash memory.

Proposed storage array controller

- Round-Robin frozen data collecting algorithm- SSD card exchange method

Exchangeable storage system

TLC

TLC

MLC

MLC

MLC TLC

TLCMLC

- Exchangeable- Mountable for both

MLC and TLC

Wireless SSD card [6]w/o NAND controller

Fig. 1 Proposed storage array. Wireless SSD card is used for storage.

Storage array controller

Step 1. Read valid pages

Blk#1 Blk#127Blk#0SSD card


Step 2. Copy to new pages

Step 3. Erase

Blk#1 Blk#127Blk#0SSD card

Step 4. Select next candidate (Blk#1) for GC

Fig. 7 Conventional garbage collection (GC). When the blank block number becomes small, GC is executed.

NRDS : Round number for each block corresponding to screening time (see Fig. 6)

NRDSMAX : Threshold of NRDS

for data moving to TLC

Manyvalid pages ?( > NPNPB,TLC )

Yes

If blank block is not enough, RR-FDCA is triggered.

Select two blocksfor GC

No

Enoughscreening period ?

(NRDS NRDSMAX)

Yes

No

Step B1. Increment NRDS for skipped blocks

Step B2. Select next candidate for GC

Step C1. Read valid pages in two selected blocksStep C2. Copy valid pages to TLCStep C3. Write-back the overflow pages to MLCStep C4. Erase the selected blocksStep C5. Set NRDS = 0 for erased blocks

Step A1. Collect valid pages until 2NPNPB,MLC pages

Step A2. Copy valid pages to MLCStep A3. Erase corresponding blocksStep A4. Set NRDS = 0 for erased blocks

and NRDS = 1 for copied blocks

Frozen data moving to TLC

Cold data screening

Cold data collection in MLCFig. 2(A)

Fig. 2(B)

Fig. 2(C)

Fig. 8 Proposed Round-Robin frozen data collecting algorithm

(RR-FDCA).

0

30

60

90

120

150

0

2

4

6

8

10

0 25 50 75 100

Write en

ergy (J/G

B)

Wri

te T

hro

ug

pu

t (M

B/s

)

TLC NAND Capacity (%)

0

4

8

12

16

0

400

800

1200

1600

2000

0 25 50 75 100

TL

C N

AN

D W

/E C

ycles

ML

C N

AN

D W

/E C

ycle

s

TLC NAND Capacity (%)ML

C N

AN

D W

/E C

ycle

s,NW

/E,M

LC

NW

/E,T

LC /N

W/E

,ML

C(%

)

NRDSMAX = 3 NRDSMAX = 3

Optimumcapacity

ratio

Write performance W/E cycles of MLC/TLC NAND

TLC capacity percentage (%) TLC capacity percentage (%)

Fig. 12 TLC/MLC capacity ratio dependence of write performance (a) and MLC/TLC NAND W/E cycles (b) for proposed storage system.

Blk#1 Blk#127Blk#0

2nd 128th1st130th129th

12

Block selection order (MLC/TLC)Round #

0

Page selection order

(MLC only)

127Page#

New dataOld data

Fig. 6 RR block selection approach.

Hot Cold

MLC NAND

TLC NAND

Co

llec

ted

and

scre

ened

fr

oze

n d

ata

Prop. (Step B2) Prop. (After Step C4)Conv.

MLC NAND

TLC NANDMLC NAND

Fro

zen

dat

am

ove

d t

o T

LC

Fig. 10 Hot/cold data distribution in conventional and proposed storage systems.

0

40

80

120

160

0

2

4

6

8

10

12

MLC 0 1 2 3 4

Write en

erg

y (J/GB

)

Wri

te th

rou

gh

pu

t (M

B/s

)

0

4

8

12

16

20

0

400

800

1200

1600

2000

MLC 0 1 2 3 4

NW

/E,M

LC

/ NW

/E,T

LC

(%)

ML

C N

AN

D W

/E c

ycle

sM

LC

NA

ND

W/E

cyc

les,

NW

/E,M

LC

NW

/E,T

LC

/ NW

/E,M

LC

(%)

ThroughputEnergy

MLC W/E CyclesNW/E,TLC / NW/E,MLC

NRDS MAX NRDSMAX

Prop.Conv.

Prop.Conv.

+58%

-29%

-37%

2.8%

10.6%

+22%

-17%

-39%

Write performance W/E cycles of MLC/TLC NAND

+56%

-29%

-39%

3.9%

Fig. 11 (a) Write performance and (b) average W/E cycles of

MLC/TLC NAND for conventional and proposed storage arrays. NRDSMAX corresponds to screening period.

Conventional method

Spare

NAND I/F (Wireless)


If W/E cycle reaches the limitation, the degraded chip should be exchanged.

The system stops since access requests from the host are denied due to enforce copy.

Proposed method

Degraded In use

Degraded In use

Reach W/E cycle limitation

In useIn use

In use In use

Read only In use In use

Seq. 2 Valid page moving

Read only In use In use

Seq. 3 RR GC

In use In use

Seq. 4 Final state

In case of over write,valid pages are moved.

Finally, all valid pages in degraded chip are copied to other chips by RR GC.

Ready forexchange

Ready for exchange

Degraded

Time

W/E cycles W/E cycle limit

Seq. 1

SSD card#2 SSD card#3SSD card#1

Seq. 4 Exchange SSD card#1

NAND chips do not reach W/E cycle limit simultaneously.

Seq. 2,3

Enforcecopy

Seq. 1 Initial state

In use

After enforce copy

System suspend

Fig. 13 Proposed NAND chip exchange method without performance degradation. Table II. Summary of this work.

Conv. This workBenchmark tpcc-mysql prxy_1 tpcc-mysql prxy_1Total capacity (a.u.) 1 1 1 1Optimum TLC NAND capacity (%) N/A N/A 50 87.5Cost 1 1 0.83 0.71NRDS N/A N/A 3 3Write throughput (MB/s) (Normalized by conv.) 5.36 (1) 7.52 (1) 8.36 (1.56) 10.18 (1.35)Write energy (J/GB) (Normalized by conv.) 61.61 (1) (1) 43.8 (0.71) 42.5 (0.82)MLC NAND W/E cycles (Cycle/GB) 1678 1285 1026 1009TLC NAND W/E cycles (Cycle/GB) N/A N/A 40.3 0.82W/E cycles ratio (TLC/MLC) N/A N/A 3.92% 0.08%

31 6

52 9

84 12

1stwrite

TLC NAND program order

190187 192

P1 P2 P3 P4 P5 P6 P7

P2

P4E

E

E

P4 P6

0Lower pageMiddle pageUpper page

1stWrite

2ndWrite

3rdWrite

0 0 0 0

0 0 00 0 0 01 1

1 1 1 111

11 1 1

VTH2nd

write3rd

write

VTH

VTHWL0

WL1

WL2

WL63

Fig. 5 Write order and VTH distribution of TLC [2, 3].

The write order of TLC is complicated due to capaci-tive interference, resulting in block unit write and long write latency.

Logical address (109)

10

101

102

103

Ho

tC

old

4 620

Wri

te f

req

ue

ncy

1

Static frozen data

Fig. 9 Write frequency distribution of tpcc-mysql [8].

MLC

Write data from host is stored in MLC.

MLC

(A) Cold data is collected into MLC blocks.

MLC

(B) Screening extremely cold data = Frozen data

MLC TLC

(C) Frozen data aremoved to TLC.

Invalid pageEmpty page

Valid page

Hot data : Frequently accessed dataCold data : Rarely accessed data Fig. 2 Proposed Round-Robin frozen data collecting

algorithm (RR-FDCA).

Assuming 8GB valid pages in the degraded chip, Conv. = 8GB / 50MiB/s (Read) + 8GB / 10MiB/s(Write)

= 960 sProp. = Normal block copy (Step 1 & 2 in Fig. 7)

= 200 ms (Avg.)

0 500 1000

Prop.

Conv.

Continuous dead time (s)

960 s

~200 ms 1/4800

Fig. 14 Estimated continuous dead time for conventional and proposed exchange methods. In the conventional method, to move all valid pages in degraded chip to other chip, storage system must be sus-pended. The proposed storage array can be accessed without system suspend be-cause of read only mode.

- 895 -

h-3-3 tlc/mlc nand flash mix-and-match desi gn with ... · tlc/mlc nand flash mix-and-match desi gn...

Documents