andromeda: building the next-generation...andromeda: building the next-generation high-density...

1

Laura Caulfield, Michael Xing, Zeke Tan, Robin Alexander

Andromeda: Building the Next-Generation

High-Density Storage Interface for Successful Adoption

Cloud Server Infrastructure Engineering

AzureMicrosoft

3

Microsoft

Design Principles For Cloud Hardware

• Support a broad set of applications on shared hardwareAzure (>600 services), Bing, Exchange, O365, others

• Scale requires vendor neutrality & supply chain diversity Azure operates in 38 regions globally, more than any other cloud provider

• Rapid enablement of new NAND generationsNew NAND every 18 months, hours to precondition, hundreds of workloads

• Flexible enough for software to evolve faster than hardwareSSDs rated for 3-5 years, heavy process for FW update, software updated daily

4

Microsoft

1.0

1.5

2.0

2.5

3.0

3.5

4.0

4 16 64 256 1,024

Wri

te A

mp

lific

atio

n F

acto

r (W

AF)

IO Size (MB)

1.0

1.5

2.0

2.5

3.0

3.5

4.0

4 16 64 256 1,024

Wri

te A

mp

lific

atio

n F

acto

r (W

AF)

IO Size (MB)

A1-960GB

B1-480GB

C1-480GB

D1-480GB

D2-480GB

E1-480GB

E1-960GB

E2-480GB

Attribute Size

Flash Page 16kB

Flash Block 4MB - 9MB

Map Granularity 4kB

SSD Architecture

Address Map Data Cache Attribute Size

Flash Page 16kB


4kB Writes

Flash Page

Flash Block

NAND Flash

Expectation…X

Enough data to fill a page

Garbage Collection:1. Copy valid data

(Write Amplification)2. Erase Block

1 GB

5

Microsoft

Attribute Size

Flash Page 16kB


Map Granularity 4kB

Attribute Size

Flash Page 16kB 1MB

Flash Block 4MB - 9MB 1GB

Map Granularity 4kB

… …

SSD Architecture

Address Map Data Cache

NAND Flash

…

1MB Writes

…B

lock

Siz

e (M

B)

Die Capacity (Gbit) -- Log Scale

Enough data to fill a striped page

6

Microsoft

………

…

……

Cloud-Scale Workloads

What is the most efficient placement of their data in an SSD’s NAND Flash Array?

…

…

…

Application in Virtual Machine (VM)• Small updates• Unaligned Peak Traffic (Bursty)

Azure Storage Backend (SOSP ‘11)• Lowest tier in hierarchy (“streaming”)• Write Perf. ↑, Stream Count ↑• Read QoS via small reclaim unit

Horizontal Stripe• Each write receives peak performance• Erase blocks when VM closes

Allow these and other stripe dimensions simultaneously in the same SSD

Vertical Stripe• High throughput through aggregation• Smallest possible effective block size

… …

…

New Application in VM• Same resources as any VM guest• Adaptable to flash sizes

Hybrid Stripe• VM Host allocates horizontal stripe• VM Guest partitions it further

7

Microsoft

Key Observations

• Block abstraction is showing its age– Forces expensive reserve of flash and DRAM– Mixes distinct streams of data, increasing WAF– Even HDDs are host-managed (SMR)

• Reliability should be near media– Warranty & Production-scale quality are essential– Tuning the host to different NANDs is not practical– Expertise is in-house at SSD companies

• Data placement policy should be near applications– Trade-off between reclaim size & throughput– Each application has a different place on spectrum– Expertise is in-house at software companies

Fundamental requirements of the general purpose cloud

Baidu first to propose in ASPLOS ‘14Increase Raw BW from 40% to 95%Increase capacity from 50% to 99%

8

Microsoft

What do you mean by “Open Channel”?

Host

Drive

FTL

Log Mgmt.

Media Mgmt.

Log Mgmt.

Media Mgmt.

FTL (Flash Translation Layer): Algorithms which allow flash to replace conventional HDDsConventional systems contain the entire FTL in the Drive’s controllerTarget design divides the log and media mgmt. between host and drive

Media Managers: Tuned to a specific generation of media (such as Toshiba A19nm or Micron L95)Implement error correction such as ECC, RAID and read-retryPrevent media-specific errors through scrubbing, mapping out bad blocks, etc.

Log Managers: Receive random writesTransmit one or more streams of sequential writesMaintains address map, performs garbage collection

Log Mgmt.

Media Mgmt.

Host

Drive

Log Mgmt.

Media Mgmt.

OC SSD 1.2 OC SSD 2.0

Host

DriveLog Mgmt.

Media Mgmt.

NVMe 1.3

9

Microsoft

Existing Proposals

Log Abstraction

In-HostPlacement Policy

In-DriveReliability

LightNVM / OCSSD 2.0 (FAST ‘17)

Software Defined Flash (ASPLOS ‘14)

IO Determinism (Fall ‘16)

Multi-Streamed SSD (HotStor ’14)

Nameless Writes (FAST ‘12)

Programmable Flash (ADMS ’11)

AvailableIncompleteNot planned

Next: Iterate on OCSSD 2.0 Spec. to create production-ready system (manufacturable, warrantable, etc.)

17

Microsoft

Physical Page Addressing (PPA) Interface

Host and drive communicate with physical addresses

• Terminology– Channel: SSD Channel– Parallel Unit (PU): NAND Die– Chunk: multi-plane block– Sector: 512B or 4k region of NAND page

• All addresses map to a fixed physical location

• Host IO Requirements:– Erase a chunk before writing any sectors– Write sectors within the chunk sequentially– Read any sector that is not within “cache minimum write size”

“Open-Channel Solid State Drives NVMe Specification” Matias Bjørling, Oct 2016 https://aka.ms/i9my03

Address Format:MSB LSB

Channel Parallel Unit Chunk Sector

https://aka.ms/i9my03

18

Microsoft

Chunk Address: Logical or Physical?

If Chunk Address Remains Physical

• Host manages all wear leveling– Monitors erase count on every block

– Uses vector copy command

• Drive monitors wear, provides backup– Low water mark:

notifies host of uneven wear

– High water mark: moves data and notifies host of new address

If Chunk Address Were Logical

• Drive guarantees even wear within die

• Wear leveling integrated with scrubbing

• Simple wear leveling– Full, block-granular map

– Swap most/least worn block at regular period

• Start-gap style wear-leveling– No map (equation-based)

– Move 1 block of data at regular period

• Host issues maintenance “go” and “stop”

Remapping the block number within the drive has much lowerperformance and space overheads vs. current SSD designs

Host requires no resource guarantees rooted in physical location for chunk – option to make chunk address logical

19

Microsoft

Reliability and QoS

Tight QoS Guarantees

• OCSSD 2.0– Read and write times include ECC

– “Heroic ECC” is binary

• Future Enhancement– Use reported times as “base” performance

– Table to select reliability/latency

– Selectable per chunk? Per stripe?

RAID

• RAID and isolation are at odds– Independent work scheduled to each die

– Must build parity and decode on all dies

• Higher level replication– RAID within drive is sometimes redundant

– Require information & mechanisms for host to selectively us in-drive RAID

– Leverage vector reads/writes

20

Microsoft

LSB MSB1 MSB2

NAND Cell

N 10 14 18

N + 1 13 17 21

N + 2 16 20 24

N + 3 19 23 27

N + 4 22 26 30

N + 5 25 29 33

LSB MSB1 MSB2

NAND Cell

N 10 14 18

N + 1 13 17 21

N + 2 16 20 24

N + 3 19 23 27

N + 4 22 26 30

N + 5 25 29 33

LSB MSB1 MSB2

NAND Cell

N 10 14 18

N + 1 13 17 21

N + 2 16 20 24

N + 3 19 23 27

N + 4 22 26 30

N + 5 25 29 33

Cache Minimum Write Size

• Open NAND cells susceptible to read disturb

• Cache the last 3-5 pages written to any write point

• Single-shot programming is not susceptible

• Host-Device Contract: – Cache min write size (CMWS) = max kB in open cells

– Host queries for CMWS (0kB if drive caches enough)

– Host does not read from last CMWS of data

– Drive fails reads to CMWS region

“Vulnerabilities in MLC NAND Flash Memory Programming.” Yu Cai et al. HPCA 2017

LSB MSB1 MSB2

NAND Cell

N 10 14 18

N + 1 13 17 21

N + 2 16 20 24

N + 3 19 23 27

N + 4 22 26 30

N + 5 25 29 33

Written page in fully-written cell

Written page in partly-written cell

Next page to write

Unwritten page

Mitigating Open Memory Cell Read Disturb, defining the right abstraction for the “open block timer”

21

Microsoft

Telemetry for Physical Addressing

• OCSSD 2.0: “Chunk Report”

– Enhancement: use an NVMe log page

– Supported only in open channel drives

– Include additional telemetry (at right)

• Vendor-specific log page– Standard address

– Vendor-specific content

– May be temporary, used only while the open channel interface matures

Additional Standard Telemetry

• Erase count per PUHost may use this to determine which PUs to swap during cross-die wear leveling.

• Erase count per chunkIf chunk addressing is logical, host may use this to reduce drive activity by freeing little-worn chunks.

• Read disturb count per sectorHost may use this to reduce drive activity by scrubbing or distributing reads.

24

Microsoft

Conclusions

• Storage interface is changing, let’s architect it for the long term

– Correct division of responsibilities between Host and SSD

– Control to define heterogeneous block stripes

– HyperScale: Hundreds or thousands of workers per TB

• Final solution must include expertise from community

– Currently working through the division between host and SSD

– Contact me to discuss ([email protected])

– Read more in our FAST 2017 paper: FlashBlox

27

Microsoft

References

• Azure Storage Backend (SOSP ‘11)Windows Azure Storage: A Highly Available Cloud Storage Service with Strong Consistency

• FlashBlox (FAST ‘17) FlashBlox: Achieving Both Performance Isolation and Uniform Lifetime for Virtualized SSDs

• LightNVM (FAST ‘17) LightNVM: The Linux Open-Channel SSD Subsystem

• Read Determinism (SDC ‘16) Standards for improving SSD performance and endurance

• Software-Defined Flash (ASPLOS ‘14) SDF: Software-Defined Flash for Web-Scale Internet Storage Systems

• Multi-Streamed SSD (HotStor ‘14) The Multi-streamed Solid-State Drive

• De-Indirection (FAST ‘12) De-Indirection for Flash-based SSDs with Nameless Writes

• Programmable Flash (ADMS ‘11) Fast, Energy Efficient Scan inside Flash Memory SSDs
http://sigops.org/sosp/sosp11/current/2011-Cascais/printable/11-calder.pdfhttps://www.usenix.org/conference/fast17/technical-sessions/presentation/huanghttps://www.usenix.org/system/files/conference/fast17/fast17-bjorling.pdfhttp://www.snia.org/sites/default/files/SDC/2016/presentations/hyperscalers/Bill_Martin_Standards_for_Improving_SSD_Performance_and_Endurance.pdfhttp://www.ece.eng.wayne.edu/~sjiang/pubs/papers/ouyang14-SDF.pdfhttps://www.usenix.org/system/files/conference/hotstorage14/hotstorage14-paper-kang.pdfhttps://www.usenix.org/system/files/conference/fast12/zhang2-7-12.pdfhttp://www.adms-conf.org/p36-KIM.pdf

andromeda: building the next-generation...andromeda: building the next-generation high-density...

Documents