andromeda: building the next-generation...andromeda: building the next-generation high-density...

15
1 Laura Caulfield, Michael Xing, Zeke Tan, Robin Alexander Andromeda: Building the Next-Generation High-Density Storage Interface for Successful Adoption Cloud Server Infrastructure Engineering Azure Microsoft

Upload: others

Post on 18-Oct-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

  • 1

    Laura Caulfield, Michael Xing, Zeke Tan, Robin Alexander

    Andromeda: Building the Next-Generation

    High-Density Storage Interface for Successful Adoption

    Cloud Server Infrastructure Engineering

    AzureMicrosoft

  • 3

    Microsoft

    Design Principles For Cloud Hardware

    • Support a broad set of applications on shared hardwareAzure (>600 services), Bing, Exchange, O365, others

    • Scale requires vendor neutrality & supply chain diversity Azure operates in 38 regions globally, more than any other cloud provider

    • Rapid enablement of new NAND generationsNew NAND every 18 months, hours to precondition, hundreds of workloads

    • Flexible enough for software to evolve faster than hardwareSSDs rated for 3-5 years, heavy process for FW update, software updated daily

  • 4

    Microsoft

    1.0

    1.5

    2.0

    2.5

    3.0

    3.5

    4.0

    4 16 64 256 1,024

    Wri

    te A

    mp

    lific

    atio

    n F

    acto

    r (W

    AF)

    IO Size (MB)

    1.0

    1.5

    2.0

    2.5

    3.0

    3.5

    4.0

    4 16 64 256 1,024

    Wri

    te A

    mp

    lific

    atio

    n F

    acto

    r (W

    AF)

    IO Size (MB)

    A1-960GB

    B1-480GB

    C1-480GB

    D1-480GB

    D2-480GB

    E1-480GB

    E1-960GB

    E2-480GB

    Attribute Size

    Flash Page 16kB

    Flash Block 4MB - 9MB

    Map Granularity 4kB

    SSD Architecture

    Address Map Data Cache Attribute Size

    Flash Page 16kB

    Flash Block 4MB - 9MB

    4kB Writes

    Flash Page

    Flash Block

    NAND Flash

    Expectation…X

    Enough data to fill a page

    Garbage Collection:1. Copy valid data

    (Write Amplification)2. Erase Block

    1 GB

  • 5

    Microsoft

    Attribute Size

    Flash Page 16kB

    Flash Block 4MB - 9MB

    Map Granularity 4kB

    Attribute Size

    Flash Page 16kB 1MB

    Flash Block 4MB - 9MB 1GB

    Map Granularity 4kB

    … …

    SSD Architecture

    Address Map Data Cache

    NAND Flash

    1MB Writes

    …B

    lock

    Siz

    e (M

    B)

    Die Capacity (Gbit) -- Log Scale

    Enough data to fill a striped page

  • 6

    Microsoft

    ………

    ……

    Cloud-Scale Workloads

    What is the most efficient placement of their data in an SSD’s NAND Flash Array?

    Application in Virtual Machine (VM)• Small updates• Unaligned Peak Traffic (Bursty)

    Azure Storage Backend (SOSP ‘11)• Lowest tier in hierarchy (“streaming”)• Write Perf. ↑, Stream Count ↑• Read QoS via small reclaim unit

    Horizontal Stripe• Each write receives peak performance• Erase blocks when VM closes

    Allow these and other stripe dimensions simultaneously in the same SSD

    Vertical Stripe• High throughput through aggregation• Smallest possible effective block size

    … …

    New Application in VM• Same resources as any VM guest• Adaptable to flash sizes

    Hybrid Stripe• VM Host allocates horizontal stripe• VM Guest partitions it further

  • 7

    Microsoft

    Key Observations

    • Block abstraction is showing its age– Forces expensive reserve of flash and DRAM– Mixes distinct streams of data, increasing WAF– Even HDDs are host-managed (SMR)

    • Reliability should be near media– Warranty & Production-scale quality are essential– Tuning the host to different NANDs is not practical– Expertise is in-house at SSD companies

    • Data placement policy should be near applications– Trade-off between reclaim size & throughput– Each application has a different place on spectrum– Expertise is in-house at software companies

    Fundamental requirements of the general purpose cloud

    Baidu first to propose in ASPLOS ‘14Increase Raw BW from 40% to 95%Increase capacity from 50% to 99%

  • 8

    Microsoft

    What do you mean by “Open Channel”?

    Host

    Drive

    FTL

    Log Mgmt.

    Media Mgmt.

    Log Mgmt.

    Media Mgmt.

    FTL (Flash Translation Layer): Algorithms which allow flash to replace conventional HDDsConventional systems contain the entire FTL in the Drive’s controllerTarget design divides the log and media mgmt. between host and drive

    Media Managers: Tuned to a specific generation of media (such as Toshiba A19nm or Micron L95)Implement error correction such as ECC, RAID and read-retryPrevent media-specific errors through scrubbing, mapping out bad blocks, etc.

    Log Managers: Receive random writesTransmit one or more streams of sequential writesMaintains address map, performs garbage collection

    Log Mgmt.

    Media Mgmt.

    Host

    Drive

    Log Mgmt.

    Media Mgmt.

    OC SSD 1.2 OC SSD 2.0

    Host

    DriveLog Mgmt.

    Media Mgmt.

    NVMe 1.3

  • 9

    Microsoft

    Existing Proposals

    Log Abstraction

    In-HostPlacement Policy

    In-DriveReliability

    LightNVM / OCSSD 2.0 (FAST ‘17)

    Software Defined Flash (ASPLOS ‘14)

    IO Determinism (Fall ‘16)

    Multi-Streamed SSD (HotStor ’14)

    Nameless Writes (FAST ‘12)

    Programmable Flash (ADMS ’11)

    AvailableIncompleteNot planned

    Next: Iterate on OCSSD 2.0 Spec. to create production-ready system (manufacturable, warrantable, etc.)

  • 17

    Microsoft

    Physical Page Addressing (PPA) Interface

    Host and drive communicate with physical addresses

    • Terminology– Channel: SSD Channel– Parallel Unit (PU): NAND Die– Chunk: multi-plane block– Sector: 512B or 4k region of NAND page

    • All addresses map to a fixed physical location

    • Host IO Requirements:– Erase a chunk before writing any sectors– Write sectors within the chunk sequentially– Read any sector that is not within “cache minimum write size”

    “Open-Channel Solid State Drives NVMe Specification” Matias Bjørling, Oct 2016 https://aka.ms/i9my03

    Address Format:MSB LSB

    Channel Parallel Unit Chunk Sector

    https://aka.ms/i9my03

  • 18

    Microsoft

    Chunk Address: Logical or Physical?

    If Chunk Address Remains Physical

    • Host manages all wear leveling– Monitors erase count on every block

    – Uses vector copy command

    • Drive monitors wear, provides backup– Low water mark:

    notifies host of uneven wear

    – High water mark: moves data and notifies host of new address

    If Chunk Address Were Logical

    • Drive guarantees even wear within die

    • Wear leveling integrated with scrubbing

    • Simple wear leveling– Full, block-granular map

    – Swap most/least worn block at regular period

    • Start-gap style wear-leveling– No map (equation-based)

    – Move 1 block of data at regular period

    • Host issues maintenance “go” and “stop”

    Remapping the block number within the drive has much lowerperformance and space overheads vs. current SSD designs

    Host requires no resource guarantees rooted in physical location for chunk – option to make chunk address logical

  • 19

    Microsoft

    Reliability and QoS

    Tight QoS Guarantees

    • OCSSD 2.0– Read and write times include ECC

    – “Heroic ECC” is binary

    • Future Enhancement– Use reported times as “base” performance

    – Table to select reliability/latency

    – Selectable per chunk? Per stripe?

    RAID

    • RAID and isolation are at odds– Independent work scheduled to each die

    – Must build parity and decode on all dies

    • Higher level replication– RAID within drive is sometimes redundant

    – Require information & mechanisms for host to selectively us in-drive RAID

    – Leverage vector reads/writes

  • 20

    Microsoft

    LSB MSB1 MSB2

    NAND Cell

    N 10 14 18

    N + 1 13 17 21

    N + 2 16 20 24

    N + 3 19 23 27

    N + 4 22 26 30

    N + 5 25 29 33

    LSB MSB1 MSB2

    NAND Cell

    N 10 14 18

    N + 1 13 17 21

    N + 2 16 20 24

    N + 3 19 23 27

    N + 4 22 26 30

    N + 5 25 29 33

    LSB MSB1 MSB2

    NAND Cell

    N 10 14 18

    N + 1 13 17 21

    N + 2 16 20 24

    N + 3 19 23 27

    N + 4 22 26 30

    N + 5 25 29 33

    Cache Minimum Write Size

    • Open NAND cells susceptible to read disturb

    • Cache the last 3-5 pages written to any write point

    • Single-shot programming is not susceptible

    • Host-Device Contract: – Cache min write size (CMWS) = max kB in open cells

    – Host queries for CMWS (0kB if drive caches enough)

    – Host does not read from last CMWS of data

    – Drive fails reads to CMWS region

    “Vulnerabilities in MLC NAND Flash Memory Programming.” Yu Cai et al. HPCA 2017

    LSB MSB1 MSB2

    NAND Cell

    N 10 14 18

    N + 1 13 17 21

    N + 2 16 20 24

    N + 3 19 23 27

    N + 4 22 26 30

    N + 5 25 29 33

    Written page in fully-written cell

    Written page in partly-written cell

    Next page to write

    Unwritten page

    Mitigating Open Memory Cell Read Disturb, defining the right abstraction for the “open block timer”

  • 21

    Microsoft

    Telemetry for Physical Addressing

    • OCSSD 2.0: “Chunk Report”

    – Enhancement: use an NVMe log page

    – Supported only in open channel drives

    – Include additional telemetry (at right)

    • Vendor-specific log page– Standard address

    – Vendor-specific content

    – May be temporary, used only while the open channel interface matures

    Additional Standard Telemetry

    • Erase count per PUHost may use this to determine which PUs to swap during cross-die wear leveling.

    • Erase count per chunkIf chunk addressing is logical, host may use this to reduce drive activity by freeing little-worn chunks.

    • Read disturb count per sectorHost may use this to reduce drive activity by scrubbing or distributing reads.

  • 24

    Microsoft

    Conclusions

    • Storage interface is changing, let’s architect it for the long term

    – Correct division of responsibilities between Host and SSD

    – Control to define heterogeneous block stripes

    – HyperScale: Hundreds or thousands of workers per TB

    • Final solution must include expertise from community

    – Currently working through the division between host and SSD

    – Contact me to discuss ([email protected])

    – Read more in our FAST 2017 paper: FlashBlox

  • 27

    Microsoft

    References

    • Azure Storage Backend (SOSP ‘11)Windows Azure Storage: A Highly Available Cloud Storage Service with Strong Consistency

    • FlashBlox (FAST ‘17) FlashBlox: Achieving Both Performance Isolation and Uniform Lifetime for Virtualized SSDs

    • LightNVM (FAST ‘17) LightNVM: The Linux Open-Channel SSD Subsystem

    • Read Determinism (SDC ‘16) Standards for improving SSD performance and endurance

    • Software-Defined Flash (ASPLOS ‘14) SDF: Software-Defined Flash for Web-Scale Internet Storage Systems

    • Multi-Streamed SSD (HotStor ‘14) The Multi-streamed Solid-State Drive

    • De-Indirection (FAST ‘12) De-Indirection for Flash-based SSDs with Nameless Writes

    • Programmable Flash (ADMS ‘11) Fast, Energy Efficient Scan inside Flash Memory SSDs

    http://sigops.org/sosp/sosp11/current/2011-Cascais/printable/11-calder.pdfhttps://www.usenix.org/conference/fast17/technical-sessions/presentation/huanghttps://www.usenix.org/system/files/conference/fast17/fast17-bjorling.pdfhttp://www.snia.org/sites/default/files/SDC/2016/presentations/hyperscalers/Bill_Martin_Standards_for_Improving_SSD_Performance_and_Endurance.pdfhttp://www.ece.eng.wayne.edu/~sjiang/pubs/papers/ouyang14-SDF.pdfhttps://www.usenix.org/system/files/conference/hotstorage14/hotstorage14-paper-kang.pdfhttps://www.usenix.org/system/files/conference/fast12/zhang2-7-12.pdfhttp://www.adms-conf.org/p36-KIM.pdf