triple-a: a non-ssd based autonomic all-flash array for high performance storage systems myoungsoo...
TRANSCRIPT
Triple-A: A Non-SSD Based Autonomic All-Flash Array
for High Performance Storage Systems
Myoungsoo Jung (UT-Dallas)Wonil Choi (UT-Dallas),
John Shalf (LBNL), Mahmut Kandemir (PSU)
Executive Summary• Challenge: SSD array might not be suitable for a high-perfor-
mance computing storage• Our goal: propose a new high-performance storage architecture• Observation
– High maintenance cost: caused by worn-out flash-SSD replacements– Performance degradation: caused by shared resource contentions
• Key Ideas– Cost reduction: by taking bare NAND flash out from SSD box– Contention resolve: by distributing excessive I/O generating bottlenecks
• Triple-A: a new architecture for HPC storages– Consists of non-SSD bare flash memories– Automatically detects and resolves the performance bottlenecks
• Results: non-SSD all-flash arrays expect to save 35~50% of cost and offer a 53% higher throughput than a traditional SSD array
• Motivations• Triple-A Architecture• Triple-A Management• Evaluations• Conclusions
Outline
SSD Arrays
• SSD arrays are in position to (partially) replace HDD arrays
HPC starts to employ SSDs
HDD Arrays
HPCSSD-Cache on HDD Arrays
SSD-buffer on Compute-Node
• As time goes by, worn-out SSD should be replaced• The thrown-away SSD has complex internals• Other parts are still useful, only flash memories are useless
High-cost Maintenance of SSD Arrays
SSD Arrays
Worn-out!
Abandon!
Replace!
Dead!
Live!
I/O Services Suffered in SSD Arrays
• Varying data locality in an array, which consist of 80 SSDs• Hot region is a group of SSDs having 10% of total data• Arrays without a hot region show reasonable latency• As the number of hot regions increases, the performance of
SSD arrays degrades
Why Latency Delayed? Link Contention
• A single data bus is shared by a group of SSDs• When target SSD is ready and the shared bus is idle, the
I/O request can get service right away• When excessive I/Os destined to a specific group of SSDs
SSD-1 SSD-2 SSD-3 SSD-4
SSD-5 SSD-6 SSD-7 SSD-8
Dest- 3
Dest-8
Dest- 4Dest- 1Dest- 2
Dest-6
READY!
READY!
IDLE!
IDLE!
Why Latency Delayed? Link Contention
• When the shared bus is busy, even though the target SSD is ready, I/O requests should stay in the buffer
• This stall is because SSDs in a group share a data bus link contention
SSD-1 SSD-2 SSD-3 SSD-4
SSD-5 SSD-6 SSD-7 SSD-8
Dest- 4
Dest- 1
Dest- 2
READY! READY!
Dest-6
READY!
READY!
IDLE!
IDLE!BUSY!
BUSY!STALL
Why Latency Delayed? Storage Contention
• When excessive I/Os destined to a specific SSD
SSD-1 SSD-2 SSD-3 SSD-4
SSD-5 SSD-6 SSD-7 SSD-8
Dest- 3
Dest-8
READY!
READY!Dest-8Dest-8Dest-8
Dest-1Dest-2Dest-4 BUSY!
BUSY!
Why Latency Delayed? Storage Contention
• When excessive I/O destined to a specific SSD• When the target SSD is busy, even though the link is avail-
able, I/O request should stay in buffer• This stall is because a specific SSD is continuously busy storage contention
SSD-1 SSD-2 SSD-3 SSD-4
SSD-5 SSD-6 SSD-7 SSD-8
Dest-3
Dest-8
READY!
Dest-8
Dest-8
Dest-1
Dest-2
Dest-4
BUSY!
BUSY!Dest-8
BUSY!STALL
READY!BUSY!
• Motivations• Triple-A Architecture• Triple-A Management• Evaluations• Conclusions
Outline
Unboxing SSD for Cost Reduction
• Worn-out flash packages should be replaced• Many logics in SSDs including H/W controllers and
firmware are wasted, when worn-out SSDs are replaced• Instead of a whole SSD, let’s use only bare flash packages
Bare NAND flash packages
Still useful, so reusable
Useless, so replaced
SSD Internal
Host interface controllerFlash controllersMicroprocessorsDRAM buffers
Firmware
35~50% of total SSD cost
Use of Unboxed Flash Packages, FIMM
• Multiple NAND flash packages integrated into a board– Looks like passive memory device such as DIMM– Referred to as Flash Inline Memory Module (FIMM)
• Control-signals and pin-assignment are defined• For convenient replacement of worn-out FIMMs – A FIMM has hot swappable connector– NV-DDR2 interface design by ONFi
FlashPackage
How FIMMs Connected? PCI-E
• PCI-E technology provides high-performance interconnect• Root complex – I/O start point component• Switch – middle-layer components• Endpoint – where FIMMs directly attached• Link – bus connecting components
FIMMFIMM
FIMM
FIMM
FIMM
FIMM
FIMMFIMM
FIMM
HPC
root complex
switch switch switch
endpoint endpoint
FIM
M
FIM
M
FIM
M
FIM
M
FIM
M
FIM
M
Connection between FIMMs and PCI-E
• PCI-E Endpoint is where “PCI-E fabric” and “FIMMs” meet– Front-end PCI-E protocol for PCI-E fabric– Back-end ONFi NV-DDR2 interface for FIMMs
• Endpoint consists of three parts– PCI-E device layers: handle PCI-E interface– Control logic: handles FIMMs over ONFi interface– Upstream/downstream buffers: control traffic communication
Connection between FIMMs and PCI-E
• Communication example– (1) PCI-E packet arrived at target endpoint– (2) PCI-E device layers disassemble the packet– (3) The disassembled packet is enqueued into downstream buffer– (4) HAL dequeues the packet and constructs a NAND flash command
• Hot-swappable connector for FIMMs– ONFi 78-pin NV-DDR2 slot
Triple-A Architecture
• PCI-E allows architect to configure any configuration• Endpoint is where FIMMs are directly attached• Triple-A comprises a set of FIMMs using PCI-E• Useful parts of SSDs are aggregated on top of PCI-E fabric
PCI-E Fabric
Endpoint
FIMM FIMM FIMM FIMM
EndpointEndpointEndpoint
FIMM FIMM FIMM FIMM
FIMM FIMM FIMM FIMM
FIMM FIMM FIMM FIMM
Multi-coresMulti-cores DRAMs
SwitchesRCs
Triple-A Architecture
• Flash control logic is also moved out of SSD internal – Address translation, garbage collection, IO scheduler, and so on– Autonomic I/O contention managements
• Triple-A architectures interact with hosts or compute nodes
PCI-E Fabric
Endpoint
FIMM FIMM FIMM FIMM
EndpointEndpointEndpoint
FIMM FIMM FIMM FIMM
FIMM FIMM FIMM FIMM
FIMM FIMM FIMM FIMM
Multi-coresMulti-cores DRAMs Management Module
Hosts CNs
RCs Switches
• Motivations• Triple-A Architecture• Triple-A Management• Evaluations• Conclusions
Outline
Link Contention Management
(1) Hot cluster detection – I/O stalled due to link contention
FIMM FIMM FIMM FIMM
EndPoint FIMM
EndPoint
EndPoint
EndPoint
FIMM FIMM FIMM
FIMM FIMM FIMM FIMM
PCI-E
Sw
itch
shared data bus
FIMMFIMMFIMMFIMM FIMMFIMM FIMM FIMM BUSY!!!HotCluster
Link Contention Management
(1) Hot cluster detection – I/O stalled due to link contention(2) Cold cluster securement – clusters with free link(3) Autonomic data migration – from hot to cold cluster
FIMM FIMM FIMM FIMM
EndPoint FIMM
EndPoint
EndPoint
EndPoint
FIMM FIMM FIMM
FIMM FIMM FIMM FIMM
PCI-E
Sw
itch
FIMMFIMMFIMMFIMM FIMMFIMM FIMM FIMM BUSY!!!HotCluster
FIMM FIMM FIMMFIMM IDLE!!!ColdCluster
Shadow-cloning can hide the migration overheads
Storage Contention Management
(1) Laggard detection – I/O stalled due to storage contention(2) Autonomic data-layout reshaping for stalled I/O in queue
EndPoint FIMM-1
Switc
h FIMM-2 FIMM-3 FIMM-4REQ-3REQ-3
REQ-1
QUEUE
REQ-3Stalled
REQ-3REQ-2
Issued
Issued
Issued
FIMM-1 FIMM-2 FIMM-3 FIMM-4
Stalled
Stalled EndPoint FIMM FIMM FIMM FIMM
…
Laggard
Storage Contention Management
(1) Laggard detection – I/O stalled due to storage contention(2) Autonomic data-layout reshaping for stalled I/O in queueWrite I/O – physical data-layout reshaping (to no-laggard neigh-bors)Read I/O – shadow copying (to no-laggard neighbors) & reshaphing
EndPoint FIMM-1
Switc
h FIMM-2 FIMM-3 FIMM-4REQ-3REQ-3
REQ-1
QUEUE
REQ-3Stalled
REQ-3REQ-2
Issued
Issued
Issued
FIMM-1 FIMM-2 FIMM-3 FIMM-4
Stalled
Stalled EndPoint FIMM FIMM FIMM FIMM
…
Laggard
3 43 2
3 4
IssuedIssued
Issued
• Motivations• Triple-A Architecture• Triple-A Management• Evaluations• Conclusions
Outline
Experimental Setup
• Flash array network simulation model– Captures PCI-E specific characteristics
• Data movement delay, switching and routing latency (PLX 3734), contention cycles
– Configures diverse system parameters– Will be available in the public (preparing an open-source framework)
• Baseline all-flash array configuration– 4 switches x 16 endpoints x 4 FIMMs (64GB) = 16TB– 80 clusters, 320 FIMM network evaluation
• Workloads– Enterprise workloads (cfs, fin, hm, mds, msnfs, …)– HPC workload (eigensolver simulated at LBNL supercomputer)– Micro-benchmarks (read/write, sequential/random)
Latency Improvement
• Triple-A latency normalized to non-autonomic all-flash array• Real-world workloads: enterprise and HPC I/O traces• On average, x5 shorter latency• Specific workloads (cfs and web) generate no hot clusters
Throughput Improvement
• Triple-A IOPS normalized for system throughput• On average, x6 higher IOPS• Specific workloads (cfs and web) generate no hot clusters• Triple-A boosts the storage system by resolving contentions
Queue Stall Time Decrease
• Queue stall time come from two resource contentions• On average, stall time shortened by 81%• According to our analysis, Triple-A decreases dramatically
link-contention time• msnfs shows low I/O ratio on hot clusters
Network Size Sensitivity
• By increasing the number of clusters (endpoints)• Execution time broken-down into stall times and storage lat.• Triple-A shows better performance on larger networks
– PCI-E components stall times are effectively reduced– FIMM latency is out of Triple-A’s concern
non-autonomic array Triple-A
Related Works (1)• Market products (SSD array)– [Pure Storage] one-large pool storage system with 100% NAND
flash based SSDs– [Texas Memory Systems] 2D flash-RAID– [Violin Memory] flash memory array of 1000s of flash cells
• Academia study (SSD array)– [A.M.Caulfield, ISCA’13] proposed SSD-based storage area
network (QuickSAN) by integrating network adopter into SSDs– [A.Akel, Hotstorage’11] proposed a prototype of PCM based
storage array (Onyx)– [A.M.Caulfield, MICRO’10] proposed a high-performance stor-
age array architecture for emerging non-volatile memories
Related Works (2)• Academia study (SSD RAID)– [M.Balakrishnan, TS’10] proposed SSD-optimized RAID for
better reliability by creating age disparities within arrays– [S.Moon, Hotstorage’13] investigated the effectiveness of
SSD-based RAID and discussed the reliability potential• Academia study (NVM usage for HPC)– [A.M.Caulfield, ASPLOS’09] exploited flash memory to clus-
ters for the performance and power consumption– [A.M.Caulfield, SC’10] explored the impact of NVMs on HPC
Conclusions• Challenge: SSD array might not be suitable for high-perfor-
mance storage• Our goal: propose a new high-performance storage architecture• Observation
– High maintenance cost: caused by worn-out flash-SSD replacements– Performance degradation: caused by shared resource contentions
• Key Ideas– Cost reduction: by taking bare NAND flash out from SSD box– Contention resolve: by distributing excessive I/O generating bottlenecks
• Triple-A: a new architecture suitable for HPC storages– Consists of non-SSD bare flash memories– Automatically detects and resolves the performance bottlenecks
• Results: non-SSD all-flash arrays expect to save 35~50% of cost and offer a 53% higher throughput than traditional SSD arrays
Backup
Data Migration Overhead
• A data migration comprises a series of – (1) Data read from source FIMM – (2) Data move to parental switch – (3) Data move to target endpoint – (4) Data write to target FIMM
• Naïve data migration activity shares all-flash array resources with normal I/O requests – I/O latency delayed due to resource contention
Naïve migration
Data Migration Overhead
• Data read of data migration (first step) hurts seriously the system performance
• Shadow cloning overlaps normal read I/O request and data read of data migration
• Shadow cloning successfully hides the data migration over-head and minimizes the system performance degradation
Naïve migration Shadow cloning
Real Workload Latency (1)
• CDF of workload latency for non-autonomic all-flash ar-ray and Triple-A
• Triple-A significantly improves I/O request latency• Relatively low latency improvement in msnfs
– Ratio of I/O requests heading to hot clusters is not very high– Hot clusters detected, but not that hot (less hot)
proj msnfs
Real Workload Latency (2)
• Prxy experiences great latency improved by Triple-A• Websql did not get more benefit than expected
– Due to more and hotter clusters than proxy– But, all hot clusters are located in the same switch
• In addition to 1) hotness, 2) balance of I/O requests among switches determines the effectiveness of Triple-A
prxy websql
Network Size Sensitivity
• Triple-A successfully reduces both contention time– By distributing extra load of hot clusters– Data migration and physical data reshaping
• Link contention time is all most completely eliminated• Storage contention time is steadily reduced
– It is bounded by the number of I/O requests to target clusters
Nor
mal
ized
to
non-
auto
nom
ic
all-fl
ash
arra
ys
Why Latency Delayed? Storage Contention
• Regardless of array condition, independent SSDs can be busy or idle (ready to serve a new I/O)
• When the SSD where an I/O destined is ready, I/O can get service right away
• When the SSD where an I/O destined is busy, I/O should wait
SSD-1 SSD-2 SSD-3 SSD-4
SSD-5 SSD-6 SSD-7 SSD-8
Dest- 3
Dest-8
READY!
BUSY!