scaling smart - crisp
TRANSCRIPT
SCALING SMART:EAST-WEST & NORTH-SOUTH SCALINGOF COMPUTATION WITH DATA
Pankaj Mehra
VP of Product Planning
Samsung Electronics
November 5, 2020
This presentation and/or accompanying oral statements by Samsung representatives collectively, the “Presentation”) is
intended to provide information concerning the SSD and memory industry and Samsung Electronics Co., Ltd. and certain
affiliates (collectively, “Samsung”). While Samsung strives to provide information that is accurate and up-to-date, this
Presentation may nonetheless contain inaccuracies or omissions. As a consequence, Samsung does not in any way
guarantee the accuracy or completeness of the information provided in this Presentation.
This Presentation may include forward-looking statements, including, but not limited to, statements about any matter that
is not a historical fact; statements regarding Samsung’s intentions, beliefs or current expectations concerning, among
other things, market prospects, technological developments, growth, strategies, and the industry in which Samsung
operates; and statements regarding products or features that are still in development. By their nature, forward-looking
statements involve risks and uncertainties, because they relate to events and depend on circumstances that may or may
not occur in the future. Samsung cautions you that forward looking statements are not guarantees of future performance
and that the actual developments of Samsung, the market, or industry in which Samsung operates may differ materially
from those made or suggested by the forward-looking statements in this Presentation. In addition, even if such forward-
looking statements are shown to be accurate, those developments may not be indicative of developments in future
periods.
1 Elements of Infrastructure: Bits, Cores, and Fabrics
2 Challenges of Data at Scale
3 Samsung SmartSSD® Computational Storage Device
4 Some Thesis Topics for your consideration
Data-Centric Architecture
3
Bits, Cores & Fabrics:the elements of infrastructure
4
Data Center Infrastructurein context
5
OperationsInfrastructure
Data, Applications Services
Bits, Cores, Fabrics Security & Virtualization
Memory & Storage
InformationEngines
ConfidentialComputing
DomainSpecific
Architecture
Sites, Services, APIs
DenseVirtualization
DataTiers
AccelerationOrchestration
CloudScaling
ComputationalStorage
Data@Scale
Key Infrastructure Themes• Data centricity• Rapid evolution of memory
• Connectivity• Smarts• Persistence
Bits, Cores & FabricsThe foundation of infrastructure What it means for SSDs for instance
BITSIntelligent
Bits, SDS
Service &
SDN
Connected
Bits, RDMA
Services Information Cloud
mServices Objects Protocols
APIs Metadata Topologies
Software Data Routes
OS Metabits Endpoints
Cores Bits Fabrics
ALUs Caches I/OPorts
Firmware DataPaths Switches
control state flow
Universal System Concepts
Universal Hardware Concepts
• In the Data Center
Data@Scale
• Broadcast & Edge
The challenges of data at scale
8
•At Cloud’s Core DCs
– Bottlenecks Rooflines
– Inefficiencies Sprawl
Challenges of Data@Scale
•At Cloud’s Edge
– Latency Frustration
– Communication Costs
– Lost opportunities to capture context or detail
Challenges of Data@Scale
• Processing power and processing bandwidth
• Metadata inefficiency of object storage & retrieval
• Wire protocol termination for disaggregated flash
• Inability to deliver both performance and scale
• Wasted endurance
• Wasted memory BW
• CPU overhead of I/O
• CPU overhead of I/O virtualization
Bottlenecks Inefficiencies
• Virtualization offload
• SMRDB (since HDD days)
• DB filtering acceleration
• Storage NW conv (since FC)
• Active Disk (since HDD days)
• OSD (since HDD days)
Why Revisit?
Because in 2020,
three distinct 25-y.o. ideas meet the
SSD!
Many Good Ideas, Already In-PlayVisib
ility
Technologytrigger
Peak of inflated expectations
Trough ofdisillusionment
Slope ofenlightenment
Plateau of productivity
Key-value device
ComputationalStorage
DisaggregatedStorage
Now1~2 yrs.3+ yrs.
SmartSSD
KV SSD
E-SSD
Scale-optimized storage devices: Summary of benefits
SmartSSD Ethernet SSD Key-Value SSDZoned Name
Spaces
Application Awareness
Acceleration
Reduce data-related CPU load
Improved Write Endurance
Fewer protocol terminations
Min device virtualization o/h
Fewer stack translations
Metadata Optimization
Scaling Data Bandwidth
Saving L2-to-Memory BW
Control@Scale (IODT, QoS)
Maximize #SSDs/chassis
Possible Convergence
KV Smart eSmart
OLTP
e-KV
KV SSD SmartSSD E-SSD
Beyond blockCPU util10PB+
Near-data procPerformanceScalability(100TB+)
Disagg. BlockTCOIOPS
OLAP
Object
Data Lake
MediaBlob
Block
Dense VMs
Serverless
HostInterface
Addressing Accelerator
PCIe Block None
Ethernet ZNS FPGA
Key-Value
Samsung SmartSSD®
Computational Storage Device
14
SmartSSD® CSD Scales to Accelerate Data-Rich Workloads
Computational Storage 3 & 6 GBps internal BW per device:
Minimize external data movement
FPGA: Each device has 3x~10x core
equivalents for offload/acceleration
4TB storage, 4 GB FPGA DRAM:For Inline and Data@Rest processing
Scalable Performance Near Data Processing: Data
format conversion, Filtering,
Metadata management, DB
Analytics, Video processing
New Services: Secure content,
Edge acceleration
H.264 Video Transcoding
SparkSQL with Parquet Data
SmartSSD U.2 Platform Acceleration Concept Partner Solutions
P2P Compression and Decompression
FPGA
SSD
Controller
V-NAND
4TB
SmartSSD® CSD HW Architecture
• Peer-to-peer (P2P) communication enables unlimited concurrency– SSD:Accelerator data transfers use internal data path
• Save precious L2:DRAM Bandwidth (Compute Nodes) • Scale without costly x86 frontend (Storage Nodes)
• Avoid the unnecessary funneling and data movement of standalone accelerators
– FPGA DRAM is exposed to Host PCIe address space
• NVMe commands can securely stream data from SSD to FPGA peer-to-peer
Soft PCIe
Switch
Soft PCIe
Switch
CPU (Host)
Accelerator
FPGADRAM
P2P communicat
ion
SSDController
NAND
SmartSSD® CSD
NVMe
Accelerator
FPGADRAM
NVMeSSD
FPGAAccel
PCIe Address Space
FPGA DRAM
Samsung SmartSSD® Technology Roadmap
• Samples, development tools, partners solutions available for immediate PoC
• Customer PoC Test&Dev systems/support available from Samsung and partners
v1.0 SmartSSD® U.2 CSD
2nd
Generation
1H’20
U.2 ESPartner Solution
Customer PoCs
Partner PoCs
U.2 FF: Scale Processing to 24 ~ 48 devices4TB, PCIe Gen3x4 External, ~530K LUTs,
Next Gen SmartSSD® CSD
Customers requirements: Integration, Interfaces, FF, workloads
SNIA API and NVMe Protocolfor Computational Storage
• Deploy off-the-shelf IP and solutions from our partners
• Use familiar Xilinx tools to develop new IP or redeploy existing accelerator IP from
ASICs or FPGAs
• Use custom IP development services from Samsung and Xilinx partners
• Enterprise Class SSD Controller: NVMe1.3, CMB, AES256
• 4TB Capacity
• 523K Total LUTs, ~330K LUTs total in dynamic region available for acceleration IP
• 4GB FPGA DDR
• External interface: PCIe Gen3x4, Internal BW: PCIe Gen3x4,
Flexible SmartSSD® IP Development Options
SSDAccel Runtime
Application
Utilities and Libraries
Connectors & Optimizer
Developing on SmartSSD® CSD
• Frameworks supported by partners– Spark, Kafka available
– FFmpeg coming
– Many more in development
• Supported OSes– Linux
– FreeBSD
– Windows Server
• Ease of porting for SDAccel OpenCL developers
• Vivado-friendly for RTL developers
19
Developing on SmartSSD® CSD (cont.)
• Xilinx SDx 2019.2 tool chain
• Samsung SDK available
• U.2 Platform Shell
• xocc --platform /opt/Xilinx/Vitis/2019.2/platforms/xilinx_samsung_U2x4_201920_2/xilinx_samsung_U2x4_201920_2.xpfm
• Generate workspace using the above platform and compile
• It’s that SIMPLE!!!
xbutil
SmartSSD® OpenCL Programming in 5 Steps
• Secure, P2P data movement
– Data moves in/out of SSD only under control of storage stack (NVMe)
21
Page 22
Acceleration
platform
Bring your own IP +
Accelerated
storage
services
Comp/decomp
Encrypt/ decrypt
Erasure coding
+Accelerated
application
frameworks
Video encoding
DB acceleration
Storage and Virtualization
AI and ML
IP Dev Toolchain:
Runtime, Libraries, API, Drivers
Connectors to
Application
Frameworks
Storage Acceleration IP
SmartSSD® CSD Use Cases and Ecosystem
• Storage Services: Comp/Decomp, Encryp/Decrypt, Metadata management, Erasure Coding, • Real-time Analytics & Biz Intelligence: DB Query (Spark, PostgreSQL, ..), Log analytics, genomics, physics• Rich Media and ML: transcoding, live streaming, object detection
Page 23
Talk “Arrow” to Parquet data on
SmartSSD™ drive
parse
compress
encrypt
index
stats
decrypt
decompress
parse
scan-filter
Scales to24x units
2.8x faster execution on SmartSSD™ CSD
Lower CPU utilization
Only processed results move to CPU
Transfer
Full Data
Filter and
Decompress
SQL
Process
CPU
SmartSSDTM CSDCPU
Filter &
DecompTransfer
results SQL Process
Un-Accelerated
99 secondsSmartSSDTM CSD
35 seconds
Query
start
1017
29
47
58
1 2 4 8 12# of SmartSSDTM CSD
Queries per hours
Performance scales with each SmartSSD™ CSD
SmartSSD® CSD accelerates DB and Analytics
• Scale to larger data sets with fewer servers
Page 24
Offload CPU &
SmartNIC
Eliminates CPU-only scaling bottlenecks
• Offloads CPU, more content per server
Indexes digital media assets
• Transforms content for consumption
• Detects and tags objects
• Speeds up object and image retrieval
Frees up SmartNIC for value-added tasks
• e.g. account fraud and usage analytics
735
885
CPU only 3x SmartSSD™ CSD
1920x1080p Frames per Second
99%
12%
CPU only 3x SmartSSD™ CSD
CPU Utilization
Scales to24x units
segment
encode2
extract
index
decode1
lookup
decode2
sample
encode3
SmartSSD® CSD enables Efficient Media Processing
• Process more video and images with fewer servers
Page 25
External Block protocol,
with acceleration offload
Scales to24-48x units
I/O for Virtual Machines
dedupe
compress
encrypt decrypt
decompress
Enables additional value-added stack
• Data caching after decompression
• Decompression latency reduction
• Increased array IOPS
• Offloads hypervisors
12
72
External Accelerator SmartSSD™ CSD
Compression and Decompression Bandwidth [GBps]
SmartSSD® CSD fuels denser storage
• Offload compression and virtualization to embedded accelerator
Metadata management
Page 26
Computational Storage Use Cases Examples• 3rd party and proprietary acceleration stacks run on Computational Storage
to accelerate real-time analytics and regex searches for cybersecurity
SmartSSD
Compute Nodes
New:Accelerated
Cache Nodes
Storage Nodes
Cache Up
Computational Storage Processor (CSP)
Computational Storage Drive (CSD)
Proprietary IP
On-Prem
Analytics Cache Node RegEx Appliance
RegEx Appliance,24x SmartSSD, 48TB
• Throughput scales to large datasets and complex searches• >10x throughput improvement compared to x86
(across DC Fabric/WAN)
log scale!
Page 27
Direct2Edge 30x to 60x faster
Edge Applications
• Low Latency Video Streaming using SmartSSD® CSDs
New Scaling Ideas
Fight latency and SPC costs of edge by coalescing servers types using North-South scaling
– No/low server-to-server latency
– Bigger caches yield higher hit rates despite stream skews
• Rein in the Sprawl of Analytics clusters by using East-West Scaling
– Use 2x-10x acceleration to reduce scale-out cluster size for a particular workload
– Each server handles more data
Suggested Thesis Topics / Recommended Reading
• Beyond RAID and EC, how to realize computational storage with sharded/coded data
Recommended reading:
– Mert Pilanci’s paper on Polar Coded Matrix Multiplication
– Martin Abadi’s work on calculating with compressed and encoded data
• Optimizing and orchestrating across external and embedded accelerators at scale
Recommended reading:
– Zhenyuan Ruan’s work on INSIDER
– Maysam Lavasani’s Ph.D. thesis
N. S. Kim and P. Mehra, "Practical Near-Data Processing to Evolve Memory and Storage Devices into Mainstream Heterogeneous Computing Systems," 2019 56th ACM/IEEE Design Automation Conference (DAC), Las Vegas, NV, USA, 2019, pp. 1-4.