liang cunming platform solution architect data center
TRANSCRIPT
Liang CunmingPlatform Solution ArchitectData Center / Network Platforms Group
Legal Notices & Disclaimers
Intel technologies’ features and benefits depend on system configuration and may require enabled hardware, software or service activation. Performance varies depending on system configuration. No computer system can be absolutely secure. Check with your system manufacturer or retailer or learn more at intel.com.
Tests document performance of components on a particular test, in specific systems. Differences in hardware, software, or configuration will affect actual performance. Consult other sources of information to evaluate performance as you consider your purchase. For more complete information about performance and benchmark results, visit http://www.intel.com/performance.
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit http://www.intel.com/performance.
Cost reduction scenarios described are intended as examples of how a given Intel-based product, in the specified circumstances and configurations, may affect future costs and provide cost savings. Circumstances will vary. Intel does not guarantee any costs or cost reduction.
No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document.
Intel does not control or audit third-party benchmark data or the web sites referenced in this document. You should visit the referenced web site and confirm whether referenced data are accurate.
© 2017 Intel Corporation. Intel, the Intel logo, and Intel Xeon are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as property of others.
Tick-Tock Development Model:Sustained Microprocessor Innovation Leadership
Intel® MicroarchitectureCodename Nehalem
Intel® MicroarchitectureCodename Sandy Bridge
Intel® MicroarchitectureCodename Haswell
Intel® MicroarchitectureCodename Skylake
Tock Tock Tock Tock TickTick Tick Tick
Innovation delivers new microarchitecture with Skylake
Nehalem
45nm
New Micro-architecture
Westmere
32nm
New ProcessTechnology
Sandy Bridge
32nm
New Micro-architecture
Ivy Bridge
22nm
New ProcessTechnology
Haswell
22nm
New Micro-architecture
Broadwell
14nm
New ProcessTechnology
Skylake
14nm
New Micro-architecture
Future Product
Purley PlatformGrantley PlatformRomley PlatformThurley Platform
4
Skylake-SP Server CPU Overview
Intel® Hyper-Threading Technology (2 threads/core)
Intel® AVX-512
32 DP FLOPs/Cycle/Core
Non-Inclusive Cache Hierarchy:
SNC: Sub-NUMA Clustering Mode
IO Enhancements
Intel® Turbo Boost Technology
Integrated Voltage Regulator
Mesh Interconnect (SCF)
Memory Enhancements
Integrated Fabric:
Intel® Omni-Path Architecture
14nm Process Technology
Core LLC
Core LLC
Core LLC
Core LLC
System Agent
DMI
IMC
Intel® UPI
PCIe*3.0
.
.
.
.
.
.
Core LLC
Core LLC
Fabric
IMC
Power Management Enhancements (HWPC)
Power Management:Per Core P-State (PCPS)Uncore Frequency Scaling (UFS)Energy Efficient Turbo (EET)
New Feature
Enhanced Feature
Skylake: 6th gen Core processor
IPC increase vs. Broadwell
5
Skylake Core Micro-Architecture
Sandy Bridge Haswell Skylake
Out of Order Window
168 192 224
In-flight Loads 64 72 72
In-flight Stores 36 42 56
Scheduler Entries 54 60 97
Integer Register File 160 168 180
FP Register File 144 168 168
Allocation Queue 28/thread 56 64/thread
Extracting more parallelism each generation, ~10% IPC improvement
6
Cycle Per Packet Improvements
Cy
cle
s/p
ack
et
(lo
we
r is
be
tte
r)
E5-2699v4 Platinum 8180 E5-2699v4 Platinum 8180
1C/1T
1C/2T
System configuration is the same as the one used in DPDK layer 3 forwarding test covered in this presentation
7
Skylake-SP Scalable Coherent Fabric Overview
Home AgentDDR DDR
Mem Ctlr
Home AgentDDR DDR
Mem Ctlr
Core LLC2.5MB
CBO
Core BO
Cache BO
IDI
IDI/Q
PII SAD
CoreLLC2.5MB
CBO
Core BO
Cache BO
IDI
IDI/Q
PIISAD
Core LLC2.5MB
CBO
Core BO
Cache BO
IDI
IDI/Q
PII SAD
CoreLLC2.5MB
CBO
Core BO
Cache BO
IDI
IDI/Q
PIISAD
Core LLC2.5MB
CBO
Core BO
Cache BO
IDI
IDI/Q
PII SAD
CoreLLC2.5MB
CBO
Core BO
Cache BO
IDI
IDI/Q
PIISAD
Core LLC2.5MB
CBO
Core BO
Cache BO
IDI
IDI/Q
PII SAD
CoreLLC2.5MB
CBO
Core BO
Cache BO
IDI
IDI/Q
PIISAD
Core LLC2.5MB
CBO
Core BO
Cache BO
IDI
IDI/Q
PII SAD
CoreLLC2.5MB
CBO
Core BO
Cache BO
IDI
IDI/Q
PIISAD
Core LLC2.5MB
CBO
Core BO
Cache BO
IDI
IDI/Q
PII SAD
CoreLLC2.5MB
CBO
Core BO
Cache BO
IDI
IDI/Q
PIISAD
Core LLC2.5MB
CBO
Core BO
Cache BOID
I
IDI/Q
PII SAD
CoreLLC2.5MB
CBO
Core BO
Cache BO ID
I
IDI/Q
PIISAD
Core LLC2.5MB
CBO
Core BO
Cache BOID
I
IDI/Q
PII SAD
CoreLLC2.5MB
CBO
Core BO
Cache BO ID
I
IDI/Q
PIISAD
Core LLC2.5MB
CBO
Core BO
Cache BOID
I
IDI/Q
PII SAD
CoreLLC2.5MB
CBO
Core BO
Cache BO ID
I
IDI/Q
PIISAD
Core LLC2.5MB
CBO
Core BO
Cache BOID
I
IDI/Q
PII SAD
CoreLLC2.5MB
CBO
Core BO
Cache BO ID
I
IDI/Q
PIISAD
Core LLC2.5MB
CBO
Core BO
Cache BOID
I
IDI/Q
PII SAD
CoreLLC2.5MB
CBO
Core BO
Cache BO ID
I
IDI/Q
PIISAD
Core LLC2.5MB
CBO
Core BO
Cache BOID
I
IDI/Q
PII SAD
CoreLLC2.5MB
CBO
Core BO
Cache BO ID
I
IDI/Q
PIISAD
QPI Agent
QPI Link
QPI Link
R3QPI
IIO
UBox PCU
R2PCI
PCI-E X16
PCI-E X16
PCI-E X8
PCI-E X4 (ESI)
CB DMA
IOAPIC
Xeon E7 v4 24-core die Skylake-SP
Mesh Improves Scalability with Higher Bandwidth and Reduced Latencies
8
Loaded Memory Access Latency
Memory Load Line enables deterministic packet processing at peak levels
• Network Function Virtualization requires deterministic throughput as VMs are added
• Memory controller design and two additional memory channels yield a significant improvement in the loaded latency(*) Source as of May 2017: Intel internal measurements of BW/latency on platform with Skylake-SP H0 28C internal sample, Core=turbo,
CLM=turbo, UPI=10.4, SNC1, 6x32GB DDR4-2400/2667 per CPU, 1 DPC, and platform with E5-2699 v4, Turbo enabled, 4x32GB DDR4-2400, RHEL 7.0. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more information go to http://www.intel.com/performance *Other names and brands may be claimed as the property of others.
9
PCIe Bandwidth
PCI Express platform performance increases up to 2x
• Mesh to I/O improvement, three MS2PCI mesh stops
• Additional Gen 3 x16 PCI E interface, three in total – resulting in up to 82GB/Bytes per socket
• Improvement in Data Directed I/O architecture, separation of RX and TX data
“Results have been estimated based on internal Intel analysis and are provided for informational purposes only. Any difference in systemhardware or software design or configuration may affect actual performance. Software and workloads used in performance tests mayhave been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measuredusing specific computer systems, components, software, operations and functions. Any change to any of those factors may cause theresults to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplatedpurchases, including the performance of that product when combined with other products. For more information go tohttp://www.intel.com/performance/datacenter. Configurations: see next slide”
10
Translating Core, Memory and I/O Performance to Packet Processing
Data Plane Development Kit
Linux* Foundation Project
• More than 20 key open source projects build on DPDK libraries, including MoonGen*, mTCP*, Ostinato*, Lagopus*, Fast Data (FD.io), Open vSwitch*, OPNFV*, and OpenStack*
SKL-SP Optimizations
• Large MLC enables packet processing application foot print to remain close to the core
*Other names and brands may be claimed as property of others
“Results have been estimated based on internal Intel analysis and are provided for informational purposes only. Any difference in systemhardware or software design or configuration may affect actual performance. Software and workloads used in performance tests mayhave been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measuredusing specific computer systems, components, software, operations and functions. Any change to any of those factors may cause theresults to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplatedpurchases, including the performance of that product when combined with other products. For more information go tohttp://www.intel.com/performance/datacenter. Configurations: see next slide”
11
Packet Processing Problem Statement
15
150
85
MPPS
64 128 256 512 1024 1518
64 Byte Packet 1024 Byte Packet
10 Gb/s 51 ns 819 ns
100 Gb/s 5 ns 82 ns
Packet Size
From a CPU perspective:• Last-level-cache (L3) hit ~40 cycles• L3 miss, memory read is ~70ns (140 cycles at
2GHz)• Added security complexity• Harder to address at 100Gb rates
Communication Infrastructure
Typical Data Center
100GbE Packets /Second
10GbE Packets/ Second
12
Terabit Throughput Level with Unmodified SW
Breaking the Software Defined Network Services Barrier1 Terabit Services on dual Intel® Xeon® Server !!! with DPDK, Fortville-25, Lewisburg
Intel® XEON® CPUs (E5 v3/v4)a. Per socket have 40 lanes of PCIe Gen3b. 2x 160Gbps of packet I/O per socket
Intel® XEON® CPUs (Skylake-SP)a. Per socket have 48 lanes of PCIe Gen3b. 2x 280Gbps of packet I/O per socket
https://www.sinog.si/wp-content/uploads/2017/05/SINOG-VPP.pdfhttps://fd.io/2017/07/fdio-doubles-packet-throughput-performance-terabit-levels/
13
Unlocking Platform Capability by DPDK
IGB_UIO KNI UIO_PCI_GENERIC VFIO
UserspaceKernel
Packet classification
Software libraries for hash/exact
match, LPM, ACL etc.
Accelerated SW libraries
Common functions such as IP fragmentation,
reassembly, reordering etc.
Stats
Libraries for collecting and
reporting statistics.
QoS
Libraries for QoSscheduling and
metering/policing
PacketFramework
Libraries for creating complex pipelines in
software.
Core libraries
Core functions such as memory
management, software rings,
timers etc.
Network Functions (Cloud, Enterprise, Telco)DPDK Fundamentals
• Implements run-to-completion and pipeline models
• No scheduler - all devices accessed by polling
• Supports 32-bit and 64-bit OSs, with and without NUMA
• Scales from Intel® Atom® to Intel® Xeon® processors
• Number of cores and processors is not limited
• Optimal packet allocation across DRAM channels
• Use of 2M & 1G hugepages and cache aligned structures
• Uses bulk concepts - processing ‘n’ packets simultaneously
• Open source and BSD licensed
PMDs for physical
and virtual Ethernet devices
ETHDEV
PMDs for HW and SW
crypto accelerators
CRYPTODEV
Event-driven
PMDs (HW &
SW)
EVENTDEV
Hardware acceleration
APIs
SECURITY COMPRESS RAW
PMDs for HW and SW compressionaccelerators
Generic devices w/o specific type
14
Bridging Various Acceleratorsseamless interface to accelerators
DPDK Framework
Generic APIs
Application is abstracted from the underlying SW and HW with DPDK
Preserve Platform and Application software investment
Optimized platform software ingredients (e.g. vSwitch) to take advantage of HW and SW ingredients
Flexible and outstanding performing data plane
IA Platform
O p t i o n a lS o l u t i o n s
Application
DPDK Framework
Optimized Platform Software (OS / Hypervisor)
Optimized Softwareon CPU ISA(e.g., AES, AVX)
Integrated / Discrete FPGA
Smart NICAccelerators(Intel® QAT)Standard NIC
Application Abstracted from Platform
15
Community Ecosystem
A fully open source software project with a strong development community
16
Boosts Open Source Projects
Enriches Research & Innovation
17
mTCP [NSDI '14]
MoonGen [IMC '15]
NetBricks [OSDI '16]
mOS [NSDI '17]
SoftFlow [ATC '16]
StatelessNF [NSDI '17]
IX [OSDI'14]
Software RAN [CCTS '15]
NFP [SIGCOMM '17]
NFVnice [SIGCOMM '17]
OpenNetVM [HotMIddlebox '16]
VigNAT [SIGCOMM '17]
ExpressPass [SIGCOMM '17]
Decibel [NSDI '17]APUNet [NSDI '17]Flowtune [NSDI '17]
SwitchKV [NSDI '16]
MICA [NSDI '14]
ClickNP [SIGCOMM '16]
Trumpet [SIGCOMM '16]
PISCES [SIGCOMM '16]
ESWITCH [SIGCOMM '16]
STYX [SOCC '17]
FTMB [SIGCOMM '15]
BlindBox [SIGCOMM '15]
ScaleBricks [SIGCOMM '15]
NetCache [SOSP '17]
Future: Toward Cloud-Native Network Functions
• Primary Constructs
• DevOps/Continuous delivery/Micro services/Containers
• Unique Considerations of Network Functions
• Data plane packet processing requires an optimized architecture
• Domain specific protocol is absent
• Intergenerational transforming & compatibility
18
Summary
• Powerful Multi-Core Scalable Architecture Processor
• Unlock Packet Processing Capability by DPDK
• Seamless Interface to Various Accelerators
• Fantastic Ecosystem for Innovation
19
Denodo Data Virtualization Platform: Overview (session 1 from Architect to Architect webinar series)
An Overview of Palladium Brian A. LaMacchia Software Architect Windows Trusted Platform Technologies