liang cunming platform solution architect data center

Liang CunmingPlatform Solution ArchitectData Center / Network Platforms Group

Legal Notices & Disclaimers

Intel technologies’ features and benefits depend on system configuration and may require enabled hardware, software or service activation. Performance varies depending on system configuration. No computer system can be absolutely secure. Check with your system manufacturer or retailer or learn more at intel.com.

Tests document performance of components on a particular test, in specific systems. Differences in hardware, software, or configuration will affect actual performance. Consult other sources of information to evaluate performance as you consider your purchase. For more complete information about performance and benchmark results, visit http://www.intel.com/performance.

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit http://www.intel.com/performance.

Cost reduction scenarios described are intended as examples of how a given Intel-based product, in the specified circumstances and configurations, may affect future costs and provide cost savings. Circumstances will vary. Intel does not guarantee any costs or cost reduction.

No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document.

Intel does not control or audit third-party benchmark data or the web sites referenced in this document. You should visit the referenced web site and confirm whether referenced data are accurate.

© 2017 Intel Corporation. Intel, the Intel logo, and Intel Xeon are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as property of others.

Tick-Tock Development Model:Sustained Microprocessor Innovation Leadership

Intel® MicroarchitectureCodename Nehalem

Intel® MicroarchitectureCodename Sandy Bridge

Intel® MicroarchitectureCodename Haswell

Intel® MicroarchitectureCodename Skylake

Tock Tock Tock Tock TickTick Tick Tick

Innovation delivers new microarchitecture with Skylake

Nehalem

45nm

New Micro-architecture

Westmere

32nm

New ProcessTechnology

Sandy Bridge

32nm


Ivy Bridge

22nm


Haswell

22nm


Broadwell

14nm


Skylake

14nm


Future Product

Purley PlatformGrantley PlatformRomley PlatformThurley Platform

4

Skylake-SP Server CPU Overview

Intel® Hyper-Threading Technology (2 threads/core)

Intel® AVX-512

32 DP FLOPs/Cycle/Core

Non-Inclusive Cache Hierarchy:

SNC: Sub-NUMA Clustering Mode

IO Enhancements

Intel® Turbo Boost Technology

Integrated Voltage Regulator

Mesh Interconnect (SCF)

Memory Enhancements

Integrated Fabric:

Intel® Omni-Path Architecture

14nm Process Technology

Core LLC

Core LLC

Core LLC

Core LLC

System Agent

DMI

IMC

Intel® UPI

PCIe*3.0

.

.

.

.

.

.

Core LLC

Core LLC

Fabric

IMC

Power Management Enhancements (HWPC)

Power Management:Per Core P-State (PCPS)Uncore Frequency Scaling (UFS)Energy Efficient Turbo (EET)

New Feature

Enhanced Feature

Skylake: 6th gen Core processor

IPC increase vs. Broadwell

5

Skylake Core Micro-Architecture

Sandy Bridge Haswell Skylake

Out of Order Window

168 192 224

In-flight Loads 64 72 72

In-flight Stores 36 42 56

Scheduler Entries 54 60 97

Integer Register File 160 168 180

FP Register File 144 168 168

Allocation Queue 28/thread 56 64/thread

Extracting more parallelism each generation, ~10% IPC improvement

6

Cycle Per Packet Improvements

Cy

cle

s/p

ack

et

(lo

we

r is

be

tte

r)

E5-2699v4 Platinum 8180 E5-2699v4 Platinum 8180

1C/1T

1C/2T

System configuration is the same as the one used in DPDK layer 3 forwarding test covered in this presentation

7

Skylake-SP Scalable Coherent Fabric Overview

Home AgentDDR DDR

Mem Ctlr

Home AgentDDR DDR

Mem Ctlr

Core LLC2.5MB

CBO

Core BO

Cache BO

IDI

IDI/Q

PII SAD

CoreLLC2.5MB

CBO

Core BO

Cache BO

IDI

IDI/Q

PIISAD

Core LLC2.5MB

CBO

Core BO

Cache BO

IDI

IDI/Q

PII SAD

CoreLLC2.5MB

CBO

Core BO

Cache BO

IDI

IDI/Q

PIISAD

Core LLC2.5MB

CBO

Core BO

Cache BO

IDI

IDI/Q

PII SAD

CoreLLC2.5MB

CBO

Core BO

Cache BO

IDI

IDI/Q

PIISAD

Core LLC2.5MB

CBO

Core BO

Cache BO

IDI

IDI/Q

PII SAD

CoreLLC2.5MB

CBO

Core BO

Cache BO

IDI

IDI/Q

PIISAD

Core LLC2.5MB

CBO

Core BO

Cache BO

IDI

IDI/Q

PII SAD

CoreLLC2.5MB

CBO

Core BO

Cache BO

IDI

IDI/Q

PIISAD

Core LLC2.5MB

CBO

Core BO

Cache BO

IDI

IDI/Q

PII SAD

CoreLLC2.5MB

CBO

Core BO

Cache BO

IDI

IDI/Q

PIISAD

Core LLC2.5MB

CBO

Core BO

Cache BOID

I

IDI/Q

PII SAD

CoreLLC2.5MB

CBO

Core BO

Cache BO ID

I

IDI/Q

PIISAD

Core LLC2.5MB

CBO

Core BO

Cache BOID

I

IDI/Q

PII SAD

CoreLLC2.5MB

CBO

Core BO

Cache BO ID

I

IDI/Q

PIISAD

Core LLC2.5MB

CBO

Core BO

Cache BOID

I

IDI/Q

PII SAD

CoreLLC2.5MB

CBO

Core BO

Cache BO ID

I

IDI/Q

PIISAD

Core LLC2.5MB

CBO

Core BO

Cache BOID

I

IDI/Q

PII SAD

CoreLLC2.5MB

CBO

Core BO

Cache BO ID

I

IDI/Q

PIISAD

Core LLC2.5MB

CBO

Core BO

Cache BOID

I

IDI/Q

PII SAD

CoreLLC2.5MB

CBO

Core BO

Cache BO ID

I

IDI/Q

PIISAD

Core LLC2.5MB

CBO

Core BO

Cache BOID

I

IDI/Q

PII SAD

CoreLLC2.5MB

CBO

Core BO

Cache BO ID

I

IDI/Q

PIISAD

QPI Agent

QPI Link

QPI Link

R3QPI

IIO

UBox PCU

R2PCI

PCI-E X16

PCI-E X16

PCI-E X8

PCI-E X4 (ESI)

CB DMA

IOAPIC

Xeon E7 v4 24-core die Skylake-SP

Mesh Improves Scalability with Higher Bandwidth and Reduced Latencies

8

Loaded Memory Access Latency

Memory Load Line enables deterministic packet processing at peak levels

• Network Function Virtualization requires deterministic throughput as VMs are added

• Memory controller design and two additional memory channels yield a significant improvement in the loaded latency(*) Source as of May 2017: Intel internal measurements of BW/latency on platform with Skylake-SP H0 28C internal sample, Core=turbo,

CLM=turbo, UPI=10.4, SNC1, 6x32GB DDR4-2400/2667 per CPU, 1 DPC, and platform with E5-2699 v4, Turbo enabled, 4x32GB DDR4-2400, RHEL 7.0. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more information go to http://www.intel.com/performance *Other names and brands may be claimed as the property of others.

9

PCIe Bandwidth

PCI Express platform performance increases up to 2x

• Mesh to I/O improvement, three MS2PCI mesh stops

• Additional Gen 3 x16 PCI E interface, three in total – resulting in up to 82GB/Bytes per socket

• Improvement in Data Directed I/O architecture, separation of RX and TX data

“Results have been estimated based on internal Intel analysis and are provided for informational purposes only. Any difference in systemhardware or software design or configuration may affect actual performance. Software and workloads used in performance tests mayhave been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measuredusing specific computer systems, components, software, operations and functions. Any change to any of those factors may cause theresults to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplatedpurchases, including the performance of that product when combined with other products. For more information go tohttp://www.intel.com/performance/datacenter. Configurations: see next slide”

10

Translating Core, Memory and I/O Performance to Packet Processing

Data Plane Development Kit

Linux* Foundation Project

• More than 20 key open source projects build on DPDK libraries, including MoonGen*, mTCP*, Ostinato*, Lagopus*, Fast Data (FD.io), Open vSwitch*, OPNFV*, and OpenStack*

SKL-SP Optimizations

• Large MLC enables packet processing application foot print to remain close to the core

*Other names and brands may be claimed as property of others

“Results have been estimated based on internal Intel analysis and are provided for informational purposes only. Any difference in systemhardware or software design or configuration may affect actual performance. Software and workloads used in performance tests mayhave been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measuredusing specific computer systems, components, software, operations and functions. Any change to any of those factors may cause theresults to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplatedpurchases, including the performance of that product when combined with other products. For more information go tohttp://www.intel.com/performance/datacenter. Configurations: see next slide”

11

Packet Processing Problem Statement

15

150

85

MPPS

64 128 256 512 1024 1518

64 Byte Packet 1024 Byte Packet

10 Gb/s 51 ns 819 ns

100 Gb/s 5 ns 82 ns

Packet Size

From a CPU perspective:• Last-level-cache (L3) hit ~40 cycles• L3 miss, memory read is ~70ns (140 cycles at

2GHz)• Added security complexity• Harder to address at 100Gb rates

Communication Infrastructure

Typical Data Center

100GbE Packets /Second

10GbE Packets/ Second

12

Terabit Throughput Level with Unmodified SW

Breaking the Software Defined Network Services Barrier1 Terabit Services on dual Intel® Xeon® Server !!! with DPDK, Fortville-25, Lewisburg

Intel® XEON® CPUs (E5 v3/v4)a. Per socket have 40 lanes of PCIe Gen3b. 2x 160Gbps of packet I/O per socket

Intel® XEON® CPUs (Skylake-SP)a. Per socket have 48 lanes of PCIe Gen3b. 2x 280Gbps of packet I/O per socket

https://www.sinog.si/wp-content/uploads/2017/05/SINOG-VPP.pdfhttps://fd.io/2017/07/fdio-doubles-packet-throughput-performance-terabit-levels/

13

Unlocking Platform Capability by DPDK

IGB_UIO KNI UIO_PCI_GENERIC VFIO

UserspaceKernel

Packet classification

Software libraries for hash/exact

match, LPM, ACL etc.

Accelerated SW libraries

Common functions such as IP fragmentation,

reassembly, reordering etc.

Stats

Libraries for collecting and

reporting statistics.

QoS

Libraries for QoSscheduling and

metering/policing

PacketFramework

Libraries for creating complex pipelines in

software.

Core libraries

Core functions such as memory

management, software rings,

timers etc.

Network Functions (Cloud, Enterprise, Telco)DPDK Fundamentals

• Implements run-to-completion and pipeline models

• No scheduler - all devices accessed by polling

• Supports 32-bit and 64-bit OSs, with and without NUMA

• Scales from Intel® Atom® to Intel® Xeon® processors

• Number of cores and processors is not limited

• Optimal packet allocation across DRAM channels

• Use of 2M & 1G hugepages and cache aligned structures

• Uses bulk concepts - processing ‘n’ packets simultaneously

• Open source and BSD licensed

PMDs for physical

and virtual Ethernet devices

ETHDEV

PMDs for HW and SW

crypto accelerators

CRYPTODEV

Event-driven

PMDs (HW &

SW)

EVENTDEV

Hardware acceleration

APIs

SECURITY COMPRESS RAW

PMDs for HW and SW compressionaccelerators

Generic devices w/o specific type

14

Bridging Various Acceleratorsseamless interface to accelerators

DPDK Framework

Generic APIs

Application is abstracted from the underlying SW and HW with DPDK

Preserve Platform and Application software investment

Optimized platform software ingredients (e.g. vSwitch) to take advantage of HW and SW ingredients

Flexible and outstanding performing data plane

IA Platform

O p t i o n a lS o l u t i o n s

Application

DPDK Framework

Optimized Platform Software (OS / Hypervisor)

Optimized Softwareon CPU ISA(e.g., AES, AVX)

Integrated / Discrete FPGA

Smart NICAccelerators(Intel® QAT)Standard NIC

Application Abstracted from Platform

15

Community Ecosystem

A fully open source software project with a strong development community

16

Boosts Open Source Projects

Enriches Research & Innovation

17

mTCP [NSDI '14]

MoonGen [IMC '15]

NetBricks [OSDI '16]

mOS [NSDI '17]

SoftFlow [ATC '16]

StatelessNF [NSDI '17]

IX [OSDI'14]

Software RAN [CCTS '15]

NFP [SIGCOMM '17]

NFVnice [SIGCOMM '17]

OpenNetVM [HotMIddlebox '16]

VigNAT [SIGCOMM '17]

ExpressPass [SIGCOMM '17]

Decibel [NSDI '17]APUNet [NSDI '17]Flowtune [NSDI '17]

SwitchKV [NSDI '16]

MICA [NSDI '14]

ClickNP [SIGCOMM '16]

Trumpet [SIGCOMM '16]

PISCES [SIGCOMM '16]

ESWITCH [SIGCOMM '16]

STYX [SOCC '17]

FTMB [SIGCOMM '15]

BlindBox [SIGCOMM '15]

ScaleBricks [SIGCOMM '15]

NetCache [SOSP '17]

Future: Toward Cloud-Native Network Functions

• Primary Constructs

• DevOps/Continuous delivery/Micro services/Containers

• Unique Considerations of Network Functions

• Data plane packet processing requires an optimized architecture

• Domain specific protocol is absent

• Intergenerational transforming & compatibility

18

Summary

• Powerful Multi-Core Scalable Architecture Processor

• Unlock Packet Processing Capability by DPDK

• Seamless Interface to Various Accelerators

• Fantastic Ecosystem for Innovation

19

liang cunming platform solution architect data center

Documents