capi snap framework, the tool for software c/c++ ... · capi1.0 378ns total latency pciegen3x8...
TRANSCRIPT
June 1st, 2016April 17th 2018 SNAP Framework built on Power™ CAPI technology
Alexandre Castellane - Bruno Mesnet PowerTM Systems OpenCAPI and CAPI Hardware acceleration enablementMontpellier Cognitive Systems lab
CAPI SNAPCAPI SNAP framework,
the tool for Software C/C++ programmers
to create FPGA accelerators
2SNAP Framework built on Power™ CAPI technology2018, IBM Corporation
11 xdays 35
CAPI SNAP……a really good story
What would you do with a 35x faster application?
3SNAP Framework built on Power™ CAPI technology2018, IBM Corporation
CAPICAPI2.0 – OpenCAPI3.0
SNAP frameworkUse Cases
PowerAI Inference Engine
4SNAP Framework built on Power™ CAPI technology2018, IBM Corporation
Chip Interconnect
CoreCoreCore
L2L2L2
L3
Bank
L3
Bank
L3
Bank
L3
Bank
L3
Bank
L3
Bank
L2L2L2
Core Core Core
Chip Interconnect
Core Core Core
L2 L2 L2
L2 L2 L2L3
Bank
L3
Bank
L3
Bank
L3
Bank
L3
Bank
L3
Bank
Core Core Core
Mem
ory
Bu
s
Mem
ory
Bu
s
SM
P In
terc
on
ne
ct
SM
P In
terc
on
ne
ct
SM
PS
MP
CA
PI
PC
Ie
SM
PC
AP
IP
CIe
SM
P
Running an application on CPU (Option 1)
ApplicationCPU based application
5SNAP Framework built on Power™ CAPI technology2018, IBM Corporation
Chip Interconnect
CoreCoreCore
L2L2L2
L3
Bank
L3
Bank
L3
Bank
L3
Bank
L3
Bank
L3
Bank
L2L2L2
Core Core Core
Chip Interconnect
Core Core Core
L2 L2 L2
L2 L2 L2L3
Bank
L3
Bank
L3
Bank
L3
Bank
L3
Bank
L3
Bank
Core Core Core
Mem
ory
Bu
s
Mem
ory
Bu
s
SM
P In
terc
on
ne
ct
SM
P In
terc
on
ne
ct
SM
PS
MP
CA
PI
PC
Ie
SM
PC
AP
IP
CIe
SM
P
Offloading an application on an external device (Option 2)
Application
Driver
Classic view of
offloading an application
to a FPGA
- 3 versions of the
data (not coherent).
- 1000s of
instructions in the
device driver.
6SNAP Framework built on Power™ CAPI technology2018, IBM Corporation
Chip Interconnect
CoreCoreCore
L2L2L2
L3
Bank
L3
Bank
L3
Bank
L3
Bank
L3
Bank
L3
Bank
L2L2L2
Core Core Core
Chip Interconnect
Core Core Core
L2 L2 L2
L2 L2 L2L3
Bank
L3
Bank
L3
Bank
L3
Bank
L3
Bank
L3
Bank
Core Core Core
Mem
ory
Bu
s
Mem
ory
Bu
s
SM
P In
terc
on
ne
ct
SM
P In
terc
on
ne
ct
SM
PS
MP
CA
PI
PC
Ie
SM
PC
AP
IP
CIe
SM
P
Offloading an application on an external device with CAPI (Option 3)
Core
Configure
Application
Driver
CAPI view of
offloading an application
to a FPGA
- 1 coherent version
of the data.
- No device driver
call/instructions.
- The FPGA
becomes a peer of
the processor cores
7SNAP Framework built on Power™ CAPI technology2018, IBM Corporation
Chip Interconnect
CoreCoreCore
L2L2L2
L3
Bank
L3
Bank
L3
Bank
L3
Bank
L3
Bank
L3
Bank
L2L2L2
Core Core Core
Chip Interconnect
Core Core Core
L2 L2 L2
L2 L2 L2
L3
Bank
L3
Bank
L3
Bank
L3
Bank
L3
Bank
L3
Bank
Core Core Core
Mem
ory
Bu
s
Mem
ory
Bu
s
SM
P In
terc
on
ne
ct
SM
P In
terc
on
ne
ct
SM
PS
MP
CA
PI
PC
Ie
SM
PC
AP
IP
CIe
SM
P
CA
PP
CA
PP
PS
LPC
Ie
PC
Ie
AF
U
C
code
OS
PC
Ie
- System memory accessed directly by FPGA Coherent data
- Application in CPU + FPGA code much simpler and quicker Accelerator
- Specific logic in Power8 and FPGA + standard PCIe Gen3 Processor Interface
application becomes QUICKER, SIMPLER, SAFER
IBM POWER8 processor and its Coherent Accelerator Processor Interface
Your
code
8SNAP Framework built on Power™ CAPI technology2018, IBM Corporation
Power 9CAPI2.0 - OpenCAPI 3.0
9SNAP Framework built on Power™ CAPI technology2018, IBM Corporation
48 lanes PCIe G4Up to 32 lanes CAPI 2.0 & 1.0 enabledBalance is PCIe G4
POWER9: CAPI2.0
- CAPI2.0
- 2x PCIe Gen4 x16 lanes CAPI2.0 enabled
- 2x faster than PCIe Gen3 x16
per FPGA card: 1 slot PCIeGen4x8 or PCIeGen3x16 – 16GBps
10SNAP Framework built on Power™ CAPI technology2018, IBM Corporation
48 lanes PCIe G4Up to 32 lanes CAPI 2.0 & 1.0 enabledBalance is PCIe G4
POWER9: CAPI2.0 and OpenCAPI3.0
- CAPI2.0
- 2x PCIe Gen4 x16 lanes CAPI2.0 enabled
- 2x faster than PCIe Gen3 x16
per FPGA card: 1 slot PCIeGen4x8 or PCIeGen3x16 – 16GBps
- OpenCAPI3.0
- From 0 to 4x BlueLink (2x8 lanes) at 25Gbps depending on the P9 chip (i.e. ZZ has 1 BlueLink per socket)
- 100% Open Interface Architecture to connect to user-level accelerators, I/.O devices and advanced memories
per FPGA card: 1 brick=8 lanes – 25GBps
48 lanes over 25Gbps LinkUp to 32 lanes for OpenCAPI 3.0All can be used for NVLink
11SNAP Framework built on Power™ CAPI technology2018, IBM Corporation
FPGA
POWER
Core
CAPI versus OpenCAPI
POWER8 – POWER9 Processor
OS
App
Memory (Coherent)
AFU
PSL
PC
Ie
CA
PP
cxl
libcxl
FPGA
POWER9
Core
POWER9 Processor
OS
App
Memory (Coherent)
AFU
TLx
DLx
25G phy
25G phy
DL
TLN
PU
(w
/ C
AP
P
fcn
)
PS
L
ocxl
libocxl
12SNAP Framework built on Power™ CAPI technology2018, IBM Corporation
IBM Power9 processor and its Coherent Accelerator Processor Interface
Core
Core Core Core Core
Core Core Core Core
Core Core Core Core
CAPI2.0
OpenCAPI3.0
Core
CAPI2.0
PCIeGen4x8 @16Gb/s
~12 to 14 GB/s measured
est. <555ns total latency
OpenCAPI3.0
BlueLink 25Gb/s 8 lanes
~22GB/s measured
378ns total latencyCAPI1.0
PCIeGen3x8 @8Gb/s
~4GB/s measured
~800ns latency
“Total latency” test on OpenCAPI3.0:
Simple workload created to simulate communication
between system and attached FPGA
1. Copy 512B from host send buffer to FPGA
2. Host waits for 128 Byte cache injection from FPGA
and polls for last 8 bytes
3. Reset last 8 bytes
4. Repeat Go TO 1.
13SNAP Framework built on Power™ CAPI technology2018, IBM Corporation
SNAP FrameworkStorage Networking Analytics Processing
14SNAP Framework built on Power™ CAPI technology2018, IBM Corporation* Compared to running the same C/C++ in software
PCI-E FPGA CAPI FPGA CAPI SNAP
Target Customer Computer Engineers +
Device Driver Developer
Computer Engineers Application Programmers
Development time 3-6 Months 2-4 Months <1 Month
Software Integration PCI-E Device Driver LibCXL Simple API
Source Code VHDL, Verilog, OpenCL VHDL, Verilog, OpenCL C/C++, Go
Coherency, Security None POWER + PSL POWER + PSL
15SNAP Framework built on Power™ CAPI technology2018, IBM Corporation
The CAPI – SNAP concept
Action X
Action Y
Action Z
CAPI
FPGA
SNAP
Vivado
HLS
CAPI FPGA becomes a peer of the CPU
Action directly accesses host memory
SNAP
Manage server threads and actions
Manage access to IOs (memory, network)
Action easily accesses resources
FPGAGives on-demand compute capabilities
Gives direct IOs access (storage, network)
Action directly accesses external resources
Vivado
HLS
Compile Action written in C/C++ code
Optimize code to get performance
Action code can be ported efficiently
+
+
+
=Best way to offload/accelerate a C/ C++ code with :
- Quick porting
- Minimum change in code
- Better performance than CPU
16SNAP Framework built on Power™ CAPI technology2018, IBM Corporation
Process A
Context
SNAP
library Job Manager
DRAMon-card
Network
(TBD)
NVMeAXI
Host DMA
Control
MMIO
AXI
CAPI SNAP Components
Job
Queue
Job
Queue
SNAP : CAPI Framework
Application
PSL/AXI bridge (simplified)
Accelerated
Action
CAPI SNAP Enabled Card
libcxl
cxl
HDK:
CAPI
PSL
CAPI
Base CAPI Components
Key:
Action 3
“Go” Action 1
“VHDL”Action 2
“C/C++”
User host code and accelerator IP
17SNAP Framework built on Power™ CAPI technology2018, IBM Corporation
SNAP Enabled Card Details
4GB DDR4 – 2TB NVMe sticks
Nallatech 250SAlpha-Data ADM-PCIE-KU3
16GB DDR3 – 2x40Gb QSFP+ Ports
Semptian NSA121B
8GB DDR4 – 2x10GB SFP+ Ports
And more to come…..
18SNAP Framework built on Power™ CAPI technology2018, IBM Corporation
OpenCAPI Enabled FPGA Cards
Mellanox Innova2 Accelerator Card
Alpha Data 9v3 Accelerator Card
Typical eye diagram at 25Gb/s using these cards
19SNAP Framework built on Power™ CAPI technology2018, IBM Corporation
Let’s understand SNAP with a “hello world” exampleApplication on
Server
snap_helloworld –i /tmp/t1 -o /tmp/t2 (-mode=cpu)
HELLO WORLD. I love this new
experience with SNAP
“Lower case”
processing
“software” action
hello world. I love this new
experience with snap
20SNAP Framework built on Power™ CAPI technology2018, IBM Corporation
Let’s understand SNAP with a “hello world” exampleApplication on
Server
snap_helloworld –i /tmp/t1 -o /tmp/t2 (-mode=cpu)
HELLO WORLD. I love this new
experience with SNAP
“Lower case”
processing
“software” action
hello world. I love this new
experience with snap
“Upper case”
processing
“hardware” action
snap_helloworld –i /tmp/t1 –o /tmp/t2 -mode=fpga
HELLO WORLD. I LOVE THIS NEW
EXPERIENCE WITH SNAP
Change C code to implement:
- A switch to execute action on CPU or on FPGA
- A way to access new resources
21SNAP Framework built on Power™ CAPI technology2018, IBM Corporation
SNAP: 3 steps flow
Application
SNAP_CONFIG=CPU
snap_helloworld –i /tmp/t1 -o /tmp/t2
“Lower case”
processing
“software” action
Action
1
x86 server
ISOLATION
command: make
22SNAP Framework built on Power™ CAPI technology2018, IBM Corporation
SNAP: 3 steps flow
Application
SNAP_CONFIG=CPU
snap_helloworld –i /tmp/t1 -o /tmp/t2
“Lower case”
processing
“software” action
SNAP_CONFIG=FPGA
snap_helloworld –i /tmp/t1 –o /tmp/t2
FPGA Card emulation
Action
1 2
x86 server
“Upper case”
processing
“hardware” action
x86 server
Application Action
ISOLATION SIMULATION
command: make command: make sim
23SNAP Framework built on Power™ CAPI technology2018, IBM Corporation
SNAP: 3 steps flow
Application
SNAP_CONFIG=CPU
snap_helloworld –i /tmp/t1 -o /tmp/t2
“Lower case”
processing
“software” action
SNAP_CONFIG=FPGA
snap_helloworld –i /tmp/t1 –o /tmp/t2
FPGA Card emulation
Action
1 2 3
“Upper case”
processing
“hardware” action
x86 server Power8 server
“Upper case”
processing
“hardware” action
x86 server
Application Action Application Action
SNAP_CONFIG=FPGA
snap_helloworld –i /tmp/t1 –o /tmp/t2
ISOLATION EXECUTIONSIMULATION
command: make command: make sim command: make image
24SNAP Framework built on Power™ CAPI technology2018, IBM Corporation
Use cases
25SNAP Framework built on Power™ CAPI technology2018, IBM Corporation
Comparison of Ingest + Search
48GB/s 200GB/s
1TB of data to search
POWER + AcceleratorPOWER8 SW only
Chip
Interconne
ct
CoreCoreCore
L2L2L2
L3
Bank
L3
Bank
L3
Bank
L3
Bank
L3
Bank
L3
Bank
L2L2L2
Core Core Core
Chip
Interconnec
t
Core Core Core
L2 L2 L2
L2 L2 L2L3
Bank
L3
Bank
L3
Bank
L3
Bank
L3
Bank
L3
Bank
Core Core Core
Mem
ory
Bu
s
Mem
ory
Bu
s
SM
P In
terc
on
ne
ct
SM
P In
terc
on
ne
ct
SM
PS
MP
CA
P
I
PC
I
e
SM
PC
AP
I
PC
I
eS
MP
26SNAP Framework built on Power™ CAPI technology2018, IBM Corporation
CAPI SNAP Paradigms
CPUIBM PSL A
ctio
nsCache
FPGA
Memory Transform
Example : Basic off-load
Data
CPUIBM PSL
ActionsCache
FPGA
Ingress, Egress or Bi-Directional Transform
Example : Compression, Crypto, HFT, Database operations
Data
Actions
CPUActions
Classic SW Process
27SNAP Framework built on Power™ CAPI technology2018, IBM Corporation
CAPI SNAP : just the best integrated together
28SNAP Framework built on Power™ CAPI technology2018, IBM Corporation
You need to :
- Know more about accelerators ?
- See a live demonstration?
- Do a benchmark ?
- Get answers to your questions?
Visit the IBM Client Center in Montpellier
http://www.ibm.com/ibm/clientcenter/montpellier
Contact us [email protected]
29SNAP Framework built on Power™ CAPI technology2018, IBM Corporation
Conclusion
This benchmark validates several points:
- Combining CAPI SNAP and Vivado HLS :
- gives real good performances with very little changes needed to the initial C code
program
- gives a very simple way to call the action on FPGA from the application on server
- Code synthesized in a mid-range FPGA is able to beat CPU by a factor x35 on
performance side !
- The complete porting + optimizing was done by 2 engineers within…..11 days
- Porting code to HLS : 2 days
- Integrate code into SNAP : 2 days
- Optimize code to beat CPU : 7 days