capi snap framework, the tool for software c/c++ ... · capi1.0 378ns total latency pciegen3x8...

29
April 17 th 2018 SNAP Framework built on Power™ CAPI technology Alexandre Castellane - Bruno Mesnet Power TM Systems OpenCAPI and CAPI Hardware acceleration enablement Montpellier Cognitive Systems lab CAPI SNAP CAPI SNAP framework, the tool for Software C/C++ programmers to create FPGA accelerators

Upload: others

Post on 31-Jul-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CAPI SNAP framework, the tool for Software C/C++ ... · CAPI1.0 378ns total latency PCIeGen3x8 @8Gb/s ~4GB/s measured ~800ns latency “Total latency” test on OpenCAPI3.0: Simple

June 1st, 2016April 17th 2018 SNAP Framework built on Power™ CAPI technology

Alexandre Castellane - Bruno Mesnet PowerTM Systems OpenCAPI and CAPI Hardware acceleration enablementMontpellier Cognitive Systems lab

CAPI SNAPCAPI SNAP framework,

the tool for Software C/C++ programmers

to create FPGA accelerators

Page 2: CAPI SNAP framework, the tool for Software C/C++ ... · CAPI1.0 378ns total latency PCIeGen3x8 @8Gb/s ~4GB/s measured ~800ns latency “Total latency” test on OpenCAPI3.0: Simple

2SNAP Framework built on Power™ CAPI technology2018, IBM Corporation

11 xdays 35

CAPI SNAP……a really good story

What would you do with a 35x faster application?

Page 3: CAPI SNAP framework, the tool for Software C/C++ ... · CAPI1.0 378ns total latency PCIeGen3x8 @8Gb/s ~4GB/s measured ~800ns latency “Total latency” test on OpenCAPI3.0: Simple

3SNAP Framework built on Power™ CAPI technology2018, IBM Corporation

CAPICAPI2.0 – OpenCAPI3.0

SNAP frameworkUse Cases

PowerAI Inference Engine

Page 4: CAPI SNAP framework, the tool for Software C/C++ ... · CAPI1.0 378ns total latency PCIeGen3x8 @8Gb/s ~4GB/s measured ~800ns latency “Total latency” test on OpenCAPI3.0: Simple

4SNAP Framework built on Power™ CAPI technology2018, IBM Corporation

Chip Interconnect

CoreCoreCore

L2L2L2

L3

Bank

L3

Bank

L3

Bank

L3

Bank

L3

Bank

L3

Bank

L2L2L2

Core Core Core

Chip Interconnect

Core Core Core

L2 L2 L2

L2 L2 L2L3

Bank

L3

Bank

L3

Bank

L3

Bank

L3

Bank

L3

Bank

Core Core Core

Mem

ory

Bu

s

Mem

ory

Bu

s

SM

P In

terc

on

ne

ct

SM

P In

terc

on

ne

ct

SM

PS

MP

CA

PI

PC

Ie

SM

PC

AP

IP

CIe

SM

P

Running an application on CPU (Option 1)

ApplicationCPU based application

Page 5: CAPI SNAP framework, the tool for Software C/C++ ... · CAPI1.0 378ns total latency PCIeGen3x8 @8Gb/s ~4GB/s measured ~800ns latency “Total latency” test on OpenCAPI3.0: Simple

5SNAP Framework built on Power™ CAPI technology2018, IBM Corporation

Chip Interconnect

CoreCoreCore

L2L2L2

L3

Bank

L3

Bank

L3

Bank

L3

Bank

L3

Bank

L3

Bank

L2L2L2

Core Core Core

Chip Interconnect

Core Core Core

L2 L2 L2

L2 L2 L2L3

Bank

L3

Bank

L3

Bank

L3

Bank

L3

Bank

L3

Bank

Core Core Core

Mem

ory

Bu

s

Mem

ory

Bu

s

SM

P In

terc

on

ne

ct

SM

P In

terc

on

ne

ct

SM

PS

MP

CA

PI

PC

Ie

SM

PC

AP

IP

CIe

SM

P

Offloading an application on an external device (Option 2)

Application

Driver

Classic view of

offloading an application

to a FPGA

- 3 versions of the

data (not coherent).

- 1000s of

instructions in the

device driver.

Page 6: CAPI SNAP framework, the tool for Software C/C++ ... · CAPI1.0 378ns total latency PCIeGen3x8 @8Gb/s ~4GB/s measured ~800ns latency “Total latency” test on OpenCAPI3.0: Simple

6SNAP Framework built on Power™ CAPI technology2018, IBM Corporation

Chip Interconnect

CoreCoreCore

L2L2L2

L3

Bank

L3

Bank

L3

Bank

L3

Bank

L3

Bank

L3

Bank

L2L2L2

Core Core Core

Chip Interconnect

Core Core Core

L2 L2 L2

L2 L2 L2L3

Bank

L3

Bank

L3

Bank

L3

Bank

L3

Bank

L3

Bank

Core Core Core

Mem

ory

Bu

s

Mem

ory

Bu

s

SM

P In

terc

on

ne

ct

SM

P In

terc

on

ne

ct

SM

PS

MP

CA

PI

PC

Ie

SM

PC

AP

IP

CIe

SM

P

Offloading an application on an external device with CAPI (Option 3)

Core

Configure

Application

Driver

CAPI view of

offloading an application

to a FPGA

- 1 coherent version

of the data.

- No device driver

call/instructions.

- The FPGA

becomes a peer of

the processor cores

Page 7: CAPI SNAP framework, the tool for Software C/C++ ... · CAPI1.0 378ns total latency PCIeGen3x8 @8Gb/s ~4GB/s measured ~800ns latency “Total latency” test on OpenCAPI3.0: Simple

7SNAP Framework built on Power™ CAPI technology2018, IBM Corporation

Chip Interconnect

CoreCoreCore

L2L2L2

L3

Bank

L3

Bank

L3

Bank

L3

Bank

L3

Bank

L3

Bank

L2L2L2

Core Core Core

Chip Interconnect

Core Core Core

L2 L2 L2

L2 L2 L2

L3

Bank

L3

Bank

L3

Bank

L3

Bank

L3

Bank

L3

Bank

Core Core Core

Mem

ory

Bu

s

Mem

ory

Bu

s

SM

P In

terc

on

ne

ct

SM

P In

terc

on

ne

ct

SM

PS

MP

CA

PI

PC

Ie

SM

PC

AP

IP

CIe

SM

P

CA

PP

CA

PP

PS

LPC

Ie

PC

Ie

AF

U

C

code

OS

PC

Ie

- System memory accessed directly by FPGA Coherent data

- Application in CPU + FPGA code much simpler and quicker Accelerator

- Specific logic in Power8 and FPGA + standard PCIe Gen3 Processor Interface

application becomes QUICKER, SIMPLER, SAFER

IBM POWER8 processor and its Coherent Accelerator Processor Interface

Your

code

Page 8: CAPI SNAP framework, the tool for Software C/C++ ... · CAPI1.0 378ns total latency PCIeGen3x8 @8Gb/s ~4GB/s measured ~800ns latency “Total latency” test on OpenCAPI3.0: Simple

8SNAP Framework built on Power™ CAPI technology2018, IBM Corporation

Power 9CAPI2.0 - OpenCAPI 3.0

Page 9: CAPI SNAP framework, the tool for Software C/C++ ... · CAPI1.0 378ns total latency PCIeGen3x8 @8Gb/s ~4GB/s measured ~800ns latency “Total latency” test on OpenCAPI3.0: Simple

9SNAP Framework built on Power™ CAPI technology2018, IBM Corporation

48 lanes PCIe G4Up to 32 lanes CAPI 2.0 & 1.0 enabledBalance is PCIe G4

POWER9: CAPI2.0

- CAPI2.0

- 2x PCIe Gen4 x16 lanes CAPI2.0 enabled

- 2x faster than PCIe Gen3 x16

per FPGA card: 1 slot PCIeGen4x8 or PCIeGen3x16 – 16GBps

Page 10: CAPI SNAP framework, the tool for Software C/C++ ... · CAPI1.0 378ns total latency PCIeGen3x8 @8Gb/s ~4GB/s measured ~800ns latency “Total latency” test on OpenCAPI3.0: Simple

10SNAP Framework built on Power™ CAPI technology2018, IBM Corporation

48 lanes PCIe G4Up to 32 lanes CAPI 2.0 & 1.0 enabledBalance is PCIe G4

POWER9: CAPI2.0 and OpenCAPI3.0

- CAPI2.0

- 2x PCIe Gen4 x16 lanes CAPI2.0 enabled

- 2x faster than PCIe Gen3 x16

per FPGA card: 1 slot PCIeGen4x8 or PCIeGen3x16 – 16GBps

- OpenCAPI3.0

- From 0 to 4x BlueLink (2x8 lanes) at 25Gbps depending on the P9 chip (i.e. ZZ has 1 BlueLink per socket)

- 100% Open Interface Architecture to connect to user-level accelerators, I/.O devices and advanced memories

per FPGA card: 1 brick=8 lanes – 25GBps

48 lanes over 25Gbps LinkUp to 32 lanes for OpenCAPI 3.0All can be used for NVLink

Page 11: CAPI SNAP framework, the tool for Software C/C++ ... · CAPI1.0 378ns total latency PCIeGen3x8 @8Gb/s ~4GB/s measured ~800ns latency “Total latency” test on OpenCAPI3.0: Simple

11SNAP Framework built on Power™ CAPI technology2018, IBM Corporation

FPGA

POWER

Core

CAPI versus OpenCAPI

POWER8 – POWER9 Processor

OS

App

Memory (Coherent)

AFU

PSL

PC

Ie

CA

PP

cxl

libcxl

FPGA

POWER9

Core

POWER9 Processor

OS

App

Memory (Coherent)

AFU

TLx

DLx

25G phy

25G phy

DL

TLN

PU

(w

/ C

AP

P

fcn

)

PS

L

ocxl

libocxl

Page 12: CAPI SNAP framework, the tool for Software C/C++ ... · CAPI1.0 378ns total latency PCIeGen3x8 @8Gb/s ~4GB/s measured ~800ns latency “Total latency” test on OpenCAPI3.0: Simple

12SNAP Framework built on Power™ CAPI technology2018, IBM Corporation

IBM Power9 processor and its Coherent Accelerator Processor Interface

Core

Core Core Core Core

Core Core Core Core

Core Core Core Core

CAPI2.0

OpenCAPI3.0

Core

CAPI2.0

PCIeGen4x8 @16Gb/s

~12 to 14 GB/s measured

est. <555ns total latency

OpenCAPI3.0

BlueLink 25Gb/s 8 lanes

~22GB/s measured

378ns total latencyCAPI1.0

PCIeGen3x8 @8Gb/s

~4GB/s measured

~800ns latency

“Total latency” test on OpenCAPI3.0:

Simple workload created to simulate communication

between system and attached FPGA

1. Copy 512B from host send buffer to FPGA

2. Host waits for 128 Byte cache injection from FPGA

and polls for last 8 bytes

3. Reset last 8 bytes

4. Repeat Go TO 1.

Page 13: CAPI SNAP framework, the tool for Software C/C++ ... · CAPI1.0 378ns total latency PCIeGen3x8 @8Gb/s ~4GB/s measured ~800ns latency “Total latency” test on OpenCAPI3.0: Simple

13SNAP Framework built on Power™ CAPI technology2018, IBM Corporation

SNAP FrameworkStorage Networking Analytics Processing

Page 14: CAPI SNAP framework, the tool for Software C/C++ ... · CAPI1.0 378ns total latency PCIeGen3x8 @8Gb/s ~4GB/s measured ~800ns latency “Total latency” test on OpenCAPI3.0: Simple

14SNAP Framework built on Power™ CAPI technology2018, IBM Corporation* Compared to running the same C/C++ in software

PCI-E FPGA CAPI FPGA CAPI SNAP

Target Customer Computer Engineers +

Device Driver Developer

Computer Engineers Application Programmers

Development time 3-6 Months 2-4 Months <1 Month

Software Integration PCI-E Device Driver LibCXL Simple API

Source Code VHDL, Verilog, OpenCL VHDL, Verilog, OpenCL C/C++, Go

Coherency, Security None POWER + PSL POWER + PSL

Page 15: CAPI SNAP framework, the tool for Software C/C++ ... · CAPI1.0 378ns total latency PCIeGen3x8 @8Gb/s ~4GB/s measured ~800ns latency “Total latency” test on OpenCAPI3.0: Simple

15SNAP Framework built on Power™ CAPI technology2018, IBM Corporation

The CAPI – SNAP concept

Action X

Action Y

Action Z

CAPI

FPGA

SNAP

Vivado

HLS

CAPI FPGA becomes a peer of the CPU

Action directly accesses host memory

SNAP

Manage server threads and actions

Manage access to IOs (memory, network)

Action easily accesses resources

FPGAGives on-demand compute capabilities

Gives direct IOs access (storage, network)

Action directly accesses external resources

Vivado

HLS

Compile Action written in C/C++ code

Optimize code to get performance

Action code can be ported efficiently

+

+

+

=Best way to offload/accelerate a C/ C++ code with :

- Quick porting

- Minimum change in code

- Better performance than CPU

Page 16: CAPI SNAP framework, the tool for Software C/C++ ... · CAPI1.0 378ns total latency PCIeGen3x8 @8Gb/s ~4GB/s measured ~800ns latency “Total latency” test on OpenCAPI3.0: Simple

16SNAP Framework built on Power™ CAPI technology2018, IBM Corporation

Process A

Context

SNAP

library Job Manager

DRAMon-card

Network

(TBD)

NVMeAXI

Host DMA

Control

MMIO

AXI

CAPI SNAP Components

Job

Queue

Job

Queue

SNAP : CAPI Framework

Application

PSL/AXI bridge (simplified)

Accelerated

Action

CAPI SNAP Enabled Card

libcxl

cxl

HDK:

CAPI

PSL

CAPI

Base CAPI Components

Key:

Action 3

“Go” Action 1

“VHDL”Action 2

“C/C++”

User host code and accelerator IP

Page 17: CAPI SNAP framework, the tool for Software C/C++ ... · CAPI1.0 378ns total latency PCIeGen3x8 @8Gb/s ~4GB/s measured ~800ns latency “Total latency” test on OpenCAPI3.0: Simple

17SNAP Framework built on Power™ CAPI technology2018, IBM Corporation

SNAP Enabled Card Details

4GB DDR4 – 2TB NVMe sticks

Nallatech 250SAlpha-Data ADM-PCIE-KU3

16GB DDR3 – 2x40Gb QSFP+ Ports

Semptian NSA121B

8GB DDR4 – 2x10GB SFP+ Ports

And more to come…..

Page 18: CAPI SNAP framework, the tool for Software C/C++ ... · CAPI1.0 378ns total latency PCIeGen3x8 @8Gb/s ~4GB/s measured ~800ns latency “Total latency” test on OpenCAPI3.0: Simple

18SNAP Framework built on Power™ CAPI technology2018, IBM Corporation

OpenCAPI Enabled FPGA Cards

Mellanox Innova2 Accelerator Card

Alpha Data 9v3 Accelerator Card

Typical eye diagram at 25Gb/s using these cards

Page 19: CAPI SNAP framework, the tool for Software C/C++ ... · CAPI1.0 378ns total latency PCIeGen3x8 @8Gb/s ~4GB/s measured ~800ns latency “Total latency” test on OpenCAPI3.0: Simple

19SNAP Framework built on Power™ CAPI technology2018, IBM Corporation

Let’s understand SNAP with a “hello world” exampleApplication on

Server

snap_helloworld –i /tmp/t1 -o /tmp/t2 (-mode=cpu)

HELLO WORLD. I love this new

experience with SNAP

“Lower case”

processing

“software” action

hello world. I love this new

experience with snap

Page 20: CAPI SNAP framework, the tool for Software C/C++ ... · CAPI1.0 378ns total latency PCIeGen3x8 @8Gb/s ~4GB/s measured ~800ns latency “Total latency” test on OpenCAPI3.0: Simple

20SNAP Framework built on Power™ CAPI technology2018, IBM Corporation

Let’s understand SNAP with a “hello world” exampleApplication on

Server

snap_helloworld –i /tmp/t1 -o /tmp/t2 (-mode=cpu)

HELLO WORLD. I love this new

experience with SNAP

“Lower case”

processing

“software” action

hello world. I love this new

experience with snap

“Upper case”

processing

“hardware” action

snap_helloworld –i /tmp/t1 –o /tmp/t2 -mode=fpga

HELLO WORLD. I LOVE THIS NEW

EXPERIENCE WITH SNAP

Change C code to implement:

- A switch to execute action on CPU or on FPGA

- A way to access new resources

Page 21: CAPI SNAP framework, the tool for Software C/C++ ... · CAPI1.0 378ns total latency PCIeGen3x8 @8Gb/s ~4GB/s measured ~800ns latency “Total latency” test on OpenCAPI3.0: Simple

21SNAP Framework built on Power™ CAPI technology2018, IBM Corporation

SNAP: 3 steps flow

Application

SNAP_CONFIG=CPU

snap_helloworld –i /tmp/t1 -o /tmp/t2

“Lower case”

processing

“software” action

Action

1

x86 server

ISOLATION

command: make

Page 22: CAPI SNAP framework, the tool for Software C/C++ ... · CAPI1.0 378ns total latency PCIeGen3x8 @8Gb/s ~4GB/s measured ~800ns latency “Total latency” test on OpenCAPI3.0: Simple

22SNAP Framework built on Power™ CAPI technology2018, IBM Corporation

SNAP: 3 steps flow

Application

SNAP_CONFIG=CPU

snap_helloworld –i /tmp/t1 -o /tmp/t2

“Lower case”

processing

“software” action

SNAP_CONFIG=FPGA

snap_helloworld –i /tmp/t1 –o /tmp/t2

FPGA Card emulation

Action

1 2

x86 server

“Upper case”

processing

“hardware” action

x86 server

Application Action

ISOLATION SIMULATION

command: make command: make sim

Page 23: CAPI SNAP framework, the tool for Software C/C++ ... · CAPI1.0 378ns total latency PCIeGen3x8 @8Gb/s ~4GB/s measured ~800ns latency “Total latency” test on OpenCAPI3.0: Simple

23SNAP Framework built on Power™ CAPI technology2018, IBM Corporation

SNAP: 3 steps flow

Application

SNAP_CONFIG=CPU

snap_helloworld –i /tmp/t1 -o /tmp/t2

“Lower case”

processing

“software” action

SNAP_CONFIG=FPGA

snap_helloworld –i /tmp/t1 –o /tmp/t2

FPGA Card emulation

Action

1 2 3

“Upper case”

processing

“hardware” action

x86 server Power8 server

“Upper case”

processing

“hardware” action

x86 server

Application Action Application Action

SNAP_CONFIG=FPGA

snap_helloworld –i /tmp/t1 –o /tmp/t2

ISOLATION EXECUTIONSIMULATION

command: make command: make sim command: make image

Page 24: CAPI SNAP framework, the tool for Software C/C++ ... · CAPI1.0 378ns total latency PCIeGen3x8 @8Gb/s ~4GB/s measured ~800ns latency “Total latency” test on OpenCAPI3.0: Simple

24SNAP Framework built on Power™ CAPI technology2018, IBM Corporation

Use cases

Page 25: CAPI SNAP framework, the tool for Software C/C++ ... · CAPI1.0 378ns total latency PCIeGen3x8 @8Gb/s ~4GB/s measured ~800ns latency “Total latency” test on OpenCAPI3.0: Simple

25SNAP Framework built on Power™ CAPI technology2018, IBM Corporation

Comparison of Ingest + Search

48GB/s 200GB/s

1TB of data to search

POWER + AcceleratorPOWER8 SW only

Chip

Interconne

ct

CoreCoreCore

L2L2L2

L3

Bank

L3

Bank

L3

Bank

L3

Bank

L3

Bank

L3

Bank

L2L2L2

Core Core Core

Chip

Interconnec

t

Core Core Core

L2 L2 L2

L2 L2 L2L3

Bank

L3

Bank

L3

Bank

L3

Bank

L3

Bank

L3

Bank

Core Core Core

Mem

ory

Bu

s

Mem

ory

Bu

s

SM

P In

terc

on

ne

ct

SM

P In

terc

on

ne

ct

SM

PS

MP

CA

P

I

PC

I

e

SM

PC

AP

I

PC

I

eS

MP

Page 26: CAPI SNAP framework, the tool for Software C/C++ ... · CAPI1.0 378ns total latency PCIeGen3x8 @8Gb/s ~4GB/s measured ~800ns latency “Total latency” test on OpenCAPI3.0: Simple

26SNAP Framework built on Power™ CAPI technology2018, IBM Corporation

CAPI SNAP Paradigms

CPUIBM PSL A

ctio

nsCache

FPGA

Memory Transform

Example : Basic off-load

Data

CPUIBM PSL

ActionsCache

FPGA

Ingress, Egress or Bi-Directional Transform

Example : Compression, Crypto, HFT, Database operations

Data

Actions

CPUActions

Classic SW Process

Page 27: CAPI SNAP framework, the tool for Software C/C++ ... · CAPI1.0 378ns total latency PCIeGen3x8 @8Gb/s ~4GB/s measured ~800ns latency “Total latency” test on OpenCAPI3.0: Simple

27SNAP Framework built on Power™ CAPI technology2018, IBM Corporation

CAPI SNAP : just the best integrated together

Page 28: CAPI SNAP framework, the tool for Software C/C++ ... · CAPI1.0 378ns total latency PCIeGen3x8 @8Gb/s ~4GB/s measured ~800ns latency “Total latency” test on OpenCAPI3.0: Simple

28SNAP Framework built on Power™ CAPI technology2018, IBM Corporation

You need to :

- Know more about accelerators ?

- See a live demonstration?

- Do a benchmark ?

- Get answers to your questions?

Visit the IBM Client Center in Montpellier

http://www.ibm.com/ibm/clientcenter/montpellier

Contact us [email protected]

or [email protected]

Page 29: CAPI SNAP framework, the tool for Software C/C++ ... · CAPI1.0 378ns total latency PCIeGen3x8 @8Gb/s ~4GB/s measured ~800ns latency “Total latency” test on OpenCAPI3.0: Simple

29SNAP Framework built on Power™ CAPI technology2018, IBM Corporation

Conclusion

This benchmark validates several points:

- Combining CAPI SNAP and Vivado HLS :

- gives real good performances with very little changes needed to the initial C code

program

- gives a very simple way to call the action on FPGA from the application on server

- Code synthesized in a mid-range FPGA is able to beat CPU by a factor x35 on

performance side !

- The complete porting + optimizing was done by 2 engineers within…..11 days

- Porting code to HLS : 2 days

- Integrate code into SNAP : 2 days

- Optimize code to beat CPU : 7 days