cisco's journey from verbs to libfabric

41
© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 1 © 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 1 Cisco’s Journey From Verbs to Libfabric Abondon the shackles of Verbs Embrace the freedom of Libfabric Jeffrey M. Squyres Cisco Systems 23 September 2015

Upload: jeff-squyres

Post on 19-Jan-2017

1.186 views

Category:

Technology


2 download

TRANSCRIPT

Page 1: Cisco's journey from Verbs to Libfabric

© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 1 © 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 1

Cisco’s Journey From Verbs to Libfabric

Abondon the shackles of Verbs

Embrace the freedom of Libfabric

Jeffrey M. Squyres Cisco Systems 23 September 2015

Page 2: Cisco's journey from Verbs to Libfabric

© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 2

Application

Kernel

Cisco VIC ethX port

TCP stack

General Ethernet driver

enic.ko

Userspace sockets API userspace library

Application

Verbs IB core

usnic.ko

Send and receive fast path

usNIC TCP/IP

Page 3: Cisco's journey from Verbs to Libfabric

© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 3 © 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 3

Verbs is a fine API. …if you make InfiniBand hardware.

Page 4: Cisco's journey from Verbs to Libfabric

© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 4 © 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 4

...but now there’s this libfabric thing (see libfabric.org community for details)

Page 5: Cisco's journey from Verbs to Libfabric

© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 5

Keep in mind, Cisco already supports UD Verbs

Page 6: Cisco's journey from Verbs to Libfabric

© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 6

•  Monotonic enum •  Could not add popular Ethernet values

1500

9000

•  usNIC verbs provider had to lie (!) …just like iWARP providers

•  MPI had to match verbs device with IP interface to find real MTU

Verbs IBV_MTU_256 IBV_MTU_512 IBV_MTU_1024 IBV_MTU_2048 IBV_MTU_4096

Page 7: Cisco's journey from Verbs to Libfabric

© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 7

•  Integer (not enum) endpoint attribute

Libfabric

Page 8: Cisco's journey from Verbs to Libfabric

© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 8

•  Integer (not enum) endpoint attribute

Libfabric

DONE

Page 9: Cisco's journey from Verbs to Libfabric

© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 9

•  Mandatory GRH structure InfiniBand-specific header

•  40 bytes UDP header is 42 bytes

…and a different format

•  Breaks ib_ud_pingpong •  usnic verbs provider used “magic”

ibv_port_query() to return extensions pointers

E.g., enable 42-byte UDP mode

Verbs

et len chk smac dmac …

ver len next

hop

sgid dgid

UDP header: 42 bytes

GRH: 40 bytes

Page 10: Cisco's journey from Verbs to Libfabric

© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 10

•  FI_MSG_PREFIX and ep_attr.msg_prefix_size

Libfabric

et len chk smac dmac …

payload

Page 11: Cisco's journey from Verbs to Libfabric

© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 11

•  FI_MSG_PREFIX and ep_attr.msg_prefix_size

Libfabric

et len chk smac dmac …

payload

DONE

Page 12: Cisco's journey from Verbs to Libfabric

© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 12

•  Tuple: (device, port) Usually a physical device and port

Does not match virtualized VIC hardware

•  Queue pair •  Completion queue

Verbs

Machine (64GB total)

NUMANode P#0 (32GB)

Socket P#0

L3 (25MB)

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#0

PU P#0

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#1

PU P#1

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#2

PU P#2

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#3

PU P#3

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#4

PU P#4

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#8

PU P#5

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#9

PU P#6

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#10

PU P#7

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#11

PU P#8

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#12

PU P#9

PCI 8086:1521

eth0

PCI 8086:1521

eth1

PCI 8086:1521

eth2

PCI 8086:1521

eth3

PCI 1137:0043

eth4

usnic_0

PCI 1137:00cf

PCI 1137:00cf

PCI 1137:00cf

PCI 1137:00cf

PCI 1137:0043

eth5

usnic_1

PCI 1137:00cf

PCI 1137:00cf

PCI 1137:00cf

PCI 1137:00cf

NUMANode P#1 (32GB)

Socket P#1

L3 (25MB)

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#0

PU P#10

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#1

PU P#11

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#2

PU P#12

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#3

PU P#13

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#4

PU P#14

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#8

PU P#15

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#9

PU P#16

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#10

PU P#17

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#11

PU P#18

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#12

PU P#19

PCI 1000:0073

sda

PCI 1137:0043

eth6

usnic_2

PCI 1137:00cf

PCI 1137:00cf

PCI 1137:00cf

PCI 1137:00cf

PCI 1137:0043

eth7

usnic_3

PCI 1137:00cf

PCI 1137:00cf

PCI 1137:00cf

PCI 1137:00cf

Indexes: physical

Date: Sat Mar 14 09:27:31 2015

ibv_device ibv_port

QP QP CQ

QP

Page 13: Cisco's journey from Verbs to Libfabric

© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 13

•  Maps nicely to SR-IOV •  Fabric à PCI physical function (PF) •  Domain à PCI virtual function (VF) •  Endpoint à Resources in VF

Machine (64GB total)

NUMANode P#0 (32GB)

Socket P#0

L3 (25MB)

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#0

PU P#0

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#1

PU P#1

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#2

PU P#2

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#3

PU P#3

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#4

PU P#4

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#8

PU P#5

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#9

PU P#6

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#10

PU P#7

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#11

PU P#8

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#12

PU P#9

PCI 8086:1521

eth0

PCI 8086:1521

eth1

PCI 8086:1521

eth2

PCI 8086:1521

eth3

PCI 1137:0043

eth4

usnic_0

PCI 1137:00cf

PCI 1137:00cf

PCI 1137:00cf

PCI 1137:00cf

PCI 1137:0043

eth5

usnic_1

PCI 1137:00cf

PCI 1137:00cf

PCI 1137:00cf

PCI 1137:00cf

NUMANode P#1 (32GB)

Socket P#1

L3 (25MB)

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#0

PU P#10

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#1

PU P#11

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#2

PU P#12

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#3

PU P#13

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#4

PU P#14

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#8

PU P#15

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#9

PU P#16

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#10

PU P#17

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#11

PU P#18

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#12

PU P#19

PCI 1000:0073

sda

PCI 1137:0043

eth6

usnic_2

PCI 1137:00cf

PCI 1137:00cf

PCI 1137:00cf

PCI 1137:00cf

PCI 1137:0043

eth7

usnic_3

PCI 1137:00cf

PCI 1137:00cf

PCI 1137:00cf

PCI 1137:00cf

Indexes: physical

Date: Sat Mar 14 09:27:31 2015

Libfabric

fi_fabric

fi_domain

fi_endpoint (resources in domain)

EP EP CQ

EP

Page 14: Cisco's journey from Verbs to Libfabric

© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 14

•  GID and GUID No easy mapping back to IP interface

•  usnic verbs provider encoded MAC in GID

Still cumbersome to map back to IP interface

•  Could use RDMA CM …but that would be a ton more code

Verbs mac[0] = gid->raw[8] ^ 2; mac[1] = gid->raw[9]; mac[2] = gid->raw[10]; mac[3] = gid->raw[13]; mac[4] = gid->raw[14]; mac[5] = gid->raw[15];

Page 15: Cisco's journey from Verbs to Libfabric

© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 15

•  Can use IP addressing directly

Libfabric

Everything is awesome

Page 16: Cisco's journey from Verbs to Libfabric

© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 16

•  Can use IP addressing directly

Libfabric

Everything is awesome DONE

Page 17: Cisco's journey from Verbs to Libfabric

© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 17

•  Generic send call ibv_post_send(…SG list…)

Lots of branches

•  Wasteful allocations •  No prefixed receive •  Branching in completions

Verbs

Page 18: Cisco's journey from Verbs to Libfabric

© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 18

•  Multiple types of send calls fi_send(buffer, …)

•  Variable-length prefix receive Provider-specific

•  Fewer branches in completions

Libfabric

Page 19: Cisco's journey from Verbs to Libfabric

© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 19

1.9

1.95

2

2.05

2.1

2.15

2.2

2.25

2.3

2.35

2.4

0.1 1 10 100

Tim

e (m

icro

seco

nds)

Buffer size

Open MPI with usNIC: IMB PingPong Latency

imb-pingpong-ompi-1.8-verbs.outimb-pingpong-ompi-1.8-libfabric.out

Page 20: Cisco's journey from Verbs to Libfabric

© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 20

61000

62000

63000

64000

65000

66000

67000

68000

69000

1e+06

Band

wid

th (m

egab

its/s

econ

d)

Buffer size

Open MPI with usNIC: IMB SendRecv Bandwidth

imb-sendrecv-ompi-1.8-verbs.outimb-sendrecv-ompi-1.8-libfabric.out

Page 21: Cisco's journey from Verbs to Libfabric

© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 21

•  Performance issues •  Memory registration still a problem •  No MPI-style tag matching •  One-sided capabilities do not match MPI •  Network topology is a separate API

Verbs

Page 22: Cisco's journey from Verbs to Libfabric

© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 22

•  Performance happiness •  Many MPI-helpful features:

Tag matching

One-sided operations

Triggered operations

•  Inherently designed to be more than just point-to-point

•  More work to be done… but promising MMU notify

Network topology

Libfabric

Page 23: Cisco's journey from Verbs to Libfabric

© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 23

•  Long design discussions about how to expose Ethernet / VIC concepts in the verbs API …usually with few good answers

Especially problematic with new VIC features over time

•  Conclusion: possible (obviously), but not preferable

•  Whole API designed with multiple vendor hardware models in mind

•  Much easier to match our hardware to core Libfabric concepts

•  Conclusion: much more preferable than verbs

Libfabric Verbs

Page 24: Cisco's journey from Verbs to Libfabric

© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 24 © 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 24

Ok, so let’s do libfabric!

Page 25: Cisco's journey from Verbs to Libfabric

© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 25 © 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 25

Does it play well with MPI?

Page 26: Cisco's journey from Verbs to Libfabric

© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 26

Byte Transport Layer (BTL) plugins

Matching Transport Layer (MTL) plugins

MPI_Send(…)

Page 27: Cisco's journey from Verbs to Libfabric

© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 27

•  Inherently multi-device •  Round-robin for

small messages •  Striping for large messages

•  Major protocol decisions and MPI message matching driven by an Open MPI engine

Byte Transport Layer (BTL) plugins

Page 28: Cisco's journey from Verbs to Libfabric

© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 28

Matching Transport Layer (MTL) plugins

•  Most details hidden by network API •  MXM •  Portals •  PSM

•  As a side effect, must handle: •  Process loopback •  Server loopback (usually via shared memory)

Page 29: Cisco's journey from Verbs to Libfabric

© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 29

Byte Transport Layer (BTL) plugins

Matching Transport Layer (MTL) plugins

•  IB / iWarp (verbs) •  Portals •  SCIF •  Shared memory •  TCP •  uGNI •  usNIC (verbs)

•  MXM •  Portals •  PSM •  PSM2

Page 30: Cisco's journey from Verbs to Libfabric

© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 30

•  IB / iWarp (verbs) •  Portals •  SCIF •  Shared memory •  TCP •  uGNI •  usNIC

Byte Transport Layer (BTL) plugins

Matching Transport Layer (MTL) plugins

•  MXM •  Portals •  PSM •  PSM2 •  ofi

libfabric

Page 31: Cisco's journey from Verbs to Libfabric

© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 31

libfabric

usnic BTL ofi MTL

•  Cisco developed •  usNIC-specific •  OFI point-to-point / UD •  Tested with usNIC

•  Intel developed •  Provider neutral •  OFI tag matching •  Tested with PSM / PSM2

Page 32: Cisco's journey from Verbs to Libfabric

© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 32

Bootstrapping

Message passing

There are two main parts of the usNIC BTL

Page 33: Cisco's journey from Verbs to Libfabric

© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 33

verbs bootstrapping

verbs message passing

These two parts were previously written to the Verbs API

Page 34: Cisco's journey from Verbs to Libfabric

© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 34

verbs bootstrapping

verbs message passing

sideband bootstrapping

1.  Find the corresponding ethX device 2.  Obtain MTU 3.  Open usNIC-specific configuration

options

Per the previous slides, the Verbs API requires some… help… in the form of sideband bootstrapping

Page 35: Cisco's journey from Verbs to Libfabric

© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 35

verbs bootstrapping

verbs message passing

sideband bootstrapping

libfabric bootstrapping

à

libfabric message passing à

Now let’s convert to use the libfabric API

Page 36: Cisco's journey from Verbs to Libfabric

© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 36

verbs bootstrapping

verbs message passing

sideband bootstrapping

libfabric bootstrapping

à

libfabric message passing à Pretty much a ~1:1 swap of verbs à libfabric calls

Bootstrapping sequence totally different / not comparable

…but libfabric needs no sideband bootstrapping (got to delete several hundred lines of OMPI code – yay!)

Page 37: Cisco's journey from Verbs to Libfabric

© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 37

•  For a specific provider Ask fi_getinfo() for prov_name=“usnic”

•  Use usNIC extensions Netmask, link speed, IP device name, etc.

•  usNIC-specific error messages

•  For any tag-matching provider

•  No extension use 100% portable

•  Generic error messages

usnic BTL ofi MTL

Page 38: Cisco's journey from Verbs to Libfabric

© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 38

•  For a specific provider Ask fi_getinfo() for prov_name=“usnic”

•  Use usNIC extensions Netmask, link speed, IP device name, etc.

•  usNIC-specific error messages

•  For any tag-matching provider

•  No extension use 100% portable

•  Generic error messages

usnic BTL ofi MTL

Both libfabric usage models co-exist (and play well with each other)

inside a single MPI implementation.

Proof positive of successful co-design

of libfabric and MPI implementations.

Page 39: Cisco's journey from Verbs to Libfabric

© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 39

•  For a specific provider Ask fi_getinfo() for prov_name=“usnic”

•  Use usNIC extensions Netmask, link speed, IP device name, etc.

•  usNIC-specific error messages

•  For any tag-matching provider

•  No extension use 100% portable

•  Generic error messages

usnic BTL ofi MTL

Page 40: Cisco's journey from Verbs to Libfabric

© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 40

•  Libfabric is the Way Forward for Cisco

Open community Matches our hardware Performance benefits Features benefits

•  Libfabric matches MPI Has features MPI has been asking for… for years Optimistic about its future (come join us!)

http://libfabric.org

Page 41: Cisco's journey from Verbs to Libfabric

© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 41

Thank you.