ucx-python: a flexible communication library for python ... · march 21, 2018 ucx-python: a...

29
March 21, 2018 UCX-PYTHON: A FLEXIBLE COMMUNICATION LIBRARY FOR PYTHON APPLICATIONS

Upload: others

Post on 20-May-2020

17 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: UCX-PYTHON: A FLEXIBLE COMMUNICATION LIBRARY FOR PYTHON ... · march 21, 2018 ucx-python: a flexible communication library for python applications

March 21, 2018

UCX-PYTHON: A FLEXIBLE COMMUNICATION LIBRARY FOR PYTHON APPLICATIONS

Page 2: UCX-PYTHON: A FLEXIBLE COMMUNICATION LIBRARY FOR PYTHON ... · march 21, 2018 ucx-python: a flexible communication library for python applications

2

OUTLINE

Motivation and goals

Implementation choices

Features/API

Performance

Next steps

Page 3: UCX-PYTHON: A FLEXIBLE COMMUNICATION LIBRARY FOR PYTHON ... · march 21, 2018 ucx-python: a flexible communication library for python applications

3

WHY PYTHON-BASED GPU COMMUNICATION?

Python use growing

Extensive libraries

Python in Data science/HPC is growing

+ GPU usage and communication needs

Page 4: UCX-PYTHON: A FLEXIBLE COMMUNICATION LIBRARY FOR PYTHON ... · march 21, 2018 ucx-python: a flexible communication library for python applications

4

IMPACT ON DATA SCIENCE

RAPIDS uses dask-distributed for data distribution over python sockets

=> slows down all communication-bound components

Critical to enable dask with the ability to leverage IB, NVLINK

CUDA

PYTHON

APACHE ARROW

DASK

DEEP LEARNING

FRAMEWORKS

CUDNN

RAPIDS

CUMLCUDF CUGRAPH

Courtesy RAPIDS Team

Page 5: UCX-PYTHON: A FLEXIBLE COMMUNICATION LIBRARY FOR PYTHON ... · march 21, 2018 ucx-python: a flexible communication library for python applications

5

CURRENT COMMUNICATION DRAWBACKS

Existing python communication modules primarily rely on sockets

Low latency / high bandwidth critical for better system utilization of GPUs (eg: NVLINK, IB)

Frameworks that transfer GPU data between sites make copies

But CUDA-aware data movement is largely solved in HPC!

Page 6: UCX-PYTHON: A FLEXIBLE COMMUNICATION LIBRARY FOR PYTHON ... · march 21, 2018 ucx-python: a flexible communication library for python applications

6

REQUIREMENTS AND RESTRICTIONS

Dask – popular framework facilitates scaling python workloads to many nodes

Permits use of cuda-based python objects

Allows workers to be added and removed dynamically

communication backed built around coroutines (more later)

Why not use mpi4py then?

Dimension mpi4py

CUDA-Aware? No - Makes GPU<->CPU copies

Dynamic scaling? No - Imposes MPI restrictions

Coroutine support? No known support

Page 7: UCX-PYTHON: A FLEXIBLE COMMUNICATION LIBRARY FOR PYTHON ... · march 21, 2018 ucx-python: a flexible communication library for python applications

7

GOALS

Provide a flexible communication library that:

1. Supports CPU/GPU buffers over a range of message types

- raw bytes, host objects/memoryview, cupy objects, numba objects

2. Supports Dynamic connection capability

3. Supports pythonesque programming using Futures, Coroutines, etc (if needed)

4. Provides close to native performance from python world

How? – Cython, UCX

Page 8: UCX-PYTHON: A FLEXIBLE COMMUNICATION LIBRARY FOR PYTHON ... · march 21, 2018 ucx-python: a flexible communication library for python applications

8

OUTLINE

Motivation and goals

Implementation choices

Features/API

Performance

Next steps

Page 9: UCX-PYTHON: A FLEXIBLE COMMUNICATION LIBRARY FOR PYTHON ... · march 21, 2018 ucx-python: a flexible communication library for python applications

9

WHY UCX?Popular unified communication library used for MPI/PGAS implementations such as OpenMPI, MPICH, OSHMEM, etc

Exposes API for:

Client-server based connection establishment

Point-to-point, RMA, atomics capabilities

Tag matching

Callbacks on communication events

Blocking/Polling progress

Cuda-Aware Point-to-point communication

C library!

Page 10: UCX-PYTHON: A FLEXIBLE COMMUNICATION LIBRARY FOR PYTHON ... · march 21, 2018 ucx-python: a flexible communication library for python applications

10

PYTHON BINDING APPROACHES

Three main considerations:

SWIG, CFFI, Cython

Problems with SWIG, CFFI

Works well for small examples but not for C libraries

UCX definitions of structures isn’t consolidated

Tedious to populate interface file / python script by hand

Page 11: UCX-PYTHON: A FLEXIBLE COMMUNICATION LIBRARY FOR PYTHON ... · march 21, 2018 ucx-python: a flexible communication library for python applications

11

CYTHONCall C functions and structures from cython code (.pyx)

Expose classes, functions from python which can use C underneath

ucx_echo.py

...

ep=ucp.get_endpoint()

ep.send_obj(…)

ucp_py.pyx

cdef class ucp_endpoint:

def send_obj(…):

ucp_py_send_nb(…)

ucp_py_ucp_fxns.c

struct ctx *ucp_py_send_nb()

{

ucp_tag_send_nb(…)

}

Defined in UCX C libraryDefined in UCX-PY module

Page 12: UCX-PYTHON: A FLEXIBLE COMMUNICATION LIBRARY FOR PYTHON ... · march 21, 2018 ucx-python: a flexible communication library for python applications

12

UCX-PY STACK

UCX-PY (.py)

OBJECT META-DATA EXTRACTION/ UCX C-WRAPPERS (.pyx)

RESOURCE MANAGEMENT/CALLBACK HANDLING/UCX CALLS(.c)

UCX C LIBRARY

Page 13: UCX-PYTHON: A FLEXIBLE COMMUNICATION LIBRARY FOR PYTHON ... · march 21, 2018 ucx-python: a flexible communication library for python applications

13

OUTLINE

Motivation and goals

Implementation choices

Features/API

Performance

Next steps

Page 14: UCX-PYTHON: A FLEXIBLE COMMUNICATION LIBRARY FOR PYTHON ... · march 21, 2018 ucx-python: a flexible communication library for python applications

14

COROUTINESCo-operative concurrent functions

Preempted when

read/write from disk

perform communication

sleep, etc

Scheduler/event loop manages execution of all coroutines

Single thread utilization increases

def zzz(i):

print("start", i)

time.sleep(2)

print("finish", i)

def main():

zzz(1)

zzz(2)

main()

Ouput:

start 1 # t = 0

finish 1 # t = 2

start 2 # t = 2 + △finish 2 # t = 4 + △

async def zzz(i):

print("start" , i)

await asyncio.sleep(2)

print("finish“, i)

f = asyncio.create_task

async def main():

task1 = f(zzz(1))

task2 = f(zzz(2))

await task1

await task2

asyncio.run(main())

start 1 # t = 0

start 2 # t = 0 + △finish 1 # t = 2

finish 2 # t = 2 + △

Page 15: UCX-PYTHON: A FLEXIBLE COMMUNICATION LIBRARY FOR PYTHON ... · march 21, 2018 ucx-python: a flexible communication library for python applications

15

UCX-PY CONNECTION ESTABLISHMENT API

Dynamic connection establishment

.start_listener(accept_cb, port, is_coroutine) : Server creates listener

.get_endpoint (ip, port) : client connects

Multiple listeners allowed, multiple endpoints to server allowed

async def accept_cb(ep, …):

await ep.send_obj()

await ep.recv_obj()

ucp.start_listener(accept_cb, port,

is_coroutine=True)

async def talk_to_client():

ep = ucp.get_endpoint(ip, port)

await ep.recv_obj()

await ep.send_obj()

Page 16: UCX-PYTHON: A FLEXIBLE COMMUNICATION LIBRARY FOR PYTHON ... · march 21, 2018 ucx-python: a flexible communication library for python applications

16

UCX-PY CONNECTION ESTABLISHMENT

ucp.start_listener()

ucp.get_endpoint()

listening state

accept connection

invoke callback

accept_cb()

Server Client

Page 17: UCX-PYTHON: A FLEXIBLE COMMUNICATION LIBRARY FOR PYTHON ... · march 21, 2018 ucx-python: a flexible communication library for python applications

17

UCX-PY DATA MOVEMENT APISend data (on endpoint)

.send_*() : raw bytes, host objects (numpy), cuda objects (cupy, numba)

Receive data (on endpoint)

.recv_obj() : pass an object as argument where data is received

.recv_future() ‘blind’ : no input; returns received object; low performance

async def accept_cb(ep, …):

await ep.send_obj(cupy.array([42]))

async def talk_to_client():

ep = ucp.get_endpoint(ip, port)

rr = await ep.recv_future()

msg = ucp.get_obj_from_msg(rr)

Page 18: UCX-PYTHON: A FLEXIBLE COMMUNICATION LIBRARY FOR PYTHON ... · march 21, 2018 ucx-python: a flexible communication library for python applications

18

UCX-PY DATA MOVEMENT SEMANTICS

Send/Recv operations are non-blocking by default

Issue of the operation returns a future

Calling await on the future or calling future.result() blocks until completion

Caveat - Limited number of object types tested

memoryview, numpy, cupy, and numba

Page 19: UCX-PYTHON: A FLEXIBLE COMMUNICATION LIBRARY FOR PYTHON ... · march 21, 2018 ucx-python: a flexible communication library for python applications

19

UNDER THE HOOD

UCX depends on event notification to avoid the main thread from constantly polling

Layer UCX Calls

Connection management ucp_{listener/ep}_create

Issuing data movement ucp_tag_{send/recv/probe}_nb

Request progress ucp_worker_{arm/signal/progress}

UCX-PY Blocking progress

UCX event notification

mechanism

Completion queue events

from IB event channel

Read/write event from Sockets

Page 20: UCX-PYTHON: A FLEXIBLE COMMUNICATION LIBRARY FOR PYTHON ... · march 21, 2018 ucx-python: a flexible communication library for python applications

20

OUTLINE

Motivation and goals

Implementation choices

Features/API

Performance

Next steps

Page 21: UCX-PYTHON: A FLEXIBLE COMMUNICATION LIBRARY FOR PYTHON ... · march 21, 2018 ucx-python: a flexible communication library for python applications

21

EXPERIMENTAL TESTBED

Hardware includes 2 Nodes:

Intel(R) Xeon(R) CPU E5-2698 v4 @ 2.20GHz

Tesla V100-SXM2 (CUDA 9.2.88, driver version 410.48)

ConnectX-4 Mellanox HCAs (OFED-internal-4.0-1.0.1)

Software:

UCX 1.5, Python 3.7.1

Case UCX progress mode Python functions

Latency bound polling regular

Bandwidth bound Blocking (event notification based) coroutines

Page 22: UCX-PYTHON: A FLEXIBLE COMMUNICATION LIBRARY FOR PYTHON ... · march 21, 2018 ucx-python: a flexible communication library for python applications

22

HOST MEMORY LATENCY

0

1

2

3

4

5

6

7

8

9

1 2 4 8 16 32 64 128 256 512 1K 2K 4K 8K

Late

ncy (

us)

Message Size (bytes)

Short Message Latency

native-UCX python-UCX

Latency-bound host transfers

0

100

200

300

400

500

600

700

800

16K 32K 64K 128K 256K 512K 1M 2M 4M

Late

ncy (

us)

Message Size (bytes)

Large Message Latency

native-UCX python-UCX

Page 23: UCX-PYTHON: A FLEXIBLE COMMUNICATION LIBRARY FOR PYTHON ... · march 21, 2018 ucx-python: a flexible communication library for python applications

23

DEVICE MEMORY LATENCY

0

2

4

6

8

10

12

1 2 4 8 16 32 64 128 256 512 1K 2K 4K 8K

Late

ncy (

us)

Message Size (bytes)

Short Message Latency

native-UCX python-UCX

Latency-bound device transfers

0

100

200

300

400

500

600

700

800

900

16K 32K 64K 128K 256K 512K 1M 2M 4M

Late

ncy (

us)

Message Size (bytes)

Large Message Latency

native-UCX python-UCX

Page 24: UCX-PYTHON: A FLEXIBLE COMMUNICATION LIBRARY FOR PYTHON ... · march 21, 2018 ucx-python: a flexible communication library for python applications

24

DEVICE MEMORY BANDWIDTHBandwidth-bound transfers (cupy)

6.15

7.36

8.85

9.8610.47

10.85

8.7

9.710.3

10.7 10.9 11.03

0

2

4

6

8

10

12

10MB 20MB 40MB 80MB 160MB 320MB

Bandw

idth

GB/s

Message Size

cupy native

Page 25: UCX-PYTHON: A FLEXIBLE COMMUNICATION LIBRARY FOR PYTHON ... · march 21, 2018 ucx-python: a flexible communication library for python applications

25

OUTLINE

Motivation and goals

Implementation choices

Features/API

Performance

Next steps

Page 26: UCX-PYTHON: A FLEXIBLE COMMUNICATION LIBRARY FOR PYTHON ... · march 21, 2018 ucx-python: a flexible communication library for python applications

26

NEXT STEPS

Performance validate dask-distributed over UCX-PY with dask-cuda workloads

Objects that have mixed physical backing (CPU and GPU)

Adding blocking support to NVLINK based UCT

Non-contiguous data transfers

Integration into dask-distributed underway (https://github.com/TomAugspurger/distributed/commits/ucx+data-handling)

Current implementation (https://github.com/Akshay-Venkatesh/ucx/tree/topic/py-bind)

Push to UCX project underway (https://github.com/openucx/ucx/pull/3165)

Page 27: UCX-PYTHON: A FLEXIBLE COMMUNICATION LIBRARY FOR PYTHON ... · march 21, 2018 ucx-python: a flexible communication library for python applications

27

SUMMARY

UCX-PY is a flexible communication library

Provides python developers a way to leverage high-speed interconnects like IB

Can support pythonesque way of overlap communication with other coroutines

Or can be non-overlapped like in traditional HPC

Can support data-movement of objects residing on CPU memory or on GPU memory

users needn’t explicitly copy GPU<->CPU

UCX-PY is close to native performance for major use case range

Page 28: UCX-PYTHON: A FLEXIBLE COMMUNICATION LIBRARY FOR PYTHON ... · march 21, 2018 ucx-python: a flexible communication library for python applications

28

BIG PICTURE

UCX-PY will serve as a high-performance communication module for dask

CUDA

PYTHON

APACHE ARROW

DASK

DEEP LEARNING

FRAMEWORKS

CUDNN

RAPIDS

CUMLCUDF CUGRAPH

UCX-P

Y

Page 29: UCX-PYTHON: A FLEXIBLE COMMUNICATION LIBRARY FOR PYTHON ... · march 21, 2018 ucx-python: a flexible communication library for python applications