dlobd:&an&emerging&paradigm&of&deep&learning& …...•...

42
DLoBD: An Emerging Paradigm of Deep Learning over Big Data Stacks on RDMAenabled Clusters Talk at OFA Workshop 2018 by Xiaoyi Lu The Ohio State University Email: [email protected] h=p://www.cse.ohiostate.edu/~luxi

Upload: others

Post on 22-May-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: DLoBD:&An&Emerging&Paradigm&of&Deep&Learning& …...• RDMA#for#Apache#Hadoop#2.x#and#RDMA#for#Apache#Spark#are#installed#and# available#on#SDSC#Comet.# – Examples#for#various#modes#of#usage#are#available#in:#

DLoBD:  An  Emerging  Paradigm  of  Deep  Learning  over  Big  Data  Stacks  on  RDMA-­‐enabled  Clusters  

Talk  at  OFA  Workshop  2018  

by  

Xiaoyi  Lu  The  Ohio  State  University  

E-­‐mail:  [email protected]­‐state.edu  h=p://www.cse.ohio-­‐state.edu/~luxi  

Page 2: DLoBD:&An&Emerging&Paradigm&of&Deep&Learning& …...• RDMA#for#Apache#Hadoop#2.x#and#RDMA#for#Apache#Spark#are#installed#and# available#on#SDSC#Comet.# – Examples#for#various#modes#of#usage#are#available#in:#

OFA  Workshop  2018   2  Network  Based  CompuNng  Laboratory  

•  Deep  Learning  is  a  sub-­‐set  of  Machine  Learning  –  But,  it  is  perhaps  the  most  radical  and  revoluJonary  

subset  

•  Deep  Learning  is  going  through  a  resurgence  –  Model:  Excellent  accuracy  for  deep/convoluJonal  

neural  networks  

–  Data:  Public  availability  of  versaJle  datasets  like  MNIST,  CIFAR,  and  ImageNet  

–  Capability:  Unprecedented  compuJng  and  communicaJon  capabiliJes:  MulJ-­‐/Many-­‐Core,  GPGPUs,  Xeon  Phi,  InfiniBand,  RoCE,  etc.  

•  Big  Data  has  become  one  of  the  most  important  elements  in  business  analyJcs  –  Increasing  demand  for  geWng  Big  Value  out  of  Big  

Data  to  drive  the  revenue  conJnuously  growing  

Why  Deep  Learning  is  so  hot?  

Courtesy:  hQp://www.zdnet.com/arNcle/caffe2-­‐deep-­‐learning-­‐wide-­‐ambiNons-­‐flexibility-­‐scalability-­‐and-­‐advocacy/  

MNIST  handwriQen  digits   Deep  Neural  Network  

Page 3: DLoBD:&An&Emerging&Paradigm&of&Deep&Learning& …...• RDMA#for#Apache#Hadoop#2.x#and#RDMA#for#Apache#Spark#are#installed#and# available#on#SDSC#Comet.# – Examples#for#various#modes#of#usage#are#available#in:#

OFA  Workshop  2018   3  Network  Based  CompuNng  Laboratory  

ApplicaNon  Example  of  DL:  Flickr’s  Magic  View  Photo  Filtering  

Courtesy:  hQps://thenextweb.com/opinion/2015/05/22/flickrs-­‐new-­‐magic-­‐view-­‐photo-­‐filtering-­‐feature-­‐works-­‐so-­‐well-­‐it-­‐convinced-­‐me-­‐to-­‐ditch-­‐iphoto/#.tnw_RaZEaD6g  

•  Image  recogniJon  to  divide  pictures  into  surprisingly  accurate  categories  •  Magic  of  AI/DL:    Generate  accurate  tags  for  billions  of  pictures  

Page 4: DLoBD:&An&Emerging&Paradigm&of&Deep&Learning& …...• RDMA#for#Apache#Hadoop#2.x#and#RDMA#for#Apache#Spark#are#installed#and# available#on#SDSC#Comet.# – Examples#for#various#modes#of#usage#are#available#in:#

OFA  Workshop  2018   4  Network  Based  CompuNng  Laboratory  

(1)  Prepare  Datasets  @Scale  

(2)  Deep  Learning  @Scale  

(3)  Non-­‐deep  learning  

analyJcs  @Scale  (4)  Apply  ML  model  @Scale  

•  Deep  Learning  over  Big  Data  (DLoBD)  is  one  of  the  most  efficient  analyzing  paradigms  

•  More  and  more  deep  learning  tools  or  libraries  (e.g.,  Caffe,  TensorFlow)  start  running  over  big  data  stacks,  such  as  Apache  Hadoop  and  Spark  

•  Benefits  of  the  DLoBD  approach  

–  Easily  build  a  powerful  data  analyJcs  pipeline  •  E.g.,  Flickr  DL/ML  Pipeline,  “How  Deep  Learning  Powers  Flickr”,  h=p://bit.ly/1KIDfof  

 

–  Be=er  data  locality  

–  Efficient  resource  sharing  and  cost  effecNve  

Deep  Learning  over  Big  Data  (DLoBD)  

Page 5: DLoBD:&An&Emerging&Paradigm&of&Deep&Learning& …...• RDMA#for#Apache#Hadoop#2.x#and#RDMA#for#Apache#Spark#are#installed#and# available#on#SDSC#Comet.# – Examples#for#various#modes#of#usage#are#available#in:#

OFA  Workshop  2018   5  Network  Based  CompuNng  Laboratory  

•  CaffeOnSpark    

•  SparkNet    

•  TensorFlowOnSpark  

•  TensorFrame      

•  DeepLearning4J  

•  BigDL    

•  mmlspark  –  CNTKOnSpark  

•  Many  others…  

Examples  of  DLoBD  Stacks      

Page 6: DLoBD:&An&Emerging&Paradigm&of&Deep&Learning& …...• RDMA#for#Apache#Hadoop#2.x#and#RDMA#for#Apache#Spark#are#installed#and# available#on#SDSC#Comet.# – Examples#for#various#modes#of#usage#are#available#in:#

OFA  Workshop  2018   6  Network  Based  CompuNng  Laboratory  

•  Layers  of  DLoBD  Stacks  –  Deep  learning  applicaJon  layer  

–  Deep  learning  library  layer  

–   Big  data  analyJcs  framework  layer    

–  Resource  scheduler  layer    

–  Distributed  file  system  layer    

–  Hardware  resource  layer  

•  How  much  performance  benefit  we  can  achieve  for  end  deep  learning  applicaJons?  

Overview  of  DLoBD  Stacks  

Sub-­‐opNmal  

Performance  

?  

Page 7: DLoBD:&An&Emerging&Paradigm&of&Deep&Learning& …...• RDMA#for#Apache#Hadoop#2.x#and#RDMA#for#Apache#Spark#are#installed#and# available#on#SDSC#Comet.# – Examples#for#various#modes#of#usage#are#available#in:#

OFA  Workshop  2018   7  Network  Based  CompuNng  Laboratory  

Big  Data  (Hadoop,  Spark,  

HBase,  Memcached,  

etc.)  

Deep  Learning  (Caffe,  TensorFlow,  

BigDL,  etc.)  

HPC    (MPI,  RDMA,  Lustre,  etc.)  

Increasing  Usage  of  HPC,  Big  Data  and  Deep  Learning  

Convergence  of  HPC,  Big  Data,  and  Deep  Learning!!!  

Page 8: DLoBD:&An&Emerging&Paradigm&of&Deep&Learning& …...• RDMA#for#Apache#Hadoop#2.x#and#RDMA#for#Apache#Spark#are#installed#and# available#on#SDSC#Comet.# – Examples#for#various#modes#of#usage#are#available#in:#

OFA  Workshop  2018   8  Network  Based  CompuNng  Laboratory  

•  BLAS  Libraries  –  the  heart  of  math  operaJons  –  Atlas/OpenBLAS  

–  NVIDIA  cuBlas  

–  Intel  Math  Kernel  Library  (MKL)  

•  DNN  Libraries  –  the  heart  of  ConvoluJons!  –  NVIDIA  cuDNN  (already  reached  its  7th  

iteraJon  –  cudnn-­‐v7)  

–  Intel  MKL-­‐DNN  (MKL  2017)  –  recent  but  a  very  promising  development  

•  CommunicaJon  Libraries  –  the  heart  of  model  parameter  updaJng  –  RDMA    

–  GPUDirect  RDMA  

Highly-­‐OpNmized  Underlying  Libraries  with  HPC  Technologies  

Page 9: DLoBD:&An&Emerging&Paradigm&of&Deep&Learning& …...• RDMA#for#Apache#Hadoop#2.x#and#RDMA#for#Apache#Spark#are#installed#and# available#on#SDSC#Comet.# – Examples#for#various#modes#of#usage#are#available#in:#

OFA  Workshop  2018   9  Network  Based  CompuNng  Laboratory  

Outline    

•  AcceleraJng  Big  Data  Stacks  •  Benchmarking  and  Characterizing  DLoBD  Stacks  

–  CaffeOnSpark,  TensorFlowOnSpark,  MMLSpark,  and  BigDL  

•  AcceleraJng  DLoBD  Stacks  –  BigDL  on  RDMA-­‐Spark    

–  TensorFlow  

Page 10: DLoBD:&An&Emerging&Paradigm&of&Deep&Learning& …...• RDMA#for#Apache#Hadoop#2.x#and#RDMA#for#Apache#Spark#are#installed#and# available#on#SDSC#Comet.# – Examples#for#various#modes#of#usage#are#available#in:#

OFA  Workshop  2018   10  Network  Based  CompuNng  Laboratory  

•  RDMA  for  Apache  Spark    

•  RDMA  for  Apache  Hadoop  2.x  (RDMA-­‐Hadoop-­‐2.x)  –  Plugins  for  Apache,  Hortonworks  (HDP)  and  Cloudera  (CDH)  Hadoop  distribuJons  

•  RDMA  for  Apache  HBase  

•  RDMA  for  Memcached  (RDMA-­‐Memcached)  

•  RDMA  for  Apache  Hadoop  1.x  (RDMA-­‐Hadoop)  

•  OSU  HiBD-­‐Benchmarks  (OHB)  –  HDFS,  Memcached,  HBase,  and  Spark  Micro-­‐benchmarks  

•  hQp://hibd.cse.ohio-­‐state.edu  

•  Users  Base:  280  organizaJons  from  34  countries  

•  More  than  25,750  downloads  from  the  project  site  

The  High-­‐Performance  Big  Data  (HiBD)  Project  

Available  for  InfiniBand  and  RoCE  Also  run  on  Ethernet  

 Available  for  x86  and  OpenPOWER  

 Upcoming  Release  will  have  support  

For  Singularity  and  Docker  

Page 11: DLoBD:&An&Emerging&Paradigm&of&Deep&Learning& …...• RDMA#for#Apache#Hadoop#2.x#and#RDMA#for#Apache#Spark#are#installed#and# available#on#SDSC#Comet.# – Examples#for#various#modes#of#usage#are#available#in:#

OFA  Workshop  2018   11  Network  Based  CompuNng  Laboratory  

HiBD  Release  Timeline  and  Downloads  

0  

5000  

10000  

15000  

20000  

25000  

30000  

Num

ber  o

f  Dow

nloa

ds  

Timeline  

RDMA-­‐Ha

doop

 2.x  0.9.1  

RDMA-­‐Ha

doop

 2.x  0.9.6  

RDMA-­‐Mem

cached

 0.9.4  

RDMA-­‐Ha

doop

 2.x  0.9.7  

RDMA-­‐Spark  0.9.1  

RDMA-­‐HB

ase  0.9.1  

RDMA-­‐Ha

doop

 2.x  1.0.0  

RDMA-­‐Spark  0.9.4  

RDMA-­‐Ha

doop

 2.x  1.3.0  

RDMA-­‐Mem

cached

 0.9.6  &  OHB

-­‐0.9.3  

RDMA-­‐Spark  0.9.5  

RDMA-­‐Ha

doop

 1.x  0.9.0  

RDMA-­‐Ha

doop

 1.x  0.9.8  

RDMA-­‐Mem

cached

 0.9.1  &  OHB

-­‐0.7.1  

RDMA-­‐Ha

doop

 1.x  0.9.9  

Page 12: DLoBD:&An&Emerging&Paradigm&of&Deep&Learning& …...• RDMA#for#Apache#Hadoop#2.x#and#RDMA#for#Apache#Spark#are#installed#and# available#on#SDSC#Comet.# – Examples#for#various#modes#of#usage#are#available#in:#

OFA  Workshop  2018   12  Network  Based  CompuNng  Laboratory  

Version  Number 0.9.1

Example:  RDMA-­‐based  Apache  Spark  

Default  Design  Choices:  1.  Hash-­‐based  Shuffle  2.  NIO-­‐based  data  transfer  

1.2.0

Default  Design  Choices:  1.   Sort-­‐based  Shuffle  2.   NIO-­‐based  data  

transfer    

Default  Design  Choices:  1.   Sort-­‐based  Shuffle  2.   NeQy-­‐based  data  

transfer    

1.3.1

Default  Design  Choices:  1.   Sort-­‐based  Shuffle  2.   NeQy-­‐based  data  transfer  

New  choice:  Tungsten-­‐Sort!  

1.4.0~2.0.0+

RDMA-­‐based  Apache  Spark  (HotI’14)  

RDMA-­‐based  Apache  Spark  (BigData’16)  

1.6.0~2.2.0+

New  Workloads  -­‐-­‐  DLoBD:  CaffeOnSpark,  TensorFlowOnSpark,  BigDL,  etc.  

RDMA-­‐based  Apache  Spark  (HotI’17)  

Page 13: DLoBD:&An&Emerging&Paradigm&of&Deep&Learning& …...• RDMA#for#Apache#Hadoop#2.x#and#RDMA#for#Apache#Spark#are#installed#and# available#on#SDSC#Comet.# – Examples#for#various#modes#of#usage#are#available#in:#

OFA  Workshop  2018   13  Network  Based  CompuNng  Laboratory  

•  Design  Features  –  RDMA  based  shuffle  plugin  –  SEDA-­‐based  architecture  –  Dynamic  connecJon  

management  and  sharing  

–  Non-­‐blocking  data  transfer  –  Off-­‐JVM-­‐heap  buffer  

management  –  InfiniBand/RoCE  support  

Design  Overview  of  Spark  with  RDMA

•  Enables  high  performance  RDMA  communicaJon,  while  supporJng  tradiJonal  socket  interface  

•  JNI  Layer  bridges  Scala  based  Spark  with  communicaJon  library  wri=en  in  naJve  code  

 X.  Lu,  M.  W.  Rahman,  N.  Islam,  D.  Shankar,  and  D.  K.  Panda,  AcceleraNng  Spark  with  RDMA  for  Big  Data  Processing:  Early  Experiences,  Int'l  Symposium  on  High  Performance  Interconnects  (HotI'14),  August  2014  

X.  Lu,  D.  Shankar,  S.  Gugnani,  and  D.  K.  Panda,  High-­‐Performance  Design  of  Apache  Spark  with  RDMA  and  Its  Benefits  on  Various  Workloads,  IEEE  BigData  ‘16,  Dec.  2016.  

Spark  Core  

RDMA  Capable  Networks  (IB,    iWARP,  RoCE  ..)  

Apache  Spark  Benchmarks/ApplicaNons/Libraries/Frameworks  

1/10/40/100  GigE,  IPoIB    Network  

Java  Socket  Interface   Java  NaNve  Interface  (JNI)  

 NaNve  RDMA-­‐based  Comm.  Engine  

 

Shuffle  Manager  (Sort,  Hash,  Tungsten-­‐Sort)  

Block  Transfer  Service  (NeQy,  NIO,  RDMA-­‐Plugin)  NeQy  Server  

NIO  Server  

RDMA  Server  

NeQy  Client  

NIO  Client  

RDMA  Client  

Page 14: DLoBD:&An&Emerging&Paradigm&of&Deep&Learning& …...• RDMA#for#Apache#Hadoop#2.x#and#RDMA#for#Apache#Spark#are#installed#and# available#on#SDSC#Comet.# – Examples#for#various#modes#of#usage#are#available#in:#

OFA  Workshop  2018   14  Network  Based  CompuNng  Laboratory  

•  High-­‐Performance  Design  of  Spark    over  RDMA-­‐enabled  Interconnects  –  High  performance  RDMA-­‐enhanced  design  with  naJve  InfiniBand  and  RoCE  support  at  the  verbs-­‐level  for  Spark  

–  RDMA-­‐based  data  shuffle  and  SEDA-­‐based  shuffle  architecture  

–  Non-­‐blocking  and  chunk-­‐based  data  transfer  

–  Off-­‐JVM-­‐heap  buffer  management  

–  Support  for  OpenPOWER  

–  Easily  configurable  for  different  protocols  (naJve  InfiniBand,  RoCE,  and  IPoIB)  

•  Current  release:  0.9.5  

–  Based  on  Apache  Spark    2.1.0  

–  Tested  with  •  Mellanox  InfiniBand  adapters  (DDR,  QDR,  FDR,  and  EDR)  

•  RoCE  support  with  Mellanox  adapters  

•  Various  mulJ-­‐core  plasorms  (x86,  POWER)  

•  RAM  disks,  SSDs,  and  HDD  

–  hQp://hibd.cse.ohio-­‐state.edu  

RDMA  for  Apache  Spark  DistribuNon  

Page 15: DLoBD:&An&Emerging&Paradigm&of&Deep&Learning& …...• RDMA#for#Apache#Hadoop#2.x#and#RDMA#for#Apache#Spark#are#installed#and# available#on#SDSC#Comet.# – Examples#for#various#modes#of#usage#are#available#in:#

OFA  Workshop  2018   15  Network  Based  CompuNng  Laboratory  

•  RDMA  for  Apache  Hadoop  2.x  and  RDMA  for  Apache  Spark  are  installed  and  available  on  SDSC  Comet.  –  Examples  for  various  modes  of  usage  are  available  in:  

•  RDMA  for  Apache  Hadoop  2.x:  /share/apps/examples/HADOOP  

•  RDMA  for  Apache  Spark:  /share/apps/examples/SPARK/  

–  Please  email  [email protected]  (reference  Comet  as  the  machine,  and  SDSC  as  the  site)  if  you  have  any  further  quesJons  about  usage  and  configuraJon.    

•  RDMA  for  Apache  Hadoop  is  also  available  on  Chameleon  Cloud  as  an  appliance  –  h=ps://www.chameleoncloud.org/appliances/17/    

HiBD  Packages  on  SDSC  Comet  and  Chameleon  Cloud  

M.  TaNneni,  X.  Lu,  D.  J.  Choi,  A.  Majumdar,  and  D.  K.  Panda,  Experiences  and  Benefits  of  Running  RDMA  Hadoop  and  Spark  on  SDSC  Comet,    XSEDE’16,  July  2016  

Page 16: DLoBD:&An&Emerging&Paradigm&of&Deep&Learning& …...• RDMA#for#Apache#Hadoop#2.x#and#RDMA#for#Apache#Spark#are#installed#and# available#on#SDSC#Comet.# – Examples#for#various#modes#of#usage#are#available#in:#

OFA  Workshop  2018   16  Network  Based  CompuNng  Laboratory  

•  InfiniBand  FDR,  SSD,  32/64  Worker  Nodes,  768/1536  Cores,  (768/1536M  768/1536R)  

•  RDMA-­‐based  design  for  Spark  1.5.1    

•  RDMA  vs.  IPoIB  with  768/1536  concurrent  tasks,  single  SSD  per  node.    –  32  nodes/768  cores:  Total  Jme  reduced  by  37%  over  IPoIB  (56Gbps)    

–  64  nodes/1536  cores:  Total  Jme  reduced  by  43%  over  IPoIB  (56Gbps)    

Performance  EvaluaNon  on  SDSC  Comet  –  HiBench  PageRank

32  Worker  Nodes,  768  cores,  PageRank  Total  Time   64  Worker  Nodes,  1536  cores,  PageRank  Total  Time  

0  50  

100  150  200  250  300  350  400  450  

Huge   BigData   GiganJc  

Time  (sec)  

Data  Size  (GB)  

IPoIB  

RDMA  

0  

100  

200  

300  

400  

500  

600  

700  

800  

Huge   BigData   GiganJc  

Time  (sec)  

Data  Size  (GB)  

IPoIB  

RDMA  

43%  37%  

Page 17: DLoBD:&An&Emerging&Paradigm&of&Deep&Learning& …...• RDMA#for#Apache#Hadoop#2.x#and#RDMA#for#Apache#Spark#are#installed#and# available#on#SDSC#Comet.# – Examples#for#various#modes#of#usage#are#available#in:#

OFA  Workshop  2018   17  Network  Based  CompuNng  Laboratory  

Outline    

•  AcceleraJng  Big  Data  Stacks  •  Benchmarking  and  Characterizing  DLoBD  Stacks  

–  CaffeOnSpark,  TensorFlowOnSpark,  MMLSpark,  and  BigDL  

•  AcceleraJng  DLoBD  Stacks  –  BigDL  on  RDMA-­‐Spark    

–  TensorFlow  

Page 18: DLoBD:&An&Emerging&Paradigm&of&Deep&Learning& …...• RDMA#for#Apache#Hadoop#2.x#and#RDMA#for#Apache#Spark#are#installed#and# available#on#SDSC#Comet.# – Examples#for#various#modes#of#usage#are#available#in:#

OFA  Workshop  2018   18  Network  Based  CompuNng  Laboratory  

•  Choose  proper  DL  workloads,  models  and  datasets  –  Varied  sizes  to  cover  big  and  small  models.  

Small  and  large  data  sets  

–  Cover  different  kinds  of  combinaJons  

•  Choose  representaJve  DLoBD  stacks    –  CaffeOnSpark,  TensorFlowOnSpark,  and  BigDL  

–  Running  over  Spark,  Yarn,  HDFS  

Benchmarking  and  CharacterizaNon  Methodology    

•  Define  characterizaJon  dimensions  –  Processor  Type  

–  Parameter  updaJng  approach  (i.e.,  communicaJon)  

–  Network  Protocol  (IPoIB,  RDMA)  

•  Generate  evaluaJon  reports    –  Performance  (End-­‐to-­‐end  training  Jme;  Jme  to  a  certain  

accuracy;  epoch  execuJon  Jme)  

–  Accuracy,  Scalability,  Resource  UJlizaJon  

–  Breakdown  

Page 19: DLoBD:&An&Emerging&Paradigm&of&Deep&Learning& …...• RDMA#for#Apache#Hadoop#2.x#and#RDMA#for#Apache#Spark#are#installed#and# available#on#SDSC#Comet.# – Examples#for#various#modes#of#usage#are#available#in:#

OFA  Workshop  2018   19  Network  Based  CompuNng  Laboratory  

•  Spark  Driver:  Job  Launching  and  Job  Control  

•  Spark  Executor:  For  data  feeding  and  task  control  

•  Model  Synchronizer:  Communicates  across  nodes  with  RDMA  /  TCP,  and  output  model  file  on  HDFS  

•  Scalable  and  CommunicaJon  intensive  –  Server-­‐to-­‐server  direct  communicaJon  

(Ethernet  or  InfiniBand)  achieves  faster  learning  and  eliminates  scalability  bo=leneck  

–  Out-­‐of-­‐band  communicaJon  

Overview  of  RepresentaNve  DLoBD  Stacks  -­‐  CaffeOnSpark  

Page 20: DLoBD:&An&Emerging&Paradigm&of&Deep&Learning& …...• RDMA#for#Apache#Hadoop#2.x#and#RDMA#for#Apache#Spark#are#installed#and# available#on#SDSC#Comet.# – Examples#for#various#modes#of#usage#are#available#in:#

OFA  Workshop  2018   20  Network  Based  CompuNng  Laboratory  

•  Spark  Executors  acJng  as  containers  used  to  run  TensorFlow  code  

•  Two  different  modes  to  ingesJng  data  –  Read  data  directly  from  HDFS  using  

built-­‐in  TensorFlow  modules    –  Feeding  data  from  Spark  RDDs  to  Spark  

executors  (TensorFlow  core)  

•  Scalable  and  CommunicaJon  intensive  –  Parameter  Server-­‐based  approach  

–  Embedded  inside  one  Spark  executor  and  talk  to  other  workers  over  gRPC  or  gPRC  with  RDMA  

–  Out-­‐of-­‐band  communicaJon  

Overview  of  RepresentaNve  DLoBD  Stacks  -­‐  TensorFlowOnSpark  

Page 21: DLoBD:&An&Emerging&Paradigm&of&Deep&Learning& …...• RDMA#for#Apache#Hadoop#2.x#and#RDMA#for#Apache#Spark#are#installed#and# available#on#SDSC#Comet.# – Examples#for#various#modes#of#usage#are#available#in:#

OFA  Workshop  2018   21  Network  Based  CompuNng  Laboratory  

•  Microsoy  CogniJve  Toolkit  (CNTK)  and  OpenCV  into  Spark  Machine  Learning  pipelines  without  data  transfer  overhead    

•  Feeding  data  for  CNTK  Core  (e.g.  images  or  texts)  can  be  directly  read  from  HDFS  by  Spark  Executors    

•  Scalable  and  CommunicaJon  intensive  –  Embedded  inside  one  Spark  executor  and  

talk  to  other  workers  over  MPI  (RDMA,  TCP)  

–  Out-­‐of-­‐band  communicaJon  

Overview  of  RepresentaNve  DLoBD  Stacks  –  CNTKOnSpark/MMLSpark  

Page 22: DLoBD:&An&Emerging&Paradigm&of&Deep&Learning& …...• RDMA#for#Apache#Hadoop#2.x#and#RDMA#for#Apache#Spark#are#installed#and# available#on#SDSC#Comet.# – Examples#for#various#modes#of#usage#are#available#in:#

OFA  Workshop  2018   22  Network  Based  CompuNng  Laboratory  

•  Users  can  write  deep  learning  applicaJons  as  Spark  programs  

•  Users  can  load  pre-­‐trained  Caffe  or  Torch  models  into  Spark  programs  using  BigDL  

•  Feed  data  to  BigDL  core  by  Spark  Executor  which  can  directly  load  data  from  HDFS  

•  High  performance  –  Support  Intel  MKL    –  Support  both    Xeon  and  Xeon  Phi  (e.g.,  KNL)  

•  Scalable  and  CommunicaJon  intensive  –  Spark  block  manager  as  parameter  server    –  Organically  designed  and  integrated  with  Spark  architecture  

–  In-­‐band  CommunicaJon  

•  RDMA  communicaJon  can  be  achieved  through  our  RDMA-­‐Spark  package!  

Overview  of  RepresentaNve  DLoBD  Stacks  -­‐  BigDL  

Page 23: DLoBD:&An&Emerging&Paradigm&of&Deep&Learning& …...• RDMA#for#Apache#Hadoop#2.x#and#RDMA#for#Apache#Spark#are#installed#and# available#on#SDSC#Comet.# – Examples#for#various#modes#of#usage#are#available#in:#

OFA  Workshop  2018   23  Network  Based  CompuNng  Laboratory  

MNIST   CIFAR-­‐10     ImageNet    

Category     Digit  ClassificaJon     Object  ClassificaJon     Object  ClassificaJon    

ResoluJon     28  ×  28  B&W     32  ×  32  Color     256  ×  256  Color    

Classes     10   10   1000  

Training  Images     60  K   50  K   1.2  M  

Tesing  Images     10  K     10  K   100  K  

Selected  Various  Datasets  and  Models  

Model   Layers  (Conv.  /  Full-­‐connected)     Dataset   Framework  

LeNet   2  /  2   MNIST   CaffeOnSpark,  TensorFlowOnSpark    

SoyMax  Regression   NA  /  NA   MNIST   TensorFlowOnSpark    

CIFAR-­‐10  Quick   3  /  1   CIFAR-­‐10   CaffeOnSpark,  TensorFlowOnSpark,  MMLSpark  

VGG-­‐16   13  /  3   CIFAR-­‐10   BigDL    

AlexNet   5  /  3   ImageNet   CaffeOnSpark    

GoogLeNet   22  /  0   ImageNet   CaffeOnSpark    

Resnet-­‐50   53/1   SyntheJc     TensorFlow  

Page 24: DLoBD:&An&Emerging&Paradigm&of&Deep&Learning& …...• RDMA#for#Apache#Hadoop#2.x#and#RDMA#for#Apache#Spark#are#installed#and# available#on#SDSC#Comet.# – Examples#for#various#modes#of#usage#are#available#in:#

OFA  Workshop  2018   24  Network  Based  CompuNng  Laboratory  

Performance  CharacterizaNon  for  CPU-­‐/GPU-­‐based  Deep  Learning  with  CaffeOnSpark    

0

1000

2000

3000

4000

5000

6000

1 2 4 8 16

End-

to-E

nd T

ime

(sec

s)

Number of Nodes (one device/node)

CIFAR-­‐10  Quick  CPU+OpenBlas

CPU+MKL

GPU

GPU+cuDNN

0

200

400

600

800

1000

1200

1 2 4 8 16

End-

to-E

nd T

ime

(sec

s)

Number of Nodes (one device/node)

LeNet  on  MNIST  CPU+OpenBlas

CPU+MKL

GPU

GPU+cuDNN

63%  

91.26%  

98.08%  

78.3%  

41.5%   15.9%

 

81.31%  

88.91%  

96.78%  

80.08%  

63.92%  

•  DL  workloads  can  benefit  from  the  high  performance  of  the  DLoBD  stacks.    

•  Network  will  become  a  bo=leneck  at  some  point  if  the  sub-­‐opJmal  IPoIB  network  protocol  is  used.  

•  GPU/GPU+cuDNN  can  get  the  best  performance.  GPU  +  cuDNN  is  degraded  at  a  large  scale  (e.g.,  16  nodes).  

•  For  some  models,  soluJons  with  CPU  +  MKL  may  outperform  GPU-­‐based  soluJons.  

Page 25: DLoBD:&An&Emerging&Paradigm&of&Deep&Learning& …...• RDMA#for#Apache#Hadoop#2.x#and#RDMA#for#Apache#Spark#are#installed#and# available#on#SDSC#Comet.# – Examples#for#various#modes#of#usage#are#available#in:#

OFA  Workshop  2018   25  Network  Based  CompuNng  Laboratory  

Performance  CharacterizaNon  for  IPoIB  and  RDMA  with  CaffeOnSpark  and  TensorFlowOnSpark  (IB  EDR)  

•  CaffeOnSpark  benefits  from  the  high  performance  of  RDMA  compared  to  IPoIB  once  communicaJon  overhead  becomes  significant.  

•  Our  experiments  show  that  the  default  RDMA  design  in  TensorFlowOnSpark  is  not  fully  opJmized  yet.  For  MNIST  tests,  RDMA  is  not  showing  obvious  benefits.  

0 50

100 150 200 250 300 350 400 450

IPoIB RDMA IPoIB RDMA IPoIB RDMA IPoIB RDMA

GPU GPU-cuDNN GPU GPU-cuDNN

CIFAR-10 Quick LeNet on MNIST

End-

to-E

nd T

ime

(sec

s)

CaffeOnSpark  

2 4 8 16

0 50

100 150 200 250 300 350 400 450

IPoIB RDMA IPoIB RDMA

CIFAR10 (2 GPUs/node) MNIST (1 GPU/node)

End-

to-E

nd T

ime

(sec

s)

TensorFlowOnSpark  2 GPUs 4 GPUs 8 GPUs

14.2%  13.3%  

51.2%  45.6%  

33.01%  

4.9%  

Page 26: DLoBD:&An&Emerging&Paradigm&of&Deep&Learning& …...• RDMA#for#Apache#Hadoop#2.x#and#RDMA#for#Apache#Spark#are#installed#and# available#on#SDSC#Comet.# – Examples#for#various#modes#of#usage#are#available#in:#

OFA  Workshop  2018   26  Network  Based  CompuNng  Laboratory  

Performance  CharacterizaNon  with  MMLSpark    

0  

1000  

2000  

3000  

4000  

5000  

1   2   4   8   16  

End-­‐to-­‐End

 Tim

e  (secs)  

Number  of  Nodes  (one  device/node)    

CPU+OpenBlas  

CPU+MKL  

GPU+cuDNN    

0  

40  

80  

120  

160  

IPoIB     RDMA  

End-­‐to-­‐End

 Tim

e  (secs)  

CIFAR-­‐10  

2   4   8   16  

•  The  soluJon  of  GPU  +  cuDNN  performs  best,  up  to  55x  faster  than  CPU  +  OpenBLAS,  and  up  to  15x  than  CPU  +  MKL.    

•  OpenMPI-­‐based  communicaJon  over  IPoIB  and  RDMA;  Similar  performance;  The  latency  and  bandwidth  of  IPoIB  in  this  cluster  are  sufficient  for  small  models.  

•  Could  not  find  other  benchmarks  with  bigger  models  for  MMLSpark  

Page 27: DLoBD:&An&Emerging&Paradigm&of&Deep&Learning& …...• RDMA#for#Apache#Hadoop#2.x#and#RDMA#for#Apache#Spark#are#installed#and# available#on#SDSC#Comet.# – Examples#for#various#modes#of#usage#are#available#in:#

OFA  Workshop  2018   27  Network  Based  CompuNng  Laboratory  

CharacterizaNon  on  Performance  and  Accuracy      

10

20

30

40

50

60

70

80 28

6

1001

5297

1293

9

1867

3

2297

4

2727

3

3013

8

3730

1

4303

0

5032

6

Acc

urac

y (%

)

Time (secs)

IPoIB RDMA

10

30

50

70

90

160

511

863

1214

15

65

1917

22

68

2619

29

71

3322

39

08

4259

46

10

4962

Acc

urac

y (%

)

Time (secs)

IPoIB RDMA

AlexNet  on  ImageNet  with  CaffeOnSpark   GoogleNet  on  ImageNet  with  CaffeOnSpark  

•  Performance  EvaluaJon  of  CaffeOnSpark    (training  Jme  to  achieve  a  70%  accuracy)  –  RDMA  reduces  the  overall  Jme  cost  by  22%  in  training  AlexNet  on  ImageNet  –  RDMA  reduces  the  overall  Jme  cost  by  15%  in  training  GoogleNet  on  ImageNet  

•  Performance  EvaluaJon  of  BigDL  (training  Jme  to  achieve  a  70%  accuracy)  –  RDMA  reduces  the  overall  Jme  cost  by  48%  in  training  VGG  on  CIFAR-­‐10  

0

20

40

60

80

100

165

474

711

961

1280

15

98

1920

20

17

2132

22

61

2908

36

45

4337

Acc

urac

y (%

)

Time (secs)

IPoIB RDMA

VGG  on  CIFAR-­‐10  with  BigDL  

70% Accuracy Reduce by 48%

Reduce by 22%

70% Accuracy Reduce by 15% 70% Accuracy

Page 28: DLoBD:&An&Emerging&Paradigm&of&Deep&Learning& …...• RDMA#for#Apache#Hadoop#2.x#and#RDMA#for#Apache#Spark#are#installed#and# available#on#SDSC#Comet.# – Examples#for#various#modes#of#usage#are#available#in:#

OFA  Workshop  2018   28  Network  Based  CompuNng  Laboratory  

Memory  and  Network  UNlizaNon  of  CaffeOnSpark  

•  CIFAR-­‐10  Quick  Model  and  CIFAR-­‐10  Dataset  •  GPU-­‐based  soluJons  use  less  memory  than  CPU-­‐based  ones  as  they  mostly  use  GPU  memory.    •  CPU  +  MKL  soluJon  uses  host  memory  more  efficiently  and  has  be=er  performance  than  CPU  +  

OpenBLAS.  •  RDMA  uJlizes  the  network  resources  more  efficiently  than  the  IPoIB  in  CaffeOnSpark.    •  CaffeOnSpark  sJll  does  not  fully  uJlize  the  high  throughput  characterisJc  of  RDMA  and  memory  

resource.  

0

10

20

30

0 3 6 9 12 15 18 21 24 27 30 33 36 39 42 45 48 51 54 57

Hos

t Mem

ory

Usa

ge

(GB

)

Time (min)

CPU+OpenBLAS CPU+MKL GPU GPU+cuDNN

0

10

20

0 1 2 3 4 5 6 7

Net

wor

k Th

roug

hput

(M

B/s

)

Time (min)

IPoIB RDMA

Page 29: DLoBD:&An&Emerging&Paradigm&of&Deep&Learning& …...• RDMA#for#Apache#Hadoop#2.x#and#RDMA#for#Apache#Spark#are#installed#and# available#on#SDSC#Comet.# – Examples#for#various#modes#of#usage#are#available#in:#

OFA  Workshop  2018   29  Network  Based  CompuNng  Laboratory  

•  SoyMax  Regression  model,  over  MNIST  dataset    

•  Up  to  15.5%  Jme  in  Apache  Hadoop  YARN  scheduler  layer    

•  Up  to  18.1%  execuJon  Jme  in  Spark  job  execuJon  layer    

•  Data  size  is  small,  so  we  do  not  count  the  Jme  spent  on  accessing  HDFS  layer.    

•  Need  more  effort  to  reduce  the  overhead  across  different  layers  of  DLoBD  stacks  

•  Maybe  amorJzed  in  long-­‐running  deep  learning  jobs  

Performance  Overhead  across  Layers  in  DLoBD  Stacks  

0

10

20

30

40

50

60

70

IPoIB RDMA IPoIB RDMA

TensorFlowOnSpark Native TensorFlow

Tim

e (s

econ

d)

OtherSpark-OverheadYARN-OverheadActual Training

Page 30: DLoBD:&An&Emerging&Paradigm&of&Deep&Learning& …...• RDMA#for#Apache#Hadoop#2.x#and#RDMA#for#Apache#Spark#are#installed#and# available#on#SDSC#Comet.# – Examples#for#various#modes#of#usage#are#available#in:#

OFA  Workshop  2018   30  Network  Based  CompuNng  Laboratory  

•  RDMA  can  benefit  DL  workloads  

–  Up  to  2.7x  speedup  with  RDMA  compared  to  the  IPoIB  scheme  for  deep  learning  workloads.    

–  RDMA  can  scale  be=er  and  uJlize  resources  more  efficiently  than  IPoIB  over  InfiniBand  clusters  

•  GPU-­‐based  DL  designs  can  outperform  CPU-­‐based  designs,  but  not  always    

–  LeNet  on  MNIST,  CPU  +  MKL  achieved  be=er  performance  than  GPU  and  GPU  +  cuDNN  on  8/16  nodes    

•  Large  rooms  for  further  improvement  in  DLoBD  stacks!!!  

•  We  need  more  benchmarks,  public  datasets,  and  analysis  tools!!!  

Insights  and  Guidance    

X.  Lu,  H.  Shi,  M.  H.  Javed,  R.  Biswas,  and  D.  K.  Panda,  Characterizing  Deep  Learning  over  Big  Data  (DLoBD)  Stacks  on  RDMA-­‐capable  Networks,  HotI  2017.    

X.  Lu,  H.  Shi,  R.  Biswas,  M.  H.  Javed,  and  D.  K.  Panda,  DLoBD:  A  Comprehensive  Study  on  the  Emerging  Paradigm  of  Deep  Learning  over  Big  Data  Stacks,  (Under  Review).    

Page 31: DLoBD:&An&Emerging&Paradigm&of&Deep&Learning& …...• RDMA#for#Apache#Hadoop#2.x#and#RDMA#for#Apache#Spark#are#installed#and# available#on#SDSC#Comet.# – Examples#for#various#modes#of#usage#are#available#in:#

OFA  Workshop  2018   31  Network  Based  CompuNng  Laboratory  

Outline    

•  AcceleraJng  Big  Data  Stacks  •  Benchmarking  and  Characterizing  DLoBD  Stacks  

–  CaffeOnSpark,  TensorFlowOnSpark,  MMLSpark,  and  BigDL  

•  AcceleraJng  DLoBD  Stacks  –  BigDL  on  RDMA-­‐Spark    

–  TensorFlow  

Page 32: DLoBD:&An&Emerging&Paradigm&of&Deep&Learning& …...• RDMA#for#Apache#Hadoop#2.x#and#RDMA#for#Apache#Spark#are#installed#and# available#on#SDSC#Comet.# – Examples#for#various#modes#of#usage#are#available#in:#

OFA  Workshop  2018   32  Network  Based  CompuNng  Laboratory  

Epoch-­‐Level  EvaluaNon  with  BigDL  on  SDSC  Comet      

0

10

20

30

40

50

60

10 510

1010 1510 2010 2510 3010 3510 4010 4510

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

Acc

urac

y (%

)

Acc

umul

ativ

e Ep

ochs

Tim

e (s

ecs)

Epoch Number

IPoIB-Time

RDMA-Time

IPoIB-Accuracy

RDMA-Accuracy

VGG  on  CIFAR-­‐10  with  BigDL  

•  Epoch-­‐level  evaluaJon  of  training  VGG  model  using  BigDL  on  default  Spark  with  

IPoIB  and  our  RDMA-­‐based  Spark.  

•  RDMA  version  takes  constantly  less  Jme  than  the  IPoIB  version  to  finish  every  epoch.  –  RDMA  finishes  epoch  18  in  2.6x  Jme  faster  than  IPoIB    

2.6X

Page 33: DLoBD:&An&Emerging&Paradigm&of&Deep&Learning& …...• RDMA#for#Apache#Hadoop#2.x#and#RDMA#for#Apache#Spark#are#installed#and# available#on#SDSC#Comet.# – Examples#for#various#modes#of#usage#are#available#in:#

OFA  Workshop  2018   33  Network  Based  CompuNng  Laboratory  

Scalability  EvaluaNon  with  BigDL  on  SDSC  Comet  

•  Using  BigDL  with  IPoIB  &  RDMA  Spark  

•  For  VGG  model  trained  with  BigDL,  RDMA-­‐based  Spark  scales  be=er  than  default  IPoIB  Spark    

•  For  384  CPU  cores,  18  epochs  and  same  batch  size,  RDMA  takes  about  870  seconds  while  IPoIB  takes  2,372  seconds  

•  A  speedup  of  2.7x  using  RDMA  for  the  epoch-­‐level  training  Jme  

10 510

1010 1510 2010 2510 3010 3510 4010 4510 5010

96 192 384

Acc

umul

ativ

e Ep

ochs

Tim

e (s

ecs)

Number of CPU Cores

IPoIB RDMA

Page 34: DLoBD:&An&Emerging&Paradigm&of&Deep&Learning& …...• RDMA#for#Apache#Hadoop#2.x#and#RDMA#for#Apache#Spark#are#installed#and# available#on#SDSC#Comet.# – Examples#for#various#modes#of#usage#are#available#in:#

OFA  Workshop  2018   34  Network  Based  CompuNng  Laboratory  

Outline    

•  AcceleraJng  Big  Data  Stacks  •  Benchmarking  and  Characterizing  DLoBD  Stacks  

–  CaffeOnSpark,  TensorFlowOnSpark,  MMLSpark,  and  BigDL  

•  AcceleraJng  DLoBD  Stacks  –  BigDL  on  RDMA-­‐Spark    

–  TensorFlow  

Page 35: DLoBD:&An&Emerging&Paradigm&of&Deep&Learning& …...• RDMA#for#Apache#Hadoop#2.x#and#RDMA#for#Apache#Spark#are#installed#and# available#on#SDSC#Comet.# – Examples#for#various#modes#of#usage#are#available#in:#

OFA  Workshop  2018   35  Network  Based  CompuNng  Laboratory  

Overview  of  gRPC  with  TensorFlow  

Worker  services  communicate  among  each  other  using  gRPC,  or  gRPC+X!    

Client   Master  

Worker    /job:PS/task:0  

 

 

 

Worker    /job:Worker/task:0  

CPU   GPU  

 gRPC  server/  client    

CPU   GPU  

gRPC  server/  client    

Page 36: DLoBD:&An&Emerging&Paradigm&of&Deep&Learning& …...• RDMA#for#Apache#Hadoop#2.x#and#RDMA#for#Apache#Spark#are#installed#and# available#on#SDSC#Comet.# – Examples#for#various#modes#of#usage#are#available#in:#

OFA  Workshop  2018   36  Network  Based  CompuNng  Laboratory  

Performance  Benefits  for  RDMA-­‐gRPC  with  Micro-­‐Benchmark  

RDMA-­‐gRPC  RPC  Latency  

•  gRPC-­‐RDMA  Latency  on  SDSC-­‐Comet-­‐FDR  –  Up  to  2.7x  performance  speedup  over  IPoIB  for  Latency  for  small  messages  –  Up  to  2.8x  performance  speedup  over  IPoIB  for  Latency  for  medium  messages  –  Up  to  2.5x  performance  speedup  over  IPoIB  for  Latency  for  large  messages    

0  

15  

30  

45  

60  

75  

90  

2   8   32   128   512   2K   8K  

Latency  (us)  

 payload  (Bytes)  

Default  gRPC  

OSU  RDMA  gRPC  

0  

200  

400  

600  

800  

1000  

16K   32K   64K   128K   256K   512K  Latency  (us)  Payload  (Bytes)  

Default  gRPC  

OSU  RDMA  gRPC  

100  

3800  

7500  

11200  

14900  

18600  

1M   2M   4M   8M  

Latency  (us)  

Payload  (Bytes)  

Default  gRPC  

OSU  RDMA  gRPC  

R.  Biswas,  X.  Lu,  and  D.  K.  Panda,  AcceleraNng  gRPC  and  TensorFlow  with  RDMA  for  High-­‐Performance  Deep  Learning  over  InfiniBand,  Under  Review.  

Page 37: DLoBD:&An&Emerging&Paradigm&of&Deep&Learning& …...• RDMA#for#Apache#Hadoop#2.x#and#RDMA#for#Apache#Spark#are#installed#and# available#on#SDSC#Comet.# – Examples#for#various#modes#of#usage#are#available#in:#

OFA  Workshop  2018   37  Network  Based  CompuNng  Laboratory  

0  

20  

40  

60  

80  

100  

120  

140  

160  

16   32   64  

Images  /  Second

 

Batch  Size  

gRPPC   Verbs  MPI   gRPC+RDMA  

35%  

Performance  Benefit  for  TensorFlow  -­‐  Resnet50  

0  

50  

100  

150  

200  

250  

300  

16   32   64  

Images  /  Second

 

Batch  Size  

gRPPC   Verbs  MPI   gRPC+RDMA  

41%  

•  TensorFlow  Resnet50  performance  evaluaJon  on  an  IB  EDR  cluster  –  Up  to  35%  performance  speedup  over  IPoIB  for  4  nodes.  –  Up  to  41%  performance  speedup  over  IPoIB  for  8  nodes.  

4  Nodes   8  Nodes  

Page 38: DLoBD:&An&Emerging&Paradigm&of&Deep&Learning& …...• RDMA#for#Apache#Hadoop#2.x#and#RDMA#for#Apache#Spark#are#installed#and# available#on#SDSC#Comet.# – Examples#for#various#modes#of#usage#are#available#in:#

OFA  Workshop  2018   38  Network  Based  CompuNng  Laboratory  

0  10  20  30  40  50  60  70  80  90  

16   32   64  

Images  /  Second

 

Batch  Size  

gRPPC   Verbs   MPI   gRPC+RDMA  

Performance  Benefit  for  TensorFlow  -­‐  IncepNon3  

0  20  40  60  80  100  120  140  160  180  200  

16   32   64  

Images  /  Second

 

Batch  Size  

gRPPC   Verbs   MPI   gRPC+RDMA  

TensorFlow  IncepJon3  performance  evaluaJon  on  an  IB  EDR  cluster  •  Up  to  27%  performance  speedup  over  IPoIB  for  4  nodes  •  Up  to  36%  performance  speedup  over  IPoIB  for  8  nodes.  

4  Nodes   8  Nodes  

27%  

36%  

Page 39: DLoBD:&An&Emerging&Paradigm&of&Deep&Learning& …...• RDMA#for#Apache#Hadoop#2.x#and#RDMA#for#Apache#Spark#are#installed#and# available#on#SDSC#Comet.# – Examples#for#various#modes#of#usage#are#available#in:#

OFA  Workshop  2018   39  Network  Based  CompuNng  Laboratory  

•  Discussed  challenges  in  benchmarking,  characterizing,  and  acceleraJng    Deep  Learning  over  Big  Data  (DLoBD)  stacks  

•  RDMA  can  benefit  DL  workloads  as  showed  by  our  RDMA-­‐Spark,  AR-­‐gRPC,  and  other  RDMA  designs    

•  Many  other  open  issues  need  to  be  solved    

•  Will  enable  Big  Data  and  Deep  Learning  community  to  take  advantage  of  modern  HPC  technologies  to  carry  out  their  analyJcs  in  a  fast  and  scalable  manner    

Concluding  Remarks  

Page 40: DLoBD:&An&Emerging&Paradigm&of&Deep&Learning& …...• RDMA#for#Apache#Hadoop#2.x#and#RDMA#for#Apache#Spark#are#installed#and# available#on#SDSC#Comet.# – Examples#for#various#modes#of#usage#are#available#in:#

OFA  Workshop  2018   40  Network  Based  CompuNng  Laboratory  

The  4th  InternaNonal  Workshop  on    High-­‐Performance  Big  Data  CompuNng  (HPBDC)

HPBDC  2018  will  be  held  with  IEEE  InternaNonal  Parallel  and  Distributed  Processing  Symposium  (IPDPS  2018),  Vancouver,  BriNsh  Columbia  CANADA,  May,  2018  

Workshop  Date:  May  21st,  2018  

Keynote  Talk:  Prof.  Geoffrey  Fox,  Twister2:  A  High-­‐Performance  Big  Data  Programming  Environment  

Six  Regular  Research  Papers  and  Two  Short  Research  Papers  

Panel  Topic:  Which  Framework  is  the  Best  for  High-­‐Performance  Deep  Learning:    Big  Data  Framework  or  HPC  Framework?  hQp://web.cse.ohio-­‐state.edu/~luxi/hpbdc2018      

   

HPBDC  2017  was  held  in  conjuncNon  with  IPDPS’17  

hQp://web.cse.ohio-­‐state.edu/~luxi/hpbdc2017      

HPBDC  2016  was  held  in  conjuncNon  with  IPDPS’16  

hQp://web.cse.ohio-­‐state.edu/~luxi/hpbdc2016    

Page 41: DLoBD:&An&Emerging&Paradigm&of&Deep&Learning& …...• RDMA#for#Apache#Hadoop#2.x#and#RDMA#for#Apache#Spark#are#installed#and# available#on#SDSC#Comet.# – Examples#for#various#modes#of#usage#are#available#in:#

OFA  Workshop  2018   41  Network  Based  CompuNng  Laboratory  

Two  More  PresentaNons    

•  Wednesday  (04/11/18)  at  11:30  am    

Building  Efficient  Clouds  for  HPC,  Big  Data,  and  Neuroscience  ApplicaNons  over  SR-­‐IOV-­‐enabled  InfiniBand  Clusters  

 

•  Thursday  (04/12/18)  at  04:00  pm    

High-­‐Performance  Big  Data  AnalyNcs  with  RDMA  over  NVM  and  NVMe-­‐SSD  

Page 42: DLoBD:&An&Emerging&Paradigm&of&Deep&Learning& …...• RDMA#for#Apache#Hadoop#2.x#and#RDMA#for#Apache#Spark#are#installed#and# available#on#SDSC#Comet.# – Examples#for#various#modes#of#usage#are#available#in:#

OFA  Workshop  2018   42  Network  Based  CompuNng  Laboratory  

[email protected]­‐state.edu  

hQp://www.cse.ohio-­‐state.edu/~luxi  

Thank  You!  

Network-­‐Based  CompuJng  Laboratory  h=p://nowlab.cse.ohio-­‐state.edu/

The  High-­‐Performance  Big  Data  Project  h=p://hibd.cse.ohio-­‐state.edu/