singa: putting deep learning into the hands of multimedia users wei wang, gang chen, tien tuan anh...

SINGA: Putting Deep Learning into the Hands of Multimedia Users

SINGA: Putting Deep Learning into the Hands of

Multimedia Usershttp://singa.apache.org/

Wei Wang, Gang Chen, Tien Tuan Anh Dinh, Jinyang Gao, Beng Chin Ooi, Kian-Lee Tan, and Sheng

Wang

1


• Introduction

• Multimedia data and application• Motivations

• Deep learning models and training, and design principles• SINGA

• Usability

• Scalability

• Implementation

• Experiment

2


Introduction

Image/video

Social Media

E-commerce

Health-care Text

AudioMadbits (acquired by Twitter)

Perceptio (acquired by Apple)

LookFlow (acquired by Yahoo! Flickr)

Deepomatic (e-commerce product search)

Descartes Labs (satellite images)

Clarifai (tagging)

ParallelDots

Semantria (NLP tasks >10 languages)

Ldibon

AlchemyAPI (acquired by IBM)

VocallIQ (acquired by Apple)

Multimedia Data

Multimedia Data

Deep Learning has been noted for its effectiveness for multimedia applications!

3


Motivations

Model Categories

CNN, MLP, Auto-encoderImage/video classification

Feedforward Models

Krizhevsky, Sutskever, and Hinton, 2012; Szegedy et al., 2014; Simonyan and Zisserman, 2014a

CNN

4


Motivations

Feedforward Models

Energy models

RBM

DBN

Model Categories


DBN, RBM, DBMSpeech recognition

Dahl et al., 20125


Motivations

Feedforward Models

Energy models

Recurrent Neural

Networks

Model Categories



RNN, LSTM, GRUNatural language processing

Mikolov et al., 2010; Cho et al., 20146


Motivations

Feedforward Models

Energy models

Recurrent Neural

Networks

Model Categories



RNN, LSTM, GRUNatural language processing

Design Goal IUsability: easy to implement various models

7


Motivations: Training Process

• Training process• Update model parameters to minimize prediction error

• Training algorithm• Mini-batch Stochastic Gradient Descent (SGD)

• Training time• (time per SGD iteration) x (number of SGD iterations)• Long time to train large models over large datasets, e.g., 2 weeks

for training Overfeat (Pierre, et al.) reported by Intel (https://software.intel.com/sites/default/files/managed/74/15/SPCS008.pdf).

Back-propagation (BP) Contrastive Divergence (CD)

8


Motivations: Distributed Training Frameworks• Synchronous training (Google Sandblaster, Dean et al., 2012; Baidu AllReduce, Wu et al., 2015)

• Reduce time per iteration

• Scalable for single-node with multiple GPUs

• Cannot scale to large cluster

• Asynchronous training (Google Downpour, Dean et al., 2012, Hogwild!, Recht et al., 2011)

• Reduce number of iterations per machine

• Scalable for big cluster with commodity machine(CPU)

• Not stable

• Hybrid frameworks

Design Goal IIScalability: not just flexible, but also efficient and

adaptive to run different training frameworks

9


SINGA:

A Distributed Deep Learning Platform

10


Usability: Abstraction

class Layer { vector<Blob> data, grad; vector<Param*> param; ... void Setup(LayerProto& conf, vector<Layer*> src); void ComputeFeature(int flag, vector<Layer*> src); void ComputeGradient(int flag, vector<Layer*> src);};Driver::RegisterLayer<FooLayer>("Foo"); // register new layers

Input layers load raw data (and label)Output layers output feature (and prediction results)

Neuron layers transform features, e.g., convolution and pooling

Loss layers measure training loss, e.g., cross-entropy loss

Connection layers connect layers due to neural net partition

TrainOneBatchTrainOneBatch

NeuralNet

Layer

stopstop

11


Usability: Neural Net Representation


NeuralNet

Layer

stopstop

RNN RBM

Input

Hidden

Loss

labels

Feedforward models (e.g., CNN)

12


Usability: TrainOneBatch


NeuralNet

Layer

stopstop

Back-propagation (BP)

Contrastive Divergence (CD)

Input

Hidden

Loss

labels

RNN

Feedforward models (e.g., CNN)

RBMJust need to override the TrainOneBatch

function to implement other algorithms! 13


Scalability: Partitioning for Distributed TrainingNeuralNet Partitioning:1. Partition layers into different subsets

2. Partition each singe layer on batch dimension.

3. Partition each singe layer on feature dimension.

4. Hybrid partitioning strategy of 1, 2 and 3. Worker 1

Worker 2

1

Worker 1

Worker 2

Worker 1

Worker 2

Worker 1

2 3

Users just need to CONFIGURE the partitioning scheme and

SINGA takes care of the real work (eg. slice and connect layers)14


Scalability:Training Framework Cluster Topology

Server Group

Parameters

Server Server ServerWorker

Server

Node

Group

Inter-node Communication

Synchronous training cannot scale to large group size

Neural Net

Worker Worker Worker

Legends:

15


Cluster Topology

Worker

Server

Node

Group


Communication is the bottleneck!

Legends:

16

Scalability:Training Framework


Cluster Topology

Worker

Server

Node

Group


(a) Sandblaster (b) AllReduce (c) Downpour (d) Distributed Hogwild

sync async

SINGA is able to configure most known frameworks.

Legends:

17

Scalability:Training Framework


Implementation

Driver::Train()

Main Thread

Stub::Run()

Worker thread

While(not stop): Worker::TrainOneBatch()

Server thread

While(not stop): Server::Update()

Remote NodesHDFS

Ubuntu

Docker

CentOS MacOS

DiskFile

Mes

os

Zoo

keep

er

Worker Stub Server

Driver

CNN RBM RNN

OptionalComponent

SINGA Component

Legend:

SINGA Software StackSINGA Software Stack

18


Deep learning as a Service (DLaaS)Third party APPs(Web app, Mobile,..)----------------------

API

Developers(Browser)

----------------------GUI

Rafiki ServerRafiki Server

Routing(Load balancing)

Rafiki AgentRafiki Agent

User, Job, Model, Node Management

Timon(c++ wrapper)

SINGA

Timon(c++ wrapper)

SINGA

DataBaseDataBase

File Storage System

(e.g. HDFS)

File Storage System

(e.g. HDFS)

…

Rafiki AgentRafiki AgentTimon

(c++ wrapper)

SINGA

Timon(c++ wrapper)

SINGA ……

http request

http request http request

http request

SINGA’s RAFIKI

1. To improve the Usability of SINGA; 2. To “level” the playing field by taking care of complex system plumbing work, its reliability, efficiency and scalability.

19


Comparison:Features of the Systems

Comparison with other open source projects

Feature SINGA Caffe CXXNET cuda-convnet H2O

Deep LearningModels

Feed-forward (CNN) ✔ ✔ ✔ ✔ MLP

Energy model (RBM) ✔ x x x x

Recurrent networks (RNN) ✔ ✔ x x x

DistributedTrainingFrameworks

Synchronous ✔ ✔ ✔ ✔ ✔

Asynchronous ✔ ✔ x x x

Hybrid ✔ x x x x

Hardware CPU ✔ ✔ ✔ x ✔

GPU V0.2.0 ✔ ✔ ✔ x

Cloud Software

HDFS ✔ x x x ✔

Resource management ✔ x x x ✔

Virtualization ✔ x x x ✔

Binding Python (P), Matlab(M), R ongoing (P) P+M P P P+R

MXNet on 28/09/15

20


Experiment --- Usability

Hinton, G. E. and Salakhutdinov, R. R. (2006)Reducing the dimensionality of data with neural networks.Science, Vol. 313. no. 5786, pp. 504 - 507, 28 July 2006.

…

Deep Auto-EncodersRBM

• Used SINGA to train three known models and verify the results

21


Experiment --- UsabilityW. Wang, X. Yang, B. C. Ooi, D. Zhang, Y. Zhuang: Effective Deep Learning Based Multi-Modal Retrieval. VLDB Journal - Special issue of VLDB'14 best papers, 2015. W. Wang, B.C. Ooi, X. Yang, D. Zhang, Y. Zhuang: Effective MultiModal Retrieval based on Stacked AutoEncoders. Int'l Conference on Very Large Data Bases (VLDB), 2014.

Deep Multi-Model Neural Network

CNN MLP

22


Experiment --- Usability

Mikolov Tomá, Karafiát Martin, Burget Luká, Èernocký Jan, Khudanpur Sanjeev: Recurrent neural network based language model, INTERSPEECH 2010), Makuhari, Chiba, JP

23


Single Node4 NUMA nodes (Intel Xeon 7540, 2.0GHz)Each node has 6 cores hyper-threading enabled500 GB memory

Experiment --- Efficiency and Scalability

ClusterQuad-core Intel Xeon 3.1 GHz CPU and 8GB memory, 1Gbps switch32 nodes, 4 workers per node

Train DCNN over CIFAR10: https://code.google.com/p/cuda-convnet

Synchronous

Caffe, GTX 970

24


Experiment --- Scalability

Single Node Cluster

Train DCNN over CIFAR10: https://code.google.com/p/cuda-convnet

Asynchronous

Caffe

SINGA

25


Conclusions• Programming Model, Abstraction, and System Architecture

• Easy to implement different models

• Flexible and efficient to run different frameworks • Experiments

• Train models from different categories

• Scalability test for different training frameworks• SINGA

• Usable, extensible, efficient and scalable

• Apache SINGA v0.1.0 has been released• V0.2.0 (with GPU-CPU, DLaaS, more features) out next month

• Being used for healthcare analytics, product search, …

26


27

singa: putting deep learning into the hands of multimedia users wei wang, gang chen, tien tuan anh...

Documents