stratos idreos - daslabdaslab.seas.harvard.edu/classes/cs265/files/... · reduction in unique data...

Stratos Idreos

11:30am today, Yifan Wu from Berkeley: Real-Time Interactive Data Analytic Interfaces

11:30am tmr, K. Karanasos from Microsoft: (big data/ML)

Talks at the lab this week:

Discussion papers as of MondayReview submission every class

Academic interest

Market share

Wealth of applicationsdeep le

arning

Some input Some output

Cat

Image recognition

Dog

Table

ΓαταCat

Machine translation

What happens in Vegas…

Auto-complete

…

cat = f ( )

How do we do this mapping?

perceptron

weight

3.4

data

clean/transformactivation function

Multi-layer perceptron“can represent a wide variety of interesting functions”

Multi-layer perceptron

Some weird neural network!

Features to labelsData to features

+

-

Deep neural networks

Cat

Multiple layers

Cat

Training phase: pass labeled data until we get to an acceptable accuracyInference: new data -> result

How do we train these networks?

Labeled

data


Forward pass to compute a prediction

Labeled

data



Fish

Labeled

data



Fish

Labeled

data

Backward pass to ‘slightly’ nudge the weights



Fish

Labeled

data

Backward pass to ‘slightly’ nudge the weights

repeat until happy/convergence


Labeled

data

PERFORMANCE (TRAINING/INFERENCE)

Labeled

data

PERFORMANCE (TRAINING/INFERENCE)

READING/WRITING DATA + COMPUTATION

Labeled

data

ACCURACY

2012 2014 2015 2017

264 layers

152 layers

19 layers7 layers

Deep Residual Learning for Image Recognition [He et. al., 2016]

Layers in state-of-the-art NN

CatDNN

Single ModelEnsemble Model

CatDNN

DNN

DNN

DNN

Single ModelEnsemble Model

CatDNN

DNN

DNN

DNN

Cat

Cat

Catastrophe

CAT

Cat

Cat

Catastrophe

CAT

DNN

DNN

DNN

Representationally richer

Wisdom of the crowd

finding the first principles of neural networks

Cat

Cat

Catastrophe

CAT

DNN

DNN

DNN

VERY EXPENSIVE TO TRAIN

100 variants of VGG-16 (different structures)

CIFAR-10

Dataset Training approachesFull-data Bagging

How expensive is it?

Trai

ning

tim

e (h

rs.)

0

15

30

45

60

Number of neural networks1 21 41 61 81 100

Full-dataBagging

35 min.27 min.

BaggingFull-Data

Time to add one neural network

Trai

ning

tim

e (h

rs.)

0

15

30

45

60


Full-dataBagging

35 min.27 min.

BaggingFull-Data

CIFAR-10

35 min.27 min.

BaggingFull-Data

CIFAR-10 CIFAR-100 CIFAR-10 | ResNet

67 min.

35 min.

Complex data Complex model

516 min.

206 min.

~250

~10

NN ensembles

other models ensembles

ensemble

size

~250

~10

NN ensembles

other models ensembles

complex problems need larger ensembles

ensemble

size

MotherNets Rapid Deep Ensemble Learningit is all about data movement and computation

MotherNetsDNN architectures

MotherNets

Capture structural similarity

i

MotherNet

MotherNets

Capture structural similarity

i

Trained

MotherNets

Capture structural similarity Train once

i ii

Same function as MotherNet

MotherNets

Capture structural similarity Train once Transfer learned function

i ii iii

MotherNets

Capture structural similarity Train once Transfer learned function Train incrementally

i ii iii iv

I. Capture structural similarityFind the largest network structure common amongst all networks


20 13

15 8 20

10 18 10 22

5NN1

NN2

NN3

L1 L2 L3 L4

MotherNet

Neu

ral N

etw

ork

Spec

ifica

tions


20 13

15 8 20

10 18 10 22

5NN1

NN2

NN3

L1 L2 L3 L4

10

MotherNet

Neu

ral N

etw

ork

Spec

ifica

tions


20 13

15 8 20

10 18 10 22

5NN1

NN2

NN3

L1 L2 L3 L4

810

MotherNet

Neu

ral N

etw

ork

Spec

ifica

tions


20 13

15 8 20

10 18 10 22

5NN1

NN2

NN3

L1 L2 L3 L4

5810

MotherNet

Neu

ral N

etw

ork

Spec

ifica

tions


20 13

15 8 20

10 18 10 22

5NN1

NN2

NN3

L1 L2 L3 L4

5810

MotherNet

Smaller than any of the ensemble networks

Neu

ral N

etw

ork

Spec

ifica

tions

MotherNets


i ii iii iv

Function Preserving TransformationsIncrease the capacity (expressivity) of the networks while preserving their function (also accuracy)

II. Transfer learned function

Function Preserving Transformations

Deepen the networkWiden the network

Increase the capacity (expressivity) of the networks while preserving their function (also accuracy)

II. Transfer learned function

Function Preserving TransformationsIncrease the capacity (expressivity) of the networks while preserving their function (also accuracy)

Morph MotherNets to ensemble networks

MotherNets


i ii iii iv


i ii iii iv

full data bagging

100 variants of VGG-16 (different structures)

CIFAR-10

Dataset Training approachesFull-data Bagging

How does it behave?

Trai

ning

tim

e (h

rs.)

0

15

30

45

60


Full-dataBaggingMotherNets

Trai

ning

tim

e (h

rs.)

0

15

30

45

60


Full-dataBaggingMotherNets

7 min.

35 min.27 min.

BaggingFull-DataMotherNets


7 min.

35 min.27 min.


CIFAR-10


7 min.

35 min.27 min.


CIFAR-10 CIFAR-100 CIFAR-10 | ResNet

14 min.

67 min.

35 min.

155 min.

516 min.

206 min.

Complex data Complex model


Accuracy vs Performance?

220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274

MotherNets

0 2 4 6 8

10 12 14 16 18

Vot

e

EA SL O

erro

r rat

e (%

)

baggingfull-data

KDMotherNets

(a) Ensemble test error rate

0

40

80

120

160

200

FD Bag. KD MNet

train

ing

time

(min

) MNV13V16

V16AV16B

V19

(b) Training time

0

5

10

15

20

25

30

13 12 11 10 9 8

train

ing

time

(min

)

test error rate (%)

MotherNetsfull-data

(c) Tradeoff

6

7

8

erro

r ra

te (%

)

0

100

200

1 2 3 4 5

time

(min

)

Number of clusters

(d) Varying the number of clusters

Figure 3: MotherNets achieves comparable individual and ensemble accuracy to the full-data approach but in a fraction ofthe training time – striking a better time-accuracy tradeoff.

networks in a function-preserving manner, we build a sep-arate MotherNet for each class (or a set of MotherNets ifeach class also consists of networks of diverse sizes). Thisenables MotherNets to closely capture the architecture classof every neural network in the diverse ensemble.

2.4. Training

Step 1: Training the MotherNet. First, the MotherNet forevery cluster is trained from scratch using the entire dataset until convergence. This allows the MotherNet to learn agood core representation of the data. Since the MotherNethas fewer parameters than any of the ensemble networks inits cluster (by construction) it is expected to take less timeto converge than any of the ensemble networks.

Step 2: Hatching Ensemble Networks. Once the Mother-Net is trained, the next step is to generate every ensemblenetwork by expanding the MotherNet through a sequenceof function-preserving transformations. We call this processhatching. Hatching yields the ensemble networks with pos-sibly larger size (more parameters) than the MotherNet butrepresenting the same function as learned by the MotherNet.The hatching process is instantaneous as generating everyensemble network requires a single pass on the Mother-Net. Every hatched ensemble network with size (number ofparameters) larger than the MotherNet now has additionaluntrained parameters that we need to further train. To ex-plicitly add diversity to the hatched networks, we randomlyperturb the newly introduced parameters. This is a stan-dard technique to create diversity when training ensemblenetworks.

Step 3: Training Ensemble Networks. The hatched en-semble networks are trained using bootstrap aggregation(bagging) (Breiman, 1996). Bagging is an effective andwidely used method to create diversity and to reduce thevariance of ensembles (Hansen & Salamon, 1990; Guzman-Rivera et al., 2014; Lee et al., 2015b). This is becausedifferent models in the ensemble are trained using differ-ent overlapping subsets of the data. However, recent work

has shown that training neural network ensembles throughbagging results in decreased generalization accuracy as itreduces the number of unique data items seen by individualneural networks. Since neural networks have a large numberof parameters they are affected relatively more from thisreduction in unique data items than ensembles of other clas-sifiers such as decision trees or SVMs (Domingos, 1999;Lee et al., 2015a). Hatching the networks from the Mother-Net and then further training them using bagging overcomesthis problem.

We explain in supplementary material Section B how Moth-erNets’ training process can be parallelized and how state-of-the-art general network training optimizations can beincorporated into MotherNets.

2.5. Optimizing for Inference

While our focus is on reducing training costs, the Mother-Nets approach can also help during inference. To optimizean ensemble trained through MotherNets for inference cost(time and memory requirement), we share the incremen-tal training of MotherNet parameters. To achieve this, weconstruct a shared-MotherNet from the hatched networksand train the shared-MotherNet incrementally instead ofthe hatched networks. In shared-MotherNet, the hatchednetworks are combined together such that they have a sin-gle copy of MotherNet parameters, which is jointly trainedinstead of being trained independently by each ensemblenetwork. This yields an ensemble with fewer number ofparameters that is more efficient to infer from and is morecompact to deploy. We explain this approach to optimizeensemble inference as well as provide experimental resultsin supplementary material Section C.

3. ExperimentsWe show how MotherNets can be used to accelerate thetraining of diverse ensembles of neural networks.

Training Setup. We adopt stochastic gradient descent with

Accuracy vs Performance?

Accelerating inference in MotherNets

Inference cost


Inference cost TrainingMaintain common MotherNets parameters

MN

IV —

Incr

emen

tal T

rain

ing


Inference cost Training Inference

Full-pass

Maintain common MotherNets parameters

MN

IV —

Incr

emen

tal T

rain

ing


Inference cost Training Inference

Full-pass

Partial

passes

Maintain common MotherNets parameters

MN

IV —

Incr

emen

tal T

rain

ing


CIFAR-10

Ensemble of 5 VGGNetsLayers between 13 and 34

Initial Results


1.95 min.

3.71 min.

StandardShare-MN

Inference time

7.9%7.5%

Test error rate

Initial results

finding the first principles of neural networks

Stratos Idreos

stratos idreos - daslabdaslab.seas.harvard.edu/classes/cs265/files/... · reduction in unique data...

Documents