stratos idreos - daslabdaslab.seas.harvard.edu/classes/cs265/files/... · reduction in unique data...

Stratos Idreos

11:30am today, Yifan Wu from Berkeley: Real-Time Interactive Data Analytic Interfaces

11:30am tmr, K. Karanasos from Microsoft: (big data/ML)

Talks at the lab this week:

Discussion papers as of MondayReview submission every class

Academic interest

Market share

Wealth of applicationsdeep le

arning

Some input Some output

Image recognition

ΓαταCat

Machine translation

What happens in Vegas…

Auto-complete

cat = f ( )

How do we do this mapping?

cat = f ( )

How do we do this mapping?

perceptron

weight

clean/transformactivation function

Multi-layer perceptron“can represent a wide variety of interesting functions”

Multi-layer perceptron

Some weird neural network!

Features to labelsData to features

Deep neural networks

Multiple layers

Training phase: pass labeled data until we get to an acceptable accuracyInference: new data -> result

How do we train these networks?

Labeled

Forward pass to compute a prediction

Labeled

Backward pass to ‘slightly’ nudge the weights

Labeled

Backward pass to ‘slightly’ nudge the weights

repeat until happy/convergence

Labeled

PERFORMANCE (TRAINING/INFERENCE)

Labeled

PERFORMANCE (TRAINING/INFERENCE)

READING/WRITING DATA + COMPUTATION

Labeled

ACCURACY

2012 2014 2015 2017

264 layers

152 layers

19 layers7 layers

Deep Residual Learning for Image Recognition [He et. al., 2016]

Layers in state-of-the-art NN

CatDNN

Single ModelEnsemble Model

CatDNN

Single ModelEnsemble Model

CatDNN

Catastrophe

Representationally richer

Wisdom of the crowd

finding the first principles of neural networks

Catastrophe

VERY EXPENSIVE TO TRAIN

100 variants of VGG-16 (different structures)

CIFAR-10

Dataset Training approachesFull-data Bagging

How expensive is it?

Number of neural networks1 21 41 61 81 100

Full-dataBagging

35 min.27 min.

BaggingFull-Data

Time to add one neural network

Full-dataBagging

35 min.27 min.

BaggingFull-Data

CIFAR-10

35 min.27 min.

BaggingFull-Data

CIFAR-10 CIFAR-100 CIFAR-10 | ResNet

67 min.

35 min.

Complex data Complex model

516 min.

206 min.

NN ensembles

other models ensembles

ensemble

NN ensembles

other models ensembles

complex problems need larger ensembles

ensemble

MotherNets Rapid Deep Ensemble Learningit is all about data movement and computation

MotherNetsDNN architectures

MotherNets

Capture structural similarity

MotherNets

MotherNet

MotherNets

Trained

MotherNets

Capture structural similarity Train once

Same function as MotherNet

MotherNets

Capture structural similarity Train once Transfer learned function

i ii iii

MotherNets

Capture structural similarity Train once Transfer learned function Train incrementally

i ii iii iv

MotherNets

i ii iii iv

I. Capture structural similarityFind the largest network structure common amongst all networks

15 8 20

10 18 10 22

L1 L2 L3 L4

MotherNet

15 8 20

10 18 10 22

L1 L2 L3 L4

MotherNet

15 8 20

10 18 10 22

L1 L2 L3 L4

MotherNet

15 8 20

10 18 10 22

L1 L2 L3 L4

MotherNet

15 8 20

10 18 10 22

L1 L2 L3 L4

MotherNet

15 8 20

10 18 10 22

L1 L2 L3 L4

MotherNet

Smaller than any of the ensemble networks

MotherNets

i ii iii iv

MotherNets

i ii iii iv

Function Preserving TransformationsIncrease the capacity (expressivity) of the networks while preserving their function (also accuracy)

II. Transfer learned function

Function Preserving Transformations

Deepen the networkWiden the network

Increase the capacity (expressivity) of the networks while preserving their function (also accuracy)

II. Transfer learned function

Function Preserving TransformationsIncrease the capacity (expressivity) of the networks while preserving their function (also accuracy)

Morph MotherNets to ensemble networks

MotherNets

i ii iii iv

full data bagging

100 variants of VGG-16 (different structures)

CIFAR-10

Dataset Training approachesFull-data Bagging

How does it behave?

Full-dataBaggingMotherNets

7 min.

35 min.27 min.

BaggingFull-DataMotherNets

7 min.

35 min.27 min.

CIFAR-10

7 min.

35 min.27 min.

CIFAR-10 CIFAR-100 CIFAR-10 | ResNet

14 min.

67 min.

35 min.

155 min.

516 min.

206 min.

Complex data Complex model

Accuracy vs Performance?

220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274

MotherNets

0 2 4 6 8

10 12 14 16 18

EA SL O

baggingfull-data

KDMotherNets

(a) Ensemble test error rate

FD Bag. KD MNet

) MNV13V16

V16AV16B

(b) Training time

13 12 11 10 9 8

test error rate (%)

MotherNetsfull-data

(c) Tradeoff

1 2 3 4 5

Number of clusters

(d) Varying the number of clusters

Figure 3: MotherNets achieves comparable individual and ensemble accuracy to the full-data approach but in a fraction ofthe training time – striking a better time-accuracy tradeoff.

networks in a function-preserving manner, we build a sep-arate MotherNet for each class (or a set of MotherNets ifeach class also consists of networks of diverse sizes). Thisenables MotherNets to closely capture the architecture classof every neural network in the diverse ensemble.

2.4. Training

Step 1: Training the MotherNet. First, the MotherNet forevery cluster is trained from scratch using the entire dataset until convergence. This allows the MotherNet to learn agood core representation of the data. Since the MotherNethas fewer parameters than any of the ensemble networks inits cluster (by construction) it is expected to take less timeto converge than any of the ensemble networks.

Step 2: Hatching Ensemble Networks. Once the Mother-Net is trained, the next step is to generate every ensemblenetwork by expanding the MotherNet through a sequenceof function-preserving transformations. We call this processhatching. Hatching yields the ensemble networks with pos-sibly larger size (more parameters) than the MotherNet butrepresenting the same function as learned by the MotherNet.The hatching process is instantaneous as generating everyensemble network requires a single pass on the Mother-Net. Every hatched ensemble network with size (number ofparameters) larger than the MotherNet now has additionaluntrained parameters that we need to further train. To ex-plicitly add diversity to the hatched networks, we randomlyperturb the newly introduced parameters. This is a stan-dard technique to create diversity when training ensemblenetworks.

Step 3: Training Ensemble Networks. The hatched en-semble networks are trained using bootstrap aggregation(bagging) (Breiman, 1996). Bagging is an effective andwidely used method to create diversity and to reduce thevariance of ensembles (Hansen & Salamon, 1990; Guzman-Rivera et al., 2014; Lee et al., 2015b). This is becausedifferent models in the ensemble are trained using differ-ent overlapping subsets of the data. However, recent work

has shown that training neural network ensembles throughbagging results in decreased generalization accuracy as itreduces the number of unique data items seen by individualneural networks. Since neural networks have a large numberof parameters they are affected relatively more from thisreduction in unique data items than ensembles of other clas-sifiers such as decision trees or SVMs (Domingos, 1999;Lee et al., 2015a). Hatching the networks from the Mother-Net and then further training them using bagging overcomesthis problem.

We explain in supplementary material Section B how Moth-erNets’ training process can be parallelized and how state-of-the-art general network training optimizations can beincorporated into MotherNets.

2.5. Optimizing for Inference

While our focus is on reducing training costs, the Mother-Nets approach can also help during inference. To optimizean ensemble trained through MotherNets for inference cost(time and memory requirement), we share the incremen-tal training of MotherNet parameters. To achieve this, weconstruct a shared-MotherNet from the hatched networksand train the shared-MotherNet incrementally instead ofthe hatched networks. In shared-MotherNet, the hatchednetworks are combined together such that they have a sin-gle copy of MotherNet parameters, which is jointly trainedinstead of being trained independently by each ensemblenetwork. This yields an ensemble with fewer number ofparameters that is more efficient to infer from and is morecompact to deploy. We explain this approach to optimizeensemble inference as well as provide experimental resultsin supplementary material Section C.

3. ExperimentsWe show how MotherNets can be used to accelerate thetraining of diverse ensembles of neural networks.

Training Setup. We adopt stochastic gradient descent with

Accuracy vs Performance?

Accelerating inference in MotherNets

Inference cost

Inference cost TrainingMaintain common MotherNets parameters

IV —

Inference cost Training Inference

Full-pass

Maintain common MotherNets parameters

IV —

Inference cost Training Inference

Full-pass

Partial

passes

Maintain common MotherNets parameters

IV —

CIFAR-10

Ensemble of 5 VGGNetsLayers between 13 and 34

Initial Results

1.95 min.

3.71 min.

StandardShare-MN

Inference time

7.9%7.5%

Test error rate

Initial results

finding the first principles of neural networks

Stratos Idreos

stratos idreos - daslabdaslab.seas.harvard.edu/classes/cs265/files/... · reduction in unique data...

Documents

mips coding. spim some links can be found such as:...

class 11 b-trees - harvard...

transactions on affective computing, vol. x, no. x ......of...

runtime optimization with specialization johnathon jamison...

expert systems with applications · 2019-06-12 · have...

incremental organization for data recording and...

ip spoofing, cs265 1 ip spoofing

graham lustiber advisor:...

java inheritance, interfaces - computer science...

class 6 more about column-store plans and...

sql & intro to db architectures - harvard...

stratos idreos, niv dayan, wilson qin, mali akmanalp ... ·...

awk – introduction awk –...

incremental computation aj shankar cs265 spring 2003 expert...

capability based security by zachary walker cs265 section 1

fail - harvard...

how to give a talk by ogram fizzy-water on december 4, 2008...

class 14 joins - harvard...

introduction kurt schmidt intro help python – introduction...

reverse engineering, drm, & operating systems...