outrageously large neural networks the...

OUTRAGEOUSLY LARGE NEURAL NETWORKS THE SPARSELY-GATED MIXTURE-OF-EXPERTS LAYER [1]

Noam Shazeer, Azalia Mirhoseiniy, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton and Jeff Dean

Google Brain - Jagiellonian University

Presenter: Mohammad Motamedi

• Key contributors in performance of deep networks:• Model Size

• Training data size

• “When datasets are sufficiently large, increasing the capacity (number of parameters) of neural networks can give much better prediction accuracy”

• Current computational infrastructure fall short of providing the computing demands.

LEPS – UC Davis 2

• Inefficiency in the memory system

• Branch-based vulnerabilities

• Sophisticated linear algebra libraries

• “Even if the number of arithmetic operations is reduced by 100×, the overhead of lookups and cache misses would dominate: switching to sparse matrices might not pay off.” [2]

LEPS – UC Davis 3

LEPS – UC Davis 4

𝑦 =

𝐺 𝑥 𝑖 𝐸𝑖 𝑥

𝐺 𝑥 ∈ ℝ+𝑛

• If 𝐺 𝑥 𝑖 = 0, we need not compute 𝐸𝑖 𝑥 .

• In each round, out of 1000 expert modules only a handful of them are active.

• Desired characteristics• Sparsity

• Load Balancing

𝐺 𝑥 = 𝑆𝑜𝑓𝑡𝑚𝑎𝑥 𝑇𝑜𝑝𝐾 𝐻 𝑥 , 𝑘

𝑇𝑜𝑝𝐾 𝑣, 𝑘 𝑖 = ቊ𝑣𝑖 , 𝑣𝑖 ∈ 𝑠𝑜𝑟𝑡𝑒𝑑(𝑣)[−𝑘: ]−∞, 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒

𝐻 𝑥 𝑖 = 𝑥.𝑊𝑔 𝑖+ ln 1 + 𝑒 𝑥.𝑤𝑛𝑜𝑖𝑠𝑒 𝑖

LEPS – UC Davis 5

• Large batch sizes are necessary to achieve high throughput.• Amortizing the overhead of data transfer

• Assume the gating networks chooses 𝑘 out of 𝑛 expert networks. New batch size.

𝑘 × 𝑏

𝑛≪ 𝑏

• Model parallelism over 𝑑 devices𝑘 × 𝑏 × 𝑑

LEPS – UC Davis 6

• Gating network selects a certain number of experts and the process is self-reinforcing.

• It is required to define an additional loss term to discourage such a behavior for a given batch X.

𝐿𝑖𝑚𝑝𝑜𝑟𝑡𝑎𝑛𝑐𝑒 = 𝑤𝑖𝑚𝑝𝑜𝑟𝑡𝑎𝑛𝑐𝑒 × 𝐶𝑉

𝑥∈𝑋

𝐺 𝑥

• It is still possible to have imbalance. How?

LEPS – UC Davis 7

• In machine learning, perplexity is a measure of prediction error.

𝑅 =2− σ𝑋 𝑞 𝑥 log2 𝑝 𝑥

• A measure to determine how strongly results are predicted.

LEPS – UC Davis 8

LEPS – UC Davis 9

LEPS – UC Davis 10

Perplexity #Parameters Training Time TFLOPS/GPU

Best public results [7] 34.7 151 Millions 59 Hours – 32 K40 1.09

Proposed 28.0 4.4 Billions 47 Hours – 32 K40 1.56

Billion words language modeling benchmark

1. Shazeer, Noam, et al. "Outrageously large neural networks: The sparsely-gated mixture-of-experts layer." arXiv preprint arXiv:1701.06538 (2017).

2. Szegedy, Christian, et al. "Going deeper with convolutions." Proceedings of the IEEE conference on computer vision and pattern recognition. 2015.

3. Rafal Jozefowicz, Oriol Vinyals, Mike Schuster, Noam Shazeer, and YonghuiWu. Exploring the limits of language modeling. arXiv preprint arXiv:1602.02410, 2016.

4. Ciprian Chelba, Tomas Mikolov, Mike Schuster, Qi Ge, Thorsten Brants, Phillipp Koehn, and TonyRobinson. One billion word benchmark for measuring progress in statistical language modeling.arXiv preprint arXiv:1312.3005, 2013.

outrageously large neural networks the...

Documents

neural networks: backpropagation -...

neural networks chapter 8. 8.1 feed-forward neural networks

artiﬁcial neural networks and fuzzy neural networks for...

artificial neural networks. introduction to neural networks

neural networks

neural networks neural networks

artificial neural networks lect8: neural networks for...

neural networks

neural network part 1: multiple layer neural...

artificial neural networks for robot control neural networks...

neural networks recurrent networks boltzmann networks...

artiﬁcial neural networks and fuzzy neural networks for

neural networks,cellular neural networks and adaptive fuzzy...

neural networks neural networks based on competition chapter...

higher order neural networks and neural networks for

neural networks part ii feed-forward neural networks

introduction to deep neural networks - deep learning ·...

quantized neural networks: training neural networks with...

o l neural networks the s -g mixture of ...under review as a...

outrageously l neural networks the s -g mixture of-experts...