corinna cortes, head of research, google, at mlconf nyc 2017
TRANSCRIPT
Harnessing Neural Networks
Corinna CortesGoogle Research, NY
Harnessing the Power of Neural NetworksIntroduction
How do we standardize the output?
How do we speed up inference?
How do we automatically find a good network architecture?
Google’s mission is to organize the world’s information and make it universally accessible and useful.
Google Translate
Smart reply in Inbox
10%of all responses sent on mobile
LSTM in Action
LSTMs and Extrapolation
They daydream or hallucinate :-)
Feature or bug?
DeepDream Art Auction and Symposium (A&MI)
Magenta
A
ht
Xt
A.I. Duethttps://aiexperiments.withgoogle.com/ai-duet/view/
Harnessing the Power of Neural NetworksIntroduction
How do we standardize the output?
How do we speed up inference?
How do we automatically find a good network architecture?
Restricting the Output. Smart Replies.http://www.kdd.org/kdd2016/papers/files/Paper_1069.pdf
● Ungrammatical and inappropriate answers○ thanks hon!; Yup, got it thx; Leave me alone!
● Work with a Fixed Response Set○ Sanitized answers are clustered in semantically similar answers using
label propagation;○ The answers in the clusters are used to filter the candidate set generated
by the LSTM. Diversity is ensured by using top answers from different clusters.
● Efficient search via tries
Search Tree, Trie, for Valid Responses
Tuesday Wednesday Tuesday? Wednesday?
I can do
Cluster responses
How about
. !
! What time works for you?
. What time works for you?
Computational Complexity
● Exhaustive: R x l
R size of response set, l length of longest sentence
● Beam search: b x l
Typical size of R ~ millions, typical size of b ~ 10-30
● A more elegant solution based on rules○ Exploit rules to efficiently enlarge the response set:
■ “Can you do Monday?” “Yes, I can do Monday”■ “Can you do Tuesday?” “Yes, I can do Tuesday”■ ...
“Can you do <time>?”
“Yes, I can do <time>” or “No, I can do <time + 1>
What if the Response Set in Billions?
Rules for Response Set
Text Normalization for Text-to-Speech, TTS, SystemsNavigation assistant
Text Normalization
Richard Sproat, Navdeep Jaitly, Google: “RNN Approaches to Text Normalization: A Challenge”https://arxiv.org/pdf/1611.00068.pdf
Break the Task in Two
● Channel model○ possible normalizations of that token? Sequence of tokens to words.○ Example: 123
■ one hundred twenty three, one two three, one twenty three, ...
● Language model○ which one is appropriate to the given context? Words to words.○ Example: 123
■ 123 King Ave. - the correct reading in American English would normally be one twenty three.
Combining the Models
One combined LSTM
Silly Mistakes
Add a Grammar to Constrain the OutputRule: <number> + <measurement abbreviation> => <number> + the possible verbalizations of the measure abbreviation.
Instantiation: 24.2kg => twenty four point two kilogram, twenty four point two kilograms, twenty four point two kilo.
Finite State Transducers: a finite state automaton which produces output as well as reading input, pattern matching, regular expressions.
Thrax GrammarMEASURE: <number> + <measurement abbreviation> -> <number> + measurement verbalizations
Input: 5 kg -> five kilo/kilograms/kilogram
MONEY: $ <number> -> <number> dollars
Input composed with FSTs. The output of the FST is used to restrict the output of the LSTM.
TTS: RNN + FSTMeasure and Money restricted by grammar.
Harnessing the Power of Neural NetworksIntroduction
How do we standardize the output?
How do we speed up inference?
How do we automatically find a good network architecture?
One class per image type (horse, car, …), M classes.
Neural network inference: Just to compute the last layer requires MN multiply adds.
Super-Multiclass Classification Problem
Output layer, M units:
Last hidden layer, N units:
Asymmetric Hashing
W1
W2
W3
WM
Weights to the output layer, parted in N/k chunks
● Represent each chunk with a set of cluster centers (256) using k-means.
● Save the coordinates of the centers, (ID, coordinates).
● Save each weight vector as a set of closest IDs, hashcode.
Asymmetric Hashing
W1
W2
W3
WM
Weights to the output layer, parted in N/k chunks
● Represent each chunk with a set of cluster centers (256) using k-means.
● Save the coordinates of the centers, (ID, coordinates).
● Save each weight vector as a set of closest IDs, hashcode.
78 184 15 12 63 192
56 82 72
201 37 51
Asymmetric Hashing, Searching● For given activation u, divide it into its N/k chunks, uj:
○ Compute the 256 N/k distances to centers. 256N multiply adds, not MN.○ Compute the distances to all hash codes:
● MN/k additions needed.● The “Asymmetric” in “Asymmetric Hashing” refers to the fact that we hash the
weight vectors but not the activation vector.
Asymmetric HashingIncredible saving in inference time
Sometimes also with a bit of improved accuracy
Harnessing the Power of Neural NetworksIntroduction
How do we standardize the output?
How do we speed up inference?
How do we automatically find a good network architecture?
“Learning to Learn” a.k.a “Automated Hyperparameter Tuning”
Google: AdaNet, Architecture Search with Reinforcement Learning
MIT: Designing Neural Networks Architectures Using Reinforcement Learning,
Harvard,Toronto, MIT, Intel: Scalable Bayesian Optimization Using Deep Neural Networks.
Genetic Algorithms, Reinforcement Learning, Boosting Algorithm
Modeling Challenges for ML
The right model choice can significantly improve the performance. For Deep Learning it is particularly hard as the search space is huge and
● Difficult non-convex optimization● Lack of sufficient theory
Questions● Can neural network architectures be learned
together with their weights?● Can this problem be solved efficiently and in a
principled way?● Can we capture the end-to-end process?
AdaNet● Incremental construction: At each round, the algorithm adds a subnetwork to
the existing neural network;
● Algorithm leverages embeddings previous learned;● Adaptively grows network, balancing trade-off between empirical error and
model complexity;● Learning bound:
Experimental Results, AdaNet
CIFAR-10: 60,000 images, 10 classes
SD of all #’s: 0.01
Label Pair AdaNet Log. Reg. NN
deer-truck 0.94 0.90 0.92
deer-horse 0.84 0.77 0.81automobile-truck 0.85 0.80 0.81
cat-dog 0.69 0.67 0.66
dog-horse 0.84 0.80 0.81
Neural Architecture Search with RL
Neural Architecture Search with RL
Error rates on CIFAR-10
Perplexity on Penn Treebank
Current accuracy of NAS on ImageNet: 78%State-of-Art: 80.x%
“Learning to Learn” a.k.a “Automated Hyperparameter Tuning”
Google: AdaNet, Architecture Search with Reinforcement Learning
MIT: Designing Neural Networks Architectures Using Reinforcement Learning,
Harvard,Toronto, MIT, Intel: Scalable Bayesian Optimization Using Deep Neural Networks.
Genetic Algorithms, Reinforcement Learning, Boosting Algorithm
Harnessing the Power of Neural NetworksIntroduction
How do we standardize the output?
How do we speed up inference?
How do we automatically find a good network architecture?