cs395t: structured models for nlp lecture 14: neural network … · 2017. 10. 17. · r 6...
TRANSCRIPT
-
CS395T:StructuredModelsforNLPLecture14:NeuralNetwork
Implementa@on
GregDurrett
Administrivia‣ Project2duetoday
‣ Project3outtoday
‣ ProjectzipcontainssampleTensorflowcodedemonstra@ngwhat’sintoday’slecture
‣ Sen@mentanalysisusingfeedforwardneuralnetworksplusyourchoiceofRNNsorCNNs
Recall:FeedforwardNNs
V
nfeatures
dhiddenunits
dxnmatrix mxdmatrix
soQmaxWf(x)
z
nonlinearity (tanh,relu,…)
g P(y
|x)
P (y|x) = softmax(Wg(V f(x)))
Recall:Backpropaga@on
V
dhiddenunits
soQmaxWf(x)
zg P
(y|x)
P (y|x) = softmax(Wg(V f(x)))
@L@W err(root)@z
@Verr(z)
zf(x)
-
ThisLecture
‣ Training
‣ Implementa@ondetails
‣Wordrepresenta@ons Implementa@onDetails
Computa@onGraphs
‣ Compu@nggradientsishard!
‣ Automa@cdifferen@a@on:instrumentcodetokeeptrackofderiva@ves
x = x * x (x,dx) = (x * x, 2 * x * dx)codegen
‣ Inprac@ce:needotheropera@ons,wantmorecontrol->useanexternalcomputa@ongraphlibrary
Computa@onGraphs
http://tmmse.xyz/content/images/2016/02/theano-computation-graph.png
‣ Definecomputa@onabstractly,intermsofsymbols
‣ Cancomputegradientsofcwithrespectto(x,y,z)easily
‣ Usefulabstrac@on:supportsbothCPUandGPUimplementa@ons
‣ Disadvantage:higher-levelspecifica@on,sohardtocontrolmemoryalloca@onandlow-levelimplementa@ondetails
-
Tensorflow
http://tmmse.xyz/content/images/2016/02/theano-computation-graph.png
x = tf.placeholder(“x”)
y = tf.placeholder(“y”)
z = tf.placeholder(“z”)
a = tf.add(x, y)
b = tf.multiply(a, z)c = tf.add(b, a)
output = sess.run([a,c], dict_of_input_values)
with tf.Session() as sess:
Computa@onGraph:FFNN
P (y|x) = softmax(Wg(V f(x)))fx = tf.placeholder(tf.float32, feat_vec_size)V = tf.get_variable(“V”, [hidden_size, feat_vec_size])z = tf.sigmoid(tf.tensordot(V, fx, 1))
probs = tf.nn.softmax(tf.tensordot(W, z, 1))
‣ placeholder:inputtothesystem;variable:parametertolearn
W = tf.get_variable(“W”, [num_classes, hidden_size])
Computa@onGraph:FFNN
P (y|x) = softmax(Wg(V f(x)))fx = tf.placeholder(tf.float32, feat_vec_size)V = tf.get_variable(“V”, [hidden_size, feat_vec_size])z = tf.sigmoid(tf.tensordot(V, fx, 1))
probs = tf.nn.softmax(tf.tensordot(W, z, 1))W = tf.get_variable(“W”, [num_classes, hidden_size])
‣ Shortcuthelpermethodsexistlike tf.nn.softmax_cross_entropy_with_logits
loss = tf.negative(tf.log(tf.tensordot(probs, label, 1)))
‣ TensorflowcancomputegradientsforWandVbasedonloss
label = tf.placeholder(tf.int32, num_classes)
TrainingaModelDefineacomputa@ongraph
Defineanoperatorthatupdatestheparametersbasedonanexample
Foreachepoch:
Evaluatethetrainingoperatorontheexample
Foreachexample:
Decodetestset
-
Batching
‣ Batchingdatagivesspeedupsduetomoreefficientmatrixopera@ons,leadstobeberlearningoutcomestoo
‣ Needtomakethecomputa@ongraphprocessabatchatthesame@mefx = tf.placeholder(tf.float32, [batch_size, feat_vec_size])
V = tf.get_variable(“V”, [hidden_size, feat_vec_size])
z = tf.sigmoid(tf.tensordot(V, fx, [1,1])) # batch_size x hidden_size
...
loss = [sum over losses from batch]
BatchTrainingaModelDefineacomputa@ongraphtoprocessabatchofdata
Defineanoperatorthatupdatestheparametersbasedonabatch
Foreachepoch:
Evaluatethetrainingoperatoronthebatch
Foreachbatch:
Decodetestsetinbatches
TrainingTips
TrainingBasics‣ Basicformula:computegradientsonbatch,usefirst-orderopt.method
‣ Howtoini@alize?Howtoregularize?Whatop@mizertouse?
‣ Thislecture:[email protected]@miza@oncoursestounderstandthisfurther
-
Howdoesini@aliza@onaffectlearning?
V
nfeatures
dhiddenunits
dxnmatrix mxdmatrix
soQmaxWf(x)
z
nonlinearity (tanh,relu,…)
g P(y
|x)
P (y|x) = softmax(Wg(V f(x)))
‣Howdoweini@alizeVandW?Whatconsequencesdoesthishave?
‣ Whyisitimportanttohavesmallac@va@ons?
‣ Ifcellac@va@onsaretoolargeinabsolutevalue,gradientsaresmalltoo‣ ReLU:largerdynamicrange(allposi@venumbers),butcanproducebigvalues,canbreakdownifeverythingistoonega@ve
Howdoesini@aliza@onaffectlearning?
Ini@aliza@on1)Can’tusezeroesforparameterstoproducehiddenlayers:allvaluesinthathiddenlayerarealways0andhavegradientsof0,neverchange
‣ Candorandomuniform/normalini@aliza@onwithappropriatescale
U
"�r
6
fan-in + fan-out
,+
r6
fan-in + fan-out
#‣ Glorotini@alizer:
‣ Wantvarianceofinputsandgradientsforeachlayertobethesame‣ Batchnormaliza@on(IoffeandSzegedy,2015):periodicallyshiQ+rescaleeachlayertohavemean0andvariance1overabatch(usefulifnetisdeep)
2)Ini@alizetoolargeandcellsaresaturated
Dropout‣ Probabilis@callyzerooutpartsofthenetworkduringtrainingtopreventoverfikng,usewholenetworkattest@me
Srivastavaetal.(2014)
‣ Similartobenefitsofensembling:networkneedstoberobusttomissingsignals,soithasredundancy
‣ Formofstochas@cregulariza@on
-
Dropout
‣ Intensorflow:implementedasanaddi@onallayerinanetwork
hidden_dropped_out = tf.nn.dropout(hidden, dropout_keep_prob)
‣ OQenuselowdropout(keepavaluewithprobability0.8)attheinputandmoderatedropout(keepwithprobability0.5)internallyinfeedforwardnetworks(notinRNNs)
Op@mizer‣ Adam(KingmaandBa,ICLR2015)isverywidelyused‣ Adap@vestepsizelikeAdagrad,incorporatesmomentum
Op@mizer‣ Wilsonetal.NIPS2017:adap@vemethodscanactuallyperformbadlyattest@me(Adamisinpink,SGDinblack)‣ Checkdevsetperiodically,decreaselearningrateifnotmakingprogress
Visualiza@onwithTensorboard‣ Visualizethecomputa@ongraphandlogsoftheobjec@veover@me
-
StructuredPredic@on‣ Fourelementsofastructuredmachinelearningmethod:‣ Model:feedforward,RNNs,CNNscanbedefinedinauniformframework
‣ Objec@ve:manylossfunc@onslooksimilar,justchangesthelastlayeroftheneuralnetwork
‣ Inference:definethenetwork,Tensorflow takescareofit(mostly…)
‣ Training:lotsofchoicesforop@miza@on/hyperparameters
WordRepresenta@ons
WordRepresenta@ons
‣ Con@nuousmodelexpectscon@nuousseman@csfrominput‣ Neuralnetworksworkverywellatcon@nuousdata,butwordsarediscrete
WordEmbeddings
Bothaetal.(2017)
…
Fedraisesinterestratesinorderto…
f(x)?? em
b(raises)
‣ Wordembeddingsforeachwordforminput emb(interest)
emb(rates)
previousword
currword
nextword
otherwords,feats,etc.
‣ Part-of-speechtaggingwithFFNNs
-
goodenjoyable
bad
dog
great
is
‣ Wantavectorspacewheresimilarwordshavesimilarembeddings
themoviewasgreat
themoviewasgood
~~
WordEmbeddings WordRepresenta@ons
‣ Con@nuousmodelexpectscon@nuousseman@csfrominput‣ “Cantellawordbythecompanyitkeeps”Firth1957
‣ Neuralnetworksworkverywellatcon@nuousdata,butwordsarediscrete
Con@nuousBag-of-Words‣ Predictwordfromcontext thedogbittheman
‣ Parameters:dx|V|vectors,|V|xdoutputparameters(W)
dog
the
+
sum,sizedP (w|w�1, w+1)
soQmaxMul@ply byW
gold=bit
‣ Maximizelikelihoodofgoldlabels(nomanuallabelingrequired!)
Mikolovetal.(2013)
Skip-Gramthedogbittheman‣ Predictonewordofcontextfromword
bit
soQmaxMul@ply byW
gold=dog
‣ Parameters:dx|V|vectors,|V|xdoutputparameters(W)
Mikolovetal.(2013)
‣ Anothertrainingexample:bit->the
-
Skip-GramwithNega@veSamplingthedogbittheman‣ Problem:wanttotrainon1B+words,
mul@plyingby|V|xdmatrixforeachistoo expensive
‣ dx|V|vectors,dx|V|contextvectors(same#ofparamsasbefore)Mikolovetal.(2013)
(bit,the)=>+1(bit,dog)=>+1
(bit,a)=>-1(bit,fish)=>-1
‣ Solu@on:take(word,context)pairsandclassifythemas“real”ornot.Createrandomnega@veexamplesbysampling
wordsinsimilarcontextsselectforsimilarcvectorsP (pos|w, c) = e
w·c
ew·c + 1
Regulari@esinVectorSpace
queenking
woman
man
(king-man)+woman=queen
‣Whywouldthisbe?
‣ woman-mancapturesthedifferencein thecontextsthattheseoccurin
king+(woman-man)=queen
‣Dominantchange:more“he”withmanand“she”withwoman—similartodifferencebetweenkingandqueen
GloVe‣ Wordco-occurrencesarewhatmaberdirectly
‣ Weightedleast-squaresproblemtodirectlypredictwordco-occurrencematrix(likematrixfactoriza@on)
Penningtonetal.(2014)
UsingWordEmbeddings
‣ Approach1:learnembeddingsasparametersfromyourdata‣ Approach2:ini@alizeusingGloVe/CBOW/SGNS,keepfixed
‣ Approach3:ini@alizeusingGloVe/CBOW/SGNS,fine-tune
# Indexed sentence of length sent_len, e.g.: [12, 36, 47, 8]input_words = tf.placeholder(tf.int32, [sent_len])encoder = tf.get_variable("embed", [voc_size, embedding_size])embedded_input_words = tf.nn.embedding_lookup(encoder, input_words)# embedded_input_words: sent_len x embedding_size tensor
‣ Fasterbecausenoneedtoupdatetheseparameters
‣ Typicallyworksbest
-
Takeaways‣ Lotstotunewithneuralnetworks
‣Wordvectors:variouschoicesofpre-trainedvectorsworkwellasini@alizers
‣ Training:op@mizer,ini@alizer,regulariza@on(dropout),…‣ Hyperparameters:dimensionalityofwordembeddings,layers,…
‣Next@me:RNNs/LSTMs/GRUs