question 1a - columbia universitymcollins/.../lstm1-answers.pdf · question 3 i z(t) controls how...
TRANSCRIPT
![Page 1: Question 1a - Columbia Universitymcollins/.../lstm1-answers.pdf · Question 3 I z(t) controls how much the new hidden state h(t) copies information across from h(t 1), and how much](https://reader035.vdocuments.us/reader035/viewer/2022063018/5fddc90b28884f1c641049af/html5/thumbnails/1.jpg)
Question 1a
![Page 2: Question 1a - Columbia Universitymcollins/.../lstm1-answers.pdf · Question 3 I z(t) controls how much the new hidden state h(t) copies information across from h(t 1), and how much](https://reader035.vdocuments.us/reader035/viewer/2022063018/5fddc90b28884f1c641049af/html5/thumbnails/2.jpg)
Question 1a (continued)
The three paths from W hx to o are
W hx → h(3) → l→ q → o
W hx → h(2) → h(3) → l→ q → o
W hx → h(1) → h(2) → h(3) → l→ q → o
![Page 3: Question 1a - Columbia Universitymcollins/.../lstm1-answers.pdf · Question 3 I z(t) controls how much the new hidden state h(t) copies information across from h(t 1), and how much](https://reader035.vdocuments.us/reader035/viewer/2022063018/5fddc90b28884f1c641049af/html5/thumbnails/3.jpg)
Question 1b
∂o
∂W hx= A× ∂h(3)
∂W hx
+A×B2 × ∂h(2)
∂W hx
+A×B2 ×B1 × ∂h(1)
∂W hx
![Page 4: Question 1a - Columbia Universitymcollins/.../lstm1-answers.pdf · Question 3 I z(t) controls how much the new hidden state h(t) copies information across from h(t 1), and how much](https://reader035.vdocuments.us/reader035/viewer/2022063018/5fddc90b28884f1c641049af/html5/thumbnails/4.jpg)
Question 1c
The values for all non-leaf variables—h(1), h(2), h(3), l, q, o—varyas x1 varies. Because of this all Jacobians in the graph vary as x1varies (the Jacobians for any non-leaf variables depend on thevalue of the variable computed in the forward pass).
![Page 5: Question 1a - Columbia Universitymcollins/.../lstm1-answers.pdf · Question 3 I z(t) controls how much the new hidden state h(t) copies information across from h(t 1), and how much](https://reader035.vdocuments.us/reader035/viewer/2022063018/5fddc90b28884f1c641049af/html5/thumbnails/5.jpg)
Question 2a
Inputs: A sequence x1 . . . xn where each xj ∈ Rd. A label y ∈{1 . . . K} for position i.
Computational Graph:
I For t = 1 . . . n, h(t) = g(W hxx(t) +W hhh(t−1) + bh)
I For t = n . . . 1, η(t) = g(W bhxx(t) +W bhhη(t+1) + bbh)
I For t = 1 . . . n,h(2,t) = g(W 2hx × CONCAT(h(t), η(t)) +W 2hhh(2,t−1) + b2h)
I For t = n . . . 1,η(2,t) = g(W 2bhx×CONCAT(h(t), η(t))+W 2bhhη(2,t+1)+ b2bh)
I l = V × CONCAT(h(2,i), η(2,i)) + γ, q = LS(l), o = −qy
![Page 6: Question 1a - Columbia Universitymcollins/.../lstm1-answers.pdf · Question 3 I z(t) controls how much the new hidden state h(t) copies information across from h(t 1), and how much](https://reader035.vdocuments.us/reader035/viewer/2022063018/5fddc90b28884f1c641049af/html5/thumbnails/6.jpg)
Question 2b
Inputs: A sequence x1 . . . xn where each xj ∈ Rd. A label y ∈{1 . . . K} for position i. A sequence of tags y1 . . . yi−1.
Computational Graph:
I For t = 1 . . . n, h(t) = g(W hxx(t) +W hhh(t−1) + bh)
I For t = n . . . 1, η(t) = g(W bhxx(t) +W bhhη(t+1) + bh)
I For j = 1 . . . (i− 1), β(j) = g(W hyy(j) +W yhhβ(j−1) + by)
I l = V × CONCAT(h(i), η(i), β(i−1)) + γ, q = LS(l), o = −qy
![Page 7: Question 1a - Columbia Universitymcollins/.../lstm1-answers.pdf · Question 3 I z(t) controls how much the new hidden state h(t) copies information across from h(t 1), and how much](https://reader035.vdocuments.us/reader035/viewer/2022063018/5fddc90b28884f1c641049af/html5/thumbnails/7.jpg)
Question 3
I z(t) controls how much the new hidden state h(t) copiesinformation across from h(t−1), and how much a new updateis incorporated
I r(t) controls how much of h(t−1) is reset to zero in the updatepart of the network