learning and knowledge transfer with memory networks for

27
Learning and Knowledge Transfer with Memory Networks for Machine Comprehension Mohit Yadav. Lovekesh Vig. Gautam ShroTCS Research New-Delhi Presented by Kyo Kim April 24, 2018 Mohit Yadav. Lovekesh Vig. Gautam ShroPresented by Kyo Kim April 24, 2018 1 / 27

Upload: others

Post on 14-May-2022

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Learning and Knowledge Transfer with Memory Networks for

Learning and Knowledge Transfer withMemory Networks for Machine

Comprehension

Mohit Yadav. Lovekesh Vig. Gautam Shro↵

TCS Research New-Delhi

Presented by Kyo Kim

April 24, 2018

Mohit Yadav. Lovekesh Vig. Gautam Shro↵ (TCS Research New-Delhi)Presented by Kyo Kim April 24, 2018 1 / 27

Page 2: Learning and Knowledge Transfer with Memory Networks for

Overview

1 Motivation2 Background3 Proposed Method4 Dataset and Experiment Results5 Summary

Mohit Yadav. Lovekesh Vig. Gautam Shro↵ (TCS Research New-Delhi)Presented by Kyo Kim April 24, 2018 2 / 27

Page 3: Learning and Knowledge Transfer with Memory Networks for

Motivation

Mohit Yadav. Lovekesh Vig. Gautam Shro↵ (TCS Research New-Delhi)Presented by Kyo Kim April 24, 2018 3 / 27

Page 4: Learning and Knowledge Transfer with Memory Networks for

Problem

Obtaining high performance in ”machinecomprehension” requires abundant humanannotated dataset.

Measured by question answering performance.

In a real-world dataset with small amount of data,wider range of vocabulary can be observed and thegrammar structure is often complex.

Mohit Yadav. Lovekesh Vig. Gautam Shro↵ (TCS Research New-Delhi)Presented by Kyo Kim April 24, 2018 4 / 27

Page 5: Learning and Knowledge Transfer with Memory Networks for

High-level Overview of Proposed Method

1 Curriculum based training procedure.2 Knowledge transfer to increase the performance in

dataset with less abundant labeled data.3 Pre-trained memory network on small dataset.

Mohit Yadav. Lovekesh Vig. Gautam Shro↵ (TCS Research New-Delhi)Presented by Kyo Kim April 24, 2018 5 / 27

Page 6: Learning and Knowledge Transfer with Memory Networks for

Background

Mohit Yadav. Lovekesh Vig. Gautam Shro↵ (TCS Research New-Delhi)Presented by Kyo Kim April 24, 2018 6 / 27

Page 7: Learning and Knowledge Transfer with Memory Networks for

End-to-end Memory Networks

1 Vectorize the problem tuple.2 Retrieve the corresponding memory attention vector.3 Use the retrieved memory to answer the question.

Mohit Yadav. Lovekesh Vig. Gautam Shro↵ (TCS Research New-Delhi)Presented by Kyo Kim April 24, 2018 7 / 27

Page 8: Learning and Knowledge Transfer with Memory Networks for

End-to-end Memory Networks Cont.

Vectorize the problem tupleProblem tuple: (q,C , S , s)

q: questionC : context textS : set of answer choicess: correct answer (s 2 S)

Question and context embedding matrix A 2 Rp·d

Query vector: ~q = A�(q)�: Bag of words

Memory vector: ~m

i

= A�(ci

) for i = 1, · · · , n wheren = |C | and c

i

2 C

Mohit Yadav. Lovekesh Vig. Gautam Shro↵ (TCS Research New-Delhi)Presented by Kyo Kim April 24, 2018 8 / 27

Page 9: Learning and Knowledge Transfer with Memory Networks for

End-to-end Memory Networks Cont.

Retrieve the corresponding memory attention vector

Attention distribution: ai

= softmax( ~m>i

~q).

Second memory vector: ~ri

= B�(ci

) where B isanother embedding matrix similar to A.

Aggregated vector: ~ro

=P

n

i=1

a

i

~r

i

Prediction vector: ai

= softmax((~ro

+ ~q)>U�(s

i

))U is the embedding matrix for the answers

Mohit Yadav. Lovekesh Vig. Gautam Shro↵ (TCS Research New-Delhi)Presented by Kyo Kim April 24, 2018 9 / 27

Page 10: Learning and Knowledge Transfer with Memory Networks for

End-to-end Memory Networks Cont.

Answer the question

Pick s

i

that corresponds to the highest ai

.

Cross-entropy Loss

L(P ,D) =1

N

D

N

DX

n=1

a

n

· log(an

(P ,D))

+ (1� a

n

) · log(1� a

n

(P ,D))

Mohit Yadav. Lovekesh Vig. Gautam Shro↵ (TCS Research New-Delhi)Presented by Kyo Kim April 24, 2018 10 / 27

Page 11: Learning and Knowledge Transfer with Memory Networks for

Curriculum Learning

First proposed by Bengio et al. (2009)

Introduce samples with increasing ”di�culty”.

Better local minima even under non-convex loss.

Mohit Yadav. Lovekesh Vig. Gautam Shro↵ (TCS Research New-Delhi)Presented by Kyo Kim April 24, 2018 11 / 27

Page 12: Learning and Knowledge Transfer with Memory Networks for

Pre-training and Joint-training

Pre-training

Have a pre-trained model to initially guide thetraining process in a similar domain.

Joint-training

Exploit the similarity between two di↵erent domainsby training the model is two di↵erent domainssimultaneously.

Mohit Yadav. Lovekesh Vig. Gautam Shro↵ (TCS Research New-Delhi)Presented by Kyo Kim April 24, 2018 12 / 27

Page 13: Learning and Knowledge Transfer with Memory Networks for

Proposed Method

Mohit Yadav. Lovekesh Vig. Gautam Shro↵ (TCS Research New-Delhi)Presented by Kyo Kim April 24, 2018 13 / 27

Page 14: Learning and Knowledge Transfer with Memory Networks for

Curriculum Inspired Training (CIT)

Di�culty Measurement

SF (q, S ,C , s) =

Pword2{q[S[C} log(Freq(word))

#{q [ S [ C}

Partition the dataset into a fixed numberchapter size with increasing di�culty.

Each chapter consists ofS

current chapter

i=1

parition[i ].

The model is trained with fixed number of epochsper chapter.

Mohit Yadav. Lovekesh Vig. Gautam Shro↵ (TCS Research New-Delhi)Presented by Kyo Kim April 24, 2018 14 / 27

Page 15: Learning and Knowledge Transfer with Memory Networks for

CIT Cont.

Loss Function

L(P ,D, en) =1

N

D

N

DX

n=1

(a

n

· log(an

(P ,D)))

+ (1� a

n

) · log(1� a

n

(P ,D)) · 1en>=c(n)·epc

en : Current epoch

c(n) : Chapter number that the example n isassigned to

epc : Epochs per chapterMohit Yadav. Lovekesh Vig. Gautam Shro↵ (TCS Research New-Delhi)Presented by Kyo Kim April 24, 2018 15 / 27

Page 16: Learning and Knowledge Transfer with Memory Networks for

Joint-Training

General Joint Loss Function

L(P ,TD, SD) = 2� · L(P ,TD)

+ 2(1� �) · L(P , SD) · F (NTD

,NSD

)

TD: Target dataset

SD: Source dataset

N

D

: Number of examples in the dataset D

�: Tunable weight parameter

Mohit Yadav. Lovekesh Vig. Gautam Shro↵ (TCS Research New-Delhi)Presented by Kyo Kim April 24, 2018 16 / 27

Page 17: Learning and Knowledge Transfer with Memory Networks for

Loss Functions

Joint-training� = 1

2

and F (NTD

,NSD

) = 1

L(P ,TD, SD) = L(P ,TD) + L(P , SD)

Weighted joint-training� 2 (0, 1) and F (N

TD

,NSD

) = N

TD

N

SD

.

L(P ,TD, SD) = 2� ·L(P ,TD)+2(1��) ·L(P , SD) ·NTD

N

SD

Mohit Yadav. Lovekesh Vig. Gautam Shro↵ (TCS Research New-Delhi)Presented by Kyo Kim April 24, 2018 17 / 27

Page 18: Learning and Knowledge Transfer with Memory Networks for

Loss Functions Cont.

Curriculum joint-training� = 1

2

and F (NTD

,NSD

) = 1

L(P ,TD, SD) = L(P ,TD, en) + L(P , SD, en)

Weighted Curriculum joint-training� 2 (0, 1) and F (N

TD

,NSD

) = N

TD

N

SD

.

L(P ,TD, SD) = 2� · L(P ,TD, en)

+ 2(1� �)L(P , SD, en) · NTD

N

SD

Mohit Yadav. Lovekesh Vig. Gautam Shro↵ (TCS Research New-Delhi)Presented by Kyo Kim April 24, 2018 18 / 27

Page 19: Learning and Knowledge Transfer with Memory Networks for

Source only� = 0 and c 2 R+

L(P ,TD, SD) = c · L(P , SD)

Mohit Yadav. Lovekesh Vig. Gautam Shro↵ (TCS Research New-Delhi)Presented by Kyo Kim April 24, 2018 19 / 27

Page 20: Learning and Knowledge Transfer with Memory Networks for

Dataset and Experiment Results

Mohit Yadav. Lovekesh Vig. Gautam Shro↵ (TCS Research New-Delhi)Presented by Kyo Kim April 24, 2018 20 / 27

Page 21: Learning and Knowledge Transfer with Memory Networks for

Dataset

Figure: Dataset used for experiments.

Mohit Yadav. Lovekesh Vig. Gautam Shro↵ (TCS Research New-Delhi)Presented by Kyo Kim April 24, 2018 21 / 27

Page 22: Learning and Knowledge Transfer with Memory Networks for

Experiment Results

Figure: The table has two major rows. The upper row are modelsthat only used the target dataset. The lower rows are models thatused both the target and source dataset.

Mohit Yadav. Lovekesh Vig. Gautam Shro↵ (TCS Research New-Delhi)Presented by Kyo Kim April 24, 2018 22 / 27

Page 23: Learning and Knowledge Transfer with Memory Networks for

Experiment Results

Figure: Categorical performance measurement in CNN-11 K . Thetable has two major rows. The upper row are models that only usedthe target dataset. The lower rows are models that used both thetarget and source dataset.

Mohit Yadav. Lovekesh Vig. Gautam Shro↵ (TCS Research New-Delhi)Presented by Kyo Kim April 24, 2018 23 / 27

Page 24: Learning and Knowledge Transfer with Memory Networks for

Experiment Results

Figure: Knowledge transfer performance result.

Mohit Yadav. Lovekesh Vig. Gautam Shro↵ (TCS Research New-Delhi)Presented by Kyo Kim April 24, 2018 24 / 27

Page 25: Learning and Knowledge Transfer with Memory Networks for

Experiment Results

Figure: Loss convergence comparison between model trained withCIT and without CIT.

Mohit Yadav. Lovekesh Vig. Gautam Shro↵ (TCS Research New-Delhi)Presented by Kyo Kim April 24, 2018 25 / 27

Page 26: Learning and Knowledge Transfer with Memory Networks for

Summary

MemNN is often used in QA.

Ordering the samples lead to better local minima.

Joint-training is useful in obtaining betterperformance on small target dataset.

Using pre-trained model improves performance.

Mohit Yadav. Lovekesh Vig. Gautam Shro↵ (TCS Research New-Delhi)Presented by Kyo Kim April 24, 2018 26 / 27

Page 27: Learning and Knowledge Transfer with Memory Networks for

The End

Mohit Yadav. Lovekesh Vig. Gautam Shro↵ (TCS Research New-Delhi)Presented by Kyo Kim April 24, 2018 27 / 27