parsing long and complex natural language sentences yuji matsumoto nara institute of science and...

44
Parsing Long and Complex Natural Language Sentences Yuji Matsumoto Nara Institute of Science and Technology (NAIST) November 27, 2014 Shonan Meeting

Upload: theodore-gray

Post on 27-Dec-2015

215 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: Parsing Long and Complex Natural Language Sentences Yuji Matsumoto Nara Institute of Science and Technology (NAIST) November 27, 2014 Shonan Meeting

Parsing Long and Complex Natural Language Sentences

Yuji MatsumotoNara Institute of Science and Technology

(NAIST)

November 27, 2014Shonan Meeting

Page 2: Parsing Long and Complex Natural Language Sentences Yuji Matsumoto Nara Institute of Science and Technology (NAIST) November 27, 2014 Shonan Meeting

2

Sentences in scientific or legal domains tend to have long and complex sentences

Major hindrance to syntactic analysis of natural language sentences Difficult syntactic structures in long sentences

Coordinate structures Complex syntactic patterns (with multiple clauses)

Ordinary CFG grammars and lexicons have difficulty in handing (or representing) such phenomena

Problems in Parsing Long Sentences

Page 3: Parsing Long and Complex Natural Language Sentences Yuji Matsumoto Nara Institute of Science and Technology (NAIST) November 27, 2014 Shonan Meeting

Issues in Parsing that lie between Lexicon and Grammar

Coordinate Structures Extra-grammatical phenomenon

Grammatical Units Multiword Expressions (functional MWEs)

Syntactically or semantically idiosyncratic expressions that should be registered in lexicon

Complex Sentence Patterns Subordinate clauses Embedded clauses Other complex sentence patterns

(word tokens) (grammar rules)

3

Page 4: Parsing Long and Complex Natural Language Sentences Yuji Matsumoto Nara Institute of Science and Technology (NAIST) November 27, 2014 Shonan Meeting

Problems of Coordinate Structures Any constituents can be coordinated Non-constituent structures (sequences of

constituents) can be coordinated “ John saw Mary yesterday and Bill today. ”

Coordinate structures can be nested “... 6.1 months and 8.9 months in arm A and

7.2 months and 9.5 months in arm B. ” Scope ambiguity

old [men and women] vs [old men and women]”

4

Page 5: Parsing Long and Complex Natural Language Sentences Yuji Matsumoto Nara Institute of Science and Technology (NAIST) November 27, 2014 Shonan Meeting

Identification of coordinate structure helps improve parsing accuracy “Median times to progression and median survival

times were 6.1 months and 8.9 months in arm A and 7.2 months and 9.5 months in arm B.”

               ⇓ “Median times … were 6.1 months in arm A” “Median times … were 7.2 months in arm B” “median survival times were 8.9 months in arm A” “median survival times were 9.5 months in arm B”

5

Page 6: Parsing Long and Complex Natural Language Sentences Yuji Matsumoto Nara Institute of Science and Technology (NAIST) November 27, 2014 Shonan Meeting

Joint analysis of grammatical and alignment methods [Hara, et al 09]

1. Learning scores for alignment Feature based learning [Shimbo & Hara 07]

2. Phrase structure grammar rules for coordinate structure are defined to ensure the structural constraints

3. The weights for alignment are jointly learned with the structural constraints Combination of CKY parsing algorithm with

perceptron learning of alignment weights

Masashi Shimbo and Kazuo Hara, "A Discriminative Learning Model for Coordinate Conjunctions,“ EMNLP-CoNLL, pp.610-619, June 2007.

Kazuo Hara, Masashi Shimbo, Hideharu Okuma and Yuji Matsumoto, "Coordinate Structure Analysis with Global Structural Constraints and Alignment-Based Local Features," Proceedings of ACL-IJCNLP 2009, pp.967-975, August 2009.

6

Page 7: Parsing Long and Complex Natural Language Sentences Yuji Matsumoto Nara Institute of Science and Technology (NAIST) November 27, 2014 Shonan Meeting

Coordination structure analysis

Alignment of corresponding parts

“the standard arm and the dose dense arm”

the standard arm

the dose dense arm7

Page 8: Parsing Long and Complex Natural Language Sentences Yuji Matsumoto Nara Institute of Science and Technology (NAIST) November 27, 2014 Shonan Meeting

DP matching method for alignment

the dose dense arm

the

standard

arm

the standard arm

the dose dense arm8

Page 9: Parsing Long and Complex Natural Language Sentences Yuji Matsumoto Nara Institute of Science and Technology (NAIST) November 27, 2014 Shonan Meeting

Our first method represents coordinate structure as a path on a triangular alignment graph

9

Median

times

to

progression

and

median

survival

times

Me

dia

n

time

s

to Pro

gre

ssio

n

and

me

dia

n

surv

iva

l

time

s

“Median times to progression and median survival times”

start

end

Page 10: Parsing Long and Complex Natural Language Sentences Yuji Matsumoto Nara Institute of Science and Technology (NAIST) November 27, 2014 Shonan Meeting

A path representing correct structure

10

Median

times

to

progression

and

median

survival

times

Me

dia

n

time

s

to Pro

gre

ssio

n

and

med

ian

su

rviv

al

tim

es

Median times to progression and

median survival times

start

end

Page 11: Parsing Long and Complex Natural Language Sentences Yuji Matsumoto Nara Institute of Science and Technology (NAIST) November 27, 2014 Shonan Meeting

Drawback of path-based method

It cannot cope with nested coordinations, such as:

“Median times to progression and median survival times were 6.1 months and 8.9 months in arm A and 7.2 months and 9.5 months in arm B.”

11

6.1 months and

8.9 months

7.2 months and

9.5 months

6.1 months and 8.9 months in arm A and

7.2 months and 9.5 months in arm B

Page 12: Parsing Long and Complex Natural Language Sentences Yuji Matsumoto Nara Institute of Science and Technology (NAIST) November 27, 2014 Shonan Meeting

12

6.1monthsand8.9monthsin armAand7.2monthsand9.5monthsin armB

6.1

mon

th

s and

8.9

mon

th

s in

arm

A and

7.2

mon

th

s and

9.5

mon

th

s in

arm

B

“6.1 months and 8.9 months in arm A and 7.2 months and 9.5 months in arm B”

start

end

Page 13: Parsing Long and Complex Natural Language Sentences Yuji Matsumoto Nara Institute of Science and Technology (NAIST) November 27, 2014 Shonan Meeting

13

6.1monthsand8.9monthsin armAand7.2monthsand9.5monthsin armB

6.1

mon

ths

and

8.9

mo

nth

s in

arm

A and

7.2

mon

ths

and

9.5

mon

ths

in

arm

B

6.1 months and

8.9 months

start

end

Page 14: Parsing Long and Complex Natural Language Sentences Yuji Matsumoto Nara Institute of Science and Technology (NAIST) November 27, 2014 Shonan Meeting

14

6.1monthsand8.9monthsin armAand7.2monthsand9.5monthsin armB

6.1

mon

ths

and

8.9

mon

ths

in

arm

A and

7.2

mon

ths

and

9.5

mo

nth

s in

arm

B

7.2 months and

9.5 months

start

end

Page 15: Parsing Long and Complex Natural Language Sentences Yuji Matsumoto Nara Institute of Science and Technology (NAIST) November 27, 2014 Shonan Meeting

15

6.1monthsand8.9monthsin armAand7.2monthsand9.5monthsin armB

6.1

mon

ths

and

8.9

mon

ths

in

arm

A and

7.2

mo

nth

s

an

d9

.5m

on

ths in

a

rmB

6.1 months and 8.9 months in arm A and

7.2 months and 9.5 months in arm B

start

end

Page 16: Parsing Long and Complex Natural Language Sentences Yuji Matsumoto Nara Institute of Science and Technology (NAIST) November 27, 2014 Shonan Meeting

16

6.1monthsand8.9monthsin armAand7.2monthsand9.5monthsin armB

6.1

mon

ths

and

8.9

mo

nth

s in

arm

A and

7.2

mo

nth

s

an

d9

.5m

on

ths in

a

rmB

6.1 months and

8.9 months

7.2 months and

9.5 months

6.1 months and 8.9 months in arm A and

7.2 months and 9.5 months in arm B

start

end

Page 17: Parsing Long and Complex Natural Language Sentences Yuji Matsumoto Nara Institute of Science and Technology (NAIST) November 27, 2014 Shonan Meeting

17

6.1monthsand8.9monthsin armAand7.2monthsand9.5monthsin armB

6.1

mon

ths

and

8.9

mo

nt

hs

in

arm

A and

7.2

mo

nt

hs

a

nd

9.5

mo

nt

hs

in

arm

B

There is no single path to connect all three segments

start

end

Page 18: Parsing Long and Complex Natural Language Sentences Yuji Matsumoto Nara Institute of Science and Technology (NAIST) November 27, 2014 Shonan Meeting

Constituent tree structure canrepresent coordinate structure as a tree

18

Me

dia

n

tim

es

to

p

rog

res

si

on

an

d

med

ian

s

urv

iva

l ti

me

s

we

re

6.1

m

on

ths

a

nd

8

.9

mo

nth

s

in g

rou

p A a

nd

7.2

mo

nth

s

an

d 9

.5 m

on

ths

in g

rou

p B

Page 19: Parsing Long and Complex Natural Language Sentences Yuji Matsumoto Nara Institute of Science and Technology (NAIST) November 27, 2014 Shonan Meeting

We use Grammar rules only to ensure consistent global coordinate structure

For any two coordinate structures in a sentence, the following must be fulfilled.

Either their scope is completely disjoint (non-overlapping

flat coordinate structures), or

one is embedded in a conjunct of another coordinate structue (nested sturcutres).

19

Page 20: Parsing Long and Complex Natural Language Sentences Yuji Matsumoto Nara Institute of Science and Technology (NAIST) November 27, 2014 Shonan Meeting

We incorporate “sequence alignment” into the tree-based method

We like to measure local similarity between conjuncts in each coordination by sequence alignment.

To do so, we attach an alignment graph to each COORD node in a tree.

22

Page 21: Parsing Long and Complex Natural Language Sentences Yuji Matsumoto Nara Institute of Science and Technology (NAIST) November 27, 2014 Shonan Meeting

Attach an alignment graph to each COORD node (in the correct tree)

23

W W CC W W W W W CC W W CC W W W W W

N

NN

N

NN

NCJT

NCJT

N

CJTCOORD

NCJT

COORD

CJT CJT

COORD

6.1 months

8.9

mon

ths

9.5

mon

ths

7.2 months

6.1 months

and8.9

monthsin

groupA

7.2

mon

ths an

d9.

5m

onth

s in grou

pB

Me

dia

n

tim

es

to

p

rog

res

si

on

an

d

med

ian

s

urv

iva

l ti

me

s

we

re

6.1

m

on

ths

a

nd

8

.9

mo

nth

s

in g

rou

p A a

nd

7.2

mo

nth

s

an

d 9

.5 m

on

ths

in g

rou

p B

W W W W CC W W W

NCJT

NN

NNCJT

COORD

W

NN

Mediantimes

toprogression

med

ian su

rviv

al times

Page 22: Parsing Long and Complex Natural Language Sentences Yuji Matsumoto Nara Institute of Science and Technology (NAIST) November 27, 2014 Shonan Meeting

Attach an alignment graph to each COORD node (in an incorrect tree)

24

W W CC W W W W W CC W W CC W W W W W

N

NN

N

N

NCJT

NCJT

N

CJTCOORD

NCJT

COORD COORD

Me

dia

n

tim

es

to

p

rog

res

si

on

an

d

med

ian

s

urv

iva

l ti

me

s

we

re

6.1

m

on

ths

a

nd

8

.9

mo

nth

s in

gro

up A

an

d 7

.2 m

on

ths

a

nd

9.5

mo

nth

s in g

rou

p B

W W W W CC W W W

NCJT

NN

NNCJT

COORD

W

N

N

N

CJT

N

CJT

N

NN

6.1 months

8.9

mon

ths

Mediantimes

toprogression

med

ian su

rviv

al times

A

7. 2

months

9. 5

Page 23: Parsing Long and Complex Natural Language Sentences Yuji Matsumoto Nara Institute of Science and Technology (NAIST) November 27, 2014 Shonan Meeting

Score of a tree = sum of all the scores of COORD/COORD’

nodes in the tree

25

c”and b or“a

W

N

W

N

W

N

CCCC

COORD

COORD

CJT CJTCJT

CJT

bc

ab and c

node score = 5.5

node score = 3.3

total score = 8.8

Page 24: Parsing Long and Complex Natural Language Sentences Yuji Matsumoto Nara Institute of Science and Technology (NAIST) November 27, 2014 Shonan Meeting

Experiments: Comparison with other parsers (on Genea Corpus)

26

Coordination type

Number Proposed method

Bikel-Collins

Overall 3598 61.5 52.1

Coordination type

Number Proposed method

Charniak-Johnson

Overall 3598 57.5 52.9

(with Gold POS)

(with auto-tagged POS)

Page 25: Parsing Long and Complex Natural Language Sentences Yuji Matsumoto Nara Institute of Science and Technology (NAIST) November 27, 2014 Shonan Meeting

Breakdown of the results percoordination of different types

27

Coordination type

Number Proposed method

Bikel-Collins

NP 2317 64.2 45.5VP 465 54.2 67.7

ADJP 321 80.4 66.4S 188 22.9 67.0PP 167 59.9 53.3

UCP 60 36.7 18.3

SBAR 56 51.8 85.7

ADVP 21 85.7 90.5

Others 3 66.7 33.3

Page 26: Parsing Long and Complex Natural Language Sentences Yuji Matsumoto Nara Institute of Science and Technology (NAIST) November 27, 2014 Shonan Meeting

Breakdown of the results percoordination of different types

28

Coordination type

Number Proposed method

Charniak-Johnson

NP 2317 62.5 50.1VP 465 42.6 61.9

ADJP 321 76.3 48.6S 188 15.4 63.3PP 167 53.9 58.1

UCP 60 38.3 26.7

SBAR 56 33.9 83.9

ADVP 21 85.7 90.5

Others 3 33.3 0.0

Page 27: Parsing Long and Complex Natural Language Sentences Yuji Matsumoto Nara Institute of Science and Technology (NAIST) November 27, 2014 Shonan Meeting

Our current annotation scheme for coordination and dependency

ChaKi: General annotation tool for POS, chunks, dependency, links in natural language sentences

Coordinate structure and dependency structure are annotated independently

29

Page 28: Parsing Long and Complex Natural Language Sentences Yuji Matsumoto Nara Institute of Science and Technology (NAIST) November 27, 2014 Shonan Meeting

ChaKi: Corpus annotation and management tool

30

Page 29: Parsing Long and Complex Natural Language Sentences Yuji Matsumoto Nara Institute of Science and Technology (NAIST) November 27, 2014 Shonan Meeting

Current Project: Joint coordination and dependency parsing

Coordinate structure analysis alignment-based coordination structure analysis

Dependency analysis Eisner algorithm (CKY style dynamic

programming)

Extended Eisner algorithm

Need to accumulate training examples

31

Page 30: Parsing Long and Complex Natural Language Sentences Yuji Matsumoto Nara Institute of Science and Technology (NAIST) November 27, 2014 Shonan Meeting

Complex Sentence Patterns

Long sentences, having subordinate clauses or embedded clauses, are difficult to parse

We investigated variation of clause patterns around “SBAR”

Extracted SBAR patterns in auto-parsed (by Berkeley parser) corpus, then merged the patterns into manageable size

32

Page 31: Parsing Long and Complex Natural Language Sentences Yuji Matsumoto Nara Institute of Science and Technology (NAIST) November 27, 2014 Shonan Meeting

Analysis of SBAR patterns in complex sentences Examine SBAR and its relations to its parents,

sister and children nodes.

33

Page 32: Parsing Long and Complex Natural Language Sentences Yuji Matsumoto Nara Institute of Science and Technology (NAIST) November 27, 2014 Shonan Meeting

Extracted SBAR Pattern (NP (NP) (SBAR (WHNP (WP who)) (S (VP))))

34

Page 33: Parsing Long and Complex Natural Language Sentences Yuji Matsumoto Nara Institute of Science and Technology (NAIST) November 27, 2014 Shonan Meeting

35

Corpus data: Hiragana Times(English-Japanse Parallel text)

Page 34: Parsing Long and Complex Natural Language Sentences Yuji Matsumoto Nara Institute of Science and Technology (NAIST) November 27, 2014 Shonan Meeting

Statistics of patterns extracted from Hiragana Times (English part)

Number of sentences 171,098

Number of complex sentences 70,134

Number of SBARs 114,840

Number of distinct SBAR patterns 21,090

36

Page 35: Parsing Long and Complex Natural Language Sentences Yuji Matsumoto Nara Institute of Science and Technology (NAIST) November 27, 2014 Shonan Meeting

37

Top 10 SBAR patterns of high frequencyRank SBAR Pattern Freq.

1 (NP (NP) (SBAR (WHNP (WP who)) (S (VP)))) 6921

2 (NP (NP) (SBAR (S (NP) (VP)))) 4982

3 (NP (NP) (SBAR (WHNP (WDT that)) (S (VP)))) 4772

4 (NP (NP) (SBAR (WHNP (WDT that)) (S (VP)))) 3017

5 (VP (VBP) (SBAR (S (NP) (VP)))) 2768

6 (NP (NP) (,) (SBAR (WHNP (WDT which)) (S (VP)))) 1619

7 (VP (VBD) (SBAR (IN that) (S (NP) (VP)))) 1583

8 (VP (VB) (SBAR (IN that) (S (NP) (VP)))) 1564

9 (VP (VBD) (SBAR (S (NP) (VP)))) 1542

10 (VP (VBZ) (SBAR (IN that) (S (NP) (VP)))) 1389

(sum) 30157 / 100537(29.9%)

Page 36: Parsing Long and Complex Natural Language Sentences Yuji Matsumoto Nara Institute of Science and Technology (NAIST) November 27, 2014 Shonan Meeting

Grouping SBAR patterns

Grouping Criteria(1) Head function words

should be the same

(2) Parent nodes of SBAR should be the same

(3) C-commanding nodes of SBAR should be the same

(4) Clause structure under SBAR should be the same

38

Page 37: Parsing Long and Complex Natural Language Sentences Yuji Matsumoto Nara Institute of Science and Technology (NAIST) November 27, 2014 Shonan Meeting

Examples of Distinct Patterns to be Grouped Together

Rank SBAR Pattern Freq.

1 (NP (NP) (SBAR (WHNP (WP who)) (S (VP)))) 6921

21 (NP (NP) (,) (SBAR (WHNP (WP who)) (S (VP)))) 646

151 (NP (NP) (,) (SBAR (WHNP (WP who)) (S (ADVP) (VP)))) 51

557 (NP (NP) (ADJP) (SBAR (WHNP (WP who)) (S (VP)))) 13

2359 (NP (NP) (PP) (,) (SBAR (WHNP (WP who)) (S (ADVP) (VP)))) 2

39

Page 38: Parsing Long and Complex Natural Language Sentences Yuji Matsumoto Nara Institute of Science and Technology (NAIST) November 27, 2014 Shonan Meeting

10 20 30 40 50 100 150 3000

10000

20000

30000

40000

50000

60000

70000

80000

90000

100000

Number of covered sentences

Number of grouped patterns

Number of grouped patterns

Coverage

150 86,553/100,537 (86%)

300 87,789/100,537 (87.3%)

Grouping SBAR patterns reduces the number of distinct patterns

40

Page 39: Parsing Long and Complex Natural Language Sentences Yuji Matsumoto Nara Institute of Science and Technology (NAIST) November 27, 2014 Shonan Meeting

One grouped pattern may have different meaning (different translations in Japanese)

“Business has doubled every year since we began,” Stuart says .

「 私 たち が 始めてから 、取引 高 は 毎年 、 倍 に 成長 し て い ます 」 と 、 スチュウアート さんは 言う 。

(VP (VB.*) (NP) (SBAR (IN since) (S (NP) (VP))))

…since… : time-related meaning41

Page 40: Parsing Long and Complex Natural Language Sentences Yuji Matsumoto Nara Institute of Science and Technology (NAIST) November 27, 2014 Shonan Meeting

また 、 日本語 の ガイド ボーカル が 付き 、 ローマ字 付き の 日本語 歌詞 本 も 付い て いる ので 、日本語 の 発音 の 練習 も できる という すぐれ もの だ 。

This is so great that you can also practice Japanese pronunciation since the guide vocal and Japanese song book in Romaji are attached.

(VP (VB.*) (NP) (SBAR (IN since) (S (NP) (VP))))

…since… : reason-related meaning

One grouped pattern may have different meaning (different translations in Japanese)

42

Page 41: Parsing Long and Complex Natural Language Sentences Yuji Matsumoto Nara Institute of Science and Technology (NAIST) November 27, 2014 Shonan Meeting

Present Perfecttime- related meaning

Past Tense

Future Tense Present Perfect / Past

Present   reason-relatedmeaning

Present / Future

Past Past / Future

Relations between the nodes in SBAR patterns

(VP (VB.*) (NP) (SBAR (IN since) (S (NP) (VP) )))

43

Page 42: Parsing Long and Complex Natural Language Sentences Yuji Matsumoto Nara Institute of Science and Technology (NAIST) November 27, 2014 Shonan Meeting

Evaluation of grouped patterns via Statistical MT

Top 100 SBAR patterns (coverage 82%) 13 patterns have multiple translations Hand-write translation templates and disambiguation

rules “Divide and Rewrite” approach to translation

Complex sentence pattern matching Sub-clauses are translated by existing SMT system Translated clauses are put in translation templates

Existing SMT systems Google translate, Moses & Giza++

44

Page 43: Parsing Long and Complex Natural Language Sentences Yuji Matsumoto Nara Institute of Science and Technology (NAIST) November 27, 2014 Shonan Meeting

Experiment

Training (17,000 sentences), Dev(500), Test(500) from Hiragana Times (English)

In test sentences, there are 232 complex sentences, of which 185 matched (80%)

all test sentences complex sentences only

Moses Google Moses Googlewithout complex sentence patterns 15.26 24.36 12.84 15.61

with complex sentence patterns 17.49 24.73 15.97 16.43

45

Page 44: Parsing Long and Complex Natural Language Sentences Yuji Matsumoto Nara Institute of Science and Technology (NAIST) November 27, 2014 Shonan Meeting

Current Projects Joint Coordination and Dependency Parsing

Extended Eisner algorithm MWE Lexicon

Functional expressions: preposition, determiner, conjunction, adverb

Phrasal verbs (Flexible MWEs: MWEs with gaps) Training data for disambiguation

Complex sentence patterns Previous evaluation was done only with sentences that

have one SBAR structure Sentence pattern acquisition and disambiguation

46

a JJ kind ofnot only … (but) alsoThe JJR ... V…,the JJR…V