parsing long and complex natural language sentences yuji matsumoto nara institute of science and...

Post on 27-Dec-2015

215 Views

Category:

Documents

3 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Parsing Long and Complex Natural Language Sentences

Yuji MatsumotoNara Institute of Science and Technology

(NAIST)

November 27, 2014Shonan Meeting

2

Sentences in scientific or legal domains tend to have long and complex sentences

Major hindrance to syntactic analysis of natural language sentences Difficult syntactic structures in long sentences

Coordinate structures Complex syntactic patterns (with multiple clauses)

Ordinary CFG grammars and lexicons have difficulty in handing (or representing) such phenomena

Problems in Parsing Long Sentences

Issues in Parsing that lie between Lexicon and Grammar

Coordinate Structures Extra-grammatical phenomenon

Grammatical Units Multiword Expressions (functional MWEs)

Syntactically or semantically idiosyncratic expressions that should be registered in lexicon

Complex Sentence Patterns Subordinate clauses Embedded clauses Other complex sentence patterns

(word tokens) (grammar rules)

3

Problems of Coordinate Structures Any constituents can be coordinated Non-constituent structures (sequences of

constituents) can be coordinated “ John saw Mary yesterday and Bill today. ”

Coordinate structures can be nested “... 6.1 months and 8.9 months in arm A and

7.2 months and 9.5 months in arm B. ” Scope ambiguity

old [men and women] vs [old men and women]”

4

Identification of coordinate structure helps improve parsing accuracy “Median times to progression and median survival

times were 6.1 months and 8.9 months in arm A and 7.2 months and 9.5 months in arm B.”

               ⇓ “Median times … were 6.1 months in arm A” “Median times … were 7.2 months in arm B” “median survival times were 8.9 months in arm A” “median survival times were 9.5 months in arm B”

5

Joint analysis of grammatical and alignment methods [Hara, et al 09]

1. Learning scores for alignment Feature based learning [Shimbo & Hara 07]

2. Phrase structure grammar rules for coordinate structure are defined to ensure the structural constraints

3. The weights for alignment are jointly learned with the structural constraints Combination of CKY parsing algorithm with

perceptron learning of alignment weights

Masashi Shimbo and Kazuo Hara, "A Discriminative Learning Model for Coordinate Conjunctions,“ EMNLP-CoNLL, pp.610-619, June 2007.

Kazuo Hara, Masashi Shimbo, Hideharu Okuma and Yuji Matsumoto, "Coordinate Structure Analysis with Global Structural Constraints and Alignment-Based Local Features," Proceedings of ACL-IJCNLP 2009, pp.967-975, August 2009.

6

Coordination structure analysis

Alignment of corresponding parts

“the standard arm and the dose dense arm”

the standard arm

the dose dense arm7

DP matching method for alignment

the dose dense arm

the

standard

arm

the standard arm

the dose dense arm8

Our first method represents coordinate structure as a path on a triangular alignment graph

9

Median

times

to

progression

and

median

survival

times

Me

dia

n

time

s

to Pro

gre

ssio

n

and

me

dia

n

surv

iva

l

time

s

“Median times to progression and median survival times”

start

end

A path representing correct structure

10

Median

times

to

progression

and

median

survival

times

Me

dia

n

time

s

to Pro

gre

ssio

n

and

med

ian

su

rviv

al

tim

es

Median times to progression and

median survival times

start

end

Drawback of path-based method

It cannot cope with nested coordinations, such as:

“Median times to progression and median survival times were 6.1 months and 8.9 months in arm A and 7.2 months and 9.5 months in arm B.”

11

6.1 months and

8.9 months

7.2 months and

9.5 months

6.1 months and 8.9 months in arm A and

7.2 months and 9.5 months in arm B

12

6.1monthsand8.9monthsin armAand7.2monthsand9.5monthsin armB

6.1

mon

th

s and

8.9

mon

th

s in

arm

A and

7.2

mon

th

s and

9.5

mon

th

s in

arm

B

“6.1 months and 8.9 months in arm A and 7.2 months and 9.5 months in arm B”

start

end

13

6.1monthsand8.9monthsin armAand7.2monthsand9.5monthsin armB

6.1

mon

ths

and

8.9

mo

nth

s in

arm

A and

7.2

mon

ths

and

9.5

mon

ths

in

arm

B

6.1 months and

8.9 months

start

end

14

6.1monthsand8.9monthsin armAand7.2monthsand9.5monthsin armB

6.1

mon

ths

and

8.9

mon

ths

in

arm

A and

7.2

mon

ths

and

9.5

mo

nth

s in

arm

B

7.2 months and

9.5 months

start

end

15

6.1monthsand8.9monthsin armAand7.2monthsand9.5monthsin armB

6.1

mon

ths

and

8.9

mon

ths

in

arm

A and

7.2

mo

nth

s

an

d9

.5m

on

ths in

a

rmB

6.1 months and 8.9 months in arm A and

7.2 months and 9.5 months in arm B

start

end

16

6.1monthsand8.9monthsin armAand7.2monthsand9.5monthsin armB

6.1

mon

ths

and

8.9

mo

nth

s in

arm

A and

7.2

mo

nth

s

an

d9

.5m

on

ths in

a

rmB

6.1 months and

8.9 months

7.2 months and

9.5 months

6.1 months and 8.9 months in arm A and

7.2 months and 9.5 months in arm B

start

end

17

6.1monthsand8.9monthsin armAand7.2monthsand9.5monthsin armB

6.1

mon

ths

and

8.9

mo

nt

hs

in

arm

A and

7.2

mo

nt

hs

a

nd

9.5

mo

nt

hs

in

arm

B

There is no single path to connect all three segments

start

end

Constituent tree structure canrepresent coordinate structure as a tree

18

Me

dia

n

tim

es

to

p

rog

res

si

on

an

d

med

ian

s

urv

iva

l ti

me

s

we

re

6.1

m

on

ths

a

nd

8

.9

mo

nth

s

in g

rou

p A a

nd

7.2

mo

nth

s

an

d 9

.5 m

on

ths

in g

rou

p B

We use Grammar rules only to ensure consistent global coordinate structure

For any two coordinate structures in a sentence, the following must be fulfilled.

Either their scope is completely disjoint (non-overlapping

flat coordinate structures), or

one is embedded in a conjunct of another coordinate structue (nested sturcutres).

19

We incorporate “sequence alignment” into the tree-based method

We like to measure local similarity between conjuncts in each coordination by sequence alignment.

To do so, we attach an alignment graph to each COORD node in a tree.

22

Attach an alignment graph to each COORD node (in the correct tree)

23

W W CC W W W W W CC W W CC W W W W W

N

NN

N

NN

NCJT

NCJT

N

CJTCOORD

NCJT

COORD

CJT CJT

COORD

6.1 months

8.9

mon

ths

9.5

mon

ths

7.2 months

6.1 months

and8.9

monthsin

groupA

7.2

mon

ths an

d9.

5m

onth

s in grou

pB

Me

dia

n

tim

es

to

p

rog

res

si

on

an

d

med

ian

s

urv

iva

l ti

me

s

we

re

6.1

m

on

ths

a

nd

8

.9

mo

nth

s

in g

rou

p A a

nd

7.2

mo

nth

s

an

d 9

.5 m

on

ths

in g

rou

p B

W W W W CC W W W

NCJT

NN

NNCJT

COORD

W

NN

Mediantimes

toprogression

med

ian su

rviv

al times

Attach an alignment graph to each COORD node (in an incorrect tree)

24

W W CC W W W W W CC W W CC W W W W W

N

NN

N

N

NCJT

NCJT

N

CJTCOORD

NCJT

COORD COORD

Me

dia

n

tim

es

to

p

rog

res

si

on

an

d

med

ian

s

urv

iva

l ti

me

s

we

re

6.1

m

on

ths

a

nd

8

.9

mo

nth

s in

gro

up A

an

d 7

.2 m

on

ths

a

nd

9.5

mo

nth

s in g

rou

p B

W W W W CC W W W

NCJT

NN

NNCJT

COORD

W

N

N

N

CJT

N

CJT

N

NN

6.1 months

8.9

mon

ths

Mediantimes

toprogression

med

ian su

rviv

al times

A

7. 2

months

9. 5

Score of a tree = sum of all the scores of COORD/COORD’

nodes in the tree

25

c”and b or“a

W

N

W

N

W

N

CCCC

COORD

COORD

CJT CJTCJT

CJT

bc

ab and c

node score = 5.5

node score = 3.3

total score = 8.8

Experiments: Comparison with other parsers (on Genea Corpus)

26

Coordination type

Number Proposed method

Bikel-Collins

Overall 3598 61.5 52.1

Coordination type

Number Proposed method

Charniak-Johnson

Overall 3598 57.5 52.9

(with Gold POS)

(with auto-tagged POS)

Breakdown of the results percoordination of different types

27

Coordination type

Number Proposed method

Bikel-Collins

NP 2317 64.2 45.5VP 465 54.2 67.7

ADJP 321 80.4 66.4S 188 22.9 67.0PP 167 59.9 53.3

UCP 60 36.7 18.3

SBAR 56 51.8 85.7

ADVP 21 85.7 90.5

Others 3 66.7 33.3

Breakdown of the results percoordination of different types

28

Coordination type

Number Proposed method

Charniak-Johnson

NP 2317 62.5 50.1VP 465 42.6 61.9

ADJP 321 76.3 48.6S 188 15.4 63.3PP 167 53.9 58.1

UCP 60 38.3 26.7

SBAR 56 33.9 83.9

ADVP 21 85.7 90.5

Others 3 33.3 0.0

Our current annotation scheme for coordination and dependency

ChaKi: General annotation tool for POS, chunks, dependency, links in natural language sentences

Coordinate structure and dependency structure are annotated independently

29

ChaKi: Corpus annotation and management tool

30

Current Project: Joint coordination and dependency parsing

Coordinate structure analysis alignment-based coordination structure analysis

Dependency analysis Eisner algorithm (CKY style dynamic

programming)

Extended Eisner algorithm

Need to accumulate training examples

31

Complex Sentence Patterns

Long sentences, having subordinate clauses or embedded clauses, are difficult to parse

We investigated variation of clause patterns around “SBAR”

Extracted SBAR patterns in auto-parsed (by Berkeley parser) corpus, then merged the patterns into manageable size

32

Analysis of SBAR patterns in complex sentences Examine SBAR and its relations to its parents,

sister and children nodes.

33

Extracted SBAR Pattern (NP (NP) (SBAR (WHNP (WP who)) (S (VP))))

34

35

Corpus data: Hiragana Times(English-Japanse Parallel text)

Statistics of patterns extracted from Hiragana Times (English part)

Number of sentences 171,098

Number of complex sentences 70,134

Number of SBARs 114,840

Number of distinct SBAR patterns 21,090

36

37

Top 10 SBAR patterns of high frequencyRank SBAR Pattern Freq.

1 (NP (NP) (SBAR (WHNP (WP who)) (S (VP)))) 6921

2 (NP (NP) (SBAR (S (NP) (VP)))) 4982

3 (NP (NP) (SBAR (WHNP (WDT that)) (S (VP)))) 4772

4 (NP (NP) (SBAR (WHNP (WDT that)) (S (VP)))) 3017

5 (VP (VBP) (SBAR (S (NP) (VP)))) 2768

6 (NP (NP) (,) (SBAR (WHNP (WDT which)) (S (VP)))) 1619

7 (VP (VBD) (SBAR (IN that) (S (NP) (VP)))) 1583

8 (VP (VB) (SBAR (IN that) (S (NP) (VP)))) 1564

9 (VP (VBD) (SBAR (S (NP) (VP)))) 1542

10 (VP (VBZ) (SBAR (IN that) (S (NP) (VP)))) 1389

(sum) 30157 / 100537(29.9%)

Grouping SBAR patterns

Grouping Criteria(1) Head function words

should be the same

(2) Parent nodes of SBAR should be the same

(3) C-commanding nodes of SBAR should be the same

(4) Clause structure under SBAR should be the same

38

Examples of Distinct Patterns to be Grouped Together

Rank SBAR Pattern Freq.

1 (NP (NP) (SBAR (WHNP (WP who)) (S (VP)))) 6921

21 (NP (NP) (,) (SBAR (WHNP (WP who)) (S (VP)))) 646

151 (NP (NP) (,) (SBAR (WHNP (WP who)) (S (ADVP) (VP)))) 51

557 (NP (NP) (ADJP) (SBAR (WHNP (WP who)) (S (VP)))) 13

2359 (NP (NP) (PP) (,) (SBAR (WHNP (WP who)) (S (ADVP) (VP)))) 2

39

10 20 30 40 50 100 150 3000

10000

20000

30000

40000

50000

60000

70000

80000

90000

100000

Number of covered sentences

Number of grouped patterns

Number of grouped patterns

Coverage

150 86,553/100,537 (86%)

300 87,789/100,537 (87.3%)

Grouping SBAR patterns reduces the number of distinct patterns

40

One grouped pattern may have different meaning (different translations in Japanese)

“Business has doubled every year since we began,” Stuart says .

「 私 たち が 始めてから 、取引 高 は 毎年 、 倍 に 成長 し て い ます 」 と 、 スチュウアート さんは 言う 。

(VP (VB.*) (NP) (SBAR (IN since) (S (NP) (VP))))

…since… : time-related meaning41

また 、 日本語 の ガイド ボーカル が 付き 、 ローマ字 付き の 日本語 歌詞 本 も 付い て いる ので 、日本語 の 発音 の 練習 も できる という すぐれ もの だ 。

This is so great that you can also practice Japanese pronunciation since the guide vocal and Japanese song book in Romaji are attached.

(VP (VB.*) (NP) (SBAR (IN since) (S (NP) (VP))))

…since… : reason-related meaning

One grouped pattern may have different meaning (different translations in Japanese)

42

Present Perfecttime- related meaning

Past Tense

Future Tense Present Perfect / Past

Present   reason-relatedmeaning

Present / Future

Past Past / Future

Relations between the nodes in SBAR patterns

(VP (VB.*) (NP) (SBAR (IN since) (S (NP) (VP) )))

43

Evaluation of grouped patterns via Statistical MT

Top 100 SBAR patterns (coverage 82%) 13 patterns have multiple translations Hand-write translation templates and disambiguation

rules “Divide and Rewrite” approach to translation

Complex sentence pattern matching Sub-clauses are translated by existing SMT system Translated clauses are put in translation templates

Existing SMT systems Google translate, Moses & Giza++

44

Experiment

Training (17,000 sentences), Dev(500), Test(500) from Hiragana Times (English)

In test sentences, there are 232 complex sentences, of which 185 matched (80%)

all test sentences complex sentences only

Moses Google Moses Googlewithout complex sentence patterns 15.26 24.36 12.84 15.61

with complex sentence patterns 17.49 24.73 15.97 16.43

45

Current Projects Joint Coordination and Dependency Parsing

Extended Eisner algorithm MWE Lexicon

Functional expressions: preposition, determiner, conjunction, adverb

Phrasal verbs (Flexible MWEs: MWEs with gaps) Training data for disambiguation

Complex sentence patterns Previous evaluation was done only with sentences that

have one SBAR structure Sentence pattern acquisition and disambiguation

46

a JJ kind ofnot only … (but) alsoThe JJR ... V…,the JJR…V

top related