parsing long and complex natural language sentences yuji matsumoto nara institute of science and...
TRANSCRIPT
Parsing Long and Complex Natural Language Sentences
Yuji MatsumotoNara Institute of Science and Technology
(NAIST)
November 27, 2014Shonan Meeting
2
Sentences in scientific or legal domains tend to have long and complex sentences
Major hindrance to syntactic analysis of natural language sentences Difficult syntactic structures in long sentences
Coordinate structures Complex syntactic patterns (with multiple clauses)
Ordinary CFG grammars and lexicons have difficulty in handing (or representing) such phenomena
Problems in Parsing Long Sentences
Issues in Parsing that lie between Lexicon and Grammar
Coordinate Structures Extra-grammatical phenomenon
Grammatical Units Multiword Expressions (functional MWEs)
Syntactically or semantically idiosyncratic expressions that should be registered in lexicon
Complex Sentence Patterns Subordinate clauses Embedded clauses Other complex sentence patterns
(word tokens) (grammar rules)
3
Problems of Coordinate Structures Any constituents can be coordinated Non-constituent structures (sequences of
constituents) can be coordinated “ John saw Mary yesterday and Bill today. ”
Coordinate structures can be nested “... 6.1 months and 8.9 months in arm A and
7.2 months and 9.5 months in arm B. ” Scope ambiguity
old [men and women] vs [old men and women]”
4
Identification of coordinate structure helps improve parsing accuracy “Median times to progression and median survival
times were 6.1 months and 8.9 months in arm A and 7.2 months and 9.5 months in arm B.”
⇓ “Median times … were 6.1 months in arm A” “Median times … were 7.2 months in arm B” “median survival times were 8.9 months in arm A” “median survival times were 9.5 months in arm B”
5
Joint analysis of grammatical and alignment methods [Hara, et al 09]
1. Learning scores for alignment Feature based learning [Shimbo & Hara 07]
2. Phrase structure grammar rules for coordinate structure are defined to ensure the structural constraints
3. The weights for alignment are jointly learned with the structural constraints Combination of CKY parsing algorithm with
perceptron learning of alignment weights
Masashi Shimbo and Kazuo Hara, "A Discriminative Learning Model for Coordinate Conjunctions,“ EMNLP-CoNLL, pp.610-619, June 2007.
Kazuo Hara, Masashi Shimbo, Hideharu Okuma and Yuji Matsumoto, "Coordinate Structure Analysis with Global Structural Constraints and Alignment-Based Local Features," Proceedings of ACL-IJCNLP 2009, pp.967-975, August 2009.
6
Coordination structure analysis
Alignment of corresponding parts
“the standard arm and the dose dense arm”
the standard arm
the dose dense arm7
DP matching method for alignment
the dose dense arm
the
standard
arm
the standard arm
the dose dense arm8
Our first method represents coordinate structure as a path on a triangular alignment graph
9
Median
times
to
progression
and
median
survival
times
Me
dia
n
time
s
to Pro
gre
ssio
n
and
me
dia
n
surv
iva
l
time
s
“Median times to progression and median survival times”
start
end
A path representing correct structure
10
Median
times
to
progression
and
median
survival
times
Me
dia
n
time
s
to Pro
gre
ssio
n
and
med
ian
su
rviv
al
tim
es
Median times to progression and
median survival times
start
end
Drawback of path-based method
It cannot cope with nested coordinations, such as:
“Median times to progression and median survival times were 6.1 months and 8.9 months in arm A and 7.2 months and 9.5 months in arm B.”
11
6.1 months and
8.9 months
7.2 months and
9.5 months
6.1 months and 8.9 months in arm A and
7.2 months and 9.5 months in arm B
12
6.1monthsand8.9monthsin armAand7.2monthsand9.5monthsin armB
6.1
mon
th
s and
8.9
mon
th
s in
arm
A and
7.2
mon
th
s and
9.5
mon
th
s in
arm
B
“6.1 months and 8.9 months in arm A and 7.2 months and 9.5 months in arm B”
start
end
13
6.1monthsand8.9monthsin armAand7.2monthsand9.5monthsin armB
6.1
mon
ths
and
8.9
mo
nth
s in
arm
A and
7.2
mon
ths
and
9.5
mon
ths
in
arm
B
6.1 months and
8.9 months
start
end
14
6.1monthsand8.9monthsin armAand7.2monthsand9.5monthsin armB
6.1
mon
ths
and
8.9
mon
ths
in
arm
A and
7.2
mon
ths
and
9.5
mo
nth
s in
arm
B
7.2 months and
9.5 months
start
end
15
6.1monthsand8.9monthsin armAand7.2monthsand9.5monthsin armB
6.1
mon
ths
and
8.9
mon
ths
in
arm
A and
7.2
mo
nth
s
an
d9
.5m
on
ths in
a
rmB
6.1 months and 8.9 months in arm A and
7.2 months and 9.5 months in arm B
start
end
16
6.1monthsand8.9monthsin armAand7.2monthsand9.5monthsin armB
6.1
mon
ths
and
8.9
mo
nth
s in
arm
A and
7.2
mo
nth
s
an
d9
.5m
on
ths in
a
rmB
6.1 months and
8.9 months
7.2 months and
9.5 months
6.1 months and 8.9 months in arm A and
7.2 months and 9.5 months in arm B
start
end
17
6.1monthsand8.9monthsin armAand7.2monthsand9.5monthsin armB
6.1
mon
ths
and
8.9
mo
nt
hs
in
arm
A and
7.2
mo
nt
hs
a
nd
9.5
mo
nt
hs
in
arm
B
There is no single path to connect all three segments
start
end
Constituent tree structure canrepresent coordinate structure as a tree
18
Me
dia
n
tim
es
to
p
rog
res
si
on
an
d
med
ian
s
urv
iva
l ti
me
s
we
re
6.1
m
on
ths
a
nd
8
.9
mo
nth
s
in g
rou
p A a
nd
7.2
mo
nth
s
an
d 9
.5 m
on
ths
in g
rou
p B
We use Grammar rules only to ensure consistent global coordinate structure
For any two coordinate structures in a sentence, the following must be fulfilled.
Either their scope is completely disjoint (non-overlapping
flat coordinate structures), or
one is embedded in a conjunct of another coordinate structue (nested sturcutres).
19
We incorporate “sequence alignment” into the tree-based method
We like to measure local similarity between conjuncts in each coordination by sequence alignment.
To do so, we attach an alignment graph to each COORD node in a tree.
22
Attach an alignment graph to each COORD node (in the correct tree)
23
W W CC W W W W W CC W W CC W W W W W
N
NN
N
NN
NCJT
NCJT
N
CJTCOORD
NCJT
COORD
CJT CJT
COORD
6.1 months
8.9
mon
ths
9.5
mon
ths
7.2 months
6.1 months
and8.9
monthsin
groupA
7.2
mon
ths an
d9.
5m
onth
s in grou
pB
Me
dia
n
tim
es
to
p
rog
res
si
on
an
d
med
ian
s
urv
iva
l ti
me
s
we
re
6.1
m
on
ths
a
nd
8
.9
mo
nth
s
in g
rou
p A a
nd
7.2
mo
nth
s
an
d 9
.5 m
on
ths
in g
rou
p B
W W W W CC W W W
NCJT
NN
NNCJT
COORD
W
NN
Mediantimes
toprogression
med
ian su
rviv
al times
Attach an alignment graph to each COORD node (in an incorrect tree)
24
W W CC W W W W W CC W W CC W W W W W
N
NN
N
N
NCJT
NCJT
N
CJTCOORD
NCJT
COORD COORD
Me
dia
n
tim
es
to
p
rog
res
si
on
an
d
med
ian
s
urv
iva
l ti
me
s
we
re
6.1
m
on
ths
a
nd
8
.9
mo
nth
s in
gro
up A
an
d 7
.2 m
on
ths
a
nd
9.5
mo
nth
s in g
rou
p B
W W W W CC W W W
NCJT
NN
NNCJT
COORD
W
N
N
N
CJT
N
CJT
N
NN
6.1 months
8.9
mon
ths
Mediantimes
toprogression
med
ian su
rviv
al times
A
7. 2
months
9. 5
Score of a tree = sum of all the scores of COORD/COORD’
nodes in the tree
25
c”and b or“a
W
N
W
N
W
N
CCCC
COORD
COORD
CJT CJTCJT
CJT
bc
ab and c
node score = 5.5
node score = 3.3
total score = 8.8
Experiments: Comparison with other parsers (on Genea Corpus)
26
Coordination type
Number Proposed method
Bikel-Collins
Overall 3598 61.5 52.1
Coordination type
Number Proposed method
Charniak-Johnson
Overall 3598 57.5 52.9
(with Gold POS)
(with auto-tagged POS)
Breakdown of the results percoordination of different types
27
Coordination type
Number Proposed method
Bikel-Collins
NP 2317 64.2 45.5VP 465 54.2 67.7
ADJP 321 80.4 66.4S 188 22.9 67.0PP 167 59.9 53.3
UCP 60 36.7 18.3
SBAR 56 51.8 85.7
ADVP 21 85.7 90.5
Others 3 66.7 33.3
Breakdown of the results percoordination of different types
28
Coordination type
Number Proposed method
Charniak-Johnson
NP 2317 62.5 50.1VP 465 42.6 61.9
ADJP 321 76.3 48.6S 188 15.4 63.3PP 167 53.9 58.1
UCP 60 38.3 26.7
SBAR 56 33.9 83.9
ADVP 21 85.7 90.5
Others 3 33.3 0.0
Our current annotation scheme for coordination and dependency
ChaKi: General annotation tool for POS, chunks, dependency, links in natural language sentences
Coordinate structure and dependency structure are annotated independently
29
ChaKi: Corpus annotation and management tool
30
Current Project: Joint coordination and dependency parsing
Coordinate structure analysis alignment-based coordination structure analysis
Dependency analysis Eisner algorithm (CKY style dynamic
programming)
Extended Eisner algorithm
Need to accumulate training examples
31
Complex Sentence Patterns
Long sentences, having subordinate clauses or embedded clauses, are difficult to parse
We investigated variation of clause patterns around “SBAR”
Extracted SBAR patterns in auto-parsed (by Berkeley parser) corpus, then merged the patterns into manageable size
32
Analysis of SBAR patterns in complex sentences Examine SBAR and its relations to its parents,
sister and children nodes.
33
Extracted SBAR Pattern (NP (NP) (SBAR (WHNP (WP who)) (S (VP))))
34
35
Corpus data: Hiragana Times(English-Japanse Parallel text)
Statistics of patterns extracted from Hiragana Times (English part)
Number of sentences 171,098
Number of complex sentences 70,134
Number of SBARs 114,840
Number of distinct SBAR patterns 21,090
36
37
Top 10 SBAR patterns of high frequencyRank SBAR Pattern Freq.
1 (NP (NP) (SBAR (WHNP (WP who)) (S (VP)))) 6921
2 (NP (NP) (SBAR (S (NP) (VP)))) 4982
3 (NP (NP) (SBAR (WHNP (WDT that)) (S (VP)))) 4772
4 (NP (NP) (SBAR (WHNP (WDT that)) (S (VP)))) 3017
5 (VP (VBP) (SBAR (S (NP) (VP)))) 2768
6 (NP (NP) (,) (SBAR (WHNP (WDT which)) (S (VP)))) 1619
7 (VP (VBD) (SBAR (IN that) (S (NP) (VP)))) 1583
8 (VP (VB) (SBAR (IN that) (S (NP) (VP)))) 1564
9 (VP (VBD) (SBAR (S (NP) (VP)))) 1542
10 (VP (VBZ) (SBAR (IN that) (S (NP) (VP)))) 1389
(sum) 30157 / 100537(29.9%)
Grouping SBAR patterns
Grouping Criteria(1) Head function words
should be the same
(2) Parent nodes of SBAR should be the same
(3) C-commanding nodes of SBAR should be the same
(4) Clause structure under SBAR should be the same
38
Examples of Distinct Patterns to be Grouped Together
Rank SBAR Pattern Freq.
1 (NP (NP) (SBAR (WHNP (WP who)) (S (VP)))) 6921
21 (NP (NP) (,) (SBAR (WHNP (WP who)) (S (VP)))) 646
151 (NP (NP) (,) (SBAR (WHNP (WP who)) (S (ADVP) (VP)))) 51
557 (NP (NP) (ADJP) (SBAR (WHNP (WP who)) (S (VP)))) 13
2359 (NP (NP) (PP) (,) (SBAR (WHNP (WP who)) (S (ADVP) (VP)))) 2
…
39
10 20 30 40 50 100 150 3000
10000
20000
30000
40000
50000
60000
70000
80000
90000
100000
Number of covered sentences
Number of grouped patterns
Number of grouped patterns
Coverage
150 86,553/100,537 (86%)
300 87,789/100,537 (87.3%)
Grouping SBAR patterns reduces the number of distinct patterns
40
One grouped pattern may have different meaning (different translations in Japanese)
“Business has doubled every year since we began,” Stuart says .
「 私 たち が 始めてから 、取引 高 は 毎年 、 倍 に 成長 し て い ます 」 と 、 スチュウアート さんは 言う 。
(VP (VB.*) (NP) (SBAR (IN since) (S (NP) (VP))))
…since… : time-related meaning41
また 、 日本語 の ガイド ボーカル が 付き 、 ローマ字 付き の 日本語 歌詞 本 も 付い て いる ので 、日本語 の 発音 の 練習 も できる という すぐれ もの だ 。
This is so great that you can also practice Japanese pronunciation since the guide vocal and Japanese song book in Romaji are attached.
(VP (VB.*) (NP) (SBAR (IN since) (S (NP) (VP))))
…since… : reason-related meaning
One grouped pattern may have different meaning (different translations in Japanese)
42
Present Perfecttime- related meaning
Past Tense
Future Tense Present Perfect / Past
Present reason-relatedmeaning
Present / Future
Past Past / Future
Relations between the nodes in SBAR patterns
(VP (VB.*) (NP) (SBAR (IN since) (S (NP) (VP) )))
43
Evaluation of grouped patterns via Statistical MT
Top 100 SBAR patterns (coverage 82%) 13 patterns have multiple translations Hand-write translation templates and disambiguation
rules “Divide and Rewrite” approach to translation
Complex sentence pattern matching Sub-clauses are translated by existing SMT system Translated clauses are put in translation templates
Existing SMT systems Google translate, Moses & Giza++
44
Experiment
Training (17,000 sentences), Dev(500), Test(500) from Hiragana Times (English)
In test sentences, there are 232 complex sentences, of which 185 matched (80%)
all test sentences complex sentences only
Moses Google Moses Googlewithout complex sentence patterns 15.26 24.36 12.84 15.61
with complex sentence patterns 17.49 24.73 15.97 16.43
45
Current Projects Joint Coordination and Dependency Parsing
Extended Eisner algorithm MWE Lexicon
Functional expressions: preposition, determiner, conjunction, adverb
Phrasal verbs (Flexible MWEs: MWEs with gaps) Training data for disambiguation
Complex sentence patterns Previous evaluation was done only with sentences that
have one SBAR structure Sentence pattern acquisition and disambiguation
46
a JJ kind ofnot only … (but) alsoThe JJR ... V…,the JJR…V