semantic event extraction in unstructured text based on prominence
TRANSCRIPT
SEMANTIC EVENT EXTRACTION INUNSTRUCTURED TEXT BASED ON PROMINENCE
AND DISCOURSE-LEVEL DEPENDENCIES
SIAW NYUK HIONG
UNIVERSITI MALAYSIA SARAWAK
SEMANTIC EVENT EXTRACTION IN UNSTRUCTURED TEXT BASED ONPROMINENCE AND DISCOURSE-LEVEL DEPENDENCIES
SIAW NYUK HIONG
THESIS SUBMITTED IN PARTIAL FULFILMENT OF THEREQUIREMENTS FOR THE DEGREE OF
DOCTOR OF PHILOSOPHY
FACULTY OF COMPUTER SCIENCE AND INFORMATION TECHNOLOGYUNIVERSITI MALAYSIA SARAWAK
2015
DECLARATION
I hereby declare that the work in this thesis is my own except for quotations and
summaries which have been duly acknowledged.
7 July 2015 SIAW NYUK HIONG10011495
ii
ACKNOWLEDGEMENTS
My heartfelt thanks go out to Prof. Dr. Narayanan Kulathuramaiyer who being my mainsupervisor has guided me to do a lot of explorations at the initial stage of my research.This has provoked me to always try to think out of the box. Due to the needs of myresearch, Assoc. Prof. Dr. Bali Ranaivo-Malançon and Assoc. Prof. Dr. Jane Labadinare also appointed to be my co-supervisors. I would like to express my gratitude andappreciation to Assoc. Prof. Dr. Bali Ranaivo-Malançon who has motivated and guidedme towards the direction of research that I am able to accomplish. Her professionalcomments have always helped me to continuously improve my work. Many thanks toAssoc. Prof. Dr. Jane Labadin who has shared her knowledge and views from themathematical perspective. My appreciation also goes out to Prof. Dr. Zaharin Yusoff
who has given valuable comments on the earlier part of my research. I would also liketo thank the three experts who have taken their precious times to annotate the data. Mydeepest appreciations also go out to Dr. Lim Lian Tze and Puan Suhaila Saee whohave helped me in using LATEX. Next, I would like to thank the Ministry of EducationMalaysia which has sponsored my full-time study at Unimas. Lastly, I would like tothank my family members who have been very supportive towards my research work.
iii
ABSTRACT
Semantic event extraction has been applied in many natural language processing
(NLP) tasks like summarization and text mining. However, not many researches have
been carried out to automate multiple event extraction and representation. This has
resulted in the limitation of semantically annotated corpus to PropBank, FrameNet and
VerbNet for event extraction. These corpus collections can be expanded by having
other semantically annotated event corpus added into it. Many event extraction models
like EVENT, SEM and LODE have been proposed but these researches stopped at the
collection of events. Extending research beyond this collection of event to investigate
the interpretation and abstraction of event-based knowledge has not been exploited
much. Furthermore, there is a lack of research for key event indexing to identify the
relative importance of multiple events in a complex sentence. This indexing can
augment successful extracted event-based knowledge as weight.
The main objective of this research is to propose a framework that can automate
the extraction of semantically relevant key events based on thematic hierarchy and
discourse-level dependencies to determine their relationships and relative importance.
This has led to the exploration and formulation of designs to: i) capture and annotate
multiple semantic events in a semantic representation format. ii) define a linguistically
injected model (Linguistic Window Model) to interpret multiple events in a complex
sentence. iii) define new weights for graph-based text (based on Linguistic Window
Model) for key event indexing.
This research has proposed a new method, EveSem, a NLP tools pipeline to
automate the extraction and annotation of semantic events. This tool has performed
marginally better than TIPSemB-1.0. EveSem is then extended to invent a Linguistic
Window Model which has a linguistic structure that is found to enhance the F1-score
when compared to ACE data for event extraction. The thematic hierarchy and
iv
discourse-level dependencies properties of the linguistic structure have been found to
greatly improve the recall over ACE data for "trigger" identification as well. Based on
the thematic hierarchy, new weights are defined to construct weighted graph-based text
which has shown to improve the indexing of relative importance of key event in
complex sentences.
The results showed that the NLP tools pipeline has successfully extracted and
represented multiple events in XML tags. The small collection of XML annotated
corpus for semantic events can be added to the collection of event lexical databases.
Furthermore, this approach is domain generic and is portable to be implemented in
other languages provided the language has the available NLP tools. The Linguistic
Window Model is able to extract event with improve F1-score over ACE task. This
model has the advantage over bag of word (BOW) model for key event indexing since
it takes into consideration the context of word co-occurrence and semantic association
between words based on the linguistic structure of the model. As a conclusion, the
objectives of this research have been successfully achieved. The research has
addressed the gaps identified in this thesis by: (a) automatically generated a collection
of multiple semantic event using a generic approach through NLP tools as a pipeline,
(b) identifying relative importance of key semantic events based on linguistic
properties of the sentence.
v
MENGEKSTRAK PERISTIWA SEMANTIK UNTUK TEKS TIDAK
BERSTRUKTUR BERASASKAN PROMINENCE DAN DISCOURSE-LEVEL
DEPENDENCIES
ABSTRAK
Kajian berkaitan mengekstrak peristiwa semantik (semantic event extraction) telah
diterokai dalam bidang pemprosesan bahasa tabii (natural language processing) untuk
penulisan ringkasan (summarization) dan perlombongan teks (text mining). Banyak
kajian ini masih belum mengkaji automasi mengekstrak and perwakilan peristiwa
semantik untuk teks. Akses kepada pangkalan data semantik untuk kajian berkaitan
adalah terhad kepada data PropBank, FrameNet and VerbNet. Pangkala data semantik
ini boleh dikembangkan dengan penambahan data semantik lain hasil daripada dapatan
kajian. Pelbagai penyelidikan seperti EVENT, SEM and LODE yang mengekstrak
peristiwa semantik berdasarkan model hanya berhenti setakat memperolehi koleksi
peristiwa (event) sahaja. Penyelidikan lanjutan daripada koleksi ini untuk mengkaji
interpretasi dan mengabstrak pengetahuan berasaskan peristiwa semantik masih belum
banyak diterokai. Di samping itu, masih belum banyak kajian mengenai perolehan
kepentingan relatif (relative importance) peristiwa dalam ayat yang kompleks untuk
mengindeks peristiwa (event indexing). Indeks yang diperolehi boleh digunakan
sebagai pemberat bagi pangkalan pengetahuan yang diperolehi.
Objektif kajian ini adalah untuk mengemukakan satu kerangka yang berupaya
mengekstrak peristiwa semantik utama (semantically relevant key events) berasaskan
hiraki thematik (thematic hierarchy) dan discourse-level dependencies bagi
mengenalpasti perhubungan (relationship) dan kepentingan relatif (relative
importance) peristiwa semantik. Kajian ini telah mengeksplorasi dan menghasilkan
rekacipta bagi: i) satu saluran berasaskan pemprosesan bahasa tabii (natural language
vi
processing pipeline) untuk mengekstrak dan annotate peristiwa semantik secara
automatik. ii) model tetingkap linguistik (Linguistic Window Model) untuk
menterjemah peristiwa pelbagai (multiple events) dalam ayat kompleks. iii)
mendefinisi pemberat baru (berasaskan model tetingkap linguistik) untuk mengindeks
peristiwa menggunakan graf teks.
Kajian ini telah mengemukakan satu cara baru, EveSem, iaitu saluran
berasaskan pemprosesan bahasa tabii (natural language processing pipeline) untuk
mengekstrak dan annotate peristiwa semantik secara automatik. EveSem mampu
memberikan hasil yang lebih baik sedikit berbanding dengan TIPSemB-1.0. Ia telah
dikembangkan untuk mencipta model tetingkap linguistik (Linguistic Window Model).
Tetingkap ini mempunyai struktur linguistik yang didapati dapat meningkatkan skor F1
untuk mengekstrak perisitiwa berbanding dengan data ACE. Ciri hiraki thematik
(thematic hierarchy) dan discourse-level dependencies struktur linguistik ini juga
didapati berupaya untuk meningkatkan recall dalam mengenalpasti “trigger” peristiwa
secara mendadak berbanding dengan data ACE. Berdasarkan hiraki thematik (thematic
hierarchy), pemberat baru untuk pembinaan graf teks telah didefinisikan. Ia telah
didapati dapat memperbaiki indeks bagi kepentingan relatif peristiwa utama (relative
importance of key event) dalam ayat kompleks.
Hasil penilaian menunjukkan aplikasi saluran berasaskan pemprosesan bahasa
tabii berupaya untuk mengekstrak dan mewakili peristiwa pelbagai dengan tag XML.
Koleksi data peristiwa semantik yang dihasilkan merupakan sumbangan kepada
penambahan koleksi pangkalan data semantik. Pendekatan ini adalah domain generik
dan boleh diaplikasi untuk bahasa berlainan dengan syarat bahasa berkenaan
mempunyai alat NLP untuk pemprosesan data. Model tetingkap linguistik dapat
meningkatkan skor F1 dalam mengekstrak peristiwa berbanding dengan data ACE.
Model ini mempunyai kelebihan berbanding dengan model BOW (bag-of-word)
memandangkan ia mengambilkira konteks dan hubungan thematic antara perkataan
dalam mengindeks peristiwa . Sebagai kesimpulan, objektif kajian ini telah dicapai
vii
dengan jayanya. Kajian ini telah berjaya mengatasi masalah yang dikenalpasti sebagai
jurang kajian dengan (a) menjana satu koleksi pelbagai peristiwa semantik (multiple
semantic event) melalui kaedah generik secara automatik dengan menggunakan satu
saluran berasaskan pemprosesan bahasa tabii (natural language processing pipeline),
(b) mengenalpasti kepentingan relatif (relative importance) peristiwa semantik utama
(key semantic event) berdasarkan ciri-ciri linguistik ayat.
viii
TABLE OF CONTENTS
Page
DECLARATION
ACKNOWLEDGEMENTS iii
ABSTRACT iv
ABSTRAK vi
TABLE OF CONTENTS ix
LIST OF FIGURES xv
LIST OF TABLES xxii
LIST OF ABBREVIATIONS xxvi
CHAPTER 1 INTRODUCTION
1.1 Research Motivation and Problems 1
1.2 Research Questions 7
1.3 Research Objectives 9
1.4 Research Framework 11
1.5 Expected Contributions 14
1.6 Thesis Outline 15
1.7 Summary and Conclusion 16
ix
CHAPTER 2 LITERATURE REVIEW
2.1 Introduction 17
2.2 What is an event? 17
2.3 Event Extraction Techniques 21
2.4 Argument Realization in Natural Language Text 25
2.5 Thematic Role Theory 30
2.5.1 Thematic Hierarchy (Jackendoff 1972) 33
2.5.2 Thematic Hierarchy (Grimshaw 1990) 35
2.5.3 Case Grammar (Fillmore 1968) 36
2.5.4 Proto-roles (Dowty 1989, 1991) 38
2.6 Semantic Role Labelling (SRL) 40
2.6.1 Datasets 41
2.6.2 Applications of Semantic Role Labelling 52
2.7 TimeML Specification Event Extraction 54
2.7.1 Related Research 59
2.7.2 Discussion 60
2.8 Semantic Event Extraction 61
2.8.1 Related Research 61
2.8.2 Discussion 64
2.9 Key Event Indexing 67
2.9.1 Why Graph-based Text? 68
2.9.2 Graph Theory 71
x
2.9.3 Explicit Representations of Graph 73
2.9.4 Graph Centrality Measures 75
2.9.5 Graph-based Text Representation 79
2.9.6 Syntax-based Collocation 98
2.10 Conclusion 121
CHAPTER 3 NLP SEMANTIC EVENT EXTRACTION PIPELINE:
DESIGN AND IMPLEMENTATION
3.1 Introduction 123
3.2 Semantic Event Argument Realization in Written Text 124
3.3 NLP for Dependency PAS Extraction 126
3.4 Implementation of Semantic Event Extraction Pipeline 128
3.4.1 Data 128
3.4.2 LTH Semantic Dependency Parser 128
3.4.3 Semantic Event Extraction 130
3.4.4 Enriching PropBank with VerbNet Semantic Role
Label 133
3.4.5 XML Semantic Event Annotation 138
3.4.6 EveSem Pipeline Overview 141
3.5 Summary and Conclusion 142
CHAPTER 4 LINGUISTIC WINDOW MODEL FOR KEY EVENT
INDEXING: DESIGN AND IMPLEMENTATION
4.1 Introduction 144
4.2 Linguistic Window Model 145xi
4.2.1 Nested Event in Dependency PAS 145
4.2.2 Linguistic Motivated Collocation 151
4.2.3 Linguistic Window of Collocate Word 152
4.2.4 Implementation 155
4.3 Key Event Indexing 160
4.3.1 Graph-based Text Definitions 160
4.3.2 Graph Centrality Measures 163
4.3.3 Implementation 170
4.4 Summary and Conclusion 178
CHAPTER 5 DATA EXPERIMENTATIONS AND ANALYSES
5.1 Introduction 180
5.2 Evaluation 1: Natural Language (NL) Semantic Event
Extraction 182
5.2.1 Results and Analysis 183
5.2.2 Discussion 183
5.3 Evaluation 2: Linguistic Window Model - Event Extraction
Based on Dependency SRL PAS 185
5.3.1 Results and Analysis 185
5.3.2 Discussion 188
5.4 Evaluation 3: Linguistic Window Model - Linguistic
Window (E-SR Window) of Collocate Word Extraction 191
5.4.1 Data 191
5.4.2 Pre-processing 191
xii
5.4.3 Graph Centrality Measures 192
5.4.4 Discussion 212
5.5 Summary and Conclusion 216
CHAPTER 6 CONCLUSIONS AND FUTURE WORK
6.1 Introduction 219
6.2 Capturing and Annotating Multiple Semantic Events in a
Semantic Representation Format (EveSem) (P1, P2, RO1,
ECR1) 220
6.3 A Linguistically Injected Event Model (Linguistic Window
Model) (P3, RO2, ERC2) 221
6.4 New Defined Weights Based on Linguistic Window Model
for Graph-based Text Key Event Indexing (P4, RO3, ERC3) 222
6.5 Event-based Knowledge Representation 224
6.6 Future Works 226
6.6.1 Event-based Knowledge Research 226
6.6.2 Extending EveSem 226
6.6.3 Extending Linguistic Window Model 227
6.6.4 Expanding XML Annotated Semantic Event
Corpus 228
6.7 Conclusion 228
xiii
BIBLIOGRAPHY 230
APPENDICES
A Dependency Structure 263
B Expert Annotation Data 266
C List of publications 283
xiv
LIST OF FIGURES
Figure No. Page
1.1 Example of Syntactic Ambiguity with Two Interpretations 1
1.2 Example of Semantic Ambiguity with Two Interpretations 2
1.3 A comparison between Exner & Nugues Event Extraction
(Exner & Nugues 2011) with the Proposed Research 6
1.4 A simple event example 8
1.5 A multiple event example 8
1.6 Overview of Research Framework 12
2.1 Relationship between Tense and Aspect on a Timeline 19
2.2 Data Driven Parsing Framework 28
2.3 Arguments Realization in Written Text 29
2.4 Centrality of Verb in Syntax 30
2.5 Abstract Representation of Conceptual Composition 31
2.6 Simplified Features of Verb die 34
2.7 A-structure by Grimshaw (1990) 35
2.8 D-structure in Relation to Argument Structure 35
2.9 Event and Sub-event Structure 36
xv
2.10 Grimshaw Arguments of Verbs and Nouns 37
2.11 Dependency Syntax and Role Semantics Annotation 40
2.12 Frame and Subframe Examples 44
2.13 VerbNet Class “run-51.3.2” 49
2.14 VerbNet Class “chase-51.6” 49
2.15 VerbNet Semantic Role Hierarchy 51
2.16 Mapping of FrameNet SRL to Template Slots 52
2.17 Question Answering using Semantic Frame 53
2.18 Isomorphic Frame-Semantic of an English–Swedish Sentence
Pair 54
2.19 Motivation of Semantic Event Extraction and Annotation with
NLP Pipeline 66
2.20 A Non-directed Graph with Adjacency Relation and Degree of
Node 72
2.21 Directed Graph with Weighted Edge and Degree of Node 72
2.22 Adjacency Matrix Representation of Graph 74
2.23 Adjacency List Representation of Graph 75
2.24 Feasible Direction of Research 99
2.25 Proposed Novel Research 119
3.1 Two Stages of Proposed Research 124
xvi
3.2 An Example of Thematic Hierarchy 125
3.3 Examples for Output of Text Pre-processing 126
3.4 An Example of Dependency Parsing and PAS Representation 127
3.5 An Example of SRL Based Dependency Parsing and PAS
Representation 127
3.6 LTH Parser Architecture (Bjorkelund et al. 2010) 129
3.7 Four Steps of SRL Module in LTH Parser (Bjorkelund et al.
2010) 130
3.8 CoNLL-2008 Shared Task Data Format 130
3.9 Output of scripts/preprocess.sh 131
3.10 Output of scripts/run.sh (wsj_0006_2.output) 131
3.11 SRL of the Extracted Verbs and Nouns 133
3.12 Verbal and Nominal Predicate-argument Extraction 133
3.13 Excerpt of Mapping PropBank Semantic Role to VerbNet Theta
Role 135
3.14 Algorithm 1 (Pseudocode for Mapping PropBank Verb to
VerbNet Verb) 135
3.15 Write to Text File 136
3.16 Mapping of PropBank Verb to VerbNet Verb with SemLink 136
3.17 Algorithm 2 (Pseudocode for VerbNet Semantic Predicate
Extraction) 137
xvii
3.18 Write to Textfile 137
3.19 VerbNet Semantic Predicate Extraction Process 138
3.20 Algorithm 3 (Pseudocode for XML Annotation) 139
3.21 XML Annotated Document Structure 140
3.22 XML Annotated File 140
4.1 Direct Parent-child Dependency of PAS 146
4.2 An Example of Direct Dependency PAS 147
4.3 An Example of Direct Dependency PAS 147
4.4 An Example of Direct Dependency PAS 148
4.5 Indirect Parent-child Dependency of PAS 148
4.6 An Example of Indirect Dependency PAS 149
4.7 An Example of Indirect Dependency PAS 150
4.8 Graphical Representation of the Collocate Window for
Different PAS ‘frame’ 156
4.9 Tree Representation of PAS 156
4.10 Raw Text, POS Tagging and Extracted Verb-nouns 159
4.11 Algorithm 4 (Pseudocode for Verb-noun Extraction) 160
4.12 An example of Degree Graph Centrality Measure 165
4.13 Power Iteration Method 167
xviii
4.14 An example of Eigenvector Graph Centrality Measure 167
4.15 Non-directional Weighted Collocate Word Graph Construction 171
4.16 An Undirected Weighted Graph 172
4.17 Graphical Representation of SRL Collocate Window for PAS
‘frame’1-4 173
4.18 Predicate Argument of each PAS frame’ 175
4.19 Non-directional Thematic Hierarchy Weighted Graph 175
4.20 Adjacency Matrix of Non-Directional Event Association
Weighted Graph 178
5.1 Overall View of the Three Evaluations 181
5.2 An Example of Nominal Event Identification for EveSem and
TipSem 185
5.3 Nested Event in Linguistic Window Model: an Event
Extraction Example with Dependency SRL PAS. 186
5.4 Nested Event in Linguistic Window Model Relation Mentions
and their types. 187
5.5 ACE Event Mentions and their types 188
5.6 Degree Graph Centrality: An Example with wsj_0006_2.txt 193
5.7 Eigenvector Graph Centrality: An Example with
wsj_0006_2.txt 194
xix
5.8 PageRank Graph Centrality Measure: An Example with
wsj_0006_2.txt 195
5.9 Human Expert Data Annotation Information 196
5.10 Graph Centrality Measures Output Information: An Example
with wsj_0006_2.txt 200
5.11 Expert Data Evaluated Against Top Five Ranking Words
for Each Graph Centrality Measure: An Example with
wsj_0006_2.txt 201
5.12 Comparing A1 and A2 Top Five Ranking Words using
PageRank(F ∗ T) Graph Centrality Measure as a Reference
Method: An Example with 5 Sample Data 202
5.13 An Example of Inter-rater Kappa Coefficient Computation for
A1 and A2 using PageRank (F ∗ T) as the Reference Method
for the five Sample Data 203
5.14 Inter-rater Standard Error Computation of Kappa Coefficient
for the five Sample Data 203
5.15 Comparing Top Five Ranking Words of PageRank (F ∗ T) as
the Reference Method with other Graph Centrality Measure:
An Example with wsj_0006_2.txt 207
5.16 An Example of Inter-method Kappa Coefficient Computation
for Eigenvector (F) vs. Eigenvector(F ∗ T) using PageRank
(F ∗ T) as the Reference Method for the Sample Data 209
5.17 Inter-method Standard Error Computation of Kappa Coefficient
for the five Sample Data 210
xx
5.18 Linguistic Window of Collocate Word Extraction 213
6.1 Augmented Multilayer-representation Structure with Graph
of Key Event Indexed Weight for Event-based Knowledge
Representation 225
A.1 Projective Graph 265
xxi
LIST OF TABLES
Table No. Page
1.1 Correct Syntactic Interpretation with SRL 2
1.2 Correct Semantic Interpretation with SRL 2
1.3 An Example of SRL Tagged Sentence 3
2.1 Different Context of Event Definition 22
2.2 Event Extraction Techniques 26
2.3 Thematic Relation by Jackendoff (1972) 33
2.4 Semantic Roles 37
2.5 EDUCATION TEACHING Frame 43
2.6 COMMERCE Frame 43
2.7 Argument Label 46
2.8 ARGMs Modifier Tag 46
2.9 An Example of PropBank Annotation 47
2.10 A Polysemous Verb example 47
2.11 Semantic Role Labels of VerbNet 50
2.12 TIMEX3 Annotation 57
2.13 Event-time/ Event-event Relation Recognition Researches 60
xxii
2.14 Semantic Event Extraction Research 65
2.15 Different Semantic Representation of Text 69
2.16 Centrality Measures and their Functions 76
2.17 Graph-based Text Research 88
2.18 Summary of Discussion 96
2.19 Summarization of Graph Edge and Weight Representation 98
2.20 Different Types of Ternary Combinations 102
2.21 Syntax-based Collocation Extraction Research 109
2.22 Summary of Research Discussion 118
2.23 Predicate-argument Structure of Semantic Role Labeller 120
2.24 Feasible Semantic Role Label Syntax-based Analysis for
Collocation Extraction 121
3.1 TimeBank1.2 Information 128
3.2 POS Tags from Penn TreeBank 132
3.3 NL Semantic Event Extraction Pipeline 141
4.1 An example of PAS Predicate 149
4.2 An Example of PAS Argument 150
4.3 Linguistic Window of Word Formation 153
4.4 Linguistically Injected Collocate Window Formation 154
xxiii