semantic event extraction in unstructured text based on prominence

24
SEMANTIC EVENT EXTRACTION IN UNSTRUCTURED TEXT BASED ON PROMINENCE AND DISCOURSE-LEVEL DEPENDENCIES SIAW NYUK HIONG UNIVERSITI MALAYSIA SARAWAK

Upload: hadiep

Post on 17-Jan-2017

221 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: semantic event extraction in unstructured text based on prominence

SEMANTIC EVENT EXTRACTION INUNSTRUCTURED TEXT BASED ON PROMINENCE

AND DISCOURSE-LEVEL DEPENDENCIES

SIAW NYUK HIONG

UNIVERSITI MALAYSIA SARAWAK

Page 2: semantic event extraction in unstructured text based on prominence

SEMANTIC EVENT EXTRACTION IN UNSTRUCTURED TEXT BASED ONPROMINENCE AND DISCOURSE-LEVEL DEPENDENCIES

SIAW NYUK HIONG

THESIS SUBMITTED IN PARTIAL FULFILMENT OF THEREQUIREMENTS FOR THE DEGREE OF

DOCTOR OF PHILOSOPHY

FACULTY OF COMPUTER SCIENCE AND INFORMATION TECHNOLOGYUNIVERSITI MALAYSIA SARAWAK

2015

Page 3: semantic event extraction in unstructured text based on prominence

DECLARATION

I hereby declare that the work in this thesis is my own except for quotations and

summaries which have been duly acknowledged.

7 July 2015 SIAW NYUK HIONG10011495

ii

Page 4: semantic event extraction in unstructured text based on prominence

ACKNOWLEDGEMENTS

My heartfelt thanks go out to Prof. Dr. Narayanan Kulathuramaiyer who being my mainsupervisor has guided me to do a lot of explorations at the initial stage of my research.This has provoked me to always try to think out of the box. Due to the needs of myresearch, Assoc. Prof. Dr. Bali Ranaivo-Malançon and Assoc. Prof. Dr. Jane Labadinare also appointed to be my co-supervisors. I would like to express my gratitude andappreciation to Assoc. Prof. Dr. Bali Ranaivo-Malançon who has motivated and guidedme towards the direction of research that I am able to accomplish. Her professionalcomments have always helped me to continuously improve my work. Many thanks toAssoc. Prof. Dr. Jane Labadin who has shared her knowledge and views from themathematical perspective. My appreciation also goes out to Prof. Dr. Zaharin Yusoff

who has given valuable comments on the earlier part of my research. I would also liketo thank the three experts who have taken their precious times to annotate the data. Mydeepest appreciations also go out to Dr. Lim Lian Tze and Puan Suhaila Saee whohave helped me in using LATEX. Next, I would like to thank the Ministry of EducationMalaysia which has sponsored my full-time study at Unimas. Lastly, I would like tothank my family members who have been very supportive towards my research work.

iii

Page 5: semantic event extraction in unstructured text based on prominence

ABSTRACT

Semantic event extraction has been applied in many natural language processing

(NLP) tasks like summarization and text mining. However, not many researches have

been carried out to automate multiple event extraction and representation. This has

resulted in the limitation of semantically annotated corpus to PropBank, FrameNet and

VerbNet for event extraction. These corpus collections can be expanded by having

other semantically annotated event corpus added into it. Many event extraction models

like EVENT, SEM and LODE have been proposed but these researches stopped at the

collection of events. Extending research beyond this collection of event to investigate

the interpretation and abstraction of event-based knowledge has not been exploited

much. Furthermore, there is a lack of research for key event indexing to identify the

relative importance of multiple events in a complex sentence. This indexing can

augment successful extracted event-based knowledge as weight.

The main objective of this research is to propose a framework that can automate

the extraction of semantically relevant key events based on thematic hierarchy and

discourse-level dependencies to determine their relationships and relative importance.

This has led to the exploration and formulation of designs to: i) capture and annotate

multiple semantic events in a semantic representation format. ii) define a linguistically

injected model (Linguistic Window Model) to interpret multiple events in a complex

sentence. iii) define new weights for graph-based text (based on Linguistic Window

Model) for key event indexing.

This research has proposed a new method, EveSem, a NLP tools pipeline to

automate the extraction and annotation of semantic events. This tool has performed

marginally better than TIPSemB-1.0. EveSem is then extended to invent a Linguistic

Window Model which has a linguistic structure that is found to enhance the F1-score

when compared to ACE data for event extraction. The thematic hierarchy and

iv

Page 6: semantic event extraction in unstructured text based on prominence

discourse-level dependencies properties of the linguistic structure have been found to

greatly improve the recall over ACE data for "trigger" identification as well. Based on

the thematic hierarchy, new weights are defined to construct weighted graph-based text

which has shown to improve the indexing of relative importance of key event in

complex sentences.

The results showed that the NLP tools pipeline has successfully extracted and

represented multiple events in XML tags. The small collection of XML annotated

corpus for semantic events can be added to the collection of event lexical databases.

Furthermore, this approach is domain generic and is portable to be implemented in

other languages provided the language has the available NLP tools. The Linguistic

Window Model is able to extract event with improve F1-score over ACE task. This

model has the advantage over bag of word (BOW) model for key event indexing since

it takes into consideration the context of word co-occurrence and semantic association

between words based on the linguistic structure of the model. As a conclusion, the

objectives of this research have been successfully achieved. The research has

addressed the gaps identified in this thesis by: (a) automatically generated a collection

of multiple semantic event using a generic approach through NLP tools as a pipeline,

(b) identifying relative importance of key semantic events based on linguistic

properties of the sentence.

v

Page 7: semantic event extraction in unstructured text based on prominence

MENGEKSTRAK PERISTIWA SEMANTIK UNTUK TEKS TIDAK

BERSTRUKTUR BERASASKAN PROMINENCE DAN DISCOURSE-LEVEL

DEPENDENCIES

ABSTRAK

Kajian berkaitan mengekstrak peristiwa semantik (semantic event extraction) telah

diterokai dalam bidang pemprosesan bahasa tabii (natural language processing) untuk

penulisan ringkasan (summarization) dan perlombongan teks (text mining). Banyak

kajian ini masih belum mengkaji automasi mengekstrak and perwakilan peristiwa

semantik untuk teks. Akses kepada pangkalan data semantik untuk kajian berkaitan

adalah terhad kepada data PropBank, FrameNet and VerbNet. Pangkala data semantik

ini boleh dikembangkan dengan penambahan data semantik lain hasil daripada dapatan

kajian. Pelbagai penyelidikan seperti EVENT, SEM and LODE yang mengekstrak

peristiwa semantik berdasarkan model hanya berhenti setakat memperolehi koleksi

peristiwa (event) sahaja. Penyelidikan lanjutan daripada koleksi ini untuk mengkaji

interpretasi dan mengabstrak pengetahuan berasaskan peristiwa semantik masih belum

banyak diterokai. Di samping itu, masih belum banyak kajian mengenai perolehan

kepentingan relatif (relative importance) peristiwa dalam ayat yang kompleks untuk

mengindeks peristiwa (event indexing). Indeks yang diperolehi boleh digunakan

sebagai pemberat bagi pangkalan pengetahuan yang diperolehi.

Objektif kajian ini adalah untuk mengemukakan satu kerangka yang berupaya

mengekstrak peristiwa semantik utama (semantically relevant key events) berasaskan

hiraki thematik (thematic hierarchy) dan discourse-level dependencies bagi

mengenalpasti perhubungan (relationship) dan kepentingan relatif (relative

importance) peristiwa semantik. Kajian ini telah mengeksplorasi dan menghasilkan

rekacipta bagi: i) satu saluran berasaskan pemprosesan bahasa tabii (natural language

vi

Page 8: semantic event extraction in unstructured text based on prominence

processing pipeline) untuk mengekstrak dan annotate peristiwa semantik secara

automatik. ii) model tetingkap linguistik (Linguistic Window Model) untuk

menterjemah peristiwa pelbagai (multiple events) dalam ayat kompleks. iii)

mendefinisi pemberat baru (berasaskan model tetingkap linguistik) untuk mengindeks

peristiwa menggunakan graf teks.

Kajian ini telah mengemukakan satu cara baru, EveSem, iaitu saluran

berasaskan pemprosesan bahasa tabii (natural language processing pipeline) untuk

mengekstrak dan annotate peristiwa semantik secara automatik. EveSem mampu

memberikan hasil yang lebih baik sedikit berbanding dengan TIPSemB-1.0. Ia telah

dikembangkan untuk mencipta model tetingkap linguistik (Linguistic Window Model).

Tetingkap ini mempunyai struktur linguistik yang didapati dapat meningkatkan skor F1

untuk mengekstrak perisitiwa berbanding dengan data ACE. Ciri hiraki thematik

(thematic hierarchy) dan discourse-level dependencies struktur linguistik ini juga

didapati berupaya untuk meningkatkan recall dalam mengenalpasti “trigger” peristiwa

secara mendadak berbanding dengan data ACE. Berdasarkan hiraki thematik (thematic

hierarchy), pemberat baru untuk pembinaan graf teks telah didefinisikan. Ia telah

didapati dapat memperbaiki indeks bagi kepentingan relatif peristiwa utama (relative

importance of key event) dalam ayat kompleks.

Hasil penilaian menunjukkan aplikasi saluran berasaskan pemprosesan bahasa

tabii berupaya untuk mengekstrak dan mewakili peristiwa pelbagai dengan tag XML.

Koleksi data peristiwa semantik yang dihasilkan merupakan sumbangan kepada

penambahan koleksi pangkalan data semantik. Pendekatan ini adalah domain generik

dan boleh diaplikasi untuk bahasa berlainan dengan syarat bahasa berkenaan

mempunyai alat NLP untuk pemprosesan data. Model tetingkap linguistik dapat

meningkatkan skor F1 dalam mengekstrak peristiwa berbanding dengan data ACE.

Model ini mempunyai kelebihan berbanding dengan model BOW (bag-of-word)

memandangkan ia mengambilkira konteks dan hubungan thematic antara perkataan

dalam mengindeks peristiwa . Sebagai kesimpulan, objektif kajian ini telah dicapai

vii

Page 9: semantic event extraction in unstructured text based on prominence

dengan jayanya. Kajian ini telah berjaya mengatasi masalah yang dikenalpasti sebagai

jurang kajian dengan (a) menjana satu koleksi pelbagai peristiwa semantik (multiple

semantic event) melalui kaedah generik secara automatik dengan menggunakan satu

saluran berasaskan pemprosesan bahasa tabii (natural language processing pipeline),

(b) mengenalpasti kepentingan relatif (relative importance) peristiwa semantik utama

(key semantic event) berdasarkan ciri-ciri linguistik ayat.

viii

Page 10: semantic event extraction in unstructured text based on prominence

TABLE OF CONTENTS

Page

DECLARATION

ACKNOWLEDGEMENTS iii

ABSTRACT iv

ABSTRAK vi

TABLE OF CONTENTS ix

LIST OF FIGURES xv

LIST OF TABLES xxii

LIST OF ABBREVIATIONS xxvi

CHAPTER 1 INTRODUCTION

1.1 Research Motivation and Problems 1

1.2 Research Questions 7

1.3 Research Objectives 9

1.4 Research Framework 11

1.5 Expected Contributions 14

1.6 Thesis Outline 15

1.7 Summary and Conclusion 16

ix

Page 11: semantic event extraction in unstructured text based on prominence

CHAPTER 2 LITERATURE REVIEW

2.1 Introduction 17

2.2 What is an event? 17

2.3 Event Extraction Techniques 21

2.4 Argument Realization in Natural Language Text 25

2.5 Thematic Role Theory 30

2.5.1 Thematic Hierarchy (Jackendoff 1972) 33

2.5.2 Thematic Hierarchy (Grimshaw 1990) 35

2.5.3 Case Grammar (Fillmore 1968) 36

2.5.4 Proto-roles (Dowty 1989, 1991) 38

2.6 Semantic Role Labelling (SRL) 40

2.6.1 Datasets 41

2.6.2 Applications of Semantic Role Labelling 52

2.7 TimeML Specification Event Extraction 54

2.7.1 Related Research 59

2.7.2 Discussion 60

2.8 Semantic Event Extraction 61

2.8.1 Related Research 61

2.8.2 Discussion 64

2.9 Key Event Indexing 67

2.9.1 Why Graph-based Text? 68

2.9.2 Graph Theory 71

x

Page 12: semantic event extraction in unstructured text based on prominence

2.9.3 Explicit Representations of Graph 73

2.9.4 Graph Centrality Measures 75

2.9.5 Graph-based Text Representation 79

2.9.6 Syntax-based Collocation 98

2.10 Conclusion 121

CHAPTER 3 NLP SEMANTIC EVENT EXTRACTION PIPELINE:

DESIGN AND IMPLEMENTATION

3.1 Introduction 123

3.2 Semantic Event Argument Realization in Written Text 124

3.3 NLP for Dependency PAS Extraction 126

3.4 Implementation of Semantic Event Extraction Pipeline 128

3.4.1 Data 128

3.4.2 LTH Semantic Dependency Parser 128

3.4.3 Semantic Event Extraction 130

3.4.4 Enriching PropBank with VerbNet Semantic Role

Label 133

3.4.5 XML Semantic Event Annotation 138

3.4.6 EveSem Pipeline Overview 141

3.5 Summary and Conclusion 142

CHAPTER 4 LINGUISTIC WINDOW MODEL FOR KEY EVENT

INDEXING: DESIGN AND IMPLEMENTATION

4.1 Introduction 144

4.2 Linguistic Window Model 145xi

Page 13: semantic event extraction in unstructured text based on prominence

4.2.1 Nested Event in Dependency PAS 145

4.2.2 Linguistic Motivated Collocation 151

4.2.3 Linguistic Window of Collocate Word 152

4.2.4 Implementation 155

4.3 Key Event Indexing 160

4.3.1 Graph-based Text Definitions 160

4.3.2 Graph Centrality Measures 163

4.3.3 Implementation 170

4.4 Summary and Conclusion 178

CHAPTER 5 DATA EXPERIMENTATIONS AND ANALYSES

5.1 Introduction 180

5.2 Evaluation 1: Natural Language (NL) Semantic Event

Extraction 182

5.2.1 Results and Analysis 183

5.2.2 Discussion 183

5.3 Evaluation 2: Linguistic Window Model - Event Extraction

Based on Dependency SRL PAS 185

5.3.1 Results and Analysis 185

5.3.2 Discussion 188

5.4 Evaluation 3: Linguistic Window Model - Linguistic

Window (E-SR Window) of Collocate Word Extraction 191

5.4.1 Data 191

5.4.2 Pre-processing 191

xii

Page 14: semantic event extraction in unstructured text based on prominence

5.4.3 Graph Centrality Measures 192

5.4.4 Discussion 212

5.5 Summary and Conclusion 216

CHAPTER 6 CONCLUSIONS AND FUTURE WORK

6.1 Introduction 219

6.2 Capturing and Annotating Multiple Semantic Events in a

Semantic Representation Format (EveSem) (P1, P2, RO1,

ECR1) 220

6.3 A Linguistically Injected Event Model (Linguistic Window

Model) (P3, RO2, ERC2) 221

6.4 New Defined Weights Based on Linguistic Window Model

for Graph-based Text Key Event Indexing (P4, RO3, ERC3) 222

6.5 Event-based Knowledge Representation 224

6.6 Future Works 226

6.6.1 Event-based Knowledge Research 226

6.6.2 Extending EveSem 226

6.6.3 Extending Linguistic Window Model 227

6.6.4 Expanding XML Annotated Semantic Event

Corpus 228

6.7 Conclusion 228

xiii

Page 15: semantic event extraction in unstructured text based on prominence

BIBLIOGRAPHY 230

APPENDICES

A Dependency Structure 263

B Expert Annotation Data 266

C List of publications 283

xiv

Page 16: semantic event extraction in unstructured text based on prominence

LIST OF FIGURES

Figure No. Page

1.1 Example of Syntactic Ambiguity with Two Interpretations 1

1.2 Example of Semantic Ambiguity with Two Interpretations 2

1.3 A comparison between Exner & Nugues Event Extraction

(Exner & Nugues 2011) with the Proposed Research 6

1.4 A simple event example 8

1.5 A multiple event example 8

1.6 Overview of Research Framework 12

2.1 Relationship between Tense and Aspect on a Timeline 19

2.2 Data Driven Parsing Framework 28

2.3 Arguments Realization in Written Text 29

2.4 Centrality of Verb in Syntax 30

2.5 Abstract Representation of Conceptual Composition 31

2.6 Simplified Features of Verb die 34

2.7 A-structure by Grimshaw (1990) 35

2.8 D-structure in Relation to Argument Structure 35

2.9 Event and Sub-event Structure 36

xv

Page 17: semantic event extraction in unstructured text based on prominence

2.10 Grimshaw Arguments of Verbs and Nouns 37

2.11 Dependency Syntax and Role Semantics Annotation 40

2.12 Frame and Subframe Examples 44

2.13 VerbNet Class “run-51.3.2” 49

2.14 VerbNet Class “chase-51.6” 49

2.15 VerbNet Semantic Role Hierarchy 51

2.16 Mapping of FrameNet SRL to Template Slots 52

2.17 Question Answering using Semantic Frame 53

2.18 Isomorphic Frame-Semantic of an English–Swedish Sentence

Pair 54

2.19 Motivation of Semantic Event Extraction and Annotation with

NLP Pipeline 66

2.20 A Non-directed Graph with Adjacency Relation and Degree of

Node 72

2.21 Directed Graph with Weighted Edge and Degree of Node 72

2.22 Adjacency Matrix Representation of Graph 74

2.23 Adjacency List Representation of Graph 75

2.24 Feasible Direction of Research 99

2.25 Proposed Novel Research 119

3.1 Two Stages of Proposed Research 124

xvi

Page 18: semantic event extraction in unstructured text based on prominence

3.2 An Example of Thematic Hierarchy 125

3.3 Examples for Output of Text Pre-processing 126

3.4 An Example of Dependency Parsing and PAS Representation 127

3.5 An Example of SRL Based Dependency Parsing and PAS

Representation 127

3.6 LTH Parser Architecture (Bjorkelund et al. 2010) 129

3.7 Four Steps of SRL Module in LTH Parser (Bjorkelund et al.

2010) 130

3.8 CoNLL-2008 Shared Task Data Format 130

3.9 Output of scripts/preprocess.sh 131

3.10 Output of scripts/run.sh (wsj_0006_2.output) 131

3.11 SRL of the Extracted Verbs and Nouns 133

3.12 Verbal and Nominal Predicate-argument Extraction 133

3.13 Excerpt of Mapping PropBank Semantic Role to VerbNet Theta

Role 135

3.14 Algorithm 1 (Pseudocode for Mapping PropBank Verb to

VerbNet Verb) 135

3.15 Write to Text File 136

3.16 Mapping of PropBank Verb to VerbNet Verb with SemLink 136

3.17 Algorithm 2 (Pseudocode for VerbNet Semantic Predicate

Extraction) 137

xvii

Page 19: semantic event extraction in unstructured text based on prominence

3.18 Write to Textfile 137

3.19 VerbNet Semantic Predicate Extraction Process 138

3.20 Algorithm 3 (Pseudocode for XML Annotation) 139

3.21 XML Annotated Document Structure 140

3.22 XML Annotated File 140

4.1 Direct Parent-child Dependency of PAS 146

4.2 An Example of Direct Dependency PAS 147

4.3 An Example of Direct Dependency PAS 147

4.4 An Example of Direct Dependency PAS 148

4.5 Indirect Parent-child Dependency of PAS 148

4.6 An Example of Indirect Dependency PAS 149

4.7 An Example of Indirect Dependency PAS 150

4.8 Graphical Representation of the Collocate Window for

Different PAS ‘frame’ 156

4.9 Tree Representation of PAS 156

4.10 Raw Text, POS Tagging and Extracted Verb-nouns 159

4.11 Algorithm 4 (Pseudocode for Verb-noun Extraction) 160

4.12 An example of Degree Graph Centrality Measure 165

4.13 Power Iteration Method 167

xviii

Page 20: semantic event extraction in unstructured text based on prominence

4.14 An example of Eigenvector Graph Centrality Measure 167

4.15 Non-directional Weighted Collocate Word Graph Construction 171

4.16 An Undirected Weighted Graph 172

4.17 Graphical Representation of SRL Collocate Window for PAS

‘frame’1-4 173

4.18 Predicate Argument of each PAS frame’ 175

4.19 Non-directional Thematic Hierarchy Weighted Graph 175

4.20 Adjacency Matrix of Non-Directional Event Association

Weighted Graph 178

5.1 Overall View of the Three Evaluations 181

5.2 An Example of Nominal Event Identification for EveSem and

TipSem 185

5.3 Nested Event in Linguistic Window Model: an Event

Extraction Example with Dependency SRL PAS. 186

5.4 Nested Event in Linguistic Window Model Relation Mentions

and their types. 187

5.5 ACE Event Mentions and their types 188

5.6 Degree Graph Centrality: An Example with wsj_0006_2.txt 193

5.7 Eigenvector Graph Centrality: An Example with

wsj_0006_2.txt 194

xix

Page 21: semantic event extraction in unstructured text based on prominence

5.8 PageRank Graph Centrality Measure: An Example with

wsj_0006_2.txt 195

5.9 Human Expert Data Annotation Information 196

5.10 Graph Centrality Measures Output Information: An Example

with wsj_0006_2.txt 200

5.11 Expert Data Evaluated Against Top Five Ranking Words

for Each Graph Centrality Measure: An Example with

wsj_0006_2.txt 201

5.12 Comparing A1 and A2 Top Five Ranking Words using

PageRank(F ∗ T) Graph Centrality Measure as a Reference

Method: An Example with 5 Sample Data 202

5.13 An Example of Inter-rater Kappa Coefficient Computation for

A1 and A2 using PageRank (F ∗ T) as the Reference Method

for the five Sample Data 203

5.14 Inter-rater Standard Error Computation of Kappa Coefficient

for the five Sample Data 203

5.15 Comparing Top Five Ranking Words of PageRank (F ∗ T) as

the Reference Method with other Graph Centrality Measure:

An Example with wsj_0006_2.txt 207

5.16 An Example of Inter-method Kappa Coefficient Computation

for Eigenvector (F) vs. Eigenvector(F ∗ T) using PageRank

(F ∗ T) as the Reference Method for the Sample Data 209

5.17 Inter-method Standard Error Computation of Kappa Coefficient

for the five Sample Data 210

xx

Page 22: semantic event extraction in unstructured text based on prominence

5.18 Linguistic Window of Collocate Word Extraction 213

6.1 Augmented Multilayer-representation Structure with Graph

of Key Event Indexed Weight for Event-based Knowledge

Representation 225

A.1 Projective Graph 265

xxi

Page 23: semantic event extraction in unstructured text based on prominence

LIST OF TABLES

Table No. Page

1.1 Correct Syntactic Interpretation with SRL 2

1.2 Correct Semantic Interpretation with SRL 2

1.3 An Example of SRL Tagged Sentence 3

2.1 Different Context of Event Definition 22

2.2 Event Extraction Techniques 26

2.3 Thematic Relation by Jackendoff (1972) 33

2.4 Semantic Roles 37

2.5 EDUCATION TEACHING Frame 43

2.6 COMMERCE Frame 43

2.7 Argument Label 46

2.8 ARGMs Modifier Tag 46

2.9 An Example of PropBank Annotation 47

2.10 A Polysemous Verb example 47

2.11 Semantic Role Labels of VerbNet 50

2.12 TIMEX3 Annotation 57

2.13 Event-time/ Event-event Relation Recognition Researches 60

xxii

Page 24: semantic event extraction in unstructured text based on prominence

2.14 Semantic Event Extraction Research 65

2.15 Different Semantic Representation of Text 69

2.16 Centrality Measures and their Functions 76

2.17 Graph-based Text Research 88

2.18 Summary of Discussion 96

2.19 Summarization of Graph Edge and Weight Representation 98

2.20 Different Types of Ternary Combinations 102

2.21 Syntax-based Collocation Extraction Research 109

2.22 Summary of Research Discussion 118

2.23 Predicate-argument Structure of Semantic Role Labeller 120

2.24 Feasible Semantic Role Label Syntax-based Analysis for

Collocation Extraction 121

3.1 TimeBank1.2 Information 128

3.2 POS Tags from Penn TreeBank 132

3.3 NL Semantic Event Extraction Pipeline 141

4.1 An example of PAS Predicate 149

4.2 An Example of PAS Argument 150

4.3 Linguistic Window of Word Formation 153

4.4 Linguistically Injected Collocate Window Formation 154

xxiii