distantly supervised information extraction …the best thing about stanford is its grad students...

170
DISTANTLY SUPERVISED INFORMATION EXTRACTION USING BOOTSTRAPPED PATTERNS A DISSERTATION SUBMITTED TO THE DEPARTMENT OF COMPUTER SCIENCE AND THE COMMITTEE ON GRADUATE STUDIES OF STANFORD UNIVERSITY IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY Sonal Gupta June 2015

Upload: others

Post on 25-Jun-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: DISTANTLY SUPERVISED INFORMATION EXTRACTION …The best thing about Stanford is its grad students – brilliant and yet so approachable. It has been amazing to be a part of the NLP

DISTANTLY SUPERVISED INFORMATION EXTRACTION USING

BOOTSTRAPPED PATTERNS

A DISSERTATION

SUBMITTED TO THE DEPARTMENT OF COMPUTER SCIENCE

AND THE COMMITTEE ON GRADUATE STUDIES

OF STANFORD UNIVERSITY

IN PARTIAL FULFILLMENT OF THE REQUIREMENTS

FOR THE DEGREE OF

DOCTOR OF PHILOSOPHY

Sonal Gupta

June 2015

Page 2: DISTANTLY SUPERVISED INFORMATION EXTRACTION …The best thing about Stanford is its grad students – brilliant and yet so approachable. It has been amazing to be a part of the NLP

http://creativecommons.org/licenses/by-nc/3.0/us/

This dissertation is online at: http://purl.stanford.edu/nt508qx3506

© 2015 by Sonal Gupta. All Rights Reserved.

Re-distributed by Stanford University under license with the author.

This work is licensed under a Creative Commons Attribution-Noncommercial 3.0 United States License.

ii

Page 3: DISTANTLY SUPERVISED INFORMATION EXTRACTION …The best thing about Stanford is its grad students – brilliant and yet so approachable. It has been amazing to be a part of the NLP

I certify that I have read this dissertation and that, in my opinion, it is fully adequatein scope and quality as a dissertation for the degree of Doctor of Philosophy.

Christopher Manning, Primary Adviser

I certify that I have read this dissertation and that, in my opinion, it is fully adequatein scope and quality as a dissertation for the degree of Doctor of Philosophy.

Jeffrey Heer

I certify that I have read this dissertation and that, in my opinion, it is fully adequatein scope and quality as a dissertation for the degree of Doctor of Philosophy.

Percy Liang

Approved for the Stanford University Committee on Graduate Studies.

Patricia J. Gumport, Vice Provost for Graduate Education

This signature page was generated electronically upon submission of this dissertation in electronic format. An original signed hard copy of the signature page is on file inUniversity Archives.

iii

Page 4: DISTANTLY SUPERVISED INFORMATION EXTRACTION …The best thing about Stanford is its grad students – brilliant and yet so approachable. It has been amazing to be a part of the NLP

Abstract

Information extraction (IE) involves extracting information such as entities, relations, and

events from unstructured text. Although most work in IE focuses on tasks that have abun-

dant training data by exploiting supervised machine learning techniques, in practice, most

IE problems do not have any supervised training data available. Learning conditional ran-

dom fields (CRFs), a state-of-the-art supervised approach, is impractical for such real world

applications because: (1) they require large and expensive labeled corpora, and (2) it is dif-

ficult to interpret them and analyze errors, an often-ignored but important feature.

This dissertation focuses on information extraction for tasks that have no labeled data

available, apart from some seed examples. Supervision using seed examples is usually eas-

ier to obtain than fully labeled sentences. In addition, for many tasks, the seed examples can

be acquired using existing resources like Wikipedia and other human curated knowledge

bases.

I present Bootstrapped Pattern Learning (BPL), an iterative pattern and entity learning

approach, as an effective and interpretable approach to entity extraction tasks with only

seed examples as supervision. I propose two new tasks: (1) extracting key aspects from

scientific articles to study the influence of sub-communities of a research community, and

(2) extracting medical entities from online web forums. For the first task, I propose three

new categories of key aspects and a new definition of influence based on the key aspects.

This dissertation is the first work to address the second task of extracting drugs & treatments

and symptoms & conditions entities from patient-authored text. Extracting these entities

can aid in studying the efficacy and side effects of drugs and home remedies at a large

scale. I show that BPL, using either dependency patterns or lexico-syntactic surface-word

patterns, is an effective approach to solve both problems. It outperforms existing tools and

iv

Page 5: DISTANTLY SUPERVISED INFORMATION EXTRACTION …The best thing about Stanford is its grad students – brilliant and yet so approachable. It has been amazing to be a part of the NLP

CRFs.

Similar to most bootstrapped or semi-supervised systems, BPL systems developed ear-

lier either ignore the unlabeled data or make closed world assumptions about it, resulting in

less accurate classifiers. To address this problem, I propose improvements to BPL’s pattern

and entity scoring functions by evaluating the unlabeled entities using unsupervised sim-

ilarity measures, such as word embeddings and contrasting domain-specific and general

text. I improve the entity classifier of BPL by expanding the training sets using similar-

ity computed by distributed representations of entities. My systems successfully leverage

unlabeled data and significantly outperform the baselines by not making closed world as-

sumptions.

Developing any learning system usually requires a developer-in-the-loop to tune the

parameters. I utilize the interpretability of patterns to humans, a highly desirable attribute

for industrial applications, to develop a new diagnostic tool for visualization of the output

of multiple pattern-based entity learning systems. Such comparisons can help in diagnosing

errors faster, resulting in a shorter and easier development cycle. I make source code of all

tools developed in this dissertation publicly available.

v

Page 6: DISTANTLY SUPERVISED INFORMATION EXTRACTION …The best thing about Stanford is its grad students – brilliant and yet so approachable. It has been amazing to be a part of the NLP

To my wonderful parents, Arvind and Sudha Gupta, and my partner-in-crime, Apurva.

vi

Page 7: DISTANTLY SUPERVISED INFORMATION EXTRACTION …The best thing about Stanford is its grad students – brilliant and yet so approachable. It has been amazing to be a part of the NLP

Acknowledgements

I consider myself lucky to have had great advisors during my graduate life. Thank you

Chris for your insightful short and long answers, for showing me the right path whenever I

was in doubt, and for encouraging me to pursue my research whenever I felt disheartened.

You gave me all the freedom to work on the research projects I was excited about. Your

honest and constructive feedback helped me learn how to critically assess ideas and their

implementations. Very often graduate students worry about not publishing enough – I did

too. A lot. Thanks so much for the persistent advice that the number of papers do not

matter; what matters is the quality of the research I do and whether I relish the process. I

think it is one of the best advices I have ever gotten.

I am also thankful to my committee members, Jeff and Percy. Jeff, I really enjoyed our

conversations and the brainstorming sessions. I admire your clear thinking and great ideas.

The research project with you and Diana opened the door for many successful projects

subsequently. I have always appreciated your encouragement and high-spiritedness. Percy,

you are so smart and yet so grounded!! Thanks so much for all the great, thoughtful feed-

back on my research and this dissertation.

I am indebted to Ray Mooney for becoming my mentor during my masters at UT

Austin. Ray, it is fair to say that this PhD would not have been possible without you.

Even though I spent only two years with you, I learned the skills for a lifetime. Some of

my best memories are from the time we spent together sightseeing after ECML in Belgium

and NAACL in Los Angeles.

The research in this dissertation has been possible because of the incredible people

around me. First and foremost: Diana, the collaboration and friendship with you has been

one of the highlights of my time at Stanford. Jason and Sanjay, it was a lot of fun to

vii

Page 8: DISTANTLY SUPERVISED INFORMATION EXTRACTION …The best thing about Stanford is its grad students – brilliant and yet so approachable. It has been amazing to be a part of the NLP

work with both of you. Thank you Val and Angel for labeling the inter-annotator data for

studying the key aspects of scientific articles. Thank you DanJ and Chris for mapping the

ACL Anthology topics to communities. Whenever someone asked me about the validity of

the mapping, it felt great to point them to two universally-acknowledged experts in NLP!

Thanks to Eric Xing, Kriti, and Jacob for the fun and exciting collaboration during my

quarter at CMU.

The best thing about Stanford is its grad students – brilliant and yet so approachable.

It has been amazing to be a part of the NLP group at Stanford. I have thoroughly enjoyed

hanging out with the group during the almost-daily afternoon tea time (that is, procrastina-

tion time). I also have fond memories of the various group hikes and the NLP group retreat.

People in the 2A wing – you know who you are – thanks so much for being there whenever

I needed you.

Finally, I am grateful to my family and friends. Diana, Isa, Nick, Nisha, Reyes, Suyash,

Tejo: thanks for being the stress busters I sorely needed over the years. My parents, Arvind

and Sudha Gupta, are the reason I am here. Their dedication and love has always been

selfless and I do not think I can ever repay for their sacrifices. My sister Anshu and brother

Ankur have always been there for me. Thanks so much! I am very fortunate to have the

best parents-in-law – Sandhya and Prakash Samudra. Their love and encouragement has

been unconditional. Words are not enough to express my gratitude and love towards my

husband, Apurva. We lived apart for many years so that we can pursue our own dreams,

however, that distance never came between us. Apurva, you have always been my best

friend and counselor, and always will be.

viii

Page 9: DISTANTLY SUPERVISED INFORMATION EXTRACTION …The best thing about Stanford is its grad students – brilliant and yet so approachable. It has been amazing to be a part of the NLP

Contents

Abstract iv

Acknowledgements vii

1 Introduction 11.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.1.1 Distant Supervision . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.1.2 Pattern-based learning . . . . . . . . . . . . . . . . . . . . . . . . 6

1.1.3 Bootstrapped Pattern Learning . . . . . . . . . . . . . . . . . . . . 8

1.1.4 Challenges with unlabeled data . . . . . . . . . . . . . . . . . . . 9

1.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

1.3 Dissertation Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2 Background 132.1 Entity Extraction Task . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.1.1 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.2 Patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2.2.1 Lexico-syntactic Surface word Patterns . . . . . . . . . . . . . . . 18

2.2.2 Dependency Patterns . . . . . . . . . . . . . . . . . . . . . . . . . 19

2.3 Bootstrapped Pattern Learning . . . . . . . . . . . . . . . . . . . . . . . . 21

2.4 Classifiers and Entity Features . . . . . . . . . . . . . . . . . . . . . . . . 26

2.5 Dataset: MedHelp Patient Authored Text . . . . . . . . . . . . . . . . . . . 27

ix

Page 10: DISTANTLY SUPERVISED INFORMATION EXTRACTION …The best thing about Stanford is its grad students – brilliant and yet so approachable. It has been amazing to be a part of the NLP

3 Related Work 293.1 Pattern-based systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

3.1.1 Fully supervised . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

3.1.2 Distantly supervised . . . . . . . . . . . . . . . . . . . . . . . . . 31

3.2 Distantly supervised Non-pattern-based systems . . . . . . . . . . . . . . . 34

3.3 Distantly supervised Hybrid systems . . . . . . . . . . . . . . . . . . . . . 34

3.3.1 Open IE systems . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

4 Studying Scientific Articles and Communities 374.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

4.2 Related Work: Scientific Study . . . . . . . . . . . . . . . . . . . . . . . . 40

4.3 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

4.3.1 Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

4.3.2 Communities and their Influence . . . . . . . . . . . . . . . . . . . 43

4.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

4.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

4.6 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

4.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

5 Information Extraction on Medical Forums 585.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

5.2 Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

5.3 Related Work: Medical IE . . . . . . . . . . . . . . . . . . . . . . . . . . 61

5.4 Materials and Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

5.4.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

5.5 Inducing Lexico-Syntactic Patterns . . . . . . . . . . . . . . . . . . . . . . 64

5.5.1 Creating Patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

5.5.2 Learning Patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

5.5.3 Learning Phrases . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

5.6 Evaluation Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

5.6.1 Test Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

5.6.2 Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

x

Page 11: DISTANTLY SUPERVISED INFORMATION EXTRACTION …The best thing about Stanford is its grad students – brilliant and yet so approachable. It has been amazing to be a part of the NLP

5.6.3 Baselines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

5.7 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

5.7.1 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

5.7.2 Case study: Anecdotal Efficacy . . . . . . . . . . . . . . . . . . . 83

5.8 Additional Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

5.9 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90

5.10 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

6 Leveraging Unlabeled Data 936.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

6.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94

6.3 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

6.3.1 Creating Patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

6.3.2 Scoring Patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

6.3.3 Learning Entities . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

6.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

6.4.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

6.4.2 Labeling Guidelines . . . . . . . . . . . . . . . . . . . . . . . . . 102

6.4.3 Baselines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

6.4.4 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . 104

6.4.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108

6.5 Discussion and Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . 110

7 Word Embeddings Improve Entity Classifiers 1127.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112

7.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114

7.3 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115

7.4 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117

7.5 Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120

7.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121

xi

Page 12: DISTANTLY SUPERVISED INFORMATION EXTRACTION …The best thing about Stanford is its grad students – brilliant and yet so approachable. It has been amazing to be a part of the NLP

8 Visualizing and Diagnosing BPL 1228.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122

8.2 Learning Patterns and Entities . . . . . . . . . . . . . . . . . . . . . . . . 124

8.3 Design Criteria . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124

8.4 Visualizing Diagnostic Information . . . . . . . . . . . . . . . . . . . . . . 125

8.5 System Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127

8.6 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127

8.7 Future Work and Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . 131

9 Conclusions 132

A Stop Words List 136

xii

Page 13: DISTANTLY SUPERVISED INFORMATION EXTRACTION …The best thing about Stanford is its grad students – brilliant and yet so approachable. It has been amazing to be a part of the NLP

List of Tables

2.1 Examples of patterns and how they match to sentences. . . . . . . . . . . . 19

2.2 A few examples of sentences from the MedHelp forum. . . . . . . . . . . . 28

3.1 A few examples to give an idea about the types of IE systems developed

based on the amount of supervision and the type of models. . . . . . . . . . 29

4.1 Some examples of dependency patterns that extract information from de-

pendency trees of sentences. . . . . . . . . . . . . . . . . . . . . . . . . . 41

4.2 Extracted phrases for some papers. . . . . . . . . . . . . . . . . . . . . . . 45

4.3 Examples of patterns learned using the iterative extraction algorithm. . . . . 47

4.4 The precision, recall and F1 scores of each category for the different ap-

proaches. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

4.5 The top 5 influential communities with their most influential phrases. . . . . 50

4.6 The next 5 influential communities with their most influential phrases. . . . 51

4.7 The community in the first column has been influenced the most by the

communities in the second column. . . . . . . . . . . . . . . . . . . . . . . 52

4.8 Comparison of our BPL-based approach and supervised CRF for the task. . 56

5.1 F1 scores for labeling with Dictionaries using different types of labeling

schemes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

5.2 Token-level Precision, Recall, and F1 scores of our system and the baselines

on the Asthma forum for the label DT. . . . . . . . . . . . . . . . . . . . . 72

5.3 Token-level Precision, Recall, and F1 scores of our system and the baselines

on the Asthma forum for the label SC. . . . . . . . . . . . . . . . . . . . . 72

xiii

Page 14: DISTANTLY SUPERVISED INFORMATION EXTRACTION …The best thing about Stanford is its grad students – brilliant and yet so approachable. It has been amazing to be a part of the NLP

5.4 Token-level Precision, Recall, and F1 scores of our system and the baselines

on the ENT forum for the label DT. . . . . . . . . . . . . . . . . . . . . . 73

5.5 Token-level Precision, Recall, and F1 scores of our system and the baselines

on the ENT forum for the label SC. . . . . . . . . . . . . . . . . . . . . . 73

5.6 Entity-level Precision, Recall, and F1 scores of our system and the baselines

on the Asthma forum for the label DT. . . . . . . . . . . . . . . . . . . . . 74

5.7 Entity-level Precision, Recall, and F1 scores of our system and the baselines

on the Asthma forum for the label SC. . . . . . . . . . . . . . . . . . . . . 74

5.8 Entity-level Precision, Recall, and F1 scores of our system and the baselines

on the ENT forum for the label DT. . . . . . . . . . . . . . . . . . . . . . 75

5.9 Entity-level Precision, Recall, and F1 scores of our system and the baselines

on the ENT forum for the label SC. . . . . . . . . . . . . . . . . . . . . . 75

5.10 Top 10 patterns learned for the label DT on the Asthma forum. . . . . . . . 76

5.11 Top 10 patterns learned for the label SC on the Asthma forum. . . . . . . . 76

5.12 Top 10 patterns learned for the label DT on the ENT forum. . . . . . . . . 77

5.13 Top 10 patterns learned for the label SC on the ENT forum. . . . . . . . . . 77

5.14 Precision, Recall, and F1 scores of systems that use pattern matching when

labeling data and our system on the Asthma forum. . . . . . . . . . . . . . 86

5.15 Precision, Recall, and F1 scores of systems that use pattern matching when

labeling data and our system on the ENT forum. . . . . . . . . . . . . . . . 87

5.16 Effects of use of GoogleCommonList in OBA on the Asthma forum. . . . . 87

5.17 Effects of use of GoogleCommonList in OBA on the ENT forum. . . . . . 88

5.18 Effects of use of GoogleCommonList in MetaMap on the Asthma forum. . 88

5.19 Effects of use of GoogleCommonList in MetaMap on the ENT forum. . . . 88

5.20 Scores when our system is run with different phrase threshold values. . . . 89

5.21 Scores when our system is run with different pattern threshold values. . . . 89

5.22 Scores when our system is run with different values of N and T . . . . . . . 90

5.23 Scores when our system is run with different values of K. . . . . . . . . . . 90

6.1 Area under Precision-Recall curves of the systems. . . . . . . . . . . . . . 105

xiv

Page 15: DISTANTLY SUPERVISED INFORMATION EXTRACTION …The best thing about Stanford is its grad students – brilliant and yet so approachable. It has been amazing to be a part of the NLP

6.2 Individual feature effectiveness: Area under Precision-Recall curves when

our system uses individual features during pattern scoring. . . . . . . . . . 107

6.3 Feature ablation study: Area under Precision-Recall curves when individ-

ual features are removed from our system during pattern scoring. . . . . . . 108

6.4 Example patterns and the entities extracted by them, along with the rank at

which the pattern was added to the list of learned patterns. . . . . . . . . . 109

6.5 Top 10 (simplified) patterns learned by our system and RlogF-PUN from

the ENT forum. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110

7.1 Area under Precision-Recall curve for all the systems. . . . . . . . . . . . . 117

7.2 Examples of unlabeled entities that were expanded into the training sets. . . 121

xv

Page 16: DISTANTLY SUPERVISED INFORMATION EXTRACTION …The best thing about Stanford is its grad students – brilliant and yet so approachable. It has been amazing to be a part of the NLP

List of Figures

1.1 Seed examples can often be automatically curated using existing resources. 5

1.2 Percentage of commercial entity extraction systems that use rule-based,

machine learning-based, or hybrid systems. . . . . . . . . . . . . . . . . . 7

2.1 The dependency tree for ‘We work on extracting information using depen-

dency graphs’. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

2.2 A flowchart of various steps in a bootstrapped pattern-based entity learning

system. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

2.3 An example pattern learning system for the class ‘animals’ from the text. . . 24

4.1 The F1 scores for TECHNIQUE and DOMAIN categories after every five it-

erations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

4.2 The influence scores of communities in each year. . . . . . . . . . . . . . . 52

4.3 The popularity of communitites in each year. . . . . . . . . . . . . . . . . 53

4.4 The influence scores of machine translation related communities. . . . . . . 54

4.5 Popularity of machine translation communities in each year. . . . . . . . . 55

5.1 Top 15 phrases extracted for the Asthma and the ENT forums. . . . . . . . 78

5.2 Top 50 DT phrases extracted by our system for three different forums. . . . 80

5.3 Top 50 SC phrases extracted by our system for three different forums. . . . 81

5.4 Top DT and SC phrases extracted by our system, MetaMap, and MetaMap-

C for the Diabetes forum. . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

5.5 Study of efficacy of ‘Cinnamon’ and ‘Vinegar’, two DTs extracted by our

system, for treating Type II Diabetes. . . . . . . . . . . . . . . . . . . . . . 84

xvi

Page 17: DISTANTLY SUPERVISED INFORMATION EXTRACTION …The best thing about Stanford is its grad students – brilliant and yet so approachable. It has been amazing to be a part of the NLP

6.1 An example pattern learning system for the class ‘animals’ from the text

starting with the seed entity ‘dog’. . . . . . . . . . . . . . . . . . . . . . . 95

6.2 Precision vs. Recall curves of our system and the baselines for the Asthma

forum. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105

6.3 Precision vs. Recall curves of our system and the baselines for the ENT

forum. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106

6.4 Precision vs. Recall curves of our system and the baselines for the Acne

forum. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106

6.5 Precision vs. Recall curves of our system and the baselines for the Diabetes

forum. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

7.1 An example of expanding a bootstrapped entity classifier’s training set us-

ing word vector similarity. . . . . . . . . . . . . . . . . . . . . . . . . . . 113

7.2 Precision vs. Recall curves of our system and the baselines for the Asthma

forum. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118

7.3 Precision vs. Recall curves of our system and the baselines for the Acne

forum. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118

7.4 Precision vs. Recall curves of our system and the baselines for the Diabetes

forum. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119

7.5 Precision vs. Recall curves of our system and the baselines for the ENT

forum. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119

8.1 Entity centric view of SPIED-Viz. . . . . . . . . . . . . . . . . . . . . . . 128

8.2 Pattern centric view of SPIED-Viz. . . . . . . . . . . . . . . . . . . . . . . 129

8.3 When the user clicks on the compare icon for an entity, the explanations of

the entity extraction for both systems (if available) are displayed. . . . . . . 130

xvii

Page 18: DISTANTLY SUPERVISED INFORMATION EXTRACTION …The best thing about Stanford is its grad students – brilliant and yet so approachable. It has been amazing to be a part of the NLP

Chapter 1

Introduction

1.1 Overview

Information extraction involves extracting information, such as entities, relations between

entities, and events from unstructured text. The most common entity types are names of

people, places, organizations, and locations. Relation extraction systems predict relations

between entities in text, for example, city of birth, spouse of, and employee of. Earlier

systems built extraction modules using hand-written regular expressions and rules. They

have been largely replaced by machine learning-based models, especially in the research

community. The last decade of the machine learning-based information extraction research

has focused on models that require large amounts of labeled data. A state-of-the-art sys-

tem based on conditional random field (Lafferty et al., 2001) has been very successful at

entity extraction tasks (Ratinov and Roth, 2009), but they only work well when given large

amount of labeled data. However, most real-world information extraction tasks do not have

any fully labeled data. Labeling new data to train a reasonably accurate sequence model is

not only expensive, it also requires labeling data for each new domain.

Information extraction tasks that have no readily available labeled data is the norm in

practical uses, not the exception. Consider a substance-abuse specialist who wants to study

substances-of-choice (SoC) of people from online health discussion forums. A post from

one such forum on MedHelp.org1

1MedHelp is an online health forum. Example altered to preserve privacy.

1

Page 19: DISTANTLY SUPERVISED INFORMATION EXTRACTION …The best thing about Stanford is its grad students – brilliant and yet so approachable. It has been amazing to be a part of the NLP

CHAPTER 1. INTRODUCTION 2

Brother was on huge amounts of opiates...80 vics a day plus oxys and mor-phine...then valium at night to sleep. Then he was on an alcohol bender for 3

weeks... Doc has put him on suboxone.

Two common ways of approaching the above problem are: machine learning-based

sequence models like CRFs and hand-written lexicons.

CRFs and other supervised sequence models tend to work well when trained on reason-

ably sized fully labeled data. However, there is currently no publicly available data labeled

with SoCs. Labeling such data would be time-consuming and difficult for someone who is

not a trained data annotator with access to good annotation tools.

The second way of manually constructing a lexicon is expensive, time consuming, and

can lead to poor recall. Most commonly used medical entity extractors, e.g. MetaMap

(Aronson, 2001) and Open Biomedical Annotator or OBA (Jonquet et al., 2009), are based

on keyword matching with manually written ontologies. They have two problems: 1. they

have poor recall on patient-authored text (Smith and Wicks, 2008) since most ontologies

do not contain colloquial or slang phrases, and 2. dictionary-lookup based annotators do

not model context.

OBA references various manually curated ontologies and currently has no SoC cate-

gory. When applied using the generic pharmaceutical drugs categories to the above ex-

ample, it extracts ‘opiates’, ‘morphine’, ‘suboxone’, and ‘valium’. There are two kinds of

extraction errors: 1. not extracting ‘vics’, ‘oxys’, and ‘alcohol bender’, and 2. extracting

‘suboxone’ as a SoC. Suboxone can be a SOC but is also prescribed as a treatment for ad-

diction of other stronger opiates, such as in the above example. The example underscores

the need for domain-specific context modeling.

Consider another task, which I focus on in this thesis, of extracting drugs & treatments

from patient-authored text (PAT). Treatments include anything consumed or applied to im-

prove a symptom or a condition. Following is an example from Medhelp.org

I plan to start cinnamon and holy basil known to help diabetes in many peo-

ple.

PAT, such as discussion posts on online forums, is rife with many home and alternate

Page 20: DISTANTLY SUPERVISED INFORMATION EXTRACTION …The best thing about Stanford is its grad students – brilliant and yet so approachable. It has been amazing to be a part of the NLP

CHAPTER 1. INTRODUCTION 3

remedies, and morphological variations, spelling mistakes, and abbreviations of pharma-

ceutical drugs. It is nearly impossible to manually list all treatments mentioned in such

text. No public dataset exists in this domain for these types of entities to train a machine

learning classifier.

Such information extraction needs are more widespread. Consider another task of ex-

tracting dish names from Yelp.com reviews. Following is an example:

We ordered the Empanadas for an appetizer, it comes with three different

sauces, ... My wife ordered the arroz con pollo and I had the lomo saltadoand we split an order of mac and cheese. The presentation was awesome ..

One is likely to find ‘mac and cheese’ in existing dish lexicons but less likely to find

‘lomo saltado’. It is not practically viable to list all dish names of all the cuisines by hand.

Additionally, restaurants add new dishes to their menus frequently, making an existing

lexicon outdated very soon.

These real world information extraction scenarios are common: people need very spe-

cific information extracted from text and they do not have any fully labeled data. Manually

listing a few examples or using existing knowledge bases as examples for given entity

types, however, is very easy. These examples can be used as seed sets, also known as dic-

tionaries or gazettes, for learning more examples for the entity types. To extract food items

from text, for example, it is much easier to find a list of dish names on the Internet than to

manually label sentences. The list of dish names can be used as a seed set to train a dish

name extractor from unlabeled reviews. Similarly, for drugs & treatments, it is easier to

use the existing ontologies as a seed set.

In this dissertation, I show that these seed sets can be effectively used to extract entities

from unlabeled text, even though, in contrast to fully labeled data, the supervision provided

by them is weak. I explore machine learned patterns to extract information from text.

Patterns can be thought of as instantiations of feature templates, or in their simplest form,

regular expressions. I use bootstrapped pattern learning to learn patterns and entities from

text automatically, starting with only a seed set of a few examples of each entity type. I

propose two new tasks and show that BPL is effective for both of them. I compare our

pattern-based approach with both lexicon matching and CRFs and show that our system

Page 21: DISTANTLY SUPERVISED INFORMATION EXTRACTION …The best thing about Stanford is its grad students – brilliant and yet so approachable. It has been amazing to be a part of the NLP

CHAPTER 1. INTRODUCTION 4

performs significantly better. I also propose improvements to BPL systems.

Earlier pattern or rule-based systems were built with hand-written rules (Hobbs et al.,

1993; Riloff, 1993). An important distinction between these systems and the systems I

work on is that the patterns in my systems are learned. I use various statistical approaches

to automatically score computer generated patterns. Many other pattern-based systems

also learn patterns, often using BPL. BPL was a popular research topic in the mid-1990s

till early-2000s (Hearst, 1992; Riloff, 1996; Collins and Singer, 1999), however, academic

research on entity extraction has recently focused more on feature-based sequence clas-

sifiers. In my opinion, the shift happened mostly due to the trends in the wide world;

feature-based classifiers (e.g., naive Bayes and logistic regression) and their structured ex-

tensions (e.g., hidden Markov model and conditional random field model) became popular

beginning late-1990s. The popularity of these classifiers is justified – they tend to work

very well when trained on fully labeled data. However, it is not clear if they work better

than pattern-based learning approaches, both for fully supervised and distantly supervised

settings.

Below I discuss various aspects of my systems.

1.1.1 Distant Supervision

When supervision is in the form of examples, the usual first step is to label the data using

the examples. The simplest way is to label all occurrences of the examples in text with the

corresponding label. This simple matching of seed sets to text does not take word ambiguity

into account. There exists more sophisticated use of distant supervision (Surdeanu et al.,

2012) – in many scenarios, not all token instances in text that match the seed lexicon need

be an instance of that particular class. In my systems, I assume that the seed set entries are

unambiguous in text, which is a reasonable assumption for specialized domains. Note that

only a few tokens, those corresponding to the seed examples, in sentences get labeled; other

tokens are unlabeled. In contrast, a more common way of acquiring supervision in semi-

supervised settings is to get a few sentences fully labeled. One obvious question is, why use

supervision in the form of examples instead of fully labeled sentences? There are two main

reasons: First, it is easier for someone to give examples of a particular information need.

Page 22: DISTANTLY SUPERVISED INFORMATION EXTRACTION …The best thing about Stanford is its grad students – brilliant and yet so approachable. It has been amazing to be a part of the NLP

CHAPTER 1. INTRODUCTION 5

(a) Wikipedia Infobox for Barack Obama (b) List of Schedule I drugs in the US fromWikipedia.

Figure 1.1: Seed examples can often be automatically curated using existing resources.

Labeling full sentences is a more cumbersome task. Druck et al. (2008) and Mann and

McCallum (2008) showed that the more effective use of a human annotator with limited

time is to label features rather than labeling instances. They define labeled features as

words that are more likely to occur in one label as compared to others (e.g. puck is a

stronger indicator of label hockey as compared to the label baseball). Acquiring seed sets

is similar to labeling features – it is easier and more efficient for human annotators to

list a few good candidates for each information need. Second, existing lexicons and web

resources can be frequently used as seed sets. For example, the TAC-KBP slot filling task

uses Wikipedia infoboxes as seed examples. Figure 1.1 show examples of already curated

information about people and entities. The figure on the left is the Wikipedia Infobox

for Barack Obama, which can be used for learning relation extractors like ‘born in’ and

‘spouse’. The figure on the right lists Schedule I drugs2 in the US, which can be used to

bootstrap learning SoCs.

1http://en.wikipedia.org/wiki/Barack_Obama. Accessed April 2015.2http://en.wikipedia.org/wiki/List_of_Schedule_I_drugs_(US). Accessed April

2015.

Page 23: DISTANTLY SUPERVISED INFORMATION EXTRACTION …The best thing about Stanford is its grad students – brilliant and yet so approachable. It has been amazing to be a part of the NLP

CHAPTER 1. INTRODUCTION 6

1.1.2 Pattern-based learning

Patterns are typically created using contexts around the known entities in a text corpus. The

two types of patterns I explore are lexico-syntactic surface word patterns (Hearst, 1992)

and dependency tree patterns (Yangarber et al., 2000). They have been shown to perform

better than state-of-the-art feature-based machine learning methods on some specialized

domains, such as in Chapters 4 and 5, and by Nallapati and Manning (2008). Addition-

ally, pattern-based3 systems dominate in commercial use (Chiticariu et al., 2013), mainly

because patterns are effective, interpretable, and are easy to customize by non-experts to

cope with errors. Figure 1.2 shows the distribution of pattern or rule-based vs. machine

learning-based entity extraction systems in commercial use, in a study by Chiticariu et al.

(2013).4

Comparison with sequence classifiers

One main difference between pattern-based learning systems and sequence classifiers like

CRFs is the representation – whether a system is represented using patterns or features.

Sequence classifiers learn weights on a large number of features. The features commonly

include token-level properties, neighboring tokens and their tags, and distributional simi-

larity word classes. Note that there is a continuum between patterns and features. That is,

one can think of a pattern as a big conjunction of features and use it in a sequence classi-

fier like a CRF. Conversely, many feature-based systems use feature conjunctions. Some

feature-based systems use some quite specific hand-engineered conjunctions. For example,

the Stanford part-of-speech tagger (Toutanova and Manning, 2003) models the unknown

words with a conjunction feature of words that are capitalized and have a digit and a dash

in them.

However, in practice, the distinction between patterns and features is clearer: The

feature-based systems typically have a very large number of features, starting with single

element features (such as, word to left is ‘foo’) and then considering simple conjunctions.

Generally, all instantiations of the features and the conjunctions are generated, resulting in

3I use the terms patterns and rules interchangeably in this dissertation.4It is not clear from their paper in which category BPL, and thus my systems, fall under. My systems are

pattern-based and are machine learned.

Page 24: DISTANTLY SUPERVISED INFORMATION EXTRACTION …The best thing about Stanford is its grad students – brilliant and yet so approachable. It has been amazing to be a part of the NLP

CHAPTER 1. INTRODUCTION 7

Figure 1.2: Percentage of commercial entity extraction systems that use rule-based, ma-chine learning-based, or hybrid systems. The study was conducted by Chiticariu et al.(2013). Pattern or rule-based systems dominate the commercial market, especially amonglarge vendors.

a large numbers of features, most of which are not individually very useful. The emphasis

is on coverage and recall of features. The advantage is that the features can share statis-

tics better. There has been some work on learning useful feature templates (Martins et al.,

2011). However, the focus is on whether to include entire feature templates, which are

usually much more general than patterns. The pattern-based systems are normally built on

orders of magnitude smaller numbers of patterns. Each pattern is normally a quite specific

conjunction of several things (such as, tokens and their generalized versions, dependency

paths, and wildcards) careful targeted to extract an information need. The emphasis is more

on the precision of the patterns.

The difference is not only in representation; the systems also differ in the typical learn-

ing methods used. Feature-based system normally optimize weights on every feature

in a classifier using optimization methods like stochastic gradient descent and Newton’s

method. Pattern-based learning also has a loss function, but the optimization methods are

rarely used. The focus is on choosing whether to include or exclude patterns (or to weight

them more or less) based on the change in the loss function value.

Page 25: DISTANTLY SUPERVISED INFORMATION EXTRACTION …The best thing about Stanford is its grad students – brilliant and yet so approachable. It has been amazing to be a part of the NLP

CHAPTER 1. INTRODUCTION 8

Interpretability

Even though feature-based machine learning is very popular in the academic world, indus-

try is slow and reluctant to adopt it, as seen in the earlier figure. One of the reasons is that

developers, who are often not machine learning experts, do not trust black boxes. Patterns

solve this problem because: 1. patterns are understandable to humans, 2. it is easy to find

errors and fix them in a pattern-based system, and 3. patterns are generally high precision.

However, most industrial pattern-based systems are developed using manually defined pat-

terns, which requires significant human effort and expertise in the pattern language. I work

on making the task automated using machine learning to learn good patterns; the system

preserves the interpretability of patterns but does not require much manual effort. Note that

some other forms of machine learning also have much better interpretability than methods

such as feature-based classifiers or neural networks; traditionally, decision tree and decision

list classifiers have been the prototypical examples of more interpretable machine learning

classifiers (Letham et al., 2013). In a decision tree, the conjunction of features from its root

down a path is often not so different in nature from a decision list or a pattern.

1.1.3 Bootstrapped Pattern Learning

In a bootstrapped pattern-based entity learning system, seed dictionaries and/or patterns

provide distant supervision to label data. The BPL system iteratively learns new patterns

and entities belonging to a specific class from unlabeled text (Riloff, 1996; Collins and

Singer, 1999). I discuss individual components of BPL in detail in Chapter 2. A high level

overview is: BPL is an iterative algorithm, which learns a few good patterns and entities for

each entity type in each iteration. First, patterns are created around known entities. They are

scored and ranked by their ability to extract more positive entities and less negative entities.

Top ranked patterns are then used to extract candidate entities from text. An entity scorer

is trained to score candidate entities based on the entity features and the scores of patterns

that extracted them. High scoring candidate entities are added to the dictionaries and are

used to generate more candidate patterns around them. The power of BPL comes from two

properties. First, it identifies only a few good patterns and entities in each iteration; it takes

a cautious approach to learning new information. Cautious approaches have been shown to

Page 26: DISTANTLY SUPERVISED INFORMATION EXTRACTION …The best thing about Stanford is its grad students – brilliant and yet so approachable. It has been amazing to be a part of the NLP

CHAPTER 1. INTRODUCTION 9

be more accurate in semi-supervised settings (Abney, 2004). Second, the patterns get used

in two ways: 1. they act as a filter to suggest good candidate entities to be scored by an

entity scorer, and 2. they act as good features in the entity scorer; a highly scored pattern is

more likely to extract good entities.

1.1.4 Challenges with unlabeled data

The attractive part about distantly-supervised IE – the limited supervision – is also its main

challenge. It is very hard to learn effective classifiers with little labeled data. Starting

with unlabeled data and a small seed set, only a few tokens or examples get labeled using

the seed and learned sets of entities. Existing systems either assume unlabeled data to be

negative or just ignore them. Very often unlabeled data is subsampled to generate negative

training set for an entity or relation classifier (Angeli et al., 2014). Assuming unlabeled data

to be negative can be counterproductive – since many examples subsampled as negative can

actually be positive. On the other hand, by ignoring the unlabeled data, a system does not

use the data to its full extent possible. In this thesis, I propose two improvements to BPL

that exploits the unlabeled data to make the pattern and entity scoring more accurate.

1.2 Contributions

I make the following contributions in this dissertation:

• I focus on low-resource information extraction problems and show that patterns, both

lexico-syntactic surface word patterns and dependency patterns, learned using BPL

are effective for distantly supervised IE. I bring the academic research in IE closer to

the problems in industry.

• I propose two new tasks and show that BPL is an effective approach for both of them.

1. Studying influence of sub-communities of a scientific community: I propose

new types of key aspects of research papers – focus or main contribution, tech-

niques used, and domain or problem. I also propose a new way of quantifying

influence of one research article on another. There has since been a surge in

Page 27: DISTANTLY SUPERVISED INFORMATION EXTRACTION …The best thing about Stanford is its grad students – brilliant and yet so approachable. It has been amazing to be a part of the NLP

CHAPTER 1. INTRODUCTION 10

interest in the study of academic dynamics, with IARPA funding a research

program called FUSE.

2. Extracting medical entities from patient-authored text: This dissertation is the

first work to extract drugs & treatments and symptoms & conditions from patient-

authored text. Such extractions can be used to study side-effects and the efficacy

of treatments and home remedies at a large scale. My systems outperformed

commonly used medical entity extractors and other machine learning-based

baselines.

• I leverage the unlabeled data to improve bootstrapped pattern learning in two ways.

My systems significantly outperform the existing pattern and entity scoring mea-

sures.

1. Improved pattern scoring: I propose predicting labels of unlabeled entities us-

ing unsupervised measures to improve pattern scoring in BPL. I present a new

pattern scoring method that uses the predicted labels of unlabeled entities. I

predict the labels using five unsupervised measures, such as distributional sim-

ilarity between labeled and unlabeled entities, and edit distances of unlabeled

entities from the labeled entities.

2. Improved entity scoring: I present an improved entity classifier by creating its

training set in a better way. I expand the positive and negative training ex-

amples by adding most similar unlabeled entities, computed using distributed

representations of words, to the training sets.

• I make the source code and some datasets publicly available. In addition, I also re-

lease a visualization and diagnostics tool to compare pattern-based learning systems

to make developing pattern-based systems more effective and efficient.

.

Page 28: DISTANTLY SUPERVISED INFORMATION EXTRACTION …The best thing about Stanford is its grad students – brilliant and yet so approachable. It has been amazing to be a part of the NLP

CHAPTER 1. INTRODUCTION 11

1.3 Dissertation Structure

Chapter 2 This chapter has details about the entity extraction tasks, and necessary

background information about patterns and bootstrapped pattern learning. I give a detailed

overview of individual components of BPL in this chapter. I discuss contributions made to

the system and its components by other researchers in the next chapter.

Chapter 3 I discuss work related to semi-supervised and distantly supervised IE, and

pattern learning in this chapter. I discuss related work specific to the different tasks in each

corresponding chapter.

Chapter 4 I present a new way of studying influence between sub-communities of

a research community in this chapter. I define three new key aspects to extract from a

research article: focus or main contribution, techniques used, and domains applied to. I

then describe how to use topic models to define sub-communities. I combine article-to-

community scores and key aspects of each article to compute influence of sub-communities

on each other. I present a case-study of influence of sub-communities in the computational

linguistics community, such as Speech Recognition and Machine Learning, on each other.

The content of this chapter is drawn from Gupta and Manning (2011).

Chapter 5 This chapter describes the work published in Gupta et al. (2014b). I

describe a new task of extracting drugs & treatments, and symptoms & conditions from

patient-authored text. I show that BPL using lexico-syntactic surface-word patterns is an

effective technique for extracting the information. It performs significantly better than

other approaches, including existing medical entity extraction tools like MetaMap and

Open Biomedical Annotator, on extracting the entities from posts on four forums on Med-

Help.org.

Chapter 6 I present an improved measure to compute pattern scores in BPL by lever-

age unlabeled data. I propose predicting labels of unlabeled entities extracted by patterns

using unsupervised measures and using the predicted labels in the pattern scoring function.

I describe the five unsupervised measures I use to predict the labels of unlabeled entities

and present experimental results on four forums from MedHelp.org. This work has been

published in Gupta and Manning (2014a).

Page 29: DISTANTLY SUPERVISED INFORMATION EXTRACTION …The best thing about Stanford is its grad students – brilliant and yet so approachable. It has been amazing to be a part of the NLP

CHAPTER 1. INTRODUCTION 12

Chapter 7 I present an improved entity classifier for BPL by using distributed repre-

sentation of words. I propose expanding training sets for BPL’s entity classifier, modeled

by a logistic regression, using similarity of entities computed by cosine distance between

word vectors. I present experimental results and show that expanded training sets improve

the performance significantly. This work has been published in Gupta and Manning (2015).

Chapter 8 I present and publicly release a visualization and diagnostics tools to com-

pare pattern-based learning systems. This work has been published in Gupta and Manning

(2014b).

Chapter 9 I conclude this dissertation and discuss avenues for future work.

I release the code for the systems described in this dissertation at http://nlp.

stanford.edu/software/patternslearning.shtml. I also release a visual-

ization tool, described in Chapter 8, that can be downloaded at http://nlp.stanford.

edu/software/patternviz.shtml.

Page 30: DISTANTLY SUPERVISED INFORMATION EXTRACTION …The best thing about Stanford is its grad students – brilliant and yet so approachable. It has been amazing to be a part of the NLP

Chapter 2

Background

This chapter contains the background information necessary for understanding the rest of

the dissertation. First, I describe the entity extraction task and the evaluation measures.

I then explain the patterns I use in this dissertation – dependency patterns and lexico-

syntactic surface-word patterns. I also describe components of a Bootstrapped Pattern

Learning (BPL) system and a few aspects of the classifier I use.

Information extraction encompasses many subtasks, such as entity extraction, relation

and event extraction, and entity linking and canonicalization. Entity extraction is the first

step for other extraction systems; for example, predicting relations between entities is per-

formed jointly with or after entity extraction. In this dissertation, I focus on the entity

extraction task and discuss it below.

2.1 Entity Extraction Task

Entity extraction involves labeling contiguous tokens that form an entity of the desired

type in a sentence. The most common task has been to extract named entities, that is, to

identify the sequence of words that are the names of things, such as PERSON NAME, PLACE,

LOCATION, and ORGANIZATION from text. Below is an example of a sequence of tokens

labeled with named entity tags.

PresidentTITLE BarackNAME ObamaNAME lives in D.PLACE C.PLACE

13

Page 31: DISTANTLY SUPERVISED INFORMATION EXTRACTION …The best thing about Stanford is its grad students – brilliant and yet so approachable. It has been amazing to be a part of the NLP

CHAPTER 2. BACKGROUND 14

The labeled datasets for the common named entity recognition tasks are publicly avail-

able and widely used. The three common corpora that are used to train such extractors

are CoNLL-03, which is from the shared task of the Conference on Computational Natural

Language Learning in 2003, the MUC-7 dataset (Chinchor, 1998), and the OntoNotes cor-

pus (Hovy et al., 2006), which contains around 10 named entity types and 7 miscellaneous

entity types like TIME and DATE. Generally, the tasks are modeled using BIO enhanced

labels – separate labels for Beginning of an entity, Inside an entity, and Outside token of an

entity. For example, the phrase ‘the capital Washington D. C.’ would be labeled as ‘the/O

capital/O Washington/B-LOCATION D./I-LOCATION C./I-LOCATION’. Any contiguous se-

quence of the same entity class is labeled as an entity for that class. Learning using BIO

enhanced labels increases the number of classes, since each entity type has 2 classes asso-

ciated with it. Because the tasks I work on have limited amount of supervision provided, I

do not use the BIO or other more expressive notations.

Most work in entity extraction has focused on supervised training of the classifiers.

Accuracy for the common named entity tasks using fully supervised data has reached in

the 90s – Ratinov and Roth (2009) reported 90.8 F1 score on the CoNLL-03 dataset and

86.15 F1 score on the MUC-7 dataset. The most informative features when classifying

these entities come from the word itself and other word-level features like capitalization,

prefix characters, suffix characters, and part-of-speech tags. There is ambiguity, such as

whether ‘Washington’ is a NAME or a PLACE, and thus sequence models like CRFs are

very effective for such tasks. However, word-level features are still very predictive of the

labels for most entities. I do not work on the CoNLL-03 or MUC-07-like entity extraction

tasks because they have been well-studied and have large fully labeled corpora available to

train supervised classifiers. I, however, use many of the features used in these systems for

other entity extraction tasks.

Other entity extraction tasks include Biomedical named entity recognition (BioNER),

such as recognizing protein, DNA, drugs and genes in text. Using statistical classifiers for

BioNER has been moderately successful – Finkel (2010) reported around 70 F1 score on

the GENIA corpus (Kim et al., 2003) that has Medline abstracts labeled with entities of

types such as PROTEIN and DNA. Generally word features of these entity types are not very

informative. There is more ambiguity in word level features, and thus context features are

Page 32: DISTANTLY SUPERVISED INFORMATION EXTRACTION …The best thing about Stanford is its grad students – brilliant and yet so approachable. It has been amazing to be a part of the NLP

CHAPTER 2. BACKGROUND 15

more important. See Gu (2002) for a longer discussion.

In this dissertation, I focus on training entity extractors from unlabeled text and seed

sets of entities. In the majority of the work, I focus on a task that is similar to the BioNER

tasks. The task is to extract symptoms & diseases and drugs & treatments from patient-

authored text (PAT). PAT is more challenging than well written abstracts and news articles

because of the variation in entity naming and the long descriptions of entities. Similar

to the BioNER tasks, word features are not very powerful for this dataset because of the

morphological variations (like ‘vics’ for ‘Vicodin’) and spelling mistakes. I discuss the

dataset and its challenges in Section 2.5.

2.1.1 Evaluation

Entity extraction is usually evaluated by precision, recall, and F1 scores. For each entity

type l, the following values are computed: true positivel (number of entities correctly

extracted as l by the model), false positivel (number of entities extracted for label l by

the model but they belong to other labels), true negativel (number of entities that are not

of type l and are not extracted by the model), and false negativel (number of entities of

type l but are not extracted by the model).

Precision

Also known as ‘positive predictive value’, precision is the percentage of correct entities

extracted among all the extracted entities.

Precisionl =true positivel

true positivel + false positivel(2.1)

Recall

Recall, also known as sensitivity, is the percentage of correct entities extracted among all

correct entities in the text.

Recalll =true positivel

true positivel + false negativel(2.2)

Page 33: DISTANTLY SUPERVISED INFORMATION EXTRACTION …The best thing about Stanford is its grad students – brilliant and yet so approachable. It has been amazing to be a part of the NLP

CHAPTER 2. BACKGROUND 16

F1 Score

Usually F1 score is used as an evaluation measure to compare various systems since it

provides a single score by taking the harmonic mean of precision and recall.

F l1 =

2× Precisionl ×Recalll

(Precisionl +Recalll)(2.3)

There are two ways of combining scores of different entity types – macro-averaged

and micro-averaged. To obtain the macro-averaged scores, the precision, recall, and F1

scores are calculated for each entity type (except the background label) and are averaged to

compute the final score of a system. This averaging is commonly reported for text catego-

rization in the information retrieval papers. Micro-averaging, which is used for CoNLL-03

shared task evaluation, computes the precision, recall, and F1 scores for all the entities

together.

Entity-level vs. Token-level Evaluation

The correctness of an extracted entity can be judged at the entity-level or the token-level.

In an entity-level evaluation, a multi-word entity is considered as a single unit. That is,

an entity is considered correctly extracted for l if and only if the all the contiguous tokens

labeled as l form an entity. I refer to the entity-level evaluation measures that do not give

partial credits as hard entity-level evaluation. Although the hard entity-level evaluation is

common for NER tasks, when data has inconsistencies or when extracting partial entities

is also important, token level scores are used, such as in Ratinov and Roth (2009). In

a token-level evaluation, each token’s correctness is judged independently of the label of

other tokens. For example, if a system extracts a five words long entity correctly, the hard

entity-level evaluation gives it a score of 1 and the token-level evaluation gives it a score of

5.

The hard entity-level evaluation penalizes the system twice for extracting a partial en-

tity. For example, if an entity is “salbutamol inhaler” and a system labels only “inhaler”

as a DRUG, then for the label DRUG, the entity-level number of true positives is 0, false

negatives is 1, and false positives is 1. On the other hand, the token-level number of true

positives is 1, false negative is 1, and false positive is 0. Entity-level evaluation is preferred

Page 34: DISTANTLY SUPERVISED INFORMATION EXTRACTION …The best thing about Stanford is its grad students – brilliant and yet so approachable. It has been amazing to be a part of the NLP

CHAPTER 2. BACKGROUND 17

over token-level evaluation when extracting all words of an entity is more important than

extracting parts of entity phrases. Entity-level evaluation is commonly used for recogniz-

ing named entities, where, for example, it is not clear if partially extracting ‘Mining Corp.’,

instead of fully extracting ‘Westport Mining Corp.’, should be considered correct.

Some evaluation measures consider a multi-word entity as a single unit, however, give

partial credit for partially extracting an entity. For example, the MUC-7 evaluation adds

scores for partially matching an entity, downweighting the partial match scores by 50% to

prefer systems that extract more fully matched entities.

True Recall vs. Pooled Recall

True Recall is calculated when there is fully labeled test data available, that is every token

of test sentences is labeled. However, in many situations, hand labeling all tokens in the

test data is not practical because of the size of test data, such as in information retrieval

(Buckley et al., 2007) and the TAC-KBP task settings. In such cases, recall is measured

by pooling (i.e., taking a union) of all correct entities extracted by all the systems. Pooled

Recall for label l is defined in the same way as Equation 2.2, when the denominator is the

size of the pooled set for the label l. See Chapter 8 of Manning et al. (2008) for more

information on evaluation measures and pooled recall. I use true recall in Chapters 4 and 5

and pooled recall in Chapters 6 and 7, when the size of the test sets was prohibitively large

to hand label all tokens.

2.2 Patterns

There are many ways of defining patterns – lexico-syntactic surface word patterns (Hearst,

1992; Riloff, 1996), dependency patterns (Yangarber et al., 2000), a combination of both

(Illig et al., 2014), or cascaded rules (Hobbs et al., 1997). In this dissertation, I focus on

the first two and describe them below.

Page 35: DISTANTLY SUPERVISED INFORMATION EXTRACTION …The best thing about Stanford is its grad students – brilliant and yet so approachable. It has been amazing to be a part of the NLP

CHAPTER 2. BACKGROUND 18

2.2.1 Lexico-syntactic Surface word Patterns

Lexico-syntactic surface word patterns1 considers context around entity tokens in a sen-

tence. The patterns are formed using a window of words before and after the labeled

tokens. There are several different ways of constructing these patterns. In this section, I

give an overview of how these patterns can be formed; more details on the restrictions and

the parameter values I use are in the individual chapters.

A few options a developer can consider when developing a surface word pattern-based

system are below. Table 2.1 shows an example of two patterns and how they match to two

sentences.

• Target Entity Restrictions: Patterns can be developed to extract any sequence of

words that match a pattern’s context. However, generally, manually providing or

automatically learning restrictions on the target entity, such as part-of-speech and

common named entity tags, can improve precision of a system. Patterns can also

specify minimum and maximum lengths of entities to be learned.

• Context Length: Surface word patterns are formed by considering the maximum and

minimum window size on either side of the entities. Contexts that consist of only

a few stop words can be discarded because they are too general; sometimes longer

contexts consisting of all stop words can still be useful (for example, ‘I am on X’ is

a good pattern for extracting a DRUG.)

• Context Generalizations: Generalizing context of a pattern helps to reduce sparsity

and thus improve learning, since the pattern matches more entities. It also improves

performance on unseen data. Lemmatization of words is a classic form of generaliza-

tion. Other ways of generalizing tokens include semantic word classes, such as from

Wordnet (Fellbaum, 1998) or Yago (Suchanek et al., 2007) hierarchy, stop words, and

part-of-speech tags. Such generalizations have been used widely in previous work.

For example, Califf and Mooney (1999) used part-of-speech tags and semantic word

class constraints on pattern elements. In addition, words that are labeled with one

of the label dictionaries (seed as well as learned) can be generalized with the label.

1I also refer to them as surface word patterns in the dissertation.

Page 36: DISTANTLY SUPERVISED INFORMATION EXTRACTION …The best thing about Stanford is its grad students – brilliant and yet so approachable. It has been amazing to be a part of the NLP

CHAPTER 2. BACKGROUND 19

Pattern Sentencelemma:put FW* lemma:I FW* lemma:on FW* SW* {X |tag:NN.*}{1,2}

dr. put me on some albuterol inhaler

{X}{1,2} SW* FW* lemma:in FW*lemma:throat

I have this itchiness in the throat.

Table 2.1: Examples of patterns and how they match to sentences. X means one tokenthat will be matched, tag means the part of speech tag restriction of the target entity, FW*means up 2 or less words from {a, an, the}, SW* means 2 or less stop words, .* meanszero or more characters can match, and lemma means the lemma of the token. Lemmas,FW*, and SW* are generalizations of the patterns’ context. Colors show correspondingmatches between pattern elements and words in sample sentences.

As an example on how to vary the context window size and generalize, consider the

labeled sentence, ‘I take Advair::DT and Albuterol::DT for asthma::SC’, where ‘::’

indicates the label of the word. The patterns created around the DT word ‘Albuterol’

will be ‘DT and X’, ‘DT and X for SC’, ‘X for SC’, and so on, where X is the target

entity.

• Flexible Matching: One can create flexible patterns by ignoring certain types of

words, such as determiners and function words, while matching the pattern. One

can also allow stop words between the context and the term to be extracted. Some

systems (Agichtein and Gravano, 2000; Brin, 1999) have used vectors of context

words instead of contiguous tokens as patterns to increase flexibility in matching. I

use context as contiguous tokens or their generalized forms.

2.2.2 Dependency Patterns

A dependency tree of a sentence is a parse tree that gives dependencies (such as direct-

object, subject) between words in the sentence. It is, in my opinion, the best way to trade-

off semantic meaning representation and ‘learnability’ using the current resources we have.

Semantic representations can be more expressive but it is hard to learn how to generate the

representation for a new sentence. The expressive representations require more manually

labeled data, which is very hard to acquire. All work in this dissertation has used the

Page 37: DISTANTLY SUPERVISED INFORMATION EXTRACTION …The best thing about Stanford is its grad students – brilliant and yet so approachable. It has been amazing to be a part of the NLP

CHAPTER 2. BACKGROUND 20

Stanford English Dependencies (De Marneffe et al., 2006). Many researchers are currently

working on developing the Universal Dependencies2 to provide a universal collection of

categories with consistent annotations across different languages. My systems can be easily

customized to work with the Universal Dependencies.

Figure 2.1 shows the dependency tree for the sentence ‘We work on extracting informa-

tion using dependency graphs.’. Dependency patterns match dependency trees of sentences

to extract phrase sub-trees. The figure shows matching of two patterns: [using→ (direct-

object)] and [work→ (preposition on)]. The two patterns are part of seed patterns to extract

FOCUS and TECHNIQUE entities from scientific articles in Chapter 4.

A dependency tree matches a pattern [T → (d)], with a trigger word T and a de-

pendency d, if (1) it contains T , and (2) the trigger word’s node has a successor whose

dependency with its parent is d. In the rest of the dissertation, I call the subtree headed by

the successor as the matched phrase-tree. The notion of a phrase in a dependency grammar

is the subtree below the head node selected by the pattern. I extract the phrase correspond-

ing to the matched phrase-tree and label it with the pattern’s category. For example, the

dependency tree in Figure 2.1 matches the FOCUS pattern [work→ (preposition on)] and

the TECHNIQUE pattern [using→ (direct-object)]. Thus, the system labels the phrase corre-

sponding to the phrase-tree headed by ‘extracting’, which is ‘extracting information using

dependency graphs’, with the category FOCUS, and similarly labels the phrase ‘dependency

graphs’ as a TECHNIQUE.

I use Stanford CoreNLP (Manning et al., 2014) to get dependency trees of sentences

and use its Semgrex tool3 to match dependency patterns to the dependency trees.4

The options to consider when creating dependency trees are similar to the surface word

patterns. A few other parameters to consider are: 1. the allowed and disallowed dependen-

cies, both when generating the dependency patterns and when extracting a phrase from a

matched phrase sub-tree, 2. flexible matching of dependency patterns by allowing a cer-

tain number or type of nodes to be skipped between the trigger node and the node that is

connected by the required dependency edge.

2http://universaldependencies.github.io/3http://nlp.stanford.edu/software/tregex.shtml4More details about the dependencies are in http://nlp.stanford.edu/software/

dependencies_manual.pdf.

Page 38: DISTANTLY SUPERVISED INFORMATION EXTRACTION …The best thing about Stanford is its grad students – brilliant and yet so approachable. It has been amazing to be a part of the NLP

CHAPTER 2. BACKGROUND 21

Figure 2.1: The dependency tree for ‘We work on extracting information using dependencygraphs’. The tree is generated using the collapsed dependencies defined in the StanfordCoreNLP toolkit (the word ‘on’ is collapsed with the edge ‘preposition’). The dependency‘nn’ means ‘noun compound modifier’. The generated dependencies are not always correct,for example, the correct dependency between ‘extracting’ and ‘using’ should have been‘advcl’. Also shown are matching of two patterns. More details are in Chapter 4.

2.3 Bootstrapped Pattern Learning

Bootstrapped pattern-based entity learning (BPL) generally begins with seed sets of pat-

terns and/or example dictionaries for given labels and iteratively learns new entities from

unlabeled text (Riloff, 1996; Collins and Singer, 1999). I earlier discussed the two types

of patterns our systems learned using this approach – lexico-syntactic surface word pat-

terns (Hearst, 1992) and dependency tree patterns (Yangarber et al., 2000). In each itera-

tion, BPL learns a few patterns and a few entities of each given label. Figure 2.2 shows

the flow of the system when the supervision is provided as seed entities. For ease of ex-

position, I present the approach below for learning entities for one label l. It can easily

be generalized to multiple labels. I refer to entities belonging to l as positive and entities

belonging to all other labels as negative. Patterns are scored by their ability to extract more

positive entities and less negative entities. Top ranked patterns are used to extract candidate

entities from text. High scoring candidate entities are added to the dictionaries and are used

to generate more candidate patterns around them.

Page 39: DISTANTLY SUPERVISED INFORMATION EXTRACTION …The best thing about Stanford is its grad students – brilliant and yet so approachable. It has been amazing to be a part of the NLP

CHAPTER 2. BACKGROUND 22

Figure 2.2: A flowchart of various steps in a bootstrapped pattern-based entity learningsystem.

Page 40: DISTANTLY SUPERVISED INFORMATION EXTRACTION …The best thing about Stanford is its grad students – brilliant and yet so approachable. It has been amazing to be a part of the NLP

CHAPTER 2. BACKGROUND 23

The bootstrapping process involves the following steps, iteratively performed until no

more patterns or entities can be learned. For the ease of understanding, I use a running

example of learning ‘animal’ entities from the following unlabeled text, starting with the

seed set of entities as {dog}.

Step 1: Data labeling

The unlabeled text is partially labeled using the label dictionaries, starting with the seed

dictionaries in the first iteration. In later iterations, the text is labeled using both the seed

and the learned dictionaries. A phrase matching a dictionary phrase is labeled with the

dictionary’s label. Often, phrases are soft matched, by using lemmas of words and/or by

matching phrases within a small edit distance. In the example below, both instances of

‘dog’ are labeled as an ‘animal’.

Step 2: Pattern generation

Patterns are generated using the context around the labeled entities to create candidate

patterns. I discussed various parameters to consider when generating the patterns in Section

2.2. I generate all possible patterns and learn the good ones in the next step. Figure 2.3

shows two of the many possible candidate patterns and their extractions.

Step 3: Pattern learning

This is one of the two crucial steps in a BPL system. Candidate patterns generated in

the previous step are scored using a pattern scoring measure. Top ones are added to the

list of learned patterns for l. The maximum number of patterns to be learned and the

Page 41: DISTANTLY SUPERVISED INFORMATION EXTRACTION …The best thing about Stanford is its grad students – brilliant and yet so approachable. It has been amazing to be a part of the NLP

CHAPTER 2. BACKGROUND 24

threshold to choose a pattern are given as an input to the system by the developer. In a

supervised setting, the efficacy of patterns can be judged by their performance on a fully

labeled dataset (Califf and Mooney, 1999; Ciravegna, 2001). In a bootstrapped system,

where the data is not fully labeled, a pattern is usually judged by the number of positive,

negative, and unlabeled entities it extracts. Note that a true recall cannot be used because of

the lack of a fully labeled dataset. One of the most commonly used measures is RlogF by

Riloff (1996). It is a combination of reliability of a pattern and the frequency with which

it extracts positive entities. Let pos(p), neg(p), and unlab(p) be the number of positive,

negative, and unlabeled entities extracted by the pattern p, respectively. The RlogF score is

RlogF (p) =pos(p)

pos(p) + neg(p) + unlab(p)log pos(p) (2.4)

The first term is a very rough estimate of the precision of a pattern – it assumes unla-

beled entities to be negative. The log pos(p) term gives higher scores to patterns that extract

more positive entities. In Figure 2.3, the pattern scorer gives scores to the two candidate

patterns. The pos and unlab values for both patterns are 1 and neg is 0. Assuming that the

pattern scorer is good, that is s2 > s1, the second pattern is selected and added to the list of

learned patterns.

Figure 2.3: An example pattern learning system for the class ‘animals’ from the text. Twoof the many possible candidate patterns are shown, along with the extracted entities. Textmatched with the patterns is shown in italics and the extracted entities are shown in bold.

Page 42: DISTANTLY SUPERVISED INFORMATION EXTRACTION …The best thing about Stanford is its grad students – brilliant and yet so approachable. It has been amazing to be a part of the NLP

CHAPTER 2. BACKGROUND 25

Step 4: Entity learning

Patterns that are learned for the label in the previous step are applied to the text to extract

candidate entities. An entity scorer ranks the candidate entities and adds the top entities to

l’s dictionary. The maximum number of entities to be learned and the threshold to choose

an entity is given as an input to the system by the developer. Some systems learn every

entity extracted by the learned patterns, however, that can lead to many noisy entities. In

Chapter 7, I discuss various entity evaluation measures. In our systems, I represent an

entity as a vector of feature values. The features are used to score the entities, either by

taking an average of their values or by training a machine learning-based classifier.

Iterations

Steps 1-4 are repeated for a given number of iterations. Generally, precision drops and

recall increases with every iteration. The number of iterations can also be determined by a

threshold on precision; in Chapters 6 and 7, I do not consider output of the learning systems

when their precision drops below 75% during the post-hoc analysis.

Parameters

Similar to any learning system, there are many parameters one can tweak to improve a

system’s performance. One set of parameters are BPL related parameters – the thresholds

for learning a pattern (entity), number of patterns (entities) to learn in each iteration, and

the total number of iterations. The second set of parameters, as discussed in the previous

section, are related to the construction of patterns: minimum and maximum window of

context, annotations (such as, part-of-speech tags and word class tags) to consider for the

context tokens and the target entity, and, in the case of dependency patterns, the depth

of ancestors/dependents to consider when matching or constructing a pattern. Some of

these parameters, such as thresholds, can be hand tuned on a development dataset. One

trick to not tune the thresholds is to start with high thresholds and reduce them when no

more patterns or entities are learned by the system. I follow this approach in Chapters 5–

7. Another advantage is that initially the systems learn only highly confident patterns and

Page 43: DISTANTLY SUPERVISED INFORMATION EXTRACTION …The best thing about Stanford is its grad students – brilliant and yet so approachable. It has been amazing to be a part of the NLP

CHAPTER 2. BACKGROUND 26

entities, reducing the chances of semantic drift. Semantic drift is a phenomenon when the

system learn a few false positive entities leading to learning of more incorrect and pattern

entities over the iterations. Other parameters, such as restrictions to consider for the target

entity, can be learned by the system; patterns with and without the restrictions are generated

and are scored by the pattern scoring function.

2.4 Classifiers and Entity Features

Brown clusters

I use Brown clustering (Brown et al., 1992) to cluster words in the MedHelp dataset in

an unsupervised way. It is a greedy bottom-up hierarchical clustering approach based on

n-gram class language models. They have been used widely for other tasks, such as for

NER (Ratinov and Roth, 2009), parsing (Koo et al., 2008), and part-of-speech tagging

(Li et al., 2012a). I used its implementation by Liang (2005). I do not use the publicly

available generic word clusters because they are not from the same domain as the datasets.

I mainly used Brown clustering instead of other clustering methods such as distributional

clustering (Clark, 2001) or word embeddings (Collobert and Weston, 2008; Mikolov et al.,

2013a) because it was fast, easy to use, and produced good clusters. Additionally, Turian

et al. (2010) reported that Brown clustering induces better representations for rare words

than the embeddings from Collobert and Weston (2008), when the latter does not receive

sufficient training updates. I use the word embeddings from a neural network model to

enhance the entity classifier in Chapter 7.

Google Ngrams

Google Ngrams5 is a resource provided by Google consisting of 1–5 words long English

phrases and their observed frequency counts on the web (considering around 1 trillion word

tokens). Only phrases with frequency greater than or equal to 40 are included. It is a great

resource for building language models or for estimating usage of a phrase on the Internet. I

5https://catalog.ldc.upenn.edu/LDC2006T13, accessed in January 2008.

Page 44: DISTANTLY SUPERVISED INFORMATION EXTRACTION …The best thing about Stanford is its grad students – brilliant and yet so approachable. It has been amazing to be a part of the NLP

CHAPTER 2. BACKGROUND 27

use this for calculating feature values of entities, on the assumption that an entity common

on the Internet (such as, ‘youtube’) is not a useful entity to extract for a specialized domain.

Logistic Regression

I use logistic regression (LR) for entity classifiers since it is one of the most commonly used

classifiers and it worked better than SVMs and Random Forests in the pilot experiments. I

used the implementation of LR in Stanford CoreNLP (Manning et al., 2014) and used the

default settings (with L2 regularization). Note that our training datasets are noisy – auto-

matically constructed seeds sets often have some noise; learned patterns and entities can be

incorrect; and the sampling to create a training set can lead to a wrongly labeled dataset.

There has been some work in modeling annotation noise to learn more robust classifiers,

such as Shift LR (Tibshirani and Manning, 2014) and Natarajan et al. (2013), that model

random labeling noise. The noise in bootstrapped systems is, however, more systematic.

The wrong labels come from the noisy dictionaries, instead of wrong annotations by human

annotators, which is presumably more random. I tried using Shift LR in our systems but it

led to poor results.

2.5 Dataset: MedHelp Patient Authored Text

In Chapters 5, 6, and 7 for experimental evaluation, I use MedHelp.org forum data. Med-

Help is one of the largest online health discussion forums. Similar to other discussion

forums, there are forums under topics like ‘Asthma’, ‘ENT’, and ‘Pregnancy: Sept 2015

babies’. A MedHelp forum consists of thousands of threads; each thread is a sequence of

posts by users. The dataset includes some medical research material posted by users but has

no clinical text. In each thread, the initiator of the thread posts a paragraph or more about

a health concern or comment. The conversations are usually about health topics, but are

also sometimes about emotional support and advice (MacLean et al., 2015). We acquired

the dataset through a research agreement with MedHelp, who anonymized the data prior

to sharing. The data spans from 2007 to May 2011. Other work on this dataset includes

MacLean and Heer (2013), MacLean et al. (2015), and MacLean (2015).

Page 45: DISTANTLY SUPERVISED INFORMATION EXTRACTION …The best thing about Stanford is its grad students – brilliant and yet so approachable. It has been amazing to be a part of the NLP

CHAPTER 2. BACKGROUND 28

Cold dry air is a common trigger, I’m also haven’t a lot of trouble keeping theasthma under control now that is it winter (only diganosed last spring).I had actually been feeling spasms in my throat that I thought were palpitationsbut it ended up not being my heart.Now I have developed a low grade fever and blisters in my throat.Would love some feedback as I’m anxious.No stuffed nose, no discharge.yes i realize that i should have used ear plugs and yes i’ve learned my lesson thati will use plugs from now on.I have chronic sinusitis, scars on both ears from past infections , and “fairly severedeviated septum, ”.I went to the doctor and he gave me augmittin it cleared the white patches rightup.I went to the health food store and found Wally’s Ear Oil about 2 weeks ago afterreading some of the posts here.I am interested in Xanax side affect of loosing taste and smell.It sounds like chronic non-infectious bronchitis.I’ve had chest x-ray-normal.Once I had my sinus surgeries my asthma improved dramatically.

Table 2.2: A few examples of sentences from the MedHelp forum. The sentences arelabeled with symptoms & conditions (in italics) and drugs & treatments (in bold) labels.

There are several challenges with extracting information from the dataset. Patients use

various slang, colloquial forms of entities and home remedies that are not found in seed

sets. They are very descriptive about their symptoms and conditions. Some examples

of sentences from the Asthma and ENT forums labeled with symptoms & conditions (in

italics) and drugs & treatments (in bold) labels are shown in Table 2.2. More information

about these two labels are in Chapter 5.

In the next chapter, I discuss previous work related to bootstrapped and pattern-based

learning.

Page 46: DISTANTLY SUPERVISED INFORMATION EXTRACTION …The best thing about Stanford is its grad students – brilliant and yet so approachable. It has been amazing to be a part of the NLP

Chapter 3

Related Work

Fully Supervised Distantly SupervisedPattern-based SRV, SLIPPER, WHISK,

RAPIERSnowball, Basilisk,Ravichandran and Hovy(2002)

Non-pattern-based Sequence models like CRFs,HMMs, CMMs

Semi-supervised classifiers,TAC-KBP systems likeMIML-RE

Hybrid Boella et al. (2013), Freitagand Kushmerick (2000)

KnowItAll, NELL, Putthivid-hya and Hu (2011), Surdeanuet al. (2006)

Table 3.1: A few examples to give an idea about the types of IE systems developed basedon the amount of supervision and the type of models.

The existing IE systems can be roughly categorized along two dimensions – the super-

vision required and the models used. A system can be fully supervised, semi-supervised,

or distantly supervised. There are several names for distantly supervised learning in the

existing literature: bootstrapped, lightly supervised, weakly supervised, minimally super-

vised, or semi-supervised learning. The second dimension is that the model used can be

pattern-based, not pattern-based (most of them are feature-based sequence classifiers), or a

hybrid of pattern and feature-based. Table 3.1 gives a few examples for each category.

29

Page 47: DISTANTLY SUPERVISED INFORMATION EXTRACTION …The best thing about Stanford is its grad students – brilliant and yet so approachable. It has been amazing to be a part of the NLP

CHAPTER 3. RELATED WORK 30

In this chapter, I discuss distantly supervised systems and pattern-based learning ap-

proaches to information extraction (IE). I do not review hand-written systems, fully su-

pervised or semi-supervised machine learning-based systems in this chapter. Conditional

random fields (CRFs), a fully supervised approach, have been very successful at entity ex-

traction tasks. For more information on CRFs and other advances in IE, see Hobbs and

Riloff (2010) and Sarawagi (2008).

3.1 Pattern-based systems

Pattern-based learning, both distantly and fully supervised, has been a topic of interest

for many years. Pattern-based approaches have been widely used for IE (Chiticariu et al.,

2013; Fader et al., 2011; Etzioni et al., 2005). Systems differ in how they create patterns,

learn patterns, and learn the entities they extract. Patterns are useful in two ways: they

are good features in an entity classifier, and they identify promising candidate entities.

Pattern learning can also be thought of as a feature selection approach, in which patterns

are instantiated feature templates. Patwardhan (2010) gives a good overview of the research

in the field.

3.1.1 Fully supervised

The pioneering work by Hearst (1992) used hand written rules to automatically generate

more rules that were manually evaluated to extract hypernym-hyponym pairs from text.

Other supervised systems like SRV (Freitag, 1998), SLIPPER (Cohen and Singer, 1999),

WHISK (Soderland, 1999), (LP )2 (Ciravegna, 2001), and RAPIER (Califf and Mooney,

1999) used a fully labeled corpus to either create or score rules. WHISK learned sur-

face word patterns with wildcards and semantic classes like digit and numbers. SLIP-

PER (Cohen and Singer, 1999) used boosting to create an ensemble of rules. Here, I de-

scribe RAPIER and (LP )2 in detail as examples of the supervised pattern learning systems.

RAPIER, a bottom-up learning system, used a relational learning algorithm that preferred

overly specific to overly general rules. A rule or pattern was defined as a combination of

the filler (target entity), pre-filler (left context), and post-filler (right context) slots. Initially,

Page 48: DISTANTLY SUPERVISED INFORMATION EXTRACTION …The best thing about Stanford is its grad students – brilliant and yet so approachable. It has been amazing to be a part of the NLP

CHAPTER 3. RELATED WORK 31

the patterns were created as maximally specific by considering all the left and right context

tokens around the known entities in the labeled documents. RAPIER generalized pairs of

patterns and then specialized pre- and post-fillers. A pattern was evaluated by considering

the positive and negative examples it extracted and the pattern’s complexity, giving prefer-

ence to less complex patterns. Note that since the system was fully supervised, the number

of positive and negative examples were known for each pattern. (LP )2, another bottom-

up pattern learning system, had similar steps as RAPIER. It had two stages: induction of

tagging patterns and induction of correction patterns. Similar to other systems, the training

corpus was manually marked with positive examples and the rest of the corpus was con-

sidered as negative examples. Generalization was performed by relaxing constraints in the

initial patterns, which were created by considering a window of words on the left and the

right of marked tokens. The correction rules, which learned correct boundaries of entities,

were induced from the mistakes made in applying the earlier learned tagging rules on the

training corpus. All entities that matched the patterns were extracted.

IBM Research’s SystemT (Liu et al., 2010) used supervision of correctly and incor-

rectly extracted data to suggest rule refinements to developers. Freitag and Kushmerick

(2000) used extraction patterns as weak learners in a boosting framework to learn patterns

with better recall. Some systems used patterns or patterns-like features in a supervised

machine learning-based classifier. In particular, dependency paths between entities, which

can be thought of as simpler versions of dependency patterns, have been used as features in

various relation extraction systems. Bunescu and Mooney (2005) used dependency paths

similarity to compute kernel scores in a SVM for relation extraction.

3.1.2 Distantly supervised

There has been a lot of recent work on using distant supervision for entity and relation ex-

traction, both using classifiers and patterns. Bootstrapping or distantly supervised learning

has many variants, such as pattern- or rule-based learning, self-training, co-training, and la-

bel propagation. Yarowsky’s style of self-training algorithms (Yarowsky, 1995) have been

shown to be successful at bootstrapping (Collins and Singer, 1999). Co-training (Blum and

Mitchell, 1998) and its bootstrapped adaptation (Collins and Singer, 1999) require disjoint

Page 49: DISTANTLY SUPERVISED INFORMATION EXTRACTION …The best thing about Stanford is its grad students – brilliant and yet so approachable. It has been amazing to be a part of the NLP

CHAPTER 3. RELATED WORK 32

views of the features of the data. In an entity learning task, the two views are from the

context and the content of the entities. Bellare et al. (2007) learned attributes of entities

in underspecified queries using the DL-CoTrain algorithm proposed by Collins and Singer

(1999). Whitney and Sarkar (2012) proposed a modified Yarowsky algorithm that used la-

bel propagation on graphs, inspired by an algorithm proposed in Subramanya et al. (2010)

that used a large labeled data for domain adaptation.

My dissertation is inspired by the system proposed by Riloff (1996). Riloff used a set

of seed entities to bootstrap learning of patterns for entity extraction from unlabeled text. I

describe the iterative algorithm in Chapter 2. She scored a pattern by a weighted conditional

probability measure estimated by counting the number of positive entities among all the

entities extracted by the rule. Riloff and Jones (1999) added another level of bootstrapping

by retaining only the learned entities and restarting the process after each iteration. Thelen

and Riloff (2002) proposed a system called Basilisk that extended the above bootstrapping

algorithm for multi-class learning. Their systems can be viewed as a form of the Yarowsky

algorithm, with pattern learning as an additional step.

Yangarber et al. (2002) learned surface-word patterns to extract diseases and viruses

from medical text using seed sets. Lin et al. (2003) learned names, such as diseases and

locations, using an approach similar to Yangarber et al. (2002), and tested the system on

multiple languages. Stevenson and Greenwood (2005) used Wordnet to assess semantic

similarity between patterns. Talukdar et al. (2006) used seed sets to learn trigger words

for entities and a pattern automata. Using the learned dictionaries as gazette features, they

improved supervised CRFs performance on the CoNLL NER task.

Snowball (Agichtein and Gravano, 2000) and DIPRE (Brin, 1999) are two classic

pattern-learning systems. Snowball learned patterns to extract (LOCATION, ORGANIZA-

TION) tuples from text using seed sets of examples. It was inspired by the DIPRE system.

Unlike most other pattern-based systems, Snowball represented a pattern by the left, mid-

dle, and right vectors of terms around the entities. StatSnowball (Zhu et al., 2009) is an

enhanced Snowball system that used Markov logic networks to learn scores of patterns and

selected the patterns using L1 regularization.

There are many different ways of creating and representing patterns. Sudo et al. (2003),

Page 50: DISTANTLY SUPERVISED INFORMATION EXTRACTION …The best thing about Stanford is its grad students – brilliant and yet so approachable. It has been amazing to be a part of the NLP

CHAPTER 3. RELATED WORK 33

for example, used a subtree model to represent patterns, which is based on arbitrary sub-

trees of dependency trees. Sub-trees can potentially capture more varied context than just

dependency paths or surface context.

The distant supervision can also be provided by using seed patterns instead of seed

examples. Yangarber et al. (2000) used seed dependency patterns to divide a text corpus

into relevant and irrelevant documents, and ranked the candidate patterns according to their

frequency of match in relevant vs. irrelevant documents. Pasca (2004) used seed patterns to

learn generic non-domain specific patterns like ‘X [such as | including] N [and |,|.]’ from

web pages to learn named entities (represented by ‘N’) and their categories (represented by

‘X’). In Chapter 4, I use seed patterns to extract key aspects from scientific articles. In the

rest of the chapters, I use seed entities to extract more entities.

Some papers built IE systems in the setting of traditional semi-supervised learning.

McLernon and Kushmerick (2006) acquired patterns using a small amount of labeled data,

in addition to the unlabeled data. In contrast, Hassan et al. (2006) did not use any seed

examples or patterns; they used human annotation for identifying ‘interesting entities’.

They used a HITS-like algorithm (Kleinberg, 1999) on patterns (authorities) and instances

(hubs) to learn generic relation extractors.

Patterns have also been used to extract attributes of entities. Yahya et al. (2014) used

seed patterns to extract attributes of nouns from queries. Gupta et al. (2014a) used text

patterns to learn attributes of entities and described Biperpedia, an ontology with 1.6M

(class, attribute) pairs.

In this dissertation, I do not discuss canonicalization of entities, which is, in some ways,

a harder task because it involves disambiguation along with extraction. Suchanek et al.

(2009) used pattern-based IE combined with logical constraint checking using a Max-SAT

model to extend existing ontologies. They worked on canonicalization of entities, which

is useful for extension of ontologies and other downstream tasks. Buitelaar and Magnini

(2005) gave an overview of the methods to learn ontologies from text.

Distant supervision can also come from existing human-curated resources, such as web

pages and ontologies. I used automatically generated seed sets from medical ontologies and

webpages in my experiments. Other resources include Freebase (Bollacker et al., 2008),

Wikipedia Infoboxes, and Yago (Suchanek et al., 2007). Mintz et al. (2009) used Freebase

Page 51: DISTANTLY SUPERVISED INFORMATION EXTRACTION …The best thing about Stanford is its grad students – brilliant and yet so approachable. It has been amazing to be a part of the NLP

CHAPTER 3. RELATED WORK 34

as supervision to learn relation extractors. Xu et al. (2007) learned pattern rules for n-ary

relation extraction, starting with seed examples. Most other systems use hybrid approaches,

which are discussed in the next section. Systems developed for the TAC-KBP slot filling

task,1 a shared task for relation extraction, use Wikipedia Infoboxes as distant supervision

(Surdeanu et al., 2012). Jean-Louis et al. (2011) learned patterns for the TAC-KBP slot

filling task.

3.2 Distantly supervised Non-pattern-based systems

I work on IE systems that use unlabeled data and seed sets of entities. Existing non-pattern-

based IE systems can be compared with our systems along the two aspects. First is the use

of unlabeled data. Many IE systems use unlabeled data to learn word embeddings or clus-

ters (such as, Brown clusters) to use them as features in a fully supervised feature-based

classifier (Ratinov and Roth, 2009; Turian et al., 2010). These systems are not distantly

supervised because, even though they make use of unlabeled data, they still need a fully

labeled dataset to learn a robust classifier. I also use word embeddings computed using

unlabeled data to improve entity and pattern scoring function, however, in a bootstrapped

setting. Second, many of them, such as CRFs-based systems, use sets of entities, also

called gazettes or dictionaries, as features. However, they do not expand the sets of enti-

ties. Most systems perform direct look-up against the dictionaries. Cohen and Sarawagi

(2004) worked on improving the matching of named entity segments to dictionaries to use

as a feature for a sequence model. They used segment-based conditional markov models

(CMM, also called MEMM) to incorporate similarity of named entity segments (instead of

words) with a dictionary’s entries, and model lengths of entities. I compare pattern-based

systems with sequence classifiers in Section 1.1.2.

3.3 Distantly supervised Hybrid systems

There are several ways of combining patterns-based and feature-based learning systems.

First, individual components of a pattern-based learning system can use feature-based1http://www.nist.gov/tac/2014/KBP/

Page 52: DISTANTLY SUPERVISED INFORMATION EXTRACTION …The best thing about Stanford is its grad students – brilliant and yet so approachable. It has been amazing to be a part of the NLP

CHAPTER 3. RELATED WORK 35

learning methods to learn good pattern and entity ranking functions. For example, I use

logistic regression to learn an entity scorer in some of my systems. Second, dependency

paths can be used as features in a classifier, a common practice for building classifier-based

entity and relation extraction systems. Boella et al. (2013) used patterns or syntactic de-

pendencies as features in a SVM for extracting semantic knowledge from legislative text.

Patterns can also be thought of as ‘feature templates’ used in classifiers. In my opinion,

pattern-based learning approaches learn good instantiations of the feature templates. Sur-

deanu et al. (2006) proposed a co-training-based algorithm that used text categorization

along with pattern extraction, starting with seed sets.

Roth and Klakow (2013) used patterns in their system on combining generative and

discriminative relation extraction approaches. Angeli et al. (2014) used learned dependency

patterns, along with a machine learning-based MIML-RE approach (Surdeanu et al., 2012),

to predict relations between two entities.

More recently, DeepDive (Niu et al., 2012) has shown promising results on distantly

supervised relation extraction (Angeli et al., 2014) by using fast inference in Markov logic

networks. Govindaraju et al. (2013) and Zhang et al. (2013) used DeepDive on the task of

extracting structured information like tables from text.

3.3.1 Open IE systems

Open IE, a popular task in recent years, is geared towards learning generic, domain-independent

extractors. KnowItAll’s entity extraction from the web (Downey et al., 2004; Etzioni et

al., 2005) used components such as list extractors, generic and domain specific pattern

learning, and subclass learning. They learned domain-specific patterns using a seed set.

Never-Ending Language Learning (NELL) system (Carlson et al., 2010a) learned multiple

semantic types using coupled semi-supervised training from web-scale data, which is not

feasible for all datasets and entity learning tasks.

Open-IE relation extraction systems like ReVerb (Fader et al., 2011) and OLLIE (Mausam

et al., 2012) learn domain-independent generic relation extractors for web data. However,

using them for a specific domain with a moderately sized corpus leads to poor results. I

tested learning an entity extractor for a given class using ReVerb. I labeled the binary and

Page 53: DISTANTLY SUPERVISED INFORMATION EXTRACTION …The best thing about Stanford is its grad students – brilliant and yet so approachable. It has been amazing to be a part of the NLP

CHAPTER 3. RELATED WORK 36

unary ReVerb extractions using the class seed entities and retrained its confidence function,

with poor results. Poon and Domingos (2010) found a similar result for inducing a proba-

bilistic ontology: an open information extraction system extracted low accuracy relational

triples on a small corpus.

There has been some work to map generic Open IE extractions to learn extractors for

specific relations. Soderland et al. (2013) manually wrote rules to map Open IE extrac-

tions to TAC-KBP Slot Filling relations in under 3 hours and achieved reasonable perfor-

mance. Improving pattern-based learning systems would also improve the hybrid systems

described above.

Overall, even though pattern-based approaches have been less popular in the recent

years as compared to feature-based sequence models because of the trends in the wide

world, they have been shown to be successful at both supervised and bootstrapped entity

learning. The hybrid systems, which have become popular in the last few years, usually

have a pattern learning component. Improving pattern learning would presumably also im-

prove the performance of hybrid systems. In Chapters 4 and 5, I apply the bootstrapped

pattern-based learning approach to two new problems and domains. The results show that

they are very effective at expanding seed sets of entities in these domains. One key com-

ponent missing from the previous systems is the utilization of unlabeled data beyond the

matching of patterns to text. For example, when scoring patterns, unlabeled entities ex-

tracted by patterns are either considered negative or are ignored. In Chapters 6 and 7, I

propose improvements to bootstrapped pattern-based learning systems that leverage unla-

beled data in a better way.

Page 54: DISTANTLY SUPERVISED INFORMATION EXTRACTION …The best thing about Stanford is its grad students – brilliant and yet so approachable. It has been amazing to be a part of the NLP

Chapter 4

Studying Scientific Articles andCommunities

In this chapter, I present how to study influence of sub-communities of a research com-

munity by extracting key aspects of the articles published. I examine the computational

linguistics community as a case-study. I use bootstrapped pattern learning to extract the

key aspects, starting with only a few seed dependency patterns as supervision. The content

of this chapter is drawn from Gupta and Manning (2011).

4.1 Introduction

The evolution of ideas and the dynamics of a research community can be studied using

the scientific articles published by the community. For instance, we may be interested in

how methods spread from one community to another, or the evolution of a topic from a

focus of research to a problem-solving tool. We might want to find the balance between

technique-driven and domain-driven research within a field. Such a rich insight of the

development and progress of scientific research requires an understanding of more than

just “topics” of discussion or citation links between articles. As an example, to determine

whether technique-driven researchers have greater or lesser impact, we need to be able to

identify styles of work. To achieve this level of detail and to be able to connect together

how methods and ideas are being pursued, it is essential to move beyond bag-of-words

37

Page 55: DISTANTLY SUPERVISED INFORMATION EXTRACTION …The best thing about Stanford is its grad students – brilliant and yet so approachable. It has been amazing to be a part of the NLP

CHAPTER 4. STUDYING SCIENTIFIC ARTICLES AND COMMUNITIES 38

topical models. This requires an understanding of sentence and argument structure, and is

therefore a form of information extraction, if of a looser form than the relation extraction

methods that have typically been studied.

To study the application domains, the techniques used to approach the domain prob-

lems, and the focus of scientific articles in a community, I propose to extract the following

concepts from the articles

FOCUS: an article’s main contribution

TECHNIQUE: a method or a tool used in an article, for example, expectation maxi-

mization and conditional random fields

DOMAIN: an article’s application domain, such as speech recognition and classifica-

tion of documents.

For example, if an article concentrates on regularization in support vector machines and

shows improvement in parsing accuracy, then its FOCUS and TECHNIQUE are regularization

and support vector machines and its DOMAIN is parsing. In contrast, an article that focuses

on lexical features to improve parsing accuracy, and uses support vector machines to train

the model has FOCUS as lexical features and parsing, the TECHNIQUE being lexical fea-

tures and support vector machines, and DOMAIN still is parsing.1 In this case, even though

TECHNIQUEs and DOMAIN of both papers are very similar, the FOCUS phrases distinguish

them from each other. Note that a DOMAIN of one article can be a TECHNIQUE of another,

and vice-versa. For example, an article that shows improvements in named entity recogni-

tion (NER) has DOMAIN as NER, and an article that uses named entities as an intermediary

tool to extract relations has NER as one of its TECHNIQUEs.

I use dependency patterns to extract the above three categories of phrases from arti-

cles, which can then be used to study the influence of communities on each other. The

phrases are extracted by matching semantic (dependency) patterns in dependency trees of

sentences. The input to the extraction system are some seed patterns (see Table 4.1 for

examples) and it learns more patterns using a bootstrapping approach, similar to one de-

scribed in Chapter 2.

1A community vs. a DOMAIN: a community can be as broad as computer science or statistics whereas aDOMAIN is a specific application such as Chinese word segmentation.

Page 56: DISTANTLY SUPERVISED INFORMATION EXTRACTION …The best thing about Stanford is its grad students – brilliant and yet so approachable. It has been amazing to be a part of the NLP

CHAPTER 4. STUDYING SCIENTIFIC ARTICLES AND COMMUNITIES 39

As a case study, I examine the computational linguistics community and consider the

influence of its sub-fields such as parsing and machine translation. For the study, I use the

document collection from the ACL Anthology Network and the ACL Anthology Reference

corpus (Bird et al., 2008; Radev et al., 2009). To get the sub-fields of the community, I use

latent Dirichlet allocation (LDA) (Blei et al., 2003) to find topics and label them by hand.2

However, our general approach can be used to study any case of the influence of academic

communities, including looking more broadly at the influence of statistics or economics

across the social sciences.

Using the approach, I study how communities influence each other in terms of tech-

niques that are reused, and show how some communities ‘mature’ so that the results they

produce get adopted as tools for solving other problems. For example, the products of the

part-of-speech tagging community have been adopted by many other communities. This is

evidenced by many papers that use part-of-speech tagging as an intermediary step to solve

other problems. Overall, our results show that speech recognition and probability theory

have been the most influential fields in the last two decades, since many communities now

use the techniques introduced by papers in those communities. Probability theory, unlike

speech recognition, is not a sub-field of computational linguistics, but it is an important

topic since many papers use and work on probabilistic approaches.

I also show the timeline of influence of communities. For example, the results show

that formal computational semantics and unification-based grammars had a lot of influence

in the late 1980s. The speech recognition and probability theory fields showed an upward

trend of influence in the mid-1990s, and even though it has decreased in recent years,3

they still have a lot of influence on recent papers mainly due to techniques like expectation

maximization and hidden Markov models.

Contributions I introduce a new categorization of key aspects of scientific articles,

which is (1) FOCUS: main contribution, (2) TECHNIQUE: method or tool used, and (3)2In this chapter, I use the terms communities, sub-communities and sub-fields interchangeably.3Speech Recognition has recently made a come-back with the advances using the deep learning approach.

Page 57: DISTANTLY SUPERVISED INFORMATION EXTRACTION …The best thing about Stanford is its grad students – brilliant and yet so approachable. It has been amazing to be a part of the NLP

CHAPTER 4. STUDYING SCIENTIFIC ARTICLES AND COMMUNITIES 40

DOMAIN: application domain. I extract them by matching dependency patterns to de-

pendency trees, and learn patterns using bootstrapping. I present a new definition of in-

fluence of a research community on another, and present a case study on the computa-

tional linguistics community, both for verifying the results of our system and showing

novel results for the dynamics and the overall influence of computational linguistics sub-

fields. I introduce a dataset of abstracts labeled with the novel categories available at

http://nlp.stanford.edu/pubs/FTDDataset_v1.txt for the research com-

munity.

4.2 Related Work: Scientific Study

While there is some connection to keyphrase selection in text summarization (Radev et al.,

2002), extracting FOCUS, TECHNIQUE and DOMAIN phrases is fundamentally a form of

information extraction, and there has been a wide variety of prior work in this area. Some

work, including the seminal (Hearst, 1992) identified patterns (IS-A relations) using hand-

written patterns, while other work has learned patterns over dependency graphs (Bunescu

and Mooney, 2005). For more related work on pattern-based systems, see Chapter 3.

Topic models have been used to study the history of ideas (Hall et al., 2008) and schol-

arly impact of papers (Gerrish and Blei, 2010). However, topic models do not extract

detailed information from text as we do. Still, I use topic-to-word distributions from topic

models as a way of describing sub-fields.

Demner-Fushman and Lin (2007) used hand written knowledge extractors to extract in-

formation, such as population and intervention, in their clinical question-answering system

to improve ranking of relevant abstracts. Our categorization of key aspects is applicable

to a broader range of communities, and we learn the patterns by bootstrapping. Li et al.

(2010) used semantic metadata to create a semantic digital library for Chemistry. They

applied machine learning techniques to identify experimental paragraphs using keywords

features. Xu et al. (2006) and Ruch et al. (2007) proposed systems, in the clinical-trials

and biomedical domain, respectively, to classify sentences of abstracts corresponding to

categories such as introduction, purpose, method, results and conclusion to improve article

Page 58: DISTANTLY SUPERVISED INFORMATION EXTRACTION …The best thing about Stanford is its grad students – brilliant and yet so approachable. It has been amazing to be a part of the NLP

CHAPTER 4. STUDYING SCIENTIFIC ARTICLES AND COMMUNITIES 41

FOCUS

present→ (direct object)work→ (preposition on)propose→ (direct object)

TECHNIQUE

using→ (direct object)apply→ (direct object)extend→ (direct object)

DOMAIN

system→ (preposition for)task→ (preposition of)

framework→ (preposition for)

Table 4.1: Some examples of dependency patterns that extract information from depen-dency trees of sentences. A pattern is of the form T → (d), where T is the trigger wordand d is the dependency that the trigger word’s node has with its successor.

retrieval by using either structured abstracts,4 or hand-labeled sentences. Some summariza-

tion systems also use machine learning approaches to find ‘key sentences’. The systems

built in these papers are complimentary to ours since one can find relevant paragraphs or

sentences and then extract the key aspects from them. Note that a sentence can have multi-

ple phrases corresponding to our three categories, and thus classification of sentences will

not give similar results.

4.3 Approach

In this section, I explain how to extract phrases for each of the three categories (FOCUS,

TECHNIQUE and DOMAIN) and how to compute the influence of communities.

4.3.1 Extraction

From an article’s abstract and title, I use the dependency trees of sentences and a set of

semantic dependency extraction patterns to extract phrases in each of the three categories.

More details on the dependency patterns and trees are in Chapter 2. Figure 2.1 shows an

example of matching two dependency patterns to a dependency tree. I start with a few

4Structured abstracts, which are used by some journals, have multiple sections such as PURPOSE andMETHOD.

Page 59: DISTANTLY SUPERVISED INFORMATION EXTRACTION …The best thing about Stanford is its grad students – brilliant and yet so approachable. It has been amazing to be a part of the NLP

CHAPTER 4. STUDYING SCIENTIFIC ARTICLES AND COMMUNITIES 42

handwritten patterns and learn more patterns using a bootstrapping approach. Table 4.1

shows some seed patterns.

To learn more patterns automatically, I run an iterative algorithm that extracts phrases

using semantic patterns, and then learns new patterns from the extracted phrases. Section

2.3 gives a detailed overview of a bootstrapped pattern-based learning approach. Here,

the seed supervision is provided in terms of seed patterns, instead of seed entities. More

specific details of each step are described below.

Extracting Phrases from Patterns

The pattern matching is the same as described in Section 2.2.2. To increase flexibility of

matching the patterns, when matching the dependency edge, I consider dependents and

granddependents upto 4 levels. I have special rules for paper titles. I label the whole title as

FOCUS if we are not able to extract a FOCUS phrase using the patterns, as authors usually

include the main contribution of the paper in the title. For titles from which we can extract a

TECHNIQUE phrase, I label the rest of the words (except for trigger words) with DOMAIN.

For example, for the title ‘Studying the history of ideas using topic models’, our system

extracts ‘topic models’ as TECHNIQUE, and then labels ‘Studying the history of ideas’ as

DOMAIN.

Learning Patterns from Phrases

After extracting phrases with patterns, we want to be able to construct and learn new pat-

terns. For each sentence whose dependency tree has a subtree corresponding to one of the

extracted phrases, I construct a pattern T → (d) by considering the ancestor (parent or

grandparent) of the subtree as the trigger word T , and the dependency between the head

of the subtree and its parent as the dependency d. For each category, I weight the patterns

depending on the categories of phrases from which they are derived. The weighting method

is as follows. For a set of phrases (P ) that extract a pattern (q), the weight of the pattern

q for the category FOCUS is∑

p∈P1zpcount(p ∈ FOCUS), where zp is the total frequency

of the phrase p. Similarly, I get weights of the pattern for the other two categories. Note

that we do not need smoothing since the phrase-category ratios are aggregated over all the

Page 60: DISTANTLY SUPERVISED INFORMATION EXTRACTION …The best thing about Stanford is its grad students – brilliant and yet so approachable. It has been amazing to be a part of the NLP

CHAPTER 4. STUDYING SCIENTIFIC ARTICLES AND COMMUNITIES 43

phrases from which the pattern is constructed. After weighting all the patterns that have

not been selected in the previous iterations, I select the top k patterns in each category (k=2

in our experiments). Table 4.3 shows some patterns learned through the iterative method.

4.3.2 Communities and their Influence

I define communities as fields or sub-fields that one wishes to study. To study communities

using the articles published, one needs to know which communities each article belongs to.

The article-to-community assignment can be computed in several ways, such as by manual

assignment, using metadata, or by text categorization of papers. In our case study, I use the

topics formed by applying latent Dirichlet allocation (Blei et al., 2003) to the text of the

papers by considering each topic as one community. In recent years, topic modeling has

been widely used to get ‘concepts’ from text; it has the advantage of defining communities

and soft, probabilistic article-to-community assignment scores in an unsupervised manner.

I combine these soft assignment scores with the phrases extracted in the previous section

to score a phrase for each community and category as follows. The score of a phrase p,

which is extracted from an article a, for a community c and the category TECHNIQUE is

calculated as

techScore (c, p, a) =1

zpcount (p ∈ TECHNIQUE | a)P (c | a; θ) (4.1)

where the function P (c | a, θ) gives the probability of a community (i.e., a topic) for an

article a given the topic modeling parameters θ. The normalization constant for the phrase,

zp, is the frequency of the phrase in all the abstracts. In the rest of the section, I use ai’s for

articles, ci’s for communities and y’s for years.

I define influence such that communities receive higher scores if they use techniques

earlier than other communities do or produce tools to solve other problems. For example,

since hidden Markov model introduced by the speech recognition community and part-

of-speech tagging tools built by the part-of-speech community have been widely used as

techniques in other communities, these communities should receive higher scores as com-

pared to some nascent or not-so-widely-used ones. Thus, I define influence of a community

Page 61: DISTANTLY SUPERVISED INFORMATION EXTRACTION …The best thing about Stanford is its grad students – brilliant and yet so approachable. It has been amazing to be a part of the NLP

CHAPTER 4. STUDYING SCIENTIFIC ARTICLES AND COMMUNITIES 44

based on the number of times its FOCUS, TECHNIQUE or DOMAIN phrases have been used

as a TECHNIQUE in other communities. To calculate the overall influence of one commu-

nity on another, we first need to calculate influence because of individual articles in the

community, which is calculated as follows. The influence of community c1 on another

community c2 because of a phrase p extracted from an article a1 is

techInfl (c1, c2, p, a1) = allScore (c1, p, a1)∑a2∈D

ya2>ya1

techScore (c2, p, a2)C(a2, a1)

(4.2)

where the function allScore(c, p, a) is computed the same way as in Equation 4.1 but by

using count(p ∈ ALL | a), where ALL means the union of phrases extracted in all three

categories. The variable D is the set of all articles, and ya2 means year of publication of the

article a2. The function C(a2, a1) is a weighting function based on citations, whose value

is 1 if a2 cites a1, and λ otherwise. If λ is 0 then the system calculates influence based on

just citations, which can be noisy and incomplete. In the experiments, I used λ as 0.5, since

we want to study the influence even when an article does not explicitly cite another article.

The function allScore measures how often a phrase is used by a community.

Thus, the technique-influence score of community c1 on community c2 in a particular

year y is computed by summing up the previous equation for all phrases in P and for all

articles in D. It is computed as

techInfl (c1, c2, y) =∑p∈P

∑a1∈Dya1=y

techInfl (c1, c2, p, a1) (4.3)

where P is the set of all phrases.

Straightforwardly, the overall influence of community c1 on the community c2 is calcu-

lated as

Page 62: DISTANTLY SUPERVISED INFORMATION EXTRACTION …The best thing about Stanford is its grad students – brilliant and yet so approachable. It has been amazing to be a part of the NLP

CHAPTER 4. STUDYING SCIENTIFIC ARTICLES AND COMMUNITIES 45

Paper Title FOCUS TECHNIQUE DOMAIN

Studying thehistory of ideasusing topicmodels

studying the historyof ideas using topic

latent dirichlet alloca-tion; topic; topic; un-supervised topic; his-torical trends; that allthree conferences areconverging in the topics

studying the historyof ideas; topic;model of the di-versity of ideas, topic entropy;probabilistic

A BayesianHybrid MethodFor Context-SensitiveSpelling Correc-tion.

new hybrid method, based on bayesianclassifiers; bayesianhybrid method forcontext-sensitivespelling correction

decision lists; bayesian;bayesian classifiers;ambiguous; part-of-speech tags; methodsusing decision lists;single strongest pieceof evidence; spelling

context-sensitivespelling correction;for context-sensitivespelling correction;spelling

Table 4.2: Extracted phrases for some papers. The word ‘model’ is missing from the endof some phrases as it was removed during post-processing.

techInfl (c1, c2) =∑y

techInfl (c1, c2, y) (4.4)

And, the overall influence of a community c1 is calculated as

techInfl (c1) =∑c2 6=c1

techInfl (c1, c2) (4.5)

Next, I present a case study over the sub-fields of computational linguistics using the

influence scores described above.

4.4 Experiments

I studied the computational linguistics community from 1965 to 2009 using titles and ab-

stracts of 15,016 articles in the ACL Anthology5 dataset (Bird et al., 2008; Radev et al.,

2009), since it has full text of papers available. I use the full text of papers to build a topic

model. I found 52 pairs of abstracts that had more than 80% of words in common with

5http://www.aclweb.org/anthology

Page 63: DISTANTLY SUPERVISED INFORMATION EXTRACTION …The best thing about Stanford is its grad students – brilliant and yet so approachable. It has been amazing to be a part of the NLP

CHAPTER 4. STUDYING SCIENTIFIC ARTICLES AND COMMUNITIES 46

each other; I ignored the influence of the earlier-published paper on the later-published pa-

per in the pairs while calculating the influence scores, because double publishing the same

research presumably does not indicate influence.

When extracting phrases from the matched phrase trees, I ignored tokens with part-

of-speech tags as pronoun, number, determiner, punctuation or symbol, and removed all

subtrees in the matched phrase trees that had either relative-clause-modifier or clausal-

complement dependency with their parents since, even though we want full phrases, in-

cluding these sub-trees introduce extraneous phrases and clauses. I also added phrases

from the subtrees of the matched phrase trees to the set of extracted phrases.

I used 13 seed hand-written patterns for FOCUS, 7 for TECHNIQUE and 15 for DO-

MAIN. When constructing a new pattern for learning, I ignored the ancestors that were not

a noun or a verb since most trigger words are a noun or a verb (such as use, constraints). I

also ignored conjunction, relative-clause-modifier, dependent (most generic dependency),

quantifier-modifier and abbreviation dependencies6 since they either are too generic or in-

troduce extraneous phrases and clauses.

Learning new patterns did not help in improving the FOCUS category phrases when

tested over a hand labeled test set. It got relatively high scores when using just the seed

patterns and the titles, and hence learning new patterns reduced the precision without any

significant improvement in recall. Thus, I learned new patterns only for the TECHNIQUE

and DOMAIN categories. I ran 50 iterations for both categories. After extracting all the

phrases, I removed common phrases that are frequently used in scientific articles, such as

‘this technique’ and ‘the presence of’, using a stop words list, a set of 3,000 phrases created

by taking the top most occurring 1 to 3 grams from 100,000 random articles that have an

abstract in the ISI web of knowledge database7. I ignored phrases that were either one

character or more than 15 tokens long.

In a step towards finding canonical names, I automatically detected abbreviations and

their expanded forms from the full text of papers by searching for text between two paren-

theses, and considered the phrase before the parentheses as the expanded form (similar

6See De Marneffe et al. (2006) for details of these dependencies.7www.isiknowledge.com

Page 64: DISTANTLY SUPERVISED INFORMATION EXTRACTION …The best thing about Stanford is its grad students – brilliant and yet so approachable. It has been amazing to be a part of the NLP

CHAPTER 4. STUDYING SCIENTIFIC ARTICLES AND COMMUNITIES 47

TECHNIQUE DOMAIN

model→ (nn) improve→ (direct-object)rules→ (nn) used→ (preposition for)extracting→ (direct-object) evaluation→ (nn)identify→ (direct-object) parsing→ (nn)constraints→ (amod) domain→ (nn)based→ (preposition on) ) applied→ (preposition to)

Table 4.3: Examples of patterns learned using the iterative extraction algorithm. The de-pendency ‘nn’ is the noun compound modifier dependency.

to Schwartz and Hearst (2003)). I got a high precision list by picking the top most oc-

curring pairs of abbreviations and their expanded forms, and created groups of phrases by

merging all the phrases that use same abbreviation. I then changed all the phrases in the

extracted phrases dataset to their canonical names. I also removed ‘model’, ‘approach’,

‘method’, ‘algorithm’, ‘based’, ‘style’ words and their variants when they occurred at the

end of a phrase.

To get communities in the computational linguistics literature, I considered the topics

generated using the same ACL Anthology dataset by Bethard and Jurafsky (2010) as com-

munities. They ran latent Dirichlet allocation on the full text of the papers to get 100 topics.

I, with help from two computational linguistics experts, hand labeled the topics and used

72 of them in my study; the rest of them were about common words. When calculating the

scores in Equation 4.1, I considered the value of P (c | a; θ) to be zero if it was less than

0.1.

4.5 Results

The total numbers of phrases extracted were 25525 for FOCUS, 24430 for TECHNIQUE,

and 33203 for DOMAIN. The total numbers of phrases after including the phrases extracted

from subtrees of the matched phrase trees were 64041, 38220 and 46771, respectively.

Examples of phrases extracted from some papers are shown in Table 4.2.

Page 65: DISTANTLY SUPERVISED INFORMATION EXTRACTION …The best thing about Stanford is its grad students – brilliant and yet so approachable. It has been amazing to be a part of the NLP

CHAPTER 4. STUDYING SCIENTIFIC ARTICLES AND COMMUNITIES 48

Approach F1 Precision RecallFOCUS

Baseline tf-idf NPs 35.60 24.36 66.07Seed Patterns 55.29 44.67 72.54Inter-Annotator Agreement 53.33 50.80 56.14

TECHNIQUE

Baseline tf-idf NPs 26.65 17.87 52.41Seed Patterns 20.09 23.46 21.72Iteration 50 36.86 30.46 46.68Inter-Annotator Agreement 72.02 66.81 78.11

DOMAIN

Baseline tf-idf NPs 30.13 19.90 62.03Seed Patterns 25.27 30.55 26.29Iteration 50 37.29 27.60 57.50Inter-Annotator Agreement 72.31 75.58 69.32

Table 4.4: The precision, recall and F1 scores of each category for the different approaches.Note that the inter-annotator agreement is calculated on a smaller set.

For testing, I hand labeled 474 abstracts with the three categories to measure the pre-

cision and recall scores. For each abstract and each category, I compared the unique non-

stop-words extracted from my algorithm to the hand labeled dataset. I calculated preci-

sion, recall measures for each abstract and then averaged them to get the results for the

dataset. To compare against a non-information-extraction based baseline, I extracted all

noun phrases (and sub-trees of the noun phrase trees) from the abstracts and labeled them

with all the three categories. In addition, I labeled the titles (and their sub-trees) with the

category FOCUS. I then scored the phrases with a tf-idf inspired measure, which was the

ratio of the frequency of the phrase in the abstract and the sum of the total frequency of the

individual words, and removed phrases that had the tf-idf measure less than 0.001 (best out

of many experiments). I call this approach as ‘Baseline tf-idf NPs’.

Table 4.4 compares precision, recall and micro-averaged F1 scores for the three cat-

egories when we use: (1) only the seed patterns, (2) the combined set of learned and

seed patterns, and (3) the baseline. I also calculated inter-annotator agreement for 30 ab-

stracts, where each abstract was labeled by 2 annotators,8 and the precision-recall scores

8I annotated all 30 abstracts and two other doctoral candidates in computational linguistics annotated 15

Page 66: DISTANTLY SUPERVISED INFORMATION EXTRACTION …The best thing about Stanford is its grad students – brilliant and yet so approachable. It has been amazing to be a part of the NLP

CHAPTER 4. STUDYING SCIENTIFIC ARTICLES AND COMMUNITIES 49

Figure 4.1: The F1 scores for TECHNIQUE and DOMAIN categories after every five itera-tions. For reasons explained in the text, I do not learn new patterns for FOCUS.

were calculated by randomly choosing one annotation as gold and another as predicted for

each article. We can see that both precision and recall scores increase for TECHNIQUE

because of the learned patterns, though for DOMAIN, precision decreases but recall in-

creases. The recall scores for the baseline are higher as expected but the precision is very

low. Three possible reasons explain the mistakes made by our system: (1) authors some-

times use generic phrases to describe their system, which are not annotated with any of the

three categories in the test set but are extracted by the system (such as ‘We use a simplemethod . . . ’, ‘We propose a faster model . . . ’, ‘This paper presents a new approach to

. . . ’); (2) the dependency trees of some sentences are wrong; and (3) some of the patterns

learned for TECHNIQUE and DOMAIN were low-precision but high-recall, for example,

[based → (preposition on)] was learned a TECHNIQUE pattern. The first problem of er-

roneous extraction of generic phrases could perhaps be decreased by allowing restrictions

on the target of the dependency or by disallowing certain kinds of generic positive terms

like ‘simple’, ‘new’, ‘faster’. Figure 4.1 shows the F1 scores for TECHNIQUE and DOMAIN

after every 5 iterations.

each

Page 67: DISTANTLY SUPERVISED INFORMATION EXTRACTION …The best thing about Stanford is its grad students – brilliant and yet so approachable. It has been amazing to be a part of the NLP

CHAPTER 4. STUDYING SCIENTIFIC ARTICLES AND COMMUNITIES 50

Community Representative words Most Influential Phrases ScoreSpeech Recogni-tion

(recognition, acoustic, er-ror, speaker, rate, adapta-tion, recognizer, vocabulary,phone)

expectation maximization; hidden markov;language; contextually; segment; context in-dependent phone; snn hidden markov; n gramback off language; multiple reference speak-ers; cepstral; phoneme; least squares; speechrecognition; intra; hi gram; bu; word depen-dent; tree structured; statistical decision trees

1.35

Probability The-ory

(probability, probabilities,distribution, probabilistic,estimation, estimate, en-tropy, statistical, likelihood,parameters)

hidden markov; maximum entropy; language;expectation maximization; merging; expec-tation maximization hidden markov; naturallanguage; variable memory markov; standardhidden markov; part of speech; inside out-side; segmentation only; minimum descrip-tion length principle; continuous density hid-den markov; part of speech information; for-ward backward

1.31

Bilingual WordAlignment

(alignment, alignments,aligned, pairs, align, pair,statistical, parallel, source,target, links, brown, ibm,null)

hidden markov; expectation maximization;maximum entropy; spectral clustering; statis-tical alignment; conditional random fields ,a discriminative; statistical word alignment;string to tree; state of the art statistical machinetranslation system; single word; synchronouscontext free grammar; inversion transductiongrammar; ensemble; novel reordering

1.2

POS Tagging (tag, tagging, pos, tags,tagger, part-of-speech,tagged, unknown, accuracy,part, taggers, brill, corpora,tagset)

maximum entropy; machine learning; ex-pectation maximization hidden markov; partof speech information; decision tree; hid-den markov; transformation based error drivenlearning; entropy; part of speech tagging; partof speech; variable memory markov; viterbi;second stage classifiers; document; wide cov-erage lexicon; using inductive logic program-ming

1.13

Machine Learn-ing Classification

(classification, classifier, ex-amples, classifiers, kernel,class, svm, accuracy, deci-sion, methods, labeled, vec-tor, instances)

support vector machines; ensemble; machinelearning; gaussian mixture; expectation max-imization; flat; weak classifiers; statisticalmachine learning; lexicalized tree adjoininggrammar based features; natural language pro-cessing; standard text categorization collec-tion; pca; semisupervised learning; standardhidden markov; supervised learning

1.12

Table 4.5: The top 5 influential communities with their most influential phrases. Thesecond column lists the top words that describe the communities obtained by the topicmodel, and the third column shows most influential phrases that have been widely used astechniques. The last column is the score of the community computed by Equation 4.5.

Page 68: DISTANTLY SUPERVISED INFORMATION EXTRACTION …The best thing about Stanford is its grad students – brilliant and yet so approachable. It has been amazing to be a part of the NLP

CHAPTER 4. STUDYING SCIENTIFIC ARTICLES AND COMMUNITIES 51

Community Representative words Most Influential Phrases ScoreStatistical Pars-ing

(parse, treebank, trees,parses, penn, collins,parsers, charniak, accu-racy, wsj, head, statistical,constituent, constituents)

propbank; expectation maximization; super-vised machine learning; maximumentropyclassifier; ensemble; lexicalized tree adjoininggrammar based features; neural network; gen-erative probability; incomplete constituents;part of speech tagging; treebank; penn; 50 bestparses; lexical functional grammar; maximumentropy; full comlex resource

0.92

Statistical Ma-chine Translation(More-Phrase-Based)

(bleu, statistical, source, tar-get, phrases, smt, reorder-ing, translations, phrase-based)

maximum entropy; hidden markov; expec-tation maximization; language; linguisticallystructured; ihmm; cross language informationretrieval; ter; factored language; billion word;hierarchical phrases; string to tree; state ofthe art statistical machine translation system;statistical alignment; ist inversion transductiongrammar; bleu as a metric; statistical machinetranslation

0.82

Parsing (grammars, parse, chart,context-free, edge, edges,production, symbols, sym-bol)

natural language processing; expectation max-imization; natural language; inside outside;rule; macro; various filtering strategies; tomita’s parser; forward backward; phrase structure;synchronous context free grammars; cky; ter-mination properties; extraposition grammars

0.81

Chunking/MemoryBased Models

(chunk, chunking, chunks,pos, accuracy, best,memory-based, daelemans,van, base)

state of the art machine learning; conditionalrandomf ields; support vector machines; ma-chine learning; using hidden markov; maxi-mum entropy; memory based learning; hid-den markov; standard hidden markov; secondstage classifiers; weak classifiers; flat; conll2004; iob; probabilities output; high recall

0.8

DiscriminativeSequence Models

(label, conditional, se-quence, random, labels,discriminative, inference,crf, fields, labeling)

conditional random fields; ensemble; maxi-mum entropy; maximum entropy; conditionalrandom fields , a discriminative; large mar-gin; perceptron; hidden markov; generalizedperceptron; pseudo negative examples; natu-ral language processing; entropy; singer; la-tent variable; character level; named entity

0.72

Table 4.6: The next 5 influential communities with their most influential phrases. Thesecond column lists the top words that describe the communities obtained by the topicmodel, and the third column shows most influential phrases that have been widely used astechniques. The last column is the score of the community computed by Equation 4.5.

Page 69: DISTANTLY SUPERVISED INFORMATION EXTRACTION …The best thing about Stanford is its grad students – brilliant and yet so approachable. It has been amazing to be a part of the NLP

CHAPTER 4. STUDYING SCIENTIFIC ARTICLES AND COMMUNITIES 52

Figure 4.2: The influence scores of communities in each year.

Community Communities that have influenced most (descending order)Named Entity Recognition Chunking/Memory Based Models; Discriminative Sequence Models;

POS Tagging; Machine Learning Classification; Coherence Relations;Biomedical NER; Bilingual Word Alignment

Statistical Parsing Probability Theory; POS Tagging; Discriminative Sequence Mod-els; Speech Recognition; Parsing; Syntactic Theory; Cluster-ing+DistributionalSimilarity; Chunking/Memory Based Models

Word Sense Disambiguation Clustering + DistributionalSimilarity; Machine Learning Classifica-tion; Dictionary Lexicons; Collocations/Compounds; Syntax; SpeechRecognition; Probability Theory

Table 4.7: The community in the first column has been influenced the most by the commu-nities in the second column. The scores are calculated using Equation 4.4

Influence

Tables 4.5 and 4.6 show the most influential communities overall and their respective in-

fluential phrases that have been widely adopted as techniques by other communities. The

third column is the score of the community calculated using Equation 4.5. We can see that

Page 70: DISTANTLY SUPERVISED INFORMATION EXTRACTION …The best thing about Stanford is its grad students – brilliant and yet so approachable. It has been amazing to be a part of the NLP

CHAPTER 4. STUDYING SCIENTIFIC ARTICLES AND COMMUNITIES 53

Figure 4.3: The popularity of communitites in each year. It is measured by summing upthe article-to-topic scores for the articles published in that year (see Hall et al. (2008)).The scores are smoothed with weighted scores of 2 previous and 2 next years, and L1-normalized for each year. The scores are lower for all communities in late 2000s sincethe probability mass is more evenly distributed among many communities. Contrast therelative popularity of the communities with their relative influence shown in Figure 4.2.

speech recognition is the most influential community because of the techniques like hidden

Markov models and other stochastic methods it introduced in the computational linguistics

literature. This shows that its long-term seeding influence is still present despite the lim-

ited popularity around 2000s. Probability theory also gets a high score since many papers

in the last decade have used stochastic methods. The communities part-of-speech tagging

and parsing get high scores because they adopted some techniques that are used in other

communities, and because other communities use part-of-speech tagging and parsing in the

intermediary steps for solving other problems.

Figure 4.2 shows the change in a community’s influence over time. The scores are nor-

malized such that the total score for all communities in a year sum to one. Compare the

relative scores of communities in the figure with the relative scores in Figure 4.3, which

Page 71: DISTANTLY SUPERVISED INFORMATION EXTRACTION …The best thing about Stanford is its grad students – brilliant and yet so approachable. It has been amazing to be a part of the NLP

CHAPTER 4. STUDYING SCIENTIFIC ARTICLES AND COMMUNITIES 54

Figure 4.4: The influence scores of machine translation related communities. The statisticalmachine translation community, which is a topic from the topic model, is more phrase-based.

shows sum of all article-to-topics scores for each community for articles published in a

given year, and is normalized the same way as before. There is a huge spike for the Speech

Recognition community for the years 1989–1994. Hall et al. (2008) note, “These years

correspond exactly to the DARPA Speech and Natural Language Workshop, held at differ-

ent locations from 1989–1994. That workshop contained a significant amount of speech

until its last year (1994), and then it was revived in 2001 as the Human Language Technol-

ogy workshop with a much smaller emphasis on speech processing.” See their paper Hall

et al. (2008) for more analysis. Note that this analysis uses just bag-of-words-based topic

models.

Comparing Figures 4.2 and 4.3, we can see influence of a community is different from

the popularity of a community in a given year. As mentioned before, we observe that al-

though the influence score for speech recognition has declined during 1997-2009, it still has

a lot of influence, though the popularity of the community in recent years is very low. Ma-

chine learning classification has been both popular and influential in recent years. Figures

4.4 and 4.5 compare the machine translation communities in the same way as we compare

other communities in Figures 4.2 and 4.3. We can see that statistical machine translation

(more phrase-based) community’s popularity has steeply increased during late 2002-2009,

Page 72: DISTANTLY SUPERVISED INFORMATION EXTRACTION …The best thing about Stanford is its grad students – brilliant and yet so approachable. It has been amazing to be a part of the NLP

CHAPTER 4. STUDYING SCIENTIFIC ARTICLES AND COMMUNITIES 55

Figure 4.5: Popularity of machine translation communities in each year. The statisticalmachine translation community, which is a topic from the topic model, is more phrase-based. Contrast the relative popularity scores with the relative influence scores shown inFigure 4.4.

however, its influence has increased at a slower rate. On the other hand, the influence of

bilingual word alignment (the most influential community in 2009) has increased during

the same period, mainly because of its influence on statistical machine translation. The in-

fluence of non-statistical machine translation has been decreasing recently, though slower

than its popularity. Table 4.7 shows the communities that have the most influence on a

given community (the list is in descending order of scores by Equation 4.4).

Comparison with Supervised CRF

In this section, I present an experiment performed after Gupta and Manning (2011) was

published. To compare how the BPL approach with dependency patterns compared against

a supervised CRF, I divided the labeled examples used as the test set into two. One half was

reserved for training a CRF model and the rest was used to test both BPL and CRF. Note

that the supervision provided to BPL and CRF are very different. BPL did not have access

to the fully labeled abstracts, which the CRF used. Instead, it used the same seed patterns

as before. Since fully labeled abstracts have each token labeled, they are of much higher

quality than seed patterns. Table 4.8 shows the scores of the two systems for the labels

Page 73: DISTANTLY SUPERVISED INFORMATION EXTRACTION …The best thing about Stanford is its grad students – brilliant and yet so approachable. It has been amazing to be a part of the NLP

CHAPTER 4. STUDYING SCIENTIFIC ARTICLES AND COMMUNITIES 56

TECHNIQUE DOMAIN

System Precision Recall F1 Precision Recall F1

Supervised CRF 41.55 31.51 35.38 53.90 52.80 55.05Bootstrapped Patterns 29.37 56.10 38.56 37.56 30.66 48.45

Table 4.8: Comparison of our BPL-based approach and supervised CRF for the task.

TECHNIQUE and DOMAIN. Note that the scores are not directly comparable to the previous

results in the chapter since the test set is now half of the previous test set. Supervised CRF

performed better than BPL for the label DOMAIN as expected. Surprisingly, BPL performed

better than Supervised CRF for TECHNIQUE, even with lower supervision.

4.6 Further Reading

There has been a surge of interest in studying academic communities and the technical ad-

vancements made in published literature in the last few years. IARPA funded a program

called Foresight and Understanding from Scientific Exposition (FUSE) to develop auto-

mated methods to study technical contributions made by scientific, technical, and patent

literature.9 Tsai et al. (2013) used a bootstrapping approach to identify and categorize

scientific concepts in research literature. They used the context of citations to cluster the

extracted mentions into concepts. Tateisi et al. (2014) released an annotation framework

for relation extraction from research literature in computer science. The Semantic Scholar

project from Allen Institute for Artificial Intelligence 10 is focused towards understand-

ing scientific literature semantically. One of their papers (Valenzuela et al., 2015) studied

identifying meaningful citations using a supervised method.

9http://www.iarpa.gov/index.php/research-programs/fuse10http://allenai.org/semantic-scholar.html

Page 74: DISTANTLY SUPERVISED INFORMATION EXTRACTION …The best thing about Stanford is its grad students – brilliant and yet so approachable. It has been amazing to be a part of the NLP

CHAPTER 4. STUDYING SCIENTIFIC ARTICLES AND COMMUNITIES 57

4.7 Conclusion

This chapter presented a framework for extracting detailed information from scientific ar-

ticles, such as main contributions, tools and techniques used, and domain problems ad-

dressed, by matching semantic extraction patterns in dependency trees. I start with a few

hand written seed patterns and learn new patterns using a bootstrapping approach. I use

this rich information extracted from the articles to study the dynamics of research commu-

nities, and define a new way of measuring influence of one research community on another.

I present a case study on the computational linguistics community, where I examine the in-

fluence of its sub-fields, and observed that speech recognition and probability theory have

had the most seminal influence.

The results show that bootstrapped pattern-based learning is an effective approach for

this task. Since the task is new, there exists no fully labeled dataset of scientific articles

labeled with the three categories. Bootstrapping with a few hand written patterns provides

enough supervision to learn more patterns and entities. Henceforth, I will apply a simi-

lar approach, bootstrapped lexico-syntactic surface word pattern-based learning, to extract

entities from a very different domain – patient authored text.

Page 75: DISTANTLY SUPERVISED INFORMATION EXTRACTION …The best thing about Stanford is its grad students – brilliant and yet so approachable. It has been amazing to be a part of the NLP

Chapter 5

Inducing Lexico-Syntactic Patterns forInformation Extraction on MedicalForums

In the last chapter, I presented bootstrapped pattern-based learning as an effective approach

for extracting key aspects from scientific papers. In this chapter, I describe using the ap-

proach for extracting entities from another domain – patient authored text. Patient authored

text is usually very different from the content in scientific papers; many sentences are un-

grammatical, and the sentences have various spelling mistakes, variations in naming enti-

ties, and extensive use of slang words. I develop a system to extract drugs & treatments and

symptoms & diseases from users’ posts on online medical forums. The system outperforms

existing medical text annotators and state-of-the-art classifier-based systems. I also discuss

how the extractor can be used to study the efficacy of drugs & treatments on a large scale.

This work was published in Gupta et al. (2014b).

5.1 Introduction

In 2013, 59% of adults in the United States sought health information on the Internet (Fox

and Duggan, 2013). While these users typically have no formal medical education, they

generate large volumes of patient-authored text (PAT) in the form of medical blogs and

58

Page 76: DISTANTLY SUPERVISED INFORMATION EXTRACTION …The best thing about Stanford is its grad students – brilliant and yet so approachable. It has been amazing to be a part of the NLP

CHAPTER 5. INFORMATION EXTRACTION ON MEDICAL FORUMS 59

discussions on online health forums. Their contributions range from rare disease diagnosis

to drug and treatment efficacy.

My eventual goal is to enable open-ended mining and analysis of PAT to improve health

outcomes. In particular, PAT can be a great resource for extracting the efficacy and side-

effects of both pharmaceutical and alternative treatments. Prior work demonstrates the

knowledge value of PAT in mining adverse drug events (Leaman et al., 2010), predicting flu

trends (Carneiro and Mylonakis, 2009) (although caution is needed (Butler, 2013)), explor-

ing drug interactions (White et al., 2013), and replicating results of a double-blind medical

trial (Wicks et al., 2011). Already websites such as http://www.medify.com and

http://www.treato.com aggregate information on the efficacy and side effects of

drugs from PAT. Extraction of sentiment and side effects for drugs and treatments in PAT is

only possible on a large scale when we have tools to discover and robustly identify entities

such as symptoms, conditions, drugs, and treatments in the text. Most of the research in

extracting such information has focused on clinicians’ notes, and thus most annotation sys-

tems are tailored towards them. Unlike expert-authored text, which is composed of terms

routinely used by the medical community, PAT contains a great deal of slang and verbose,

informal descriptions of symptoms and treatments (for example, ‘feels like a brick on my

heart’ or ‘Watson 357’ for Vicodin). Previous research has shown that most terms used by

consumers are not in ontologies (Smith and Wicks, 2008).

I propose inducing lexico-syntactic patterns using seed dictionaries to identify specific

medical entity types in PAT. The patterns generalize terms from seed dictionaries to learn

new entities. I test our method over two entity types: symptoms & conditions (SC), and

drugs & treatments (DT) on two of MedHelp’s forums: Asthma and Ear, Nose & Throat

(ENT). I also report the results of applying the system to three other forums on MedHelp:

Adult Type II Diabetes, Acne, and Breast Cancer. Our system is able to extract SC and DT

phrases that are not in the seed dictionaries, such as ‘cinnamon pills’ and ‘Opuntia’ as DT

from the Diabetes forum, and ‘achiness’ and ‘lumpy’ as SC from the Breast Cancer forum.

Page 77: DISTANTLY SUPERVISED INFORMATION EXTRACTION …The best thing about Stanford is its grad students – brilliant and yet so approachable. It has been amazing to be a part of the NLP

CHAPTER 5. INFORMATION EXTRACTION ON MEDICAL FORUMS 60

5.2 Objective

The objective is to learn new SC and DT phrases from PAT without using hand written

rules or any hand-labeled sentences. I define SC as any symptom or condition mentioned

in text. The DT label refers to any treatment taken or intervention performed in order to

improve a symptom or condition. It includes pharmaceutical treatments and drugs, surg-

eries, interventions (like ‘getting rid of cat and carpet’ for Asthma patients), and alternative

treatments (like ‘acupuncture’ or ‘garlic’). Note that our system ignores negations (for ex-

ample, in the sentence ‘I don’t have Asthma’, ‘Asthma’ is labeled SC) since it is preferable

to extract all SC and DT mentions and handle the negations separately, if required. The

labels include all relevant generic terms (for example, ‘meds’, ‘disease’). Devices used to

improve a symptom or condition (like inhalers) are included in DT, but devices that are

used for monitoring or diagnosis are not. Some examples of sentences from the Asthma

and ENT forums labeled with SC (in italics) and DT (in bold) labels are shown below:

I don’t agree with my doctor’s diagnostic after research and I think I may have

a case of Sinus Mycetoma

I started using an herbal mixture especially meant for Candida with limited

success.

however , with the consistent green and occasional blood in nasal discharge

(but with minimal “stuffy” feeling), I wonder if perhaps a problem with chronic

sinusitis and or eustachian tubes

She gave me albuteral and symbicort (plus some hayfever meds and asked

me to use the peak flow meter.

My sinus infections were treated electrically, with high voltage million voltelectricity, which solved the problem, but the treatment is not FDA approved

and generally unavailable, except under experimental treatment protocols.

Page 78: DISTANTLY SUPERVISED INFORMATION EXTRACTION …The best thing about Stanford is its grad students – brilliant and yet so approachable. It has been amazing to be a part of the NLP

CHAPTER 5. INFORMATION EXTRACTION ON MEDICAL FORUMS 61

5.3 Related Work: Medical IE

Medical term annotation is a long-standing research challenge. However, almost no prior

work focuses on automatically annotating PAT. Tools like TerMINE (Frantzi et al., 2000)

and ADEPT (MacLean and Heer, 2013) do not identify specific entity types. Other ex-

isting tools like MetaMap (Aronson, 2001), the OBA (Jonquet et al., 2009), and Apache

cTakes1 perform poorly mainly because they are designed for fine-grained entity extraction

on expert-authored text. They essentially perform dictionary matching on text based on

source ontologies (Aronson, 2001; Jonquet et al., 2009; Aronson and Lang, 2010). Despite

being the go-to tools for medical text annotation, previous studies (Pratt and Yetisgen-

Yildiz, 2003) comparing OBA and MetaMap to human annotator performance underscore

two sources of performance error, which we also notice in our results. The first is ontology

incompleteness, which results in low recall, and second is inclusion of contextually irrel-

evant terms (MacLean and Heer, 2013). For example, when restricted to the RxNORM

ontology and semantic type Antibiotic (T195), OBA will extract both Today and Penicillin

from the sentence “Today I filled my Penicillin rx”. Other approaches focusing on expert-

authored text show improvement in identifying food and drug allergies (Epstein et al., 2013)

and disease normalization (Kang et al., 2012) with the use of statistical methods. While

these statistically-based approaches tend to perform well, they require hand labeled data,

which is both manually intensive to collect and does not generalize across PAT sources.

The most relevant work to ours is in building the Consumer Health Vocabularies (CHVs).

CHVs are ontologies designed to bridge the gap between patient language and the UMLS

Metathesaurus. We are aware of two CHVs: the (OAC) CHV (Zeng and Tse, 2006)2 and

the MedlinePlus CHV3. To date, most work in this area focuses on identifying candidate

terms of general medical relevance, and not specific entity types. We use the OAC CHV to

construct our seed dictionaries.

There has been some work that extracts information from PAT. In a study investigating

the feasibility of mining adverse drug events from user comments on DailyStrength (www.

1http://ctakes.apache.org2http://www.consumerhealthvocab.org3http://www.nlm.nih.gov/medlineplus/xml.html

Page 79: DISTANTLY SUPERVISED INFORMATION EXTRACTION …The best thing about Stanford is its grad students – brilliant and yet so approachable. It has been amazing to be a part of the NLP

CHAPTER 5. INFORMATION EXTRACTION ON MEDICAL FORUMS 62

dailystrength.org), Leaman et al. (2010) achieve an F1-score of 73.9% against hu-

man annotators utilizing a lexicon-based approach. This approach differs from ours in that

they do not learn new lexicon terms from their data. Other approaches focusing on expert-

authored text show improvement with the use of statistical methods. For example, Epstein

et al. (2013) utilize RxNorm and several NLP techniques to achieve F1, Precision and Re-

call scores in the 90’s for identifying food and drug allergies entered using non-standard

terminology in allergy and sensitivity entries in the Vanderbilt perioperative information

management system. Kang et al. (2012) were able to improve both MetaMap and Peregrine

disease normalization F1-scores significantly (by about 15%) by post-processing annotator

output using NLP rules for entity resolution.

In this chapter, I extract SC and DT terms by inducing lexico-syntactic surface-word

patterns. The general approach has been shown to be useful in learning different semantic

lexicons, as discussed in Chapter 2.

5.4 Materials and Methods

5.4.1 Dataset

I used discussion forum text from MedHelp4, one of the biggest online health community

websites. See Section 2.5 for more details of the dataset. I excluded from our dataset

sentences from one user who had posted very similar posts several thousand times. I test the

performance of our system in extracting DT and SC phrases on sentences from two forums:

the Asthma forum and the Ear, Nose and Throat (ENT) forum. The Asthma and ENT

forums consist of 39,137 and 215,123 sentences, respectively, in our dataset. In addition,

I present qualitative results of our system run on three other forums: the Adult Type II

Diabetes forum (63,355 sentences), the Acne forum (65,595 sentences), and the Breast

Cancer forum (296,861 sentences). I used the Stanford CoreNLP toolkit (Manning et al.,

2014) to tokenize text, split it into sentences, and to label the tokens with their part-of-

speech tags and lemma (that is, canonical form). I converted all text to lowercase because

PAT usually contains inconsistent capitalization.

4Data spans from 2007 to May 2011. Available from: http://www.medhelp.org.

Page 80: DISTANTLY SUPERVISED INFORMATION EXTRACTION …The best thing about Stanford is its grad students – brilliant and yet so approachable. It has been amazing to be a part of the NLP

CHAPTER 5. INFORMATION EXTRACTION ON MEDICAL FORUMS 63

Initial Labeling Using Dictionaries

As the first step, I ‘partially’ label data using matching phrases from our DT and SC dic-

tionaries. Our DT dictionary, comprising 38,684 phrases, was sourced from Wikipedia’s

list of drugs, surgeries and delivery devices; RxList5; MedlinePlus6; Medicinenet7 phrases

with semantic type ‘procedures’ from MedDRA8; and phrases with relevant semantic types

(Antibiotic, Clinical Drug, Laboratory Procedure, Medical Device, Steroid, and Therapeu-

tic or Preventive Procedure) from the NCI thesaurus.9

Our SC dictionary comprises 100,879 phrases, and was constructed using phrases from

MedlinePlus, Medicinenet, and from MedDRA (with semantic type ‘disorders’). We ex-

panded both dictionaries using the OAC Consumer Health Vocabulary10 by adding all syn-

onyms of the phrases previously added. Because the dictionaries are automatically con-

structed with no manual editing, they might have some incorrect phrases. However, the

results show that they perform effectively.

I label a phrase with the dictionary label when the sequence of non-stop-words (or their

lemmas) matches an entry in the dictionary. To match spelling mistakes and morpholog-

ical variations (like ‘tickly’), which are common in PAT, I do a fuzzy matching. A token

matches a word in the dictionary if the token is longer than 6 characters and the token and

the word are edit distance one away. I ignore words ‘disease’, ‘disorder’, ‘chronic’, and

‘pre-existing’ in the dictionaries when matching phrases. I remove phrases that are very

common on the Internet by compiling a list of the 2000 most common words from Google

Ngrams, called GoogleCommonList henceforth. See Section 2.4 for more information on

Google N-grams. This helps exclude words like ‘Today’ and ‘AS’, which are also names

of medicines. Tokens that are labeled as SC by the SC dictionary are not labeled DT, to

avoid labeling ‘asthma’ as DT in the phrase ‘asthma meds’, in case ‘asthma meds’ is in the

DT dictionary.

5www.rxlist.com, Accessed January 2013.6http://www.nlm.nih.gov/medlineplus, Accessed January 2013.7http://www.medicinenet.com, accessed January 20138MedDRA stands for Medical Dictionary for Regulatory Activities. http://www.meddra.org, Ac-

cessed February 20139http://ncit.nci.nih.gov, Accessed March 2013.

10Open Access, Collaborative Consumer Health Vocabulary Initiative. http://www.consumerhealthvocab.org, accessed February 2013.

Page 81: DISTANTLY SUPERVISED INFORMATION EXTRACTION …The best thing about Stanford is its grad students – brilliant and yet so approachable. It has been amazing to be a part of the NLP

CHAPTER 5. INFORMATION EXTRACTION ON MEDICAL FORUMS 64

5.5 Inducing Lexico-Syntactic Patterns

In Chapter 2, I gave a high level overview of the steps of a bootstrapped pattern learning

system. Below is a summary of the steps.

1. Label data using dictionaries

2. Create patterns using the labeled data and choose top K patterns

3. Extract phrases using the learned patterns and choose top N words

4. Add new phrases to the dictionaries

5. Repeat 1-4 T times or until converged

I experimented with different phrase and pattern weighting schemes (for example, ap-

plying log sublinear scaling in the weighting formulations below) and parameters for our

system. I selected the ones that performed best on the Asthma forum test sentences. Below,

I explain the algorithm using DT as an example label for ease of explanation.

5.5.1 Creating Patterns

I create potential patterns by looking at two to four words before and after the labeled

tokens. I discard contexts that consist of 2 or fewer stop words because they are too general

and extract many noisy entities. Contexts with 3 or more stop words are included because

the long context makes them less general, for example, ‘I am on X’ is a good pattern to

extract DTs. Words that are labeled with one of the dictionaries are generalized with the

class of the dictionary. I create flexible patterns by ignoring the words {‘a’, ‘an’, ‘the’,

‘,’, ‘.’} while matching the patterns and by allowing at most two stop words between the

context and the term to be extracted. I create two sets of the above patterns – with and

without the part-of-speech (POS) restriction of the target phrase (for example, that it only

contains nouns). Since many symptoms and drugs tend to be more than just one word, I

allow matching 1 to 2 tokens. In our experiments, matching 3 or more consecutive terms

extracted noisy phrases, mostly by patterns without the POS restriction. Table 2.1 shows

an example of two patterns and how they match to two sentences.

Page 82: DISTANTLY SUPERVISED INFORMATION EXTRACTION …The best thing about Stanford is its grad students – brilliant and yet so approachable. It has been amazing to be a part of the NLP

CHAPTER 5. INFORMATION EXTRACTION ON MEDICAL FORUMS 65

5.5.2 Learning Patterns

I learn new patterns by weighting them using normalization measures and selecting the top

patterns. In essence, we want to trade off precision and recall of the patterns to extract the

correct phrases. The weighting scheme for a pattern i is

pti =

∑mk=1

√freq(i, wk)∑n

j=1

√freq(i, wj)

(5.1)

wherem is the number of words with the label DT that match the pattern, n is the number of

all words that match the pattern, and freq(i, wk) is the number of times pattern i matched

the phrase wk. Sublinear scaling of the frequency prevents high frequency words from

overshadowing the contribution of low frequency words. Using the RlogF pattern scoring

function (Riloff, 1996) led to lower scores in the pilot experiments. I discard patterns that

have weight less than a threshold (=0.5 in our experiments). I also discard patterns when m

is equal to n since adding them would be of no benefit for learning new phrases. I remove

patterns that occur in the top 500 patterns for the other label. After calculating weights for

all the remaining patterns, I choose the top K (=50 in our experiments) patterns.

5.5.3 Learning Phrases

I apply the patterns selected by the above process to all the sentences and extract the

matched phrases. The phrase weighting scheme is a combination of TF-IDF scoring,

weight of the patterns, and relative frequency of the phrases in different dictionaries. The

latter weighting term assigns higher weight to words that are sub-phrases of phrases in the

entity’s dictionary. The weighting function for a phrase p for the label DT is

weight(p,DT) =(∑t

i=1 num(p, i)× ptilog(freqp)

)× 1 + dictDTFreqp

1 + dictSCFreqp(5.2)

where t is the number of patterns that extract the phrase p, num(p, i) is the number of times

phrase p is extracted using pattern i, pti is the weight of the pattern i from the previous

equation, freqp is frequency of phrase p in the corpus, dictDTFreqp and dictSCFreqpare the frequency of phrase p in the n-grams of the phrases from the DT dictionary and the

Page 83: DISTANTLY SUPERVISED INFORMATION EXTRACTION …The best thing about Stanford is its grad students – brilliant and yet so approachable. It has been amazing to be a part of the NLP

CHAPTER 5. INFORMATION EXTRACTION ON MEDICAL FORUMS 66

SC dictionary, respectively. I discard phrases with weight less than a threshold (=0.2 in our

experiments). I also discard phrases that are matched by less than 2 patterns to improve

precision of the system – phrases extracted by multiple patterns tend to be more accurate.

I remove the following kinds of phrases from the set of potential phrases: (1) list of

specialists and physicians downloaded from WebMD11, (2) words in the GoogleCommon-

List, and (3) 5000 most frequent tokens from around 1 million tweets from Twitter to avoid

learning slang words like ‘asap’, (4) phrases that are already in any of the dictionaries. I

then extract up to the top N (=10 in our experiments) words and label those phrases in the

sentences. I also remove body parts phrases (198 phrases that were curated from Wikipedia

and manually expanded by us) from the set of potential DT phrases.

I repeat the cycle of learning patterns and learning phrases T times (=20 in our experi-

ments) or until no more patterns and words can be extracted.

5.6 Evaluation Setup

5.6.1 Test Data

I tested our system and the baselines on two forums – Asthma and ENT. For each forum, I

randomly sampled 500 sentences, and my collaborator and I annotated 250 sentences each.

The test sentences were removed from the data used in the learning system. The labeling

guidelines for the annotators for the test sentences were to include the minimum number of

words to convey the medical information. To calculate the inter-annotator agreement, the

annotators labeled 50 sentences from the 250 sentences assigned to the other annotator; the

agreement is thus calculated on 100 sentences out of the 500 sentences. The token-level

agreement for the Asthma test sentences was 96% with Cohen’s kappa κ=0.781, and for

the ENT test sentences was 96.2% with Cohen’s kappa κ=0.801. I used the Asthma forum

as a development forum to select parameters, such as the maximum number of patterns and

phrases added in an iteration, total number of iterations, and the thresholds for learning

patterns and phrases. I discuss effect of varying these parameters in the additional exper-

iments section below. I used ENT as a test forum; no parameters were tuned on the ENT

11http://www.webmd.com, accessed October 2013

Page 84: DISTANTLY SUPERVISED INFORMATION EXTRACTION …The best thing about Stanford is its grad students – brilliant and yet so approachable. It has been amazing to be a part of the NLP

CHAPTER 5. INFORMATION EXTRACTION ON MEDICAL FORUMS 67

forum test set.

Failed Crowdsourcing Effort

I tried using Amazon mechanical turk to acquire labeled test data. However, the annota-

tions were of poor quality, thus, we did not use them. Annotators frequently labeled any

medically relevant term as DT or SC, such as ‘blood’ and ‘doctor’. I tried a secondary

verification step – turkers were asked to verify the annotations from the first step, but the

results were not satisfactory. I believe that either they did not read the instructions properly,

did not pay attention when labeling the data, or did not fully understand the task. For ex-

ample, one annotator labeled ‘doc’ and ‘WAIT’ as DTs in the sentence ‘Take care of your

son, take him to the doc regualrly , do as they say and WAIT ... until he grows out of it.’ In

retrospect, labeling the test data by ourselves was faster and, maybe, cheaper.

5.6.2 Metrics

I present both token-level and entity-level Precision, Recall, and F1 metrics to evaluate our

system and the baselines. I discuss the metrics and the difference between token-level and

entity-level metrics in Chapter 2. All the results in this chapter, unless otherwise noted,

are token-level measures because identifying partial tokens in an entity (that is, ‘inhaler’

in ‘salbutamol inhaler’) is still useful in this domain. Entity-level evaluation is commonly

used for recognizing named entities, where, for example, the distinction between ‘Wash-

ington’ and ‘Washington D. C.’ is more prominent. Note that sometimes extracting partial

phrases in our task will also lead to a wrong number of token-level true positives (for ex-

ample, extracting just ‘looking’ in ‘trouble looking straight ahead’), but I did not observe

it often in our experiments.

Note that accuracy is not a good measure for the task because most of the tokens are

labeled ‘none’, and thus labeling everything as ‘none’ achieves very high accuracy and zero

recall. I ignore about 200 very common words (like ‘i’, ‘am’), 26 very common medical

terms and their derivatives (like ‘disease’, ‘doctor’), and words that do not start with a letter

(see the Appendix for the full list) when evaluating the systems.12

12This is same as considering them as stop words and fixing their label to ‘none’. Since the F1 scores are

Page 85: DISTANTLY SUPERVISED INFORMATION EXTRACTION …The best thing about Stanford is its grad students – brilliant and yet so approachable. It has been amazing to be a part of the NLP

CHAPTER 5. INFORMATION EXTRACTION ON MEDICAL FORUMS 68

Statistical Significance Testing

I tested the statistical significance of the improvement of our system over the baselines us-

ing approximate randomization (Noreen, 1989; Yeh, 2000) implemented by SIGFv2 (Pado,

2006), commonly used for statistical significance testing for named entity recognition sys-

tems. It does not assume that the model is representative, that is it does not perform sam-

pling with replacement. Instead, it is based on random shuffling of the predictions. I

assumed each token to be an observation and randomized the observations 10,000 times.

5.6.3 Baselines

I compare our system to the OBA annotator (Jonquet et al., 2009) and the MetaMap an-

notator (Aronson, 2001). I evaluated both the baselines with the default settings. I also

compare our algorithm with the pattern learning system proposed by Xu et al. (2008). I

describe the details of these systems below.

MetaMap: I used the Java API of MetaMap 2013v2. I used the semantic types An-

tibiotic, Clinical Drug, Drug Delivery Device, Steroid, Therapeutic or Preventive Proce-

dure, Vitamin, Pharmacologic Substance for DT; and semantic types Disease or Syndrome,

Sign or Symptom, Congenital Abnormality, Experimental Model of Disease, Injury or

Poisoning, Mental or Behavioral Dysfunction, Finding for SC. MetaMap-C refers to the

MetaMap system when its output is post-processed by removing common words.

OBA: I used the web service provided by OBA to label the sentences. I used the seman-

tic types Pharmacologic Substance, Steroid, Vitamin, Antibiotic, Therapeutic or Preventive

Procedure, Medical Device, Substance, Clinical Drug, Drug Delivery Device, Biomedical

or Dental Material for DT; and the semantic types Sign or Symptom, Injury or Poisoning,

Disease or Syndrome, Mental or Behavioral Dysfunction, Rickettsia or Chlamydia for SC.

OBA-C refers to the OBA system when its output is post-processed by removing common

words.

Xu et al.: Xu et al. (2008) learned surface patterns for extracting diseases from Medline

paper abstracts. They ranked patterns based on overlap of words extracted by potential

patterns with a seed pattern. Potential words were ranked by the scores of the patterns

not calculated for the label ‘none’, both approaches have the same effect.

Page 86: DISTANTLY SUPERVISED INFORMATION EXTRACTION …The best thing about Stanford is its grad students – brilliant and yet so approachable. It has been amazing to be a part of the NLP

CHAPTER 5. INFORMATION EXTRACTION ON MEDICAL FORUMS 69

that extracted them. I compared our system with their best performing ranking measures:

BalancedRank for patterns and Best-pattern-based rank for words. Since they focus only

on extracting diseases from research paper abstracts, their seed pattern ‘patients with X’

will not perform well on our dataset. Thus, for each label, we create patterns according to

their algorithm and choose the pattern weighted highest by our system as their seed pattern.

The seed patterns were the same as the top patterns shown in Tables 5.10-5.13.

CRF: A conditional random field (CRF) is a Markov random field based classifier that

uses word features and context features, such as the words and labels of nearby words.

Even though the data is only partially labeled using dictionaries, CRFs can learn correct

labels using the context features. I experimented with many different features and settings

and report the best results. I removed sentences in which none of the words were labeled

and fixed the label of words that are labeled by dictionaries. I used distributional similarity

features, which were computed using the Brown clustering method on all sentences of the

MedHelp forums (see Section 2.4 for more details). I built the classifier using the Stanford

NER toolkit (Finkel et al., 2005). I also present results of CRFs with self-training (‘CRF-

2’ and ‘CRF-20’ for 2 and 20 iterations, respectively), in which a CRF is trained on the

sentences labeled by dictionaries and predictions using the trained CRF from the previous

iteration.

5.7 Results

Fuzzy matching

Table 5.1 shows F1 scores for our system across different dictionary labeling schemes.

‘Dictionary’ refers to the seed dictionary without fuzzy matching or removing common

words. Fuzzy matching (indicated by ‘-F’) and removing common words (indicated by

‘-C’) increase the F1 scores by 3-5%. The performance of the system increases by removing

common words from dictionaries and matching words fuzzily.

Page 87: DISTANTLY SUPERVISED INFORMATION EXTRACTION …The best thing about Stanford is its grad students – brilliant and yet so approachable. It has been amazing to be a part of the NLP

CHAPTER 5. INFORMATION EXTRACTION ON MEDICAL FORUMS 70

System Asthma–DT Asthma–SC ENT–DT ENT–SCDictionary 58.21 71.39 49.66 60.32

Dictionary-C 60.29 73.32 53.93 61.88Dictionary-F-C 62.50 74.59 54.74 63.13

Table 5.1: F1 scores for labeling with Dictionaries using different types of labelingschemes. ‘-F’ means using fuzzy matching and ‘-C’ means pruning words that are inGoogleCommonList.

Our system vs. Other systems

Tables 5.2–5.5 show the scores for DT and SC labels on the Asthma and ENT forums.

The horizontal line separates systems that do not learn new phrases from the systems that

do. An asterisk denotes that our system is statistically significantly better (for two-tailed

p-value <0.05) than the system using approximated randomization.

In most cases, our system significantly outperforms current standard tools in medical

informatics. MetaMap and OBA have lower computational time since they do not match

words fuzzily or learn new dictionary phrases, but have lower performance. All systems

extract SC terms with higher recall than DT terms because many simple SC terms (such

as ‘asthma’) occurred frequently and were present in the dictionary. The improvement in

performance of our system over the baselines is higher for DT as compared to SC, mainly

because SC terms are usually verbose and descriptive, and hence are harder to extract using

patterns. In addition, the performance is higher on Asthma than on ENT for two reasons.

First, the system was tuned on the Asthma forum. Second, the Asthma test set had many

easy to label DT and SC phrases, such as ‘asthma’ and ‘inhaler’. On the other hand, many

ENT phrases were longer and not present in seed dictionaries, such as ‘milk free diets’ and

‘smelly nasal discharge’.

One of the reasons that CRF does not perform so well, despite being very popular for ex-

tracting entities from human-labeled text data, is that the data is partially labeled using the

dictionaries. Thus, the data is noisy and lacks full supervision provided in human-labeled

data, making the word-level features not very predictive. CRF missed extracting some

common terms like ‘inhaler’ and ‘inhalers’ as DT (‘inhaler’ occurred only as a sub-phrase

in the seed dictionary), and extracted some noisy terms, such as ‘afraid’ and ‘icecream’. In

Page 88: DISTANTLY SUPERVISED INFORMATION EXTRACTION …The best thing about Stanford is its grad students – brilliant and yet so approachable. It has been amazing to be a part of the NLP

CHAPTER 5. INFORMATION EXTRACTION ON MEDICAL FORUMS 71

addition, CRF uses context for labeling data – we show in the Additional Experiments sec-

tion that using context in the form of patterns performs worse than dictionary matching for

labeling data. Our system, on the other hand, learned new dictionary phrases by exploiting

context but labeled data by dictionary matching. Self-training the CRF initially increased

the F1 score for DT but performed worse in the subsequent iterations. Xu et al.’s system

performed worse because of its overdependence on the seed patterns: it gave low scores to

patterns that extracted phrases that had low overlap with the phrases extracted by the seed

patterns, which resulted in lower recall.

I believe token-level evaluation is better suited than entity-level evalution for the task.

However, for completeness, I have included entity-level evaluation results in Tables 5.6-5.9.

Scores of all systems are better when measured at the token level than at the entity level

because they get credit for extracting partial entities. The entity-level evaluation results

show a similar trend as the token-level evaluation: our system performs better than other

systems, albeit the difference is smaller for the DT label on the ENT forum.

5.7.1 Analysis

Tables 5.10–5.13 show the top 10 patterns extracted from the Asthma and the ENT forums

for the two labels. To improve readability, I have shown only the sequences of lemmas

from the patterns. X indicates the target phrase and ‘pos:’ indicates the part-of-speech

restriction. As we can see, several context tokens are generalized to their labels. Some

target entities do not have part-of-speech restriction, especially when the context is very

predictive of the label, such as ‘and be diagnose with’ for label SC.

Table 5.1 shows the top 15 phrases extracted from the Asthma and ENT forums by our

system. Figures 5.2 and 5.3 show the phrases extracted by our system from the following

three forums: Acne, Breast Cancer, and Adult Type II Diabetes. We can broadly group the

extracted phrases into 4 categories, which are described below.

New Terms

One goal of extracting medical information from PAT is to learn new treatments patients

are using or symptoms they are experiencing. Our system extracted phrases like ‘stabbing

Page 89: DISTANTLY SUPERVISED INFORMATION EXTRACTION …The best thing about Stanford is its grad students – brilliant and yet so approachable. It has been amazing to be a part of the NLP

CHAPTER 5. INFORMATION EXTRACTION ON MEDICAL FORUMS 72

System Precision Recall F1

OBA 52.25 56.50 54.25*OBA-C 62.06 53.15 57.25*

MetaMap 68.42 57.56 62.52*MetaMap-C 77.60 54.98 64.36*

Dictionary-F-C 89.65 47.97 62.50*Xu et al.-25 89.57 53.87 67.28*Xu et al.-50 85.96 54.24 66.51*

CRF 87.09 49.81 63.38*CRF-2 87.74 50.18 63.84*

CRF-20 86.53 49.81 63.23*Our system 86.88 58.67 70.04

Table 5.2: Token-level Precision, Recall, and F1 scores of our system and the baselines onthe Asthma forum for the label DT.

System Precision Recall F1

OBA 78.87 60.08 68.20*OBA-C 83.62 58.24 68.66*

MetaMap 58.63 80.24 67.75*MetaMap-C 70.28 75.15 72.63*

Dictionary-F-C 78.73 70.87 74.59*Xu et al.-25 77.29 72.09 74.60*Xu et al.-50 76.28 72.70 74.45*

CRF 77.68 73.72 75.65*CRF-2 77.63 73.52 75.52*

CRF-20 76.64 73.52 75.05*Our system 78.10 75.56 76.81

Table 5.3: Token-level Precision, Recall, and F1 scores of our system and the baselines onthe Asthma forum for the label SC.

pain’, ‘flakiness’, ‘plaque buildup’, which are not in the seed dictionaries. It also extracted

alternative and preventative treatments like ‘HEPA’ for high-efficiency particulate absorp-

tion air filter, ‘cinnamon pills’, ‘vinegar pills’, ‘basil’, and ‘opuntia’. Effects of alternative

and new treatments are usually studied in small-scale clinical trials (e.g. the effects of

Opuntia plant and Cinnamon on Diabetes patients in clinical trials have been studied in

Page 90: DISTANTLY SUPERVISED INFORMATION EXTRACTION …The best thing about Stanford is its grad students – brilliant and yet so approachable. It has been amazing to be a part of the NLP

CHAPTER 5. INFORMATION EXTRACTION ON MEDICAL FORUMS 73

System Precision Recall F1

OBA 43.22 55.73 48.68*OBA-C 49.73 51.36 50.53*

MetaMap 56.39 53.00 54.64MetaMap-C 64.08 49.72 55.99

Dictionary-F-C 82.41 40.98 54.74*Xu et al.-25 76.50 40.98 53.37*Xu et al.-50 62.80 41.53 49.99*

CRF 79.38 42.07 55.71CRF-2 79.20 43.71 56.33

CRF-20 67.79 43.71 53.15*Our system 82.82 44.80 58.15

Table 5.4: Token-level Precision, Recall, and F1 scores of our system and the baselines onthe ENT forum for the label DT.

System Precision Recall F1

OBA 67.51 50.52 57.79*OBA-C 70.55 46.18 55.82*

MetaMap 57.01 64.23 60.40*MetaMap-C 67.40 58.50 62.63*

Dictionary-F-C 74.35 54.86 63.13*Xu et al.-25 73.48 54.86 62.82*Xu et al.-50 73.88 57.46 64.64*

CRF 72.06 56.42 63.29*CRF-2 71.39 55.90 62.70*

CRF-20 70.61 55.90 62.40*Our system 71.65 61.45 66.16

Table 5.5: Token-level Precision, Recall, and F1 scores of our system and the baselines onthe ENT forum for the label SC.

Page 91: DISTANTLY SUPERVISED INFORMATION EXTRACTION …The best thing about Stanford is its grad students – brilliant and yet so approachable. It has been amazing to be a part of the NLP

CHAPTER 5. INFORMATION EXTRACTION ON MEDICAL FORUMS 74

System Precision Recall F1

OBA 46.64 58.12 51.75OBA-C 54.80 56.15 55.47

MetaMap 54.50 56.65 55.55MetaMap-C 61.87 55.17 58.33

Dictionary-F-C 70.28 47.48 56.89Xu et al.-25 73.33 54.18 62.32Xu et al.-50 70.44 55.17 61.87

CRF 68.49 49.26 57.30CRF-2 69.65 49.75 58.04

CRF-20 69.17 49.75 57.87Our system 73.00 58.62 65.02

Table 5.6: Entity-level Precision, Recall, and F1 scores of our system and the baselines onthe Asthma forum for the label DT.

System Precision Recall F1

OBA 70.60 57.25 63.23OBA-C 73.12 55.69 63.23

MetaMap 52.90 75.38 62.17MetaMap-C 61.71 70.98 66.02

Dictionary-F-C 71.54 68.39 69.93Xu et al.-25 70.15 69.43 69.79Xu et al.-50 69.21 70.46 69.83

CRF 71.05 71.24 71.15CRF-2 70.43 70.98 70.70

CRF-20 69.36 70.98 70.16Our System 71.28 73.31 72.28

Table 5.7: Entity-level Precision, Recall, and F1 scores of our system and the baselines onthe Asthma forum for the label SC.

Page 92: DISTANTLY SUPERVISED INFORMATION EXTRACTION …The best thing about Stanford is its grad students – brilliant and yet so approachable. It has been amazing to be a part of the NLP

CHAPTER 5. INFORMATION EXTRACTION ON MEDICAL FORUMS 75

System Precision Recall F1

OBA 29.79 49.57 37.22OBA-C 34.16 46.21 39.28

MetaMap 40.97 49.57 44.86MetaMap-C 46.28 47.05 46.66

Dictionary-F-C 66.23 42.85 52.04Xu et al.-25 60.71 42.85 50.23Xu et al.-50 47.66 42.85 45.12

CRF 62.65 43.69 51.48CRF-2 60.91 44.53 51.45

CRF-20 51.45 44.53 47.74Our system 63.52 45.37 52.94

Table 5.8: Entity-level Precision, Recall, and F1 scores of our system and the baselines onthe ENT forum for the label DT.

System Precision Recall F1

OBA 56.57 44.79 50OBA-C 56.78 40.72 47.43

MetaMap 48.51 59.04 53.26MetaMap-C 56.10 54.07 55.06

Dictionary-F-C 65.57 50 56.73Xu et al.-25 64.63 50.45 56.67Xu et al.-50 64.78 52.03 57.71

CRF 63.27 50.67 56.28CRF-2 62.01 50.22 55.49

CRF-20 61.38 50 55.11Our System 62.53 55.88 59.02

Table 5.9: Entity-level Precision, Recall, and F1 scores of our system and the baselines onthe ENT forum for the label SC.

Page 93: DISTANTLY SUPERVISED INFORMATION EXTRACTION …The best thing about Stanford is its grad students – brilliant and yet so approachable. It has been amazing to be a part of the NLP

CHAPTER 5. INFORMATION EXTRACTION ON MEDICAL FORUMS 76

i be put on (X | pos: noun)i have be on (X | pos:noun)use DT and (X | pos:noun)put he on (X | pos:noun)prescribe DT and (X | pos:noun)mg of Xhe put I on Xto give he (X | pos:noun)i have be use Xand put I on X

Table 5.10: Top 10 patterns learned for the label DT on the Asthma forum.

(X | pos: noun) SC etc.reduce SC (X | pos:noun)first SC (X | pos:noun)have history of (X | pos:noun)develop SC (X | pos:noun)really bad SC (X | pos:noun)not cause SC (X | pos:noun)symptom be (X | pos:noun)(X | pos:noun) SC feeland be diagnose with X

Table 5.11: Top 10 patterns learned for the label SC on the Asthma forum.

Page 94: DISTANTLY SUPERVISED INFORMATION EXTRACTION …The best thing about Stanford is its grad students – brilliant and yet so approachable. It has been amazing to be a part of the NLP

CHAPTER 5. INFORMATION EXTRACTION ON MEDICAL FORUMS 77

have endoscopic (X | pos: noun)include DT (X | pos: noun)and put I on (X | pos: noun)(X | pos: noun) 500 mg2 round of (X | pos: noun)and be put on (X | pos: noun)have put I on (X | pos: noun)(X | pos: adj) DT and useent put I on X(X | pos: noun) and nasal rinse

Table 5.12: Top 10 patterns learned for the label DT on the ENT forum.

persistent SC (X | pos: noun)have have problem with (X | pos: noun)diagnose I with SC (X | pos: noun)morning with SC (X | pos: noun)(X | pos: noun) SC cause SChave be treat for (X | pos: noun)year SC (X | pos: noun)(X | pos: noun) SC even though(X | pos: noun) SC like SCdaughter have SC X

Table 5.13: Top 10 patterns learned for the label SC on the ENT forum.

Page 95: DISTANTLY SUPERVISED INFORMATION EXTRACTION …The best thing about Stanford is its grad students – brilliant and yet so approachable. It has been amazing to be a part of the NLP

CHAPTER 5. INFORMATION EXTRACTION ON MEDICAL FORUMS 78

inhaler, inhalers, steroid inhaler,albuterol inhaler, b5, preventive in-haler, ventolin inhaler, advar, ser-itide, steroid inhalers, symbicorttrubohaler, agumentin, pantoloc,inahler, puffs

Asthma - DT

flare, flare-up, rad, congestion,mucus, tightness, sinuses, excesmucus, cataracts-along, athsma,vcd, sensation, mites, nasal,ashtma

(a) Asthma - SCotic, z-pack, z-pac, predneson,tylenol sinus, amoxillin, salinenasal, eardrops, regimen, inhaler,peroxide, rinse, amoxcilyn, rinses,anti-nausea, saline, mucodyn,flixonase, vertin, amocicillan

(b) ENT - DT

dysfunction, sinus, sinuses, lymph,gland, tonsilitus, sinues, sensa-tion, congestion, pharynx, tight-ness, mucus, tonsil, onset, ethmoidsinus

(c) ENT - SC

Figure 5.1: Top 15 phrases extracted for the Asthma and the ENT forums.

Frati-Munari et al. (1998) and Khan et al. (2003)). In contrast, our system enables discov-

ering and extracting new DT and SC phrases in PAT and studying their effects reported by

patients at a larger scale in online forums.

Abbreviations

Patterns leverage context to extract abbreviations from PAT, despite the fact that unlike in

well-formed text, abbreviations in PAT tend to lack identifying structure like capitalization

and periods. Some examples of abbreviations our system extracted are: ‘neb’ for nebulizer,

‘labas’ for long-acting beta agonists, and ‘lada’ for latent autoimmune diabetes of adults.

Sub-phrases

Patients frequently do not use full names of diseases and drugs in PAT. For example, it

would not be unusual for patients to refer to ‘vitamin b12’ simply as ‘b12’. These partial

phrases did not get labeled by dictionaries because dictionaries contain long precise phrases

and we label a phrase only when it fully matches a dictionary phrase. When we ran trial

experiments that labeled phrases even when they partially matched a dictionary phrase, it

resulted in low precision. Our pattern learning system learns relevant sub-phrases of the

Page 96: DISTANTLY SUPERVISED INFORMATION EXTRACTION …The best thing about Stanford is its grad students – brilliant and yet so approachable. It has been amazing to be a part of the NLP

CHAPTER 5. INFORMATION EXTRACTION ON MEDICAL FORUMS 79

dictionary phrases without sacrificing much precision. For example, the system is able to

learn that ‘large’ is not a relevant word by itself even though it occurs frequently in the SC

dictionary, but ‘deficiency’ is. More examples include ‘inhaler’, ‘b5’, ‘puffer’ as DT.

Spelling Mistakes

Spelling mistakes are very common in PAT, especially for DT mentions. Context sensitive

patterns allow us to extract a wider range of spelling mistakes than would be possible with

typical edit distance metrics (e.g., Dictionary-F-C matches phrases fuzzily). For example,

the system extracts ‘neurothapy’ for neuropathy, ‘ibubofrin’ for Ibuprofen, and ‘metforim’

for Metformin.

The results show that bootstrapping using patterns gives effective in-domain dictionary

expansion for SC and DT phrases. As we can see from the top extracted phrases for the

three MedHelp forums, our system uncovers novel terms for SCs and DTs, some of which

refer to lesser-known home remedies (such as ‘basil’, ‘cinnamon’ for Diabetes) and com-

ponents of daily care and management. The system extracts some incorrect phrases, which

can be discarded by manual supervision. Such discoveries are valuable on two fronts:

firstly, they may comprise a useful candidate set for future research into alternative treat-

ments; second they can be used to suggest candidate terms for various dictionaries and

ontologies. There are two reasons for the overall lower recall and precision on this dataset

than for extracting some other types of medical entities on clinical text. First, DT and SC

definitions are broad, encompassing any symptom, condition, treatment, or intervention.

Second, PAT contains slang and verbose descriptions that are usually not present in dictio-

naries. One limitation of our system is that it does not identify long descriptive phrases,

such as ‘olive leaf nasal extract nasal spray’ and ‘trouble looking straight ahead’. More

research in needed to robustly identify those to increase recall of the system. In addition,

incorrect phrases in the dictionaries, which were curated automatically, reduced the pre-

cision of our system. Further research in automatically removing incorrect entries in the

dictionaries will help to improve the precision.

To compare the efficacy of our system for extracting relevant phrases apart from spelling

mistakes with MetaMap, I cluster all the strings from both the systems that are 1 edit

distance away and normalize them to their most frequent spelling variation. I compare the

Page 97: DISTANTLY SUPERVISED INFORMATION EXTRACTION …The best thing about Stanford is its grad students – brilliant and yet so approachable. It has been amazing to be a part of the NLP

CHAPTER 5. INFORMATION EXTRACTION ON MEDICAL FORUMS 80

ACNE DIABETES BREAST CANCER DT

NEW TERMS

diane35, retinoid, dianette, retinoids, topical retinoids, femodene, ginette, cilest, dalacin-t, dalacin, piriton, freederm, byebyeblemish, non-hormonal anti-androgen, sudocrem, byebye blemish, dermatologists, dian-35, canneston, microdermabrasions, isotrexin, noxema, proactiv, derm, cleansers, concealer, proactive, creme, microdermabrasion, moisturizer, minocylin

ambulance, basil, bedtime, c-peptide, cinnamon, diaformin, glycomet, glycomet-gp-1, hydrochlorothiazide-quinapril, hydrochlorothiazide-reserpine, lipoic, minidiab, neurotonin, opuntia, rebuilder, sometime, tritace

hormonal fluctuations, rads, ayurveda, ameridex, tram flap, bilateral mastectomy, flaps, incision, thinly, taxanes, bisphosphonates, bisphosphonate, mammosite, rad, imagery, stimulation, relicore, bezielle, wle (wide local excision), lymph spread-wle, moisturising, lymphnode, lympe, her2 neu, hormone-suppressing

SUB-PHRASES

topical, depo, contraceptives, contraceptive, aloe vera, topicals, salicylic, d3, peroxide, androgens-male, cleanser

asprin, bolus, carb, carbohydrate, carbohydrates, ovulation, regimen

hormonal, topical, antagonists, excision, vit, sentinel, cmf, primrose, augmentation, depo, flap

ABBREV. a1c, a1c cutoff, a1cs, endo (endocrinology), ob gyn, ogtt (oral glucose tolerance test), xr (extended release)

recon (reconstruction), neu

SPELLING MISTAKES

oxytetrcaycline, contracpetive, anitbiotics, oxytetracylcine, oxytracycline, lymecylcine, sprionolactone, benzol peroxide, depot-shot, tetracylcines, shampo, dorxy, steriod, moisturising, perscription

actoplusmet, awhile, basil-known, birth-control, blood-cholesterol, condrotin, darvetcet, diabix, excercise, fairley, htis, inslin, klonopin-i, metforim, metform, metformun100mg, metmorfin, omigut40, pils, sutant

homonal, steriod, horonal, releif, ibubofrin, tamoxofin, tomoxphen, reloxifen, tamoxafin, tomoxifin, steriods, tamixofin

Figure 5.2: Top 50 DT phrases extracted by our system for three different forums. Erro-neous phrases (as determined by us) are shown in gray. Full forms of some abbreviationsare in italics. Note that Abbreviations are also New Terms but are categorized separatelybecause of their frequency in PAT.

Page 98: DISTANTLY SUPERVISED INFORMATION EXTRACTION …The best thing about Stanford is its grad students – brilliant and yet so approachable. It has been amazing to be a part of the NLP

CHAPTER 5. INFORMATION EXTRACTION ON MEDICAL FORUMS 81

ACNE DIABETES BREAST CANCER

SC

NEW TERMS

squeeze blackheads, squeeze, breakouts, teenager, itchiness, coldsores, blemishes, blemish, breakout, chin, break-outs, re-appearing, outbreaks, poke, puss, flares, bum, outbreak, coldsore, acneic, armpit, teenagers

borderline diabetic, c-peptide, calories, checkup, educator, harden, rarer, sugary, thorough, type2

armpit, grandmother, aunt, cancer-grandmother, cancer-having, survivor, aunts, morphologies, diagnosing

SUB-PHRASES

lesions, bumps, irritation, glands, bump, forehead, lumps, scalp, cheeks, follicles, dryness, gland, flare-up, pilaris rubra, puberty, cystic, follicular, inflamed, follicle, pcos, soreness, groin, occurrence, discoloration, relapse, oily

abdomen, blockage, bowel, calfs, circulatory, cirrhosis, disruptions, dryness, fibro, flour, fluctuations, foggy, lesion, lumps, masturbation, menopause, onset, pcos, precursor, sensations, spike, spikes, thighs, urine

lesions, lump, soreness, lumps, phyllodes, situ, ducts, lesion, sensations, needle, menopause, manifestations, variant, mutation, manifestation, onset, duct, lymph, gland, benign, irritation, abnormality, glands, mutations, asymmetry, occurrence, leaking, parenchymal, bump, unilateral, thighs, menstrual, subtypes, ductal, colon, bumps

ABBREV. a-fib (atrial fibrillation), carbs (carbohydrates), cardio, ha1c, hep (hepatitis), hgba1c, hypo, oj (orange juice), t2 (type 2)

hx (history)*, ibc (inflammatory breast cancer)

SPELLING MISTAKES

becuase, forhead

allegeries, energyless, jsut, neurothapy, tyoe, vomiting-more, weezyness

caner, posibility, tratment

Figure 5.3: Top 50 SC phrases extracted by our system for three different forums. Erro-neous phrases (as determined by us) are shown in gray. Full forms of some abbreviationsare in italics. Note that Abbreviations are also New Terms but are categorized separatelybecause of their frequency in PAT.

Page 99: DISTANTLY SUPERVISED INFORMATION EXTRACTION …The best thing about Stanford is its grad students – brilliant and yet so approachable. It has been amazing to be a part of the NLP

CHAPTER 5. INFORMATION EXTRACTION ON MEDICAL FORUMS 82

(a) Top DT phrases.

(b) Top SC phrases.

Figure 5.4: Top DT and SC phrases extracted by our system, MetaMap, and MetaMap-Cfor the Diabetes forum. Numbers in parentheses indicate the number of times the phrasewas extracted by the system. Erroneous phrases (as determined by us) are shown in gray.

Page 100: DISTANTLY SUPERVISED INFORMATION EXTRACTION …The best thing about Stanford is its grad students – brilliant and yet so approachable. It has been amazing to be a part of the NLP

CHAPTER 5. INFORMATION EXTRACTION ON MEDICAL FORUMS 83

top most frequent phrases extracted from the Diabetes forum in Figure 5.4. We can see that

our system extracts more relevant phrases. The reason we do not extract insulin is because

it exists (incorrectly) in the automatically curated SC dictionary and we do not label DT

phrases that are in the SC dictionary. For our system, I concatenated all consecutive words

with the same label as one phrase, in contrast with MetaMap, which many times extracted

consecutive words as different phrases (leading to the difference in the frequency of some

phrases). For example, our system extracted ‘diabetes drug dependency’, but MetaMap

extracted it as ‘diabetes’ and ‘drug dependency’. Similarly, our system extracted ‘latent

autoimmune diabetes in adults’, whereas MetaMap extracted ‘latent’ and ‘autoimmune

diabetes’.

Below, I demonstrate a use case of the system to explore alternative treatments people

use for a symptom or condition. I manually labeled posts that mentioned new treatments

identified by our system as DTs and explore their efficacy by mining sentiment towards

them in the forum.

5.7.2 Case study: Anecdotal Efficacy

Our system can be used to explore different (possibly previously unknown) treatments peo-

ple are using for a condition. In turn, this can lead to novel insights, which can be further

explored by the medical community. For example, for Diabetes, our system extracted ‘Cin-

namon’ and ‘Vinegar’ as DTs. To study the anecdotal efficacy of ‘Cinnamon’ and ‘Vinegar’

for managing Diabetes, we manually labeled the posts that mentioned the terms as treat-

ment for Diabetes (47 out of 49 posts for ‘Cinnamon’ and 26 out of 30 posts for ‘Vinegar’)

with the sentiment towards that treatment. Both terms were extracted as DT by our system

for the Diabetes forum. ‘Strongly positive’ means the treatment helped the person. ‘Weakly

positive’ means the person is using the treatment or has heard positive effects of it. ‘Neu-

tral’ means the user is not using the treatment and did not express an opinion in the post.

‘Weakly negative’ means the person has heard that the treatment does not work. ‘Strongly

negative’ means the treatment did not work for the person. An informal analysis of the

posts reveals that the ‘Cinnamon’ was generally considered helpful by the community and

‘Vinegar’ had mixed reviews (Figure 5.5). Below are more details about each label.

Page 101: DISTANTLY SUPERVISED INFORMATION EXTRACTION …The best thing about Stanford is its grad students – brilliant and yet so approachable. It has been amazing to be a part of the NLP

CHAPTER 5. INFORMATION EXTRACTION ON MEDICAL FORUMS 84

Figure 5.5: Study of efficacy of ‘Cinnamon’ and ‘Vinegar’, two DTs extracted by oursystem, for treating Type II Diabetes.

Page 102: DISTANTLY SUPERVISED INFORMATION EXTRACTION …The best thing about Stanford is its grad students – brilliant and yet so approachable. It has been amazing to be a part of the NLP

CHAPTER 5. INFORMATION EXTRACTION ON MEDICAL FORUMS 85

• Strongly positive: The person has explicitly mentioned that the treatment is helping

the subject of the post (many times the posts discuss health of a family member)

for Diabetes. Example: “. . . A relative with the same problem told her about taking

cinnamon gel tabs which had greatly helped her. She found a brand at the local health

store by the name of NewChapter titled Cinnamon Force. She was afraid to take it

with so many other medications and it sat in the cabinet about five months. Last week,

she got brave and took two tabs behind the two largest meals of the day.Wow! the

level dropped down into the safe range and has remained there for several days.All

that I can tell you about the product, is that it contains 140mg of cinnamon per gel

tab.We are so thrilled that after so many years of frustration, that we see a great

change in blood sugar levels . . . ”

• Weakly positive: The subject of the post is either using the treatment or heard/read

positive effects of the treatment for Diabetes. Example: “... Some people do think

things such as vinegar help. My belief is those things are worth trying but they are

secondary to tried and true things such as weight loss, exercise and lowering carb

intake.”

• Neutral: The subject of the post is neither using the treatment nor expressed any

sentiment about it in the post. Example: “. . . I may be wrong, but I haven’t heard of

cinnamon lowering glucose levels. Please take your mother to a doctor for a checkup

asap . . . ” Posts that asked a question about using the about treatment were also

labeled neutral. For example, “Does vinegar help diabetes” is labeled Neutral.

• Weakly negative: The post mentioned that the user has heard that the treatment does

not work. For example, people citing studies that showed inconclusive evidence of

the efficacy of the treatment. Example: “. . . Studies now show that cinnamon doesn’t

lower glucose levels, but has been known to regulate blood pressure. I can vouch for

the latter . . . ”

• Strongly negative: The post mentioned that the treatment is not working from per-

sonal experience of the subject of the post (for example, a family member). Example:

“I have tried the Apple Cider Vinegar and it didn’t work for me . . . ”

Page 103: DISTANTLY SUPERVISED INFORMATION EXTRACTION …The best thing about Stanford is its grad students – brilliant and yet so approachable. It has been amazing to be a part of the NLP

CHAPTER 5. INFORMATION EXTRACTION ON MEDICAL FORUMS 86

DT SCSystem Precision Recall F1 Precision Recall F1

Our system 86.88 58.67 70.04 78.10 75.56 76.81Pattern Matches (No Gen.) 45.26 13.73 21.07 50.75 12.33 19.85

Pattern Matches 36.58 16.60 22.84 43.07 17.10 24.48Patterns Matches in Dictionary 80.00 13.28 22.78 83.51 15.47 26.11

Table 5.14: Precision, Recall, and F1 scores of systems that use pattern matching whenlabeling data and our system on the Asthma forum.

5.8 Additional Experiments

Labeling Data by Matching Patterns

I learn dictionaries for SC and DT phrases and use the dictionaries to label data. The label-

ing is done by dictionary look-up and does not consider context. Context is only considered

to learn patterns that extract new dictionary phrases. Another approach is to label data us-

ing the learned patterns, which uses only context. I compare both the approaches in Tables

5.14 and 5.15.

The system ‘Pattern Matches (No Gen.)’ applied all the patterns leaned by our system

for a given label and labeled every extraction as positive for the label. ‘Pattern Matches’

is similar to ‘Pattern Matches (No Gen.)’ except it used the dictionaries for generalizing

the context, which increased the recall. ‘Pattern Matches in Dictionary’ is the most con-

servative approach in which a token was labeled as positive only if it matched both by the

dictionary and the learned patterns. That is, it filtered the output of ‘Pattern Matching’

to all the phrases that were also labeled by the dictionaries. All the pattern matching ap-

proaches have very low recall because many correct tokens did not occur in the patterns’

context. ‘Pattern Matches in Dictionary’ has high precision because it is the most restricted

approach of all, but suffers from low recall.

Page 104: DISTANTLY SUPERVISED INFORMATION EXTRACTION …The best thing about Stanford is its grad students – brilliant and yet so approachable. It has been amazing to be a part of the NLP

CHAPTER 5. INFORMATION EXTRACTION ON MEDICAL FORUMS 87

DT SCSystem Precision Recall F1 Precision Recall F1

Our system 82.35 45.90 58.94 71.65 61.45 66.16Pattern Matches (No Gen.) 47.36 4.71 8.57 46.15 0.98 1.92

Pattern Matches 40 6.55 11.26 53.12 8.85 15.17Pattern Matches in Dictionary 90.90 5.46 10.30 94.11 8.33 15.31

Table 5.15: Precision, Recall, and F1 scores of systems that use pattern matching whenlabeling data and our system on the ENT forum.

DT SCSystem Precision Recall F1 Precision Recall F1

OBA 52.25 56.50 54.25 78.87 60.08 68.20OBA-C 62.06 53.15 57.25 83.62 58.24 68.66

OBA-C-T5 64.67 52.02 57.66 85.01 56.61 67.97

Table 5.16: Effects of use of GoogleCommonList in OBA on the Asthma forum. Precision,Recall, and F1 scores of OBA when words in GoogleCommonList are not labeled (‘-C’suffix), and when words in GoogleCommonList and in manually identified negative phrasesare not labeled (‘-C-T5’ suffix).

Manually removing top negative words from MetaMap and OBA

I sorted all words extracted by MetaMap and OBA by their frequency and manually iden-

tified top 5 words that I judged as incorrect (without considering the context). I ran experi-

ments in which those words were not labeled by OBA-C and MetaMap-C, that is, I added

them to the stop words list. The systems are marked as ‘OBA-C-T5’ and ‘MetaMap-C-

T5’, respectively, in Tables 5.16-5.17 and 5.18-5.19. The motivation to compare the per-

formance of these systems is when a user might be interested in manually identifying the

top negative words and adding them to the stop words list. Removing the manually iden-

tified words generally increased precision, but reduced recall. I suspect the recall dropped

because the words might be correct when they appeared in some contexts. The reason for

the same scores for MetaMap-C and MetaMap-C-T5 for the SC label on the Asthma forum

is that the negative words were already in the GoogleCommonList.

Page 105: DISTANTLY SUPERVISED INFORMATION EXTRACTION …The best thing about Stanford is its grad students – brilliant and yet so approachable. It has been amazing to be a part of the NLP

CHAPTER 5. INFORMATION EXTRACTION ON MEDICAL FORUMS 88

DT SCSystem Precision Recall F1 Precision Recall F1

OBA 43.22 55.73 48.68 67.51 50.52 57.59OBA-C 49.73 51.36 50.53 70.55 46.18 55.82

OBA-C-T5 53.71 51.36 52.51 82.31 42.01 55.63

Table 5.17: Effects of use of GoogleCommonList in OBA on the ENT forum. Precision,Recall, and F1 scores of OBA when words in GoogleCommonList are not labeled (‘-C’suffix), and when words in GoogleCommonList and in manually identified negative phrasesare not labeled (‘-C-T5’ suffix).

DT SCSystem Precision Recall F1 Precision Recall F1

MetaMap 68.42 57.56 62.52 58.63 80.24 67.75MetaMap-C 77.60 54.98 64.36 70.28 75.15 72.63

MetaMap-C-T5 78.68 53.13 63.43 70.28 75.15 72.63

Table 5.18: Effects of use of GoogleCommonList in MetaMap on the Asthma forum. Preci-sion, Recall, and F1 scores of MetaMap when words in GoogleCommonList are not labeled(‘-C’ suffix), and when words in GoogleCommonList and in manually identified negativephrases are not labeled (‘-C-T5’ suffix).

DT SCSystem Precision Recall F1 Precision Recall F1

MetaMap 56.39 53.00 54.64 57.01 64.23 60.40MetaMap-C 64.08 49.72 55.99 67.40 58.50 62.63

MetaMap-C-T5 64.17 46.99 54.25 70.44 57.11 63.08

Table 5.19: Effects of use of GoogleCommonList in MetaMap on the ENT forum. Preci-sion, Recall, and F1 scores of MetaMap when words in GoogleCommonList are not labeled(‘-C’ suffix), and when words in GoogleCommonList and in manually identified negativephrases are not labeled (‘-C-T5’ suffix).

Page 106: DISTANTLY SUPERVISED INFORMATION EXTRACTION …The best thing about Stanford is its grad students – brilliant and yet so approachable. It has been amazing to be a part of the NLP

CHAPTER 5. INFORMATION EXTRACTION ON MEDICAL FORUMS 89

Phrase threshold Precision Recall F1

0.01 86.78 55.71 67.860.1 87.71 55.35 67.870.2 87.77 58.30 70.060.8 90.53 56.45 69.541.0 90.53 56.45 69.54

Table 5.20: Scores when our system is run with different phrase threshold values. Increas-ing the threshold increases the precision but reduces recall. The value in bold was used inour final system.

Pattern threshold Precision Recall F1

0.2 87.77 58.30 70.060.5 87.77 58.30 70.060.8 90.68 53.87 67.591.0 89.65 47.97 62.50

Table 5.21: Scores when our system is run with different pattern threshold values. Allother parameters remain unchanged. The threshold of 0.2 and 0.5 did not make a differencebecause all patterns extracted had score of more than 0.5. The threshold of 0.8 led to higherprecision but lower recall. Threshold of 1.0 did not extract any patterns. The value in boldwas used in our final system.

Parameter Tuning

In our experiments, I tuned the parameters, such as N , K, and T , on the Asthma forum.

In this section, we discuss the effect of varying some of the parameters (keeping others the

same as the final system) on extracting DT phrases from the Asthma forum. We experi-

enced a similar effect of varying the parameters for extracting SC phrases from the Asthma

forum.

Phrase and pattern thresholds

Tables 5.20 and 5.21 show scores of our system when different phrase and pattern thresh-

olds are used. In both cases, generally increasing the threshold resulted in higher precision

but lower recall.

Page 107: DISTANTLY SUPERVISED INFORMATION EXTRACTION …The best thing about Stanford is its grad students – brilliant and yet so approachable. It has been amazing to be a part of the NLP

CHAPTER 5. INFORMATION EXTRACTION ON MEDICAL FORUMS 90

N T Precision Recall F1

5 40 86.41 58.67 69.8910 20 87.77 58.30 70.0640 5 89.22 54.98 68.03

Table 5.22: Scores when our system is run with different values of N and T . The values inbold were used in our final system.

K T Precision Recall F1

20 20 88.75 55.35 68.1850 20 87.77 58.30 70.06100 20 86.41 58.67 69.89

Table 5.23: Scores when our system is run with different values of K. Increasing K de-creases precision but improves recall. The values shown in bold were used in our finalsystem.

Number of phrases in each iteration (N )

Our system learned a maximum of 200 phrases (with maximum number of phrases in each

iteration N=10 and maximum number of iterations T=20). Table 5.22 shows scores for

different combinations of values of N and T, keeping the total number of phrases learned

constant.

Number of patterns in each iteration (K)

Table 5.23 shows results for different values of K, that is, the maximum number of pattern

learned in each iteration.

5.9 Future Work

Future improvements to performance would allow us to reap enhanced benefits from au-

tomatic medical term extraction. Improving precision, for example, would reduce manual

effort required for verifying extracted terms to do an analysis similar to one shown in Fig-

ure 5.5. Improving recall would increase the range of terms that we extract. For example,

at present, our system still misses relevant terms, such as ‘oatmeal’ as a DT for Diabetes.

Page 108: DISTANTLY SUPERVISED INFORMATION EXTRACTION …The best thing about Stanford is its grad students – brilliant and yet so approachable. It has been amazing to be a part of the NLP

CHAPTER 5. INFORMATION EXTRACTION ON MEDICAL FORUMS 91

Our results open several avenues for future work on mining and analyzing PAT. Extrac-

tion of DT and SC entities allows us to investigate connections and relationships between

drug pairs, and drugs and symptoms. Prior work has successfully identified adverse drug

events in electronic medical records (Tatonetti et al., 2012); using self-report patient data

(such as that found on MedHelp), we might uncover novel information on how particular

drug combinations affect users. One such case study to identify side effects of drugs was

presented in Leaman et al. (2010). Our system can also help to analyze sentiment towards

various treatments, including home remedies and alternative treatments, for a particular

disease – manually enumerating all treatments, along with their morphological variations,

is difficult. Finally, I note that our system does not require any labeled data of sentences

and thus can be applied to many different types of PAT (like patient emails) and entity types

(like diagnostic tests).

5.10 Conclusion

I demonstrate a method for identifying medical entity types in patient-authored text. I in-

duce lexico-syntactic patterns using a seed dictionary of desirable terms. Annotating spe-

cific types of medical terms in PAT is difficult because of lexical and semantic mismatches

between experts’ and consumers’ description of medical terms. Previous ontology-based

tools like OBA and MetaMap are good at fine-grained concept mapping on expert-authored

text, but they have low accuracy on PAT.

I demonstrate that our method improves performance for the task of extracting two

entity types: drugs & treatments (DT) and symptoms & conditions (SC), from MedHelp’s

Asthma and ENT forums by effectively expanding dictionaries in context. Our system

extracts new entities missing from the seed dictionaries; abbreviations, relevant sub-phrases

of seed dictionary phrases, and spelling mistakes. In evaluation, in most cases, our system

significantly outperformed MetaMap, OBA, an existing system that uses word patterns for

extracting diseases, and a conditional random field classifier. I believe that the ability to

effectively extract specific entities is the key first step towards deriving novel findings from

PAT.

Pattern and entity scoring are the critical components of a bootstrapped pattern-based

Page 109: DISTANTLY SUPERVISED INFORMATION EXTRACTION …The best thing about Stanford is its grad students – brilliant and yet so approachable. It has been amazing to be a part of the NLP

CHAPTER 5. INFORMATION EXTRACTION ON MEDICAL FORUMS 92

learning system. The system developed in this chapter utilizes only the supervision pro-

vided by seed sets to score patterns and entities. Thus, many entities extracted by patterns

are unlabeled. During the pattern scoring phase, the unlabeled entities extracted by patterns

are considered negative. However, many of these unlabeled entities are actually positive,

resulting in lower scores for good patterns that extract many good (that is, positive) unla-

beled entities. In the next chapter, I propose improvements to the pattern scoring phase by

evaluating unlabeled entities using unsupervised measures. It leads to improved precision

and recall.

Page 110: DISTANTLY SUPERVISED INFORMATION EXTRACTION …The best thing about Stanford is its grad students – brilliant and yet so approachable. It has been amazing to be a part of the NLP

Chapter 6

Leveraging Unlabeled Data ImprovesPattern Learning

In the previous chapters, I discussed bootstrapped pattern-based learning (BPL) as an effec-

tive approaches for entity extraction with minimal distantly supervised data. In this chapter,

I propose improvements to BPL by leveraging unlabeled data to enhance the pattern scoring

function. The work has been published in Gupta and Manning (2014a).

6.1 Introduction

In a pattern-based entity learning system, scoring patterns and scoring entities are the most

important steps. In the pattern-scoring phase, patterns are scored by their ability to extract

more positive entities and less negative entities. In a supervised setting, the efficacy of

patterns can be judged by their performance on a fully labeled dataset (Califf and Mooney,

1999; Ciravegna, 2001). In contrast, in a BPL system, seed dictionaries and/or patterns pro-

vide weak supervision. Thus, most entities extracted by candidate patterns are unlabeled,

making it harder for the system to learn good patterns.

Existing systems score patterns by making closed world assumptions about the unla-

beled entities. The problem is similar to the closed world assumption in distantly super-

vised relation extraction systems, when all propositions missing from a knowledge base

are considered false (Ritter et al., 2013; Xu et al., 2013). Consider the example discussed

93

Page 111: DISTANTLY SUPERVISED INFORMATION EXTRACTION …The best thing about Stanford is its grad students – brilliant and yet so approachable. It has been amazing to be a part of the NLP

CHAPTER 6. LEVERAGING UNLABELED DATA 94

in Chapter 2, also shown in Figure 6.1. Current pattern learning systems would score both

patterns, ‘own a X’ and ‘my pet X’, equally by either ignoring the unlabeled entities or

assuming them as negative. However, these scoring schemes cannot differentiate between

patterns that extract good versus bad unlabeled entities. Systems that ignore the unlabeled

entities do not leverage the unlabeled data in scoring patterns. Frequently, these systems

learn patterns that extract some positive entities but many bad unlabeled entities. Systems

that assume unlabeled entities to be negative are very conservative; in the example, they

wrongly penalize ‘Pattern 1’ that extracted good unlabeled entity, ‘cat’.

Predicting labels of unlabeled entities can improve scoring patterns. Features like dis-

tributional similarity can predict that ‘cat’ is closer to the seed set {dog} than ‘house’, and

a pattern learning system can use that information to rank ‘Pattern 1’ higher than ‘Pattern

2’. In this chapter, I improve the scoring of patterns for an entity class by defining a pat-

tern’s score by the number of positive entities it extracts and the ratio of number of positive

entities to expected number of negative entities it extracts. I propose five features to predict

the scores of unlabeled entities. One feature is based on Google Ngrams that exploits the

specialized nature of our dataset; entities that are frequent on the web are less likely to be a

drug-and-treatment entity. The other four features can be used to learn entities for generic

domains as well.

My main contribution is introducing the expected number of negative entities in pat-

tern scoring – I predict probabilities of unlabeled entities belonging to the negative class.

I estimate an unlabeled entity’s negative class probability by averaging probabilities from

various unsupervised class predictors, such as distributional similarity, string edit distances

from learned entities, and TF-IDF scores. Our system performs significantly better than ex-

isting pattern scoring measures for extracting drug-and-treatment entities from four medical

forums on MedHelp.

6.2 Related Work

I discuss pattern-based systems in Chapter 3. Here, I review the pattern-scoring aspects

of previous pattern-based systems. The pioneering work by Hearst (1992) used hand writ-

ten patterns to automatically generate more rules that were manually evaluated to extract

Page 112: DISTANTLY SUPERVISED INFORMATION EXTRACTION …The best thing about Stanford is its grad students – brilliant and yet so approachable. It has been amazing to be a part of the NLP

CHAPTER 6. LEVERAGING UNLABELED DATA 95

Figure 6.1: An example pattern learning system for the class ‘animals’ from the text start-ing with the seed entity ‘dog’. The figure shows two candidate patterns, along with theirextracted entities, in the first iteration. Text matched with the patterns is shown in italicsand the extracted entities are shown in bold.

hypernym-hyponym pairs from text. Other supervised systems like SRV (Freitag, 1998),

SLIPPER (Cohen and Singer, 1999), (LP )2 (Ciravegna, 2001), and RAPIER (Califf and

Mooney, 1999) used a fully labeled corpus to either create or score patterns.

Riloff (1996) used a set of seed entities to bootstrap learning of rules for entity extrac-

tion from unlabeled text. She scored a rule by a weighted conditional probability measure,

called RlogF, estimated by counting the number of positive entities among all the entities

extracted by the pattern. Thelen and Riloff (2002) extended the above bootstrapping al-

gorithm for multi-class learning. Riloff and Jones (1999) used the similar pattern scoring

measure as Riloff (1996) for their multi-level bootstrapping approach. Snowball (Agichtein

and Gravano, 2000) used the same scoring function for patterns as Riloff (1996). Yangar-

ber et al. (2002) and Lin et al. (2003) used a combination of accuracy and confidence of

a pattern for multiclass entity learning, where the accuracy measure ignored the unlabeled

entities and the confidence measure treated them as negative. Talukdar et al. (2006) used

seed sets to learn trigger words for entities and a pattern automata. Their pattern scoring

measure is same as (Lin et al., 2003). In Chapter 5, I use the ratio of scaled frequencies

of positive entities among all extracted entities. None of the above measures predict labels

of unlabeled entities to score patterns. Our system outperforms them in our experiments.

Page 113: DISTANTLY SUPERVISED INFORMATION EXTRACTION …The best thing about Stanford is its grad students – brilliant and yet so approachable. It has been amazing to be a part of the NLP

CHAPTER 6. LEVERAGING UNLABELED DATA 96

Stevenson and Greenwood (2005) used Wordnet to assess patterns, which is not feasible

for domains that have low coverage in Wordnet, such as medical data. Zhang et al. (2008)

used HITS algorithm (Kleinberg, 1999) over patterns (authorities) and instances (hubs)

to overcome some of the problems with the above systems – unlabeled entities extracted

by patterns are either considered negative or are ignored when computing pattern scores.

However, they do not use any external unsupervised knowledge for evaluating the unlabeled

entities.

Current open entity extraction systems either ignore the unlabeled entities or consider

them as negative. KnowItAll’s entity extraction from the web (Downey et al., 2004; Etzioni

et al., 2005) used components such as list extractors, generic and domain specific pattern

learning, and subclass learning. They learned domain-specific patterns using a seed set and

scored them by ignoring unlabeled entities. One of our baselines is similar to their domain-

specific pattern learning component. Carlson et al. (2010a) learned multiple semantic types

using coupled semi-supervised training from web-scale data, which is not feasible for all

datasets and entity learning tasks. They assessed patterns by their precision, assuming un-

labeled entities to be negative; one of our baselines is similar to their pattern assessment

method. Other open information extraction systems like ReVerb (Fader et al., 2011) and

OLLIE (Mausam et al., 2012) are mainly geared towards generic, domain-independent rela-

tion extractors for web data. ReVerb used manually written patterns (called as constraints)

to extract potential tuples, which were scored using a logistic regression classifier trained

on around 1000 manually labeled sentences. OLLIE ranked patterns by their frequency of

occurrence in the dataset. For more discussion on these systems, see Chapter 2.

6.3 Approach

In Chapter 2, I discussed the skeleton of a bootstrapped pattern-based learning system. In

this chapter, I use the same framework with lexico-syntactic surface word patterns. I extract

entities from unlabeled text starting with seed dictionaries of entities for multiple classes.

The success of bootstrapped pattern learning methods crucially depends on the effec-

tiveness of the pattern scorer and the entity scorer. Here I focus on improving the pattern

scoring measure.

Page 114: DISTANTLY SUPERVISED INFORMATION EXTRACTION …The best thing about Stanford is its grad students – brilliant and yet so approachable. It has been amazing to be a part of the NLP

CHAPTER 6. LEVERAGING UNLABELED DATA 97

6.3.1 Creating Patterns

Candidate patterns are created using contexts of words or their lemmas in a window of two

to four words before and after a positively labeled token. Context words that are labeled

with one of the classes are generalized with that class. The target term has a part-of-speech

(POS) restriction, which is the POS tag of the labeled token. I create flexible patterns by

ignoring the words {‘a’, ‘an’, ‘the’} and quotation marks when matching patterns to the

text. Some examples of the patterns are shown in Table 6.4.

6.3.2 Scoring Patterns

Judging the efficacy of patterns without using a fully labeled dataset can be challenging

because of two types of failures: 1. penalizing good patterns that extract good (that is,

positive) unlabeled entities, and 2. giving high scores to bad patterns that extract bad (that

is, negative) unlabeled entities. Existing systems that assume unlabeled entities as negative

are too conservative in scoring patterns and suffer from the first problem. Systems that

ignore unlabeled entities can suffer from both the problems. For a pattern r, let sets Pr, Nr,

and Ur denote the positive, negative, and unlabeled entities extracted by r, respectively.

One commonly used pattern scoring measure RlogF (Riloff, 1996) calculates a pattern’s

score by the function |Pr||Pr|+|Nr|+|Ur| log(|Pr|). The first term is a rough measure of precision,

which assumes unlabeled entities as negative. The second term gives higher weights to

patterns that extract more positive entities. The function has been shown to be effective for

learning patterns in many systems. However, it gives lower scores to patterns that extract

many unlabeled entities – regardless of whether those entities are good or bad.

I propose to estimate the labels of unlabeled entities to more accurately score the pat-

terns. The pattern score, ps(r) is calculated as

ps(r) =|Pr|

|Nr|+∑

e∈Ur(1− score(e))

log(|Pr|) (6.1)

where |.| denotes the size of a set. The function score(e) gives the probability of an entity

e belonging to C. If e is a common word, score(e) is 0. Otherwise, score(e) is calculated

as the average of five feature scores (explained below), each of which give a score between

Page 115: DISTANTLY SUPERVISED INFORMATION EXTRACTION …The best thing about Stanford is its grad students – brilliant and yet so approachable. It has been amazing to be a part of the NLP

CHAPTER 6. LEVERAGING UNLABELED DATA 98

0 and 1. The feature scores are calculated using the seed dictionaries, learned entities for

all labels, Google Ngrams, and clustering of domain words using distributional similarity.

The log |Pr| term, inspired from RlogF, gives higher scores to patterns that extract more

positive entities. Candidate patterns are ranked by ps(r) and the top patterns are added to

the list of learned patterns.1

To calculate score(e), I use features that assess unlabeled entities to be either closer

to positive or negative entities in an unsupervised way. I motivate my choice of the five

features below with the following insights. If the dataset consists of informally written

text, many unlabeled entities are spelling mistakes and morphological variations of labeled

entities. I use two edit distance based features to predict labels for these unlabeled entities.

Second, some unlabeled entities are substrings of multi-word dictionary phrases but do

not necessarily belong to the dictionary’s class. For example, for learning drug names,

the positive dictionary might contain ‘asthma meds’, but ‘asthma’ is negative and might

occur in a negative dictionary as ‘asthma disease’. To predict the labels of entities that

are a substring of dictionary phrases, I use SemOdd, which I also used in Chapter 5 to

learn entities. Third, for a specialized domain, unlabeled entities that commonly occur

in generic text are more likely to be negative. I use Google Ngrams (called GN) to get

a fast, non-sparse estimate of the frequency of entities over a broad range of domains.

The above features do not consider the context in which the entities occur in text. I use

the fifth feature, DistSim, to exploit contextual information of the labeled entities using

distributional similarity. The features are defined as:

Edit distance from positive entities (EDP): This feature gives a score of 1 if e has low edit

distance to the positive entities. It is computed as

maxp∈Pr

1

(editDist(p, e)

|p|< 0.2

)where 1(c) returns 1 if the condition c is true and 0 otherwise, |p| is the length of p,

and editDist(p, e) is the Damerau-Levenshtein string edit distance between p and e.

1Including the |Pr| term in the denominator of Equation 6.1 resulted in comparable but a bit lower per-formance in some experiments.

Page 116: DISTANTLY SUPERVISED INFORMATION EXTRACTION …The best thing about Stanford is its grad students – brilliant and yet so approachable. It has been amazing to be a part of the NLP

CHAPTER 6. LEVERAGING UNLABELED DATA 99

The hard cut-off for the edit distance function resulted in better results in the pilot

experiments as compared to a soft scoring function.

Edit distance from negative entities (EDN): It is similar to EDP and gives a score of 1 if e

has high edit distance to the negative entities. It is computed as

1−maxn∈Nr

1

(editDist(n, e)

|n|< 0.2

)Semantic odds ratio (SemOdd): First, I calculate the ratio of frequency of the entity term

in the positive entities to its frequency in the negative entities with Laplace smooth-

ing. The ratio is then normalized using a softmax function. The feature values for the

unlabeled entities extracted by all the candidate patterns are then normalized using

the min-max function to scale the values between 0 and 1. I do min-max normaliza-

tion on top of the softmax normalization because the maximum and minimum value

by softmax might not be close to 1 and 0, respectively. And, treating the out-of-

feature-vocabulary entities the same as the worst scored entities by the feature, that

is giving them a score of 0, performed best on the development dataset.

Google Ngrams score (GN): I calculate the ratio of scaled frequency of e in the dataset to

the frequency in Google Ngrams. The scaling factor is to balance the two frequencies

and is computed as the ratio of total number of phrases in the dataset to the total of

phrases in Google Ngrams. The feature values are normalized in the same way as

SemOdd.

Distributional similarity score (DistSim): Words that occur in similar contexts, such as

‘asthma’ and ‘depression’, are clustered using distributional similarity. Unlabeled

entities that get clustered with positive entities are given higher score than the ones

clustered with negative entities. To score the clusters, I learn a logistic regression

classifier using cluster ID as features, and use their weights as scores for all the

entities in those clusters. The dataset for logistic regression is created by considering

all positively labeled words as positive and sampling negative and unlabeled words

as negative. The scores for entities are normalized in the same way as SemOdd and

GN.

Page 117: DISTANTLY SUPERVISED INFORMATION EXTRACTION …The best thing about Stanford is its grad students – brilliant and yet so approachable. It has been amazing to be a part of the NLP

CHAPTER 6. LEVERAGING UNLABELED DATA 100

Entities outside the feature vocabulary are given a score of 0 for the features SemOdd,

GN, and DistSim. I use a simple way of combining the feature values: I give equal weights

to all features and average their scores. Features can be combined using a weighted average

by manually tuning the weights on a development set; I leave it to the future work. Another

way of weighting the features is to learn the weights using machine learning. I discuss this

approach in the last section of the chapter.

6.3.3 Learning Entities

I apply the learned patterns to the text and extract candidate entities. I discard common

words, negative entities, and those containing non-alphanumeric characters from the set.

The rest are scored by averaging the scores of DistSim, SemOdd, EDO, and EDN features

from Section 6.3.2 and the following features.

Pattern TF-IDF scoring (PTF): For an entity e, it is calculated as

1

log freqe

∑r∈R

ps(r)

where R is the set of learned patterns that extract e, freqe is the frequency of e in

the corpus, and ps(r) is the pattern score calculated in Equation 6.1. Entities that are

extracted by many high-weighted patterns get higher weight. To mitigate the effect

of many commonly occurring entities also getting extracted by several patterns, I

normalize the feature value with the log of the entity’s frequency. The values are

normalized in the same way as DistSim and SemOdd.

Domain N-grams TF-IDF (DN): This feature gives higher scores to entities that are more

prevalent in the corpus compared to the general domain. For example, to learn enti-

ties about a specific disease from a disease-related corpus, the feature favors entities

related to the disease over generic medical entities. It is calculated in the same way

as GN except the frequency is computed in the n-grams of the generic domain text.

Including GN in the phrase scoring features or including DN in the pattern scoring

features did not perform well on the development set in our pilot experiments.

Page 118: DISTANTLY SUPERVISED INFORMATION EXTRACTION …The best thing about Stanford is its grad students – brilliant and yet so approachable. It has been amazing to be a part of the NLP

CHAPTER 6. LEVERAGING UNLABELED DATA 101

6.4 Experiments

6.4.1 Dataset

I evaluate our system on extracting drug-and-treatment (DT) entities in sentences from four

forums on the MedHelp user health discussion website: 1. Acne, 2. Adult Type II Diabetes

(called Diabetes), 3. Ear Nose & Throat (called ENT), and 4. Asthma.

I used Asthma as the development forum for feature engineering and parameter tuning.

Similar to Chapter 5, a DT entity is defined as a pharmaceutical drug, or any treatment

or intervention mentioned that may help a symptom or a condition. It includes surgeries,

lifestyle changes, alternative treatments, home remedies, and components of daily care and

management of a disease, but does not include diagnostic tests and devices. Refer to Chap-

ter 2 for examples of sentences from these forums and the labeled entities. I used entities

from the following classes as negative: symptoms and conditions (SC), medical specialists,

body parts, and common temporal nouns to remove dates and dosage information.

Seed dictionaries

I used the DT and SC seed dictionaries from Chapter 5. The DT seed dictionary (36,091

phrases) and SC seed dictionary (97,211 phrases) were automatically constructed from

various sources on the Internet and expanded using the OAC Consumer Health Vocabulary,

which maps medical jargon to everyday phrases and their variants. Both dictionaries are

large because they contain many variants of entities. The dictionaries matched with 1065

phrases on the Acne forum, 1232 phrases on the Diabetes forum, 2271 phrases on the ENT

forum, and 1007 phrases on the Asthma forum. For each system, the SC dictionary was

further expanded by running the system with the SC class as positive (considering DT and

other classes as negative) and adding the top 50 words extracted by the top 300 patterns to

the SC class dictionary. This helps in adding corpus-specific SC words to the dictionary.

The lists of body parts and temporal nouns were obtained from Wordnet (Fellbaum, 1998).

The common words list was created using most common words on the web and Twitter. I

used the top 10,000 words from Google Ngrams and the most frequent 5,000 words from

Page 119: DISTANTLY SUPERVISED INFORMATION EXTRACTION …The best thing about Stanford is its grad students – brilliant and yet so approachable. It has been amazing to be a part of the NLP

CHAPTER 6. LEVERAGING UNLABELED DATA 102

Twitter.2

6.4.2 Labeling Guidelines

For evaluation, I hand labeled the learned entities pooled from all systems, to be used only

as a test set. For class DT, I labeled entities belonging to DT as positive and all others as

negative. I queried ‘word + forum name’ on Google and manually inspected the results.

Apart from the definition of the DT class above, the following instructions were followed

for each class.

Positive

The following types of variations of DT entities were allowed: spelling mistakes, abbre-

viations, and phonetically similar variations (for example, ‘brufen’ for ‘ibuprofen’). If a

word or phrase was a part of a DT entity, then it was labeled positive. For example, ‘nux’ is

considered positive because ‘nux vomica’ is sometimes used medicinally. Generic entities

that can be used as a treatment for the medical condition were included (like ‘moisturizer’

for Acne). Brand names of DT entities, like ‘Amway’, were labeled positive. Ways to

administer a medicine were included, such as ‘syrup’, ‘tabs’, and ‘inhalation’. Phrases like

‘anti-bacterial’ or ‘asthma meds’ were also considered positive.

Negative

Entities that were not labeled as positive were considered negative. If a phrase had any

non-DT word then it was considered negative, except when phrases had the name of the

disease or symptom for which the treatment is mentioned. For example, ‘sinus meds’ was

considered positive. Websites, dosages, diagnosis tests or devices, doctors, specialists were

labeled as negative.

Inter-annotator agreement between the annotator and another researcher was computed

on 200 randomly sampled learned entities from each of the Asthma and ENT forum. The

agreement for the entities from the Asthma forum was 96% and from the ENT forum was

92.46%. The Cohen’s kappa scores were 0.91 and 0.83, respectively. Most disagreements2www.twitter.com, accessed from May 19 to 25, 2012.

Page 120: DISTANTLY SUPERVISED INFORMATION EXTRACTION …The best thing about Stanford is its grad students – brilliant and yet so approachable. It has been amazing to be a part of the NLP

CHAPTER 6. LEVERAGING UNLABELED DATA 103

were on food items like ‘yogurt’, which are hard to label. Note that I do not use the hand

labeled entities for training.

6.4.3 Baselines

As in Section 6.3, the sets Pr,Nr, andUr are defined as the positive, negative, and unlabeled

entities extracted by a pattern r, respectively. The set Ar is defined as union of all the

three sets. I compare our system with the following pattern scoring algorithms. Candidate

entities are scored in the same way as described in Section 6.3.3. It is important to note that

previous works also differ in how they create patterns, apply patterns, and score entities.

Since I focus on only the pattern scoring aspect, I run experiments that differ in only that

component.

PNOdd: Defined as |Pr|/|Nr|, this measure ignores unlabeled entities and is similar to the

domain specific pattern learning component of Etzioni et al. (2005) since all patterns

with |Pr| < 2 were discarded (more details in the next section).

PUNOdd: Defined as |Pr|/(|Ur|+ |Nr|), this measure treats unlabeled entities as negative

entities.

RlogF: Measure used by Riloff (1996) and Thelen and Riloff (2002), and calculated as

Rr log |Pr|, where Rr was defined as |Pr|/|Ar| (labeled RlogF-PUN). It assumed

unlabeled entities as negative entities. I also compare with a variant that ignores the

unlabeled entities, that is by defining Rr as |Pr|/(|Pr + |Nr|) (labeled RlogF-PN).

Yangarber02: This measure from Yangarber et al. (2002) calculated two scores, accr =

|Pr|/|Nr| and confr = (|Pr|/|Ar|) log |Pr|. Patterns with accr less than a threshold

were discarded and the rest were ranked using confr. I empirically determined that

a threshold of 0.8 performed best on the development forum.

Lin03: A measure proposed in Lin et al. (2003), it was similar to Yangarber02, except

confr was defined as log |Pr|(|Pr| − |Nr|)/|Ar|. In essence, it discards a pattern if it

extracts more negative entities than positive entities.

Page 121: DISTANTLY SUPERVISED INFORMATION EXTRACTION …The best thing about Stanford is its grad students – brilliant and yet so approachable. It has been amazing to be a part of the NLP

CHAPTER 6. LEVERAGING UNLABELED DATA 104

SqrtRatioAll: This pattern scoring method I used in Chapter 5 from Gupta et al. (2014b)

and defined as∑

k∈Pr

√freqk/

∑j∈Ar

√freqj , where freqi is the number of times

entity i is extracted by r. Sublinear scaling of the term-frequency prevents high

frequency words from overshadowing the contribution of low frequency words.

6.4.4 Experimental Setup

I used the same experimental setup for our system and the baselines. When matching

phrases from a seed dictionary to text, a phrase is labeled with the dictionary’s class if

the sequence of phrase words or their lemmas match with the sequence of words of a

dictionary phrase. Since our corpora are from online discussion forums, they have many

spelling mistakes and morphological variations of entities. To deal with the variations, I do

fuzzy matching of words – if two words are one edit distance away and are more than 6

characters long, then they are considered a match.

I used Stanford TokensRegex (Chang and Manning, 2014) to create and apply surface

word patterns to text, and used the Stanford part-of-speech (POS) tagger (Toutanova and

Manning, 2003) to find POS tags of tokens and lemmatize them. I created patterns in a

similar way as described in Chapter 5, I discarded patterns whose left or right context was

1 or 2 stop words to avoid generating low precision patterns. In each iteration, I learned

a maximum 20 patterns with ps(r) ≥ θr and maximum 10 words with score ≥ 0.2. The

initial value of θr was 1.0, which was reduced to 0.8 × θr whenever the system did not

extract any more patterns and words. I discarded patterns that extracted less than 2 positive

entities. I selected these parameters by their performance on the development forum.

For calculating the DistSim feature used for scoring patterns and entities, I clustered

all of MedHelp’s forum data into 1000 clusters using the Brown clustering algorithm. The

data consisted of around 4 million tokens. Words that occurred less than 50 times were dis-

carded, which resulted in 50353 unique words. For calculating the DN feature for scoring

entities, I used n-grams from all user forums in MedHelp as the domain n-grams.

I evaluate systems by their precision and recall in each iteration. I stopped learning

entities for a system if the precision dropped below 75% to extract entities with reasonably

high precision. Recall is defined as the fraction of correct entities among the total unique

Page 122: DISTANTLY SUPERVISED INFORMATION EXTRACTION …The best thing about Stanford is its grad students – brilliant and yet so approachable. It has been amazing to be a part of the NLP

CHAPTER 6. LEVERAGING UNLABELED DATA 105

0.76

0.78

0.8

0.82

0.84

0.86

0.88

0.9

0.92

0.94

0.96

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Pre

cisi

on

Recall (out of 221 correct entities)

ASTHMA

OurSystemRlogF-PUN

Yangarber02SqrtAllRatio

Lin03PUNOdd

Figure 6.2: Precision vs. Recall curves of our system and the baselines for the Asthmaforum.

correct entities pooled from all systems while maintaining the precision ≥ 75%. Note that

true recall is very hard to compute since our dataset is unlabeled. To compare the systems

overall, I calculate the area under the precision-recall curves (AUC-PR).

System Asthma ENT Diabetes AcneOurSystem 68.36 60.71 67.62 68.01PNOdd 51.62 50.31 05.91 58.45PUNOdd 42.42 30.44 36.11 58.38RlogF-PUN 56.13 54.11 48.70 57.04RlogF-PN 53.46 52.84 16.59 62.35SqrtRatioAll 41.49 40.44 35.47 46.46Yangarber02 53.76 48.46 41.45 59.85Lin03 54.58 47.98 56.15 60.79

Table 6.1: Area under Precision-Recall curves of the systems.

Page 123: DISTANTLY SUPERVISED INFORMATION EXTRACTION …The best thing about Stanford is its grad students – brilliant and yet so approachable. It has been amazing to be a part of the NLP

CHAPTER 6. LEVERAGING UNLABELED DATA 106

0.75

0.8

0.85

0.9

0.95

1

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

Pre

cisi

on

Recall (out of 645 correct entities)

ENT

OurSystemRlogF-PUN

Yangarber02SqrtAllRatio

Lin03PUNOdd

Figure 6.3: Precision vs. Recall curves of our system and the baselines for the ENT forum.

0.75

0.8

0.85

0.9

0.95

1

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Pre

cisi

on

Recall (out of 624 correct entities)

ACNE

OurSystemRlogF-PUN

Yangarber02SqrtAllRatio

Lin03PUNOdd

Figure 6.4: Precision vs. Recall curves of our system and the baselines for the Acne forum.

Page 124: DISTANTLY SUPERVISED INFORMATION EXTRACTION …The best thing about Stanford is its grad students – brilliant and yet so approachable. It has been amazing to be a part of the NLP

CHAPTER 6. LEVERAGING UNLABELED DATA 107

0.76

0.78

0.8

0.82

0.84

0.86

0.88

0.9

0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

Pre

cisi

on

Recall (out of 118 correct entities)

DIABETES

OurSystemRlogF-PUN

Yangarber02SqrtAllRatio

Lin03PUNOdd

Figure 6.5: Precision vs. Recall curves of our system and the baselines for the Diabetesforum.

Feature Asthma ENT Diabetes AcneAll Features 68.36 60.71 67.62 68.01EDP 68.66 59.07 60.03 65.15EDN 59.39 59.21 16.75 65.96SemOdd 67.07 58.41 60.51 65.04GN 57.52 59.53 48.76 68.61DistSim 64.87 59.05 71.11 69.48

Table 6.2: Individual feature effectiveness: Area under Precision-Recall curves when oursystem uses individual features during pattern scoring. Other features are still used forentity scoring.

Page 125: DISTANTLY SUPERVISED INFORMATION EXTRACTION …The best thing about Stanford is its grad students – brilliant and yet so approachable. It has been amazing to be a part of the NLP

CHAPTER 6. LEVERAGING UNLABELED DATA 108

Feature Asthma ENT Diabetes AcneAll Features 68.36 60.71 67.62 68.01minusEDP 66.29 60.45 69.84 69.46minusEDN 67.19 60.39 69.89 67.57minusGN 65.53 60.33 66.07 67.28minusSemOdd 66.66 60.76 70.79 68.25minusDistSim 66.10 60.58 66.59 67.85

Table 6.3: Feature ablation study: Area under Precision-Recall curves when individualfeatures are removed from our system during pattern scoring. The feature is still used forentity scoring.

6.4.5 Results

Figure 6.2–6.5 plot the precision and recall of systems. I do not show plots of PNOdd

and RlogF-PN to improve clarity; they performed similarly to other baselines. All systems

extract more entities for Acne and ENT because different drugs and treatments are more

prevalent in these forums. Diabetes and Asthma have more interventions and lifestyle

changes that are harder to extract. Table 6.1 shows AUC-PR scores for all systems. RlogF-

PN and PNOdd have low value for Diabetes because they learned generic patterns in initial

iteration, which led them to learn incorrect entities. Overall our system performed sig-

nificantly better than existing systems. This is because the system is able to exploit the

unlabeled data in better scoring the patterns – patterns that extract good unlabeled entities

get ranked higher than the patterns that extract bad unlabeled entities.

To compare the effectiveness of each feature in our system, Table 6.2 shows the AUC-

PR values when each feature was individually used for pattern scoring (other features were

still used to learn entities). EDP and DistSim were strong predictors of labels of unlabeled

entities because many good unlabeled entities were spelling mistakes of DT entities and

occurred in similar context as them. Table 6.3 shows the AUC-PR values when each feature

was removed from the set of features used to score patterns (the feature was still used for

learning entities). Removing GN and DistSim reduced the AUC-PR scores for all forums.

Table 6.4 shows some examples of patterns and the entities they extracted along with

their labels when the pattern was learned. Our system learned the first pattern because

‘pinacillin’ has low edit distance from the positive entity ‘penicillin’. Similarly, it scored

Page 126: DISTANTLY SUPERVISED INFORMATION EXTRACTION …The best thing about Stanford is its grad students – brilliant and yet so approachable. It has been amazing to be a part of the NLP

CHAPTER 6. LEVERAGING UNLABELED DATA 109

the second pattern higher than the baseline because ‘desoidne’ is a typo of the positive

entity ‘desonide’. Note that the seed dictionaries are noisy – the entity ‘metro’, part of the

positive entity ‘metrogel’, was falsely considered a negative entity because it was in the

common web words list. Our system learned the third pattern for two reasons: ‘inhaler’,

‘inhalers’, and ‘hfa’ occurred frequently as sub-phrases in the DT dictionary, and they

were clustered with positive entities by distributional similarity. Since RlogF-PUN does

not distinguish between unlabeled and negative entities, it is does not learn the pattern.

Table 6.5 shows top 10 patterns learned for the ENT forum by our system and RlogF-PUN,

the best performing baseline for the forum. Our system preferred to learn patterns with

longer contexts, which are usually higher precision, first.

Forum Pattern Positive entities Negative Unlabeled OurSys-tem

Baseline

ENT he give I more X antibiotics, steroid, an-tibiotic

pinacillin 68NA(RlogF-PUN)

Acne topical DT ( X prednisone, clin-damycin, differin,benzoyl peroxide,tretinoin, metrogel

metro desoidne 149231(RlogF-PN)

Asthma i be put on X cortisone, prednisone,asmanex, advair, aug-mentin, bypass, nebu-lizer, xolair, steroids,prilosec

inhaler,inhalers,hfa

8NA(RlogF-PUN)

Table 6.4: Example patterns and the entities extracted by them, along with the rank atwhich the pattern was added to the list of learned patterns. NA means that the system neverlearned the pattern. Baseline refers to the best performing baseline system on the forum.The patterns have been simplified to show just the sequence of lemmas. X refers to thetarget entity; all of them in these examples had noun POS restriction. Terms that havealready been identified as the positive class were generalized to their class DT.

Page 127: DISTANTLY SUPERVISED INFORMATION EXTRACTION …The best thing about Stanford is its grad students – brilliant and yet so approachable. It has been amazing to be a part of the NLP

CHAPTER 6. LEVERAGING UNLABELED DATA 110

Our System RlogF-PUNlow dose of X* mg of Xmg of X treat with XX 10 mg take DT and Xshe prescribe X be take XX 500 mg she prescribe Xbe take DT and X* put on Xent put I on X* stop take XDT ( like X:NN i be prescribe X

like DT and X have be take Xthen prescribe X* tell I to take X

Table 6.5: Top 10 (simplified) patterns learned by our system and RlogF-PUN from theENT forum. An asterisk denotes that the pattern was never learned by the other system. Xis the target entity slot with noun POS restriction.

6.5 Discussion and Conclusion

Our system extracted entities with higher precision and recall than other existing systems.

Since most entities extracted by patterns, especially in the crucial initial iterations, are un-

labeled, existing pattern scoring functions either unfairly penalize good patterns and/or do

not penalize bad patterns enough. Our system successfully leveraged the unlabeled data to

score patterns better – it evaluated unlabeled entities extracted by patterns in an unsuper-

vised way. However, learning entities from an informal text corpus that is partially labeled

from seed entities presents some challenges. Our system made mistakes primarily due to

three reasons. One, it sometimes extracted typos of negative entities that were not easily

predictable by the edit distance measures, such as ‘knowwhere’. Second, patterns that ex-

tracted many good but some bad unlabeled entities got high scores because of the good

unlabeled entities. However, the bad unlabeled entities extracted by the highly weighted

patterns were scored high by the PTF feature during the entity scoring phase, leading to

extraction of the bad entities. Better features to predict negative entities and robust text

normalization would help mitigate both the problems. Third, we used automatically con-

structed seed dictionaries that were not dataset specific, which led to incorrectly labeling

of some entities (for example, ‘metro’ as negative in Table 6.4). Reducing noise in the

dictionaries would increase precision and recall.

Page 128: DISTANTLY SUPERVISED INFORMATION EXTRACTION …The best thing about Stanford is its grad students – brilliant and yet so approachable. It has been amazing to be a part of the NLP

CHAPTER 6. LEVERAGING UNLABELED DATA 111

In our proposed system, the features are weighted equally by taking the average of the

feature scores. In pilot experiments, learning a logistic regression classifier on heuristically

labeled data did not work well for either pattern scoring or entity scoring. In the next

chapter, I use a logistic regression to learn an entity classifier; I improved sampling of

examples to create a training set resulting in better results using a classifier. In retrospect,

this approach could also be successfully applied to the system in this chapter.

One limitation of our system and evaluation is that I learned single word entities, since

calculating some features for multi-word phrases is not straightforward. For example, word

clusters using distributional similarity were constructed for single words. Our future work

includes expanding the features to evaluate multi-word phrases. Another avenue for fu-

ture work is to use our pattern scoring method for learning other kinds of patterns, such

as dependency patterns, and in different kinds of systems, such as hybrid entity learning

systems (Etzioni et al., 2005; Carlson et al., 2010a).

In conclusion, I show that predicting the labels of unlabeled entities in the pattern scorer

of a bootstrapped entity extraction system significantly improves precision and recall of

learned entities. Our experiments demonstrate the importance of having models that con-

trast domain-specific and general domain text, and the usefulness of features that allow

spelling variations when dealing with informal texts. Our pattern scorer outperforms ex-

isting pattern scoring methods for learning drug-and-treatment entities from four medical

web forums.

Page 129: DISTANTLY SUPERVISED INFORMATION EXTRACTION …The best thing about Stanford is its grad students – brilliant and yet so approachable. It has been amazing to be a part of the NLP

Chapter 7

Distributed Word Representations toGuide Entity Classifiers

In the last chapter, I improve the pattern scoring function of a bootstrapped pattern-based

learning system using unlabeled data. In this chapter, I leverage the unlabeled data to

improve the entity scoring function. I model it by training a logistic regression and use

the unlabeled data to enhance its training set. The work has been published in Gupta and

Manning (2015).

7.1 Introduction

The limited supervision provided in bootstrapped systems, though an attractive quality, is

also one of its main challenges. When seed sets are small, noisy, or do not cover the label

space, the bootstrapped classifiers do not generalize well. I use a major guiding inspiration

of deep learning and earlier approaches such as LSA (Landauer et al., 1998): we can learn

a lot about syntactic and semantic similarities between words in an unsupervised fashion

and capture this information in word vectors. This distributed representation can inform an

inductive bias to generalize in a bootstrapping system.

In the previous chapter, I used averaging of feature values to predict an entity’s class in

a bootstrapped system. In this chapter, I use a logistic regression classifier to predict scores

for candidate entities. My main contribution is a simple approach of using the distributed

112

Page 130: DISTANTLY SUPERVISED INFORMATION EXTRACTION …The best thing about Stanford is its grad students – brilliant and yet so approachable. It has been amazing to be a part of the NLP

CHAPTER 7. WORD EMBEDDINGS IMPROVE ENTITY CLASSIFIERS 113

Figure 7.1: An example of expanding a bootstrapped entity classifier’s training set usingword vector similarity. The entities in blue represent known positive entities and the entitiesin red represent known negative entities. The entities in black are unlabeled but can beincorporated in the corresponding positive and negative sets because of their proximity tothe known entities in the word vector space.

vector representations of words to expand training data for entity classifiers. To improve

the step of learning an entity classifier, I first learn a vector representation of entities using

the continuous bag of words model (Mikolov et al., 2013b). I then use kNN to expand the

training set of the classifier by adding unlabeled entities close to seed entities in the training

set. Figure 7.1 shows an example of expansion of a training set for a drugs-and-treatment

entity classifier tailored for online health forums. The unlabeled entities shown in the

figure are usually not found in seed sets that are automatically constructed using medical

ontologies. However, these entities can be incorporated into the training set because they

occur in similar contexts in the dataset. Expanding a training set not only makes it larger

but also less susceptible to false negatives, since the process of sampling the unlabeled

entities as negative is guided by the frequency and context of entities.

The key insight is to use the word vector similarity indirectly by enhancing training

data for the entity classifier. I do not directly label the unlabeled entities using the similar-

ity between word vectors, which I show extracts many noisy entities. I show that classifiers

trained with expanded sets of entities perform better on extracting drug-and-treatment en-

tities from four online health forums from MedHelp.

Page 131: DISTANTLY SUPERVISED INFORMATION EXTRACTION …The best thing about Stanford is its grad students – brilliant and yet so approachable. It has been amazing to be a part of the NLP

CHAPTER 7. WORD EMBEDDINGS IMPROVE ENTITY CLASSIFIERS 114

7.2 Related Work

In a pattern-based system, if the patterns are not very specific, they can extract noisy terms.

On the other hand, too specific patterns can result in low recall. Many systems, such

as RAPIER (Califf and Mooney, 1999) and Kozareva and Hovy (2013), learn patterns

and extract all fillers that match the patterns. In supervised systems (e.g. RAPIER), the

patterns are scored using fully supervised data, and hence the patterns are presumably more

accurate. Learning all matched entities is a bigger problem in bootstrapped systems since

there is little labeled data to judge patterns. Kozareva and Hovy (2013) extended ontologies

using bootstrapping; they learned very specific ‘doubly-anchored’ patterns.

To mitigate the problem of extracting noisy entities, some BPL systems have an en-

tity evaluation step and they learn only the top ranked entities. There are several ways to

rank the candidate entities. Systems, such as Thelen and Riloff (2002), Lin et al. (2003),

and Agichtein and Gravano (2000), score entities using the number and scores of patterns

that extracted them. In Chapter 5, I used a similar function to rank the entities. Snow-

ball (Agichtein and Gravano, 2000) and DIPRE (Brin, 1999) also took into account how

well a pattern matched a sentence to extract an entity. All of the above systems use only

the patterns to score entities extracted by them. Surprisingly, only a few systems also use

entity-based features to score the entities. StatSnowball proposed MLNs to extract enti-

ties and used token-level features and joint entity-level features. In Chapter 6, I used five

features to evaluate an entity, four of them were entity-based features.

Some open IE systems like KnowItAll use the web to assess the quality of extractions.

KnowItAll’s assessor used querying search engines to get a PMI score of occurrence of the

entity by itself vs. as a slot of the extractors. The PMI scores are used as features in a naive

Bayes classifier. Downey et al. (2010) proposed a probabilistic urn model and compared

against noisy-or and PMI scoring models.

Most of the BPL systems do not use a machine learning-based classifier for the entity

scoring step. In this chapter, I model the entity scoring function using a logistic regression

classifier. To the best of my knowledge, this work is the first to improve a bootstrapped

Page 132: DISTANTLY SUPERVISED INFORMATION EXTRACTION …The best thing about Stanford is its grad students – brilliant and yet so approachable. It has been amazing to be a part of the NLP

CHAPTER 7. WORD EMBEDDINGS IMPROVE ENTITY CLASSIFIERS 115

system’s entity evaluation by expanding the classifier’s training set. I use distributed rep-

resentations of words to compute unlabeled entities that are similar to known entities. Dis-

tributed representations of words have been shown to be successful at improving general-

ization. Passos et al. (2014) proposed word embeddings that leverage lexicons and used the

embeddings to improve a CRF-based named entity recognition system.

7.3 Approach

In this section, I propose an entity classifier and its enhancement by expanding its training

set using an unsupervised word similarity measure.

I build a one-vs-all entity classifier using logistic regression. In each iteration, for

label l, the entity classifier is trained by treating l’s dictionary entities (seed and learned

in previous iterations) as positive and entities belonging to all other labels as negative. To

improve generalization, I also sample the unlabeled entities that are not function words as

negative. To train with a balanced dataset, I randomly sub-sample the negatives such that

the number of negative instances is equal to the number of positive instances.

The features for the entities are similar to the ones described in Chapter 6 from Gupta

and Manning (2014a): edit distances from positive and negative entities, relative frequency

of the entity words in the seed dictionaries, word classes computed using the Brown clus-

tering algorithm, and pattern TF-IDF score. Note that in Chapter 6, I averaged the feature

values to predict an entity’s score; one of the features was the score of the word class clus-

ter belonging to a label. First, the words were clustered using the Brown clustering method

(Brown et al., 1992). Then, each cluster was considered as an instance in a logisitic regres-

sion classifier, which was trained to give a probability of whether a cluster belongs to the

given label. I then used this cluster score as a feature in the average function. Here, I simply

include the word cluster id directly as a feature in the logistic regression classifier, which

is trained to give a score of whether an entity belongs to the given label. The last feature,

pattern TF-IDF score, gives higher scores to entities that are extracted by many learned

patterns and have low frequency in the dataset. In the experiments, I call this classifier

NotExpanded.

The lack of labeled data to train a good entity classifier is one of the challenges in

Page 133: DISTANTLY SUPERVISED INFORMATION EXTRACTION …The best thing about Stanford is its grad students – brilliant and yet so approachable. It has been amazing to be a part of the NLP

CHAPTER 7. WORD EMBEDDINGS IMPROVE ENTITY CLASSIFIERS 116

bootstrapped learning. I use distributed representations of words, in the form of word

vectors, to guide the entity classifier by expanding its training set. I expand the positive

training set by labeling the unlabeled entities that are similar to the seed entities of the label

as positive examples, and labeling the unlabeled entities that are similar to seed entities of

other labels as negative examples. I take the cautious approach of finding similar entities

only to the seed entities and not the learned entities. The algorithm can be modified to

find similar entities to learned entities as well. Cautious approaches have been shown to be

better for bootstrapped learning (Abney, 2004; Surdeanu et al., 2006).

To compute similarity of an unlabeled entity to the positive entities, I find k most similar

positive entities, measured by cosine similarity between the word vectors, and average the

scores. Similarly, I compute similarity of the unlabeled entity to the negative entities. If the

entity’s positive similarity score is above a given threshold θ and is higher than its negative

similarity score, it is added to the training set with positive label. I expand the negative

entities similarly. I tried expanding just the positive entities and just the negative entities.

Their relative performance, though higher than the baselines, varied between the datasets.

Expanding both positives and negatives gave more stable results across the datasets. Thus,

I present results only for expanding both positives and negatives.

An alternative to our approach is to directly label the entities using the vector simi-

larities. Our experimental results suggest that even though exploiting similarities between

word vectors is useful for guiding the classifier by expanding the training set, it is not ro-

bust enough to use for labeling entities directly. For example, for our development dataset,

when the similarity threshold θ was set as 0.4, 16 out of 41 unlabeled entities that were

expanded into the training set as positive entities were false positives. Increasing θ ex-

tracted far fewer entities. Setting θ to 0.5 extracted only 5 entities, all true positives, and

to 0.6 extracted none. Thus, labeling entities solely based on similarity scores resulted in

lower performance. A classifier, on the other hand, can use other sources of information as

features to predict an entity’s label.

I compute the distributed vector representations using the continuous bag-of-words

model (Mikolov et al., 2013b; Mikolov et al., 2013a) implemented in the word2vec toolkit.1

1http://code.google.com/p/word2vec/

Page 134: DISTANTLY SUPERVISED INFORMATION EXTRACTION …The best thing about Stanford is its grad students – brilliant and yet so approachable. It has been amazing to be a part of the NLP

CHAPTER 7. WORD EMBEDDINGS IMPROVE ENTITY CLASSIFIERS 117

Forum Expanded Expanded-M NotExpanded AverageAsthma 77.01 75.68 74.48 65.42

Acne 73.84 75.41 71.65 65.05Diabetes 82.37 44.25 48.75 21.82

ENT 80.66 80.04 77.02 59.50

Table 7.1: Area under Precision-Recall curve for all the systems. Expanded is our systemwhen word vectors are learned using the Wiki+Twit+MedHelp data and Expanded-M iswhen word vectors are learning using the MedHelp data. Average is the average of featurevalues, similar to Gupta and Manning (2014a).

The publicly available word vectors are not tailored towards the online health forums do-

main and thus I train new vector representations. I train 200-dimensional vector representa-

tions on a combined dataset of a 2014 Wikipedia dump (1.6 billion tokens), a sample of 50

million tweets from Twitter (200 million tokens), and an in-domain dataset of all MedHelp

forums (400 million tokens). The three types of datasets have words and context of differ-

ent kinds: the Wikipedia data mainly consists of domain-independent words; the Twitter

data has many slang and colloquial words, also common on online forums; and the Med-

Help data has the in-domain content. I tried learning 500-dimensional and 50-dimensional

vectors; the 200-dimensional vectors worked best on the developmental data. I removed

words that occurred less than 20 times, resulting in a vocabulary of 89k words. I call this

dataset Wiki+Twit+MedHelp. I used the parameters suggested in Pennington et al. (2014):

negative sampling with 10 samples and a window size of 10. I ran the model for 3 itera-

tions, which were enough to get good results; more iterations would presumably result in

better vectors.

7.4 Experimental Setup

I present results on the same experimental setup, dataset, and seed lists as discussed in

Chapter 6 from Gupta and Manning (2014a). The task is to extract drug-and-treatment

(DT) entities in sentences from four forums on the MedHelp user health discussion website:

1. Asthma, 2. Acne, 3. Adult Type II Diabetes (called Diabetes), and 4. Ear Nose &

Throat (called ENT). A DT entity is defined as a pharmaceutical drug, or any treatment

Page 135: DISTANTLY SUPERVISED INFORMATION EXTRACTION …The best thing about Stanford is its grad students – brilliant and yet so approachable. It has been amazing to be a part of the NLP

CHAPTER 7. WORD EMBEDDINGS IMPROVE ENTITY CLASSIFIERS 118

0.75

0.8

0.85

0.9

0.95

1

0 0.2 0.4 0.6 0.8 1

Precis

ion

Recall (268.0 correct entities)

ASTHMA

ExpandedNotExpanded

Average

Figure 7.2: Precision vs. Recall curves of our system and the baselines for the Asthmaforum.

0.75

0.8

0.85

0.9

0.95

1

0 0.2 0.4 0.6 0.8 1

Precis

ion

Recall (268.0 correct entities)

ACNE

ExpandedNotExpanded

Average

Figure 7.3: Precision vs. Recall curves of our system and the baselines for the Acne forum.

Page 136: DISTANTLY SUPERVISED INFORMATION EXTRACTION …The best thing about Stanford is its grad students – brilliant and yet so approachable. It has been amazing to be a part of the NLP

CHAPTER 7. WORD EMBEDDINGS IMPROVE ENTITY CLASSIFIERS 119

0.75

0.8

0.85

0.9

0.95

1

0 0.2 0.4 0.6 0.8 1

Precis

ion

Recall (268.0 correct entities)

DIABETES

ExpandedNotExpanded

Average

Figure 7.4: Precision vs. Recall curves of our system and the baselines for the Diabetesforum.

0.75

0.8

0.85

0.9

0.95

1

0 0.2 0.4 0.6 0.8 1

Precis

ion

Recall (268.0 correct entities)

ENT

ExpandedNotExpanded

Average

Figure 7.5: Precision vs. Recall curves of our system and the baselines for the ENT forum.

Page 137: DISTANTLY SUPERVISED INFORMATION EXTRACTION …The best thing about Stanford is its grad students – brilliant and yet so approachable. It has been amazing to be a part of the NLP

CHAPTER 7. WORD EMBEDDINGS IMPROVE ENTITY CLASSIFIERS 120

or intervention mentioned that may help a symptom or a condition. I judged the output

of all systems, following the guidelines in the previous chapter. I used Asthma as the

development forum for parameter and threshold tuning. I used threshold θ as 0.4 and use k

(number of nearest neighbors) as 2 when expanding the seed sets.

I evaluate systems by their precision and recall (see Chapter 2 for details). Similar to

the previous chapter. I present the precision and recall curves for precision above 75% to

compare systems when they extract entities with reasonably high precision. Recall is de-

fined as the fraction of correct entities among the total unique correct entities pooled from

all systems. Note that calculating lower precisions or true recall is very hard to compute.

Our dataset is unlabeled and manually labeling all entities is expensive. Pooling is a com-

mon evaluation strategy in such situations (such as, in information retrieval (Buckley et al.,

2007) and the TAC-KBP shared task). I calculate the area under the precision-recall curves

(AUC-PR) to compare the systems.

I call our system Expanded in the experiments. To compare the effects of word vectors

learned using different types of datasets, I also study our system when the word vectors are

learned using just the in-domain MedHelp data, called Expanded-M. I compare against two

baselines: NotExpanded as explained in previous section, and Average, in which I average

the feature values, similar to Gupta and Manning (2014a).

7.5 Results and Discussion

Table 7.1 shows AUC-PR of various systems and Figures 7.2–7.5 show the precision-recall

curves. Our systems Expanded and Expanded-M, which used similar entities for training,

improved the scores for all four forums. I believe the improvement for the Diabetes fo-

rum was much higher than other forums because the baseline’s performance on the forum

degraded quickly in later iterations (see the figure), and improving the classifier helped

in adding more correct entities. Additionally, Diabetes DT entities are more lifestyle-

based and hence occur frequently in web text, making the word vectors trained using the

Wiki+Twit+MedHelp dataset better suited.

In three out of four forums, word vectors trained using a large corpus perform better

than those trained using the smaller in-domain corpus. For the Acne forum, where brand

Page 138: DISTANTLY SUPERVISED INFORMATION EXTRACTION …The best thing about Stanford is its grad students – brilliant and yet so approachable. It has been amazing to be a part of the NLP

CHAPTER 7. WORD EMBEDDINGS IMPROVE ENTITY CLASSIFIERS 121

Positives NegativesAsthma

pranayama, steril-izing, expectorants,inhalable, sanitizers,ayurvedic

block, yougurt, med-cine, exertion, hate, vi-rally

Diabetesquinoa, vinegars,vegatables, thread-mill, possilbe, asanas,omegas

nicely, chiropracter,exhales, paralytic,metabolize, fluffy

Table 7.2: Examples of unlabeled entities that were expanded into the training sets. Graycolored entities were judged by the authors as falsely labeled.

name DT entities are more frequent, the entities expanded by MedHelp vectors had fewer

false positives than those expanded by Wiki+Twit+MedHelp.

Table 7.2 shows some examples of unlabeled entities that were included as positive/neg-

ative entities in the entity classifiers. Even though some entities were included in the train-

ing data with wrong labels, overall the classifiers benefited from the expansion.

7.6 Conclusion

I improve entity classifiers in bootstrapped entity extraction systems by enhancing the train-

ing set using unsupervised distributed representations of words. The classifiers learned us-

ing the expanded seed sets extract entities with better F1 score. This supports our hypoth-

esis that generalizing labels to entities that are similar according to unsupervised methods

of word vector learning is effective in improving entity classifiers, notwithstanding that the

label generalization is quite noisy. Using the word embedding based similarity measure to

directly label the data resulted in low scores. However, training a classifier with expanded

training sets improved the scores, underscoring its robustness to noise.

In the last three chapters, I worked on applying bootstrapped pattern-based learning

to extract entities from PAT, improving scoring of both patterns and entities by exploiting

unlabeled data. In the next chapter, I turn briefly to another aspect important to real life use

of pattern-based systems – their interpretability and explainability.

Page 139: DISTANTLY SUPERVISED INFORMATION EXTRACTION …The best thing about Stanford is its grad students – brilliant and yet so approachable. It has been amazing to be a part of the NLP

Chapter 8

Visualizing and Diagnosing BPL

In the previous chapters, I discussed bootstrapped pattern-based learning, along with its

improvements, as an effective practical tool for entity extraction. In this chapter, I dis-

cuss why patterns are popular in industry and present a visualization tool for developing

a pattern-based system more effectively and efficiently. The work has been published in

Gupta and Manning (2014b).

8.1 Introduction

Entity extraction using patterns dominates commercial industry, mainly because patterns

are effective, interpretable by humans, and easy to customize to cope with errors (Chiti-

cariu et al., 2013). Patterns or rules, which can be hand crafted or learned by a system,

are commonly created by looking at the context around already known entities, such as

lexico-syntactic surface word patterns and dependency patterns. Building a pattern-based

learning system is usually a repetitive process, usually performed by the system developer,

of manually examining a system’s output to identify improvements or errors introduced by

changing the entity or pattern extractor. Interpretability of patterns makes it easier for hu-

mans to identify sources of errors by inspecting patterns that extracted incorrect instances

or instances that resulted in learning of bad patterns. Parameters range from window size

of the context in surface word patterns to thresholds for learning a candidate entity. At

present, there is a lack of tools helping a system developer to understand results and to

122

Page 140: DISTANTLY SUPERVISED INFORMATION EXTRACTION …The best thing about Stanford is its grad students – brilliant and yet so approachable. It has been amazing to be a part of the NLP

CHAPTER 8. VISUALIZING AND DIAGNOSING BPL 123

improve results iteratively.

Visualizing diagnostic information of a system and contrasting it with another system

can make the iterative process easier and more efficient. For example, consider a user trying

to decide on the context’s window size in surface words patterns. And the user deliberates

that part-of-speech (POS) restriction of context words might be required for a reduced

window size to avoid extracting erroneous mentions. A shorter context size usually extracts

entities with higher recall but lower precision. By comparing and contrasting extractions

of two systems with different parameters, the user can investigate the cases in which the

POS restriction is required with smaller window size, and whether the restriction causes the

system to miss some correct entities. In contrast, comparing just accuracy of two systems

does not allow inspecting finer details of extractions that increase or decrease accuracy and

to make changes accordingly.

In this chapter, I present a pattern-based entity learning and diagnostics tool, SPIED. It

consists of two components: 1. pattern-based entity learning using bootstrapping (SPIED-

Learn), and 2. visualizing the output of one or two entity learning systems (SPIED-Viz).

SPIED-Viz is independent of SPIED-Learn and can be used with any pattern-based entity

learner. For demonstration, I use the output of SPIED-Learn as an input to SPIED-Viz.

SPIED-Viz has pattern-centric and entity-centric views, which visualize learned patterns

and entities, respectively, and the explanations for learning them. SPIED-Viz can also con-

trast two systems by comparing the ranks of learned entities and patterns. As a concrete ex-

ample, I learn and visualize drug-treatment (DT) entities from unlabeled patient-generated

medical text, starting with seed dictionaries of entities for multiple classes. This is the same

task proposed and developed in Chapter 5 and 6 from Gupta et al. (2014b) and Gupta and

Manning (2014a).

My contributions are: 1. I present a novel diagnostic tool for visualization of output of

multiple pattern-based entity learning systems, and 2. I release the code of an end-to-end

pattern learning system, which learns entities using patterns in a bootstrapped system and

visualizes its diagnostic output. The pattern learning and the visualization code are avail-

able at http://nlp.stanford.edu/software/patternslearning.shtml.

Page 141: DISTANTLY SUPERVISED INFORMATION EXTRACTION …The best thing about Stanford is its grad students – brilliant and yet so approachable. It has been amazing to be a part of the NLP

CHAPTER 8. VISUALIZING AND DIAGNOSING BPL 124

8.2 Learning Patterns and Entities

SPIED-Learn is based on the system described in Chapter 6 published in Gupta and Man-

ning (2014a). The system builds upon the previous bootstrapped pattern-learning work and

proposes an improved measure to score patterns. It learns entities for given classes from

unlabeled text by bootstrapping from seed dictionaries. Patterns are learned using labeled

entities, and entities are learned based on the extractions of learned patterns. The process

is iteratively performed until no more patterns or entities can be learned.

SPIED-Learn provides an option to use any of the pattern scoring measures described

in (Riloff, 1996; Thelen and Riloff, 2002; Yangarber et al., 2002; Lin et al., 2003; Gupta

et al., 2014b). A pattern is scored based on the positive, negative, and unlabeled entities

it extracts. The positive and negative labels of entities are heuristically determined by the

system using the dictionaries and the iterative entity learning process. The oracle labels

of learned entities are not available to the learning system. Note that an entity that the

system considered positive might actually be incorrect, since the seed dictionaries can be

noisy and the system can learn incorrect entities in the previous iterations, and vice-versa.

SPIED-Learn’s entity scorer can be chosen between the systems described in Chapter 6 or

7.

Each candidate entity is scored using weights of the patterns that extract it and other

entity scoring measures, such as TF-IDF. Thus, learning of each entity can be explained by

the learned patterns that extract it, and learning of each pattern can be explained by all the

entities it extracts.

8.3 Design Criteria

The following design criteria are considered when designing the interface.

• Quick summary: The interface should provide a quick summary of the learned en-

tities and patterns, including the percentage of correct and incorrect entities, if gold

labels are provided.

• Provenance: In a pattern-based system, provenance of an extracted entity or pattern

Page 142: DISTANTLY SUPERVISED INFORMATION EXTRACTION …The best thing about Stanford is its grad students – brilliant and yet so approachable. It has been amazing to be a part of the NLP

CHAPTER 8. VISUALIZING AND DIAGNOSING BPL 125

is much easier than in a feature-based system. The visualization needs to be able to

drill down a dictionary entry (usually a learned entity) to the learned patterns that

extracted it. Similarly, it should have the ability to go from a pattern to the lists of

entities (divided by good labels, if provided) it extracted and its perceived goodness.

• Individual goodness and its quick identification: By using a heuristic criteria, the

system should identify good and bad learned patterns and entities. It would help in

quick identification and diagnosis of errors. In SPIED-viz, an exclamation mark is

shown for a pattern if more than half of the entities extracted by it are incorrect. In

addition, various signs, such as trophy (correct entity extracted by only one system)

and star (unlabeled entity extracted by only one system), are used to identify various

types of entities.

• Comparison: The interface should be able to compare multiple systems, both at a

higher level and a fine-grained entity/pattern level.

• Pattern-centric and entity-centric views: These views can provide detailed informa-

tion, either from a pattern point of view or from an entity point of view.

• Easy and fast: The tool should not require any cumbersome installation and should

be fast to use. Web browser-based tools are easy to use since they do not require

installation of a new software.

8.4 Visualizing Diagnostic Information

SPIED-Viz visualizes learned entities and patterns from one or two entity learning systems,

and the diagnostic information associated with them. It optionally uses the oracle labels

of learned entities to color code them, and contrast their ranks of correct/incorrect enti-

ties when comparing two systems. The oracle labels are usually determined by manually

judging each learned entity as correct or incorrect. SPIED-Viz has two views: 1. a pattern-

centric view that visualizes patterns of one to two systems, and 2. an entity centric view that

mainly focuses on the entities learned. Figure 8.1 shows a screenshot of the entity-centric

view of SPIED-Viz. It displays following information:

Page 143: DISTANTLY SUPERVISED INFORMATION EXTRACTION …The best thing about Stanford is its grad students – brilliant and yet so approachable. It has been amazing to be a part of the NLP

CHAPTER 8. VISUALIZING AND DIAGNOSING BPL 126

Summary: Summary information of each system at each iteration and overall. It shows

for each system the number of iterations, the number of patterns learned, and the

number of correct and incorrect entities learned.

Learned Entities with provenance: It shows ranked list of entities learned by each system,

along with an explanation of why the entity was learned. The details shown include

the entity’s oracle label, its rank in the other system, and the learned patterns that

extracted the entity. Such information can help the user to identify and inspect the

patterns responsible for learning an incorrect entity. The interface also provides a

link to search the entity along with any user provided keywords (such as domain of

the problem) on Google.

System Comparison: SPIED-Viz can be used to compare entities learned by two systems.

It marks entities that are learned by one system but not by the other system, by either

displaying a trophy sign (if the entity is correct), a thumbs down sign (if the entity is

incorrect), or a star sign (if the oracle label is not provided).

The second view of SPIED-Viz is pattern-centric. Figure 8.2 shows a screenshot of the

pattern-centric view. It displays the following information.

Summary: Summary information of each system including the number of iterations and

number of patterns learned at each iteration and overall.

Learned Patterns with provenance: It shows a ranked list of patterns along with the en-

tities it extracts and their labels. Note that each pattern is associated with a set of

positive, negative and unlabeled entities, which were used to determine its score.1 It

also shows the percentage of unlabeled entities extracted by a pattern that were even-

tually learned by the system and assessed as correct by the oracle. A smaller percent-

age means that the pattern extracted many entities that were either never learned or

learned but were labeled as incorrect by the oracle.1Note that positive, negative, and unlabeled labels are different from the oracle labels, correct and incor-

rect, for the learned entities. The former refer to the entity labels considered by the system when learning thepattern, and they come from the seed dictionaries and the learned entities. A positive entity considered by thesystem can be labeled as incorrect by the human assessor, in case the system made a mistake in labeling data,and vice-versa.

Page 144: DISTANTLY SUPERVISED INFORMATION EXTRACTION …The best thing about Stanford is its grad students – brilliant and yet so approachable. It has been amazing to be a part of the NLP

CHAPTER 8. VISUALIZING AND DIAGNOSING BPL 127

Figure 8.3 shows an option in the entity-centric view when hovering over an entity

opens a window on the side that shows the diagnostic information of the entity learned by

the other system. This direct comparison is to directly contrast learning of an entity by both

systems. For example, it can help the user to inspect why an entity was learned at an earlier

rank than the other system.

An advantage of making the learning entities component and the visualization compo-

nent independent is that a developer can use any pattern scorer or entity scorer in the system

without depending on the visualization component to provide that functionality.

I develop a list-based visualization since it is easy to navigate and it can compare learn-

ing of individual entities/patterns. Additionally, since most pattern-based systems are it-

erative, ranking the entities/patterns in the visualization by the number of iterations helps

in diagnosing errors better. Other variations, such as clustering of entities based on pat-

terns, can give higher-level insights into the learning process, however, it is more difficult

to diagnose sources of errors.

8.5 System Details

SPIED-Learn uses TokensRegex (Chang and Manning, 2014) to create and apply surface

word patterns to text. SPIED-Viz takes details of learned entities and patterns as input in a

JSON format. It uses Javascript, angular, and jquery to visualize the information in a web

browser.

8.6 Related Work

Most interactive IE systems focus on annotation of text, labeling of entities, and manual

writing of rules. Some annotation and labeling tools are: MITRE’s Callisto2, Knowta-

tor3, SAPIENT (Liakata et al., 2009), brat4, Melita (Ciravegna et al., 2002), and XConc

2http://callisto.mitre.org3http://knowtator.sourceforge.net4http://brat.nlplab.org

Page 145: DISTANTLY SUPERVISED INFORMATION EXTRACTION …The best thing about Stanford is its grad students – brilliant and yet so approachable. It has been amazing to be a part of the NLP

CHAPTER 8. VISUALIZING AND DIAGNOSING BPL 128

Sco

re o

f th

e en

tity

in

th

is s

yst

em a

nd

th

e o

ther

sy

stem

, a

lon

g

wit

h a

lin

k t

o s

earc

h

it o

n G

oo

gle

.

An

sta

r si

gn

fo

r a

n

enti

ty i

nd

ica

tes

the

enti

ty l

ab

el i

s n

ot

pro

vid

ed a

nd

it

wa

s n

ot

extr

act

ed b

y t

he

oth

er s

yst

em.

A t

rop

hy

sig

n

ind

ica

tes

tha

t th

e en

tity

is

corr

ect

an

d w

as

no

t ex

tra

cted

by

th

e o

ther

sy

stem

.

Lis

t o

f en

titi

es

lea

rned

at

each

it

era

tio

n.

Gre

en

colo

r in

dic

ate

s th

at

the

enti

ty i

s co

rrec

t a

nd

red

co

lor

ind

ica

tes

tha

t th

e en

tity

is

inco

rrec

t.

Lis

t o

f p

att

ern

s th

at

extr

act

ed t

he

enti

ty.

Th

eir

det

ail

s a

re s

imil

ar

to t

he

det

ail

s sh

ow

n i

n t

he

pa

tter

n-c

entr

ic

vie

w.

Figu

re8.

1:E

ntity

cent

ric

view

ofSP

IED

-Viz

.The

inte

rfac

eal

low

sth

eus

erto

drill

dow

nth

ere

sults

todi

agno

seex

trac

tion

ofco

rrec

tand

inco

rrec

tent

ities

,and

cont

rast

the

deta

ilsof

the

two

syst

ems.

The

entit

ies

that

are

notl

earn

edby

the

othe

rsy

stem

are

mar

ked

with

eith

era

trop

hy(c

orre

cten

tity)

,a

thum

bsdo

wn

(inc

orre

cten

tity)

,or

ast

aric

on(o

racl

ela

bel

mis

sing

),fo

reas

yid

entifi

catio

n.

Page 146: DISTANTLY SUPERVISED INFORMATION EXTRACTION …The best thing about Stanford is its grad students – brilliant and yet so approachable. It has been amazing to be a part of the NLP

CHAPTER 8. VISUALIZING AND DIAGNOSING BPL 129

Lis

t o

f en

titi

es

con

sid

ered

as

po

siti

ve,

neg

ati

ve,

a

nd

un

lab

eled

by

th

e sy

stem

wh

en i

t le

arn

ed t

his

p

att

ern

.

An

ex

cla

ma

tio

n

sig

n i

nd

ica

tes

tha

t le

ss t

ha

n h

alf

of

the

un

lab

eled

en

titi

es w

ere

even

tua

lly

lea

rned

w

ith

co

rrec

t la

bel

.

Det

ail

s o

f th

e p

att

ern

.

Gre

en c

olo

r o

f en

tity

in

dic

ate

s th

at

the

enti

ty w

as

lea

rned

by

th

e sy

stem

an

d t

he

ora

cle

ass

ign

ed i

t th

e ‘c

orr

ect’

la

bel

.

Lis

t o

f p

att

ern

s le

arn

ed a

t ea

ch

iter

ati

on

. B

lue

pa

tter

n i

nd

ica

tes

tha

t th

e p

att

ern

w

as

no

t le

arn

ed b

y

the

oth

er s

yst

em.

Figu

re8.

2:Pa

ttern

cent

ric

view

ofSP

IED

-Viz

.

Page 147: DISTANTLY SUPERVISED INFORMATION EXTRACTION …The best thing about Stanford is its grad students – brilliant and yet so approachable. It has been amazing to be a part of the NLP

CHAPTER 8. VISUALIZING AND DIAGNOSING BPL 130

Figure 8.3: When the user clicks on the compare icon for an entity, the explanations of theentity extraction for both systems (if available) are displayed. This allows direct compari-son of why the two systems learned the entity.

Suite (Kim et al., 2008). Akbik et al. (2013) interactively helps non-expert users to manu-

ally write patterns over dependency trees. GATE5 provides the JAPE language that recog-

nizes regular expressions over annotations. Other systems focus on reducing manual effort

for developing extractors (Brauer et al., 2011; Li et al., 2011). ICE (He and Grishman,

2015) is an interface for building entity, relation, and event extractors using dependency

patterns. Valenzuela-Escarcega et al. (2015) built an interactive web-based event extraction

tool for event grammar development via rules. In contrast, our tool focuses on visualizing

and comparing diagnostic information associated with pattern learning systems.

WizIE (Li et al., 2012b) is an integrated environment for annotating text and writing

pattern extractors for information extraction. It also generates regular expressions around

labeled mentions and suggests patterns to users. It is most similar to our tool as it displays

an explanation of the results extracted by a pattern. However, it is focused towards hand

writing and selection of rules. In addition, it cannot be used to directly compare two pattern

learning systems.

What’s Wrong With My NLP?6 is a tool for jointly visualizing various natural language

processing formats such as trees, graphs, and entities. It is the same as our system in the

focus on diagnosing errors so they can be fixed, but it is different in providing no tools to

drill down and find the source of errors. Since I focus on a particular task and a learning

mechanism, I am able to develop a specialized tool that can provide more functionality,5http://gate.ac.uk6https://code.google.com/p/whatswrong

Page 148: DISTANTLY SUPERVISED INFORMATION EXTRACTION …The best thing about Stanford is its grad students – brilliant and yet so approachable. It has been amazing to be a part of the NLP

CHAPTER 8. VISUALIZING AND DIAGNOSING BPL 131

which is presumably harder for a generic visualization tool.

8.7 Future Work and Conclusion

A limitation of the tool is that there is no way for a user to give feedback, such as to provide

the oracle label of a learned entity. Currently, the oracle labels are assigned offline. It would

be useful to extend the interface to visualize diagnostic information of learned relations, in

addition to entities, from a pattern-based relation learning system. Another avenue of future

work is to evaluate SPIED-Viz by studying its users and their interactions with the system.

In addition, the visualization can be improved by summarizing the diagnostic information,

such as which parameters led to what mistakes, to make it easier to understand for systems

that extract large number of patterns and entities.

In this chapter, I present a novel diagnostic tool for pattern-based entity learning that

visualizes and compares output of one to two systems. It is a light-weight web browser

based visualization. The visualization can be used with any pattern-based entity learner.

I make the code of an end-to-end system freely available. The system learns entities and

patterns using bootstrapping starting with seed dictionaries, and visualizes the diagnostic

output. The tool was crucial in diagnosing the systems I built in Chapter 6 and 7. The

problem with unlabeled entities in a pattern learning system, as described in Chapter 6,

became apparent when I was identifying the sources of errors in earlier systems using the

interface. I hope SPIED will help other researchers and users as well to diagnose errors and

tune parameters in their pattern-based entity learning system in an easy and efficient way.

Page 149: DISTANTLY SUPERVISED INFORMATION EXTRACTION …The best thing about Stanford is its grad students – brilliant and yet so approachable. It has been amazing to be a part of the NLP

Chapter 9

Conclusions

In this dissertation, I presented bootstrapped pattern-based learning (BPL) as an effective

approach for entity extraction tasks that have no fully labeled data. Even though many real-

world information extraction tasks have no fully labeled data, frequently, developers can

gather a few examples, either manually or by using existing knowledge bases. These ex-

amples can be used as seed sets, along with an unlabeled corpus, for bootstrapped learning.

I proposed two new tasks and showed that BPL extracts information effectively, starting

out with only a few handwritten patterns or automatically constructed dictionaries. Ex-

isting BPL systems underutilize the unlabeled data. I proposed improvements to BPL by

leveraging unlabeled data in its pattern and entity scoring functions.

The two tasks I proposed had not been studied before: 1. studying influence of aca-

demic papers and communities by extracting techniques, domain, and focus entities, and 2.

extracting medical entities of types symptom-and-condition and drug-and-treatment from

challenging and loosely-structured patient-authored text. A bootstrapped system is suitable

for both problems since the tasks are new and there does not exist any fully labeled dataset

to train supervised classifiers.

In Chapter 4, I described the first task, our approach, and a case study of the computa-

tional linguistics academic community. I bootstrapped with a few hand-written patterns for

the three entity types. Since the sentences in scientific articles are well-formed and gram-

matical, I learned dependency patterns using the dependency parses of the sentences. I also

showed that for one entity type our system outperformed a fully supervised conditional

132

Page 150: DISTANTLY SUPERVISED INFORMATION EXTRACTION …The best thing about Stanford is its grad students – brilliant and yet so approachable. It has been amazing to be a part of the NLP

CHAPTER 9. CONCLUSIONS 133

random field (CRF) model. In a case study, I discuss how the Speech Recognition sub-

community has been very influential in the computational linguistics community, mainly

because of its introductory use of now-standard techniques, such as hidden Markov models,

expectation maximization, and language modeling.

I described the second task of extracting medical entities from patient-authored text on

online health forums in Chapter 5. The dataset was challenging because of the mismatch in

the content of the existing bio-informatics resources and the online health forums. Patients

use slang, colloquial, and descriptive phrases, usually not found in existing medical on-

tologies. Thus, our system started the learning process with dictionaries consisting mostly

of well-formed official names of medical entities. Over the iterations, it learned various

informal entities, including spelling mistakes, abbreviations, sub-phrases, and new terms.

Our system outperformed commonly used medical annotators (MetaMap and OBA), sta-

tistical approaches (self-trained CRFs), existing pattern-based learning systems (Xu et al.,

2008), and other dictionary-based approaches. I presented a case study of comparing the

anecdotal efficacy of two new alternate treatments extracted by our system. The analysis

shows that our system can potentially be used to study the efficacy and side effects of drugs

and treatments at a large scale.

In Chapters 6 and 7, I proposed improvements to BPL systems by exploiting the unla-

beled data. Similar to many distantly supervised learning systems, existing bootstrapped

pattern-based learning systems either ignore the unlabeled data or make closed world as-

sumptions. In Chapter 6, I discussed how to improve BPL’s pattern scoring by evaluating

the unlabeled entities extracted by patterns. I proposed five unsupervised features, such

as distributional similarity and contrasting domain vs. generic text, to predict labels of

unlabeled entities, and use the predictions to rank patterns better. My system performed

significantly better than the existing pattern ranking approaches.

In Chapter 7, I proposed a method to improve BPL’s entity scoring using distributed

representation of words obtained by recent unsupervised neural network approaches. I

modeled the entity scoring function using a logistic regression classifier. Existing distantly

supervised approaches create the training set for the classifier using the known entities as

labeled examples and sampling the unlabeled entities as negative examples. This results

in the training set being both limited and noisy. The training set is limited in size because

Page 151: DISTANTLY SUPERVISED INFORMATION EXTRACTION …The best thing about Stanford is its grad students – brilliant and yet so approachable. It has been amazing to be a part of the NLP

CHAPTER 9. CONCLUSIONS 134

the number of known entities is small. The dataset is noisy because many of the unlabeled

examples sampled as negative can actually be positive. I proposed a better way to create

the training set by exploiting similarity of known entities to the unlabeled entities. I used

distributed representations of words to find unlabeled entities that are similar to seed en-

tities and incorporated the k nearest neighbors in the training set. The proposed system

outperformed the baseline systems.

The improvements I made to BPL’s pattern and entity scoring functions underscore the

potential of unlabeled data. Computing similarity between words in an unsupervised way

has always been a topic of interest. However, the recent progress using the deep learning

approaches has improved the accuracy. Our systems illustrate that in a distantly supervised

or semi-supervised system, performance can be improved by leveraging the unlabeled data

using unsupervised measures. This insight can be applied to existing relation learning

systems like NELL (Carlson et al., 2010b) and OLLIE (Mausam et al., 2012).

One of the main benefits of using patterns, apart from being effective and fast, is that

they are interpretable. Non-machine-learning experts can understand a pattern and its ex-

tractions to identify the source of errors and possible improvements. I presented a diagnos-

tic and visualization system for pattern-based learning systems in Chapter 8. The problem

with unlabeled entities extracted by patterns, described in Chapter 6, was diagnosed using

the visualization system. It can compare multiple systems by displaying their output and

the provenance of learned entities and patterns, helping developers to tune the parameters

of a system, which is usually not possible in a classifier-based system.

There are several interesting avenues to explore in future. I discuss them below.

• Semantic drift: One main challenge of a bootstrapped system is avoiding semantic

drift, the phenomenon in which learning of a few false examples leads to a snowball

effect of learning more false patterns and examples. There has been some work

on detecting semantic drift (McIntosh and Curran, 2009). One solution is to have

a human give feedback occasionally to steer the system in the correct direction, as

implemented by NELL (Carlson et al., 2010b). More work is needed in automatically

detecting semantic drift using machine learning-based approaches and figuring out

the optimal opportunities for human input.

Page 152: DISTANTLY SUPERVISED INFORMATION EXTRACTION …The best thing about Stanford is its grad students – brilliant and yet so approachable. It has been amazing to be a part of the NLP

CHAPTER 9. CONCLUSIONS 135

• Seed sets quality and quantity: Modeling the amount and quality of supervision

needed to build an effective bootstrapped system is hard but an important aspect

from a practical viewpoint. The question is not trivial since the definition of a class

is conveyed via the seed sets. When the seed sets do not cover the label set, the scope

of the task is not clear. For example, if the seed set consists of two football athlete

names, it is unclear whether the task is to learn all athlete names or just names of

footballers. Moreover, Pantel et al. (2009) showed that the seed set composition con-

siderably affects performance. In their experiments, the difference between the best

performing seed set and the worst performing seed set was 42% precision points and

39% recall points. Thus, using human experts to list high quality seed sets might lead

to a big performance boost. However, acquiring manually labeled seed sets would be

unrealistic for a large number of classes, as is the case with several open IE systems.

• Feature-based sequence models: Comprehensive comparison of pattern-based mod-

els with sequence models (such as, HMMs and CRFs) is an important work miss-

ing in the existing information extraction literature. In this dissertation, I show

that BPL approaches outperformed unsupervised and self-trained CRFs. Supervised

CRFs, on the other hand, are the go-to tools for building entity extraction systems

in academia. More experiments comparing BPL with supervised CRFs and hand-

written rule-based systems can give insights into strengths and weaknesses of each

type of system. Moreover, these insights can inform better hybrid systems that com-

bine pattern-based approaches and feature-based sequence models.

Pattern-based IE systems are very popular in industry, however, most of them are hand-

written. I hope my work on bootstrapped pattern-based learning will help in automating

these systems in an effective and efficient way. My research on exploiting the unlabeled

data in a bootstrapped system can improve accuracy of not only pattern-based machine

learning systems, but also presumably of feature-based machine learning systems. Fur-

thermore, I hope my research will motivate more researchers to work on the practical and

interpretable IE problems, bringing the academic world closer to the world of industry.

Page 153: DISTANTLY SUPERVISED INFORMATION EXTRACTION …The best thing about Stanford is its grad students – brilliant and yet so approachable. It has been amazing to be a part of the NLP

Appendix A

Stop Words List

Medical stop words

The medical stop words list is to identify words that are common in medical text but are

not relevant to the entity types of the systems I worked on. Some of the words are not

dictionary words because the dataset is from online health forums, where users frequently

use abbreviations and slang words. Following is the list of words.

disease, diseases, disorder, symptom, symptoms, drug, drugs, problems, problem, prob,

probs, med, meds, pill, pills, medicine, medicines, medication, medications, treatment,

treatments, caps, capsules, capsule, tablet, tablets, tabs, doctor, dr, dr., doc, physician,

physicians, test, tests, testing, specialist, specialists, side-effect, side-effects, pharmaceu-

tical, pharmaceuticals, pharma, diagnosis, diagnose, diagnosed, exam, challenge, device,

condition, conditions, suffer, suffering, suffered, feel, feeling, prescription, prescribe, pre-

scribed, over-the-counter, otc

General stop words

The general stop words list is to identify words that are commonly used in the English

language but do not belong to the entity types of the systems I worked on. It is similar to

136

Page 154: DISTANTLY SUPERVISED INFORMATION EXTRACTION …The best thing about Stanford is its grad students – brilliant and yet so approachable. It has been amazing to be a part of the NLP

APPENDIX A. STOP WORDS LIST 137

many other stop words list available on the web. It contains stop words for both whitespace-

tokenized text and PTB-tokenized1 text. Below is the list of words.

a, about, above, after, again, against, all, am, an, and, any, are, aren’t, as, at, be, because,

been, before, being, below, between, both, but, by, can, can’t, cannot, could, couldn’t,

did, didn’t, do, does, doesn’t, doing, don’t, down, during, each, few, for, from, further,

had, hadn’t, has, hasn’t, have, haven’t, having, he, he’d, he’ll, he’s, her, here, here’s, hers,

herself, him, himself, his, how, how’s, i, i’d, i’ll, i’m, i’ve, if, in, into, is, isn’t, it, it’s, its,

itself, let’s, me, more, most, mustn’t, my, myself, no, nor, not, of, off, on, once, only, or,

other, ought, our, ours, ourselves, out, over, own, same, shan’t, she, she’d, she’ll, she’s,

should, shouldn’t, so, some, such, than, that, that’s, the, their, theirs, them, themselves,

then, there, there’s, these, they, they’d, they’ll, they’re, they’ve, this, those, through, to,

too, under, until, up, very, was, wasn’t, we, we’d, we’ll, we’re, we’ve, were, weren’t, what,

what’s, when, when’s, where, where’s, which, while, who, who’s, whom, why, why’s,

with, won’t, would, wouldn’t, you, you’d, you’ll, you’re, you’ve, your, yours, yourself,

yourselves, n’t, ’re, ’ve, ’d, ’s, ’ll, ’m.

1See http://nlp.stanford.edu/software/tokenizer.shtml and https://catalog.ldc.upenn.edu/LDC99T42 for the tokenization details.

Page 155: DISTANTLY SUPERVISED INFORMATION EXTRACTION …The best thing about Stanford is its grad students – brilliant and yet so approachable. It has been amazing to be a part of the NLP

Bibliography

Abney, Steven (2004). “Understanding the Yarowsky Algorithm”. In: Computational Lin-

guistics 30, pp. 365–395.

Agichtein, Eugene and Luis Gravano (2000). “Snowball: Extracting Relations from Large

Plain-text Collections”. In: Proceedings of the Fifth ACM Conference on Digital Li-

braries. DL’00.

Akbik, Alan, Oresti Konomi, and Michail Melnikov (2013). “Propminer: A Workflow for

Interactive Information Extraction and Exploration using Dependency Trees”. In: As-

sociation for Computer Linguistics System Demonstrations.

Angeli, Gabor, Sonal Gupta, Melvin Johnson Premkumar, Christopher D. Manning, Christo-

pher Re, Julie Tibshirani, Jean Y. Wu, Sen Wu, and Ce Zhang (2014). “Stanford’s Dis-

tantly Supervised Slot Filling Systems for KBP 2014”. In: Proceedings of the Text An-

alytics Conference.

Aronson, Alan R (2001). “Effective mapping of biomedical text to the UMLS Metathe-

saurus: the MetaMap program.” In: Proceedings of the AMIA Symposium.

Aronson, Alan R and Francois-Michel Lang (2010). “An overview of MetaMap: historical

perspective and recent advances”. In: Journal of the American Medical Informatics

Association 17, pp. 229–236.

Bellare, Kedar, Partha Talukdar, Giridhar Kumaran, Fernando Pereira, Mark Liberman,

Andrew McCallum, and Mark Dredze (2007). “Lightly-Supervised Attribute Extraction

for Web Search”. In: NIPS 2007 Workshop on Machine Learning for Web Search.

Bethard, Steven and Dan Jurafsky (2010). “Who should I cite: learning literature search

models from citation behavior”. In: Proceedings of the Conference on Information and

Knowledge Management.

138

Page 156: DISTANTLY SUPERVISED INFORMATION EXTRACTION …The best thing about Stanford is its grad students – brilliant and yet so approachable. It has been amazing to be a part of the NLP

BIBLIOGRAPHY 139

Bird, Steven, Robert Dale, Bonnie J. Dorr, Bryan Gibson, Mark T. Joseph, Min yen Kan,

Dongwon Lee, Brett Powley, Dragomir R. Radev, and Yee Fan Tan (2008). “The ACL

Anthology Reference Corpus: A Reference Dataset for Bibliographic Research in Com-

putational Linguistics”. In: Proceedings of the Conference on Language Resources and

Evaluation (LREC).

Blei, David, Andrew Ng, and Michael I. Jordan (2003). “Latent Dirichlet Allocation”. In:

Journal of Machine Learning Research (JMLR) 3, pp. 993–1022.

Blum, Avrim and Tom Mitchell (1998). “Combining Labeled and Unlabeled Data with

Co-training”. In: Conference on Learning Theory (COLT).

Boella, Guido, Luigi Di Caro, and Livio Robaldo (2013). “Semantic Relation Extraction

from Legislative Text Using Generalized Syntactic Dependencies and Support Vector

Machines”. In: Proceedings of the 7th International Conference on Theory, Practice,

and Applications of Rules on the Web. RuleML’13.

Bollacker, Kurt, Colin Evans, Praveen Paritosh, Tim Sturge, and Jamie Taylor (2008).

“Freebase: A collaboratively created graph database for structuring human knowledge”.

In: International Conference on Management of Data (SIGMOD), pp. 1247–1250.

Brauer, Falk, Robert Rieger, Adrian Mocan, and Wojciech M. Barczynski (2011). “En-

abling information extraction by inference of regular expressions from sample enti-

ties”. In: Proceedings of the International Conference on Information and Knowledge

Management.

Brin, Sergey (1999). Extracting Patterns and Relations from the World Wide Web. Technical

Report. Previous number = SIDL-WP-1999-0119. Stanford InfoLab.

Brown, Peter F., Peter V. deSouza, Robert L. Mercer, Vincent J. Della Pietra, Jennifer C.

Lai, and Robert L. Mercer (1992). “Class-Based n-gram Models of Natural Language”.

In: Computational Linguistics 18, pp. 467–479.

Buckley, Chris, Darrin Dimmick, Ian Soboroff, and Ellen Voorhees (2007). “Bias and the

limits of pooling for large collections”. In: Information Retrieval 10, pp. 491–508.

Buitelaar, Paul and Bernardo Magnini (2005). “Ontology Learning from Text: An Overview”.

In: In Ontology Learning from Text: Methods, Applications and Evaluation. IOS Press,

pp. 3–12.

Page 157: DISTANTLY SUPERVISED INFORMATION EXTRACTION …The best thing about Stanford is its grad students – brilliant and yet so approachable. It has been amazing to be a part of the NLP

BIBLIOGRAPHY 140

Bunescu, Razvan C. and Raymond J. Mooney (2005). “A Shortest Path Dependency Ker-

nel for Relation Extraction”. In: Empirical Methods in Natural Language Processing

(EMNLP).

Butler, Declan (2013). “When Google got flu wrong”. In: Nature 494, pp. 155–156.

Califf, Mary Elaine and Raymond J. Mooney (1999). “Relational Learning of Pattern-

match Rules for Information Extraction”. In: Association for the Advancement of Arti-

ficial Intelligence (AAAI), pp. 328–334.

Carlson, Andrew, Justin Betteridge, Richard C. Wang, Estevam R. Hruschka Jr., and Tom

M. Mitchell (2010a). “Coupled Semi-supervised Learning for Information Extraction”.

In: Web Search and Data Mining (WSDM), pp. 101–110.

Carlson, Andrew, Justin Betteridge, Bryan Kisiel, Burr Settles, Estevam R. Hruschka Jr.,

and Tom M. Mitchell (2010b). “Toward an architecture for never-ending language

learning”. In: Association for the Advancement of Artificial Intelligence (AAAI).

Carneiro, Herman A. and Eleftherios Mylonakis (2009). “Google trends: a web-based tool

for real-time surveillance of disease outbreaks”. In: Clinical Infectious Diseases 10,

pp. 1557–1564.

Chang, Angel X. and Christopher D. Manning (2014). TokensRegex: Defining cascaded

regular expressions over tokens. Tech. rep. Department of Computer Science, Stanford

University (CSTR 2014-02).

Chinchor, Nancy A. (1998). “Proceedings of the Seventh Message Understanding Confer-

ence (MUC-7) Named Entity Task Definition”. In: Proceedings of the Seventh Message

Understanding Conference (MUC-7).

Chiticariu, Laura, Yunyao Li, and Frederick R. Reiss (2013). “Rule-Based Information

Extraction is Dead! Long Live Rule-Based Information Extraction Systems!” In: Em-

pirical Methods in Natural Language Processing (EMNLP), pp. 827–832.

Ciravegna, Fabio (2001). “Adaptive information extraction from text by rule induction and

generalisation”. In: International Joint Conference on Artificial Intelligence (IJCAI),

pp. 1251–1256.

Ciravegna, Fabio, Alexiei Dingli, Daniela Petrelli, and Yorick Wilks (2002). “User-system

cooperation in document annotation based on information extraction”. In: Proceedings

Page 158: DISTANTLY SUPERVISED INFORMATION EXTRACTION …The best thing about Stanford is its grad students – brilliant and yet so approachable. It has been amazing to be a part of the NLP

BIBLIOGRAPHY 141

of the 13th International Conference on Knowledge Engineering and Knowledge Man-

agement.

Clark, Alexander (2001). “Unsupervised induction of stochastic context free grammars

with distributional clustering”. In: Computational Natural Language Learning (CoNLL).

Cohen, William W. and Sunita Sarawagi (2004). “Exploiting Dictionaries in Named Entity

Extraction: Combining semi-Markov Extraction Processes and Data Integration Meth-

ods”. In: Proceedings of the Tenth ACM SIGKDD International Conference on Knowl-

edge Discovery and Data Mining.

Cohen, William W. and Yoram Singer (1999). “A simple, fast, and effective rule learner”.

In: Association for the Advancement of Artificial Intelligence (AAAI), pp. 335–342.

Collins, Michael and Yoram Singer (1999). “Unsupervised Models for Named Entity Clas-

sification”. In: Empirical Methods in Natural Language Processing (EMNLP).

Collobert, Ronan and Jason Weston (2008). “A Unified Architecture for Natural Language

Processing: Deep Neural Networks with Multitask Learning”. In: Proceedings of the

25th International Conference on Machine Learning. ICML ’08.

De Marneffe, Marie Catherine, Bill MacCartney, and Christopher D. Manning (2006).

“Generating typed dependency parses from phrase structure parses”. In: Proceedings

of the Conference on Language Resources and Evaluation (LREC).

Demner-Fushman, Dina and Jimmy Lin (2007). “Answering clinical questions with knowledge-

based and statistical techniques”. In: Computational Linguistics 33, pp. 63–103.

Downey, Doug, Oren Etzioni, Stephen Soderland, and Daniel S. Weld (2004). “Learning

Text Patterns for Web Information Extraction and Assessment”. In: Proceedings of the

2004 AAAI Workshop on Adaptive Text Extraction and Mining (ATEM).

Downey, Doug, Oren Etzioni, and Stephen Soderland (2010). “Analysis of a Probabilistic

Model of Redundancy in Unsupervised Information Extraction”. In: Artificial Intelli-

gence 174.11, pp. 726–748. ISSN: 0004-3702.

Druck, Gregory, Gideon Mann, and Andrew McCallum (2008). “Learning from Labeled

Features using Generalized Expectation Criteria”. In: ACM Special Interest Group on

Information Retreival (SIGIR), pp. 595–602.

Epstein, Richard H., Paul St. Jacques, Michael Stockin, Brian Rothman, Jesse M. Ehren-

feld, and Joshua C. Denny (2013). “Automated identification of drug and food allergies

Page 159: DISTANTLY SUPERVISED INFORMATION EXTRACTION …The best thing about Stanford is its grad students – brilliant and yet so approachable. It has been amazing to be a part of the NLP

BIBLIOGRAPHY 142

entered using non-standard terminology”. In: Journal of the American Medical Infor-

matics Association 20, pp. 962–968.

Etzioni, Oren, Michael Cafarella, Doug Downey, Ana-Maria Popescu, Tal Shaked, Stephen

Soderland, Daniel S. Weld, and Alexander Yates (2005). “Unsupervised named-entity

extraction from the web: An experimental study”. In: Artificial Intelligence 165.1, pp. 91–

134.

Fader, Anthony, Stephen Soderland, and Oren Etzioni (2011). “Identifying Relations for

Open Information Extraction”. In: Empirical Methods in Natural Language Processing

(EMNLP).

Fellbaum, Christiane (1998). WordNet: An Electronic Lexical Database. MIT Press.

Finkel, Jenny R. (2010). “Holistic Language Processing: Joint Models of Linguistic Struc-

ture”. PhD thesis. Stanford University.

Finkel, Jenny R., Trond Grenager, and Christopher Manning (2005). “Incorporating non-

local information into information extraction systems by Gibbs sampling”. In: Associ-

ation for Computational Linguistics (ACL), pp. 363–370.

Fox, Susannah and Maeve Duggan (2013). Health Online. http://www.pewinternet.

org/Reports/2013/Health-online.aspx.

Frantzi, Katerina, Sophia Ananiadou, and Hideki Mima (2000). “Automatic recognition of

multi-word terms: the C-value/NC-value method”. In: International Journal on Digital

Libraries 3.2, pp. 115–130.

Frati-Munari, Alberto C., Blanca E. Gordillo, Perla Altamirano, and C. Raul Ariza (1998).

“Hypoglycemic effect of Opuntia streptacantha Lemaire in NIDDM”. In: Diabetes Care

11, pp. 63–66.

Freitag, Dayne (1998). “Toward General-Purpose Learning for Information Extraction”. In:

International Conference on Computational Linguistics (COLING), pp. 404–408.

Freitag, Dayne and Nicholas Kushmerick (2000). “Boosted Wrapper Induction”. In: Asso-

ciation for the Advancement of Artificial Intelligence (AAAI), pp. 577–583.

Gerrish, Sean M. and David M. Blei (2010). “A language-based approach to measuring

scholarly impact”. In: International Conference on Machine Learning (ICML).

Page 160: DISTANTLY SUPERVISED INFORMATION EXTRACTION …The best thing about Stanford is its grad students – brilliant and yet so approachable. It has been amazing to be a part of the NLP

BIBLIOGRAPHY 143

Govindaraju, Vidhya, Ce Zhang, and Christopher Re (2013). “Understanding Tables in

Context Using Standard NLP Toolkits”. In: Proceedings of the 51st Annual Meeting of

the Association for Computational Linguistics (Volume 2: Short Papers).

Gu, Baohua (2002). “Recognizing named entities in biomedical texts”. MA thesis. National

University of Singapore.

Gupta, Rahul, Alon Halevy, Xuezhi Wang, Steven Euijong Whang, and Fei Wu (2014a).

“Biperpedia: An Ontology for Search Applications”. In: Proceedings of the VLDB En-

dowment 7.7, pp. 505–516.

Gupta, Sonal and Christopher D. Manning (2011). “Analyzing the Dynamics of Research

by Extracting Key Aspects of Scientific Papers”. In: Proceedings of the International

Joint Conference on Natural Language Processing.

— (2014a). “Improved Pattern Learning for Bootstrapped Entity Extraction”. In: Compu-

tational Natural Language Learning (CoNLL).

— (2014b). “SPIED: Stanford Pattern-based Information Extraction and Diagnostics”. In:

Proceedings of the ACL 2014 Workshop on Interactive Language Learning, Visualiza-

tion, and Interfaces (ACL-ILLVI).

— (2015). “Distributed Representations of Words to Guide Bootstrapped Entity Classi-

fiers”. In: North American Association for Computational Linguistics (NAACL).

Gupta, Sonal, Diana MacLean, Jeffrey Heer, and Christopher D. Manning (2014b). “In-

duced Lexico-Syntactic Patterns Improve Information Extraction from Online Medical

Forums”. In: Journal of the American Medical Informatics Association 21, pp. 902–

909.

Hall, David, Daniel Jurafsky, and Christopher D Manning (2008). “Studying the history

of ideas using topic models”. In: Empirical Methods in Natural Language Processing

(EMNLP).

Hassan, Hany, Ahmed Hassan, and Ossama Emam (2006). “Unsupervised Information Ex-

traction Approach Using Graph Mutual Reinforcement”. In: Proceedings of the 2006

Conference on Empirical Methods in Natural Language Processing. EMNLP ’06.

He, Yifan and Ralph Grishman (2015). “ICE: Rapid Information Extraction Customiza-

tion for NLP Novices”. In: Proceedings of the 2015 Conference of the North American

Page 161: DISTANTLY SUPERVISED INFORMATION EXTRACTION …The best thing about Stanford is its grad students – brilliant and yet so approachable. It has been amazing to be a part of the NLP

BIBLIOGRAPHY 144

Chapter of the Association for Computational Linguistics – Human Language Tech-

nologies (System Demostrations).

Hearst, Marti A (1992). “Automatic acquisition of hyponyms from large text corpora”. In:

Interational Conference on Computational linguistics, pp. 539–545.

Hobbs, Jerry R. and Ellen Riloff (2010). “Information Extraction”. In: Handbook of Natural

Language Processing, Second Edition. ISBN 978-1420085921.

Hobbs, Jerry R., John Bear, David Israel, and Mabry Tyson (1993). “FASTUS: A finite-

state processor for information extraction from real-world text”. In: International Joint

Conference on Artificial Intelligence. IJCAI’93, pp. 1172–1178.

Hobbs, Jerry R., Douglas E. Appelt, John Bear, David J. Israel, Megumi Kameyama, Mark

E. Stickel, and Mabry Tyson (1997). “FASTUS: A Cascaded Finite-State Transducer for

Extracting Information from Natural-Language Text”. In: Computing Research Repos-

itory (CoRR) cmp-lg/9705013.

Hovy, Eduard, Mitchell Marcus, Martha Palmer, Lance Ramshaw, and Ralph Weischedel

(2006). “OntoNotes: The 90% Solution”. In: Proceedings of the Human Language

Technology Conference of the NAACL. NAACL-Short ’06.

Illig, Jens, Benjamin Roth, and Dietrich Klakow (2014). “Unsupervised Parsing for Gener-

ating Surface-Based Relation Extraction Patterns”. In: European Association for Com-

putational Linguistics (EACL).

Jean-Louis, Ludovic, Romaric Besancon, Olivier Ferret, and Wei Wang (2011). “Using

a weakly supervised approach and lexical patterns for the KBP slot filling task”. In:

Proceedings of the Text Analysis Conference - Knowledge Base Propagation (KBP).

Jonquet, Clement, Nigam H. Shah, and Mark A. Musen (2009). “The Open Biomedical

Annotator”. In: Summit on translational bioinformatics 2009, pp. 56–60.

Kang, Ning, Bharat Singh, Zubair Afzal, Erik M. van Mulligen, and Jan A. Kors (2012).

“Using rule-based natural language processing to improve disease normalization in

biomedical text”. In: Journal of the American Medical Informatics Association 20,

pp. 876–881.

Khan, Alam, Mahpara Safdar, Mohammad Muzaffar Ali Khan, Khan Nawaz Khattak, and

Richard A. Anderson (2003). “Cinnamon improves glucose and lipids of people with

type 2 diabetes”. In: Diabetes Care 26.12, pp. 3215–3218.

Page 162: DISTANTLY SUPERVISED INFORMATION EXTRACTION …The best thing about Stanford is its grad students – brilliant and yet so approachable. It has been amazing to be a part of the NLP

BIBLIOGRAPHY 145

Kim, Jin-Dong, Tomoko Ohta, Yuka Tateisi, and Jun ichi Tsujii (2003). “GENIA corpus -

A semantically annotated corpus for bio-textmining”. In: ISMB (Supplement of Bioin-

formatics), pp. 180–182.

Kim, Jin-Dong, Tomoko Ohta, and Jun ichi Tsujii (2008). “Corpus annotation for mining

biomedical events from literature.” In: BMC Bioinformatics 9.

Kleinberg, Jon M. (1999). “Authoritative sources in a hyperlinked environment”. In: Jour-

nal of the ACM 46, pp. 604–632.

Koo, Terry, Xavier Carreras, and Michael Collins (2008). “Simple Semi-Supervised De-

pendency Parsing”. In: Human Language Technology and Association for Computa-

tional Linguistics (HLT/ACL).

Kozareva, Zornitsa and Eduard H. Hovy (2013). “Tailoring the automated construction of

large-scale taxonomies using the web.” In: Language Resources and Evaluation 47.3,

pp. 859–890.

Lafferty, John, Andrew McCallum, and Fernando Pereira (2001). “Conditional Random

Fields: Probabilistic Models for Segmenting and Labeling Data”. In: International Con-

ference on Machine Learning (ICML), pp. 282–289.

Landauer, Thomas K., Peter W. Foltz, and Darrell Laham (1998). “An introduction to latent

semantic analysis”. In: Discourse processes 25, pp. 259–284.

Leaman, Robert, Laura Wojtulewicz, Ryan Sullivan, Annie Skariah, Jian Yang, and Gra-

ciela Gonzalez (2010). “Towards internet-age pharmacovigilance: Extracting adverse

drug reactions from user posts to health-related social networks”. In: Proceedings of

the 2010 workshop on biomedical natural language processing, pp. 117–125.

Letham, Benjamin, Cynthia Rudin, Tyler H. Mccormick, and David Madigan (2013). In-

terpretable classifiers using rules and Bayesian analysis: Building a better stroke pre-

diction model. Tech. rep. Department of Statistics, University of Washington (Report

No. 609).

Li, Na, Leilei Zhu, Prasenjit Mitra, Karl Mueller, Eric Poweleit, and C. Lee Giles (2010).

“OreChem ChemXSeer: A semantic digital library for chemistry”. In: Proceedings of

the Joint Conference on Digital libraries.

Page 163: DISTANTLY SUPERVISED INFORMATION EXTRACTION …The best thing about Stanford is its grad students – brilliant and yet so approachable. It has been amazing to be a part of the NLP

BIBLIOGRAPHY 146

Li, Shen, Joao V. Graca, and Ben Taskar (2012a). “Wiki-ly Supervised Part-of-speech Tag-

ging”. In: Proceedings of the 2012 Joint Conference on Empirical Methods in Natu-

ral Language Processing and Computational Natural Language Learning. EMNLP-

CoNLL ’12.

Li, Yunyao, Vivian Chu, Sebastian Blohm, Huaiyu Zhu, and Howard Ho (2011). “Facili-

tating Pattern Discovery for Relation Extraction with Semantic-signature-based Clus-

tering”. In: Proceedings of the 20th ACM International Conference on Information and

Knowledge Management.

Li, Yunyao, Laura Chiticariu, Huahai Yang, Frederick R. Reiss, and Arnaldo Carreno-

fuentes (2012b). “WizIE: A Best Practices Guided Development Environment for In-

formation Extraction”. In: Proceedings of the ACL 2012 System Demonstrations.

Liakata, Maria, Claire Q, and Larisa N. Soldatova (2009). “Semantic Annotation of Pa-

pers: Interface & Enrichment Tool (SAPIENT)”. In: Proceedings of the BioNLP 2009

Workshop.

Liang, Percy (2005). “Semi-Supervised Learning for Natural Language”. MA thesis. Mas-

sachusetts Institute of Technology.

Lin, Winston, Roman Yangarber, and Ralph Grishman (2003). “Bootstrapped Learning of

Semantic Classes from Positive and Negative Examples”. In: International Conference

on Machine Learning (ICML).

Liu, Bin, Laura Chiticariu, Vivian Chu, H. V. Jagadish, and Frederick R. Reiss (2010).

“Automatic Rule Refinement for Information Extraction”. In: Proceedings of the VLDB

Endowment 3.1-2, pp. 588–597.

MacLean, Diana (2015). “Insights from Patient Authored Text : From Close Reading to

Automated Extraction”. PhD thesis. Stanford University.

MacLean, Diana and Jeffrey Heer (2013). “Identifying Medical Terms in Patient-Authored

Text: A Crowdsourcing-based Approach”. In: Journal of the American Medical Infor-

matics Association 20, pp. 1120–1127.

MacLean, Diana, Sonal Gupta, Anna Lembke, Christopher D. Manning, and Jeffrey Heer

(2015). “Forum77: An Analysis of an Online Health Forum Dedicated to Addiction

Recovery”. In: Computer Supported Cooperative Work and Social Computing (CSCW).

Page 164: DISTANTLY SUPERVISED INFORMATION EXTRACTION …The best thing about Stanford is its grad students – brilliant and yet so approachable. It has been amazing to be a part of the NLP

BIBLIOGRAPHY 147

Mann, Gideon and Andrew McCallum (2008). “Generalized Expectation Criteria for Semi-

Supervised Learning of Conditional Random Fields”. In: Human Language Technology

and Association for Computational Linguistics (HLT/ACL), pp. 870–878.

Manning, Christopher, Prabhakar Raghavan, and Hinrich Schutze (2008). Introduction to

Information Retrieval. Vol. 1. Cambridge University Press.

Manning, Christopher D., Mihai Surdeanu, John Bauer, Jenny Finkel, Steven J. Bethard,

and Davic McClosky (2014). “The Stanford CoreNLP natural language processing

toolkit”. In: ACL system demonstrations.

Martins, Andre F. T., Noah A. Smith, Pedro M. Q. Aguiar, and Mario A. T. Figueiredo

(2011). “Structured Sparsity in Structured Prediction”. In: Proceedings of the Confer-

ence on Empirical Methods in Natural Language Processing. EMNLP’11.

Mausam, Michael Schmitz, Robert Bart, Stephen Soderland, and Oren Etzioni (2012).

“Open language learning for information extraction”. In: Empirical Methods in Nat-

ural Language Processing and Computational Natural Language Learning (EMNLP/-

CoNLL), pp. 523–534.

McIntosh, Tara and James R. Curran (2009). “Reducing Semantic Drift with Bagging

and Distributional Similarity”. In: Association for Computational Linguistics (ACL),

pp. 396–404.

McLernon, Brian and Nicholas Kushmerick (2006). “Transductive Pattern Learning for

Information Extraction”. In: Proceedings of the Workshop on Adaptive Text Extraction

and Mining (ATEM 2006).

Mikolov, Tomas, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey Dean (2013a). “Dis-

tributed Representations of Words and Phrases and their Compositionality”. In: Ad-

vances in Neural Information Processing Systems (NIPS).

Mikolov, Tomas, Kai Chen, Greg Corrado, and Jeffrey Dean (2013b). Efficient Estimation

of Word Representations in Vector Space. Tech. rep. 1301.3781. arXiv.

Mintz, Mike, Steven Bills, Rion Snow, and Dan Jurafsky (2009). “Distant supervision for

relation extraction without labeled data”. In: Association for Computational Linguistics

(ACL), pp. 1003–1011.

Page 165: DISTANTLY SUPERVISED INFORMATION EXTRACTION …The best thing about Stanford is its grad students – brilliant and yet so approachable. It has been amazing to be a part of the NLP

BIBLIOGRAPHY 148

Nallapati, Ramesh and Christopher D. Manning (2008). “Legal Docket-entry Classifica-

tion: Where Machine Learning Stumbles”. In: Empirical Methods in Natural Language

Processing (EMNLP), pp. 438–446.

Natarajan, Nagarajan, Inderjit S. Dhillon, Pradeep K. Ravikumar, and Ambuj Tewari (2013).

“Learning with Noisy Labels”. In: Advances in Neural Information Processing Systems.

Niu, Feng, Ce Zhang, Christopher Re, and Jude Shavlik (2012). “Elementary: Large-Scale

Knowledge-Base Construction via Machine Learning and Statistical Inference”. In: In-

ternational Journal on Semantic Web and Information Systems 8.3, pp. 42–73.

Noreen, E. (1989). Computer-intensive Methods for Testing Hypotheses: An Introduction.

John Wiley and Sons Inc.

Pado, Sebastian (2006). User’s guide to sigf: Significance testing by approximate ran-

domisation. http://www.nlpado.de/ sebastian/software/sigf.shtml.

Pantel, Patrick, Eric Crestan, Arkady Borkovsky, Ana-Maria Popescu, and Vishnu Vyas

(2009). “Web-scale Distributional Similarity and Entity Set Expansion”. In: Proceed-

ings of the 2009 Conference on Empirical Methods in Natural Language Processing:

Volume 2 - Volume 2. EMNLP ’09, pp. 938–947.

Pasca, Marius (2004). “Acquisition of Categorized Named Entities for Web Search”. In:

Proceedings of the Thirteenth ACM International Conference on Information and Knowl-

edge Management. CIKM ’04.

Passos, Alexandre, Vineet Kumar, and Andrew McCallum (2014). “Lexicon Infused Phrase

Embeddings for Named Entity Resolution”. In: Proceedings of the Eighteenth Confer-

ence on Computational Natural Language Learning. Association for Computational

Linguistics, pp. 78–86.

Patwardhan, S. (2010). “Widening the Field of View of Information Extraction through

Sentential Event Recognition”. PhD thesis. University of Utah.

Pennington, Jeffrey, Richard Socher, and Christopher D. Manning (2014). “GloVe: Global

Vectors for Word Representation”. In: Empirical Methods in Natural Language Pro-

cessing (EMNLP).

Poon, Hoifung and Pedro Domingos (2010). “Unsupervised Ontology Induction from Text”.

In: Association for Computational Linguistics (ACL).

Page 166: DISTANTLY SUPERVISED INFORMATION EXTRACTION …The best thing about Stanford is its grad students – brilliant and yet so approachable. It has been amazing to be a part of the NLP

BIBLIOGRAPHY 149

Pratt, Wanda and Meliha Yetisgen-Yildiz (2003). “A study of biomedical concept identi-

fication: MetaMap vs. people”. In: AMIA Annual Symposium Proceedings. Vol. 2003,

pp. 529–533.

Putthividhya, Duangmanee (Pew) and Junling Hu (2011). “Bootstrapped Named Entity

Recognition for Product Attribute Extraction”. In: Empirical Methods in Natural Lan-

guage Processing (EMNLP), pp. 1557–1567.

Radev, Dragomir R., Eduard Hovy, and Kathleen McKeown (2002). “Introduction to the

special issue on summarization”. In: Computational Linguistics 28, pp. 399–408.

Radev, Dragomir R., Pradeep Muthukrishnan, and Vahed Qazvinian (2009). “The ACL An-

thology Network corpus”. In: Proceedings of the 2009 Workshop on Text and Citation

Analysis for Scholarly Digital Libraries.

Ratinov, Lev and Dan Roth (2009). “Design Challenges and Misconceptions in Named

Entity Recognition”. In: Computational Natural Language Learning (CoNLL).

Ravichandran, Deepak and Eduard Hovy (2002). “Learning Surface Text Patterns for a

Question Answering System”. In: Association for Computational Linguistics (ACL),

pp. 41–47.

Riloff, Ellen (1993). “Automatically Constructing a Dictionary for Information Extraction

Tasks”. In: Proceedings of the Eleventh National Conference on Artificial Intelligence.

AAAI’93, pp. 811–816.

— (1996). “Automatically Generating Extraction Patterns from Untagged Text”. In: Asso-

ciation for the Advancement of Artificial Intelligence (AAAI), pp. 1044–1049.

Riloff, Ellen and Rosie Jones (1999). “Learning Dictionaries for Information Extraction

by Multi-level Bootstrapping”. In: Association for the Advancement of Artificial Intel-

ligence (AAAI).

Ritter, Alan, Luke Zettlemoyer, Mausam, and Oren Etzioni (2013). “Modeling Missing

Data in Distant Supervision for Information Extraction.” In: Transactions of the Asso-

ciation for Computational Linguistics (TACL) 1, pp. 367–378.

Roth, Benjamin and Dietrich Klakow (2013). “Combining Generative and Discriminative

Model Scores for Distant Supervision”. In: Proceedings of the 2013 Conference on

Empirical Methods in Natural Language Processing.

Page 167: DISTANTLY SUPERVISED INFORMATION EXTRACTION …The best thing about Stanford is its grad students – brilliant and yet so approachable. It has been amazing to be a part of the NLP

BIBLIOGRAPHY 150

Ruch, Patrick, Celia Boyer, Christine Chichester, Imad Tbahriti, Antoine Geissbuhler, Paul

Fabry, Julien Gobeill, Violaine Pillet, Dietrich Rebholz-Schuhmann, Christian Lovis,

and Anne-Lise Veuthey (2007). “Using argumentation to extract key sentences from

biomedical abstracts”. In: International Journal of Medical Informatics 76, pp. 195–

200.

Sarawagi, Sunita (2008). “Information Extraction”. In: Foundations and Trends in Databases

1.3, pp. 261–377.

Schwartz, Ariel S. and Marti A. Hearst (2003). “A Simple Algorithm For Identifying Ab-

breviation Definitions in Biomedical Text”. In: Proceedings of the Pacific Symposium

on Biocomputing.

Smith, Catherine Arnott and Paul J. Wicks (2008). “PatientsLikeMe: Consumer health vo-

cabulary as a folksonomy”. In: AMIA annual symposium proceedings. Vol. 2008.

Soderland, Stephen (1999). “Learning Information Extraction Rules for Semi-Structured

and Free Text”. In: Machine Learning 34, pp. 233–272.

Soderland, Stephen, John Gilmer, Robert Bart, Oren Etzioni, and Daniel S. Weld (2013).

“Open Information Extraction to KBP Relation in 3 Hours”. In: Proceedings of the Text

Analysis Conference on Knowledge Base Propagation.

Stevenson, Mark and Mark A. Greenwood (2005). “A Semantic Approach to IE Pattern

Induction”. In: Association for Computational Linguistics (ACL), pp. 379–386.

Subramanya, Amarnag, Slav Petrov, and Fernando Pereira (2010). “Efficient Graph-Based

Semi-Supervised Learning of Structured Tagging Models”. In: Empirical Methods in

Natural Language Processing (EMNLP).

Suchanek, Fabian M., Gjergji Kasneci, and Gerhard Weikum (2007). “YAGO: A core of

semantic knowledge”. In: World Wide Web (WWW), pp. 697–706.

Suchanek, Fabian M., Mauro Sozio, and Gerhard Weikum (2009). “SOFIE: A Self-organizing

Framework for Information Extraction”. In: Proceedings of the 18th International Con-

ference on World Wide Web. WWW ’09.

Sudo, Kiyoshi, Satoshi Sekine, and Ralph Grishman (2003). “An Improved Extraction Pat-

tern Representation Model for Automatic IE Pattern Acquisition”. In: Proceedings of

the 41st Annual Meeting on Association for Computational Linguistics. ACL ’03.

Page 168: DISTANTLY SUPERVISED INFORMATION EXTRACTION …The best thing about Stanford is its grad students – brilliant and yet so approachable. It has been amazing to be a part of the NLP

BIBLIOGRAPHY 151

Surdeanu, Mihai, Jordi Turmo, and Alicia Ageno (2006). “A Hybrid Approach for the

Acquisition of Information Extraction Patterns”. In: Proceedings of the EACL 2006

Workshop on Adaptive Text Extraction and Mining. ATEM 2006.

Surdeanu, Mihai, Julie Tibshirani, Ramesh Nallapati, and Christopher D. Manning (2012).

“Multi-instance Multi-label Learning for Relation Extraction”. In: Proceedings of the

2012 Joint Conference on Empirical Methods in Natural Language Processing and

Computational Natural Language Learning. EMNLP-CoNLL ’12.

Talukdar, Partha Pratim, Thorsten Brants, Mark Liberman, and Fernando Pereira (2006).

“A Context Pattern Induction Method for Named Entity Extraction”. In: Proceedings

of the Tenth Conference on Computational Natural Language Learning. CoNLL’06.

Tateisi, Yuka, Yo Shidahara, Yusuke Miyao, and Akiko Aizawa (2014). “Annotation of

Computer Science Papers for Semantic Relation Extraction”. In: Proceedings of the

Ninth International Conference on Language Resources and Evaluation (LREC’14).

Tatonetti, Nicholas P., Guy Haskin Fernald, and Russ B. Altman (2012). “A novel sig-

nal detection algorithm for identifying hidden drug-drug interactions in adverse event

reports”. In: Journal of the American Medical Informatics Association 19, pp. 79–85.

Thelen, Michael and Ellen Riloff (2002). “A Bootstrapping Method for Learning Semantic

Lexicons using Extraction Pattern Contexts”. In: Empirical Methods in Natural Lan-

guage Processing (EMNLP), pp. 214–221.

Tibshirani, Julie and Christopher D. Manning (2014). “Robust Logistic Regression using

Shift Parameters”. In: Proceedings of the Association for Computational Linguistics.

Toutanova, Kristina and Christopher D. Manning (2003). “Feature-Rich Part-of-Speech

Tagging with a Cyclic Dependency Network”. In: Human Language Technology and

North American Association for Computational Linguistics (HLT/NAACL).

Tsai, Chen-Tse, Gourab Kundu, and Dan Roth (2013). “Concept-based analysis of scien-

tific literature”. In: Proceedings of the 22nd ACM international conference on Confer-

ence on information and knowledge management. CIKM ’13.

Turian, Joseph, Lev Ratinov, and Yoshua Bengio (2010). “Word Representations: A Sim-

ple and General Method for Semi-supervised Learning”. In: Proceedings of the 48th

Annual Meeting of the Association for Computational Linguistics. ACL ’10.

Page 169: DISTANTLY SUPERVISED INFORMATION EXTRACTION …The best thing about Stanford is its grad students – brilliant and yet so approachable. It has been amazing to be a part of the NLP

BIBLIOGRAPHY 152

Valenzuela, Marco, Vu Ha, and Oren Etzioni (2015). “Identifying Meaningful Citations”.

In: AAAI Workshop on Scholarly Big Data.

Valenzuela-Escarcega, Marco A., Gustave Hahn-Powell, Thomas Hicks, and Mihai Sur-

deanu (2015). “A Domain-independent Rule-based Framework for Event Extraction”.

In: Proceedings of the 53rd Annual Meeting of the Association for Computational Lin-

guistics and the 7th International Joint Conference on Natural Language Processing

of the Assian Federation of Natural Language Processing: Software Demonstrations

(ACL-IJCNLP).

White, Ryen W., Nicholas P. Tatonetti, Nigam H. Shah, Russ B. Altman, and Eric Horvitz

(2013). “Web-scale pharmacovigilance: listening to signals from the crowd”. In: Jour-

nal of the American Medical Informatics Association 20, pp. 404–408.

Whitney, Max and Anoop Sarkar (2012). “Bootstrapping via Graph Propagation”. In: As-

sociation for Computational Linguistics (ACL).

Wicks, Paul, Timothy E. Vaughan, Michael P. Massagli, and James Heywood (2011). “Ac-

celerated clinical discovery using self-reported patient data collected online and a patient-

matching algorithm”. In: Nature Biotechnology 29, pp. 411–414.

Xu, Feiyu, Hans Uszkoreit, and Hong Li (2007). “A seed-driven bottom-up machine learn-

ing framework for extracting relations of various complexity”. In: Proceedings of the

45th Annual Meeting of the Association for Computational Linguistics.

Xu, Rong, Kaustubh Supekar, Yang Huang, Amar Das, and Alan Garber (2006). “Com-

bining text classification and Hidden Markov Modeling techniques for categorizing

sentences in randomized clinical trial abstracts.” In: AMIA Annual Symposium Pro-

ceedings, pp. 824–828.

Xu, Rong, Kaustubh Supekar, Alex Morgan, Amar Das, and Alan Garber (2008). “Unsu-

pervised method for automatic construction of a disease dictionary from a large free

text collection”. In: AMIA Annual Symposium Proceedings. Vol. 2008, pp. 820–824.

Xu, Wei, Raphael Hoffmann, Le Zhao, and Ralph Grishman (2013). “Filling Knowledge

Base Gaps for Distant Supervision of Relation Extraction.” In: Association for Compu-

tational Linguistics (ACL), pp. 665–670.

Page 170: DISTANTLY SUPERVISED INFORMATION EXTRACTION …The best thing about Stanford is its grad students – brilliant and yet so approachable. It has been amazing to be a part of the NLP

BIBLIOGRAPHY 153

Yahya, Mohamed, Steven Euijong Whang, Rahul Gupta, and Alon Halevy (2014). “Re-

Noun: Fact Extraction for Nominal Attributes”. In: Empirical Methods in Natural Lan-

guage Processing (EMNLP).

Yangarber, Roman, Ralph Grishman, and Pasi Tapanainen (2000). “Automatic Acquisition

of Domain Knowledge for Information Extraction”. In: International Conference on

Computational Linguistics (COLING), pp. 940–946.

Yangarber, Roman, Winston Lin, and Ralph Grishman (2002). “Unsupervised Learning

of Generalized Names”. In: International Conference on Computational Linguistics

(COLING).

Yarowsky, David (1995). “Unsupervised word sense disambiguation rivaling supervised

methods”. In: Association for Computational Linguistics (ACL).

Yeh, A. (2000). “More accurate tests for the statistical significance of result differences.”

In: The International Conference on Computational Linguistics.

Zeng, Qing T and Tony Tse (2006). “Exploring and developing consumer health vocabu-

laries”. In: Journal of the American Medical Informatics Association 13.

Zhang, Ce, Vidhya Govindaraju, Jackson Borchardt, Tim Foltz, Christopher Re, and Shanan

Peters (2013). “GeoDeepDive: Statistical Inference Using Familiar Data-processing

Languages”. In: Proceedings of the 2013 ACM SIGMOD International Conference on

Management of Data. SIGMOD ’13.

Zhang, Qi, Yaqian Zhou, Xuanjing Huang, and Lide Wu (2008). “Graph Mutual Rein-

forcement Based Bootstrapping”. In: Proceedings of the 4th Asia Information Retrieval

Conference on Information Retrieval Technology. AIRS’08.

Zhu, Jun, Zaiqing Nie, Xiaojiang Liu, Bo Zhang, and Ji-Rong Wen (2009). “StatSnowball:

A Statistical Approach to Extracting Entity Relationships”. In: Proceedings of the 18th

International Conference on World Wide Web. WWW ’09.