a methodology of error detection: improving speech ... · two heuristics involve statistical...

A METHODOLOGY OF ERROR DETECTION:

IMPROVING SPEECH RECOGNITION IN RADIOLOGY

by

Kimberly Dawn Voll

B.A., Simon Fraser University, 2001

a thesis submitted in partial fulfillment

of the requirements for the degree of

Doctor of Philosophy

in the School

of

Computing Science

c© Kimberly Dawn Voll 2006

SIMON FRASER UNIVERSITY

Spring 2006

All rights reserved. This work may not be

reproduced in whole or in part, by photocopy

or other means, without the permission of the author.

APPROVAL

Name: Kimberly Dawn Voll

Degree: Doctor of Philosophy

Title of thesis: A Methodology of Error Detection:Improving Speech Recognition in Radiology

Examining Committee: Dr. Bob Hadley, Chair

Dr. Veronica Dahl, Senior Supervisor

Professor, School of Computing Science, SFU

Dr. Stella Atkins, Supervisor


Dr. Fred Popowich, Supervisor


Dr. Bruce Forster, Supervisor

Associate Professor, Department of Radiology, UBC

Dr. Maite Taboada, SFU Examiner

Assistant Professor, Department of Linguistics, SFU

Dr. Janice Glasgow, External Examiner

Professor, School of Computing, Queen’s University

Date Approved:

ii

Abstract

Automated speech recognition (ASR) in radiology report dictation demands highly accurate

and robust recognition software. Despite vendor claims, current implementations are sub-

optimal, leading to poor accuracy, and time and money wasted on proofreading. Thus,

other methods must be considered for increasing the reliability and performance of ASR

before it is a viable alternative to human transcription. One such method is post-ASR error

detection, used to recover from the inaccuracy of speech recognition. This thesis proposes

that detecting and highlighting errors, or areas of low confidence, in a machine-transcribed

report allows the radiologist to proofread more efficiently. This, in turn, restores the benefits

of ASR in radiology, including efficient report handling and resource utilization.

To this end, an objective classification of error-detection methods for ASR is established.

Under this classification, a new theory of error detection in ASR is derived from the hybrid

application of multiple error-detection heuristics. This theory is contingent upon the type of

recognition errors and the complementary coverage of the heuristics. Inspired by these prin-

ciples, a hybrid error-detection application is developed as proof of concept. The algorithm

relies on four separate artificial-intelligence heuristics together covering semantic, syntactic,

and structural error types, and developed with the help of 2700 anonymised reports obtained

from a local radiology clinic. Two heuristics involve statistical modeling: pointwise mutual

information and co-occurrence analysis. The remaining two are non-statistical techniques: a

property-based, constraint-handling-rules grammar, and a conceptual distance metric rely-

ing on the ontological knowledge in the Unified Medical Language System. When the hybrid

algorithm is applied to thirty real-world radiology reports, the results are encouraging: up

to a 24% increase in the recall performance and an 8% increase in the precision performance

over the best single technique. In addition, the resulting algorithm is efficient and modular.

iii

Also investigated is the development necessary to turn the hybrid algorithm into a real-

world application suitable for clinical deployment. Finally, as part of an investigation of

future directions for this research, the greater context of these contributions is demonstrated,

including two applications of the hybrid method in cognitive science and machine learning.

Keywords

medical informatics, automatic speech recognition, natural language processing, hybrid

error detection, computer-assisted editing, radiology reporting

iv

To Curiosity...

v

“Not all who wander are lost.”

— J.R.R. Tolkien

vi

Acknowledgments

The road was longer and harder than it promised, but in the end I persevered. For those

who have helped me along the way, know you have a place in my heart forever warmed by

my gratitude.

So here I say thank you to...

• The Sun Hang Do family, for the many years of stress relief, good times, and friendship.

In particular, I would like thank Grand Master Kang, Mrs. Kang, the Janzen brothers,

the Fisher and Tsui family, as well as Zofia, Tammy, Richard, Kelvin, Anna, Annie

and the entire Coquitlam “gang”.

• The NSERC Postgraduate Awards Program and Simon Fraser University for ensuring

the best possible funding throughout my graduate career.

• Phinished.org, and in particular, Tom and my fellow “phinishers”.

• The computing science office and tech staff: we’d be lost without you guys. An extra

special thank you goes out to Val Galat for her kindness and swift E-mail skills, which

both contributed to the preservation of my sanity.

• Glendon, my Unix/Mac guru, thank you for your endless patience.

• The Spring 2005 COGS 100 class; you guys rocked.

• Ken MacAllister, for your useful comments on early portions of this work.

• The “Logic and Functional Programming Lab” as well as the “Natural Language Lab”

for your constructive comments, encouragement, and endless supply of fascinating

conversation. Thanks, in particular, go to Maryam, Dulce, Baohua, Jiang, Chris, and

Wendy.

vii

• Dr. Diana Cukierman, for finding the time to help with my formalization and deeper

understanding of set theory, despite being up to your eyeballs in your own work.

• The Canada Diagnostic Centre, for welcoming me into your clinic, and sharing with

me your resources.

The following professors deserve special mention for their patience, guidance, and support,

but most importantly their kindness. You have all helped build a more capable and confident

researcher:

• Dr. Bob Hadley and Dr. Bill Havens, for your help in guiding me down the path of

research.

• Dr. Nancy Hedberg, for your unending enthusiasm and help over the years.

• Dr. Maite Taboada, for your gentle, always-helpful advice over the years. I’m hon-

oured to have you as my internal examiner.

• Dr. Janice Glasgow, for flying all way out here to sit on my examining committee and

for your thoughtful and kind comments about my work.

And in particular, my supervisory committee:

• Dr. Stella Atkins, for encouraging and challenging me from the very moment we met.

I would not be here today if it was not for your medical computing class.

• Dr. Bruce Forster, my radiology expert, for the fascinating conversations, support,

and kindness. Thank you for taking the time to show me the world of radiology.

• Dr. Fred Popowich, for your energy, your incredibly positive attitude, and most

importantly your faith in me.

• Dr. Veronica Dahl, for your mentorship, support, and friendship that saw me through

the many ‘ups’ and ‘downs’ of my graduate career. I never doubted that you cared.

I wish to say thank you to my wonderful support network of friends:

• The “shore-line gang”, starring in alphabetical order: Annavie, Benny, Carl, Chantel,

David, Eileen, Eric, Kyle, Liz, and Patrick. For all the fun over the years.

viii

• Catriona, for the runs, the coffee breaks, and all the wonderful company. I am glad

to call you my friend.

• Aki, for all the great MSN chats, both goofy and serious. I’m so happy that we are

back in touch.

• Alma (“Dr. Clam”), my sweet and fun-loving friend, for all the fun, advice, encour-

agement, and commiseration.

• Mark, my academic kindred spirit, for the long walks and the long talks on just about

anything.

• Rob, for “forcing” me to play all those board games (thanks, dude).

• My dear friend Katie and her fantastic wife, Krista, for being my official “stress-relief

committee”; Katie, thank you for the many years of great friendship.

• Chris, my very dear friend, whose shoulder was always there when I most needed it

(and who frightens me in his keen understanding of my twisted sense of humour), for

just being you.

I thank my family for their patience, encouragement, and endless faith in me:

• Jennifer and David, for the great company and unwavering support.

• Fiona and Rob, for everything (but most importantly the lattes). You guys are the

best.

• Carolynn (and her beautiful family), for all the laughs and for all the love. You may

not be my sister by family, but you are by choice.

• My grandparents, for your love and support over the years.

• Brian for being such a cool guy, but more importantly, for being not only my brother

but my friend. (P.S. Mom still loves me more.)

In particular, I am blessed with three wonderful parents for whom I wish to thank for the

gift of life, love, friendship, good sense, and good times:

• Dad, for always giving me everything, even in the face of adversity. I am proud to be

your daughter.

ix

• Denny, for the love, support, and endless laughs. I am honoured to call you my

stepdad: how many daughters have two amazing dads who equally brighten their life?

• Mom, my angel, my confidant, my strongest supporter, I give you this quote:

“A mother is the truest friend we have, when trials heavy and sudden, fall

upon us; when adversity takes the place of prosperity; when friends who

rejoice with us in our sunshine desert us; when trouble thickens around us,

still will she cling to us, and endeavor by her kind precepts and counsels to

dissipate the clouds of darkness, and cause peace to return to our hearts.”

— Washington Irving.

And finally...

• Ian, the most patient of them all... thank you for everything.

x

Contents

Approval ii

Abstract iii

Dedication v

Quotation vi

Acknowledgments vii

Contents xi

List of Tables xvi

List of Figures xvii

1 The Thesis 1

1.1 The Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.1.1 Summary of Contributions . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.3 Main Research Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.4 Introduction to the Primary Research Problem . . . . . . . . . . . . . . . . . 4

1.4.1 ASR in the Reading Room . . . . . . . . . . . . . . . . . . . . . . . . 4

1.5 Extant Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

1.6 Beyond Radiology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

1.7 Hypotheses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

1.8 Canonical Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

xi

2 An Introduction to Medical Language Processing 10

2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.1.1 Medical Language Processing . . . . . . . . . . . . . . . . . . . . . . . 11

2.2 General Challenges in MLP . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.3 Medical Language Processing in Radiology . . . . . . . . . . . . . . . . . . . . 14

2.3.1 The Radiology Environment . . . . . . . . . . . . . . . . . . . . . . . . 14

2.3.2 The Radiology Report . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.3.3 Improving Radiology Reporting . . . . . . . . . . . . . . . . . . . . . . 16

2.3.4 Automated Interpretation . . . . . . . . . . . . . . . . . . . . . . . . . 18

2.4 Natural Language Understanding in Medicine . . . . . . . . . . . . . . . . . . 20

2.5 The Needs of the Radiologist . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

2.5.1 Limitations of an Imperfect System . . . . . . . . . . . . . . . . . . . 21

2.6 Pushing the State of the Art . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

2.6.1 Overcoming Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . 23

2.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

3 A Classification of Error-Detection Methods 24

3.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

3.1.1 The Stages of Error Handling in Speech Recognition . . . . . . . . . . 24

3.1.2 On the Nature of Recognition Errors . . . . . . . . . . . . . . . . . . . 25

3.2 A Brief Introduction to Automatic Speech Recognition . . . . . . . . . . . . . 28

3.2.1 Recognizing Human Speech . . . . . . . . . . . . . . . . . . . . . . . . 29

3.2.2 Natural Language Understanding . . . . . . . . . . . . . . . . . . . . . 31

3.3 Confidence Scoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

3.4 A Classification of Error-Detection Methods for Speech Recognition . . . . . 33

3.4.1 The Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

3.5 Non-Black-Box Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

3.5.1 Probabilistic Approaches . . . . . . . . . . . . . . . . . . . . . . . . . 34

3.5.2 Non-Probabilistic Approaches . . . . . . . . . . . . . . . . . . . . . . . 36

3.5.3 Hybrid Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

3.6 Black-Box Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

3.6.1 Probabilistic Approaches . . . . . . . . . . . . . . . . . . . . . . . . . 39

3.6.2 Non-Probabilistic Approaches . . . . . . . . . . . . . . . . . . . . . . . 42

xii

3.6.3 Hybrid Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

3.7 A Note on Stop Lists . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

3.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

4 A Conceptual Model 47

4.1 The General Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

4.2 Introducing A Hybrid Approach to Error Detection . . . . . . . . . . . . . . . 49

4.3 A Note on the Measure of Correctness . . . . . . . . . . . . . . . . . . . . . . 53

4.4 The Error-Detection Heuristics . . . . . . . . . . . . . . . . . . . . . . . . . . 54

4.4.1 Semantic Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

4.4.2 Syntactic Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

4.4.3 Word Occurrence Probabilities and “N-gram” Models . . . . . . . . . 59

4.5 A Formalization of the Hybrid Approach to Error

Detection in Radiology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

4.5.1 General Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

4.5.2 The Error-Detection Algorithm . . . . . . . . . . . . . . . . . . . . . . 69

4.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

5 Experimental Evidence 75

5.1 Introduction to Proposed System . . . . . . . . . . . . . . . . . . . . . . . . . 75

5.1.1 Materials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

5.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

5.2.1 Modular Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

5.2.2 Calculating Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

5.2.3 Aligning the Source and Output: Recognition Errors . . . . . . . . . . 78

5.2.4 Calculating Co-Occurrences . . . . . . . . . . . . . . . . . . . . . . . . 79

5.2.5 The Error-Detection Algorithms . . . . . . . . . . . . . . . . . . . . . 80

5.2.6 Conceptual Similarity . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

5.2.7 Semantic Grammar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

5.2.8 Syntactic Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

5.2.9 Word Occurrence Probabilities . . . . . . . . . . . . . . . . . . . . . . 89

5.2.10 Comparing Co-occurrence Analysis and PMI . . . . . . . . . . . . . . 100

5.3 A Hybrid Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

5.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

xiii

6 Observations and Corollaries 103

6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

6.2 The Findings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

6.2.1 The Hybrid Error-Detection Methodology . . . . . . . . . . . . . . . . 103

6.2.2 On the Nature of Report Errors . . . . . . . . . . . . . . . . . . . . . 105

6.2.3 General Observations . . . . . . . . . . . . . . . . . . . . . . . . . . . 106

6.3 From a Radiologist’s Perspective . . . . . . . . . . . . . . . . . . . . . . . . . 107

6.4 A Critical Look at the Hybrid Error-Detection

Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109

6.4.1 Challenges Facing the Hybrid Methodology . . . . . . . . . . . . . . . 109

6.4.2 Challenges Facing the Current Implementation . . . . . . . . . . . . . 112

6.5 Corollaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114

6.5.1 Immediate Implications . . . . . . . . . . . . . . . . . . . . . . . . . . 114

6.5.2 Implications for Future Study . . . . . . . . . . . . . . . . . . . . . . . 114

6.6 A Standalone Application for the Radiology Workstation . . . . . . . . . . . 115

6.6.1 Steps to an Independent System . . . . . . . . . . . . . . . . . . . . . 115

6.6.2 User Interface for the Hybrid Error-Detection System . . . . . . . . . 117

6.6.3 Miscellaneous Requirements . . . . . . . . . . . . . . . . . . . . . . . . 118

6.7 Measuring the Real-World Success of the System . . . . . . . . . . . . . . . . 118

6.8 Data Sparseness: Smoothing . . . . . . . . . . . . . . . . . . . . . . . . . . . 119

6.9 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120

6.9.1 The Full System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121

6.9.2 Immediate Extensions: Improving the Current Heurisitcs . . . . . . . 122

6.9.3 Miscellaneous Improvements . . . . . . . . . . . . . . . . . . . . . . . 124

6.10 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126

7 Beyond Radiology 127

7.1 Error Detection in the Greater Context . . . . . . . . . . . . . . . . . . . . . 127

7.1.1 The Methodology in Other Domains . . . . . . . . . . . . . . . . . . . 127

7.2 Cognitive Science Perspectives on Error Detection . . . . . . . . . . . . . . . 128

7.2.1 Error Detection: Applications in Neuro- and Psycholinguistics . . . . 129

7.2.2 Error Detection and Language Acquisition . . . . . . . . . . . . . . . . 131

7.3 Quality Control in NLP Applications . . . . . . . . . . . . . . . . . . . . . . . 133

xiv

7.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137

8 Conclusions 138

A Glossary of Medical and Non-Medical Terms 141

A.1 Radiology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141

A.2 Computational Linguistics/ Knowledge Representation . . . . . . . . . . . . . 143

A.3 Automated Speech Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . 147

A.4 Miscellaneous . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147

B Ontologies in Healthcare 149

B.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149

B.1.1 Controlled Medical Vocabulary . . . . . . . . . . . . . . . . . . . . . . 149

B.1.2 Semantic Lexicon . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149

B.1.3 Ontology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150

B.1.4 The Continuum of Knowledge Representation . . . . . . . . . . . . . . 151

B.1.5 Principles of Good Ontologies . . . . . . . . . . . . . . . . . . . . . . . 153

B.2 Methods of Knowledge Representation . . . . . . . . . . . . . . . . . . . . . . 157

B.2.1 First Order Predicate Calculus (FOPC) . . . . . . . . . . . . . . . . . 158

B.2.2 Semantic Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158

B.2.3 Frame-Based Representations . . . . . . . . . . . . . . . . . . . . . . . 159

B.2.4 Description Logic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159

B.3 Medical Terminologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161

B.3.1 Existing Vocabularies and Ontologies . . . . . . . . . . . . . . . . . . 162

B.3.2 Issues in Medical Informatics/Ontologies in General . . . . . . . . . . 170

B.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173

C All Results 174

Bibliography 178

xv

List of Tables

3.1 An example of the usefulness of co-occurrence relations in determining simi-

larity between documents and queries [96, Page 554] . . . . . . . . . . . . . . 40

5.1 Co-occurrence statistics for “quadriceps”. . . . . . . . . . . . . . . . . . . . . 79

5.2 CHR parser results on all error types. . . . . . . . . . . . . . . . . . . . . . . 88

C.1 Co-occurrence analysis with windowsize=3, threshold=0. . . . . . . . . . . . . 174

C.2 Co-occurrence analysis on entire error set, windowsize=collocation . . . . . . 174

C.3 Co-occurrence analysis on non-stop-words only, windowsize=collocation . . . 175

C.4 Co-occurrence analysis on entire error set, windowsize=1 . . . . . . . . . . . 175

C.5 Co-occurrence analysis on non-stop-words only, windowsize=1 . . . . . . . . 175

C.6 Co-occurrence analysis on entire error set, windowsize=10 . . . . . . . . . . . 175

C.7 Co-occurrence analysis on non-stop-words only, windowsize=10 . . . . . . . . 175

C.8 PMI analysis on entire error set, windowsize=collocation . . . . . . . . . . . . 176

C.9 PMI analysis on non-stop-words only, windowsize=collocation . . . . . . . . . 176

C.10 PMI analysis on entire error set, windowsize=1 . . . . . . . . . . . . . . . . . 176

C.11 PMI analysis on non-stop-words only, windowsize=1 . . . . . . . . . . . . . . 176

C.12 PMI analysis on entire error set, windowsize=10 . . . . . . . . . . . . . . . . 177

C.13 PMI analysis on non-stop-words only, windowsize=10 . . . . . . . . . . . . . 177

C.14 Combined heuristics on all errors based upon top f-measure. . . . . . . . . . . 177

C.15 Combined heuristics on all errors based upon top recall score. . . . . . . . . . 177

xvi

List of Figures

2.1 Typical radiology workstation. . . . . . . . . . . . . . . . . . . . . . . . . . . 14

3.1 The relevant, overlapping error levels in radiology. . . . . . . . . . . . . . . . 26

3.2 The noisy channel model, based on Jurafsky and Martin, Figure 7.1, [81, page

237]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

4.1 The abstract hybrid system. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

4.2 A Venn diagram showing the similarities between ER and AE. . . . . . . . . 73

5.1 CA results based upon report type. . . . . . . . . . . . . . . . . . . . . . . . . 92

5.2 CA recall results for 3 window sizes. . . . . . . . . . . . . . . . . . . . . . . . 93

5.3 CA precision results for 3 window sizes. . . . . . . . . . . . . . . . . . . . . . 94

5.4 CA f-measure results for 3 window sizes. . . . . . . . . . . . . . . . . . . . . . 94

5.5 PMI recall results for 3 window sizes. . . . . . . . . . . . . . . . . . . . . . . . 98

5.6 PMI precision results for 3 window sizes. . . . . . . . . . . . . . . . . . . . . . 99

5.7 PMI f-measure results for 3 window sizes. . . . . . . . . . . . . . . . . . . . . 99

5.8 PMI versus Co-occurrence Analysis (COA). . . . . . . . . . . . . . . . . . . . 100

5.9 Combined heuristics on all errors based upon top f-measure (overall perfor-

mance). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

6.1 The error detection process. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116

6.2 Sample output using a grey-scale confidence indication. . . . . . . . . . . . . 117

6.3 The full system as envisioned. . . . . . . . . . . . . . . . . . . . . . . . . . . . 122

B.1 The Knowledge Continuum. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152

xvii

Chapter 1

The Thesis

1.1 The Thesis

Post-speech-recognition, hybrid error detection is an effective means to recover from low

recognition rates in radiology report dictation. In the following pages I will define precisely

“post-speech-recognition, hybrid error detection” and present the necessary evidence in

defense of this thesis, along with the contributions arising as a direct result of my research.

This includes applications extending beyond radiology to other domains, demonstrating the

wider context of this work.

This chapter provides an introduction and brief overview of the entire dissertation, in-

cluding a summary of the motivations, research questions, and hypotheses, as well as the

resulting contributions.

1.1.1 Summary of Contributions

This dissertation comprises four original contributions to the general problem of error de-

tection in natural-language text:

• A classification of error-detection methods for speech recognition.

• A hybrid error-detection methodology.

• A successful proof of concept applying the hybrid methodology to radiology report

dictation.

• Two theoretical applications of the technology beyond the domain of radiology.

1

CHAPTER 1. THE THESIS 2

In addition, other possible applications of the hybrid methodology are explored showing

its wide employ and underscoring the relevance of this contribution.

1.2 Motivation

Medical informatics is the study of information as it pertains to medicine. This notion

includes an impressive array of knowledge tasks, including representation, storage and re-

trieval, communication across information systems, and standards development. These are

applied to a wide range of medical tasks, such as management and billing, electronic pa-

tient records, and automated diagnosis. Equally important is the interface between people

and information. This interface can take many forms, from simple text dictation to more

advanced query engines that rely on complicated knowledge representation.

The field of Medical Language Processing (MLP) sits at the intersection of medical

informatics1 and natural language processing. The ultimate aim is the seamless integra-

tion of medical information management with a natural language interface. Users are able

to communicate with the technology in their native tongue, as opposed to learning an

artificial language or alternative interface. The goal is minimized training requirements,

improved integration, and easier handling. This translates into increased acceptance of

medical-informatics technology and a greater willingness to adapt on the part of clinicians.

Indeed, user acceptance can be the sole determining factor in the survival of new technology

[97].

One of the ways in which MLP has influenced medicine is the introduction of automated

speech recognizers to medical dictation. The hope is to provide a hands-free means of record-

ing medical information, unchaining the physician from his pen and paper and providing an

electronic means of cataloguing information. This is particularly appropriate in the radi-

ology reading room, where radiologists routinely examine radiological imagery and dictate

their findings into a recording device. These reports are later transcribed and returned to a

radiologist for approval: a process that in its entirety can take days or more. With the onset

of automatic speech recognition (ASR) technology, however, the promise of truly hands-free

dictation and efficient reporting seems not far off. ASR can offer improved patient care

and resource management in the form of reduced report turnaround times (TATs), reduced

1This is sometimes extended to the wider discipline of bioinformatics.


staffing needs, and the efficient completion and distribution of reports [72, 99]. The ability

to immediately revise a report also means that the radiologist will be freshly familiar with

the case. This translates directly into improved patient care.

In some radiology clinics, such benefits coupled with recent improvements in ASR tech-

nology have motivated the introduction of automated transcription software in lieu of human

stenographers2. Yet as the technology comes of age, with vendors claiming accuracy rates

as high as 99%, the potential advantages of ASR over traditional dictation methods are not

being realized, leaving many radiologists frustrated with the technology [97, 52].

1.3 Main Research Questions

In light of the problems of ASR in the radiological setting, the following research questions

are put forth:

• How can the accuracy of speech recognition in radiology be improved?

• What is the current state of post-recognition error detection?

• How can the current state of error detection be improved, and applied to the problem

of radiology report dictation?

• What is the nature of recognition errors?

• What is needed for a general theory of error-detection methods as they relate to speech

recognition?

• How can this knowledge be combined into an error-detection methodology?

• What are the implications for this error-detection theory and methodology beyond

radiology?

The following chapters will present an in-depth look at each of these research questions.

2Also known as transcriptionists.


1.4 Introduction to the Primary Research Problem

The primary reason behind the apparent failure of ASR in radiology is accuracy. A 99%-

accurate speech recognizer still averages one error out of every hundred words, with no

guarantee as to the seriousness of such errors. Furthermore, actual accuracy rates in the

reading room often fall short of 99%. Radiologists are instead forced to maintain their

transcriptionists as correctionists, or to double as copy editors, painstakingly correcting

each case, often for nonsensical or inconspicuous errors. Not only is this frustrating, but it

is a poor use of time and resources. To compound matters, problems integrating with the

radiology suite and the introduction of delays have further soured many radiologists on the

technology. Those choosing to modernize their reading rooms with ASR software are often

plagued with difficulties, while those continuing to use traditional reporting methods have

no incentive to upgrade.

Within medicine the problem of accuracy is particularly insidious as the ramifications

of errors within a report can have serious consequences. According to the U.S. Institute of

Medicine, as many as 98,000 people in the United States die annually from medical errors

[104].

This section examines the integration of ASR into the existing report-dictation process

within the radiology reading room (the setting where images are interpreted by radiologists).

1.4.1 ASR in the Reading Room

Like an assembly line, radiology reporting relies on the order and completion of certain

events to run smoothly:

1. Physician submits exam requisition for a patient.

2. Patient is scanned and a radiograph (image) is generated.

3. Radiologist interprets the radiograph and simultaneously dictates his report into a

recording device in the radiology reading room.

4. The recording is added to the stenographer’s transcription queue.

5. The report is transcribed by the stenographer.

6. Lastly, the transcribed report is returned to a radiologist for final approval and sent

on to the requesting physician.


The time it takes for the above process to complete is referred to as the report turnaround

time (TAT). The ultimate goal of ASR is to improve the TAT as well as the reporting

process by removing steps four and five, while enhancing the radiologist’s experience in step

three. Instead of waiting in a queue of reports to be transcribed, using ASR a dictated

report is immediately transcribed, proofread, signed off, and sent to the referring physician.

Although theoretically more efficient, the accuracy rate of existing ASR technology results

in time wasted on painstaking corrections and poor interface design.

In contrast, human stenographers are highly trained and familiar with radiological par-

lance. Errors or ambiguous areas of the recording can be actively clarified with the radiolo-

gist. Thus, when it comes time to sign off on the report, the radiologist need only perform

a quick skim to confirm that everything is in order. If the stenographers are replaced by

speech recognizers, though, this revising now must fall on the radiologist himself. Not only

is this frustrating for the radiologist, who wishes to focus on image interpretation, but it is

a poor use of the highly paid radiologist’s time. As an answer, some have suggested hir-

ing correctionists, however this effectively re-invents the role of transcriptionist and negates

most benefits over traditional dictation.

Since it is unlikely that speech recognizers will achieve 100% accuracy any time in the

near future, especially within medicine, the overarching research question is how to make the

technology work in the present. By creating a means for post-recognition analysis, some of

the burden of transcription can be shifted to systems tasked with more in-depth processing

of the information contained in the text. Such processing can allow a level of “damage

control” in the form of error detection and ultimately error correction. Errors made at the

ASR level can be detected in an auxiliary system that sits between the physician and the

dictation system. This can manifest itself in several ways: from a simple detection system

that allows the radiologist to efficiently skim the reports for tagged errors, or areas of low

confidence; to a full error correction system that corrects the text based on an advanced

analysis of the contents. If efficiently designed and seamlessly integrated, the time spent

proofreading is significantly reduced and the benefits of speech recognition over traditional

transcription are regained. In addition, corrected text lends itself to further MLP processing.

An advantage of processing post-dictation over direct integration with ASR is the ability

to detect errors that may have been mistakenly introduced by the radiologist, in addition

to ASR errors. Moreover, what researchers may lose in not integrating directly with ASR is

arguably gained in software independence – a post-ASR system is not bound to a particular


speech recognizer and therefore can be readily modified and updated, and used in any clinic.

In addition, proprietary software restrictions on speech recognizers often make it challenging

for researchers to integrate their technology. As Jeong et al observe, “If the speech recognizer

can be regarded as a black-box, we can perform robust and flexible domain adaptation

through the post error correction process.” [77].

In summary, the primary research problem is the development of an intelligent error

detection (and ultimately correction) system for radiology reporting that is sensitive to the

domain, and capable of capturing the sorts of errors made by speech recognizers and the

radiologists themselves.

1.5 Extant Work

While not a new problem, post-recognition error detection has never been applied to ASR

in radiology reporting. Previous work has focused on dialogue systems, giving rise to a

variety of error-detection techniques that could be extended to radiology report dictation.

This section offers a brief introduction to the status of error detection and other relevant

research areas. It is expanded upon in Chapter 3.

Most approaches to error detection have involved statistical approaches that rely on

N-gram models and pattern matching. By collecting co-occurrence statistics on each word

in the relevant corpus, it is possible to establish a list of context words that have a high

probability of occurring near a particular word. When an error is detected in a string, its

context is matched to the database and the corresponding corrected text is substituted.

Both Kaki et al [82] and Sarma et al [131] rely on co-occurrence statistics to determine

the likelihood of a recognition error. By analysing the context of a given target word, it

is possible to determine if the words in that context “match” the target word or another

recognition candidate better. Some researchers experimented with expanding the target

word to include one or more of the surrounding words [1, 121]. This target tuple is then

compared to the context window. Still other researchers have broken the target word up

into component syllables to capitalize on sub-syllabic features [77].

Alternative approaches to error detection involve language modeling, such as the noisy

channel model, where acoustic input is treated as a “noisy” version of the source sentence

and correspondingly “decoded” in an effort to find the “true” underlying utterance. All

possible utterances are considered as a match for the noisy input, and the one with the


highest probability is then selected [81].

From a semantic perspective, Jeong et al have experimented with lexico-semantic tem-

plates based on abstractions of particular word sequences found in the training data [77].

When an error is suspected, queries are matched to templates; templates with the minimum

distance from the query are selected as replacement candidates.

Outside of work explicitly in error detection, conceptual similarity offers promise as a

means for detecting errors of semantic origin. Given an ontology3, it is possible to determine

the distance between any two concepts, either directly via edge-distance, or more complex

measures involving, for example, information content [23, 103, 78, 118].

In all cases, the performance of these error-detection methods is contingent upon the

type of error made. That is, their coverage of error types varies. For example, concep-

tual similarity may detect errors of semantic origin, but may not detect a sentence that is

ungrammatical.

1.6 Beyond Radiology

In addition to the focus on the problem of ASR accuracy in radiology, the work presented

here will be shown to have application in the greater context of natural language processing

(NLP), and beyond, to problems of assessing language in cognitive science.

The qualitative assessment of text in NLP has been a recent focus in the literature,

particularly as it pertains to machine translation. Systems such as BLEU and ROUGE have

been popular approaches to assessing text quality on the basis of similarity to a reference

document [108, 34].

Furthermore, automated methods for assessing language acquisition and pathology can

help lead the way to technology for rehabilitation, and a greater understanding of the brain

with respect to language processing.

1.7 Hypotheses

It is now possible to offer the following hypotheses, to be addressed in the coming chapters:

3See Appendix 2 for an in-depth look at ontologies in healthcare.


• As a post-processing stage, methods in medical language processing can effectively

detect recognition errors in radiology reports dictated via ASR.

• Combining complementary methods of error detection results in improved sensitivity

to report errors.

• Tagging erroneous reports based on the quality of their output can avoid the need for

an in-depth re-read of the report.

• Post-recognition error detection is an effective means to improve ASR in radiology

reporting.

• Post-recognition error detection has applications beyond radiology reporting.

1.8 Canonical Organization

This dissertation is arranged as follows. Chapter 1 has laid the groundwork, including the

introduction and motivation for the research described in the remaining chapters. Chapter

2 presents a general introduction to the field of medical language processing, placing the

intended application and proof of concept in the general context, as well as offering motiva-

tion for error detection and analysis within radiology. Chapter 3 introduces the classification

of error-detection methods within speech recognition, providing an objective framework in

which the conceptualization of the hybrid methodology can be integrated in Chapter 4. In

Chapter 5, this conceptualization is exemplified through proof of concept in the intended

domain, namely radiological report dictation, while the ramifications of this application and

the surrounding research are presented in Chapter 6. Although these contributions grew out

of the research on improving ASR in radiology, it is found that the methods and theories

share greater employ beyond MLP. Thus, to demonstrate this greater context, Chapter 7 ex-

plores two major applications of the methodology in cognitive science and natural language

processing in general, and sets the stage for future research. Finally, Chapter 8 summarizes

the research and contributions.

Three appendices are provided for the convenience of the reader. The terminology used

throughout this document follows the definitions provided in Appendix A, when not provided

in the main body of the text. Appendix B is an introduction to ontologies in healthcare and


provides additional information supporting the choice of ontology in Chapter 4. Finally,

Appendix C provides all of the experimental results from Chapter 5.

Chapter 2

An Introduction to Medical

Language Processing

2.1 Introduction

Since the late 1950s, medicine has been attracting researchers in artificial intelligence (AI)

[2]. Initially, medical diagnosis was a primary focus due to its highly structured reasoning

tasks, and met with reasonable success in terms of performance. Despite this, clinicians were

nonetheless displeased with the resulting technology; from their perspective, the task of re-

entering data into a computer that was already in a patient’s paper chart seemed redundant

and a waste of their time. Consequently, the focus of AI in medicine began to shift from

the comparatively simple task of diagnosis to the challenge of automated data-acquisition,

natural language processing (NLP), and knowledge representation. Together these comprise

many of the research goals of medical language processing, or MLP1.

From a language-processing perspective, medicine is an ideal research area. Although

the medical domain is expansive, it remains a constrained domain with a large corpus of

literature suitable for MLP research. Furthermore, limited human resources, high report

turnaround times (TATs) of days or more, as well as an ever-increasing need to improve the

cost-benefit ratio have greatly increased the attraction of automated systems that enhance

medical document handling.

1Readers wishing a more in-depth introduction to natural language processing than is possible here arereferred to Jurafsky and Martin [81], and Manning and Schutze [96].

10

CHAPTER 2. AN INTRODUCTION TO MEDICAL LANGUAGE PROCESSING 11

2.1.1 Medical Language Processing

The task of mapping natural language into the requisite medical terminology is no small feat.

While MLP reduces the domain of language to medicine, the nonetheless large vocabulary,

along with the tendency of medical professionals to use incomplete sentences, ensures the

domain is no trivial one: the scarcity of successful, comprehensive MLP systems currently

deployed in real clinical settings is testament to this. As a result, studies in mapping medical

free text into structured, machine-readable documentation have typically focused on one or

more sub-domains of medicine, such as radiology.

Researchers have been working to incorporate automatic speech recognition (ASR) in

radiology reporting for many years. Despite what may seem a natural pairing, there re-

mains a great deal of research before the technology is suitable for wide-scale use. Problems

with integration in the radiology environment, accuracy, and the introduction of delays have

soured many radiologists on the technology. Nevertheless, the alluring potential for signif-

icant gains in efficiency, and greater overall data interactivity is leading many to speculate

that ASR is yet the way of the future.

In addition to ASR, there is currently research on the automated interpretation of radi-

ology reports. Modern methods in radiology reporting leave large amounts of information

effectively “inaccessible” in the form of free text reports. “Free text” refers to unrestricted,

freely dictated reports; in contrast with “structured” reports, where the radiologist is con-

fined to pre-formatted report with restrictions such as word count. While free-text reporting

allows the radiologist more freedom and is generally the more common format for dictation,

it is difficult to search, analyze or even summarize the information contained within the

resulting text report. In an effort to overcome these challenges, and to improve patient care

overall, systems that automatically interpret free-text reports and translate them into a

structured, machine-readable format are being developed, such as the MedLEE system [49].

The benefits of these automated interpretation and summarization systems coupled with

ASR are numerous. In hospitals, this includes improved overall hospital efficiency, reducing

report TATs from days to a few hours or less; enhanced patient monitoring2; and improved

data storage. As well, since natural language is the communication medium, MLP systems

are theoretically easier to use and require less training time than other interfaces. Moreover,

2For example, patient charts and records can be automatically scanned for potential drug interactionsthat may have been overlooked.


radiographs can be made available throughout the hospital as soon as they are complete,

or even remotely via the Internet. When summarized, the information in a report becomes

accessible, meaning that clinicians can not only access past cases with greater efficiency, but

that these reports can now be analyzed by computer. The result is radiological data that

is useful not only to clinicians, but to researchers, statisticians, and decision support teams

as well, and a more efficient and cost-effective environment.

Some examples of MLP technology include:

• Intelligent searching (not only medical records and patient reports, but the Internet

as well);

• Decision support;

• Diagnosis;

• Automated structuring of free-text reports (natural language understanding);

• Speech recognition of dictated reports.

These technologies are applicable throughout the medical world, including a wide variety

of hospital departments.

2.2 General Challenges in MLP

In the past, medicine has been criticized as the only major industry still relying on hand-

written documentation [2, page 70]. Although progress is being made, the transition from

hand-written to computerized documentation is challenged by the need for a standardized

terminology, and a means for mapping natural language into this terminology. Several

projects are currently underway for the development of extensive terminologies and ontolo-

gies for just this purpose. Examples include the UMLS, SNOMED, and GALEN lexicons,

which are discussed in more detail in Appendix B.

Perhaps the greatest challenge in language processing is ambiguity in the input. Am-

biguity arises when there exists more than one interpretation for a given statement, due

to the structure, syntax, or semantics of the expression. For instance, in the sentence “the

cat saw the dog on the mat”, there is more than one answer to the question “who is on

the mat?”. Other sources of ambiguity are lexical units that have more than one semantic


interpretation; for instance, “scarf” can refer to the knitted item worn around one’s neck,

or, it can refer to the verb, as in to “scarf” one’s food. In the case of speech recognition,

homonyms (words or phrases that sound the same, but are semantically distinct) also be-

come an issue, such as “aisle” and “I’ll”. Here the ASR system is forced to guess which

lexeme is appropriate based on the context. If the choice is between a verb and a noun, this

can be an relatively easy problem to solve based on grammaticality, however, if all of the

variations are the same part of speech, the problem is more challenging.

Qualification also affects MLP. Consider the phrases “possible cardiomegaly”, “heart

may be enlarged”, or “heart is probably enlarged”; in each instance a slightly different

qualification of the condition of the heart is provided, yet the meanings are extremely similar.

As Rector observes, “[h]uman users may be able to recognize that these are essentially the

same, but the rules for doing so must be made explicit to be usable by the computer” [114,

page 245]. The challenge is recognizing when it is necessary to capture differing qualifications

as distinct, and how best to represent this information when relevant.

Negation is a similar problem. While testing for the presence of a negating word such

as “no” may seem relatively straightforward, the challenge lies in the multitude of ways in

which negation can be expressed, and in determining the scope of negation. For example,

consider the differences between “pneumonia is not present” and “no pneumonia”; both

could be erroneously classified as indicative of pneumonia without the ability to accurately

detect negation.

Parsing coordination also causes difficulties within MLP. Part of this difficulty may be

blamed on the flexibility of English, and many other languages, in coordinating structures.

For instance, in English, any two constituents (i.e. noun phrase, verb phrase, et cetera),

even of differing kind, can be joined in a coordinating structure [155]. The sentence “John

was rich and a doctor” sees the conjunction of the adjectival phrase “rich” with the noun

phrase “a doctor”. Furthermore, it is often the case that information is missing from one of

the conjuncts. For example, the sentence “evidence of opacities and bullae in the right eye”

can be interpreted as there being evidence of [opacities in the right eye] and [bullae in the

right eye]; or, alternatively, as evidence of [opacities] (the location of which is unknown) and

[bullae in the right eye]. While possibly clear to a clinician, for a computer these ambiguities

must be explicitly resolved.


Image display Report(working)window

Keyboard

Headset

Figure 2.1: Typical radiology workstation.

2.3 Medical Language Processing in Radiology

2.3.1 The Radiology Environment

As outlined in the first chapter (Section 1.4.1), in most radiology departments once an

examination is complete and the images are ready, a report is dictated and recorded by

the radiologist (not necessarily immediately following the examination). This recording is

then sent to the transcription department where it is added to the queue of reports to be

transcribed. A transcriptionist types the report, checks it for errors and sends it back to

the radiology department for verification and signing3. It is interesting to note that in

some high-volume facilities, a stenographer is present in the reading room for immediate

transcription of the dictated report. Most facilities, however, do not have the volume to

justify the luxury of a dedicated transcriptionist [83].

Figure 2.1 shows a typical radiology workstation layout.

3The signing radiologist is not necessarily the same radiologist who prepared the report.


2.3.2 The Radiology Report

Although variations may be seen from one radiology clinic to the next, the following shows

a typical layout of a radiology report:

• PATIENT AND HOSPITAL INFORMATION (Demographics)

– Name; hospital or clinic identification number; et cetera.

– Referring physician; Radiologist dictating.

– Date of exam; Date of report.

• “MRI OF THE LUMBAR SPINE”

– Title sentence indicating scan type and anatomical region of study.

• HISTORY

– Patient history such as onset of condition, family history, et cetera.

• TECHNIQUE

– Description of scanning technique, including any special procedures. When usingASR this is often “canned”, that is, stored as a pre-defined block of text that isselected at the time of dictation.

• FINDINGS

– The radiologist’s report on his findings on examination of the radiograph.

• IMPRESSIONS

– The radiologist’s conclusions based upon the findings he has reported. This isoften dictated in bulleted format, and repeats any significant observations madein the FINDINGS section.

• SIGNATURE

– The signing radiologist’s approval, following dictation and transcription, that thereport has been verified as correct.


2.3.3 Improving Radiology Reporting

Although in place for many years, the radiology-reporting system has much room for im-

provement. The introduction of PACS (Picture Archiving and Communication Systems) [8]

and improved RIS (Radiology Information Systems) has been a step in the right direction.

Similarly, adding MLP technology to the mix could help revolutionize radiology reporting.

The average report turnaround time, or TAT, is the time it takes the referring physi-

cian to receive the completed report. A critical factor in measuring the productivity and

workflow of the radiology clinic, TATs are often upwards of days or more [99]. This is at-

tributable, in part, to the many steps in the reporting process – particularly problematic is

the wait for reports to be transcribed and signed off. Furthermore, most radiology reports

are free text, making automated interpretation or analysis difficult. As a partial solution

to the problems inherent in the current reporting system, some hospitals and clinics have

begun adopting ASR systems to augment their current dictation methods. The hope is to

ultimately eliminate the role of transcriptionist, improving the TAT and overall efficiency

[72].

Accuracy

Current ASR technology, however, is proving insufficient for widespread use in the radiology

department. Although vendors may claim accuracy rates as high as 99%, this still translates

into one error out of every 100 words. Thus, the possibility of having an error-free report is

almost non-existent4. Dr. Forster, who works at a radiology clinic using ASR, suggests that

as little as 10% of reports are ever error-free [43]. Where formerly a trained transcriptionist

would make the necessary corrections, with ASR the role of transcriptionist is removed and

the radiologist must make any corrections himself. Consider this: a radiologist reading 60

exams, who requires 90 seconds per report to proofread, would need to increase his day by

approximately 1.5 hours in order to turn over the same number of reports [52]. Not only

are these corrections essential given the ramifications of errors in healthcare, but they are

costly due to the high salary of radiologists and the time necessary to complete them.

Additionally, the 99-percent figure cited by vendors reflects near-perfect dictation in a

near-perfect environment – not a likely situation in the often frenetic environment of the

4With the exception of “canned”, brief reports such as a normal chest X-ray.


hospital. This problem is further compounded by the challenge of detecting recognition er-

rors (words incorrectly recorded by the speech recognizer). Frequently they are not replaced

by nonsense words or gibberish, but instead by the next best match in the terminology of the

ASR system (missing words are also very frequent, especially small words such as prepos-

tions). As a result, the errors are often inconspicuous and easily overlooked. This problem

is looked at in more detail in Chapter 3 and again in Chapter 6. Moreover, while actual

medical errors are less common, the presence of nonsensical words is a detriment to the

credibility of the report (for example, misrecognitions such as “sauna” for “centimetre” and

random word insertion errors, such as “jungle”, that are clearly unrelated to the text). Due

to this unpredictable nature of ASR, there is also a frequent need to monitor the dictation

screen in addition to the image, adding visual strain. This also complicates the dictation

task as the radiologist is forced to keep what he is going to say in his mind, while ensuring

that what he has already said is being accurately recorded.

Some researchers have suggested that hiring correctionists will result in lower costs and

more efficient use of a radiologist’s time. The difficulty in detecting these errors, though,

requires that the correctionists be highly trained, which in turn increases their salary. Con-

sequently, the cost-time benefit of ASR versus transcriptionists is lost.

In the case of incorrect words that are detected at the time of dictation, most ASR

systems allow the user to retrain the system on those particular words. Since this is a time-

consuming task, radiologists frequently opt to simply type in the replacement word directly

[43]. As a result, the machine-learning capabilities present in the system are never able

to improve the accuracy of the system and are consequently of little value in this setting.

Therefore, a system that is accurate “right out of the box” will be more valuable.

Users with accents, such as non-native speakers, may also find a higher error rate [69].

Similarly, a user may have a cold or other condition on any given day affecting the sound

quality of their voice that may have a detrimental effect on the accuracy of ASR.

Other Issues Affecting ASR

Unfortunately, in addition to these problems, the adoption of ASR technology into the

radiology department is not necessarily a smooth one. The attitude of the radiologists can

have a profound impact on the success of new technology [97]. This is exacerbated by poor or

incomplete training, often due to the unavailability of radiologists during vendors’ limited

training periods. This is worsened still by an acclimatization period where productivity


drops as users adjust to a new system and its idiosyncrasies [99]. It is difficult to encourage

users to adapt to new technology when it does not immediately benefit them. Thus, support

from senior management is crucial along with removing alternate dictation systems that may

hinder a radiologist’s ability to adapt [99].

When upgrading to ASR (or upgrading the in-place ASR), there is a risk of software

compatibility issues [72]. It is a difficult task to test new software alongside existing appli-

cations; most vendors do not have the means to set up test environments that accurately

reflect the clinical setting of their clients. Furthermore, much of the software that is at risk

for conflict is often licensed and unavailable to the vendor for compatibility testing before

the ASR program reaches the client [72].

Difficulties in the integration with existing hospital information systems and PACS also

complicate matters. This can reduce efficiency and introduce further errors in the dictation

process. If it is necessary to load the reports separately into the dictation software and

then into PACS, for instance, this can add approximately twenty seconds per report [69]. In

addition, swapping between menus in both systems adds time and the potential for errors

(confusing patient identification numbers, for example), and disturbs the workflow of the

radiologist. Thus, the integration of these systems is vital.

Despite these challenges, many feel that the introduction of ASR into the radiology

suite remains a worthwhile endeavour, and one that some radiologists are now referring to

as inevitable.

2.3.4 Automated Interpretation

In recent years, researchers in MLP have started to tackle the problem of automatically

interpreting and structuring radiology, free-text reports so that they are more accessible to

computer analysis and querying5. This section examines some of the relevant issues.

Report Summarization

Once a report has been dictated, the next step in post-processing is summarization. This can

be broken down into several tasks, including tokenization, stemming, part-of-speech tagging,

and parsing (further broken down into syntactic, semantic, and discourse analysis). The first

three tasks are relatively straightforward and can be handled with existing algorithms. The

5This includes medical, free-text reports in general.


parsing stage, however, is more complex. The system must maximally capture information

with minimal errors to ensure that the output is of the highest quality and utility. A

system that introduces errors, or glosses over important information will quickly render

itself useless. The challenge is then to determine what information is of value and what

can be safely glossed over; there is a tradeoff between the granularity of the information

retained and the efficiency of the system.

Output Formats

As was proven with the rapid advancement of the Internet, the adoption of information

standards such as HTML can help promote a seamless integration across information sys-

tems. The medical field is no exception; similar benefits can be achieved if an information

standard is established for medical information systems and related software, including au-

tomated summarization. To this end, researchers have begun looking at markup languages

based on the Standard Generalized Markup Language (SGML) that will not only standard-

ize medical documents, but also allow them to be readily accessed via the Internet. SGML

itself is overly extensive and thus too complex for many operations, however, a relatively

new markup language based on SGML, Extensible Markup Language, or XML, captures

the power and expressibility of SGML in a simpler, more flexible format [158].

In brief, XML is a markup language that is readable by both computers and humans.

A markup language encases information between two labels, or tags, that help distinguish

the text from instructions for displaying that text or information about the text itself (for

example, highlighting key phrases in a textbook could be considered an example of “marking

up” a text) [56]. XML accomplishes the task of marking up text through tags that best

describe the contents in a human- and computer-readable fashion. Unlike HTML, these tags

do not contain information regarding formatting or display of the text, they simply store

the data in a machine-readable format6.

Major standardization efforts such as those from the HL-7 (Health Level Seven) initia-

tive now employ XML encoding in their clinical specifications. HL-7 is the most common

standard for interfacing clinical data, which “enables disparate healthcare applications to

6The XML file can then be combined with a language such as HTML or Cascading Style Sheets to displayon a webpage, for instance [56].


exchange key sets of clinical and administrative data”7. This includes the well-known clin-

ical context management specification, CCOW (Clinical Context Object Workgroup), that

“enables multiple applications to be automatically coordinated and synchronized in clin-

ically meaningful ways at the point-of-use”8. For example, when a clinician opens up a

patient file within one application, the same patient is simultaneously accessed in all other

applications in the same environment.

Other large-scale standardization efforts include the industry-standard DICOM (Digital

Imaging and Communications Medicine), which standardizes the communication of medical

images and information [73]. It “enables digital communication between diagnostic and

therapeutic equipment and systems from various manufacturers”9.

2.4 Natural Language Understanding in Medicine

The ability to recognize word dependencies and interrelations is crucial for a system to suc-

cessfully summarize a medical text. Without such “understanding” of the text, words exist

only as independent entities. This limits systems to little more than keyword search and

structural analysis, missing the subtleties present in real language. In medicine, adding NLU

capabilities to an MLP system allows the transition from a passive system that summarizes

data to a system that can actively interact with the data and clinician to give feedback,

and monitor issues such as drug compatibility. This gives rise to a wide array of more com-

plicated and useful applications, including automated clinical decision support and patient

monitoring, intelligent transcription, automated interpretation and structuring of reports,

and intelligent patient records.

Representing Knowledge in Medicine

One of the crucial challenges in MLP is the development of a standardized means for repre-

senting the salient information found in medical reports. This “salient information” is the

relevant information content of the document and is the information for which the computer

must have some representation for summarization tasks and more advanced tasks, such as

7Health Level Seven Homepage: www.hl7.org. Updated regularly; Accessed: February 2006.8Again the reader is referred to the Health Level Seven Homepage for more information: www.hl7.org.9The Radiological Society of North America Homepage: www.rsna.org Updated regularly; Accessed:

February 2006.


reasoning. In short, the appropriate formalism must “[be] sufficiently expressive to cap-

ture the information required, computationally tractable for practical cases, and [behave]

predictably in the domain” [114, page 264]

Therefore, in addition to the analysis of a sentence’s structure and meaning, an MLP

system must have a means for representing the information that is contained within [80].

In language processing, the meaning of a particular word is encoded using symbols. This

representation is known as a “type”. Johnson gives the example of the verb treat, which

might be encoded as the type THERAPEUTIC-ACTIVITY10. These types are then further

specified according to hierarchies that identify the relationships that exist between them.

This systematic arrangement is often referred as an ontology or taxonomy [80], a “set of

definitions, which associate a term (the name of a defined entity) with axioms that constrain

its use and relate it to other terms”’ [pg. 81, Falasconi, 1994]. By employing taxonomic

encoding techniques, it is then possible to manage such complex representations using “inex-

pensive” set operations [165, 39]. A closer examination of ontologies in healthcare, including

the challenges present, is provided in Appendix B.

2.5 The Needs of the Radiologist

As with any new technology that is to be incorporated into an existing infrastructure, if

the integration is to be successful it must take into account the users of the technology. All

too often software engineers work hard at developing systems for a particular field without

actually interviewing those who will be using them. Consequently, when the systems are

introduced into the field, they are met with an unwillingness to adapt on the part of the

users and are quickly discarded. Instead, such technology needs to be designed alongside the

user to ensure a good fit. Technology created for the radiology workstation is no exception.

2.5.1 Limitations of an Imperfect System

As mentioned above, a system with a 99% accuracy rate is not as useful as it may seem.

Consequently, it is important to recognize the limited utility of imperfect systems, and the

need for developing ASR systems of even higher accuracy and/or compensatory software.

Although systems that do not meet the accuracy requirements should not be used in sensitive

10In the restricted domain of medicine, the noun treat would likely not have a representation.


areas (i.e. areas where errors could have serious consequences), they may still be useful in

some instances; when the nature of the accuracy problems are known, it is often possible

for the system to be applied to certain tasks confidently. For instance, a system that does

not give false negatives may be useful in searching tasks where a manual review is required

to remove the extraneous false positives [145].

Integrating with Existing Hospital Systems

By fully integrating ASR and automated summarization into the radiology workstation,

further delays in the system, as well as errors and operator fatigue, can be reduced. As pre-

viously mentioned, speech systems that are not linked directly with the existing information

systems, such as PACS, can introduce delays in the range of 20 seconds per report while

the radiologist scans the current report and then manually loads it into PACS [69]. This is

increased by an additional 20 seconds at the end of dictation while the radiologist navigates

the PACS menus to select a new case. Recall the radiologist from the Section 2.3.3: he now

faces over a two-hour increase to his day. Moreover, it is possible to introduce serious errors

when a report is scanned into the ASR-based system but the incorrect report is called up

in PACS [69].

Initial studies by Hayt and his colleagues have suggested a time gain of nearly 40 seconds

by linking PACS to the ASR system. In this particular instance, when a case is opened in

PACS the corresponding ASR file is opened automatically. When the report is complete

the case can be signed off verbally and a new case is opened in PACS without the unwanted

navigation of menus. As Dr. Forster aptly states, “it is not true speech recognition until we

can put down the mouse” [44].

An ideal integration, as suggested by Dr. Eliot Siegal [132], would allow reports to be

opened based on dictated commands alone, such as “bring up the previous chest CT”, while

increased security could require the use of voice verification as well as a password. As the

sophistication increases, information from prior studies could be imported from PACS into

the present report. Systems involving computer-aided diagnosis (CAD) could also be added

[160], providing a reference tool for the examiner, and helping to ensure that nothing is

overlooked.


2.6 Pushing the State of the Art

2.6.1 Overcoming Challenges

There are many challenges facing researchers in the area of automated interpretation. Cur-

rently, there is no metric for the comparison of existing systems and their performance,

making objective analysis difficult. In addition, systems face the challenge of limited do-

main knowledge; a system that is too broad is over-general and suffers a loss of accuracy

[45], while a system that is insufficiently general may not provide enough coverage for the

domain at hand. Furthermore, there is a clear need for standards in the representation of

medical data, including output formats and the report itself.

Most crucially, if a system is to be deployed in a medical setting where it is responsible

for handling sensitive data, it must have extremely high accuracy. This includes a robust

means for handling ambiguity, negation and errors. If a report is returned to a requesting

physician mistakenly identifying a disease or lack thereof, the consequences could be fatal.

The system must also have a strong integration with the existing hospital information system

and PACS (and potentially any ASR system in place).

By building a successful foundation now, it will be possible to fully integrate systems

hospital-wide, from radiology to paediatrics, while making information available across the

country and beyond via the Internet. Accurate statistics on past cases could then easily be

collected and used for research, patient care and decision support.

2.7 Summary

The lure of time and cost efficiency, and improved patient care, is ensuring that healthcare-

related applications in artificial intelligence will continue to grow. Within radiology, this

includes the eventual replacement of transcriptionists with ASR systems, and the addition

of automated interpretation systems in the radiology department. Unfortunately, the low

accuracy rates, among other challenges, are preventing the wide-scale deployment of ASR in

lieu of traditional dication. In the remaining chapters, a closer examination of ASR and the

nature of recognition errors is provided, followed by a solution to the problem of accuracy in

ASR, namely a hybrid error-detection methodology. This will be corroborated with a proof

of concept in radiology reporting, as well as a demonstration of the greater context of this

work beyond medicine.

Chapter 3

A Classification of Error-Detection

Methods

Although accuracy is one of the limiting factors in the widespread introduction of auto-

matic speech recognition (ASR) in radiology, there is little if any work specifically on error

detection in this domain. Nonetheless, work in other contexts, such as spoken dialogue

systems [91], is useful for creating a methodology of error detection that is applicable to the

overriding problem of ASR in radiology dictation.

I develop an original classification for error-detection methods in ASR. Since one does

not presently exist in the literature, this sets the groundwork for future endeavours to be

objectively measured. This chapter presents this classification, and provides examples from

the literature where they exist. First, though, an introduction to speech recognition is

presented to help familiarize the reader with the relevant concepts and terminology.

3.1 Background

3.1.1 The Stages of Error Handling in Speech Recognition

The handling of recognition errors can be broken down into techniques applicable at various

levels throughout the recognition process [150]:

Error Prevention Preventing a recognition error altogether.

Error Prediction Detecting the likelihood of errors based on weaknesses in the system.

24

CHAPTER 3. A CLASSIFICATION OF ERROR-DETECTION METHODS 25

Error Detection Identifying recognition errors that have occurred.

Error Recovery This can be broken down into the following stages:

Diagnosis of Cause Identifying the sources of the error to guide error correction.

Error Correction Choosing and implementing the error correction strategy and in-

forming the user of changes made.

Error Handling Feedback Where relevant, the performance at the error detection and/or

correction level is collected for future applications (for example, machine-learning

methods).

3.1.2 On the Nature of Recognition Errors

As outlined in Kukich [87], there are five levels of text-based errors:

1. Lexical/Structural

2. Syntactic

3. Semantic

4. Discourse

5. Pragmatic

It is not possible for the speech recognizer to introduce errors at the discourse or prag-

matic level since no recognizer-level processing occurs at these levels1. Furthermore, since all

words are produced from a pre-defined lexicon, lexical errors are also not possible . Depend-

ing on the domain, however, errors pertaining to the misrecognition of specially formatted

lexical items may arise. This is frequently seen in the interpretation of radiology reports

where complex lexical items such as “L4/5”, representing the fourth and fifth lumbar ver-

tebrae, are misinterpreted as “L for/five”, for example, or the lexical representation of the

numbers is erroneously substituted for the orthographic representations (i.e. “four” versus

“4”). While these remain correct lexical elements, such errors nonetheless seem to sit below

1An exception to this are errors that follow as a side effect of errors at the syntactic or semantic level.


Semantic Errors

Structural ErrorsSyntactic Errors

Figure 3.1: The relevant, overlapping error levels in radiology.

the level of syntactic and semantic errors. To refer to these instances, I have used the term

“structural”, which represents such errors as a subset of lexical errors.

The structural, syntactic and semantic error levels overlap in instances where a recog-

nition error is recognizable as an error across more than one level. For example, the mis-

recognition, “See four/5” is both a structural error and a semantic error (and potentially a

syntactic error, depending on the surrounding sentence). Figure 3.1 shows the overlapping

error coverage of these levels.

Considering the specific needs of radiology reporting, a further evaluation is offered,

applicable to all error levels and reflecting the inherent strength of the error. “Weak”

errors result in little or no change in the overall semantics and thus no shift in the report

interpretation. For example, the omission of a determiner rarely causes enough semantic

damage to be misinterpreted by the clinician. “Strong” errors, however, cause a major shift

in the semantics. Such errors may be readily identifiable as outliers within the domain,

for example the word “elephant” appearing in a radiology text; or may be inconspicuous

and hard to detect, for example, the substitution of one medical term for another that may

still be valid in that context. Kanal et al distinguish such errors with respect to radiology

reports according to the following four levels [83]:


Class 0 No change in meaning with respect to the original report.

Class 1 No change in meaning, but text is grammatically incorrect.

Class 2 Change in meaning, but error obvious.

Class 3 Change in meaning, but error subtle.

The authors divide error classes 2 and 3 into “significant” errors, while class 3 errors are

considered “subtle significant” [83]. As with all ultimately subjective measures, however,

there is a risk of inconsistency and caution should be noted when relying on these sorts of

descriptors. Differences between institutions such as the reading-room environment, user

variability, report quality, and existing infrastructure can all affect report quality and the

nature of the errors found in dictated reports. Consequently, a rigorous definition and

accounting of errors is difficult. For consistency, throughout this document any discrepancy

from the correct or reference report will be treated as a recognition error.

In general, there are six recognition error types that can cause errors at the structural,

syntactic or semantic level:

Stop Word Errors Any error involving a stop word (i.e. words with low semantic load,

such as prepositions, determiners, et cetera). In general, stop words can result in

errors at the syntactic or semantic level.

Merge Errors Two or more words erroneously recognized as a single word [121]. E.g.

“wreck a nice” → “recognize”.

Split Errors A single word erroneously recognized as two or more words [121]. E.g. “rec-

ognize” → “wreck a nice”.

Substitution Errors The replacement of one word by another [54].

Insertion Errors The insertion of a word that is not part of the original utterance [54].

Deletion Errors A word in the original utterance that does not appear in the final ASR

output.

Deletions errors are difficult to detect as they typically leave little record of their absence.

Similarly, the detection of stop word errors is also difficult due to their prevalence in the


language and the small semantic role they play. As a result, many error-detection systems

focus on the remaining four error types. By incorporating a range of error-detection meth-

ods, it is possible, however, to draw the complemetary strengths of each, such as deletion

detection, into a single error-detection system, as will be shown in this dissertation. A more

detailed discussion of this and the role of stop words in error detection follows in Chapter

4.

3.2 A Brief Introduction to Automatic Speech Recognition

In the space of little more than a decade, automatic speech recognition (ASR) has advanced

from discontinuous, or isolated-word systems, for which users are required to clearly separate

each spoken word by a pause, to continuous recognition systems in which users are able to

speak “freely”. Current systems can achieve accuracy rates as high as 99% and have seen

application in a variety of tasks including automated call processing, driver commands in

vehicles, and sub-titling for live sporting events. Within the radiology department, ASR

allows clinicians to dictate their reports directly into the computer, avoiding the need for

note-taking or transcriptionists.

ASR can be largely divided into four core technologies [99]:

1. Synthesis of human-readable characters into speech;

2. Speaker identification and verification;

3. Recognition of human speech; and

4. Natural language understanding

Speech synthesis, or text-to-speech, allows computers to produce spoken output based on

text as input. In speaker identification and verification, speech input is used to authenticate

or identify a particular speaker. Perhaps of greatest interest to medical language processing

(MLP), though, is the recognition of human speech and natural language understanding.

Throughout this document, “ASR” is used to refer exclusively to the recognition of human

speech, while “NLU” is used to differentiate natural language understanding.


SourceSentence

NoisySentence

SentenceGuess

DecoderNoisy

Channel

Figure 3.2: The noisy channel model, based on Jurafsky and Martin, Figure 7.1, [81, page237].

3.2.1 Recognizing Human Speech

In general, ASR systems function on the basic premise of the probabilistic noisy channel

architecture [81]. Acoustic input is treated as if it is a “noisy” version of the source sentence,

and is correspondingly decoded in an effort to find the “true”, underlying sentence, as shown

in Figure 3.2. Required at the decoder level is a search algorithm that searches the space of

all possible sentences in order to find the best match for the noisy input, i.e. the sentence

with the highest probability [81]. As a side effect of the decoding process, a hypothesis list

for each utterance is produced, where utterance can be represented at the sentence, word,

or phone level. As will be shown in Section 3.5, the “N-best” of these hypotheses can be

used to assist certain error-detection methods.

Popular statistical decoding algorithms include the Viterbi algorithm [96] and Hidden

Markov Models [96]. In non-statistical methods, decoding templates are used to identify

recognition candidates; a database of sound patterns is stored as sequences of frames to

which the input sound frames are compared. The output from these decoders is limited

by the acoustic and language models that restrict the set of possible utterances. Acoustic

modeling relies on acoustic properties of the language, while language modeling relies on

properties of the domain and the language structure itself.

Speech Recognition in Radiology

Within the context of radiology, six key requirements for the successful integration of ASR

in the reading room are identified [Mehta et al, 1998]:

1. Integration with existing hospital information systems (HIS);

2. Availability of “canned” or pre-stored reports (such as a normal chest X-ray) and

templates (standardized report forms);


3. Allowable additions to a completed report even after it is “signed”;

4. User-defined fields to maintain flexibility and control over the report setup;

5. Barcode interface (this also relates to the integration with HIS); and

6. Security of patient information (e.g. password protection of sensitive materials).

Beyond the software level, a successful ASR system is also reliant on the hardware

supporting it [99, 83, 160]. An example includes a high-tech microphone with noise-canceling

capabilities. Even in the quiet of a radiology reading room, there exist ambient noises from

people and equipment that can result in unwanted input to the system. In addition, the

computers supporting the speech recognizer must be powerful enough to avoid delays and

other complications, and preserve the workflow of the radiologist. Without such equipment,

there is risk of further errors and frustration to the user.

Current ASR Systems

Unfortunately, comparative studies of ASR systems in medicine are rare. In 2000, Devine

[38] performed a comparison study of three systems as they performed “right out of the box”,

that is, with the bare minimum of required training. After examining the performance of

IBM ViaVoice 98, Dragon NaturallySpeaking Medical Suite, version 3.0 and L&Hs Voice

Xpress for Medicine, General Medicine Edition, version 1.2, he concluded that ViaVoice

significantly outperformed the other two systems in consistent recognition accuracy. Al-

though careful to point out that later versions of these software programs might render

his results obsolete, a similar study was released showing IBM ViaVoice again significantly

outperformed the Dragon NaturallySpeaking Medical Suite, version 5.0, this time on French

medical dictations [63], suggesting that Devine’s earlier conclusions may still be valid. In

addition, the Canada Diagnostic Centre (a local radiology clinic) has been working with

Dragon NaturallySpeaking Medical Suite (version 8.0)2, and has had numerous complaints

with respect to low accuracy rates. Radiologists at the clinic estimate as few as 10% of

dictated reports are ever error-free [43, 44].

Although other companies have also developed ASR systems, there are no impartial,

comparative studies available at this time. It is clear that further studies comparing the

2Version 8.0 was installed in January 2006, as an upgrade from Version 7.3.


recognition rates, as well as dictation/correction rates, of currently available systems are

needed before any qualitative evaluation and discussion is possible. Regardless, the current

state of ASR performance is inadequate for the purposes of radiology reporting.

3.2.2 Natural Language Understanding

In many respects, the ability of computers to understand and communicate freely with

humans is the defining technology of artificial intelligence. The area of natural language

understanding, or NLU, is at the very root of this freedom of communication.

Central to any NLU system is the translation of natural language input into a machine-

readable format, where “machine-readable” refers to data that can be processed by a com-

puter. While computer “understanding” of a text does not have the same connotations as

with a human, it should entail the ability to process data in order to interact with people

in a more intelligent manner. This means having some internal structure for the concepts

present in the natural language input, a means to extract those concepts, and finally a way

to reason about them.

For the purposes of this thesis, the focus is exclusively on the recognition of human

speech, leaving NLU as a separate area of pursuit. In Chapter 5 future possibilities are

discussed for later advancements of error detection and correction using NLU.

3.3 Confidence Scoring

In general, speech recognizers can be evaluated on the basis of their recognition accuracy.

This is commonly determined via the word-error rate (WER)3, a measure of the differences

between a recognized string and an actual utterance measured at the word level [81]. It

is possible, however, to determine a ranking for the individual components of a recognized

string in the form of a confidence score that directly represents the probability that that

string is correct. By modeling a recognized string or text in this fashion, it is possible to

direct error detection and correction more intelligently4.

In general, a confidence score reflects the overall result of a set of confidence measures.

3See also Section 4.3.4Note that confidence accuracy is not equivalent to recognition accuracy. A speech recognizer can have

poor recognition accuracy, while the confidence accuracy is high. That is, low confidence rankings arecorrectly assigned to the erroneous recognizer output.


Typically these measures reflect statistical properties of the acoustic model and the language

model, divided into the phonetic, utterance, and word levels. Overall, “the features which

are utilized are chosen because, either by themselves or in conjunction with other features,

they can be shown to be correlated with the correctness of a recognition hypothesis.” [71,

page 2].

In some studies [70], [71], the researchers compute word-confidence scores, based pri-

marily on acoustic qualities, as a post-processing stage following speech recognition. These

measures are combined into a single feature vector which is then compressed via a projec-

tion vector to obtain the final confidence score. This confidence score is expressed as the

following (where ~p is the projection vector, ~f is the feature vector, c is the “ raw confidence

score” [70, page 2] and T is the thresholding factor):

c = ~pT ~f (3.1)

The researchers set a threshold value to “adjust the balance between false acceptances of

misrecognized words and false rejections of correctly recognized words” [70, page 2]. The

projection vector relies on a minimum classification error (MCE) technique. Nonetheless,

while such a simplistic approach worked well (reducing the false acceptance rate of mis-

recognized terms by as much as 25% in some cases), a more powerful classifier such as an

artificial neural network may ultimately prove more successful [70].

In another study [162], the researchers use the posterior probability of a word “given all

acoustic observations of the utterance”, as an indicator of confidence. They discovered a

relative reduction in confidence error rate between 19% and 35%.

In general, the confidence of a particular utterance is reflected in its N-best score – the

score that either the decoder assigns to the decoded utterance, or is later assigned by a

separate error-detection algorithm [37].

Although these methods are exclusively statistical, non-statistical, rule-based methods

of confidence ranking are also possible that do not rely on the internal ranking of the ASR

system.

The usefulness of confidence measures can be seen in their ability to direct the focus

to potentially problematic areas of a text. Such measures can be used as indicators of the

possibility of errors in areas of low confidence, and when applied with a threshold value, to

tag those words whose confidence ranking is too low.


3.4 A Classification of Error-Detection Methods for Speech

Recognition

The short survey in the previous sections provides the information needed to propose a

classification of error-detection methods for ASR. Such a classification will make it possible

to discuss aspects of error detection in a more formal and controlled manner (avoiding the

ad-hoc discussions that currently characterize the literature) as well as compare and contrast

not only specific methods, but categories of methods as well.

3.4.1 The Classification

Error-detection methods in speech recognition can be divided into two broad categories:

• Non-Black-Box Methods

• Black-Box Methods

In Black-Box Methods, the internal recognizer information, the utterance hypothesis

list produced by the decoder, is completely inaccessible. In other words, the recognizer is

opaque, or a “black box”, for which we see only the input and the output. In Non-Black-

Box Methods, the recognizer is transparent, allowing us to access the internal ranking

information that the recognizer uses in producing its output.

Each category can be further classified into the following:

• Probabilistic Approaches

• Non-Probabilistic Approachces

• Hybrid Approaches

As we will see, there are advantages and disadvantages to working with or without

the black-box assumption. We next look at the various possibilities for non-black-box and

black-box error detection according to our classification.

3.5 Non-Black-Box Methods

This section presents a closer look at the possibilities for error detection in non-black-

box (NBB) methods, including examples of their application in the reference literature


wherever possible. NBB methods refer to error-detection systems that interface directly

with the speech recognizer. In these instances the internal ranking information that the

recognizer uses in producing its output is accessible. For example, given a Viterbi decoder

such information will take the form of likelihood ratios [32]. This information can be used

in a variety of ways, including comparison to a second decoder or a recognizer running in

parallel; input to a classifier such as Hidden Markov Models [81]; or in combination with

higher level analyses, such as the semantic level. The result is a measure of confidence in the

speech recognizer’s original output or an alternative output hypothesis for the utterance.

“N-best” Score

Given whatever model/decoder is used, in NBB methods the N-highest hypothesis scores

from the decoder can be used to create a list of the “N best” hypotheses corresponding to the

input segment. Such an “N-best list” can provide input to other error-detection algorithms

that will in turn “re-rank” this list, resulting in their own “N-best list”.

3.5.1 Probabilistic Approaches

As Gillick et al [54] observe, the most basic probabilistic confidence measure in a speech

recognizer’s output is simply the result of a long term average over the performance of the

recognizer itself. That is, the percentage error rate, p, collected over some timeframe, t. This

naıve approach has many failings, not the least of which is the failure to account for the effect

of the surrounding words on the resulting probabilities. The following sections examine the

efforts to refine this technique and create more intelligent probabilistic approaches for error

detection and confidence ranking.

Language Modeling

Recall that in ASR, the decoder output possibilities are limited by the language and acoustic

models in place describing the probability of a particular utterance. Essentially, through a

variety of statistical techniques, it is possible to estimate the probability of a word occurring

based on the previous words recognized. An early attempt at a more intelligent use of such

statistical language models was Kuhn [86]. Kuhn observed that the likelihood of a word

was higher if had been spoken recently, suggesting a trend of coherence throughout a text

that could be exploited by weighing more recent words more heavily.


In general, the language model can be enforced using algorithms that reflect the state

of the domain, such as Kuhn’s weighted algorithm above. For example, using posterior

probabilities and Bayes’ Theorem, we can determine the optimum word sequence, W [77],

as shown in Equation 3.2.

W = argW

maxP (W |O) = argW

maxP (W )P (O|W ) (3.2)

Here W = w1, w2, ..., wn is a candidate word sequence, and O = o1, o2, ...on is the utterance,

or output sequence from the speech recognizer. P (W ) and P (O|W )5 are the source model

and channel model, respectively6. P (W ) can be determined via Equation 3.3:

P (W ) =∏

i

P (wi|w1,i−1) (3.3)

The condition w1,i−1 refers to the words occurring prior to the target word, wi. Based on the

assumption that the ASR output words are independent, we have the following Equation

(3.4) [77].

P (O|W ) =∏

i

P (o1,i|w1,i) =∏

i

P (oi|wi) (3.4)

Thus, the optimum sequence, W , becomes, finally, Equation 3.5 [77].

W = argW

max( ∏

i

P (wi|w1,i−1)∏

i

P (oi|wi))

(3.5)

In Allen et al, and Ringger and Allen [1, 121], the authors rely on the likelihood of recog-

nition errors, as well as statistical data such as co-occurrences and word N-grams. N-grams

refer to the divisions representing the N words occurring in the context of the target word.

A “unigram” then refers to the word itself, a “bigram” to a two-word pairing, and so on

[77]. They observe that the assumption of independence above is an oversimplification that

neglects split or merge errors. Instead they permit a small window, such as P (oi−1, oi|wi) or

P (oi|wi, wi+1), that allows the system to make predictions based on the surrounding words,

theoretically improving the merge/split problem [121].

5The probability of an accidental word to word transformation.6Note that the denominator P (O) predicted by Bayes’ Theorem can be dropped as its probability is a

constant and likewise independent of W .


In Jeong et al [77], however, the authors observe that the methods employed by Ringger

and Allen [121] do not show the increase in accuracy expected. They suggest that data-

sparseness is to blame due to the large amount of word-level correction pairs needed to

adequately characterize the search space7. To collect such pairs requires a prohibitively

large amount of training data. Instead, the authors propose the collection of sub-word

(i.e. syllable) correction pairs to overcome data-sparseness. By breaking up the words, it is

possible to achieve a greater number of correction pairs with the same amount of training

data. This syllable-channel model is shown in the following equation [77, 121]:

W = argW

max(P (W )P (X|W )P (S|X)) (3.6)

Where X is the source syllable sequence, P (X|W ) is the word model, and P (S|X) is the

probability of a syllable to syllable transformation. Jeong et al demonstrated a 6-7% increase

over their baseline recognizers using the syllable-based method and tested on a Korean

question-answering system.

Alternative methods exist based on other probabilistic techniques, such as Hidden Markov

Models [93]. All of them share in common the notion that the previous words in a sequence

carry important information about the probability of the current or upcoming word.

3.5.2 Non-Probabilistic Approaches

Higher Level Feature Analysis

In addition to low-level lexical and statistical information, higher level information such as

prosodic features can be used alongside the recognizer’s own confidence score. In the case of

dialogue systems, Litman [95] observes that when people re-state their utterance they often

over-emphasize their words (a prosodic change), leading to poor recognition accuracy. Fur-

thermore, differences in gender, age, native-speaker status, and even temporary influences

such as colds, can affect speaker prosody. Based on such prosodic features as utterance

duration and speaker rate, Litman used a machine-learning algorithm to learn if-then-else

rules, which classify a recognition as correct or incorrect. Used in combination with the

ASR confidence measures she was able increase the overall accuracy of the system over the

acoustic score alone [95].

7The problem of data sparseness is addressed again in Chapter 6.


In addition to prosody, features at the semantic and syntactic level can also be accessed.

Lieberman et al [93] use semantic information to re-rank the recognizer’s hypothesis list.

Those hypotheses that are semantically relevant to the context in which the utterance occurs

are moved higher in the rankings. They give the example of “my bike has a squeaky brake”.

Initially, the recognizer will select “break” due to its higher individual word probability,

instead of “brake”. Given the context of “bike”, however, the system is able to determine a

set of related concepts, using the semantic network ConceptNet [93]. Of this set “brake” is

a member, but “break” is not. Lieberman et al observe that by relying on a smaller corpus

of semantic knowledge only, the smaller amount of data along with greater natural language

processing means that a larger context can be considered in an N-gram model without

becoming intractable. Statistical techniques, in contrast, rely on low-order N-grams of no

more than two or three words. Using the method of commonsense reasoning to re-rank the

candidate hypotheses, Lieberman et al estimate an overall 17% reduction in errors (based

upon a post-analysis of the actual dictation errors) [93].

Parallel Recognizers

By aligning the output of the ASR word- or utterance-level recognition with a paral-

lel, phone-level recognizer, it is possible to identify inconsistencies which may indicate

errors [33]. For example, the speech recognizer may have decoded the following phone

sequence q1, q2, ..., qN(i) for word wi, while the phone recognizer identified the sequence,

p1, p2, ..., pN(i)[33]. Comparing qi to pi can be useful in identifying error candidates, effec-

tively separating the language modeling component from the acoustic modeling component

(represented independently in the phone analyser). One advantage of such an approach is

that it avoids exclusive reliance on the decoder algorithm. Cox and Dasmahapatra found

that while the parallel phone recognizer produced statistically significant results, they did

not improve on the baseline N-best technique found in Gillick et al, 1997 [54], which relied

on the stability of a word’s position in the recognizer word lattice8.

8A lattice represents the probability of each word in the output sequence in terms of the probabilities ofthe words preceding [96].


3.5.3 Hybrid Approaches

Since NBB methods access the internal ASR confidence measures, it is by default a hybrid

approach when a typical recognizer (relying on statistical decoding) is combined with any

error-detection method that uses non-statistical features to re-rank utterance hypotheses.

For example, Lieberman’s approach described above adds higher level semantic knowledge

in order to re-rank the ASR output.

The goal with hybrid approaches is to take advantage of the strengths of both the prob-

abilistic and non-probabilistic approaches, while using their complementary error coverage

to balance out their weaknesses.

Jeong et al increase domain-specific recognition by combining their syllable-channel

model, described in 3.5.1, with a semantic analysis that is sensitive to both semantic and

lexical errors [77]. At the semantic level, they obtain the necessary semantic information

from their own generated domain dictionary, and more general thesauri. Lexico-semantic

patterns, or LSPs, are collected into a template database based on abstractions of partic-

ular word sequences found in the training data. Queries are mapped to their own LSPs

and then matched to the template LSPs when an error is suspected. Templates with the

minimum distance from the query LSP are selected as replacement candidates. On its own,

the LSP method gave a 4% increase in accuracy over the baseline method, and a 6-8%

increase over the baseline when combined with the syllable mode and tested on a Korean

question-answering system.

Similarly, in Cox and Dasmahapatra 2002, latent semantic indexing as a measure of term

similarity is combined with N-best ranking (as in Gillick et al, 1997 [54]) and is shown to

be an improvement over either technique individually [33].

3.6 Black-Box Methods

This section examines black-box methods for error detection. By assuming a black-box

scenario where the internal rankings of the ASR software are unavailable, the drive is to

develop post-processing solutions, that escape the need to deal with the complications of

proprietary software. Furthermore, there is no restriction to a particular software suite,

thus the system is capable of handling input from any system. This in turn better reflects

the varying needs of reading rooms supporting different vendor software packages. As Cox

and Dasmahapatra observe [33, 32], the performance of methods relying on ASR-dependent


information may vary based on the ASR system or decoding algorithm being used.

3.6.1 Probabilistic Approaches

The unifying theory underlying probabilisitic approaches is the understanding that human

languages are probabilistic entities, rather than fixed and absolute. Competence in the lan-

guage is therefore experience-based, depending on the frequency of observation of linguistic

and linguistic-related events. When applied to natural language processing, the research

focus is to automatically identify the frequency of events in a text, and use that information

to predict the features of novel texts.

This can readily be applied to error detection given the assumption that ASR errors

“occur in regular patterns rather than at random” [82]. Given a corpus of natural language

texts on which a system can train, it is possible to identify these patterns, and use their

frequencies to assess future texts. If the corpus is large enough to be a representative sample

of the domain of discourse, then those frequencies can be extended beyond the corpus to

the entire domain. In considering a novel text, given the observation of a new event, such as

a word occurring in a particular environment, if that event is sufficiently improbable based

on the training data, the most likely explanation is a recognition error.

The following describe common probabilistic tools for language analysis as applied to

error detection, with examples in the research literature where possible.

Latent Semantic Indexing

Latent Semantic Indexing9 (LSI) uses the co-occurrence of terms to determine the degree of

relatedness between them. Two terms co-occur if one occurs within the context of the other,

where “context” refers to the surrounding words. The general idea is that variability in word

choice due to synonymous words and phrases can make it difficult to identify semantically

related documents [96]. If each term in the domain and each document is represented

in multi-dimensional space, by restricting co-occurring terms to the same dimension we

can reduce the total number of dimensions overall and thus the noise. The result is a

compressed space with “latent” semantic dimensions in which document or term similarity

can be measured via vector cosine measures. The reduction of co-occurring terms to semantic

9Often referred to as Latent Semantic Analysis.


dimensions means that it is possible to determine the similarity between documents, even

when they have minimal terms in common [96].

Consider the following example provided by Manning and Schutze, [96] in Table 3.1.

Given our query, if we rely on keyword search alone, only Document 1 will be returned. Since

the terms “HCI” and “interaction” co-occur in Document 1 and Document 2, however, it is

likely that Document 2 is also related to the query. LSI allows us to determine a measure of

just how semantically related two terms are, or, by extension, two documents based on the

terms within, and thus provides a measure of similarity between the query and Document

2.

Table 3.1: An example of the usefulness of co-occurrence relations in determining similaritybetween documents and queries [96, Page 554]

Term 1 Term 2 Term 3 Term 4Query user interface

Document 1 user interface HCI interactionDocument 2 HCI interaction

The notion of semantic similarity can be applied quite naturally to error detection if we

consider the assumption that many recognition errors are likely to be words that share little

semantic similarity with the neighbourhood of words in which they co-occur10.

Co-Occurrence Relations

Co-occurrence relations are a statistical method for determining the number of times a word

occurs in a specific context [81, 96, 131]. Given a sufficiently representative training corpus,

words can be associated with particular contexts based on that corpus. These word-context

statistics can be applied to determine the probability of a word occurring in a given context

in a text. If that probability falls below a certain threshold, the word will be flagged as a

possible error. This technique was applied to the analysis of dialogue queries in [131] and

to radiology reports in [154]. This latter application is expanded on in detail in Chapter 5.

10Cox and Dasmahapatra [32] used this assumption in their LSI algorithm for determining semantic con-fidence measures for recognizer output. They note that although it was a weak indicator of errors overall,LSI was nonetheless complementary to the basic decoder-only N-best list. Thus, by combining the semanticconfidence measure and the N-best confidence measure they were able to improve over the baseline decodermeasure [32].


Sarma and Palmer, [131] use co-occurrence statistics to perform a context analysis on

the words in a query in order to detect and then correct errors. Given a query word, the

researchers determine the context window for that word, based upon its occurrence at the

centre. If the surrounding context words do not match the target word, they can be used

to identify misrecognition candidates, words for which the context words are appropriate.

From this list of candidates, the phonetic similarity between each word and the target word

is determined. If a candidate is both context-appropriate and phonetically similar to the

target, then it is considered likely that the target word was a misrecognized form of this

candidate [131].

Pointwise Mutual Information

Pointwise Mutual Information, or PMI, is a statistical measure of the degree of independence

between two variables and is defined in Equation 3.7 [96].

PMI(x, y) = logP (x, y)

P (x) · P (y)(3.7)

Here P (x, y) is the probability of x and y co-occurring, while P (x) and P (y) is the individual

probability of x and y occuring, respectively. If the probability of P (x, y) is larger than the

combined probability of x and y occurring individually, they will have a small measure of

independence, with P (x, y) = P (x) = P (y) being maximally dependent, and P (x) · P (y) =

P (x, y) being maximally independent [96].

As Manning and Schutze observe, measures of mutual information are particularly sen-

sitive to data sparseness [96]. Considering the case of maximum dependence above, where

two words only occur together, the value of PMI(x, y) becomes log(1/P (y)). This means

that the rarer the occurrence of (x, y), the higher the degree of mutual information. This

makes little sense, as words of higher frequency will be scored lower, despite the presence of

more evidence to support the score. Consequently, PMI is a poor measure of dependence.

For the purposes of ASR output in error detection, however, the focus is on the degree of

independence that one word shows from its surrounding context. Terms that demonstrate

a high independence are likely candidates for recognition errors. Inkpen and Desilets [75]

use this idea to determine errors in transcripts of meeting. By establishing a target word’s

neighbourhood of surrounding words, it is possible to calculate the PMI value for each

of those context words and compile them into a single value. This value represents the


“semantic coherence” (SC) of the target word. Those SC values that fall below a certain

threshold are then marked as indications of possible errors.

In addition, the constrained vocabulary within radiology means that a smaller training

corpus is needed for more complete coverage, reducing the problem of data sparseness. Data

sparseness is discussed in more detail in Section 6.8.

3.6.2 Non-Probabilistic Approaches

Pattern Matching

A common, non-probabilistic, rule-based approach to error detection relies on the exploita-

tion of error patterns. By collecting a database of common error patterns relevant to a

particular language (or domain), it is possible to use rules to compare the ASR transcrip-

tion to this database. While such approaches are often very accurate within the domain of

the error database, they are nonetheless fragile. Any errors that do not have corresponding

templates in the database will be overlooked as the system cannot generalize beyond what

is known (i.e. what is in the database) [82, 77]. In addition, they are susceptible to false

positives in cases where correct words happen to occur in a known error context [77].

Kaki et al [82] developed an error correction system based on these principles. They

collected a database of common lexical errors and their corrections for Japanese. When a

string was encountered that matched an error template in the database, it was replaced by

the corresponding correct string.

Conceptual Similarity

The comparison of concepts is necessary in a variety of human and machine reasoning

tasks, and allows high-level reasoning beyond the lexical and syntactic level. Importantly,

it is possible to derive a “quantitative similarity score between two concepts” [23, Page 77].

General semantic similarity techniques include the use of vector space measures and set

operations [96]. Although traditionally seen as relevant for vocabulary development and

maintenance, datamining, and decision support, in specialized domains such as medicine,

conceptual similarity is also applicable to error detection.

Given access to a hierarchical knowledge base there are two primary approaches to

determining conceptual similarity, namely edge-based and node-based similarity.


Edge-Based Similarity

In Caviedes and Cimino [23], the authors examine the problem of a conceptual distance met-

ric for the Unified Medical Language System (UMLS), a broad-coverage medical-language

ontology11 [103]. Despite the lack of homogeneity and the presence of inconsistency within

the ontology, they acknowledge that the UMLS is nonetheless progressing strongly towards

meeting formal terminology requirements [23] (see Appendix B for a more detailed discus-

sion). In their conceptual similarity metric they exploit the hierarchical structure of the

UMLS which, by default, places similar items nearer to one another. Previous work in this

area has indicated that a reasonable metric can be derived from the minimum path along

broader-than, or RB, links [113]. Caviedes and Cimino [23] extend this notion to include

parent, or PAR, links, which are semantically similar to is-a links but subsumed by broader-

than links. The authors note that “[o]ther Euclidean metrics based on geometric distances

in a feature space... are possible but very likely too computationally expensive for practical

use” [23, page 78]. As a rough solution to the problem of inconsistencies within the UMLS,

they assume the PAR trees are directed acyclic graphs (DAGs) and discard any cycles. They

acknowledge, however, that the ability to search within verified DAG hierarchies would im-

prove the accuracy of the distance values calculated, and that further research is needed

[23].

The authors calculate two values: the depth and the conceptual distance, CDist. The

depth value is a measure of the actual depth within the concept hierarchy and reflects the

specificity of the concept. Deeper concepts are more specialized, while shallower concepts

are more general (with the root concept being maximally general). Specifically, depth is

defined as the “shortest path from the most specific common ancestor [between the two

concepts being compared] to a root concept”[23, page 81]. The CDist is one measure of

conceptual distance calculated based on the “minimum path avoiding circular and infinite

paths”[23, page 79].

There have a been a few suggestions as to the nature of the relationship between the

depth and the conceptual distance. Caviedes and Cimino [23] suggest the following metric:

Conceptual distanceinv∝ depth (3.8)

11See Appendix B for a detailed discussion.


This reflects the effect of depth on the generality of the concept and provides a means to

differentiate two concepts whose CDist values, as defined above, may be the same but whose

depth values differ. Other suggestions include a weighting value so that concepts nearer the

top of the hierarchy are less similar than those farther down [148, 119, 122, 84]. Richardson et

al [119] also use a measure of concept density within the hierarchy. They observe, however,

that irregularities in the densities give rise to unexpected distance measures [119, 122].

Roddick et al therefore extend this approach by transition costs accrued whenever a node

is traversed, and a “zooming” factor that gives preference to concepts that are closer to the

target concept [122] .

Spanoudakis and Constantopoulos determine overall distance or similarity based on a

combination of partial distance factors that reflect different levels of detail: identification,

classification, generalization, and attribution [122, 141, 142].

Bousquet et al [20] use the weighted projections of two concepts (in their case, diagnoses)

along various axes and apply a vector distance calculation, Lp norm (a variant of L norm

[96]), to calculate the semantic distance between the two concepts. Thus, the semantic

distance or similarity between two concepts A and B can be determined using the following

Lp norm calculation:

Lp(A,B) =(WX |XA −XB|p + WY |YA − YB|p

)1/p (3.9)

Where X and Y are the axes, Wi stands for the weighted value on that particular axis, and

p represents a normalization constant.

Node-Based Similarity

One problem with the edge-based approach is the assumption that links within a vocabulary

represent uniform and symmetric distances [118, 78]. In fact, these distances can vary,

particularly in areas of high density, or where non-is-a links are used [118, 78]. Thus,

researchers have augmented the distance calculation via weights to reflect the information

content of a node (or concept).

As an initial approach to this problem, Resnik determined the informational content

(IC) of a concept, c, based on the informational theory of inverse log likelihood [78, 96, 118]

(here P (c) is the probability of c):


IC(c) = − log P (c) (3.10)

Intuitively, as the probability of a concept increases the corresponding information content

decreases: concepts that are relatively high frequency (e.g. those that are higher up in the

hierarchy and consequently more general) provide a relatively small amount of information

[118]. It follows that “the more information that two concepts share in common, the more

similar they are” [118, page 2]. Within the taxonomy, this information content is determined

by the two subsuming concepts:

sim(c1, c2) =cεS(c1,c2)max IC(c) (3.11)

Where S(c1, c2) is the set of concepts subsuming c in the network.

3.6.3 Hybrid Approaches

Like NBB methods, a hybrid approach of BB methods attempts to balance the weaknesses

of an individual approach with the complementary strengths of another. Thus, despite

the fragility of pattern matching, employing templates of common errors may increase the

performance of statistical techniques such as co-occurrence analysis. Likewise, the problem

of false positives in pattern matching may be offset by the co-occurrence score of the term or

phrase in question. Similarly, those errors for which insufficient training data was available

could be instead captured using non-probabilistic techniques.

Currently no such hybrid, BB methods for error detection exist to the author’s knowl-

edge. Chapter 4 contributes an original conceptualization for a hybrid, BB method of error

detection as applied to radiology reporting. Chapter 5 then provides a proof of concept

through a series of experiments.

3.7 A Note on Stop Lists

Stop words are words with little intrinsic meaning or semantic weight, such as “at” and

“the”. Typically, these words are found with such high frequency in the language that they

serve only as noise, losing all usefulness as search terms. In statistical analyses, stop words

are usually omitted since their overabundance in a text can affect the resulting probabilities


disproportionately. A list of stop words to be excluded from an analysis is referred to as a

“stop list”.

It may be argued that due to the low semantic load of stop words, errors involving them

are of minimal importance. From the perspective of safety-critical domains like medicine,

however, accuracy is vital and stop word errors should not be considered exempt from error

calculations or detection. Seemingly inconsequential errors can ultimately impact the clini-

cian’s interpretation of a report and should be avoided at all costs. For example, important

information conveying the location of pathology is often communicated via prepositions,

such as “in the”, “on the”, et cetera. This means that statistical methods employing such

stop lists (i.e. most if not all) will be inherently restricted in their success. As a result, the

best method for error detection in radiology will involve a non-statistical or hybrid approach.

The theory surrounding such a system will be the topic of Chapter 4.

3.8 Summary

In summary, the methods applied to error detection in ASR can be classified into black box

(BB) and non-black-box (NBB) methods. These in turn can be further specified according

to their use of probabilistic and non-probabilistic techniques. With such a classification now

in place, it is now possible to put forward a new, hybrid BB method of error detection in

speech recognition within the context of radiology and with the goal of detecting errors at

the word level. In subsequent chapters, this new method will be introduced conceptually

and formally, and supported with a proof of concept.

Chapter 4

A Conceptual Model

Given the error classification discussed in Chapter 3, it is now possible to propose the follow-

ing original contribution: an error-detection methodology for the improvement of speech-

recognition output in radiology dictation. This chapter provides a conceptual introduction

to this model, its relation to the error-detection classification, and a formal definition. In

Chapter 5 a proof of concept will be provided via experimental evidence in the radiology

domain.

4.1 The General Idea

The overriding goal of this dissertation is not only to demonstrate that we can improve

the utility of ASR for radiologists, but to present a theoretical approach that does just

this. Central to this approach is the notion that by presenting radiologists with confidence

rankings on the ASR output, they will be able to proofread more efficiently through what

is essentially computer-aided document editing. This notion is supported by a recent study

in which Skantze and Edlund [135] demonstrate that human error-detection performance

improves when subjects are provided with a confidence ranking metric. In the study, this

metric was presented as a colour-coding on the words in the text based on the internal rank-

ing of the speech recognizer. A grey-scale representation ranging from dark grey, indicating

high recognizer confidence, to light grey, indicating low recognizer confidence, was used to

communicate these confidences to the user.

With this in mind, the goal of this dissertation can be formally stated:

47

CHAPTER 4. A CONCEPTUAL MODEL 48

Objective To develop a mapping from the individual words of a radiological free-text

report to a confidence ranking or error-tag set.

To achieve this mapping, an error-detection system must have some means for identifying

recognition errors within a text. This requires the identification of features whose values

will differentiate correct versus incorrect words. By relying on these features it is possible

to define different error-detection algorithms that may rely on different feature subsets, or

may differ in their feature handling.

Mapping the words of a text to an error-tag set provides a discrete indication of potential

errors or areas of low confidence. Essentially, words with confidence scores below a certain

threshold are flagged, while those above this threshold are not. From the perspective of

medicine all errors may be considered significant errors, and flagging all words equally with

confidence below the threshold may be more desirable.

Observation 1 The features of out-of-place words will be inconsistent with the expected

features of a word in that location.

For example, the probability of a word occurring given a particular context can be consid-

ered a feature of that word. A probability below a certain threshold, for instance, is not

consistent with the expected probability of a word in that location. Similarly, a word that

is syntactically out-of-place will not have the expected syntactic feature values. A hybrid

error detection approach utilizes as much information (i.e. features) from the text as pos-

sible through multiple detection algorithms to identify the maximum number of errors (of

varying type).

When working in radiology, we can take advantage of the constrained domain by defin-

ing features specific to radiology reports. For example, examining the various sections of

a standard radiology report reveals certain attributes that represent the expected features

of words occurring in those sections. Thus, the features of those words within the “Proce-

dures” section will relate to radiological procedures. Similarly, if a report is discussing an

examination of the knee, those concepts relating to the other parts of the body will have a

lower probability since the expected features will relate to the knee. When these and other

heuristics are combined together a characterization of each word or phrase in the report is

generated that can be used to calculate the degree of confidence in that word or phrase.


4.2 Introducing A Hybrid Approach to Error Detection

Observation 2 No single error-detection technique is sufficient to detect all potential errors

in a radiology report.

The goal in any post-recognition, error-detection system is 100% coverage of all error types

and 100% accuracy in identifying errors. The discussion in Chapter 3 shows that while the

various probabilistic and non-probabilistic methods of error-detection are each sensitive to

a particular subset of error types, none provide complete coverage over all types, nor do

any implementations achieve 100% or near-100% accuracy. In some cases, such as the use

of stop lists to omit stop words in statistical techniques, complete coverage is impossible.

Observation 3 By combining those methods of error detection that are complementary

in their coverage of error types, it is possible to achieve greater sensitivity to errors

within radiology reports.

Although individual error-detection techniques may be insufficient, if their coverage of error

types is shown to be complementary then the combination of multiple techniques via a

hybrid method will result in a higher coverage of error types. In addition, overlapping

areas of coverage will increase the reliability of each error mapping. Thus, the application

of complementary techniques in a hybrid approach will ensure maximum detection. In

this sense, the component error-detection algorithms can be considered as heuristics in the

hybrid system that improve the accuracy of the mapping of words into the error tag set.

Observation 4 Within the domain of radiology, black-box methods of error-detection are

the most viable.

Any solution to the problem of speech recognition in radiology must take into account the

potential variety of speech-recognition software currently in use. Not only is much of this

software proprietary, and therefore inaccessible to external interface, but variations in the

calculation of internal speech-utterance probabilities can affect any third-party software

designed to interface with these probabilities, creating inconsistent, unpredictable results.

Treating the ASR software as a black box and separating the error detection as a post-

recognition stage, however, avoids these problems and creates a second-level filter indepen-

dent from the speech recognizer. As mentioned in Chapter 1, this independence means that


a post-ASR error-detection system is not bound to a particular speech recognizer and there-

fore can be readily modified and updated. Furthermore, this avoids overspecializing, leaving

open the possibility of extending the methods to error detection beyond speech recognition.

Conclusion A black-box, hybrid approach to error detection is the best choice for an error-

detection methodology.

As an aside, in situations where the ASR software in question is non-proprietary or

otherwise accessible, or it is not necessary to have a system that is applicable to multiple

ASR implementations, the black-box restriction may be lifted in favour of a more general

hybrid approach.

On Tagging Errors – My Contribution

The ultimate goal of an error-detection system is a mapping function that when applied to a

text, such as a radiology report, will output a list of errors detected at the word level. This

list can be expressed superficially as a tag indicating that a word is “correct” or “incorrect”,

where “incorrect” means that a word can be described according to one of the error types

outlined in Chapter 3. Thus, all words are mapped to the error tag set {correct, incorrect},irrespective of their error type.

In a hybrid method, however, this mapping function relies on the interaction of the

component error-detection heuristics. There are two possibilities for arriving at the word-

level tag map. In the direct method, the indication of an erroneous word in a text by at least

one heuristic is sufficient to trigger an “incorrect” tag on that word. In the indirect method,

the output from each error-detection heuristic is taken as input to a meta-level heuristic.

Each word in a text is provided with a score based upon the weighted aggregation of any

scores assigned to that word from the error-detection heuristics. Since the error coverage

differs with each error-detection method, not all words may have scores assigned from all

algorithms. If the output from each algorithm is a measure of confidence in the recognizer

output, then the combined result of applying all heuristics via the hybrid algorithm to the

text results in a complex confidence score for each word. These results can be combined

in any variety of ways, with the choice of meta-level heuristic affecting the final confidence

rankings and the overall performance of the system. Given a threshold that controls the

degree of filtering, these scores can be translated into “correct” and “incorrect” tags based


on a word’s proximity to this threshold. The threshold is chosen in order to maximize the

accuracy of the tag maps.

The meta-level combination of the heuristics in a hybrid algorithm is a novel approach

to error detection in radiology reporting, and, to the author’s knowledge, ASR in gen-

eral. While some of the error-detection heuristics presented in this Chapter may take their

inspiration from previous research, the generation of a hybrid algorithm for radiology error-

detection based on the complementary strengths of the component algorithms is a completely

original contribution.

Creating a Single Confidence Score – Output Normalization

A naıve error-mapping function based on the direct method maps all words to “correct”

by default and “incorrect” if any error-detection heuristic indicates it as erroneous. Thus,

the error tag is based on the assignment of a single “incorrect” tag by any individual

heuristic, and not on the combined results of all heuristics returning an error value for that

word. While straightforward, such an approach does not take advantage of each heuristic’s

contribution to the assessment of the text and fails to exploit the differences in the nature

of the output from each. For example, a heuristic may have a high recall value, meaning

a high detection of actual errors in a text, but a low precision, meaning that it may also

return a high number of false positives. By using a cumulative value based on all heuristic

input, the results from the various algorithms may suppress the effect of a false positive

in the final output. This also makes it possible to represent the complicated relationships

between algorithms, such as the case of a heuristic that is particularly strong at detecting

one type of error. In this instance, an indication of an error of that type might be more

heavily weighted than an indication from another heuristic. In a similar fashion, multiple,

overlapping heuristics would act as backup measures, suppressing erroneous outliers and

increasing the reliability of the final mapping.

An example of a more intelligent mapping function might create a single confidence

score via a meta-analysis of the component heuristics. Such a meta-analysis will take into

account the individual effect each heuristic brings to bear on the overall confidence score.

A simple, meta-level algorithm then normalizes the results from each heuristic and averages

them to produce a final confidence score, as shown in Equation 4.1, where hi(x) represents

the normalized value of heuristic h as applied to word x and there are n heuristics.


c(x) =∑

i hi(x)n

(4.1)

A more complicated aggregation algorithm will weigh the effect of a heuristic’s output

value on the final confidence score, reflecting individual differences among the heuristics.

This could be applied globally to all results from that heuristic, or, as mentioned above, only

affect the weight of the confidence of those words whose error types are a particular strength

or weakness of the heuristic. In Equation 4.2, a global weighting schema is shown that applies

to all output from a particular heurisitc, where wi reflects the particular weighting of the

heuristic hi. Each heuristic may additionally harbour internal weighting schemes affecting

its own, interim output.

c(x) =∑

i

hi(x)wi (4.2)

It is not the case that an error-detection heuristic will necessarily map a word to a value

of a type that is compatible with the output types of the other heuristics. For instance,

one heuristic may map words to binary results, such as {correct, incorrect}, while another

may map from a more abstract level than the word level, such as the concept level, where

multiple words may represent a single concept and thus have a single error tag. In these

cases, we must normalize the output types into a common type suitable for aggregating

the individual heuristic results into a single confidence value for each word in the text

(or phrase, et cetera, depending on the chosen level of focus). In the concept-level error

detection example, this involves mapping concepts and their confidence scores back to the

individual words comprising those concepts, since the goal is error detection at the word

level. In the direct method, if all heuristic output translated to a word-level mapping

into {correct, incorrect}, then we can determine a final mapping for each word in the input

text. In the indirect method, the results from each heuristic are translated into a score. The

heuristics’ scores for each word are combined according to a weighting scheme to regulate

their effect to produce a final confidence score and/or tag.

Figure 4.1 provides an abstract representation of the hybrid approach. The filter repre-

sents the application of any weighting schemes applied to any particular heuristic output. It

is separate in the figure as it may have no effect on the input, which will be passed on to the

next stage of processing. From there, the output for each heuristic is normalized (converted

to a common type), and then combined to form an error tag (or confidence value) to word


Syntactic

Analysis

Semantic

Analysis

Word

Occurrence

Probabilities

Other

Heuristics

Filter

Convert

error types

and

combine

heuristic

output

Report

Final

Error

Mapping

Figure 4.1: The abstract hybrid system.

mapping for each word in the report.

4.3 A Note on the Measure of Correctness

This dissertation views the measure of correctness of a document as a direct consequence of

the word-error rate (WER) as shown in Equation 4.3, thus Cor(d) is the degree of correctness

of document d.

Cor(d) = 1−WER (4.3)

Alternative measures exist, such as the ratio of errors counted in a text to the number of

correct words counted, however, there is little evidence motivating the use of one measure

over another.


4.4 The Error-Detection Heuristics

A hybrid application of error-detection algorithms means increased sensitivity to errors and

error types. At the very least, the potential for 100% recall requires that the component

heuristics range over all error types as listed in Section 3.1.2. This means that regardless of

a hybrid method’s precision, it must at least be capable of detecting errors of all types.

The choice of heuristics for inclusion in the hybrid algorithm is an important question.

The motivation behind such a choice is twofold. First, choosing heuristics which are com-

plementary in their range of error types ensures that all types can be detected. Second, by

choosing algorithms with the greatest breadth of error type coverage at the relevant error

levels, the overlapping range of detection acts as a backup against false negatives. Together

these heuristics help to smooth out any weaknesses found within one approach and increase

the reliability of the output. Furthermore, since heuristics, by their very nature, are not

perfect methods, each heuristic aids in verifying and corroborating the results of the other

heuristics where their coverage overlaps.

Based on my review of the literature, presented in Chapter 3, I have selected the follow-

ing error-detection methods for their success in other applications (most notably dialogue

systems), their coverage of error report types, and their appropriateness for the radiology-

reporting domain. The intersection of the range of each heuristic’s output error types is

such that Structural, Syntactic, and Semantic errors are covered.

Together, the heuristics involve three levels of analysis:

Semantic analysis Semantic errors, generally covering all error types except stop words

and deletion errors.

Syntactic analysis Syntactic errors, generally covering all error types, including stop

words and deletions.

Word occurrence probabilities Semantic, syntactic and structural errors, generally cov-

ering all error types except stop words and deletion errors.

In addition to choosing a heuristic to cover each error type minimally, further heuristics

can be added to cover any error type in the interest of further insurance against system

errors.


4.4.1 Semantic Analysis

Since ASR works on the basis of the most probable translation of an audio signal to a word

in the lexicon, there is no restriction on the meaning of that word. Thus, in many cases even

if the word “giraffe” has exceptionally low probability of appearing in a radiology report, an

ASR system may still choose that word if it has the highest match probability with respect

to the audio input since it is a valid member of the lexicon. Such semantic errors can be

detected following recognition using techniques that rely on the meaning of words and how

those meanings constrain word use in a particular language or subset of language. In a

domain such as radiology, the concept base is more limited making it possible to perform

more in-depth semantic analysis.

Observation 5 Concepts within a radiology report share a measurable degree of semantic

similarity.

Conceptual Similarity

Conceptual similarity is a high-level metric for determining the semantic relatedness (as

defined in Chapter 3) of two concepts. The aim is to exploit this similarity measure to

identify out-of-context or off-topic words or phrases that may indicate a recognition error.

Such measures are useful in constrained domains where language use is restricted to a subset

of that language. Given an ontological representation of the concepts within a domain, it

is possible to directly measure the distance between any two concepts, provided they occur

within that ontology. A more detailed discussion of this technique is available in 3.6.2. Such

a measure of distance is not intended as an exact measure, but rather an approximation of

semantic relatedness that is a consistent measure regardless of the concepts involved. Thus,

assumptions regarding the symmetrical nature of the distances between nodes within an

ontology should have little bearing on the overall result. When the distance between two

concepts lies beyond a certain threshold, this may be indicative of an error. To increase the

utility of this result, a weighting schema can be introduced to reflect the depth of concepts

within the tree. That is, a comparison of concepts at a relatively shallow depth should not

have as much impact on the confidence score as those at a deeper, more specified level.

The primary challenge in a conceptual-similarity metric as an error-detection heuristic

is the choice of concepts from a text to compare. Within radiology, at the report level, a

topic marker can be generated that reflects the overall topic of the report. For instance,


each radiology report focuses on a particular anatomical region, such as an MRI of the

knee. This can be used to set the topic marker to “knee” priming the system to expect

input relevant to that topic. Thus, concepts within the body of the report can be compared

to the topic marker to check for relevance. At a lower level, within the body of the report,

concepts occurring within the same context window can be compared. This context window

can be restricted to consider only those concepts within a certain radius, those within the

same section in the report, or even the set of all concepts within the report. If the ontology

supporting the similarity analysis can be shown to be complete with respect to the lexicon,

or contain a very high percentage of the lexical items most likely to appear in an actual

report, the inability to find a concept within that ontology can also be used as an indicator

of errors in the form of medically irrelevant concepts. The usefulness of such ontological

outliers can be improved if the concepts considered from the report are restricted to those

belonging to a more limited subset of the domain. The ontology is more likely to contain

these concepts, making the absence of a concept an informative measure.

Semantic Grammar

Observation 6 Semantic relationships between entities within a radiology report can be

exploited to identify likely error candidates.

A semantic grammar defines the rules of language based on the major semantic classes

within the domain of discourse [81]. Thus, instead of constraining words on the basis of the

syntactic, or structural, role they play, they are constrained on the basis of meaning, where

meaning is defined in terms of the semantic classes. For example, a semantic grammar for

a flight scheduling dialogue system might include the following query rule [81]:

InfoRequest → when does Flight arrive in City

Based on such a query, the parser is able to predict the semantic category of upcoming

words in a text. Therefore, when a word does not fit the expected category, it can be

flagged as a potential error. The result is a grammar of rules that are highly dependent on

a particular domain. Within radiology, however, this is appropriate, though it may hinder

expanding the error-detection system beyond the radiological domain later on as the rules

will be radiology-specific.


While the rules themselves are typically not hard to express in a semantic grammar, there

must exist a rule for each possible semantic pattern, and each possible syntactic form (for

example, the active and passive voice differ in their structure and arrangement of concepts

within). As a result the development of the grammar is a time-consuming process.

In addition to the general semantic rules, within all coherent texts it is possible to identify

semantic relationships existing between the concepts within the text. These relationships

can exist at multiple levels:

• Syntactic, or physical placement of one concept relative to another (or relative to

functional words such as prepositions).

• Semantic, such as thematic roles describing the expectations one word has of its ar-

guments1.

• Discourse levels, such as the relationships existing at the scale beyond the sentence.

Beyond these levels, relationships at levels of abstraction specific to the domain, such

as anatomy or causality, also exist. These describe domain specific constraints on the

relationships between concepts.

All of these levels of analysis provide information about the expected relationships be-

tween concepts and how those are expressed. For example, two concepts may be linked via a

limited selection of prepositions that define the nature of that relationship, such as a person

puts clothes on themselves, not in themselves. Identifying domain-specific archetypes can

be challenging, such as the thematic roles that help characterize radiology texts. A full

analysis of these relationships is beyond the scope of this thesis, however, their applicability

in future enhancements may include such analyses and is discussed in Chapter 6.

Unlike the conceptual-similarity analysis, which determines a quantitative measure of

similarity on the basis of the distance between two (or more) concepts in an ontology, a

semantic grammar constrains the meaning of words or concepts based upon the categories

to which they belong. While related, the former assigns a numerical value measuring relat-

edness, and the latter identifies those words or concepts whose semantic categories do not

match the expected categories.

1For example, a transitive verb requires one or more complements describing the objects on which itacts. These complements may be semantically restricted to animate objects, for example, in a non-fictionalsetting. Examples include “has qualifier”, “has role”, “pertains to”.


4.4.2 Syntactic Analysis

Parsing

Observation 7 The points of failure in a syntactic parse can be used to identify likely error

candidates.

Although statistical methods have dominated error detection in ASR, their use of stop

lists and surface-level analysis prevents such systems from achieving 100% accuracy. To fill

this gap, non-statistical methods can offer a more in-depth analysis of the features within a

text.

Syntactic recognition errors include words or phrases that are out of place with respect

to their syntactic placement. In a misrecognized text, for instance, a verb may occur in

the text where the syntactic analysis would predict a noun. It is possible to identify these

syntax errors and apply a weight to determine a confidence score for words within the text.

For example, the phrase “a tear at the cruciate ligament” may be misrecognized as “*a

tear at the crucial”. In the misrecognized sentence the word “crucial” is located where a

noun phrase is expected, in contrast with the correct sentence, which contains the noun

phrase “cruciate ligament” in this location. The lack of a noun phrase in the incorrect

sentence identifies a potential deletion (i.e. the correct word was deleted), a misrecognition

of “crucial”, or both.

Thus, as a component of the hybrid approach to error detection, a syntactic parser can

be used to identify syntactic errors, including those which involve stop words and deletions.

In addition, while native English speakers are unlikely to make grammatical errors, those

who have learned English as a non-native language may have some problems with grammar.

These mistakes can also be detected via syntactic analysis. The sensitivity to syntactic errors

can be adjusted using a meta-level heuristic, which controls the effect each error-detection

heuristic brings to bear on the final analysis as mentioned in Section 4.2.

In radiology, information is often recorded using incomplete sentences or “bulleted” form.

In such cases the system must be sufficiently flexible to allow for these looser constructions,

while of sufficient granularity to detect when a sentence fragment is likely ungrammatical.


4.4.3 Word Occurrence Probabilities and “N-gram” Models

“You shall know a word by the company it keeps”

–J. R. Firth, English linguist, A Synopsis of Linguistic Theory, 1957

For the purposes of this initial experiment, there are two probabilistic techniques that

have been developed for the proposed hybrid method, namely co-occurrence relations and

Pointwise Mutual Information (PMI). Underlying these approaches is the key notion that by

identifying patterns common to error-free reports, we can automatically detect inaccuracies

within novel reports.

The first technique is based on my earlier work described in Voll et al [154] in which

co-occurrence relations [81, 96, 131] were found to have a high recall in detecting errors in

radiology reports. Given a sufficiently representative training corpus, words are associated

with particular contexts based on that corpus. These word-context statistics are then applied

to determine the probability of a word occurring in a given context in a report. This

probability represents a measure of the confidence in that word; if it falls below a certain

threshold the word will be flagged as a possible error.

The second technique is based on the work of Inkpen and Desilets who suggest similar

results using PMI. They also discuss other techniques previously employed, but conclude

that PMI performs the best, in part because of the potential to scale up well to larger

databases (which is ultimately desired for better characterization of radiology reports) [75,

149].

By choosing two statistical algorithms, the results can be combined via the indirect

method2 to smooth out any anomalies within the calculations themselves to produce more

reliable results. The results can also be used towards a comparative evaluation of the two

techniques.

The Context Window

Statistical techniques to error detection rely on the properties of the environment in which

a word occurs. This environment can be defined in a variety of ways, from simply any word

in the neighbourhood of the target word, to only those words of similar type (such as all

nouns or verbs), to words meeting other criteria in common with the target word. A word’s

2See Section 4.2.


neighbourhood is called the “context window” and refers to those words that co-occur with

the target word in the text. The context window can be any size measured as the n words to

the left and right of the target word, where n can be the size of the entire text or as small as

a single word and does not necessarily refer to consecutive surrounding words. The choice

of size can have an impact on the accuracy and generality of certain statistics. For certain

statistics, if the window is too small for a sufficient sample, the feature will be inaccurately

represented. Similarly, if the window is too large there is risk of a “cross- pollination” of

features that interfere with the statistics. From an efficiency perspective, a large window size

can introduce issues with respect to tractability. Methods relying on the context in which a

word occurs are often referred to as N-gram methods, essentially corpus-based probabilistic

models of a text. Recall that a “unigram” refers to the word itself, a “bigram” to a two-word

pairing, and so on. Based on these models, it is possible to build the statistical estimators

to determine probability estimates for words or features in a text [96].

The Training Corpus

A training corpus is a corpus or body of text intended as a representative sample of a

language. This language can be as broad as an entire natural language, such as English,

or restricted to particular domains of discourse within that language, such as medicine. If

such a corpus is representative of the language, then the statistical properties of this limited

sample set can be generalized to the entire domain.

The immediate challenge facing corpus-based linguistics is the notion of what constitutes

a “representative” sampling of the domain. In short, within a truly representative corpus

any properties observed must be extensible to the entire domain. Unfortunately, it is often

the case that researchers relying on such corpora are unable to choose the sample set, or

cannot identify it as truly representative. Thus, the adage “more is better” applies here,

with the idea that the larger the sample set the more likely it is to represent the domain. The

tradeoff is tractability. In addition, a sample set will be more representative of a smaller

domain than a larger one. In general, however, we must be aware that any statistical

analyses based on a limited corpus may introduce errors when extended to the full domain

[96]. A discussion of smoothing to avoid problems of data sparseness in corpora is provided

in Section 6.8.

In error detection, the statistical property most useful is the probability of a word

occurring in a text. This can be estimated from a training corpus by counting the number


of occurrences within that corpus and dividing by the total number of words in the corpus.

Similarly, given a context window of size n, it is also possible to calculate the probability of

a word occurring with any word from that context window. This results in a database of

two-word probabilities; the probability of any two words occurring together within a given

context-window size.

Co-occurrence Relations

Observation 8 The probability of a misrecognized term occurring in a radiology report is

lower than the probability of a correctly recognized term.

As discussed in Chapter 3, co-occurrence relations can be a reliable indicator of the

probability of a word in the context of report. Conceptually, the probability of a word

occurring independently is combined with the probability of it occurring in a given context.

Using the two-word statistics generated from the training corpus it is possible to combine

these results to generate a single probability score for the target word.

One means for determining the probability of a word given its context words is Bayes’

Theorem. Bayes’ Theorem evaluates the most probable hypothesis based upon the observed

information so far [102]. This can be applied to error detection by considering a word’s

occurrence in a report as a hypothetical statement about the world of radiology reports.

Similarly, the context of that word can be viewed as the observed data so far. Thus, given

a word, x, and a list of context words for x, C, Bayes’ Theorem is defined as follows [102]:

P (x|C) =P (x) ∗ P (C|x)

P (C)(4.4)

In Equation 4.4 P (x) is referred to as the prior probability, that is the probability of

the word x occurring regardless of its context. This is the probability of x occurring in the

training corpus. In contrast, the desired quantity, P (x|C) is the posterior probability, the

probability that x will occur given its context is C. To arrive at a value for P (x|C), the

prior probability as well as the independent probability of the context P (C) and finally the

probability of the context occurring given that x does occur in that context, P (C|x), are

combined [102]. The denominator, P (C), is a normalization factor. Since it is a constant

and assumed to be independent of the target word x for the purposes of this calculation it

can be dropped.


While the probability of any individual word is simply its occurrence in the training

corpus divided by the number of words in that corpus, the probability of a given context

(a set of words) is more difficult. Thus, the calculation of P (C|x) is also complex. This is

handled by the principle of Joint Probability shown in Equation 4.5.

P (C|x) = P (x)P (C1, ..., Cn|x) By Joint Probability

= P (x)P (C1|x) ∗ ... ∗ P (Cn|x) (4.5)

= p(x)n∏

i=1

p(Ci|x)

Bayes’ Theorem is a straightforward approach to determining the probability of a word

given its context, although other methods could be easily applied here.


Like co-occurrence relations, the PMI value of a word and its surrounding context can be

a useful measure of its likelihood of being correct. Again, by determining a context for

each word, it is possible to use the probability-statistics generated for the training corpus to

calculate the probabilities necessary for the PMI calculation. For the complex calculation

of a word and its context window P (x,C), any number of aggregation techniques can be

applied. The simplest of these is an average over the individual probability of x occurring

with each word in C. Inkpen and Desilets [75] looked at three aggregation techniques, for

which they found that averaging performed slightly better. Thus, for simplicity, this will be

the method of choice for application of PMI to radiology-text error detection.

4.5 A Formalization of the Hybrid Approach to Error

Detection in Radiology

Given the discussion in the first half of this chapter, it is now possible to state a formalized

theory of hybrid error detection in radiology. To help the reader, the following formalizations

are each augmented with an English gloss and, wherever possible, examples from the domain.


4.5.1 General Definitions

Let Z be the set of integers.

Let N be the set of natural numbers.

Let L be a lexicon of words in the English language. Since theoretically a report can

contain any English word, L is not restricted to any subset of English.

Let R be a tuple s.t. R ∈ L~. Here L~ is defined as follows:

L~ = {(x1, · · · , xn)∣∣n ≥ 1 ∧ xi ∈ L ∧ 1 ≤ i ≤ n} (4.6)

In other words, L~ is the set of all possible tuples created over the lexicon L, while R

is a tuple from this set representing a natural language, free-text, radiology report. Note

that the remainder of this chapter adopts the notation convention where a report of size n

is denoted Rn.

Let win be an integer representing the context-window size, s.t. win ∈ N.

Let th be an integer representing a threshold number, s.t. th ∈ Z. The threshold is

a constant that controls the degree of filtering. Recall that the error-detection algorithm

returns a value representing the confidence assigned each word in a report. If, for any word,

that confidence value exceeds the the threshold, it will be tagged as an error.

Let EDA be a set of functions representing the error-detection heuristics (each of these

will be subsequently defined):

EDA = {parser, cor, pmir, sdr} (4.7)

In the usual notion of a set, any duplicate members within that set are not considered

distinct and therefore cannot be counted (as is needed in our probability calculations). A

multiset or mset, however, is defined as a “set” in which repeated elements are allowed [12].

Within set theory, we can more precisely define an mset M as a pair (A,m) where A is

some set and m : A → N is the multiplicity function. The set A is called the underlying set

and is defined as U(M), while m(M) defines a mapping from elements in A, to the number

of times they occur in M .

An mset is frequently written as a set of ordered pairs where the first element is the

underlying set, and the second element is the definition of the multiplicity function. For

example, the mset {a, a, b, c, c, c} is the “set” containing 2 a’s, 1 b, and 3 c’s, which is defined


as ({a, b, c}, {(a, 2), (b, 1), (c, 3)}) and where U(M) = {a, b, c}.Let TC be a set of radiology reports that represents the training corpus, s.t. TC ⊆ L~.

Let TW be an mset where U(TW ) = {w1, · · · , wn} and wi ∈ TC (s.t. 1 ≤ i ≤ n).

Calculating co-occurrences:

The co-occurrences in a given report Rn represent a symmetric context window: the set

of tuples defined by pairing a word x in Rn with each of the win words occurring to the left

of x, and the win words occurring to the right of x.

The following preliminary functions are needed to define co-occurrences formally. Con-

sider a report Rn. Let t represent the index of a given word in Rn, called the target word

(this convention will be continued through the remainder of this chapter). The functions

before and after determine the win words occurring before and after the target word t,

respectively. Taking the target word wt, report Rn, and the windowsize win, each function

returns either the list of win words before wt in report Rn, or the list of win words after wt

in Rn.

Let before : L~ × N× N → 2L be defined as follows:

before(Rn, t, win) = {xi

∣∣xi ∈ Rn ∧max(1, (t− win)) ≤ i < t} (4.8)

Let after : L~ × N× N → 2L be defined as follows:

after(Rn, t, win) = {xi

∣∣xi ∈ Rn ∧ t < i ≤ min((t + win), n)} (4.9)

Restricting the boundaries of i by max and min in before and after, respectively,

accounts for target words occurring near the beginning or end of a report (which may cause

the total number of words in the context window to be less than 2win since there will be

fewer words returned by before or after).

For example, given the following trivial report (the indices are added for clarity):

R test5 = (the0, xray1, shows2, nothing3, abnormal4)


From R test5 it is possible to calculate the following for the target word xt = shows2 (t = 2):

before(R test5, 2, 2) = {the0, xray1}

after(R test5, 2, 2) = {nothing3, abnormal4}

The definitions of before and after can be used to determine the co-occurrence relations.

The function co will take a report Rn (of size n), a target-word indexed by t, and a window

size win, and return all co-occurrence pairs that occur in Rn. That is, all pairs where xt is

the first element, and the second element is from the set of words that occur win words to

the left and win words to the right of xt.

Let co : L~ × N× N → 2L2be defined as follows:

co(Rn, t, win) ={(xt, xi)∣∣xi, xt ∈ Rn ∧ xi ∈ before(Rn, t, win)} ∪

{(xt, xi)∣∣xi, xt ∈ Rn ∧ xi ∈ after(Rn, t, win)}

(4.10)

For example, given R test5 above (where x2 = shows2):

co(R test5, 2, 2) = {(shows2, the0), (shows2, xray1), (shows2, nothing3), (shows2, abnormal4)}

Next is needed a function to define the co-occurrences as defined over the training corpus,

TC.

Let tcs be the number of reports in TC.

Let ni be the number of words in a report TCi.

Let trainingCOs : 2U(TW )~ × N → 2L2 × N be a function generating an mset of co-

occurrences based upon the training corpus, TC:

trainingCOs(TC,win) = {tcs⋃i=1

ni⋃t=0

co(TCi, t, win)} (4.11)

For example, consider the following trivial case:

TCtest = {(the0, xray1, shows2, nothing3, abnormal4), (the0,mri1, is2, unremarkable3)}

It is possible to determine the following, based upon the definitions so far (where TCi refers


to the ith report in TCtest) with a sample window size of 2 (win = 2):

co(TC1, 0, 2) ={(the0, xray1), (the0, shows2)}

co(TC1, 1, 2) ={(xray1, the0), (xray1, shows2), (xray1, nothing3)}

co(TC1, 2, 2) ={(shows2, the0), (shows0, xray1), (shows2, nothing3),

(shows2, abnormal4)}

co(TC1, 3, 2) ={(nothing3, xray1), (nothing3, shows2), (nothing3, abnormal4)}

co(TC1, 4, 2) ={(abnormal4, shows2), (abnormal4, nothing3)}

A similar result is obtained for TC2. Given this, trainingCOs(TCtest, 2) produces the

following mset as the combination of the results from TC1 and TC2 as per our definition of

trainingCOs:

{co(TC1, 0, 2) ∪ co(TC1, 1, 2) ∪ co(TC1, 2, 2)∪

co(TC1, 3, 2) ∪ co(TC1, 4, 2) ∪ co(TC2, 0, 2)∪

co(TC2, 1, 2) ∪ co(TC2, 2, 2) ∪ co(TC2, 3, 2)}

The above mset is defined by the set of ordered pairs consisting of a tuple, representing the

co-occurrence, and a cardinal number, representing the count of the number of times that

co-occurrence occurs in TC (over a window size win).

Calculating Probability

Recall that the probability of an element with respect to a corpus is the number of times

that element occurs in that corpus divided by the size of the corpus.

Let countPair : 2L2 × L~ × N be the number of times that the pair (xi, xj) occurs in

the mset (trainingCOs(TC,win),m) (Recall m from our definition of mset above).

countPair((xi, xj),TC,win) = {n∣∣((xi, xj), n) ∈ m(trainingCOs(TC,win))} (4.12)

Similarly, let countWord : 2L2 ×L~ ×N be the number of times that a word, xi, occurs

in the mset TW .

countWord(xi, TW ) = {n∣∣(xi, n) ∈ m(TW )} (4.13)


Let p1 : L → R be a function representing the probability of an element xi occurring in

the training corpus words TW .

p1(xi) =

0 countWord(xi, TW ) = 0countWord(xi,TW )

|TW | countWord(xi, TW ) > 0(4.14)

Note that in the first case, when xi 6∈ TW , the probability of that word is zero.

Let p2 : L2×N → R be a function representing the probability of a pair (xi, xj) occurring

in the training corpus co-occurrences, as defined by the windowsize win.

p2((xi, xj), win) =0 countPair((xi, xj), trainingCOs(TC,win)) = 0countPair((xi,xj),trainingCOs(TC,win))

|trainingCOs(TC,win)| countPair((xi, xj), trainingCOs(TC,win)) > 0

(4.15)

Similar to the function p1, in p2 the first case captures when the co-occurrence is not in the

set of co-occurrences defined over the mset defined by trainingCOs(TC,win)), and thus

the probability of that co-occurrence is zero.

Calculating PMI

The function pmi calculates all co-occurrence pairs for Rn (a report of size n) given

an index for a word within that report t, and a windowsize win. It then applies the PMI

calculation3 to those pairs, and returns a real number ri for each such pair. Note that

all pairs will have as their first element the word at index t based upon the definition of

co-occurrence.

As before, let Rn be a report of size n, let t be the index of a word in Rn, and win the

window size such that 0 ≤ win ≤ n. Then, let pmi : L~ × N × N → Rn be defined as the

3From Section 3.6.1.


following.

pmi(Rn, t, win) = {ri

∣∣(xt, xi) ∈ con(Rn, t, win)∧

ri =p2((xt, xi), win)p1(xt)× p(xi)

∧

1 ≤ i ≤ n}

(4.16)

Applying Bayes’ Theorem

Given the definitions of Bayes’ Theorem previously in this chapter (see Equations 4.4 and

4.5), the following formal definitions are possible.

The function bt takes a word x, and a list of words {y1, · · · , yn}, representing all words

with which x co-occurs in some report (that is, x’s context), and returns a real number

representing the probability of that word occurring in that context. Note that there is no

denominator (compare to Equation 4.4). This is because the denominator would represent

the probability of observing the context of x, namely p({y1, · · · , yn}), which in this case is

1 since the context has already been observed. Thus it has been omitted.

Let x be a word and let {y1, · · · , yn} the set of context words for x. Then, let bt :

L×Ln×N → R be a function for applying Bayes’ Theorem to a word and its context, given

a particular window size.

bt(x, {y1, · · · , yn}, win) = p1(x)×n∏

i=1

p2((x, yi), win) (4.17)

The function context calculates the context-window words in which a word, xt, occurs

at the middle in a report of size n, Rn.

Let context : L~ × N× N → 2L be defined as the following.

context(Rn, t, win) = {xi

∣∣(xt, xi) ∈ con(Rn, t, win)

1 ≤ t ≤ n}(4.18)

For example, recall the test report, R test5 = (the0, xray1, shows2, nothing3, abnormal4):

context(R test5, 2, 2) = {the0, xray1, nothing3, abnormal4}


4.5.2 The Error-Detection Algorithm

With the above definitions, the individual error-detection heuristics can now be formalized.

Co-Occurrence Report Analyser

The function cor applies a co-occurrence analysis on a report Rn, and returns the prob-

ability of all words within that report, based on their occurrence in TC. Here Bayes’ is used

to aggregate the results of the co-occurrence analysis (co(zi, win)) on the context window

of each word xi into a single value. This value is then compared to a threshold th, and only

those results for which the value falls below the threshold are returned as an error.

Let cor : L~ × N → 2R be defined as the following.

cor(Rn, win) = {xi

∣∣xi ∈ Rn∧

ri = bt(xi, context(Rn, i, win), win) ∧

ri ≤ th}

(4.19)

Pointwise-Mutual-Information Report Analyser

Let aggregate : L~ → R be a function which collects the results from applying pmi to a

report according to some means for collecting the results into one value. As there are man

approaches to such aggregation, the specifics are not defined here4.

The function pmir determines an aggregated PMI score for each word xi in the report

Rn and xi’s context. The PMI score is then compared to a threshold, and only those results

whose value falls below the threshold are returned as an error.

Let pmir : L~ × N → L be defined as follows.

pmir(Rn, win) = {xi

∣∣xi ∈ Rn ∧

zi = aggregate(pmi(Rn, i, win)

)∧

zi ≤ th}

(4.20)

4In Inkpen and Desilets, the authors discussed several options for aggregating PMI results; in this disser-tation, for instance, the results are simply summed and averaged.


Syntactic Parser

The actual implementation specifics of the parser are not important here, as any syntactic

parser implementation would be considered functionally equivalent, provided the following

(more general) definitions still hold. An example implementation is provided in Chapter 5.

Let Sn = (x1, · · · , xn) and xi ∈ L be a tuple of words of size n representing an English

sentence

Let parse be some relation between a sentence S and those subsequences of S which cor-

respond to constructions or constituents within the grammar defined by the natural language

being used (that is, those captured by V alidEnglishConstituents, where “constituent” is

a functional unit of one or more words in a language5).

Let sent : L~ → L~ be a function that defines all of the valid English sentences within

a report, where a sentence is a subsequence of a report (tuple).

sent(Rn) = {Si

∣∣Si ⊆ Rn} (4.21)

Let getErrors be a function mapping a parse relation to a set of words representing

errors. Again, this is only hollowly defined as it may vary depending on one’s method of

parser, or desired method of error collection based upon the parser.

Let parser : L~ → L be a function that given a report Rn, returns those words that

are considered errors (based on some function getErrors above). Here s is defined as the

number of sentences in Rn.

parser(Rn) =s⋃

j=1

{xi

∣∣xi ∈ getErrors(parse(Sj)) ∧ Sj ∈ sent(Rn)} (4.22)

The function parser takes a radiology report Rn and collects the union of all errors

returned for every sentence Si found within that report.

Semantic Distancer

The semantic distancer is a conceptual formalization of the semantic-similarity measure. If

the ontology being used is considered as a directed graph, the following definitions hold (as

in the case of the UMLS):

5The reader is directed to Appendix A for more information on constituents.


Let V ∈ L~ be a set of vertices (concepts).

Let E be a set of edges (tuples) of the form (x, y) where x is directed to y, s.t. x, y ∈ V .

Let G be a graph s.t. G = (V,E).

Let Distance : L~ × L~ → N be a relation that returns the length of a path between

any two vertices, x ∈ V and y ∈ V , in a graph G.

Let C ∈ L~ be a tuple of words representing a concept. Note that a concept can be

comprised of more than one word. For example, “radiology report” is two words, but may

be represented by a single concept (i.e. tuple) containing both words.

Let Concepts be the set of all concepts C in the domain as defined over L~.

The reportConcepts function maps the words within a report (a tuple of words) to those

subsequences of that tuple which correspond to concepts. That is, those tuples which are

contained in the set Concepts.

Let reportConcepts : L~ → 2L~be a function defined as follows.

reportConcepts(Rn) ={(xi, · · · , xj)∣∣Rn = (x1, · · · , xn) ∧

(xi, · · · , xj) ∈ Concepts∧

1 ≤ i ≤ j ≤ n}

(4.23)

It is also possible to calculate the co-occurrences of one concept with respect to another.

Thus, the following functions are modifications of the functions before, after, and co so

that they now apply to concepts:

Let CSc represent a set of concepts of size c, obtained via reportConcepts(Rn) for some

report Rn.

Let concept before : 2L~ × N× N → 2L~be defined as follows.

concept before(CSc, t, win) = {ci

∣∣ci ∈ CSc ∧max(1, (t− win)) ≤ i < t} (4.24)

Let concept after : 2L~ × N× N → 2L~be defined as follows.

concept after(CSc, t, win) = {ci

∣∣ci ∈ CSc ∧ t < i ≤ min((t + win),m)} (4.25)


Let co : 2L~ × N× N → 2L2be defined as follows.

concept co(CSc, t, win) = {(ct, ci)∣∣ci ∈ CSc ∧

ci ∈ concept before(CSc, t, win)} ∪

{(ct, ci)∣∣ci ∈ CSc ∧

ci ∈ concept after(CSc, t, win)}

(4.26)

The function sd determines the semantic distance of all concepts within a report, Rn,

that are up to win concepts away from the concept indexed at t. The value weight represents

an optional weight factor that may be applied to reflect the varying strength of certain edges

(as discussed in Section 3.6.2).

Let sd : 2L~ × N× N → R be defined as follows.

sd(CSc, t, win) = {zi

∣∣(ct, ci) ∈ concept co(CSc, t, win) ∧

zi = Path(ct, ci)× weight}(4.27)

The function sdr takes a report, Rn and a window size, win, and determines the set

of semantic distance values for all concepts within the report. It returns the set of those

concepts whose semantic distance values are equal to or below the threshold value.

Let sdr : L~ × N → L~ be defined as follows.

sdr(Rn, win) ={ci

∣∣ci ∈ reportConcepts(Rn) ∧

zi ∈ sd(reportConcepts(Rn), i, win) ∧

zi ≤ th}

(4.28)

Lastly, let wordmap : L~n → 2Lnbe a function that maps a list of concepts CS

(represented as word tuples, recall) to a list of the individual words within each concept

{w1, · · · , wn}, where n = |CS|. Since a concept can be comprised of more than one word,

this mapping is necessary to identify the individual word errors (since the system in question

reports errors at the word level). For example:

wordmap({(radiology, report)}, {lesion}

)=

{{radiology, report}, {lesion}

}(4.29)


FP TP FN

ER

(Set of errors

detected)

AE

(Set of actual

errors)

Figure 4.2: A Venn diagram showing the similarities between ER and AE.

Errors Detected

ER is defined as the set of errors detected:

ER = pmi(Rn, win) ∪ parser(Rn) ∪ cor(Rn, win) ∪ wordmap(sdr(Rn, win)

)(4.30)

C is defined as the set of correct words:

C = R ∩ ER (4.31)

AE ∈ 2L is defined as a set of words representing the set of actual erroneous words in a

report. The following definitions are then possible:

• FP is defined as a set of words representing the false positives, s.t. FP = ER ∩ AE

and FP ∈ 2L.

• FN is defined as a set of words representing the false negatives, s.t. FN = AE ∩ER

and FN ∈ 2L.

• TP is defined as a set of words representing the true positives, s.t. TP = AE ∩ ER

and TP ∈ 2L.

Figure 4.2 shows a Venn diagram highlighting the relationship between the two sets AE and

ER, and a visual representation of FP , TP , and FN .


4.6 Summary

This chapter has presented a hybrid, black-box-based error-detection method for ASR in

radiology. The observations provided in this chapter and the error-detection classification

laid out in Chapter 3 demonstrate a robust system that will capitalize on the strengths

of the heuristics when applied together on the same document. In the following chapter, a

series of experiments will be presented as proof of concept showing the system’s viability and

performance, including an increase in detection accuracy over any independent heuristic.

Chapter 5

Experimental Evidence

In this chapter the problem of error detection in radiology is viewed from an experimental

perspective. The heuristics outlined in the previous chapter are implemented to demonstrate

their efficacy as independent error-detection methods, and finally combined as proof of

concept of the hybrid approach.

5.1 Introduction to Proposed System

To demonstrate the viability of the methodology proposed in Chapter 4, the implementation

in this chapter has been designed as a proof of concept. The combined performance of the

error-detection heuristics is sufficient to support the thesis that error-detection is capable

of improving the performance of ASR in radiology, and likewise, the conclusion in Chapter

4 that a hybrid method will outperform any single method. On a larger scale, and beyond

the scope of this dissertation, the full error-detection system will provide an interface for an

interactive review of the report summary as well as the confidence scores. The radiologist

will be able to efficiently correct the erroneous input from this interface by concentrating on

words tagged with confidence scores below a certain threshold, while skimming those above

this threshold. This can be further extended by intelligently suggested corrections. The

interface will also present the option of searching the database of existing reports. This will

be set up to facilitate extension beyond the local database to Intra- and Internet searches

as well. These extensions and others are explored in Chapter 6.

75

CHAPTER 5. EXPERIMENTAL EVIDENCE 76

5.1.1 Materials

Corpora

The Training Corpus This proof of concept relies on the availability of radiology

reports, collected via speech recognition, to design, train, and test the system. The Canada

Diagnostic Centre (CDC) in Vancouver, BC, has provided 2751 corrected and de-identified

radiology (MRI) reports obtained using the Dragon NaturallySpeaking speech-recognition

system, version 7.3. The co-occurrence statistics of varying window sizes have also been

compiled for these reports.

Note that in these reports the “Techniques” section was provided as a template that the

user selected at the time of dictation. As a result, this section is not susceptible to errors

introduced as a result of ASR and is not used in any of the upcoming studies.

The Test Corpus Since the 2751 radiology reports from the CDC have been corrected,

they cannot be used to test the error-detection system. In response to this Dr. Bruce Forster

of the CDC has suggested an arrangement whereby raw, uncorrected reports can be obtained

from the CDC along with their corrected counterparts. Thus, an additional corpus of 30

raw, uncorrected radiology reports paired with their corrected versions was collected. Since

these reports were part of an ongoing collection by the CDC, they were produced only when

time was available, and include all scan types, unlike the training data, which is limited to

MRI. The presence of other scan types (such as CT) in the test data will influence the final

results via the system’s ability to successfully generalize beyond MRI reports. Arguably,

the resulting vocabulary variation between modalities should be minimal since much of the

radiological parlance overlaps. This is discussed in Section 5.3. Out of these test reports,

there is an average of 11.9 errors per report, with an average report length of 80.8 words.

This represents an average word-error rate (WER) of 15%.

In developing an adequate database of test reports pairs (that is, raw and corrected

report pairs) an initial attempt was made to re-dictate a series of corrected exams for which

the raw report was no longer available. Dr. Forster assisted in this process by reading

from a print-out of the report in question. Interestingly, it was found that the ASR system

performed surprisingly well on these reports. The speculation is that the cadence when

reading from a printed report is significantly different than when dictating “on the fly”.

Consequently, dictation is smoother, with less false starts or filler words such as “um” or


“er”. As a result, such a method is not viable for creating a realistic test corpus.

As with the training corpus, the “Techniques” section of a report is ignored.

Ontological Knowledge Source

The semantic analysis portion of this work requires access to an ontological knowledge

source. Based on the discussion in Appendix B, the Unified Medical Language System

(UMLS) has been chosen for this purpose1. Briefly, the UMLS is developed by the Na-

tional Library of Medicine (NLM) with the intent to facilitate automated natural language

understanding in medicine. It comprises three knowledge sources, the Metathesaurus, the

Semantic Network, and the SPECIALIST lexicon. The Metathesaurus is an ontological

source of knowledge built upon many source vocabularies that have been combined into a

single database. Within this database concepts are organized by their relationship to other

concepts, such as, for example, the “is-a” relation. The Semantic Network provides a general

and overriding conceptualization of all concepts and their relationships within the Metathe-

saurus, regardless of their source vocabulary. Each concept within the Metathesaurus is

linked to one of the abstract concepts, called “semantic types”, within the Semantic Net-

work. These semantic types represent major groupings, split at the most general level into

event and entity. Finally, the SPECIALIST lexicon is a database of lexical information

useful for natural language processing. The terms found within the Metathesaurus and

Semantic Network, for example, are found within the SPECIALIST lexicon.

5.2 Methods

5.2.1 Modular Design

As a software-engineering methodology, the modular design of the hybrid algorithm has a

number of advantages over single techniques. First, it is possible to develop and evaluate

each heuristic incrementally and individually. In addition, many of the drawbacks applicable

to particular heuristics can be overcome by the combination of multiple results. For instance,

if it is not possible to obtain a sufficient training corpus for the purposes of co-occurrence

analysis, it is still possible to derive confidence rankings using the other heuristics. In the

1http://www.nlm.nih.gov/research/umls/ Accessed: February 2006; Updated: February 2006.


long term, modularization lends itself to software reusability and the possibility of multiple

software developers, resulting in a more robust, usable system.

5.2.2 Calculating Results

To find the actual errors in our test reports, the corrected and uncorrected reports are

aligned and any differences are identified and tagged as errors. These are then compared

to the flagged errors from the program output to obtain the results: a match is considered

a correct detection, or true positive; a flagged error that does not correspond to an actual

error is considered a false positive; an error not flagged is considered a false negative.

In calculating the results for this experiment, Recall is a measure of the number of errors

correctly detected over the total number of errors actually present (how many actual errors

are found); Precision is a measure of the number of errors correctly detected over the total

number detected (how many of the errors found are actual errors).

5.2.3 Aligning the Source and Output: Recognition Errors

For the purposes of proof of concept, all errors tags were manually collected and recorded

based upon the output of the error-detection heuristics. While generally an objective pro-

cess, on occasion situations arose that required a choice between which words to count in

error, and how many errors to record. In many cases, such as split or merge errors, what

might appear as one or more errors in the output document, corresponded to a different

number of words in the source document. In striving for consistency between all error

determinations, the following conventions were adhered to: given a split error, the error

count remains one, regardless of the number of words the source word was erroneously split

into; given a merge error, the same process applies, regardless of the number of consecutive

words erroneously compressed from the source document. For example, “recognize” may be

misrecognized as “wreck a nice”, or vice versa. In either instance the error count is one.

In some cases, multiple, consecutive errors may be identified by the error-detection

system; occasionally these are the effect of cascading errors. In these situations it can be a

complicated task to align the source document with the output from the detection system,

resulting in some interpretation on the part of the human analyser. Where the errors

involved content and stop words it was often difficult to determine whether such errors

constituted more than one. In these cases tagged errors extending over six words (stop


words included), were counted as two errors, and similarly for every three errors occurring

consecutively beyond that.

As a final note, all tools were designed and run on a Mac G4, 1.5 GHz, OS X 10.3.9.

5.2.4 Calculating Co-Occurrences

The generation of a word’s or concept’s context in terms of pairs of co-occurring words or

concepts is necessary throughout this research. Given a word, w, occurring in a document,

d, a context window, C(w, d, n), is defined as the n words occurring to either side of w in d.

This technique is used to generate the training corpora for various window sizes, as well

as to analyze the test cases and compare them to the training database. A sample selection

of the co-occurence relations for the word “quadriceps” from the training corpus is provided

in Table 5.1. For example, “quadriceps” occurred in the training corpus 123 times, and

co-occured with the term “patellar” 32 times, for a frequency of 32/123 = 0.26.

Table 5.1: Co-occurrence statistics for “quadriceps”.

term context count term freq.word count

quadriceps included 1 123 0.01quadriceps mechanism 1 123 0.01quadriceps patellar 32 123 0.26quadriceps tendon 38 123 0.31quadriceps tendons 50 123 0.41quadriceps vastus 1 123 0.01

As an example of the co-occurrences for a particular sentence, consider the incorrect

sentence fragment in Sentence 1:

...possible spondylolysis eye laterally of L5... (Sentence 1)

We can generate the following co-occurrences for the target word, “eye”, with a context

window of two (up to two words to either side of “eye”):

eye possible

eye spondylolysis

eye laterally

eye L5


Note that a stop list, as discussion in Section 3.7 is employed in all statistical calculations

described here.

In the next sections, the individual heuristic implementations and their results are ex-

amined.

5.2.5 The Error-Detection Algorithms

The error-detection algorithms in the proof of concept were chosen to cover all recognition

error types, as per Section 3.1.2. Based on the study of error-detection methods in Chapter

3, these were inspired by algorithms shown to have some success in other domains, as well

as original algorithms based on unique adaptations of other natural language processing

techniques.

There are a number of potential ways in which information about the likelihood of an

error can be determined. The aim is to explore, develop and evaluate as many error-detection

heurisitcs as possible for use in this system. These include the following:

• Conceptual/semantic similarity.

• Semantic relationships, such as thematic roles and levels of abstraction.

• Syntactic analysis.

• Word occurrence probabilities.

5.2.6 Conceptual Similarity

The Semantic Distancer

Overview The method of conceptual similarity developed here was inspired by the

work of Rada and Blettner [113] and Caviedes and Cimino [23]. It is a simple system that

identifies the concepts within a radiology report by applying the NLM’s MMTx software.

The average distance each concept is from its context, and from the general topic of the

report (for example, anatomical region of study) is determined using the UMLS. The final

result is a confidence ranking of the concept itself. If a concept differs too drastically from

the surrounding concepts or the topic marker (i.e. the distance exceeds a certain threshold),

it is considered a recognition-error candidate.


Materials Central to the functioning of the semantic distancer is an ontological knowl-

edge base and a means for extracting and mapping concepts within radiology reports to this

ontology. The UMLS, v2005AB2 (see Section 5.1.1), and its corresponding MMTx soft-

ware, a program that maps biomedical text into UMLS concepts, have been chosen for this

purpose.

The MMTx is a Java implementation of the original MetaMap software intended for

public access3. Based on the same algorithms as MetaMap, in general, MMTx produces

equivalent output. Known differences stem from the use of a third-party tagger in MMTx,

but are not considered relevant to this work.

The MetaMap algorithm applies a shallow parse to an input text, and uses the resulting

phrases to determine all variants of the terms within each phrase from the SPECIALIST

lexicon. The Metathesaurus is then consulted to generate a candidate list of all concepts

that match those variants. The candidate list is ranked based upon the weighted average

of four metrics, including the degree of variation between the variant and the original term,

and the degree of match between the candidate and the text [4, 3]. The output is a list of

the top candidates ordered by match strength.

The UMLS was obtained via DVD directly from the NLM. Those files maintaining the

inter-conceptual relationships (e.g. parent and sibling relations) were transferred into a

local-access database using MySQL, v5.0.16. Note that all text manipulation and reformat-

ting of the UMLS and of the radiology reports was done in Perl, v5.8.1-RC3.

The entire corpus of 30 test reports is used in this experiment. The training corpus was

not used.

Method In order to determine the relevant concepts for analysis, each report is first

run through the MMTx software to produce a Prolog-compatible output list of concept

candidates. To simplify this preliminary implementation, only the top concept candidates

are kept in all cases. Without re-ranking the candidate list, this seems the best course

of action in lieu of testing multiple candidates for each concept in the text, which would

quickly result in an exponential growth of the search space. This leaves open the possibility

of incorporating these candidate lists more fully into the analysis and is discussed in Section

22005AC was released during the course of this research, however it was decided that due to changes inthe lexicon that an upgrade would risk further inconsistency in the results and was not necessary.

3http://mmtx.nlm.nih.gov/FAQ.shtml Accessed: February 2006; Updated: July 2004.


6.9. Where no candidate list is produced, yet MMTx identifies a probable concept, the

associated text is tagged as “unknown”. The candidates themselves are in the form of

Concept Unique Identifiers (CUIs), a UMLS-specific unique identifier that exists for each

concept within the ontology.

Once a candidate list for all concepts in the source report have been identified, the CUIs

for each are then extracted. The context of each target concept is determined based on

a context-window size, and a list of a CUI pairs is produced based on the target concept

and the concepts with which it co-occurs. For each target concept an additional CUI pair

is added to represent the relation of the target concept to the overall topic concept of the

report: In all cases, the test reports contain a title sentence that identifies the anatomical

region of focus. This title sentence is used to manually create a topic identifier in the form

of a CUI from the UMLS, which is then paired with the target concept to create a final CUI

pair.

For each CUI pair a semantic distance value is calculated using a distance calculator I

have designed in Perl, called sem dist. Using a reverse breadth-first search, sem dist will

search the UMLS MySQL database for a common parent of the two CUIs, CUI1 and CUI2.

Starting at CUI1, the algorithm searches up one level in the tree for all parent concepts of

CUI1, while doing the same for CUI2 in parallel. This generates two parent lists, P1 and

P2, which are compared. If no common parent is found, then starting with P1, the parent of

each CUI ∈ P1 is systematically calculated and compared to a list of the nodes traversed so

far. If a match is encountered, then the steps from CUI1 and CUI2 to the common parent

are counted and totaled.

The above calculation results in a distance score for each CUI pair that was created for

a target concept. Since the goal is a confidence ranking of the target concept, these scores

must be combined into a single ranking. As an initial aggregation result, the semantic-

distance results for each concept in a report were averaged to produce the final confidence

score. The topic concept can be used as an independent measure of error – the semantic

distance scores between the concepts in the body of the text and the topic marker can

indicate errors. Alternatively, the distance from the topic concept may be averaged with

the other semantic-distance results for a particular target concept in the body of the text

to create one measure of confidence. The topic distance may also be weighted to reflect

the theoretical difference between its effect on the final confidence score versus the score

produced via the semantic distance of the neighbouring words.


One of the benefits of this approach is that it avoids the problems associated with loops

in the ontology4. In this implementation, the similarity of two unique terms, or CUIs, is

determined by working up through the database, following any relevant parent links until

an intersection is found. Since this process is always going up, it is not possible to get

trapped in a loop, as all terms must have a parent. If the search space ends (i.e. a root, or

even a sub-root concept is detected), the search terminates and the linking CUI is labeled

as “NULL”.

Results and Discussion For each concept detected in a report, the semantic distancer

returned a distance measure from the surrounding words defined by the window size, and a

separate distance measure from the topic concept marker. These results were then manually

analysed to determine the errors indicated.

After running the experiment on the test cases, it was found that as many as 10% of the

concepts within some reports were tagged as “unknown”. Compounding this, when applied

to these reports the results of the semantic distancer were inaccurate due to unrecognized

concepts, which were scored as zero. As a first step to recover from this problem, such results

were excluded from the calculation of the final average. Unfortunately, further investigation

revealed a discrepancy between the MMTx and the current UMLS: As new versions of the

UMLS are released, concepts that are considered obsolete are removed, or replaced with

more accurate ones. It is often the case that the MMTx maps report concepts to these

“retired” CUIs. Thus, when such CUIs are referenced in the MySQL database there are no

longer entries for those concepts and therefore they cannot be used for the semantic distance

calculation. Due to time constraints, this remains an open problem for future work and is

discussed in Chapter 6.

Lastly, as an unfortunate consequence of the UMLS design (a compilation of source

vocabularies) a small number of concepts could not be linked via the parent link provided in

the UMLS relationship database. Such concepts were reserved by their source vocabularies

and thus could only be linked to other concepts within that source vocabulary. Occasionally,

the concept of interest lay in a different source vocabulary and could not be linked to across

vocabularies via existing relationships. While common links do exist for all concepts at

the Semantic Network level, the sensitivity to differences between concepts at that level

was insufficient. The broad granularity of Semantic Network meant that all concepts were

4Caviedes and Cimino [23], among others, have observed such loops within the UMLS.


generally within a short distance of one another or had identical distances, and thus the

calculation was not useful. Also, the nature of the UMLS is such that the source vocabularies

are still governed by their own access rights. This project was limited to those source

vocabularies for which access was free, consequently this has resulted in a fractured ontology

to some extent.

In an effort to minimize the impact of unusable concepts within a target concept’s win-

dow, the window size is limited to collocations (defined here as a concept and its immediate

neighbour). The concepts in the immediate vicinity of one another (such as two consecutive

concepts), however, are often locally different with respect to meaning resulting in unusable

semantic distances. That is, two words side by side may be conceptually distant from one

another, despite being related to the surrounding sentence when considered in its entirety.

In contrast, the inclusion of more concepts in a large context window can smooth out the

normal degree of variation among local concepts so that only those that are exceptionally

distant are actually able to trigger an error tag.

The combination of these factors has resulted in an incomplete implementation of the

semantic-distance heuristic. The sample set of those results untouched by any of the above

issues compiled with a reasonable context window size (a window size of at least three to

minimize the local variations mentioned above) was too small to be of any value. Nonethe-

less, of that small set there were examples of out-of-place words that were conceptually

unrelated to the report that showed very low confidence scores. This, combined with the

results in Caviedes and Cimino [23], provides support for the underlying semantic-distance

concept and indicates further work in this area will prove fruitful.

In conclusion, this remains an open problem due to implementation details, and not

issues related to the underlying concept.

5.2.7 Semantic Grammar

Since the needed analysis of the semantic roles of the concepts within the radiology domain

is an extensive project, an implementation of the semantic-relationship analysis falls out of

the scope of this thesis (and is not necessary to establish our proof of concept). Despite this,

the parser discussed in the following section, Section 5.2.8, has been designed to support

semantic constraints, such as thematic roles, and roles specific to radiology. Thus, once

the above analysis is complete, it will be a straightforward task to augment the existing

syntactic parser. This is discussed in more detail in Chapter 6.


5.2.8 Syntactic Analysis

Overview As discussed in Section 4.4.2, the use of stop lists and surface-level anal-

ysis prevents statistical-based methods from achieving 100% efficacy in error detection. A

syntactic parser, however, can be used to identify syntactic errors, including those which

involve stop words and deletions.

With this in mind, a parser was developed to analyse radiology reports. In the interest

of rapid prototyping sufficient for proof of concept, the parser was built upon a constraint-

handling-rules grammar, or CHRG [29] and inspired by Property Grammars [10]. Dahl and

Blache [35] demonstrate this combination of grammar formalities to be a robust option, with

the ability to handle various levels of granularity, as well as incomplete and incorrect input.

As discussed in Section 4.4.2, such flexibility is necessary to handle incomplete sentences

and note-form such as often found in a radiology report. Furthermore, by characterizing the

grammar as a series of properties, the properties constraining the language within radiology

reports are easily captured.

The parser’s design has left open the possibility of extending the constraint base to

include semantic constraints. This involves interfacing with an ontological knowledge source,

such as the Unified Medical Language System (UMLS) [15], to obtain the semantic properties

of phrases which can be used to test semantic-based constraints as mentioned in 5.2.7. For

example, a verb may be restricted to apply to only anatomical concepts.

Materials This experiment uses MMTx as an initial partial parse of the text (see

5.2.6). The main parser was developed in SICStus Prolog, v3.12.3 under a temporary stu-

dent license, using SICStus’s built-in constraint handling rules (CHR) implementation and

Henning Christiansen’s CHR grammar (CHRG) system, v0.1 [29]. For those unfamiliar,

a brief introduction to CHRs is provided in Appendix A, while a more in-depth introduc-

tion to CHRs and CHRG is provided in Fruhwirth 1994 [50] and Christiansen 2005 [29],

respectively.

All 30 test reports are used in this experiment.

Method As a preprocessing step, each test report was run through MMTx, a program

that maps biomedical text into UMLS concepts5. MMTx provides semantic information

for each report in the form of UMLS Concept Unique Identifiers (CUIs), part-of-speech

5Available at http://mmtx.nlm.nih.gov/ Accessed: February 2006; Updated: February 2006.


tagging, as well as basic phrasal information. The tagging was particularly important as

MMTx includes a tagger trained on medical texts. Since a tagged, training corpus was not

available to train a tagger, this was an invaluable resource. As an example, the phrase “of

the thoracic spine”, once passed through MMTx and a pre-processor (which modfies MMTx

output for input to the error-detection parser) is returned as the following:

rep_phrase(1,’of the thoracic spine’,[prep([tag(prep),tokens([of])]),

det([tag(det),tokens([the])]),mod([tag(adj),tokens([thoracic])]),

head([tag(noun),tokens([spine])])],[’C0581269’,...,’C0024659’],3,4).

As a second level of analysis, the parser was created in SICStus Prolog using CHRG

[29] and Property Grammars [10], a means for representing the structure of language as

properties constraining the allowable constructions within that language. The modified

reports are input to the parser and analysed according to a grammar created atop the

CHRG formalism and inspired by property grammars6.

Based on each phrase identified via rep phrase/6, the parser first performs a series of

property checks to determine the appropriate phrase type. Each phrase type has its own rule

set defining its specific properties. Unique to property grammars, the properties defining the

allowable constructs within the grammar can be tagged as “relaxable” [36]. While needing

to relax a property is likely to indicate an error (i.e. an incorrect term or an incomplete

phrase), the parse is able to continue and information regarding the nature of the error is

collected (i.e. those properties that were not satisfied). The result is a robust parser that

does not fail in the face of errors. This is an ideal solution for error detection making it

possible to detect and locate errors within the text.

When parsing, each “rep phrase” is compared to the properties within the grammar

to identify a phrase-type candidate. When identified, the phrase is added as a phrase

constituent to the constraint store. In some cases, the property check is pre-empted by

a keyword that triggers the automatic assignment of a phrase constraint. For example,

auxiliary verbs such as “is” are immediately tagged as phrases of type “is”. If no keyword

6The grammar developed was not intended as a linguistically robust representation of English, but ratheras a functional implementation of the characteristics of radiology reports. Thus, there are some deviationsfrom typical parse-tree constructions attributed to English sentences in the interest of computational feasi-bility, and speed of development. Future iterations of this parser will see a more in-depth analysis of theunderlying linguistic properties, and a more careful eye to the elegance of the resulting formalism.


is detected, then the phrase is passed on to the property check. There are three possible

cases that result from the property checks.

In the first case, all of the requisite properties are observed and the phrase is successfully

created with the matching phrase type. Since Prolog works from top to bottom when

analysing rules, the properties are tried in the order they are presented in the grammar

formalism. Thus, it is important to differentiate the rules representing various phrase types

by a unique list of properties; where phrase rules are ambiguous, careful consideration must

be given to the order presented since the first phrase to match will be the one added to the

constraint store, even if a phrase later in the Prolog listing is also possible. This latter phrase

will only be tried should the parse fail and have to backtrack to the phrase assignment rules.

In the second case, none of the required properties for any of the phrase types are met

and the attempt fails. The phrase is then tagged and added to the constraint store as

“unknown”.

In the third and final case, one or more properties labeled as “relaxable” may not be

met. Being relaxed, these properties are added to a list of unsatisfied properties but do

not halt the parse. As a result the parse will continue until all properties for the current

phrase-type are met or are tagged as “relaxable”, or until a non-relaxable property is not

satisified. In the latter case, the phrase-type rule will fail and the next phrase type will be

tried. If a property check succeeds, then, as mentioned above, a constituent is added to the

constraint store that represents the phrase, phrase type and a list of the relaxed properties

that were unsatisfied.

Beyond the phrase-type identification, the rules of the grammar are defined via con-

straint handling rules (CHRs). After each change in the constraint store, the CHRs are

consulted and, wherever applicable, constraints are modified according to these rules and

the constraint store is updated. In this way the parse is completed, conjoining sub-phrases

as permitted by the CHRs. When no further changes are possible the system has “settled”

and the current contents of the constraint store are output. During the parse, the system

maintains a list of all “unknown” constituents. These are also output at the end.

The interpretation of the results for error detection is currently performed manually

for the purposes of this experiment. Errors can take three forms given the parser output:

phrases tagged as “unknown”, unsatisfied property lists, and incomplete parse segments.

“Unknown” tags represent words or phrases that went unrecognized by MMTx, or subse-

quently could not be assigned a phrase type by the parser.


The following is an example property check for a verb phrase:

vp_properties(CUI,L,L2,L3,UnsatX,S,F):- Unsat=[],

( has_x(verb,L), append([],Unsat,Unsat2) ; relax(has_verb),

append([has_verb,S,F],Unsat,Unsat2) ),

UnsatX=Unsat2.

This rule enforces the property of having a verb in order for a phrase to be considered a

verb phrase. If a verb phrase is expected but no verb is present, however, by relaxing this

property and adding it to the unsatisfied property list (represented by Unsat) the parse can

proceed. The information on the unstatisfied properties is then available at the end of the

parse. For the purposes of error detection, all properties were marked as “relaxed”.

Next is an example constraint handling rule:

constit(np,X,Y), constit(vp,Y,Z) ==> constit(s,X,Z).

The preceding rule activates when a np and vp (noun phrase and verb phrase, respectively)

are present consecutively in the constraint store (that is, from X to Y , and Y to Z), and

adds a further constraint to the store that represents a sentence, or s, across those words

(that is, from X to Z). This is a simplified version of the actual rule for the purposes of

readability here.

Table 5.2: CHR parser results on all error types.

Accuracy Corpus SizeError Subset Recall Precision f-measure TestAll Errors 29% 34% 32% 30

Syntactic Errors 71% 17% 27% 30

Results Table 5.2 shows the result of aplying the CHR parser to the 30 test reports

When restricted to syntactic errors, the recall improves considerably. While this corresponds

to a large drop in the precision, this is attributable in part to the measurement of the

precision over all possible errors. Essentially, the set of correctly-detected errors is reduced to

include only those that are syntactic, while maintaining the total number of errors detected

(which may include correct detections of non-syntactic errors).


Some of the undetected errors were attributable to errors introduced at the MMTx

level. In some instances, some concepts were recognized and assigned a CUI, yet tagged as

“unknown”. From the MMTx perspective, this differentiates between terms that were found

in the Metathesaurus, but not in the SPECIALIST lexicon (and thus “unknown”). As a

result, given the sentence, “This examination extends from the T9 and T10 disc space to the

S2 and S3 level.” the terms “T9” and “T10” are assigned the correct CUI values, indicating

that they were correctly identified in the UMLS Metathesaurus, yet they are tagged with

“unknown”. Since the parser relies on “unknown” tags as an indication of an error, this

falsely indicates “T9” and “T10” both as errors.

Discussion While the parser performs poorly on the entire error set, the recall for

syntactic errors only is noteworthy. Developing the parser further will improve this result,

and refine the precision score. However, these results are of particular interest as they

show a strong affinity for syntactic errors, which will be useful in the hybrid approach.

Furthermore, by analyzing on the basis of syntax it is possible to identify stop word errors,

which are typically ignored by other methods (i.e. statistical-based methods, and semantic

analysis).

Though the parser takes longer than the following statistical techniques (up to three

minutes in the worst case on exceptionally long sentences), there is no overhead cost associ-

ated as with generating the co-occurrence statistics. Also, in all cases the slow run time was

attributable to the preliminary nature of the parser and will improve with future iterations.

5.2.9 Word Occurrence Probabilities

Two probabilistic techniques for the proposed hybrid method have been developed, namely

co-occurrence relations and Pointwise Mutual Information (PMI). Underlying both is the

key notion that through identifying patterns common to error-free reports, inaccuracies in

novel reports can be automatically detected. The theory underlying both of these techniques

is discussed in Chapter 4.

As part of the setup for both co-occurrence analysis and PMI, the co-occurrence statistics

of varying window sizes have been compiled for the 2751, anonymised MRI reports. Recall

that in co-occurrence analysis, stop words are usually omitted since their overabundance in

a text can negatively affect the resulting probabilities and limits overall error detection.


Co-Occurrence Analysis

Overview As mentioned above, patterns within error-free reports can be used to

detect errors within novel reports. One means for identifying these patterns is via co-

occurrence relations [81, 96, 131], a statistical method for determining the number of times

a word occurs in a specific context window. Given a sufficiently representative training

corpus, we can associate words with particular contexts based on that corpus. We can then

apply these word-context statistics to determine the probability of a word occurring in a

given context in a report. If that probability falls below a certain threshold the word will

be flagged as a possible error.

Materials As an experiment on the effect of training data on statistical-based error

detection, a further test was run on the basis of splitting the training corpus into several

training sets: the full 2751 reports, as well as those obtained from dividing by section and

dividing by report type (i.e. anatomic region being studied). These divisions reflect the

observations that the type of words found in the “Findings” and “Impressions” sections

may differ from the “History” section, while the type of words found within a knee report,

for instance, are not as likely to occur in a report of the shoulder. Thus, by training and

testing these separately, there is no risk of dilution from other report types, increasing the

accuracy.

The final training sets include: all reports; reports separated into the “Findings” and

“Impressions” sections; and reports of the spine. To ensure adequate statistical represen-

tation, the training sets are restricted to those containing 800 or more reports. Of the

2751 reports divided by anatomic region, only “spine” had enough cases to meet the 800-

minimum requirement. Separate co-occurrence statistics are generated for each training set

based on the current context window size.

Method In the testing phase, a corpus of 30 uncorrected/corrected, anonymised report

pairs was obtained from the CDC using Dragon NaturallySpeaking7. For each uncorrected

report the context of each word and the relevant co-occurrences are determined. The ap-

propriate collection of co-occurrence statistics from the training data is then applied to

7The experiment on the effect of the training data on system performance was done prior to obtaining thefull test corpus, and was instead based on a 20-report corpus subset of the full test corpus. See the resultssection for more information.


determine the relevant probabilities of the co-occurrences in the test report8.

Using Bayes’ Theorem (Equation 4.4, repeated in Equation 5.1 for convenience), it is

possible to combine the probability of each word that occurs within the context window of

the target word, and the probability of the target word itself, where wt = target word, and

C = context words. Bayes’ Theorem is a formula that allows us to calculate conditional

probabilities: the probability of an event, A, given the knowledge that another event, B,

has already taken place. In simpler terms, this means that the probability of our “event”,

the target word wt, can be calculated in terms of the probability of another “event”, the

context C. Since the target word and the context are closely related, this is an informative

calculation.

P (wt|C) =P (wt) ∗ P (C|wt)

P (C)(5.1)

The expression P (wt|C) is read “the probability of wt given C”. The probability of the

target word, P (wt), is equal to the probability of occurrence in the training corpus. Since

we have already observed the context of the target word, we know that its probability of

occurring is 100%, thus P (C) = 1. Finally, we can calculate P (C|wt), the probability of the

context C occurring given the target word wt, using the Principle of Joint Probability, as

discussed in 4.4.3:

P (C|wt) = P (wt)P (C1, ..., Cn|wt) By Joint Probability

= P (wt)P (C1|wt) ∗ ... ∗ P (Cn|wt) (5.2)

= p(wt)n∏

i=1

p(Ci|wt)

With this information we can now calculate our desired probability, P (wt|C).

For example, applying Bayes’ theorem to the sentence fragment in Section 5.2.4, Sentence

8Note that the training corpus statistics must be calculated on the same context-window size. Thus, whileit is possible to change window sizes, doing so requires a recalculation of the training corpus statistics.


0

20

40

60

80

100

120

Recall Precision f-Measure

Accuracy

Per

centa

ge

All Reports

Findings

Impressions

"Spine"

Figure 5.1: CA results based upon report type.

1, yields the following:

P (eye|possible, spondylolysis, laterally, L5) =

P (eye)P (possible, spondylolysis, laterally, L5|eye)(5.3)

Once we have obtained the value of P (wt|C) via Bayes’ Theorem, it can be compared to

a threshold value, k, flagging those target words, wt, where P (wt|C) < k. Thus, those words

in a report are captured whose occurrence in their context window is highly improbable.

This improbability reflects the likelihood of a recognition error.

For example, after processing Equation 5.3 we have P (eye|possible, . . . , L5) = 4.37067E−07, a correspondingly low value that reflects the unlikelihood of “eye” occurring in that con-

text. Assuming an appropriate threshold k, this is flagged as an error.

Results All results were collected by a manual analysis of the co-occurrence analyser’s

output. The graphs presented in Figures 5.1, 5.2, 5.3 and 5.4 are based on the data tables

in Appendix C.


0

10

20

30

40

50

60

70

80

90

100

0 5.00E-06 5.00E-04

Threshold

Recall

Collocation

Windowsize 1

Windowsize 10

Figure 5.2: CA recall results for 3 window sizes.

Since splitting by report type seems to indicate a generally positive impact, following

the experimental results obtained in Figure 5.1 all subsequent experiments were run on

only the “Findings” and “Impressions” sections simultaneously. Since these sections have

share similar language usage, combining them compensates for the lack of text in using the

“Impressions” section alone. Without obtaining more training data, the “spine” category

was deemed too small at this stage for accurate analysis.

The system is able to identify error candidates in under a minute in all cases, under-

scoring its viability for real-time use. There is a one-time overhead cost associated with

generating the co-occurrence statistics for the training sets. Once generated, however, the

database is simply stored and referenced. Re-generation would only occur if new training

data were added.

Figure 5.1 demonstrates the effect of splitting the training and test corpus by report type

and section. Note that this experiment was performed prior to obtaining the full 30 reports

in the test corpus. Therefore, the results in Figure 5.1 (and in Table C.1 in Appendix C)

were run based on a 20-report corpus subset of the full test corpus.


0

5

10

15

20

25

30

35

40

45

50

0 5.00E-06 5.00E-04

Threshold

Pre

cis

ion Collocation

Windowsize 1

Windowsize 10

Figure 5.3: CA precision results for 3 window sizes.

0

5

10

15

20

25

30

35

40

45

50

0 5.00E-06 5.00E-04

Threshold

f-M

easure Collocation

Windowsize 1

Windowsize 10

Figure 5.4: CA f-measure results for 3 window sizes.


Discussion The high recall in Figure 5.2 reflects a high sensitivity to errors and a low

rate of false negatives. This is especially important as errors missed could have serious ram-

ifications. In contrast, the precision is low, indicating a high rate of false positives, as seen

in Figure 5.3. Although still important overall, false positives are nonetheless identifiable

by the radiologist and do not affect report quality. In most cases these false positives are

generated by data sparseness, that is word-context pairs that were not previously encoun-

tered in the training data (c.f. Section 6.8). Thus we have P (C|wt) = 0, which results in

P (wt|C) = 0 by Equation 5.1. Evidence for this is seen in the “Impressions” data set, which

typically held the smallest amount of text, and the smallest training set. Correspondingly,

it has the lowest precision rate shown in Table 5.1. By increasing the number of reports

in the training corpus, however, it can ensure greater coverage of the terms that typically

occur in a radiology report. This will cause the rate of false positives to drop and improve

the precision. While the ideal training corpus would contain every possible context of every

possible word in a radiology report, radiology nonetheless does not exhibit a wide variation

within reports. A fairly accurate depiction of the possible patterns within a report is fea-

sible with a large enough training set. Interestingly, though, some false positives may be

advantageous, indicating rare occurrences that merit closer inspection by the radiologist to

ensure there are no mistakes.

Separating the training and testing data by section has a positive impact, shown in

Table 5.1, though further testing is needed. This result is encouraging as “Impressions”

is the section most likely to be read by the referring physician. As mentioned above, the

lower precision for “Impressions” is explained by the typically small amount of text in this

section. Thus, while separating by type improved recall, overall the training set was still

too small for as effective an analysis and must be followed up with more data.

The rate of error detection, or filtering, is affected by the threshold value, k. Higher

values of k, mean less filtering and a higher WER, while lower values of k, mean greater

filtering and a lower WER. In this way it is possible to increase the recall level to near 100%,

however, there is a corresponding loss of precision. Nonetheless, this does allow for some

flexibility in balancing between the recall and precision measurements.

Unlike the syntactic grammar discussed in Section 5.2.8, this analysis, as with other

statistical methods, omits stop words, or low-information-bearing words. These words are

ignored because it is often observed that a mis-recognized stop word rarely entails a shift

in the intended semantics. Exceptions exist, however, such as a substitution of “and” for


“at the”, that may have more serious consequences in medicine, and may prove difficult for

human editors to detect.

The decision of the threshold value was one of trial and error. In the end, a minimal or

zero threshold gave the best results in light of the already low precision scores. If a larger

training corpus improves the precision score, a potentially more appropriate threshold could

be chosen. Similarly, the size of the context window was also one of trial and error. The

best output was obtained with a window size of one, reflecting the highest recall balanced

with the highest precision. Future experimentation with a larger variety of window sizes

will determine if a better value is possible.

Still, these results are encouraging, and demonstrate the feasibility of post-processing

error detection as a means to recover from the low accuracy of ASR in radiology.


Overview As a comparative measure, the PMI heuristic was developed according

to the work in Inkpen and Desilets [75]. Like the co-occurrence method above, given a

sufficiently representative training corpus, it is possible to derive word probabilities based

on the probability of occurrence within that corpus. Similarly, the probability of a word

co-occurring with another word within a particular context window can be determined by

the frequency of such a co-occurrence within the training corpus. The probability of two

words occurring independently versus the rate at which they occur together, provides a

measure of independence that can be used to determine the likelihood of a word occurring

in a given context in a report. If that probability falls below a certain threshold the word

will be flagged as a possible error.

Materials The training corpus was based on the full corpus of 2751 reports. As

needed, separate co-occurrence statistics for varying context-window sizes were generated.

Method As described in Inkpen and Desilets [75], a semantic similarity score between

two words, w1 and w2 is based on the shared information load of both words. Here “in-

formation load” refers to contextual predictivity, that is the notion that a word can be

predicted by its preceding word. Equation 3.7 shows the calculation of PMI for two words

(and is repeated in Equation 5.4 for convenience): C(w1, w2), C(w1) and C(w2) represent

the frequency of occurrence (in the training corpus) while n is the total number of words

in the corpus [75]. Therefore, the PMI semantic similarity measure is a reflection of the


probability of two words occurring together and the individual probability of each word

occurring in the training corpus, where “together” is limited by the defined context-window

size [75].

PMI(w1, w2) = logP (w1, w2)

P (w1) · P (w2)

= logC(w1, w2) · nC(w1) · C(w2)

(5.4)

The basic PMI calculation in Equation 5.4 is applied to two individual words. In the case

of a document, however, the desired outcome is the semantic similarity of an individual word

with respect to the context in which it occurs. Thus, the calculation of PMI(w1, wordlist)

is as follows: For each word, w, in an uncorrected report, d, the probability of that word

(occurring in the training corpus), P (w), is determined. In addition, the co-occurrences oc-

curring within C(w, d, n) are calculated, given a window size, n, that is, all tuples comprised

of w paired with all members within the context window of w. For each co-occurrrence,

the probability of that pair occurring in the training corpus is calculated9. This results in a

calculation of the individual probabilities for each word with respect to its context, in other

words P (w1, w2). Given this value and the individual probabilities, P (w1) and P (w2), the

PMI calculation in Equation 5.4 is applied to determine the semantic similarity between w1

and w2. In order to arrive at a single measure of PMI for a word, w, within C(w, d, n),

the results are subsequently aggregated by averaging their probabilities over the size of the

context window [75] (as was done in 5.2.9).

Once the cumulative PMI value is obtained for each word, these results are normalized

by adding 100 to each value (this removed any negative numbers in the dataset). The

final, normalized results, are compared to a threshold value, k, flagging those target words,

wt, where P (wt|C) < k. Thus, we capture those words in a report whose occurrence in

their context window is highly improbable. This improbability reflects the likelihood of a

recognition error.

As with the co-occurrence analysis, to find the actual errors in our test reports, the cor-

rected and uncorrected reports are aligned to identify any errors. These errors are compared

9As in the co-occurrence analysis, the training corpus statistics must be calculated on the same context-window size.


0

10

20

30

40

50

60

70

80

100 101 102 103

Threshold

Recall

Collocation

Windowsize 1

Windowsize 10

Figure 5.5: PMI recall results for 3 window sizes.

to the flagged errors from the program output to obtain the results.

Results Like the co-occurrence analysis, all results were collected by a manual analysis

of the PMI analyser’s output. The graphs presented in Figures 5.5, 5.6 and 5.7 are based

on the data tables in Appendix C, and show the recall, precision and f-measure as related

to the chosen threshold value for three separate window sizes, namely collocation (word

pairs/bigrams), and 1 and 10 words to either side of the target word, respectively.

Again, like the co-occurrence analysis, the system is able to identify error candidates in

under a minute in all cases. The same one-time overhead cost associated with generating

the co-occurrence statistics for the training sets exists.

Discussion The results shown here do not reflect the same degree of success that

was seen in Inkpen and Desilets. This is a reflection of the difference in domains (meeting

transcriptions versus radiology reports) and the significantly smaller training set used. If a

word is not found in the training data, then its probability and the probability of it occurring

in any co-occurrence tuples will be zero, resulting in an incalculable PMI value. By default,

the system sets these values to zero, indicating no similarity.

Like in Section 5.2.9, the rate of error detection, or filtering, is affected by the threshold


0

5

10

15

20

25

30

35

40

100 101 102 103

Threshold

Pre

cis

ion Collocation

Windowsize 1

Windowsize 10

Figure 5.6: PMI precision results for 3 window sizes.

0

5

10

15

20

25

30

35

40

100 101 102 103

Threshold

f-M

easure Collocation

Windowsize 1

Windowsize 10

Figure 5.7: PMI f-measure results for 3 window sizes.


0

10

20

30

40

50

60


Accuracy

Percentage

PMI

COA

Figure 5.8: PMI versus Co-occurrence Analysis (COA).

value, k, and was established via trial and error. Here the results were normalized (to avoid

negative values) and the best overall results were obtained with a threshold of k = 100.

Since this is a corpus-based technique, as described in 5.2.9 it could easily be extended

to other areas of medicine that share the same properties of restricted vocabulary seen in

radiology, provided an adequate training corpus is available.

5.2.10 Comparing Co-occurrence Analysis and PMI

Table 5.8 shows the comparison between the performance of the co-occurrence analysis and

the PMI analysis, based upon the best results obtained within each (that is, the window

size and threshold that results in the highest f-measure). As mentioned previously, by

incorporating multiple techniques with the same error-type coverage, the result is more

reliable results and consequently a more robust system.

Both the co-occurrence and the PMI analysis techniqes could easily be extended to other

areas of medicine that share the same properties of restricted vocabulary seen in radiology,

provided an adequate training corpus is available.


0

10

20

30

40

50

60

70

80


Accuracy

Per

centa

ge

Best Co-Occur

Best PMI

Parser

Hybrid

Figure 5.9: Combined heuristics on all errors based upon top f-measure (overall perfor-mance).

5.3 A Hybrid Approach

As a proof of concept of the proposed hybrid error-detection method, the above heuristics

have been applied in combination to the test corpus. The results in Figure 5.9 show each

(completed) heuristic applied to the entire error set (regardless of the error subset for which

the heuristic is capable of performing on), to reflect its actual performance in the radiology

report setting. “Combined” refers to the application of all heuristics together on a report

analysis via the direct method as described in Section 4.2. The combined result shows a 24%,

8%, and 14% increase in recall, precision and the f-measure, respectively, over the best single

heuristic technique, co-occurrence analysis, when compared according to highest f-measure

performance. The high increase in recall is perhaps the most promising as it demonstrates

an increased sensitivity to actual errors, and consequently a lower rate of false negatives.

Clearly, these results favour the hybrid method over previous, independent applications of

error-detection methods in ASR when applied to radiology reports.


5.4 Summary

This chapter has successfully demonstrated that the conceptual model presented in Chapter

4 is viable and offers the final, concluding evidence that post-recognition error detection can

improve the quality of speech recognition output in radiology dictation. In addition, the

hybrid approach to error detection was shown to be an improvement over any single error-

detection heurstic. In light of these conclusions, the next chapter examines the consequences

and corollaries of the research presented so far.

Chapter 6

Observations and Corollaries

6.1 Introduction

Given the findings in the preceding chapters, it should now be clear that post-ASR, hybrid

error detection is an effective means to recover from low recognition rates in radiology report

dictation. In this chapter, these findings are summarized, and the research questions posed

in Chapter 1 are re-examined. Finally, a critique of the hybrid methodology is provided,

including a list of challenges currently being faced, as well as a look at the implications of

this study and its impact on future studies.

6.2 The Findings

6.2.1 The Hybrid Error-Detection Methodology

The preceding chapters have demonstrated a successful application of the hybrid, multi-

heuristic algorithm, achieving a performance increase by as much as 24% (recall score)

over any single heuristic technique tested. This solidly shows that post-ASR, hybrid error

detection is an effective means to recover from low recognition rates in radiology report

dictation. In addition, the results from a series of error-detection heuristics were evaluated

and applied to the problem of error detection in speech-recognized radiology reports. Each

heuristic was evaluated as applied to the entire set of possible errors within a report, as well

as to a subset of errors for which the technique was determined to be the most suitable.

For instance, since the probabilisitic methods all employ a stop list, any errors involving

103

CHAPTER 6. OBSERVATIONS AND COROLLARIES 104

such words cannot be detected (unless they cause an additional error of a type detectable

by that algorithm). Thus, while the system may perform reasonably well when restricted

to its detectable error set, since the goal is a system capable of detecting any errors in a

report, the system’s performance on the entire error set is the primary concern.

The individual results of the probabilistic heuristics are examined using a variety of

context window sizes as well as varying threshold factors controlling the degree of filtering

(i.e. the percentage of words actually tagged as errors), including a study on the effect

of report type and section on the N-gram model. In general, the smaller the windowsize

used in the N-gram model (and in the subsequent test cases) the poorer the precision rate.

This reflects the inability of the model to sufficiently generalize about the characteristics

of errors, resulting in an oversensitivity and a tendency to overtag. The high recall further

reflects this, as an exceptionally low precision is no different than tagging all words within

a report as errors – in such a case while the recall is 100%, the precision is 0%.

Adjusting the threshold value seems to reflect a tradeoff between recall and precision.

With a low threshold, the recall is high while the precision is low. As the threshold increases

and more errors are identified, the recall increases, while the precision drops. Nonetheless,

the best ratio, calculated via the f-measure (a combined measure of precision and recall),

was found when the threshold was set to zero (or 100 in the case of the normalized PMI

data).

With respect to the co-occurrence analysis, a further step tests the effect on the N-gram

model by splitting the corpora by report type (i.e. anatomical region) and by report section

(limited to “Impressions” or “Findings”, the two largest sections of free text). This test

was performed in the early stages of this research, and therefore on a smaller test corpus

than the eventual 30 reports. Nonetheless, while the “Impressions” dataset proved to have

too little training data (due to the typically small amount of summative text found in the

“Impressions” section), dividing by the “Findings” section and by anatomical region (recall

that “spine” is the only corpus in this study with sufficient reports to support this division)

showed an overall increase in f-measure of at least 6% (see Table C.1), suggesting that

restriction by type or section does have a positive impact and is worth further investigation.

Although such divisions do require multiple training corpora, again this is a one-time, up-

front cost.

While the results of the PMI heuristic are lower than the those obtained by Inkpen

and Desilets, this is not necessarily indicative of performance failure but rather reflects


the differing domains to which the technique was applied. Further study is needed with a

comparable training data set. From the perspective of the hybrid technique, however, the

performance of the PMI heuristic is sufficient for proof of concept.

Not surprisingly, the parser was found to perform reasonably well on syntactic errors

alone, and more poorly on the entire error set. Nonetheless, the design is such that the

rule set can be readily expanded to account for a wider variety of errors, as well as to

incorporate greater sensitivity to syntactic errors, which will in turn improve the parser’s

individual performance.

6.2.2 On the Nature of Report Errors

After extensive analysis of the test corpus, coupled with further discussions with radiologists,

the following observations on the nature of the errors as found in radiology reports have

been compiled.

Recognition Errors

Error Bias In many reports from the CDC-compiled test corpus, the repetition of

recognition errors within an individual report was frequently noted. That is, once a recog-

nition error of a particular kind was made, the recognizer seemed to show a bias towards

that same error wherever the corresponding sequence of words was repeated. For example,

in one test report the substitution “cassettes” for “facets” was made three times. While on

the surface this may seem reflective of vocal variations among radiologists, in several cases

such error repetition was found to occur only in some, but not all, reports dictated by the

same person. This may suggest transient vocal or ambient influences on the radiologist oc-

curring between reports, such as having a cold or a temporary change in background noise;

to eliminate this possibility a larger sampling of erroneous reports is needed, as well as a

record of the conditions under which the person is dictating. If speaker variation can be

eliminated then the root cause of the repetition may be linked to the recognizer itself.

Insidious Errors Many of the recognition errors within the test corpus were partic-

ularly inconspicuous, such as the substitution of “is” for “as”. When skimming a report

for errors these mistakes can easily be overlooked due to their similarity. Furthermore, it

is often the case that the proofreader may subconsciously correct for the error, especially if

he has dictated the report, as his own expectations can introduce bias. Although in some


cases the intended sentence or phrase may seem clear, when relying on computer-generated

summaries these errors will nonetheless affect the final summarization and any subsequent

reasoning based upon this summarization.

Particularly insidious errors for both humans and computers include deletion errors;

while many deletion errors are detectable by the syntactic parser when a word’s omission

results in an ungrammatical sentence, when the deletion results in an acceptable sentence

such errors are virtually undetectable. Examples include the omission of “A-P” in the

fragment “normal A-P alignment”, the omission of “and” in “central and canal”, or the

omission of “no” in “no evidence of”, where the resulting sentence is still parseable.

Such errors are challenging as not only is such a deletion often a serious one, as in the

case of a missing “no”, but current NLP technology is virtually unable to detect such errors.

Required is a deep understanding of the text from a semantic, discourse, and even pragmatic

point of view to determine if the surrounding sentence makes sense in the context of the

report.

Words that are particularly susceptible to such insidious errors may need to be replaced

by less problematic words until error analysis reaches a stage where detection is possible.

As an example, “no” might be corrected for by using the words “negative” and “positive”

instead. An immediate challenge to such a solution, however, is the need to convince the

radiologists to modify the way that they dictate.

Post-Recognition Errors

In some cases there were post-ASR errors introduced when the reports were manually cor-

rected by the radiologist (none of these were “strong” errors as per Section 3.1.2). These

errors were detected by the hybrid error-detection system, underscoring the value of a sys-

tem that can provide a second set of “eyes” for the radiologist, beyond ASR, much in the

same way computer-aided diagnosis (CAD) can assist human diagnosis.

6.2.3 General Observations

One of the reasons errors in reports can be difficult to detect by human eyes is that expec-

tations override the actual words present. There is evidence that not every letter in a word

must be read in order to understand it. For example, it is still possible to read a word even


when the first and last letters have been permuted1. Likewise, this effect can be expanded

beyond the word level to the sentence level where the brain completes the sentence not

based on visual perception but rather on the expectation of its content; anyone who has

proofread their own work is likely to have experienced this effect. Thus, when a radiologist

reviews a report, his expectations of what the report should say can have a negative impact

on proofreading. What the error-detection system does is draw the radiologist’s attention

back to certain areas, forcing a closer look. By the same token, this can be extended to

other medical tasks, and beyond, to the general problem of error detection. Recognizing

this tendency lends the technology to tasks beyond error detection to the general problem

of computer-assisted proofreading. For example, by collecting the error statistics for errors

overlooked during manual proofreading, it is possible to characterize the nature of these

missed errors. This can help in understanding the mechanisms that allow our expectations

to obscure the actual word, such as the features of particularly problematic words, which

might include a similar orthography, phonology or even features of the surrounding context

words.

6.3 From a Radiologist’s Perspective

There are many issues with ASR in the reading room beyond the immediate problems

with accuracy. An interview with Dr. Forster revealed a long list of problems with the

software and its integration into the radiology environment beyond accuracy. Many of these

complaints are echoed in the literature by radiologists working with, or considering, ASR

versus traditional dictation methods (see Chapter 2). The following is a list of the most

common complaints. These do not directly pertain to ASR as it has been covered so far in

this dissertation, yet they directly relate to future extensions of the hybrid error-detection

system as discussed in Section 6.9.

Interface Perhaps the greatest complaint aside from accuracy is the interface between the

radiologist and ASR software. Issues include:

Speed There is often a noticeable delay before dictated commands are implemented,

or before dictated text appears on the screen.

1Tihs is an emxlpae of the atilbiy to raed txet beasd on the frsit and lsat lterts aonle.


Navigation In the ideal user interface, complete verbal navigation is not only possible

but painless. In reality, navigation commands are often printed as text instead

of interpreted directly, or ignored completely. Placing the cursor to select and

correct words in the text is complicated, error-prone and time-consuming via

voice commands alone.

Workspace and Workflow The design and setup of the ASR console should result in a

smooth integration with the workstation. From a software-engineering perspective,

conflicts with existing hospital or clinic software arise frequently. Physically the radi-

ologist often deals with poorly adapted equipment, such as corded headsets, and the

challenge of switching from the image to the dictation screen or between modalities,

such as between the mouse and keyboard when verbal navigation fails.

Inconsistent Performance In some cases, ASR performance seems to degrade after pro-

longed dictation sessions, while certain verbal commands result in seemingly random

responses at times.

Inadequate Training The steep learning curve is exacerbated by poor training on the part

of the vendor, and scheduling conflicts among the radiologists or within the hospital

[97].

Chronic Misrecognition: Poor Handling of Special Words or Phrases Due to very

specific cadence expectations, speech recognizers often misinterpret special words or

phrases, such as the following:

Proper names These include patient or clinician names.

Jargon and Acronyms Many highly specialized medical terms are acronyms, such

as “FSE T2” or “C4/5” and are a frequent source of recognition errors.

Postal codes While not frequently dictated in radiology (and less so with systems

that integrate well with the existing patient information system), Dr. Forster

observes that a very particular cadence is required to successfully do so.

Emphasized throughout this dissertation, the utility of ASR in the reading room is

contingent on its accuracy. Consequently, many of the problems listed directly above may

be reduced to inconveniences once the problem of accuracy is solved. Nonetheless, in the

interest of smoothly integrating ASR and ensuring that radiologists remain as productive as


possible, these issues are highly relevant and will help direct the course of future endeavours

as discussed in Section 6.9.

6.4 A Critical Look at the Hybrid Error-Detection

Methodology

Having established post-ASR, hybrid error detection as an effective means to recover from

low recognition rates, it is now possible to turn a critical eye to the methodology in the hopes

for future improvement. As a new theory and application in error detection, the hybrid

methodology is not without challenges. This section examines existing open problems and

weaknesses facing the methodology and within the current iteration. In some cases where

such challenges might be said to overlap, they are presented in the section on Methodology

Challenges.

6.4.1 Challenges Facing the Hybrid Methodology

The following is a list of the current challenges and open problems with respect to the

hybrid error-detection methodology. These will help lay the groundwork for future study

and improvement.

Subtlety in Errors As mentioned above, certain errors are particularly hard to detect.

Deletion errors are especially challenging as omitted words rarely leave a record of their

absence. As a result, the now incorrect sentence remains parseable. Theoretically, an N-

gram model of the domain may detect errors where the omission results in an N-gram with

very low probability. That is, two words that are always separated by some word(s) may

now find themselves adjacent as a result of the deletion error. Unfortunately, this only works

well in the case where the N-gram model is built upon collocations. If the context window,

n, is any larger, the combined result of the co-occurrence probabilities will smooth out the

effect of adjacency.

In addition, while parsing is effective at detecting grammatical errors or concepts that

are in disagreement with the surrounding words in the text, recognition errors do arise

that are not caught within the current defined constraints of the syntactic and semantic

grammar. The hybrid approach means that statistical methods, which characterize reports

by the frequency with which words co-occur with other words in the domain, may detect


recognition errors that the parser failed to detect on the the basis of their infrequency2. Still,

errors do arise that may not be caught by any heuristic, such as contextual errors that may

make sense, for example, in another report. For instance, a report of the knee will describe

certain facts that are relevant to the knee; it is not unreasonable that a recognition error

within a knee report may arise that while grammatically correct and having a relatively

high frequency within the training database, will nonetheless go undetected.

Thus, the nature of errors merits further investigation, including a detailed analysis of

why certain errors go undetected. Implementation details aside, this can only be done once

the problem of insufficient training data has been controlled for (if there is a statistical

component).

Meta-Level Heuristic Interaction It is possible that an error from one heuristic can

be exacerbated when combined at the meta-level with the results from the other heuristics.

This problem of system reliability can be helped with the inclusion of more heuristics with

overlapping error coverage; in this way no one error is determined by the output of a single

heuristic, and thus if an error should be introduced from one heuristic, the overlapping

output will smooth over the erroneous data. Still, careful study of the meta-level interactions

is needed.

Ambiguity As with any NLP application, the problem of ambiguity is ever-present.

Ambiguity arises when there exists more than one interpretation for a text or segment.

This can happen at any level in the anlysis, from multiple syntactic parses, to multiple

conceptual analyses as was introduced by the MMTx software in deciding between UMLS

concept candidates. It may be the case that despite the semantic, syntactic and N-gram

model restrictions on a text or segment, more than one interpretation may still remain.

Depending on the implementation, the system may simply fail at this point, or choose the

wrong interpretation, resulting in either a false positive or a false negative.

Assessing Implementations of the Methodology Beyond the hybrid methodol-

ogy, there currently is no metric for the comparison of existing error-detection systems and

their performance, making comparative analysis difficult. Even matters as “straightforward”

as the word error rate vary within the literature. This is compounded by a lack of ASR

2Presumably an error should be infrequent or non-occurring in the training corpus if built upon correctreports.


error-detection research in radiology. As a result, although the hybrid method outperforms

the individual heuristics in the local domain, it is difficult to compare its performance to the

problem of error detection at large. Still, the hope is that this work will provide a starting

point for comparison of problems in error detection in radiology, as well as inspiration for

expansion beyond the problem of radiology.

What is more, in order to assess the performance of this implementation, and the actual

effect it has in the radiology reading room, a clinic must be found that is willing to have

the system integrated within their current ASR setup.

Data Standardization There is a clear need for standardization in the representation

of medical knowledge that will effect eventual extensions of this methodology to automated

correction and summarization (these are discussed in Section 6.9). Furthermore, the field

must see a standardization in the vocabularies and their interfaces, such as the UMLS,

required by many applications of MLP, including the hybrid error-detection methodology

(see Appendix B). By building a successful foundation now, it will be possible to fully

integrate systems hospital-wide, from radiology to paediatrics, while making information

available across the country and beyond via the Internet. Accurate statistics on past cases

can then easily be collected and used for research, patient care and decision support.

Adequate Domain Coverage Any implementation relying on a semantic or ontolog-

ical component faces the challenge of limited domain knowledge; a system that is too broad

is over-general and suffers a loss of accuracy [45], while a system that is insufficiently gen-

eral may not provide enough coverage for the domain at hand. Thus, a conceptual distance

metric risks being mired in an overly detailed ontology, or failing as a result of insufficient

distinction between the terms (that is, all distance measures will be too small to be of any

use).

Accuracy Despite the degree of error detection in the implementation provided, it is

still far from the goal of 100% accurate. If a system is to be deployed in a medical setting

where it is responsible for handling sensitive data, it must have extremely high accuracy. If a

report is returned to a requesting physician mistakenly identifying a disease or lack thereof,

the consequences could be fatal. As an extension of this, the system must have a strong

integration with the existing PACS3 and hospital information system (and potentially any

3Picture Archiving and Communication Systems.


ASR system already in place) so as to avoid additional errors being introduced throughout

the reporting process.

Data Sparseness As underscored by the existing implementation, data sparseness

and the overall quality of the training corpus is always a potential problem and must be

kept in mind for all statistical analyses. In Section 6.8 this problem is discussed in more

detail.

Choosing the Right Heuristics The choice of heuristics implemented in the hybrid

method is influenced by a number of factors. As mentioned previously, having multiple

error-detection methods with overlapping range of coverage can help increase overall system

reliability. In some cases, it may be known beforehand that only certain error levels or

types are relevant, which may limit the choices of heuristics or influence the choice of one

method over another. For instance, in ASR applications lexical or orthographic errors are

not relevant and have no bearing on the system. In contrast, web-based analysis such as

the study of weblogs, is likely to contain colloquial spellings and other variants, all of which

the system must take into account.

6.4.2 Challenges Facing the Current Implementation

The following is a list of the weaknesses within the current implementation of the hybrid,

error-detection algorithm.

Reliance on External Knowledge Sources Relying on the UMLS for the ontolog-

ical component means the implementation is susceptible to the weaknesses of that ontology.

For example, incomplete coverage means that occasionally a valid medical term is found in

a report that is not found in the UMLS. When a NULL value is returned on a legitimate

word, this disrupts the ability of the system to accurately detect errors. This problem was

exacerbated by the problems in the MMTx in the inconsistent handling of legitimate terms

unknown to the ontology or that may only have an entry in the Metathesaurus, for example.

Data sparseness Common to both probabilistic approaches is the insufficiency of the

training data. While sufficient for proof of concept, a larger corpus is needed to improve

the accuracy and reliability of the statistical heuristics. Many of the problems with respect

to false positives (and consequently the low precision rate) were attributable to a legitimate

medical word’s absence in the training data.


Incomplete Information As an initial attempt at the problem of low recognition

rates in radiology reporting, the goal was recovery from recognizer-induced errors. These

are errors that occur despite correct input. This discounts any input where the user may

have faced some speech impediment (such as a cold), or unexpected ambient noise was

present. Accounting for these problems is not possible without access to the corresponding

audio tracks, therefore a more in-depth analysis of the recognition errors was not possible

at this stage.

Assessing System Performance The process for identifying recognition errors de-

scribed in Section 5.2.3 is susceptible to inconsistent interpretations and does not represent

the best way to identify errors. A deeper analysis of the causes underlying consecutive errors,

in particular, is needed before an automated error collection system can be developed.

Corpus Bias As mentioned in Section 5.1.1, the training corpus was produced on MRI

reports alone, yet tested on a corpus that included both MRI and CT reports. While the

radiological parlance is similar in both MRI and CT dictations, it is important to recognize

the potential for bias this discrepancy introduces. For example, reference to images specific

to a particular imaging technique may be found in the “Findings” section4. Nonetheless,

when the current test corpus was split and the results separately tabulated for those reports

which were MRI-based and those which were CT-based, no difference was noted in the

performance of the statistical algorithms on either report type. Still, a larger test corpus is

needed to confirm this finding.

Also, since the training corpus was obtained from one clinic only, there is risk for further

bias in the data. Therefore, it is important that in developing or expanding the training

corpus a greater variety of reports be obtained. This includes a mix of MRI and CT reports

(as well as any other imaging report to which the error-detection algorithm may be applied),

along with input from other clinics.

4The greatest potential disparity, however, is within the “Techniques” section of the report, which isexcluded in this research since it is a template selected by the user and not likely to contain any errors (asdiscussed in Section 5.1.1).


6.5 Corollaries

There are a number of implications stemming from the conclusion that post-ASR, hybrid

error-detection is an effective means to recover from low recognition rates in radiology report

dictation. These are divided into immediate and longer-reaching consequences.

6.5.1 Immediate Implications

The classification of error-detection methods presented in Chapter 3 enables the objective

discussion of existing and future error-detection techniques. This will assist in developing

gold standards both within and outside medicine, making it easier to develop and assess

new error-detection technology.

What is more, the proof of concept from Chapter 5 provides an immediate roadmap for

the development of a system for actual use in the radiology reading room as discussed in

Section 6.6. Combined with the conceptualization in Chapter 4, this will allow improvements

and extensions over the current implementation. As an immediate consequence of a viable

application, radiologists will have another weapon against the problems currently plaguing

ASR in the reading room. Improving the experience with ASR will encourage other radiology

clinics to upgrade without worry of a reduced net performance either in efficiency or in report

quality. A highly reliable ASR system will remove the need for transcriptionists, while an

automated error-detection system will allow the radiologist to proofread and correct his

own reports efficiently. The result is improved report handling and turnaround time (TAT),

improved report quality, and, finally, improved patient handling.

The strength of the hybrid error-detection method over the reliance on any single heuris-

tic also has implications in the development of the meta-level analysis of the component

heuristics and their interactions. The nature of this interaction is important in further-

ing our understanding of how various levels of linguistic knowledge, both probabilistic and

non-probabilistic, work together to form a coherent analysis.

6.5.2 Implications for Future Study

Beyond the immediate implications of this thesis, there are also farther reaching conse-

quences. On a larger scale, the error-detection system (and subsequent advancements) will

help mitigate the difficulties of the transition from traditional dictation methods to ASR-

based systems, a transition that some are now citing as inevitable (see Chapter 2).


The processing in the error-detection system lends itself quite naturally to the problem

of report summarization. What is more, the ability to detect errors in such cases is especially

important since not only must the summarizations be correct for the current patient, but

as electronic records they are likely to find use in subsequent research.

The ability to quickly create an electronic record of a report helps streamline the re-

porting process, resulting in radiology reports that are available throughout the hospital

(via the hospital information system), and remotely to clinics. Doctors waiting for results

will receive them as soon as they are complete, radically improving the TAT. This leaves

open the possibility for efficient tele-radiology operations, or remote consultations between

radiologists, that otherwise might not be possible with multi-day TATs. In addition, pro-

viding well-structured reports will allow clinicians to easily search past cases and perform

statistical analyses, making these reports accessible to both further research and decision

support.

6.6 A Standalone Application for the Radiology Workstation

On its own, the hybrid system from Chapter 5 is nothing more than a promising idea for

post-ASR error-detection. To show the true value, the system must be integrated and tested

within an actual radiology reading room. This section examines what is required to turn

the current software into a program for practical application.

Figure 6.1 shows the error-detection process.

6.6.1 Steps to an Independent System

As it exists now, the hybrid error-detection system is a juxtaposition of various heuristics

that are manually applied to the test corpus. If the system is to become a standalone

application, a front end must be designed that will handle running the various heuristics

in parallel. Since the system runs as a post-processing stage, the output from the speech

recognizer can be provided as external input to the error-detection analysis. The analysis is

then performed and the results automatically output in a format which the radiologist can

modify or correct.


Interactive front end:

SROutput

InterimReport

ReportDictation

Corrections

User

FinalReport

ErrorDetection

Figure 6.1: The error detection process.

Output

Currently, the results of the error-detection process are collected by hand. In order to achieve

application independence, these results must be automatically collected and displayed in a

user-friendly manner. The nature of this display has been given considerable thought and

depends upon the final choice of mappings (the error tag-set {correct,incorrect} or raw

confidence scores).

In the current, manual collection of results, the error-tag-set mappings as opposed to a

confidence score or percentage, are provided as the final output. This reflects the error-tag

mappings assigned by the individual heuristics. For example, if at least one heuristic maps

a word in a report to incorrect then it will be mapped to incorrect in the final output (as

per the the discussion in Section 4.2). These results are compiled by hand and collected in a

text file as a list of word-tag pairs. In general, such a format demonstrates poor readability.

Needed is a script that applies these tags to the actual report in an easily observable format.

For instance, all words with an “incorrect” tag may be coloured red within the body of the

report to alert the radiologist’s attention to these problem areas.

Given a more complicated aggregation of heuristic results that returns a confidence or

percentage score, the combined confidence value of a word may be displayed as a grey- (or

colour-) scale representation of the report, as in Skantze’s work on error detection in spoken-

dialogue systems [135]. This allows a radiologist to immediately and visually characterize the


“Possible spondylolysis eye laterally of L5.

If clinically indicated, CT scan could be

performed for further assessment, but no

spondylolysi cysts is seen. Advanced

degenerative disease at the L-2/3 level.”

1. Possible spondylolysis eye laterally of L5.

2. If clinically indicated, CT scan could be performed

for further assessment, but no spondylolysi cysts is seen.

3. Advanced degenerative disease a the L-2/3 level.

Figure 6.2: Sample output using a grey-scale confidence indication.

state of the report. As discussed in Section 4.1, from the perspective of medicine some might

suggest that all errors should be considered significant errors, and thus mapping directly to

{correct,incorrect} with a corresponding binary colour scheme may be more desirable than

a gradient representation that admits degrees (that is a word is either misrecognized or it

is not).

Figure 6.2 is an example of how the error confidence information may be conveyed in the

final output; the sentence “*Possible spondylolysis eye laterally of L5” is a misrecognition of

the sentence “Possible spondylolysis bilaterally of L5”, “*spondylolysi” is a misrecognition

of “spondylolysis”, and finally “a the” is an insertion error.

6.6.2 User Interface for the Hybrid Error-Detection System

The user interface of any practical application is the face by which we judge its overall

quality. If a program is cumbersome, difficult to learn, or difficult to operate, it will not be

accepted by the radiology community, which has already shown an understandable resistance

to poorly integrated software [97]. From a purely functional perspective, a system that does


not interface effectively with its user will not run efficiently, irrespective of the computational

efficiency of the system itself.

The application of the error-detection system on the recognizer output should be con-

trollable via the main dictation window (for example, as a Microsoft Word macro). Once

a radiologist has dictated the report, he has the option of choosing to run the resulting

dictation through the error-detection system, or setting the system to run automatically

following report dictation (some error-detection systems allow user-defined commands that

could be linked to the error-detection system and called at the end of dictation). On com-

pletion of the analysis, the report, complete with error mapping, is then available via a

word-processing interface (with possibilities for later expansion via suggested correction

candidates, as discussed below in Section 6.9).

Though the exact nature of the word-processing interface is open to speculation, it

must include facilities for correcting errors via the keyboard/mouse or through further voice

commands. As a future extension, a facility for suggesting correction candidates will allow

the radiologist to simply click the appropriate correction and immediately replace it without

further typing, or switching of modalities (i.e. from mouse navigation to the keyboard, or

vice versa).

6.6.3 Miscellaneous Requirements

Since the performance of the system relies on the presence of threshold values that determine

the degree of error filtering, a useful extension is the presence of a “slider”-based interface

that allows radiologists to control the extent of filtering, depending on their preference (and

the task at hand).

6.7 Measuring the Real-World Success of the System

While assessing the accuracy of the hybrid error-detection system is a useful indication of

the quality of the software itself, it does not reflect the system’s performance with respect

to the actual radiology environment. This performance is a product of not only the software

calibre, but the integration with the existing ASR software and user interface, and addresses

the question, does error-detection augmentation equal or surpass the TAT efficiency of

traditional methods? Thus, any standalone error-detection system must be assessed in the

radiology suite and the report TATs measured and compared against traditional dictation


methods, as well as ASR without error detection. Although a positive effect on the TAT

is expected, it is impossible to assert this as fact without evidence from studies within a

radiology reading-room. Furthermore, the magnitude of improvement over standard ASR

(and across vendor systems) must be measured, as well as the differences between ASR-

augmented-with-error-detection versus non-ASR, traditional methods.

6.8 Data Sparseness: Smoothing

As mentioned in the discussion of the co-occurrence results in Chapter 5, Section 5.2.9,

a zero probability can indicate a failure of the training corpus to provide an adequate

representation of the words within the domain, otherwise known as the problem of data

sparseness. With this in mind, it makes little sense to treat a zero probability as actually

zero, but rather as an inaccurate assessment. Since the training corpus is at best a subset

of the domain, it is impossible to generalize beyond the training corpus to conclude that

any string is impossible, especially if that string is considered “correct” by the domain’s

standards, such as in the case of a false positive. Given that creating a corpus that contains

every single possible word and its environments is an impossible task, it is not possible to

know the “true” meaning of a zero probability relative to the domain. That is, we must

decide whether a zero probability means that the word or N-gram is so rare that it is not

in the corpus, or that the word or N-gram does not occur in the domain.

Although a larger context window size can provide more information better character-

izing a word and its features, data sparseness means that within this larger window the

probability of encountering a word pair that has not occurred in the training corpus is in-

creased (and therefore the likelihood of having to handle a zero probability) [96]. Even an

exceptionally rare word, which would have a minimal probability of occurrence within the

training corpus, may never have occurred in the relevant N-gram in that corpus, rendering

the results unreliable [96].

The answer to this problem is a technique called “smoothing” [96, 81], which modifies

all probabilities within the training corpus to reduce the effect of data sparseness. A simple

example considers only the k most common N-grams, and discards all other words as “out of

vocabulary” (OOV) words [96]. Manning and Schutze observe that this serves two purposes:

to smooth the resulting probability distribution and reduce or eliminate the presence of zero-

probability words or N-grams; and, to reduce the memory requirements by reducing the


parameter space (i.e. the smaller training corpus) [96]. For the purposes of error detection,

however, reducing the training corpus risks increasing the false-positive rate unacceptably.

In another example of smoothing, zero- or low-probability words and N-grams are re-

assessed and their probabilities modified to better reflect the domain. One naıve method is

to normalize the data by adding 1 to all probabilities. While a straightforward solution, this

affects the distribution of probabilities at all levels, not simply the low frequency ones, and

consequently results in poor estimates that are at times a few orders of magnitude out [53].

Alternatively, a technique called “Witten-Bell smoothing” uses the probability of extremely

rare occurrences (“things seen once”) to estimate those never seen based on the assumption

that a zero-probability occurrence just has not happened yet [96]. That is, the probability

of a unique occurrence is estimated upon having seen such unique occurrences in the past

(where a “unique occurrence” is any occurrence seen only once in the training corpus) [96].

It is also important to note that some words are more likely to occur preceding a unique word

than others. Calculating how often each word precedes a single-occurrence word (the best

estimation of a “new” word), it is possible to determine that word’s likelihood of following

a new word [96].

In conclusion, N-gram, corpus-based probabilistic methods, such as those used for the

error-detection analysis, are susceptible to problems of data sparseness. Any training corpus

is only an approximation of the distribution of words within that domain. Once a larger

(or improved) training corpus has been obtained, experiments with N-gram size can be

conducted with specific attention to word distribution: where distribution can support

larger multi-gram analyses, it should be used [96, 199-202]. Furthermore, a consideration

of smoothing techniques, such as Witten-Bell, may provide relief from the problem of data

sparseness (especially if a larger data corpus does not improve the results).

6.9 Future Work

Perhaps the greatest contribution stemming from the hybrid, error-detection methodology

exists as a function of a larger system. As mentioned above, although the research so

far is sufficient for proof of concept, further development is necessary for this software to

be of actual use in the radiology setting. The error-detection system nonetheless leaves

open the possibility for many future developments to improve the system with respect to

error detection and beyond. This section takes a look at some of the immediate extensions


possible, followed by a look at possibilities in the more distant future.

6.9.1 The Full System

Beyond error detection, the work here can be expanded into a full report analysis system,

the report analyser. As discussed in Chapter 2, such a system involves natural language

processing of reports to produce a computer-accessible (i.e. searchable and updateable)

summarization of a report, a subset of which involves error detection and correction. Since

much of the analysis required for a successful error-detection system overlaps that required

for a full report summarization system, including in-depth syntactic and semantic analysis,

expanding to automated summarization is a natural step. Combined with the value of

summarized reports, it makes sense for the report analyser to be the eventual goal of post-

processing in radiology.

Figure 6.3 shows the full system as envisioned. The user has three possible actions within

the system: dictate report; correct existing report; or query the report database. When dic-

tating a report, the text is collected via ASR and then run through the report analyser. The

report analyser performs an in-depth linguistic analysis and creates a computer-accessible

XML representation of the knowledge within the report, which will support future natural

language user queries. During this analysis, the report analyser applies the error-detection

algorithm to tag the summarized report, displaying the results as a copy of the original

dictated text with error tags. The user is able to make any corrections (or in the case of

an automated correction system to review the corrections made), and finally to “sign off”

on the report as correct and complete. This summarized and signed report is added to the

report database, which the user has the option to query. Figure 6.3 shows the database

query engine as a dotted line since this avenue of research has not yet begun.

The reliance on a XML-based report representation is beneficial for two reasons: it

ensures the adherence to standards in representation that will make integration with other

systems uniform; and, the final reports are in a web-ready format that will allow transmission

throughout the hospital, and even remotely to doctors at external clinics.


Report

Correction

ReportError

Correction

DB

DatabaseQuery

Interactive front end:

User

Store/retrievereport

Query/Update

Analysis

Query

Dictation

Figure 6.3: The full system as envisioned.

6.9.2 Immediate Extensions: Improving the Current Heurisitcs

Improving the Statistical heuristics

As mentioned in the discussions of both probabilistic heuristics, the lack of training data

has a detrimental impact on the results of N-gram-based models due to data sparseness.

Therefore, the immediate task is to expand the training corpus beyond the current 2751

reports. While the “more is better” approach is generally adequate in the face of corpus

design, the creation of a balanced corpus that better reflects the expected distribution of

radiological text may be a worthwhile approach, given the limited domain and in the interest

of tractability. Although a large domain, radiology nontheless does not exhibit a wide variety

of expressions within its reports. This could make it possible to intelligently select a training

corpus that would best represent the domain. The problem of rare words may still remain,

however, and may call for smoothing to be applied.

Currently, there is no pre-processing on the text used in the statistical approaches. As

a result, all orthographic variants of words, regardless of tense, et cetera, are considered as

independent terms; that is, “examine”, “‘examined” and “examines”, for instance, are all

considered as separate, independent terms within the analysis. An interesting experiment is

to stem the words in the text to see if it has an impact on the performance. Hypothetically,

stemming should reduce the variety within the training corpus thereby increasing its ability

to generalize.


By expanding the definition of stop word, lower content words, which have a higher

frequency in a text, can be added to the stop list. The information content of a word is

closely related to its frequency in the training corpus. The lower the probability of that word

occurring, the more information it brings to bear on the surrounding context. Likewise, the

higher the probability, the less information it has to offer (in the extreme case, consider a text

containing only one word repeated multiple times – that word would have zero information

content). Thus, expanding the stop list to include very high frequency words can reduce

the reliance on these words, theoretically improving the error-detection performance.

Some previous work in N-gram models includes sentence boundaries and punctuation

when analysing a training corpus. This may be a useful extension as many sentences within

radiology reports are short, assertive sentences, of a limited form. Constraining the words

further based upon sentence boundaries might help characterize this property of a report

within the training corpus.

Lastly, further work on the division of training corpus by type and/or section may be

helpful, based upon the results in Table C.1 and discussion in Section 5.2.9. A larger test

set containing also divided by anatomical region is needed.

Improving the Syntactic Analysis

The parsing heuristic (discussed in Chapter 4) is based upon a basic CHR property grammar

(see Appendix A). Immediate extensions include a finer characterization of the domain

with a more extensive property base, including tense, person, voice, number and mood. By

increasing the variety of sentences recognized by the parser, the presence of false positives

caused by parser failure is reduced (as opposed to an actual error in the text triggering an

incomplete parse).

Improving the Semantic Analysis

The semantic distancer measures the conceptual similarity between two concepts as a func-

tion of their edge distance within a semantic network, as described in Chapter 4. Although

the present implementation is faced with a number of difficulties, with careful consideration

these are not insurmountable. One of the inherent difficulties in the current approach is

the reliance on the MMTx software (see Section 5.2.6). MMTx was chosen as a readily

available program that would allow the error-detection system to interface with the UMLS.


Unfortunately, since MMTx was not designed specifically for this purpose, there were several

difficulties encountered that have delayed the success of the semantic distancer as discussed

in Section 5.2.6. Additionally, it is not possible to supply a stop list, which ultimately could

have an effect on the efficiency of the system given the size of the UMLS. The design and

development of a specialized program to interface with the UMLS dedicated to the task of

error detection will help with these difficulties, and is left to future work.

Due to time constraints and keeping the implementation to a reasonable developmental

timeframe, the semantic grammar as laid out in Chapter 4 remains to be developed. To

this end, an in-depth analysis of the radiological archetypes that can be used to develop

semantic-grammar rules is required. As a first step, a concordance analysis of the training

corpus will help reveal those constructions most common to a radiology report. From these

it is possible to abstract semantic rules for use in a semantic grammar. Recall that the

syntactic grammar developed has been created with these extensions in mind (see Section

5.2.7) and can be easily extended to accommodate semantic rules. Given a careful analysis,

once a selection of the semantic properties of radiology reports have been extracted, a

translation by hand into semantic rules or constraints will provide a proof of concept, with

an eye to automated rule induction in later stages.

6.9.3 Miscellaneous Improvements

The dictation of individual letters is a known problem within ASR, with 20 of the 26 letters

of the English alphabet, for instance, causing difficulty [127]. Rolandi notes that not only

are English letters mono-syllabic for the most part (the exception being “W”), many of

the letters sound similar and can be grouped into what he calls “confusion classes”5 [127].

Naturally, this results in difficulties dictating acronyms, or single-letter codes in medicine,

such as “C4”. These difficulties may be partially addressed by the expansion of acronyms;

for example, the expansion of “C4” into “cervical 4”. Zahariev developed an automated

acronym-expansion algorithm designed to function as part of a larger NLP system [164].

This may offer some help in the existing error-detection infrastructure proposed here, po-

tentially helping to solve a subset of the problems surrounding acronyms and abbreviations.

Perhaps the most potential for success with acronyms, however, comes with a change in ra-

diologist dictation habits; teaching radiologists to dictate expanded acronyms. The tradeoff

5For example, {“F”,“S”,“X”}, or {“M”,“N”}.


will depend on the extra time needed to dictate a longer phrase, versus the frequency of

acronym errors and the time spent on corrections.

As laid out, the hybrid error-detection algorithm provides feedback to the user as post-

processing stage following dictation. It may prove useful to adjust the system so that

detection occurs in a more “on the fly” manner, whereby radiologists are given some feedback

on the error status of their dictation as it is being dictated. This might involve tagging the

text as it is dictated. As-you-go tagging, however, may only allow certain heuristics to

function to their full efficacy without the benefit of the text from the entire report (a dual-

pass mechanism might be derived whereby some errors are detected on the fly, with a full

error check following dictation).

From Detection to Correction

Once the errors are identified in a report it falls on the radiologist to review the report and

implement any needed corrections. This process can be made more efficient by the presence

of a well-thought-out user interface. Currently, following ASR dictation, the radiologist can

switch to the dictation screen and, via the mouse and keyboard, implement the necessary

changes. In many cases making changes using the voice interface is challenging and time-

consuming, while the speech recognizer’s suggested “corrections”, intended to streamline

the correction process, are often far from relevant.

Once a system is in place for detecting errors, however, it is possible to use the infor-

mation from the error analysis towards intelligently suggested corrections. For example,

a semantic-based analysis can reveal clues as to the semantic type of the expected word

or phrase, while a syntactic analysis can reveal the expected part of speech. In addition,

the probabilities determined based upon the N-gram model of the domain can be added

to this information to help narrow down the list of possible candidates. When augmented

with the N-best list from the recognizer, this can help create a more intelligent suggestion

list. While a black-box method does offer a “fresh perspective” on confidence ranking of

recognized terms, independent of any recognizer influence, it is still useful to have access to

the internal workings of the recognizer when possible.

While arguably the ultimate goal of a “full-service” summarization system is not only au-

tomated error detection, but automated error correction as well, any procedure in medicine

that removes human eyes from a process introduces the risk of undetected errors. The pur-

pose of the radiology application of the error-detection methodology is to add reliability and


reduce the risk of user error. A fully automated system may risk the introduction of new

errors altogether, thus its development, design and integration will be difficult and further

study is required on the expected impact and reliability.

6.10 Summary

This chapter explores the consequences arising from the theoretical and experimental work

from previous chapters, both within radiology, and within error detection and report sum-

marization in general. An investigation of the challenges faced by the current approach,

and suggestions for improvement is presented. These provide a natural segue into a vari-

ety of avenues for future work that include improving the current hybrid system, adding

further heuristics, and the development of a full report analysis software system that takes

advantage of and extends the processing already done towards error detection. In the next

chapter, the wider application of the hybrid error-detection methodology will be explored.

Chapter 7

Beyond Radiology

7.1 Error Detection in the Greater Context

While post-ASR, hybrid error detection is an effective means to recover from low recogni-

tion rates in radiology report dictation, there exist valuable applications beyond the medical

domain. This chapter explores two such applications in the fields of cognitive science and

general natural language processing. These discussions are not intended as a proof of con-

cept, but rather to demonstrate that the hybrid, error-detection methodology is applicable

to a larger context, and to inspire future work.

7.1.1 The Methodology in Other Domains

Although the application of the error-detection methodology has focused on radiological

functions alone, the methodology itself is sufficiently general to be extended to other prob-

lems of error detection from other areas of medicine, the World Wide Web, and beyond. In

general, this requires sufficient domain information in the form of an ontology to support

the semantic analysis (the UMLS would suffice for any medical applications, for example)

as well as an update of the semantic rules, currently representing radiological archetypes,

to reflect a new domain. Lastly, a database of text samples in the new domain on which to

train the statistical algorithms is also needed. Still, despite these requirements the general

challenge of adapting to another domain is not unreasonable, and the methodology may

find favour in other fields.

The black-box error-detection techniques discussed so far are not dependent on input

127

CHAPTER 7. BEYOND RADIOLOGY 128

from ASR and therefore can be extended to error-detection tasks from different sources, such

as the general problem of computer-assisted editing. This can include improved utilities for

word processing, the World Wide Web, and more.

Within some domains, the choice of heuristic may be influenced by constraints on the

domain itself. Therefore, it may not be possible to collect an adequate training corpus

to develop an N-gram model of the domain. While this may be especially true in highly

constrained domains, it may be offset by the ability to develop highly accurate parsers based

on the limitations of expressions occurring within the domain. Similarly, some domains may

not enjoy the same degree of constraint in the possible range of concepts, or likewise may

have fewer grammatical restrictions on the words in the language. Both of these conditions

would preclude the use of conceptual distance (with too many varied concepts the measure

of distance loses predictive power), and certain highly-specified semantic analyses, such as

domain-specific verb complements.

An automated error-detection system allows for text quality assessment that is not

susceptible to errors or bias in human detection, and provides opportunities for higher level

statistical analysis that is not possible by humans alone.

As a final general note, a large domain necessitates a larger information base (e.g. train-

ing corpus, ontology, lexicon, et cetera), which increases the processing time and in turn

may limit the usefulness in some domains.

The following sections take a closer look at two particular applications of the hybrid

error-detection methodology outside of medicine: cognitive science and machine translation.

7.2 Cognitive Science Perspectives on Error Detection

Cognitive science is the scientific study of the mind and intelligence from a multi-disciplinary

perspective; it sits at the intersection of the broad areas of neuroscience, linguistics, psy-

chology, philosophy, and computing science. Specifically, cognitive scientists are interested

in cross-applying the methodologies and theories from these fields in an effort to understand

cognition: the mental operations and structures relating to the brain. This section examines

an application of the hybrid error-detection methodology to cognitive science, specifically

the sub-area of psycho- and neuro-linguistics.


7.2.1 Error Detection: Applications in Neuro- and Psycholinguistics

Neuro- and psycholinguistics are two approaches to linguistics that fall under the purview

of cognitive science1. Although overlapping, in general, neurolinguistics sits at the inter-

section of linguistics and neurology, while psycholinguistics focuses on linguistics and the

psychological aspects of language. In particular, neurolinguists are interested in the struc-

ture underlying language in the brain, as well as the relation of the various components of

language (lexicon, syntax, semantics, phonology) among themselves, and correspondingly

to the structure of the brain itself. Of primary interest are the neuropsychological linguistic

mechanisms driving language and grammar. Although closely related, within psycholinguis-

tics, researchers are focused on the psychology of linguistic behaviour, including first and

second language acquisition, and the mental representation of language.

Toward furthering the research in these fields, computing science often finds its niche

in the creation of applications modeling current theories. In general, there are two (over-

lapping) goals within computer applications in cognitive science: to model, replicate, and

improve on human mental capabilities; and, to further the understanding of the human mind

through computer models. While the former concentrates on artificial intelligence with the

ultimate goal of meeting or exceeding human intelligence, within the latter, the goal is to

recreate human intelligence complete with human inadequacies in an effort to learn how

the mind works. When focusing on understanding actual human intelligence, creating an

adequate model to test one’s theory is a challenging and open problem that plagues many

subareas within the field.

The multi-heuristic method of the hybrid approach to error detection is useful from

the perspective of modeling human language representation and processing in children and

adults. Although the nature of the representation and handling of language remains open

to debate, several promising theories have been put forth. These theories exist at various

levels of representation, including the conceptual, rule-based, logical and image-based mod-

els; as well as a hybrid, multi-representational account of the mind and language processing

that spans these models [146, 64, 151]. The performance of the hybrid language analysis

present in the error-detection system can offer some insight into these arguments with the

1This information was compiled in part from the information provided athttp://www.nytud.hu/depts/neuro/index.html (last accessed February 16, 2006).


demonstrated increase in performance from the multi-heuristic approach. If semantic, syn-

tactic, and statistical methods can be shown to outperform such methods individually, this

may offer some support to the hybrid-representation theories and further the theory of the

modularity of language.

Furthermore, capturing errors via a multi-level technique allows for error analysis at

each of these levels, leading to a more in-depth characterization of the nature of errors.

This includes understanding the similarities and differences in error detection at each level,

which will in turn help future error-detection software as well as contribute to a deeper

understanding of the nature of these errors.

Non-Invasive Techniques for Speech Pathology

As an extension of this discussion, studying acquired cognitive deficits, such as speech

pathology, requires an understanding of the representation and handling of language within

the brain. While studying such pathologies can lead to better understanding of natural

language processing in computers, the reverse is also true – the modeling of such deficiencies

through natural language processing and knowledge representation can lead to a better

characterization of the nature of the pathology. This may be useful in the diagnosis of such

disorders and in the development of assistive technology for speech-impaired individuals.

The hybrid error-detection methodology is an example of a tool that can be applied to

cognitive language deficiencies. Acting as a model of the errors present within a patient’s

speech, the methodology can be applied to help characterize and diagnose such errors, and

could have ultimate extensions in an error correction utility for sufferers of such afflictions

(such as the case of high-functioning individuals).

Depending on the extent of injury, the language impairment may be selective (referred to

as aphasia), such as the inability to form syntactically appropriate sentences, or to correctly

interpret the meaning of words of a sentence, resulting in errors when speaking [151]. The

selective hybrid error-detection mechanism allows a more precise measurement of the injury

based on the distinctive and discriminating errors made when the patient is speaking, which

can include errors involving the lexicon, syntax, semantics, or even morphology.

The hybrid approach to error detection readily lends itself to division by error type. Such

knowledge of a speaker’s language and the corresponding errors can help in broadening the

understanding of the related processes within the brain (for instance, if statistical analysis,


as suggested by theories of “global lexical co-occurrence”2 [92], occurs in a fashion separate

from syntactic or semantic processing within the brain).

In addition to speech pathology, the error-detection algorithm is useful in developmen-

tal studies, such as formal and informal assessment procedures for syntactic, semantic, and

pragmatic aspects of oral and written language (including pathology diagnosis, as mentioned

above). By providing a computerized analysis of language samples, language efficacy can

be tested and qualitatively evaluated. This includes development aspects such as the order

of acquisition of various syntactic and morphological processes, as well as the nature of

errors made by children in spontaneous speech. This gives rise to comparison studies be-

tween developmentally delayed and normal children to help understand and diagnose specific

language impairment.

Finally, the modular design lends itself to expansion by further heuristics, which may

be used to test other cognitive theories that pertain to language and language pathology.

Within neurolinguistics, the error-detection results can be used to validate existing theories,

or postulate new ones by comparing the actual patient errors with the ones predicted by

current aphasic grammar model and corresponding theories.

Other Applications

Aligning the output of the error-detection analysis of speaker output with a magnetoen-

cephalography (MEG) or electroencephalography (EEG) study, if time stamped, would

allow errors and their type to be correlated with specific brain events or activation bursts.

This may in turn offer insights as to the processes occurring alongside each type of error. Al-

ternative forms of analysis, such as PET (and SPECT) and functional MRI, which monitor

glucose metabolism and changing blood flow to show patterns of activity within the brain,

can also be correlated with error occurrence. Such knowledge may further understanding

of the interaction between the multiple linguistic levels of processing in the brain and allow

for more in-depth functional mapping.

7.2.2 Error Detection and Language Acquisition

Despite an extensive history in the literature, many aspects of first-language acquisition in

children remain open problems [92]. This section takes a brief survey of areas in which the

2Briefly, this refers to a word’s co-occurrence statistics.


hybrid error-detection methodology may find application.

Meaning Acquisition

A particularly difficult issue in language acquisition is that of meaning acquisition, a topic

which has generated two major hypotheses: the semantic bootstrapping hypothesis, and

the syntactic bootstrapping hypothesis [57, 109, 110, 88, 55]. As Li et al introduce, the

semantic bootstrapping hypothesis postulates that children learn syntax based on the un-

derlying semantics of language. This is driven by the ontological mappings of the world,

which constrain valid sentence construction [92]. In the syntactic bootstrapping hypothesis

the reverse is true: children glean semantic knowledge based on the grammatical context

of words and their corresponding semantic classes. Li et al explain that the underlying

assumption is that such classes are partly constrained by the syntax, meaning only certain

classes occur in certain syntactic scenarios [92].

Pinker [111] has suggested that syntactic boostrapping can only give rise to knowledge of

categories of meaning, as opposed to actual “semantic content”. If one expands the context

of a word to include that word’s “total experience in the context of all other words with

which it co-occurs” [92, page 168], Li et al show that this criticism is no longer valid. This

“global context” can be likened to the co-occurrences analyses discussed in Chapters 4 and

5 and is not restricted to the immediate grammatical environment. This is in contrast with

the “local context”, which refers to only the immediate grammatical constraints acting on

a word (such as complement clauses) [92].

The Critical Period Hypothesis

The critical period hypothesis postulates a period of optimum language acquisition from

birth to puberty, after which the ability to learn a language sharply diminishes [138]. Sup-

porters of this theory suggest that such a change is due to the ways in which the brain

processes information past puberty. Evidence for this theory is in part shown in children’s

remarkable ability to develop a full grammar despite insufficient input (e.g. a lack of negative

grammar examples)3 [138]. Still, the nature of language within the brain is not completely

understood. Evidence of varying rates of second-language acquisition beyond puberty, as

3This is referred to as the argument from the poverty of stimulus.


well as changes in attitudes towards learning and continuity of learning (i.e. one may ac-

tually fall out of practice as opposed to a deterioration of brain capacity) offers serious

questions that the critical period hypothesis must address.

With this in mind, understanding the nature of errors made at different points in lan-

guage development may help delineate any relevant age markers surrounding the so-called

critical period. Error-detection analysis may reveal evidence for a “language threshold”

after which the nature and number of errors made may drastically change. Alternatively,

it may instead reveal a gradual degradation in learning that is not marked by a sudden

decrease in ability, which may offer evidence against the critical period hypothesis.

7.3 Quality Control in NLP Applications

Within natural language processing (NLP)4 many of the major tasks involve summariza-

tion and/or translation, including document summarization, query-answering and machine

translation. In order for any of the applications within these sub-areas to be considered a

success, there must first be some means of evaluating performance to ensure accuracy and

establish the potential margin of error. While human evaluations are often considered the

“gold standard”, such studies can take months to complete [108]. Further, there is some

question as to their reliability and ultimate limitations [34]. Consequently, much recent work

has gone into evaluating and creating automated evaluation metrics for this very purpose,

such as IBM’s BLEU [108] and ISI/USC’s ROUGE [94]. Such automated techniques are

based upon the “N-gram” metric and have found recent favour within machine translation,

among other areas, for evaluating translation quality. This section explores the basic formu-

lation of such metrics, and suggests how the hybrid error-detection methodology may help

advance the field of automated evaluation.

“N-gram” Metrics for Machine Translation

Although originally introduced to machine translation (MT) within NLP, the so-called “N-

gram” metrics can be applied to a wide range of NLP tasks, united in the goal to establish

document quality on the basis of one or more “correct” reference document(s).

4A more thorough introduction to NLP and subsequently to Medical Language Processing, or MLP, isprovided in Chapter 2.


BLEU and ROUGE are two popular examples of N-gram-based models of language.

Like the context windows discussed in Chapter 4, N-gram models rely on words’ contexts

to determine an individual word’s “N-gram count” in a text, a useful feature for many

calculations. Here “N-gram” refers to the context size (the words preceding or following the

target word): a “unigram” is the word itself, while “bigram” and “trigram” refer to two-

and three-word tuples, respectively. The “N-gram count” is how often a particular N-gram

occurs in the training corpus.

Essentially, the N-gram metric relies on the notion of “N-gram precision”, computed by

aligning all N-gram counts in the source document with those in the reference documents

[108]. If one considers the word-error rate (WER) calculation used in speech recognition

as a measure of the distance between a document and the underlying, “true” document,

then this measure can be adapted to measure the degree of alignment between a source text

and the reference translation(s) [34, 108]. Like the work in Chapter 4, N-gram evaluation

methods such as BLEU calculate the WER based on the probability of a word occurring in a

given context by conditioning on the preceding words, and is dependent the context-window

size. For example, consider the following:

"My favourite flavour is vanilla."

Instead of calculating the individual probability of each word in its context, the probability

of the entire sentence above is approximated by examining a limited context window for

each word within the sentence, and combining the resulting probabilities [34]. This gives us

the following, where “<s> ” represents the sentence boundary:

P (my, favourite, flavour, is, vanilla) =

P (my| < s >)P (favourite|my) · · ·P (vanilla|is).(7.1)

While a unigram measure can be used, Papineni et al observe that a translation of a text

that uses the same words as the reference translation, but in random order, will still have

a high unigram overlap yet a poor measure of fluency (i.e. document coherence). Thus, a

measure of the length of consecutive matches, which is achieved through the longer N-gram

matches, is needed to account for overall document fluency.

Lately the N-gram methods have met with criticism for their lack of restriction on

translation coherence or grammaticality. As Culy and Riehemann note, the above N-gram


metrics are not “measures of translation goodness” but rather of document similarity, and

rely on the assumption that “a good translation of a text will be similar to other good

translations of the same text” [34, page 71]. In their experiment, however, they noted

that incomprehensible (low fluency) machine translations could still score higher than a

fluent human translation using these metrics. Furthermore, the quality of the translation

assessment was directly dependent on the number of reference translations available. Culy

and Riehemann conclude that while not terrible, the results of N-gram metrics are not great,

either.

Improving on “N-gram” Metrics using Error Detection

In response to the above weaknesses of existing N-gram-based metrics for evaluation, the

hybrid error-detection algorithm can offer an alternative measure of machine translation

quality.

Observation 8 Machine translation errors often result in grammatical errors.

Observation 9 The word-error rate is a an indication of translation quality.

Observations 8 and 9 are based on the assumption that a poor translation will have a high

number of errors, where those errors can be syntactic or semantic (like speech recognition,

a fixed lexicon will preclude errors at the lexical level). Consider the following translation

candidates from Papineni et al [108]:

It is a guide to action which ensures that the military always

obeys the commands of the party. (Sentence 2)

It is to insure the troops forever hearing the activity guidebook

that party direct. (Sentence 3)

It is clear that Sentence 3 is a poor translation candidate based upon its ungrammat-

icality. The hybrid error-detection method is able to detect such errors, among others.


Instead of relying on the similarity between a machine-translated document and a series of

reference translations (which can be erroneous themselves), this algorithm requires only the

machine translation output being tested, reducing the complexity, and providing an actual

qualitative assessment that is not based on similarity, or the assumption that the reference

documents are correct and that a similarity comparison will provide a sufficient, intelligent

assessment.

From a different perspective, the hybrid algorithm might be considered an alternative

similarity measure, where the document being analysed is compared to the features of what

would be the grammatically correct text.

The hybrid nature of the error-detection algorithm also allows for discovery of errors

based on type, giving rise to a qualitative evaluations of a document as being “semantically”

sound, despite syntactic errors. This indicates that the concepts within the translation

are correct, meaning the document may, in fact, provide an adequate gist of the original.

Semantic errors, on the other hand, may indicate more fatal flaws in the resulting translation.

As with the other applications of the hybrid error-detection methodology, a sufficient

training corpus characterizing the relevant domain is required for the statistical analysis

portion. However, depending on one’s goals, and the quality of the syntactic and seman-

tic rule-based component, the choice may be made to omit statistical analyses from the

heuristics used.

Related Areas of Application in NLP

The issue of the evaluation of automatically generated text extends beyond machine trans-

lation to other research areas within NLP. It is often the case that a poor translation or

output in one domain of NLP will have syntactic and semantic errors at the very least.

Thus, the error-detection methodology is useful in any evaluation of an NLP technique

where the output is susceptible to grammatical errors introduced as a result of the process

in question. This includes machine translation, document summarization, question answer-

ing, natural-language generation, information extraction and computer-assisted document

proofing. Errors or grammatical inconsistencies present in the output text can indicate

flaws in the underlying generation algorithm, much like how such an error can indicate a

translation error as above.


7.4 Summary

This chapter has surveyed a range of applications of the hybrid error-detection methodology,

beyond recovery from low recognition rates in radiology report dictation. This establishes

the methodology within the greater context of error detection, and demonstrates the greater

extent of the contributions presented so far. Quantitative analysis of these theories are left

as future work with the hope that the ideas presented within this chapter will serve as

inspiration for further research, including other unique applications of the methodology not

covered here. In the following, final chapter, the conclusions, contributions and consequences

resulting from the research presented in this dissertation are summarized.

Chapter 8

Conclusions

Lured by efficient services with respect to time and money, as well as improved patient

care, medicine continues to incorporate artificial-intelligence technologies more fully into

the existing armamentarium. This includes the gradual replacement of transcriptionists

with ASR systems, and the addition of automated summarization systems in the radiology

department.

Despite the trend towards automation in the reading room, ASR remains a weak al-

ternative to traditional transcription. This is attributable to poor accuracy rates and the

wasted resources spent on proofreading erroneous reports.

The work presented here was motivated by poor integration and low accuracy rates

of ASR in radiology, and the frustration of radiologists with the technology. This had

introduced many delays and wasted resources, including the need to extensively proofread

reports to search for recognition errors, any of which could have serious consequences for

the patient.

In addressing these issues, this dissertation has made several contributions to the field

of error detection. Early in the research a lack of a comprehensive theory of error detection

in ASR was noted, leading to the development of a classification of error-detection methods

to account for this absence and providing an objective measure of existing and future ASR

error-detection endeavours. In addition, the nature of recognition errors was extensively

investigated, providing the foundation for the hybrid methodology in Chapter 3.

A hybrid methodology was postulated as a multi-heuristic, modular approach to ASR

error detection in radiology report dictation (see Chapter 4). This methodology was built

upon the notion of the complementary coverage of error types in different error-detection

138

CHAPTER 8. CONCLUSIONS 139

heuristics. Four AI-inspired heuristics were developed and analysed based upon their varying

strengths with respect to relevant error types and overlapping coverage (to help ensure over-

all system reliability). These included two probabilistic, N-gram-based heuristics: pointwise

mutual information (inspired by the work of Inkpen and Desilets [75]) and co-occurrence

analysis (inspired by previous work by Voll et al [154]). The remaining two heuristics are

non-probabilisitic methods: a parser inspired by constraint handling rules for their ease of

development and suitability for the purposes of proof of concept, and a conceptual distance

metric, inspired by previous work [23].

When the heuristics are combined, the result is a high-coverage, high-accuracy error-

detection system. This was demonstrated with a proof of concept applying the hybrid

methodology to detection of errors in actual radiology reports (presented in Chapter 5).

Most notably, the hybrid approach achieves a 24% recall improvement over any individual

heuristic. This technique shows promise as an effective means to recover from the unaccept-

able accuracy rates of ASR. Flagging potential errors enhances the proofreading process,

restoring the benefits of ASR in resources saved. The result is a more efficient reading room

and an improved experience with ASR.

The implications for radiology given the hybrid error-detection methodology were pre-

sented, both within radiology and within error detection and report summarization in gen-

eral. In addition, a wide range of avenues for future work within radiology report dictation

and ASR were put forth, including the challenges currently faced by the methodology and

ASR in medicine.

The hybrid methodology is not limited to the domain of radiology. To illustrate the

applicability to the general context of error detection, two non-medical domains, namely

speech pathology and machine translation were offered as theoretical applications of the

methodology and inspiration for further work (see Chapter 7).

With this in mind, the research questions first posed in Chapter 1 can now be revis-

ited. Foremost, the question of improving the accuracy of speech recognition in radiology,

has been answered with the novel solution of a post-recognition, hybrid error-detection

methodology. This methodology not only demonstrates how error detection is applicable to

radiology report dictation, but it advances current methods of error detection in general.

Previous attempts at error detection, along with other relevant work, were first examined

to characterize the current state of error detection. The nature of recognition errors was

explored, with a breakdown of error type categories, along with the linguistic levels at which

CHAPTER 8. CONCLUSIONS 140

they occur in Chapter 3. The hybrid error-detection methodology and implementation was

the combination of the extant, relevant work; the error type analysis; and the investigation

of the properties of ASR, ASR-related errors, and radiology reporting. The implications of

this research were explored not only in the ramifications relevant to radiology, but also in

applications outside of medicine.

I lastly reiterate the contributions of this research to the field of error detection in ASR:

• A classification of error-detection methods for speech recognition.

• A hybrid error-detection methodology.

• A successful proof of concept applying the hybrid methodology to radiology report

dictation.

• Two theoretical applications of the technology beyond the domain of radiology.

In conclusion, I submit that the research within this dissertation supports each of the

the hypotheses originally presented in Chapter 1:

• As a post-processing stage, methods in medical language processing can effectively de-

tect recognition errors in radiology reports dictated via automatic speech recognition.

• Combining complementary methods of error detection results in improved sensitivity

to report errors.

• Tagging erroneous reports based on the quality of their output can avoid the need for

a in-depth re-read of the report.

• Post-recognition error detection is a viable means to improve ASR in radiology re-

porting.

• Post-recognition error detection has applications beyond radiology reporting.

Therefore, it is concluded that post-speech-recognition, hybrid error detection is an ef-

fective means to recover from low recognition rates in radiology report dictation.

Appendix A

Glossary of Medical and

Non-Medical Terms

A.1 Radiology

Image Modalities Methods for generating internal images of the body. Modalities

include:

• Magnetic Resonance Imaging (MR) A radiology imaging modality that uses radiofre-

quency waves and a strong magnetic field to take an image of internal tissues. It is

particularly well-suited for soft tissue analysis.

• Radiography (X-ray) An internal image of the body that relies on the different absorp-

tion of X-rays by varying tissues in the body. The X-rays that are not absorbed pass

through the body into the film. These exposed areas of the film turn dark, leaving the

white, characteristic patterns of the bones (which tend to absorb most of the X-ray

energy and hence do not pass through to affect the film).

• Computed (Axial) Tomography (CT/CAT) A technique for obtaining multiple X-ray

images of different angles of the body. These images are then combined using a

computer to generate cross-sectional views of the body.

• Positron emission tomography (PET) A technique for acquiring images based on the

detection of emissions from radioactive substances (typically injected into the body).

Like an X-ray, the emissions are detected using a film that differentiates areas of high

141

APPENDIX A. GLOSSARY OF MEDICAL AND NON-MEDICAL TERMS 142

versus low emissions. Where the radioactive substance collects in the body, this will

produce greater emissions that detectable on the film.

Picture Archiving and Communication System (PACS) In radiology, this sys-

tem manages the acquisition, storage, transmission and display of digital (filmless) images

on a computer network.

Radiograph A picture produced by a radiology imaging technique, such as an X-ray.

This image may be generated traditionally using film, or using a filmless image stored on a

computer.

Radiology A branch of medicine that applies radiant energy or radioactive material

in the diagnosis and treatment of disease.

Radiology Report The radiologist’s report based upon his analysis of a radiograph.

Typically this is dictated and recorded at the time of the image examination, and later tran-

scribed by a stenographer. In the case of automated speech recognition, the transcription

occurs simultaneously with dictation.

• Free-text Report A radiology report in which the information is in unstructured, nat-

ural language format. Contrast with structured reporting.

• Structured Reporting A reporting system in which the information is standardized and

structured. The radiologist is confined to a predefined format and often prompted for

information, in contrast with a free, unrestricted dictation of the report.

Reading Room A room containing large light boxes or computer screens in which a

radiologist examines radiological images, such as X-rays.

Transcriptionist A person responsible for transcribing an audio text dictation. Also

called a stenographer.

Turnaround Time (TAT) The time it takes from report requisition (i.e. a physician

requesting a scan) until the dictated report is completed and signed off by the radiologist.


A.2 Computational Linguistics/ Knowledge Representation

Accuracy In NLP, a measure of the performance of any NLP algorithm (see also

Evaluation). In general, accuracy is defined by the following three measures (defined with

respect to error detection):

• Precision A measure of the number of relevant errors (true positives, TP) found ver-

sus the number of relevant and irrelevant errors (true and false positives, TP and

FP). Essentially asks the question: Of the number of the errors found, how many

corresponded to actual errors?

P =TP

TP + FP(A.1)

• Recall A measure of the number of errors identified (true positives, TP) versus the

total number of errors actually present (true positives and false negatives, TP and

FN). Essentially asks the question: Of the number of actual errors, how many did the

algorithm find?

R =TP

TP + FN(A.2)

• f-Measure Although precision and recall are inversely related, this provides a combined

measure of performance. Typically precision and recall are weighted evenly and result

in the following definition of f-measure:

F =2 · P ·RP + R

(A.3)

Collocation In linguistics, this refers to one or more words that occur frequently

together and generally have connotations beyond the meaning of the component words. For

the purposes of this document, this definition has been relaxed to refer to any two-word

consecutive pairing in the text.

Constituent A functional unit of one or more words. A group of words is considered

a constituent of some type if all constituents of that type can occur in similar syntactic

environments [81]. For instance, a constituent of type noun phrase (that is, a grouping


of words around a noun) can occur before a verb. The following are examples of other

constituents:

The cat in the hat

Marcy hiccupped

The pickle that ended up in the sandwich tasted

Alternatively, a grouping of words may be considered a constituent if they must be

moved as a unit to a different position in a sentence, in contrast to moving the component

words individually. For example, the prepositional phrase “by October sixth” can be placed

in a number of different locations. It can be place at the beginning, the middle, or the end

of a sentence [81]. Consider the following:

By October sixth, I would like to finish my thesis.

I would like, by October sixth, to finish my thesis.

I would like to finish my thesis by October sixth.

*By, I would like to finish my thesis October sixth.

It is possible to move the prepositional phrase as a unit, but not the individual words

making up the phrase (as shown in the last sentence). This ability to identify constituent

structures is important in identifying patterns in the language, as well as understanding

how words work. This notion is used in Section 4.5 to constrain the subset of words created

from a sentence.

Constraint Handling Rules Grammar (CHRG) A logical grammar formalization

built upon constraint handling rules (CHR [51]) and developed by Henning Christiansen [29].

Christensen observes that the relationship between CHRs and CHRGs is analogous to the

relationship between Prolog and definite clause grammars (DCGs).

Computational Linguistics The use of computers to augment linguistic study. Of-

ten equated with natural language processing.

Co-occurrence Relations A relationship defined between a target word and the

words with which it co-occurs in a text. These words are referred to as the “context” and

are defined by those words occurring within a certain radius of the target word. This radius

is defined by the “window size”. A window size of three, for example, refers to the three

words before and after the target word (thus the total window size is six).


Evaluation The following measures are used in the evaluation of many language-

processing techniques, including the hybrid error-detection methodology presented in this

document:

• True Positive A correct identification of an error.

• True Negative A correct identification of a non-error.

• False Positive An incorrect identification of an error.

• False Negative An incorrect identification of a non-error.

f-Measure See Accuracy.

Grammar The rules governing a language and appropriate utterances within that

language.

Lexeme Any meaningful linguistic unit.

Linguistics The study of human language, including structure, meaning and evolu-

tion.

Medical Language Processing (MLP) The application of NLP technology to the

medical domain.

Natural Language Processing (NLP) The subfield in artificial intelligence that

deals with the processing of natural human languages, such as English. This processing

includes translating human language into a formal representation that a computer can ma-

nipulate, as well as the reverse: translating a formal computer representation into natural

language. This discipline comprises many areas of inquiry, including, but not limited to,

the following:

• Speech recognition

• Natural language generation

• Information retrieval and extraction

• Machine translation


• Question answering

• Automatic summarization

While the goals of NLP and computational linguistics often overlap, the motivations are

slightly different. Within NLP the goal is the ability of computers to process natural

language, often irrespective of the underlying linguistic theory. This is in contrast with

computational linguists who seek to augment linguistic knowledge of human languages with

computers.

Natural Language Understanding The ability of a computer to process language

and respond to that language in a way that mimics human understanding of language. That

is, the responses are appropriate based on the expectations of the user and/or domain.

Parser A program that disassembles language into its component parts using a set of

rules known as a grammar.

Phone An independent speech sound event.

Pointwise Mutual Information (PMI) A measure of the information provided by

an event (such as a word in a text) about the occurrence of another event (or word):

PMI = logP (x, y)

P (x) · P (y)(A.4)

Precision See Accuracy.

Property Grammar A grammar that represents linguistic information via proper-

ties that describe the rules of the language. Parsing is the process of ensuring that these

properties are met within a text. This representation lends itself naturally to expression via

constraints (and hence CHRGs).

Prosody The quantitative and qualitative properties of speech (not text) involving

intonation, cadence, and stress.

Recall See Accuracy.


Statistical NLP A subfield of NLP that relies on a statistical account of the common

patterns of natural language usage. Also called corpus-based linguistics. This is in contrast

with rule-based approaches, which seek to characterize language on the basis of proposed

rules that govern linguistics.

Stop Word A word that is considered to have little overall information with respect to

a text, usually because of its high frequency within that text. This is generally irrespective

of the word’s semantic role within the language.

Stop List A list of words being considered as stop words.

Ontology A formalized taxonomic representation of semantic knowledge. Typically

comprised of relations such as “is-a”. Refer to Appendix B.

Utterance A spoken language segment.

A.3 Automated Speech Recognition

Automated Speech Recognition (ASR) The automated recognition of spoken text

by a computer. As discussed in Chapter 3, ASR can refer more broadly to any AI tasks

involving speech, however, for the purposes of this document, it refers only to the recognition

of human speech.

Recognition Error A word incorrectly transcribed by the speech recognizer.

Word Error Rate (WER) The rate of recognition errors produced by an ASR pro-

gram. This can be measured in a variety of ways. For the purposes of this document, the

WER is defined as follows:

Cor(d) = 1−WER

A.4 Miscellaneous

Black Box Anything for which the internal workings are unknown to the user.


Constraint Handling Rules (CHR) CHR was first proposed by Fruhwirth in 1998

[51] as a declarative, high-level language for expressing and solving constraints. Comprised of

guarded rules (rules with restrictions on their application), CHRs permit constraint simpli-

fication and propagation through constraint rewriting. The constraint engine is responsible

for determining when CHRs apply, that is when constraints are collapsed into simpler ones,

or when they give rise to new constraints. See also constraint handling rules grammars

(CHRGs).

Hybrid Method Any composite method comprising multiple techniques applied to

the same problem to achieve a single outcome.

Appendix B

Ontologies in Healthcare

B.1 Introduction

In recent years, the term “ontology” has become somewhat of a buzzword – overused and

under-defined. Its recent ubiquity within computing, and in particular medical informatics,

has had the unfortunate side effect of an erosion of meaning and increased ambiguity in day-

to-day usage. Thus, the obvious and immediate task is to define “ontology” and establish

a consistent level of discourse. Beyond that, this chapter considers the motivation behind

ontological research as it pertains to medical informatics, the principles of good ontology

design, and the major systems in use today.

B.1.1 Controlled Medical Vocabulary

Within medical informatics there exist information structures called controlled medical vo-

cabularies. These represent and classify medical terms in order to systematize the represen-

tation of medical concepts. In general, however, they tend to focus on nouns and lack an

expanded vocabulary required to handle non-medical terms such as qualifiers (e.g. size or

degree) [80].

B.1.2 Semantic Lexicon

Like a syntactic lexicon that provides information regarding part-of-speech, et cetera, a

semantic lexicon is a “resource that maps a lexical item (word or phrase) to one or more

semantic types” [80, page 206]. Such a lexicon is necessary to translate natural language

149

APPENDIX B. ONTOLOGIES IN HEALTHCARE 150

text into a format that a computer can understand and manipulate. These semantic types

stand for entities within the domain of discourse and can be systematically arranged into an

ontology. An ontology defines meaning formally within a domain by virtue of its structure

and the interrelations it defines between the semantic types [80].

B.1.3 Ontology

In the literature, the term “ontology” has ended up an over-generalization, referring instead

to a continuum of terminological knowledge. Ontologies, in the strict sense, exist at one end

of the continuum, while loosely controlled sets of concepts or terms exist at the other [14].

The unfortunate consequence of this has been needless ambiguity and a dilution of the actual

meaning of the word. As Bodenreider observes, “although more than sixty terminological

systems exist in the biomedical domain, few actually qualify as an ontology” [15].

The name “ontology” derives from philosophy where it embodies a “systematic account

of Existence” [58]. Within knowledge-based systems, this account of existence is simply

that which can be represented. On the most fundamental level when referring to a body of

knowledge one can talk about a conceptualization: “the objects, concepts, and other entities

that are presumed to exist in some area of interest and the relationships that hold them”

[58].. In other words, the abstract and simplified representation of the domain of discourse

that lies beneath any knowledge base.

It follows that an ontology is “an explicit specification of a conceptualization” [58] – a

formalism for a shared base of knowledge or understanding within a community that permits

intelligent discourse. More specifically, it identifies through a representational vocabulary

the entities, or semantic types, that exist within that domain of discourse, and the logical

relationships between them. An ontology is an ontology by virtue of the logical framework

that it sits upon. This framework is a defining feature of a true ontology and provides the

strict laws governing the relationships within as well as the addition of new concepts, and

exists independently of any specific application of the terminology [14, 15]. It is a direct

result of strict adherence to this formal structure that ontologies achieve their ability to

support sound reasoning (via inheritance or subsumption) and hence their power.


B.1.4 The Continuum of Knowledge Representation

If (reference) terminologies represent a continuum of increasing formalization and structure,

then ontologies exist at the far, formal, end. At the opposite end, coding schemes exist

as comparatively flat and contrived structures to which medical codes are allocated. The

conditions determining classification are coded in one place only and do not capture any

relationships among the terms. Examples include the ICD and DICOM. Such schemes

seek to improve consistency in medical terminology, but do not support depth of structure,

and thus limit reasoning. They can be used as a terminological repository for populating

ontologies or other medical collections.

Taxonomies are the next step up from simple coding schemes, and can range from a min-

imally (or poorly) structured representation of the domain to a fully structured represen-

tation. Existing as hierarchies denoting taxonomic is-a relationships via subordination[21],

taxonomies were in part conceived to address the naming problem, the task of “determining

a controlled set of language labels” [14]. In this way, taxonomies account for variations in

terminology that arise when the same concepts are expressed by different sources (often

for different applications) [103]. Although in theory taxonomic relationships are conducive

to information storage and reuse, in reality many taxonomies are poorly or insufficiently

structured, lacking the necessary rigor for formal inheritance [14, 161]. Worse, it is often

the case that the inter-relationships within a taxonomy contain non-taxonomic relationships

such as meronomy (part-of ) or hyponymy (kind-of ) that have not been explicitly defined

or even acknowledged. Thus, attempts to reason on the assumption of taxonomy are open

to inconsistency and error [13].

Although within the literature taxonomies are often equated with ontologies, only a

well-constructed, rigorous taxonomy that supports formal reasoning is a true ontology. On-

tologies are challenging to construct and are computationally complex for anything beyond

a trivial domain. The high degree of formality, however, is not always necessary depend-

ing on the application. Semantic spaces exist as intermediary taxonomies between basic

taxonomies and ontologies. While increasing in formalityy, though still lacking the full

power of ontologies, semantic spaces introduce more in-depth and controlled relationships

amongst the terms in the hierarchy[14]. By adding these restrictions on the data, semantic

spaces increase the accuracy of information retrieval, though still fall short of full reasoning

capabilities.


Taxonomies

Coding Schemes Taxonomies Semantic Spaces Ontologies

Terminologies

Degree of FormalizationLess formal

E.g. ICD, DICOM

More formal

E.g. SNOMED-CT,UMLS

Figure B.1: The Knowledge Continuum.

Although most taxonomies are not themselves ontologies, Bodenreider does point out

that it is possible to add the “formality and consistency to the organization of a partially

structured set of concept[s]” [14] in order to create a true ontology.

It is important to consider that the rigid formality expected in the philosophical sense of

ontology is only a logical approximation in the knowledge representation sense [117]. Rector

observes that the concepts expressed in language elude the concrete expression of logic due

to the “flexible fluid dependence on context”. Therefore, the knowledge found within an

ontology is at best an approximation of that knowledge, which will forever remain open

to further specification in a tradeoff between expressivity and tractability [117]. This is

not necessarily a fault of the ontological specification, but a reflection of the dynamic and

productive nature of language. A definition of “ontology” demands a maximally formal

representation given this consideration.

In addition to “ontology”, a second ambiguous term, “vocabulary”, is often used to re-

fer to any of the above knowledge structures, or to refer to the terminology representing

the entities within an ontology [58]. Figure B.1.4 illustrates these terms and their coverage.

Many of the existing coding or classification systems lack sufficient granularity for capturing

the relevant information in medicine [5]. Attempts to expand these systems have resulted

in “combinatorial explosions” and terminologies that are simply too large to maintain [5].

Furthermore, without any explicit representation of the relationships within these termi-

nologies they are largely unmanageable and it is difficult to write software using them [5].


There have been major attempts at improving such terminologies through imposing formal

structure either on existing systems (e.g. SNOMED CT), or by designing from the bottom

up (e.g. GALEN). An introduction to existing terminologies is presented in Section B.3.1.

B.1.5 Principles of Good Ontologies

Alan Rector identifies four types of information typically encountered in medical informatics

[114]:

1. Information on individual patients (i.e. medical records);

2. Information on populations of patients;

3. Information on institutions and the health care system;

4. Information on the current state of knowledge of best medical practices (i.e. knowledge

management and decision support in its widest sense).

He also lays out four primary and four secondary tasks for processing this information,

namely:

Primary Tasks:

1. Entering patient data;

2. Presenting information about particular patients;

3. Patient population information retrieval (e.g. a query-answer system);

4. Sharing and integrating information.

Secondary Tasks:

1. Navigating and browsing information;

2. Authoring knowledge;

3. Indexing knowledge;

4. Analyzing and generating natural language.


In order to handle these tasks, the representation and processing of information relies

heavily on the terminology upon which the system is designed. Despite this, Rector cautions,

“there is as yet no proof that a general re-usable terminology serving all of the aspirations for

clinical information systems is possible.” [114]. Even so, the growing need for terminological

representation in medical informatics continues to drive research. While no single ontology

will serve all medical informatics applications, adherence to good ontological principles will

allow multiple ontologies to be linked together, achieving what a single one could not.

The principles of good ontological design can be split into three main areas: the principles

of classification, the principles of inheritance, and the principles of partial ordering.

Principles of Classification

Classification allows the organization of information on the basis of relationships as opposed

to knowledge in isolation (such as a traditional dictionary) [100]. This organization can vary

in formality from single-level coding schemes to the formal structure of an ontology. For

consideration as an ontology, however, a terminology must adhere to the following principles

of classification [114, 100, 15, 21].

1. Subordinate classes must be mutually exclusive – that is a concept must be uniquely

identified.

2. Subordinate classes must be jointly exhaustive – that is the implication that there are

no further concepts than what have been represented in the classification.

3. A hierarchy can have only a single root.

4. Each class must have at least one parent.

5. Non-leaf classes must have at least two children.

6. Each child must differ from its parent; siblings must differ from one another.

7. Adherence to the Economy Principle [21]:

Rule 1 Assign the most specific semantic type available.

Rule 2 Assign multiple semantic types if necessary.

Rule 3 Assign a less specific semantic type if no more specific semantic type is available.


Principles of Inheritance

In addition to the principles of classification, it is possible to derive a series of principles of

inheritance existing between a parent and its child, or between a class and its immediate

subordinate [16]:

1. Unique differentia. That is differentia (distinguishing criteria) from child to parent

should uniquely result either from the refinement of the value of a common role, or the

introduction of a new role [21]. For example, the introduction of the role CAUSATIVE

AGENT with value Infectious Agent explains the subsumption relation of Meningitis

to Infective Meningitis. Similarly, the subsumption relation of Infective Meningitis

to Viral Meningitis is explained by the refinement of the role value for CAUSATIVE

AGENT since Infectious agent subsumes Virus [21].

2. If A is a child of B, then all properties of B are also properties of A (via inheritance).

3. Cycles are forbidden.

4. Adherence to the Sibling Opposition Principle [21]: A category must be opposed to its

siblings via some differentia that is fundamentally unresolvable within the ontology.

Note that Inheritance Principles 1 and 4 essentially extend Classification Principle 6.

It is also important to note that the Sibling Opposition Principle and the Economy

Principle are not accepted by all researchers as desired qualities. The Sibling Opposition

Principle says that in order to maintain unambiguous representation, children of a class must

stand in opposition to one another. That is, they must differ in a way that is fundamentally

unresolvable within the ontology [19, 21]. The validity of this assumption comes under

question, however, with respect to certain concepts that stand better in relations of scale

than of opposition, such as differentia that cannot be defined with precision (i.e. discretely)

[21].

Similarly, the Economy Principle was developed by the designers of the UMLS to prevent

unnecessary categories from being represented. One of its sub-principles explicitly requires

that the relations stand in a strict hierarchy (that is no hybrid types inherit from two super-

classes [21]). Critics of this principle, however, observe that hybrid subtypes are sometimes

necessary to capture the full essence of a complicated concept [21].


The Economy Principle and Classification Principle 2 have been jointly referred to as

the Principle of Orthogonal Taxonomies, stating that properties and differentiae must be

“represented explicitly and independently, even at the cost of apparent redundancy” [114].

Instead any information gained from hybrid or multi-parent classifications must “be inferred

from the descriptions and definitions” [114]. In theory, this makes it possible to re-arrange

the hierarchy along any axis (e.g. anatomy, pathology, et cetera).

Principles of Partial Orderings

Lastly, in order to ensure compatibility, an ontology must adhere to the intrinsic principles

of partial orderings, essentially a mathematical definition of hierarchy [13].

1. Reflexivity Every element of a set is related to itself.

2. Antisymmetry If x is related to y, y is not related to x.

3. Transitivity If x is related to y, y is related to z, then x is related to z.

The less formal semantic spaces, for instance, do not exhibit these properties (one of

the reasons for their lower classification on the scale of knowledge representation) [13]. The

UMLS is-a relationship, for example, does not exhibit reflexivity [21]. Consider ibupro-

fen, a “non-steroidal anti-inflammatory (NSAI) substance”. Within the Metathesaurus (a

subcomponent of the UMLS) “ibuprofen” and “NSAI” correctly exist as part of an is-a rela-

tionship. However, although both “ibuprofen” and “NSAI” are represented as the semantic

type, Pharmacologic Substance, there is no corresponding is-a relationship standing between

Pharmacologic Substance and itself to mirror the is-a relationship between “ibuprofen” and

“NSAI” [21]. Thus, the terms are artificially compressed into the single term Pharmacologic

Substance, which could affect later reasoning.

If a parent-child relationship holds between concepts C1 and C2, antisymmetry ensures

that the reverse relationship cannot hold (or if it does, that those concepts must be equiva-

lent). The UMLS encounters violations of this principle because of its development through

the inclusion of multiple vocabularies.

Violations of transitivity arise from the presence of implicit qualifiers in medical terms

[21]. Consider the UMLS vocabulary1. [21] observe:

1See Section 4 for a more in-depth discussion of the UMLS.


The isa relation is found in the UMLS at three different levels: between se-

mantic types in the Semantic Network, between concepts in the Metathesaurus,

and between a concept and a semantic type through the categorization. As-

suming that this isa relation represents the same kind of abstraction at different

levels in the UMLS, transitivity is expected to apply not only between semantic

types [in the Semantic Network], or between Metathesaurus concepts, but also

between semantic types and Metathesaurus concepts. Thus, the semantic type

of any ancestor C1 of a concept C2 is expected to be a supertype of the semantic

type of C2.

Consider the violation of transitivity arising with the concept “hip dislocation”, which is

grouped together with the more specific concept “acquired hip dislocation”. This grouping

has arisen from the more frequent observation of acquired hip dislocations compared to

congenital ones. It requires that “congenital hip dislocation” be a child of “[acquired] hip

dislocation”, which is in turn a child of the semantic type Injury or Poisoning. “Congenital

hip dislocation”, however, is also a child of the semantic type Congenital Abnormality.

Based on transitivity an is-a relationship must also exist between Congenital Abnormality

and Injury or Poisoning. However, only non-taxonomic relationships are postulated, such

as has-result; complicates [21].

Non-specificity of the is-a relationship can give rise to a weaker version such as is-

generally-a, and can also undermine transitivity. Again, using the UMLS for example,

Burgun and Bodenreider cite “Addison’s disease”, which is found in the relationship: “Ad-

dison’s disease” is-a “autoimmune disease”. While true typically, the is-a in this instance

should be an is-generally-a relationship to avoid the exception, “Tuberculous Addison’s dis-

ease”, which results via transitivity in the erroneous relationship “Tuberculous Addison’s

disease” is-a “autoimmune disease” [21].

B.2 Methods of Knowledge Representation

Within a controlled vocabulary or ontology, there are several ways in which conceptual

information can be represented. These variations can affect the specificity of the information

stored, as well as the ability to link these knowledge sources to other systems. By meeting

certain conditions, a meaning representation ensures that it can be used for reasoning tasks.

These conditions include [81]:


Verifiability. The ability to determine the truth value of a meaning representation;

Clarity. Unambiguous data;

Consistency. Ensuring that types with the same meaning are in fact mapped to the same

concept structure;

Expressivity. Ensuring adequate expressivity sufficient for the task at hand.

This section gives a brief introduction to some of the commonly-used meaning represen-

tation methods.

B.2.1 First Order Predicate Calculus (FOPC)

FOPC is a well-understood formalism for representing meaning that meets the conditions

mentioned above. Computationally tractable, it places few restrictions on how concepts

are represented. The language is verifiable, highly expressive, and provides a means for

solid inference (e.g. forward or backward chaining systems) [81]. One of the major pitfalls,

though, is the assumption that the English conjunctives2 such as and, or and if are directly

related to the equivalently-named FOPC terms [81]. This can quickly lead to inconsistency

within the system.

In addition, inference methods such as forward or backward chaining are sound but not

complete. Therefore, it is possible that some valid inferences are not obtained by a system

employing these techniques. Unfortunately, the alternative method, resolution, can be very

computationally-expensive [81] (a tradeoff that may be acceptable in certain, small-scale

applications).

B.2.2 Semantic Networks

The 1970s saw the emergence of semantic networks as an attempt to standardize meaning

representation and step away from the rather ad hoc attempts during the 1960s. Semantic

networks represent the meaning of concepts as defined by the relations held with other

concepts: “in general, semantic networks attempt to impart common sense knowledge to

computers, allowing them to ‘reason’ and draw conclusions about entities by virtue of the

categories to which they have been assigned” [103, page 373]. Concepts are represented

2Similarly for conjunctions in other languages.


by nodes in a graph whose links define the binary relationships held between other nodes.

The standard relationships include ISA (is-a) and AKO (a-kind-of ), which link a class to

its superclass. For instance, ISA(dog, mammal) links dog to its superclass mammal. The

encodings of concepts within a semantic network can be informal or formal, however, they

typically lack any axioms for reasoning.

Although, semantic networks allow for an explicit and concise statement of the associ-

ations between concepts, the lack of a standard interpretation, or a standard for the links

joining concepts, limits the usefulness in systems relying on data from multiple sources

(such as might be found in a comprehensive medical system). In partial answer to this

criticism the KL-ONE family of knowledge representation systems was developed based on

the structured inheritance network initially proposed by Ron Brachman in his PhD thesis

[163], which ultimately led to the development of description logics (see Section B.2.4).

B.2.3 Frame-Based Representations

In the 1970s and 1980s, researchers further structured the semantic network representa-

tion into frame-based representations. Frames contain slots that maintain the relations to

other frames and share a lot in common with objects in the object-oriented, programming

paradigm. Each slot is broken down into facets representing not only the value, but in-

formation about that value and/or slot such as default values, constraints or axioms [163].

These systems are capable of a new type of inference, called classification, which allow the

system to automatically determine the appropriate place in an existing hierarchy of objects

for a new object or description.

B.2.4 Description Logic

Inspired by the ambiguities present in “early semantic networks and frames”, description

logics (DLs) are a family of knowledge representation formalisms for the logical represen-

tation of terminology [5, 140]. As mentioned previously, KL-ONE, for example, was an

early DL-based formalism [163]. In general, the following statements, initially put forth by

Ronald Brachman, characterize a DL system ([5]):

1. The building blocks consist of atomic concepts (unary predicates), atomic roles (binary

predicates), and individuals (constants).


2. Expressivity is limited by a small set of constructors for building complex concepts

and roles.

3. Implicit knowledge can be inferred from explicit knowledge through subsumption and

instance relationships.

Essentially DLs identify the domain of discourse, namely the concepts represented by

the terminology, and then generate a world description using those concepts to describe

the properties of objects within the domain [5]. This description is provided using con-

cepts (classes), roles (properties and relations) and individuals (instances of classes) 3. The

strength of the DL formalism comes from its formal, logic-based semantics that allows rea-

soning (the inference of implicit knowledge from explicitly represented knowledge). DLs

support classification of concepts and allow for an algorithmic specification of hierarchical

knowledge and synonymy [140]. Most importantly, though, they balance the tradeoffs be-

tween the rigor of first-order logic (FOL) and expressivity, resulting in a relatively expressive

yet decidable system [15, 31].

The advantages of this expressivity include more accurate and less ambiguous represen-

tation of concept semantics; more advanced inferencing (which can help in maintaining a

consistent system); and better possibilities for querying and aggregation [31]. In the max-

imally expressive case, FOL, tractability and decidability are compromised, while in the

more limited frame-based or DL systems, tractability is maintained at the cost of some

expressivity [105].

One of the strengths of DL formalisms over semantic networks and frame-based systems

is that the user need not explicitly introduce is-a relationships. Instead, the subsumption

and instance relationships are “inferred from the definition of the concepts and the prop-

erties of the individuals”[5, page 45]. DLs have the ability to define such concepts using

“explicitly agreed-upon semantics”, in contrast with “[f]rames, where semantics often de-

pend on interpretation” [31]. Knowledge is modeled using these concept definitions and their

inter-relations or roles. As the authors in [16] observe, “they promise to make available for

formal reasoning tools detailed descriptions for each class, representing through roles the

defining characteristics of these classes”.

In [117] the authors identify five elements of a DL-based ontology:

3“Description Logics”: http://www.openclinical.org/descriptionlogics.html Accessed: February 2006; Up-dated: October 2004.


1. A hierarchy of elementary categories (atomic concepts);

2. A hierarchy of semantic links (roles) that connect the elementary categories (note that

subsumption relationships are represented as a logical inference of the form “All Bs

are As”);

3. A set of definitions of composite concepts in terms of the elementary concepts, such

as “foot bone” = Bone which isStructuralComponentOf Foot;

4. An axiom base of the form “All X haveLinkTo some Y”, such as “All feet are a division

of some LowerExtremity”;

5. A constraint base determining what concepts can be linked via roles.

The most widely known medical terminology efforts to incorporate DLs have been GALEN

and SNOMED-CT.

While DLs do readily lend themselves to automatic reasoning and information retrieval,

they “do not systematically ensure compliance with the principles of classification required

if reasoning is to be performed accurately” [16] . Thus, such compliance falls on the ontology

designers.

As a last note of interest, some DL formalisms include an epistemic operator that “makes

it possible to define what is known about a concept” [31], adding another facet to ontology

design.

B.3 Medical Terminologies

In general, medical terminologies can be evaluated according to their domain coverage,

intended use, and the techniques underlying their construction [14]. The use of such termi-

nologies can range from billing to record keeping to a full-fledged reference terminology. A

reference terminology provides a common framework into which other knowledge sources can

be linked using the same mapping schema (e.g. SNOMED-CT). With respect to construc-

tion, terminologies also range from simple enumerated lists such as the ICD to compositional

approaches that rely on maintaining a set of atomic concepts from which all concepts are

generated. Compositional solutions, however, often do not sufficiently capture the essence

of the concepts they represent, such as recognizing that “hepatitis” and “inflammation of

the liver” refer to the same thing [14]. One way to enhance the capabilities of terminologies


has been to incorporate lexical techniques, which take into account the lexical aspects of

concepts as they are expressed. By breaking concepts down in this fashion, it is possible to

unify phrases across existing terminologies, as well as in free-text reports, research papers,

and the World Wide Web. This is the basis of the UMLS [14].

B.3.1 Existing Vocabularies and Ontologies

There exist a handful of wide-coverage vocabularies within medical informatics that have

seen fairly widespread use. These range in formality from semantic spaces to ontologies.

Unified Medical Language System (UMLS)

In 1986, the National Library of Medicine (NLM) began developing the Unified Medical

Language System (UMLS). Headed by Donald Lindberg, M.D., then director of the NLM,

the goal was to create an integrated vocabulary based upon a unified semantic structure in

anticipation of the growth of electronically available medical information [103, 2]. The UMLS

is intended to transcend these terminological variations by unifying the available electronic

medical vocabularies according to a standardized semantic structure and representation of

lexical items.

Three primary knowledge sources comprise the UMLS: the Metathesaurus, the Seman-

tic Network and the SPECIALIST Lexicon. The Metathesaurus is the primary vocabulary

database containing information on concepts obtained from pre-existing vocabularies such

as GALEN and SNOMED-CT [103, 48]. This integration is performed on the source vocab-

ulary’s existing meaning representations and inter-relationships through unification. Rep-

resentations are expanded to include any missing essential primitive knowledge, as well as

instantiating new relationships across the different source vocabularies [14, 103]. In essence,

the multiple trees from the source vocabularies are brought together into one unified graph.

Structurally, the Metathesaurus is arranged so that all words and phrases referring

to the same concept are grouped together and linked to other concepts present in the

Metathesaurus. These links define the “semantic neighbourhood” of a concept, and can be

navigated to obtain the names of needed concepts [103]. In the case of a polysemous word,

each meaning exists as its own concept within the Metathesaurus. The intervening links

then define the relationships between the concepts, such as hierarchy and context.

The Metathesaurus preserves any contextual assumptions present in a source vocabulary.


Such context may be reflected in the hierarchical arrangement of the vocabulary. Conse-

quently, a single concept may appear in more than one hierarchy within the Metathesaurus

[103]

According to the UMLS fact sheet4, the 2003 edition of the Metathesaurus “includes

900,551 concepts and 2.5 million concept names,” spread over 100 biomedical, multiple-

language, source vocabularies.

The Semantic Network provides two high-level hierarchies of semantic types intended to

categorize the concepts within the Metathesaurus, namely “event” and “entity” [21]. All

other semantic types within the hierarchies are directly or indirectly linked back to these

two types [103]. Each concept within the Metathesaurus is assigned one (or more) semantic

types based on the most specific one available in the Semantic Network [103]. More general

than the concept level, semantic types permit a broad categorization of the concept, allowing

some reasoning about the definition of the concept. For instance, at the Semantic-Network

level high-level knowledge such as “drugs treat diseases” is represented, whereas at the

Metathesaurus level the more specific, low-level knowledge such as “aspirin treats fever” is

represented [21]. A relationship at the Semantic Network level, however, will not necessarily

hold for all pairs of low-level concepts at the Metathesaurus level assigned to those semantic

types (see discussion of transitivity in Section B.1.5); for instance, not every drug treats

every disease.

Lastly, the SPECIALIST Lexicon is responsible for maintaining the syntactic, morpho-

logical and orthographic information for both the medical and non-medical words found in

English.

Evaluation. The UMLS is a unique, two-level arrangement of the Semantic Network

and the Metathesaurus [14]. On the one level, the semantic network is a type hierarchy

compatible with the ontology definition laid out above. The Metathesaurus, however, by

its very construction, is bound by different principles. It is often the case that multiple

organizational principles are inherited via the source vocabularies, and that these principles

do not necessarily adhere to the ones cited above. Thus, as Bodenreider concludes, “the

Metathesaurus fails to meet basic ontological requirements” and is better categorized at

the level of semantic space at present [14]. Indeed, “the current level of organization is not

consistent and principled enough to fully support reasoning” [14, page 7].

4http://www.nlm.nih.gov/pubs/factsheets/umls.html Accessed: February 2006; Updated: May 2004.


Frequently cited issues include circularity within the Metathesaurus hierarchy, as well as

categorization inconsistencies and incongruities in the Semantic Network and Metathesaurus

[14]. Much of this can be blamed on the unrestricted adoption of source vocabularies;

without methods to enforce sound ontological principles, the UMLS is limited by the degree

of rigor present in its source vocabularies. These source vocabularies are not restricted

in the types of relationships expressed, nor, as mentioned above, is the nature of these

relationships often defined. As a result, there can be no assumptions about the nature of

the relationships present within the UMLS either [14].

There are other causes for circularity within the Metathesaurus, including underspecified

terms, represented as “unspecified” or “not otherwise specified”. If a term, T, is listed as

“T, unspecified”, it is generally considered as a descendant of “T”. This is not the case

in the Metathesaurus where such terms are clustered together as one since they are not

considered to have a difference in meaning [14]. This compression of concepts can result in

a circular relationship.

Other difficulties arise with the presence of implicit knowledge in the source vocabulary

that is incorrectly characterized in the Metathesaurus, or absent altogether. By erroneously

collapsing terms this information is lost and further circularity is introduced.

As an evaluation of the UMLS, Bodenreider et al identify the semantic neighbourhood

of the concept “heart” [14]. In doing so they discovered that out of 6894 pairs of related

concepts, 65% were capable of being “inferred unambiguously from the Semantic Network”,

22% exhibited multiple semantic links between the two, and 13% revealed an inconsistency

between the Semantic Network and the Metathesaurus. In many cases, they noted that dis-

crepancies between the two levels were actually an artefact of the representation of abstract

versus concrete semantic types. The relationship of an abstract concepts to a concrete con-

cept would not be captured at the Semantic Network level, thus appearing as a discrepancy

with respect to the Metathesaurus versus the Semantic Network. Other reasons pertain to

the assumptions placed on the source vocabulary by the UMLS, such as patronymic rela-

tionships. In the Semantic Network, part-of relationships are assumed to be associative,

which may contrast with the interpretation in a source vocabulary (e.g. used hierarchically)

[14].

Issues with respect to the principles of partial orderings also arise within the UMLS due

to the asymmetric relationship between the relations in the Semantic Network and those in

the Metathesaurus. See Section B.1.5 on transitivity for a more detailed example.


Systemized Nomenclature of Human and Veterinary Medincine (SNOMED)

The Systematized Nomenclature of Human and Veterinary Medicine, SNOMED, can trace

its history back to the Systemized Nomenclature of Pathology, SNOP, developed in 1965 un-

der the College of American Pathologists (CAP) 5. Developed initially as a “comprehensive

and flexible tool for pathologists interested in the storage and retrieval of medical data”,

SNOMED was born in the 1970s when Dr. Roger Cote extended SNOP beyond pathology

to a wide range of specialties within medicine. In 2000, SNOMED-RT was introduced as a

concept-based reference terminology. Comprised of multiple hierarchies, it contained over

121,000 concepts that were linked to over 190,000 synonymous terms. Finally, in early 2002,

CAP released SNOMED-CT, formed through the amalgamation of SNOMED-RT and the

Clinical Terms Version 3 (previously known as the Read Codes) [25]. Today, SNOMED-CT

exists as a compositional reference terminology. As of the 2005 edition, it contains over

980,000 English language descriptions (or synonyms), 1.45 million semantic relationships,

and over 360,000 uniquely identified concepts. These are distributed over 18 hierarchies [16].

The mathematical, hierarchical relationships within SNOMED-CT are expressed via a

DL-like formalism [5]. Each class has a unique description consisting of a unique identifier

number, (at least) one parent, as well as a list of synonymous names [16]. In addition,

classes are assigned unique and fully specified names consisting of a regular (English) name

immediately followed by a parenthetical reference to the “primary hierarchy” of the class.

This reference roughly corresponds to one of the top levels of the SNOMED-CT hierarchy

[16]. With the obvious exception of the root, each class is “linked hierarchically to exactly

one top-level class” [16]).

Inheritance is represented within SNOMED-CT via is-a relationships between classes,

which are refined through their role fillers [16]. (See discussion of inheritance principle

above).

Evaluation. SNOMED-CT is perhaps best thought of as an ontology with room for

formalization. Its re-design under DL has helped greatly in the degree of formalism and the

capacity for reasoning, however, the presence of errors as well as the non-strict adherence

to ontological principles has tempered this improvement.

As an example, Ceusters et al have identified errors of the following nature detected

5The information in this section was obtained from SNOMED International Historical Perspectives, unlessotherwise indicated: http://www.snomed.org/about/perspectives.html Accessed: June 2005; Updated: 2005.


within SNOMED-CT: human error, technology-induced errors, meaning shifts (from the

transfer of SNOMED-RT to -CT), redundancy, and mistakes attributable to the underlying

ontological theory [24]

Although Bodenreider et al acknowledge that SNOMED-CT’s overall coherence does

permit reasoning, they are still able identify class descriptions that are “minimal or incom-

plete, with possible detrimental consequences on inheritance” [16]. Some of these problems

are attributable to taxonomic relations, or to issues of multiple inheritance (over 27% of

classes within SNOMED-CT were found to have more than one parent), where a parent

and child share roles with values that cannot be linked via inheritance (i.e. such as iden-

tity). Also, the presence of single-child classes does not comply with the ontological and

classification criteria above indicating the possible presence of errors: if a class has only

one child, it is questionable that a distinction should even exist between parent and child

[21, 16]. Single-child classes can arise as a result of incompleteness in the hierarchy; hybrid

classes, where two parent classes intersect, the child may be a single child of one of the

parent classes; or redundant classes, where there is no evidence of refinement or difference

between parent and child, suggesting incomplete descriptions [16]. In approximately 56% of

the single child cases there was no connection to hybrid classes, thus the child was simply a

refinement of its only parent.

The presence of an overly large number of children may also point to incomplete de-

scriptions leading to a lack of discrimination within the terminology [15].

The General Architecture for Languages Encyclopaedias and Nomenclatures

in Medicine Project (GALEN)

As part of the Advanced Informatics in Medicine (AIM) Program, the Generalized Archi-

tecture for Languages, Encyclopaedias and Nomenclatures in medicine, otherwise known as

GALEN, was developed in the early 1990s by the European Commission for representation

of surgical procedures6 [26, 115]. Compositional in design, the project’s primary goal was

the development of an alternative to “static look-up terminologies” in the form of a “ter-

minology server”, a client-server system that mediates access across various ontologies and

information systems, while facilitating the development of new systems [115, 116]. This

6OpenGALEN Homepage: http://www.opengalen.org/technology/galen-faq.html; Accessed February2006; Updated 1999.


makes it possible to reference concepts, pose queries, and translate concepts between repre-

sentations [115]. The terminology server functions also as an interface between applications,

and a facilitator for the development and integration of new concepts.

The basis of the server is the language-independent, ontological concept reference model

known as CORE (the Common Reference Model) [115, 116, 124, 147]). Represented using

GRAIL, a DL-like knowledge-representation formalism developed specifically for medical

terminology, CORE allows the development of medical applications that can successfully

intercommunicate based on a common (meta-) language [5, 115]. The goal of CORE is to

“represent the underlying conceptual model of medicine shared across national boundaries”7. Highly specific knowledge, for instance pertaining to protocols, is not part of the CORE

model itself, but rather employs the model as a foundation for representation. Roughly, if

it is considered non-controversial, widely-accepted knowledge it will be present in CORE,

giving rise to the developer’s slogan “managing diversity, without imposing uniformity”8.

This is rather nicely expressed as “the level of detail two specialists need to talk about

medicine outside their specialty”9.

GALEN works by translating natural, free text into intermediary, simplified conceptual

representations called dissections [147]. This representation is a tradeoff between the free

expression of natural language and highly complex knowledge structures. From here, the

dissection is then translated into its GRAIL representation and classified according to the

CORE model.

The GALEN terminology server is itself not a terminology, but a knowledge representa-

tion infrastructure to support existing terminologies (ranging from coding schemes such as

ICD-10 to more complex taxonomies such as SNOMED), and the development and reclassi-

fication of new terminologies [147]. By authoring the necessary conceptual representations

in GRAIL, it is possible to construct classifications in GALEN, as well as “ensure the main-

tenance, extensibility and coherence of existing ones” [147, page 73].

CORE consists of more than 13,000 elementary concepts, approximately 800 of which

consist of a set of roles, as well as a series of production rules for generating complex concepts

[128, 147]. Approximately 5700 concepts are composite, depending on the maintenance

of other definitions [147]. This generation of concepts is restricted to what is considered

7OpenGALEN Homepage.8OpenGALEN Homepage.9OpenGALEN Homepage.


medically sensible; complex concepts can only be generated once they have been sanctioned

by the knowledge constraints within the system. This has the effect of reducing over-

production of nonsensical concepts.

While GALEN and SNOMED both use DL-based formalisms, GRAIL is a more compre-

hensive DL, including additional role constructors, “namely role hierarchies, inverse roles,

role chaining and transitive roles” [31].

GALEN has bee used in a variety of projects within Europe, such as the French coding

system, CCAM, particularly because of its inherent support of multiple languages [147, 123].

OpenGALEN. OpenGALEN is a nonprofit organization that provides free and open

access to the GALEN Common Reference Model. Thus, researchers incorporating GALEN

in their systems can modify it freely and distribute their software without worry of licensing

fees. The hope is to set the stage for the development of an Open Source community for

medical terminology.

Evaluation. Unfortunately, the literature lacks an evaluative investigation of the strengths

and weakness of the GALEN formalism at this time. Much of the information available is

now out-of-date as the initiative is currently in hibernation [125, 126].

Despite the lack of content development, the GALEN formalism is far from dead. Dr.

Jeremy Rogers reports that researchers are continuing to work with GALEN and to expand

its purview beyond surgical procedures (with a particular focus on drug representations).

This includes a comparative analysis of the anatomy content of GALEN and the FMA, and

an analysis of the representation of ICD-10 diseases using GALEN [125, 126].

Work is also underway as part of the World Health Organization’s International Clas-

sification of Health Interventions (ICHI),10 on adapting OpenGALEN to handle ICHI ter-

minology. This involves testing an updated KERMANOG11 e-platform that allows users to

maintain a constant connection to a central service (as opposed to periodic connections for

resynchronisation of the knowledge base). This research will likely result in a new release

of OpenGALEN in 2006 [126].

GALEN and its various components continue to be made available via the OpenGALEN

website12, as well as OpenKnoME, the GRAIL knowledge engineering environment produced

10For more information refer to http://www.who.int/classifications/ichi/en/ Accessed: February 2006.11KERMANOG is a Dutch company that has developed applications for use with GALEN, including the

only commercially available terminology server for GALEN, and the Classification Workbench, a toolset forthe development and maintenance of classification schemes, such as the work with CCAM.

12OpenGALEN Homepage.


by topThing13 .

LinKBase

LinKBase is a large-scale, proprietary, medical ontology created by Language and Com-

puting (L&C). Unlike other ontologies, LinKBase was developed using LinKFactory, L&C’s

own ontology-authoring environment [25]. LinKBase presently contains over 1.5 million

language-independent medical and general-purpose concepts (such as “human body”) and

particular instances (such as “United States”), associated with more than 4 million terms

in several natural languages [25, 27]. The concepts and instances are linked via a semantic

network containing approximately 480 link types [25]. The connections within the seman-

tic network constitute a formal framework derived from logically axiomatized theories in

mereology and topology, augmented with causality and time, and adhering to good rules of

classification [25, 24, 27]). In fact, only about 15% of the total relationships are subsumption-

based in LinKBase, the remaining 85% comprised of richer structures than is possible using

DL formalisms [24, 27].

LinKBase has recently been re-engineered in adherence with the theory of granular

partitions – the notion of representing knowledge as grids of labeled cells [9] – and basic

formal ontology, which stipulates formal distinctions between the relationship of universals

and particulars [25, 137].

The authors of LinKBase draw a subtle distinction between concepts and the entities

they represent. Within the context of LinKBase, concepts abstract the necessary features

of natural language. They do not represent abstractions of how humans think but rather

the actual real-world entities. Concepts that do refer to people’s conceptions are called

meta-entities, included to facilitate mappings to third-party ontologies [24].

In addition to LinKBase, L&C also maintains LinKFactory, a knowledge-engineering

system, and TeSSI, an engine for semantic indexing, retrieval and extraction.

Evaluation. Unfortunately, there exists no critical analyses of the LinKBase system in

the literature, which is likely attributable to the industry-nature of the system.

In [114], it is observed, however, that: “until proven otherwise, the fact that so many

commercial vendors have chosen not to use standard terminologies must be taken as strong

presumptive evidence that those terminologies are seriously flawed from the point of view of

13http://www.topthing.com/ Accessed: February 2006.


practical use in scalable systems” [114, page 8]. Still, this does little to alleviate the access

problem to such ontologies for the purposes of academic research.

B.3.2 Issues in Medical Informatics/Ontologies in General

Semantic Challenges

Ontological design is a challenging task. At the semantic level, there are considerations

effecting the expressivity and the scale of the ontology. These include adequate coverage of

the necessary medical and non-medical concepts; facilitation of consistent ontology growth

[21]; and sufficient representational granularity. The granularity needs will vary depending

on the task at hand: too narrow and complexity grows, while too broad and insufficient

distinctions are drawn between concepts [5, 80]. Decisions regarding complexity can be in-

fluenced by a range of applications, from payment and administration details, to descriptions

of symptoms and procedures [5].

Other considerations involve concept definitions. For instance, in the case of polysemy,

orthographically equivalent words can represent multiple senses14. The ontology must pos-

sess sufficient discriminations to enable a computer to resolve cases of ambiguity [15, 112].

Unfortunately, if multiple senses are represented via multiple types, these multiple senses

can have a detrimental effect on parsing efficiency. This is because more semantic types

mean more choices for the parser and, consequently, a more complex task [80].

Combining semantic and syntactic information can assist in reducing ambiguity in both

the potential syntactic types of a lexeme and the potential semantic types. For instance,

Johnson observes that without syntactic knowledge to differentiate between adjective and

verb positions in a sentence, the word “left” in the sentences, “opacity seen in left lung”

and “patient left hospital”, would be considered ambiguous by a semantic parser [80].

In addition, there must be “a clear differentiation at all levels between ‘false’, and ‘not

done’ or ‘unknown’ ” within the ontology [5]. This can take the form of explicit negation

(or implied via taxonomic relationships), which represent only that which is true (and

say nothing of the truth or falsity of what is not mentioned). In some ontologies, such

as GALEN, negation is not expressed explicitly, but simulated through modifiers such as

14For example, “bank” is a polysemous word with multiple senses: The financial institution – “I depositedmoney in the bank”; A sloping surface – “I went down to the river bank”; A collection of items – “I walkedover to the bank of machines”.


“presence/absence” and “done/not-done” [117]. Such restrictive qualifiers, however, limit a

systems capacity to express vague concepts.

The degree of formalization is important within an ontology. As the size grows, the

ontology becomes increasingly susceptible to inconsistency without some formal system in

place, such as an axiom base. If the axiom system present is too formal, however, it may

become extremely difficult to understand and maintain the knowledge base [80].

The extent to which implied information is represented is also significant [117]. Rector

provides the example of the procedure, ‘Insertion of pins in femur’, which he points out

should imply a ‘fixation procedure’. Is the classification system responsible for automatically

deriving this implication, and if so, to what extent should implication continue? To allow

one implication is to open the door for a potential series of cascading implications which

could have detrimental consequences [117].

Structural Challenges

From a more formal perspective, a number of issues arise with respect to the structure of

the ontology itself. These include redundancy, is-a overloading, incomplete descriptions,

ontology growth, and the limits of taxonomic relationships.

Redundancy. Redundancy arises whenever there exists more than one method to encode

the same concept. For example, the concept “ruptured ovarian cyst” can be derived by

the composition of “ruptured ovary” and “cyst”, or “ruptured cyst” and “ovary” [140].

Redundancy can exist at two levels [30]: At the term level it serves a useful purpose,

allowing multiple expressions to be mapped to the same concept while the coding or

identifier of the underlying concept remains unique. Redundancy at the identifier level,

however, is generally considered problematic as it can lead to problems in inference

when there is no underlying unique representation of a concept [30]. Such redundancy

tends to arise as a result of ontological expansion. In some instances, such as in the

integration of vocabularies in the UMLS, the presence of redundancy indicates that

a concept has been asserted by multiple sources, which can be interpreted as a high

probability that that concept is semantically valid [17]. Redundancy can also allow

direct connections between important concepts that might otherwise be very distant

in the ontology. Such dependence, nevertheless, is rarely explicitly noted or rigorously

maintained, and subsequent updates can ultimately lead to inconsistency [166].


Is-a Overloading. It is often the case that the relationships within a taxonomy are not

constrained to strictly is-a or taxonomic relationships. Such “uncontrolled use” is re-

ferred to as is-a overloading and is often associated with multiple inheritance (allowing

subclasses to have more than one parent). It can result in subsumption errors brought

about by the different semantics of the relationships within the ontology [161].

Incomplete Descriptions. If a concept within a terminology lacks a complete description,

it may be incorrectly placed in the ontology, or may not be accurately referenced.

Ontology Growth. The structure and consistency within an ontology is sensitive to growth

and must be carefully maintained.

Limits of Taxonomic Relationships. More complicated relationships may be needed to

accurately capture the essence of some medical terms. As a result, more complicated

reasoning engines are needed to handle the variety of relationships, as well as strict

formalization to ensure that no incorrect assumptions are made about the nature of

these relationships.

Other Challenges

Aside from the semantic and structural concerns, attention must be paid to the applica-

tion itself. Johnson talks about application independence as a crucial factor to a successful

semantic lexicon [80, 49]. A well-constructed ontology should be accessible to a variety of

applications. The challenge is ensuring that it draws just enough distinctions to remain

efficient, while ensuring that the output is mappable to the standardized vocabularies and

databases. As an alternative, Johnson suggests creating an intermediate, or meta-, repre-

sentation that could be mapped into any target vocabulary or database.

Another interesting concern related to application independence is the need to ensure

that the MLP system is not too heavily dependent on a particular ontology. If it is, then

changes in the ontology may undesirably result in the need for significant changes to the

MLP system.

Choosing the appropriate level of formality and coverage for an application is also im-

portant. According to Johnson, it is parser efficiency that differentiates a semantic lexicon

from a general-purpose, controlled vocabulary [80]. In the case of a controlled vocabulary, as

much information as possible is provided for each entry; information that is not necessarily


required for parsing and can therefore slow down the system. Such information, however,

is often necessary for other applications. Thus, while a researcher may choose not to parse

the input using such a detailed vocabulary, they may ultimately need to map the output

into it. .

B.4 Summary

This chapter has introduced the term “ontology” with respect to medical informatics. The

principles necessary for effective ontologies have been outlined, as well as various means

for knowledge representation. Lastly, the major medical vocabularies in common use today

have been discussed.

Overall, the limits of existing ontologies with respect to breadth of coverage and the

tradeoff between expressivity and decidability, as well as the desire for interoperability

between various applications, suggest that a single, multipurpose ontology may be too much

to ask. Instead, by adhering to strict ontological principles and standards, a more attainable

goal is the construction of multi-ontological systems that will allow existing ontologies to

communicate via standard protocols, achieving greater coverage without intractability.

Appendix C

All Results

All results shown here were collected by a manual analysis of the output. Corpus Size:Training

is the the number of reports in the training set, while Corpus Size:Test is the number

of test cases on which the system was run. In all instances, unless indicated otherwise,

CorpusSize : Training = 2751 and CorpusSize : Test = 30.

Table C.1: Co-occurrence analysis with windowsize=3, threshold=0.

Accuracy Corpus SizeReport Type Recall Precision f-Measure Training Test

All 83% 26% 40% 2751 20Findings only 88% 31% 46% 2751 20

Impressions only 96% 15% 26% 2751 20Spine only 77% 35% 48% 891 10

Table C.2: Co-occurrence analysis on entire error set, windowsize=collocation

AccuracyRecall Precision f-Measure Threshold57% 19% 28% 057% 19% 29% 5E-0659% 9% 16% 5E-04

174

APPENDIX C. ALL RESULTS 175

Table C.3: Co-occurrence analysis on non-stop-words only, windowsize=collocation


Table C.4: Co-occurrence analysis on entire error set, windowsize=1


Table C.5: Co-occurrence analysis on non-stop-words only, windowsize=1


Table C.6: Co-occurrence analysis on entire error set, windowsize=10


Table C.7: Co-occurrence analysis on non-stop-words only, windowsize=10



Table C.8: PMI analysis on entire error set, windowsize=collocation

AccuracyRecall Precision f-Measure Threshold65% 16% 26% 10065% 15% 25% 10168% 14% 24% 10270% 13% 21% 103

Table C.9: PMI analysis on non-stop-words only, windowsize=collocation


Table C.10: PMI analysis on entire error set, windowsize=1


Table C.11: PMI analysis on non-stop-words only, windowsize=1



Table C.12: PMI analysis on entire error set, windowsize=10


Table C.13: PMI analysis on non-stop-words only, windowsize=10


Table C.14: Combined heuristics on all errors based upon top f-measure.


Best Co-Occur 50% 38% 43% 2751 30Best PMI 35% 34% 34% 2751 30

Parser 29% 34% 32% n/a 30Hybrid 74% 46% 57% 2751 30

Table C.15: Combined heuristics on all errors based upon top recall score.


Best Co-Occur 59% 9% 16% 2751 30Best PMI 70% 13% 21% 2751 30

Parser 29% 34% 32% n/a 30Hybrid 83% 13% 22% 2751 30

Bibliography

[1] J. F. Allen, B. W. Miller, E. K. Ringger, and T. Sikorski. A robust system for naturalspoken dialogue. In Proceedings of the 34th Annual Meeting of the ACL, pages 62–70,1996.

[2] R. Altman. AI in medicine: The spectrum of challenges from managed care to molec-ular medicine. AI Magazine, 20(3):67–77, 1999.

[3] A. R. Aronson. Meta-map: mapping text to the umls metathesaurus, 1996. This isan electronic document. Date of publication: March 6,1996. Date retrieved: January14, 2006.

[4] A. R. Aronson. Effective mapping of biomedical text to the umls metathesaurus: themetamap program. In Proceedings of the AMIA Symposium, pages 17–21, 2001.

[5] Franz Baader, Diego Calvanese, Deborah McGuinness, Daniele Nardi, and Peter F.Patel-Schneider. The Description Logic Handbook: Theory, Implementation, and Ap-plications. Cambridge University Press, 2003.

[6] R. Barrows, M. Busuioc, and C. Friedman. Limited parsing of notational text visitnotes: Ad-hoc vs. nlp approaches. In Proceedings of AMIA Annual Symposium, pages50–55, 2000.

[7] M. Carmen Benitez, Antonio Rubio, Pedro Garcia, and Jesus Diaz-Verdejo. Wordverification using confidence measures in speech recognition. In Proceedings ICSLP,pages 1082–1085, November 1998.

[8] D. S. Bhachu. Introduction to pacs. In Consumers Association. Medical DevicesAgency (MDA), March 2002.

[9] T. Bittner and B. Smith. A theory of granular partitions. In Matthew Duckham,Michael F. Goodchild, and Michael F. Worboys, editors, Foundations of GeographicInformation Science, pages 117–151. Taylor and Francis Books, London, 2003.

[10] Philippe Blache. Property grammars: A fully constraint-based theory. In H. Chris-tiansen, P. R. Skadhauge, and J. Villadsen, editors, Constraint Solving and LanguageProcessing, volume 3438 of Lecture Notes in Artificial Intelligence. Springer, 2005.

178

BIBLIOGRAPHY 179

[11] Alan W. Black, Ralf D. Brown, Robert Frederking, Rita Singh, John Moody, andEric Steinbrecher. TONGUES: Rapid development of a speech-to-speech translationsystem. In Proceedings of Second International Conference on Human Language Tech-nology Research HLT, pages 183–186, March 2002.

[12] Wayne D. Blizard. Multiset theory. Notre Dame Journal of Formal Logic, 30(1):36–66,1989.

[13] Olivier Bodenreider. Circular hierarchical relationships in the UMLS: Etiology, di-agnosis, treatment, complications and prevention. In Proceedings of AMIA AnnualSymposium, pages 57–61, 2001.

[14] Olivier Bodenreider. Medical ontology research. Technical report, Lister Hill NationalCenter for Biomedical Communications, 2001.

[15] Olivier Bodenreider, Joyce A. Mitchell, and Alexa T. Mccray. Biomedical ontologies.Proceedings of the 2003 Pacific Symposium on Biocomputing, 8:562–564, 2003. SessionIntroduction.

[16] Olivier Bodenreider, Barry Smith, Anand Kumar, and Anita Burgun. Investigatingsubsumption in dl-based terminologies: A case study in SNOMED CT. In Proceedingsof the First International Workshop on Formal Biomedical Knowledge Representation(KR-MED 2004), pages 12–20, 2004.

[17] Olivier Bodenreider and Songmao Zhang. Semantic integration in biomedicine. In Pro-ceedings of the Semantic Integration Workshop at the Second International SemanticWeb Conference (ISWC 2003), pages 156–157, 2003.

[18] S. M. Borowitz. Computer-based speech recognition as an alternative to medicaltranscription. Journal of the American Medical Informatics Association, 8:101–102,2001.

[19] J. Bouaud, B. Bachimont, J. Charlet, and P. Zweigenbaum. Acquisition and struc-turing of an ontology within conceptual graphs. In Proceedings of ICCS’94 Workshopon Knowledge Acquisition using Conceptual Graph Theory, pages 1–25, 1994.

[20] C. Bousquet, M. C. Jaulent, G. Chatellier, and P. Degoulet. Using semantic distanceforthe efficient coding of medical concepts. In Proceedings of AMIA Annual Symposium,pages 96–100, 2000.

[21] Anita Burgun and Olivier Bodenreider. Aspects of the taxonomic relation in thebiomedical domain. In International Conference on Formal Ontology in InformationSystems, pages 222–233. ACM, October 17-19 2001.

[22] Anita Burgun and Olivier Bodenreider. Comparing terms, concepts and semanticclasses in WordNet and the Unified Medical Language System. In Proceedings of

BIBLIOGRAPHY 180

the NAACL’2001 Workshop, “WordNet and Other Lexical Resources: Applications,Extensions and Customizations”, pages 77–82. ACM, 2001.

[23] J. E. Caviedes and J. J. Cimino. Towards the development of a conceptual distancemetric for the UMLS. Journal of Biomedical Informatics, 37:77–85, 2004.

[24] W. Ceusters, B. Smith, A. Kumar, and C. Dhaen. Mistakes in medical ontologies:Where do they come from and how can they be detected? In D. M. Pisanelli, editor,Ontologies in Medicine: Proceedings of the Workshop on Medical Ontologies, pages16–18, Amsterdam, October 2004. IOS Press.

[25] W. Ceusters, B. Smith, A. Kumar, and C. Dhaen. Ontology-based error detection insnomed-ct. In Proceedings of MEDINFO, pages 482–486, 2004.

[26] Werner Ceusters, Jeremy Rogers, Fabrizio Consorti, and Angelo Rossi-Mori.Syntactic-semantic tagging as a mediator between linguistic representations and for-mal models: an exercise in linking SNOMED to GALEN. Artificial Intelligence inMedicine, 15:5–23, 1999.

[27] Werner Ceusters, Barry Smith, and Jim Flanagan. Ontology and medical terminology:Why description logics are not enough. In Towards an Electronic Patient Record(TEPR 2003), Boston, MA, May 10-14 2003. Medical Records Institute (CD-ROMPublication). CD-ROM Publication.

[28] L. Christensen, P. Haug, and M. Fiszman. MPLUS: A probabilistic medical under-standing system. In Proceedings of the Workshop on Natural Language Processing inthe Biomedical Domain, pages 29–36, 2002.

[29] Henning Christiansen. CHR grammars. Theory and Practice of Logic Programming,5(4):467–501, 2005.

[30] J. J. Cimino. Desiderata for controlled medical vocabularies in the twenty-first century.Methods of Information in Medicine, 37:394–403, 1998.

[31] R. Cornet and A. Abu-Hanna. Usability of expressive description logics – a case studyin UMLS. In Proceedings of the AMIA Annual Symposium, pages 180–184, 2002.

[32] Stephen Cox and Srinandan Dasmahapatra. A semantically-based confidence measurefor speech recognition. In Proceedings of Int. Conf. on Spoken Language Processing,volume 4, pages 206–209, Beijing, China, 2000.

[33] Stephen Cox and Srinandan Dasmahapatra. High-level approaches to confidence mea-sure estimation in speech recognition. IEEE Transactions on Speech and Audio Pro-cessing, 10(7):460–471, Oct 2002.

[34] Christopher Culy and S. Z. Riehemann. The limits of N-gram translation evaluationmetrics. In MT Summit IX, pages 71–78, New Orleans, USA, September 2003.

BIBLIOGRAPHY 181

[35] Veronica Dahl and Philippe Blache. Directly executable constraint based grammars. InProc. Journees Francophones de Programmation en Logique avec Contraintes, Angers,France, June 2004.

[36] Veronica Dahl and Kimberly Voll. Concept formation rules: An executable cognitivemodel of knowledge construction. In Proceedings of the First International Work-shop on Natural Language Understanding and Cognitive Sciences, pages 28–36, Porto,Portugal, April 2004.

[37] Srinandan Dasmahapatra and Stephen Cox. Meta-models for confidence estimation inspeech recognition. In Proceedings of Proceedings International Conference on Acous-tics Speech and Signal Processing, volume 3, pages 1815–1818, June 2000.

[38] E. Devine, S. Gaehde, and A. Curtis. Comparative evaluation of three continuousspeech recognition software packages in the generation of medical reports. Journal ofAmerican Medical Informatics Association, 7:462–468, 2000.

[39] A. Fall. An abstract framework for taxonomic encoding. In Proceedings of FirstInternational Symposium on Knowledge Retrieval, Use and Storage for Efficiency,pages 162–167, 1995.

[40] E. Filisko and S. Seneff. Error detection and recovery in spoken dialogue systems. InProceedings of HLT-NAACL 2004 Workshop on Spoken Language Understanding forConversational Systems, pages 31–38, Boston, MA, May 2004.

[41] M. Fiszman, W. Chapman, S. Evans, and P. Haug. Automatic identification of pneu-monia related concepts on chest x-ray reports. In Proceedings of AMIA Annual Sym-posium, pages 67–71, 1999.

[42] M. Fiszman and P. Haug. Using medical language processing to support real-timeevaluation of pneumonia guidelines. In Proceedings of AMIA Annual Symposium,pages 235–239, 2000.

[43] Bruce Forster. Private Interview. Canada Diagnostic Centre, Vancouver, BC, May23, 2003.

[44] Bruce Forster. Private Interview. Canada Diagnostic Centre, Vancouver, BC, May24, 2005.

[45] C. Friedman. A broad-coverage natural language processing system. In Proceedingsof AMIA Annual Symposium, pages 270–274, 2000.

[46] C. Friedman, P. O Alderson, J. H Austin, J. J Cimino, and S. Johnson. A generalnatural-language text processor for clinical radiology. Journal of the American MedicalInformatics Association, 1(2):161–174, 1994.

BIBLIOGRAPHY 182

[47] C. Friedman and G. Hripcsak. Natural language processing and its future in medicine:Can computers make sense out of natural language text? Academic Medicine: Journalof the Association of American Medical Colleges, 74(8):890–895, 1999.

[48] C. Friedman, S. Johnson, B. Forman, and J. Starren. Architectural requirements fora multipurpose natural language processor in the clinical environment. In Proceedingsof AMIA Annual Symposium, pages 347–351, 1995.

[49] C. Friedman, L. Shagina, Y. Lussier, and G. Hripcsak. Automated encoding of clinicaldocuments based on natural language processing. Journal of the American MedicalInformatics Association, 11(5):392–402, 2004.

[50] Thom W. Fruhwirth. Constraint handling rules. In Constraint Programming, pages90–107, 1994.

[51] Thom W. Fruhwirth. Theory and practice of constraint handling rules. Journal ofLogic Programming, Special Issue on Constraint Logic Programming, 37(1-3):95–138,October 1998.

[52] B. Gale, Y. Safriel, and A. Lukban. Radiology report production times: Voice recog-nition vs. transcription. Radiology management, 23:18–22, 2001.

[53] W. Gale and K. Church. What’s wrong with adding one? In N. Oostdijk andP. de Haan, editors, Corpus-Based Research into Language: In honour of Jan Aarts,pages 189–200. Rodopi, Rodopi, Amsterdam, 1994.

[54] L. Gillick, Y. Ito, and J. Young. A probabilistic approach to confidence measureestimation and evaluation. In Proceedings of the IEEE International Conference onAcoustics, Speech, Signal Processing, pages 879–882, April 1997.

[55] L. Gleitman. The structural sources of verb meaning. Language Acquisition, 1:3–55,1990.

[56] J. Greenspan. Introduction to XML, 1998. http://hotwired.lycos.com/webmonkey/98/41/index1a.html?tw=authoring Electronic Publication. Date retrieved: January 14,2006.

[57] J. Grimshaw. Form, function, and the language acquisition device. In C. Bakerand J. McCarthy, editors, The logical problem of language acquisition. MIT Press,Cambridge, MA, 1981.

[58] T. R. Gruber. A translation approach to portable ontology specifications. KnowledgeAcquisition, 5:199–220, 1993.

[59] A. Gunawardana, A. Hon, and H. W. Jiang. Word-based acoustic confidence measuresfor large-vocabulary speech recognition. In Proceedings of the International Conferenceon Logic Programming, volume 3, pages 791–794, 1998.

BIBLIOGRAPHY 183

[60] U. Hahn, M. Romacker, and S. Schulz. Why discourse structures in medical reportsmatter for the validity of automatically generated text knowledge bases. In Proceedingsof MEDINFO, pages 633–638, 1998.

[61] U. Hahn, M. Romacker, and S. Schulz. Discourse structures in medical reports –Watch out! The generation of referentially coherent and valid text knowledge bases inthe MEDsyndikate system. International journal of medical informatics, 53(1):1–28,1999.

[62] U. Hahn, M. Romacker, and S. Schulz. MEDsyndikate – design considerations for anontology-based medical text understanding system. In Proceedings of AMIA AnnualSymposium, pages 330–334, 2000.

[63] A. Happe, B. Pouliquen, A. Burgun, M. Cuggia, and P. Le Beux. Automatic conceptextraction from spoken medical reports. International Journal of Medical Informatics,70(2-3):255–263, July 2003.

[64] Robert Harnish. Minds, Brains, Computers: An Historical Introduction to the Foun-dations of Cognitive Science. Blackwell Publishers, 2002.

[65] Z. Harris. Mathematical Structures of Language. Wiley Interscience, 1968.

[66] Kaichiro Hatazaki, Jun Noguohi, Akitoshi Okumura, Kazunaga Yoshida, and TakaoWatanabe. INTERTALKER: an experimental automatic interpretation system us-ing conceptual representation. In Proceedings of International Conference on SpokenLanguage Processing ICSLP, Oct 1992.

[67] P. Haug, S. Koehler, L. Lau, P. Wang, R. Rocha, and S. Huff. Experience with amixed semantic/syntactic parser. In Proceedings of Annual AMIA Symposium, pages284–288, 1995.

[68] P. Haug, D. Ranum, and P. Frederick. Computerized extraction of coded findingsfrom free-text radiology reports. Radiology, 174:543–548, 1990.

[69] D. Hayt and S. Alexander. The pros and cons of implementing PACS and speechrecognition systems. Journal of Digital Imaging, 14(3):149–157, 2001.

[70] T. J. Hazen and I. Bazzi. A comparison and combination of methods for OOV worddetection and word confidence scoring. In Proceedings of the International Conferenceon Acoustics IC, volume 1, pages 397–400, May 2001.

[71] T. J. Hazen, J. Polifroni, and S. Seneff. Recognition confidence scoring for use inspeech understanding systems. Computer Speech and Language, 16(1):49–67, 2002.

[72] S. Horii, R. Redfern, H. Kundel, and C. Nodine. PACS technologies and reliability:Are we making things better or worse? In Proceedings of SPIE, volume 4685, pages16–24, 2002.

BIBLIOGRAPHY 184

[73] S. C. Horii. Primer on computers and information technology. Part four: A nontech-nical introduction to dicom. Radiographics, 17:1297–1309, 1997.

[74] Health Level Seven Inc. Health level seven home page. Accessed: February 2006 Lastknown update: 2006. http://www.hl7.org/.

[75] Diana Inkpen and Alain Desilets. Semantic similarity for detecting recognition errorsin automatic speech transcripts. In Proceedings of EMNLP, pages 49–56, Vancouver,British Columbia, Canada, October 2005. Association for Computational Linguistics.

[76] N. Jain and C. Friedman. Identification of findings suspicious for breast cancer basedon natural language processing of mammogram reports. In Proceedings of AMIAAnnual Fall Symposium, pages 29–33, 1997.

[77] M. Jeong, B. Kim, and G. Lee. Using higher-level linguistic knowledge for speech recog-nition error correction in a spoken Q/A dialog. In Proceedings of the HLT-NAACLspecial workshop on Higher-Level Linguistic Information for Speech Processing, pages48–55, 2004.

[78] J. J. Jiang and D. W. Conrath. Semantic similarity based on corpus statistics andlexical taxonomy. In Proceedings of International Conference on Research in Compu-tational Linguistics, pages 19–33, 1997.

[79] D. Johnson, R. Taira, A. Cardenas, and D. Aberle. Extracting information from freetext radiology reports. Journal of Digital Libraries, 1:297–308, 1997.

[80] S. Johnson. A semantic lexicon for medical language processing. Journal of theAmerican Medical Informatics Association, 6(3):205–218, 1999.

[81] D. Jurafsky and J. Martin. Speech and Language Processing: An Introduction toNatural Language Processing, Computational Linguistics, and Speech Recognition.Prentice-Hall Inc, 2000.

[82] Satoshi Kaki, Eiichiro Sumita, and Hitoshi Iida. A method for correcting errors inspeech recognition using the statistical features of character co-occurrence. In ACL-COLING, pages 653–657, 1998.

[83] K. Kanal, N. Hangiandreou, A. Sykes, H. Eklund, P. Araoz, J. Leon, and B. Erickson.Initial evaluation of a continuous speech recognition program for radiology. Journalof Digital Imaging, 14(1):30–37, March 2001.

[84] Y. W. Kim and J. H. Kim. A model of knowledge based information retrieval withhierarchical concept graph. Journal of Documentation, 2:113–137, 1990.

[85] Myoung-Wan Koo, Il-Hyun Sohn, Woo-Sung Kim, and Du-Seong Chang. KT-STS: Aspeech translation system for hotel reservation and a continuous speech recognitionsystem for speech translation. In Proceedings of Eurospeech, pages 1227–1231, 1995.

BIBLIOGRAPHY 185

[86] H. Kuhn. Speech recognition and the frequency of recently used words: A modi-fied markov model for natural language. In Proceedings of the 12th Conference onComputational Linguistics, volume 1, pages 348–350, Budapest, Hungary, 1988.

[87] K. Kukich. Techniques for automatically correcting words in text. ACM ComputingSurveys, 24(4):377–439, 1992.

[88] B. Landau and L. Gleitman. Language and Experience: Evidence from blind children.Harvard University Press, Cambridge,MA, 1985.

[89] A. Lavie, A. Waibel, L. Levin, M. Finke, D. Gates, M. Gavalda, T. Zeppenfeld,and P. Zhan. JANUS-III: Speech-to-speech translation in multiple languages. InProceedings of the 22nd IEEE International Conference on Acoustics, Speech, andSignal Processing, ICASSP, pages 1997–2004, April 1997.

[90] Alon Lavie, Lori S. Levin, Robert E. Frederking, and Fabio Pianesi. The NESPOLE!speech-to-speech translation system. In Proceedings of the 5th Conference of the As-sociation for Machine Translation in the Americas, AMTA, volume 2499 of LectureNotes in Computer Science, pages 240–243. Springer, October 2002.

[91] P. Lendvai, A. Van den Bosch, E. Krahmer, and M. Swerts. Multi-feature error detec-tion in spoken dialogue systems. In Proceedings of the 12th Computational Linguisticsin The Netherlands Meeting, pages 163–178, Nov 2001.

[92] Ping Li, Curt Burgess, and Kevin Lund. The acquisition of word meaning throughglobal lexical co-occurrences. In Proceedings of the Thirtieth Annual Child LanguageResearch Forum, pages 166–178, 2000.

[93] Henry Lieberman, Alexander Faaborg, Waseem Daher, and Jose Espinosa. How towreck a nice beach you sing calm incense. In International Conference on IntelligentUser Interfaces, pages 278–280, San Diego, January 2005.

[94] Chin-Yew Lin. Rouge: A package for automatic evaluation of summaries. In Marie-Francine Moens and Stan Szpakowicz, editors, Text Summarization Branches Out:Proceedings of the ACL-04 Workshop, pages 74–81, Barcelona, Spain, July 2004.

[95] D. J. Litman, J. Hirschberg, and M. Swertz. Predicting automatic speech recognitionperformance using prosodic cues. In Proceedings of the 1st Meeting of the NorthAmerican Chapter of the Association for Computational Linguistics, pages 218–225,2000.

[96] Christopher D. Manning and Hinrich Schutze. Foundations of statistical natural lan-guage processing. MIT Press, 2002.

[97] J. Marion. Radiologists’ attitudes can make or break speech recognition.Diagnostic Imaging Online, 2002. http://www.diagnosticimaging.com/db area

BIBLIOGRAPHY 186

/archives/2002/0202.marion.di.pacs.shtm Electronic Document. Date of publication:February 1,2002. Date retrieved: January 14, 2006.

[98] D. G. Maynard and S. Ananiadou. Incorporating linguistic information for multi-word term extraction. In 2nd Computational Linguistics UK Research Colloquium(CLUK2), 1999.

[99] A. Mehta, K. Dreyer, A. Schweitzer, J. Couris, and D. Rosenthal. Voice recognition– an emerging necessity within radiology: Experiences of the massachusetts generalhospital. Journal of Digital Imaging, 11(4):20–23, 1998.

[100] J. Michael, J. L. Mejino, and C. Rosse. The role of definitions in biomedical conceptrepresentation. In Proceedings of the AMIA Symposium, pages 463–467, 2001.

[101] K. J. Mitchell, M. J. Becich, J. J. Berman, W. W. Chapman, J. Gilbertson, D. Gupta,J. Harrison, E. Legowski, and R. S. Crowley. Implementation and evaluation of anegation tagger in a pipeline-based system for information extraction from pathologyreports. Medinfo, pages 663–667, 2004.

[102] Tom M. Mitchell. Machine Learning. McGraw-Hill, 1997.

[103] S. Nelson, T. Powell, and B. Humphreys. The Unified Medical Language System(UMLS) Project, volume 71 of Encyclopedia of Library and Information Science, pages369–378. Marcel Dekker Inc, 2002.

[104] Tamar Nordenberg. Make no mistake: Medical errors can be deadly serious. FDAConsumer, 34(5), September 2000.

[105] N. F. Noy, M. A. Musen, J. L. V. Mejino, and C. Rosse. Pushing the envelope:Challenges in a frame-based representation of human anatomy. Data and KnowledgeEngineering, 48:335–359, 2004.

[106] OpenClinical. Description logics. In OpenClinical web site. Accessed: February 2006Last known update: October 18, 2004. http://www.openclinical.org /descriptionlog-ics.html.

[107] OpenGALEN. OpenGALEN FAQ. In OpenGALEN web site. Accessed: February 2006Last known update: 1999. http://www.opengalen.org/technology/galen-faq.html.

[108] Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. BLEU: a method forautomatic evaluation of machine translation. In Proceedings of the 40th Annual Meet-ing of the Association for Computational Linguistics, pages 311–318, Philadelphia,July 2002.

[109] S. Pinker. Language learnability and language development. Harvard University Press,Cambridge, MA, 1984.

BIBLIOGRAPHY 187

[110] S. Pinker. The bootstrapping problem in language acquisition. In B. MacWhinney,editor, Mechcanisms of language acquisition. Lawrence Erlbaum, HIllsdale, NJ, 1987.

[111] S. Pinker. How could a child use verb syntax to learn verb semantics? Lingua,92:377–410, 1994.

[112] D. M. Pisanelli, A. Gangemi, M. Battaglia, and C. Catenacci. Coping with medicalpolysemy in the semantic web: the role of ontologies. In Proceedings of MedInfo 2004,pages 416–419, Amsterdam, September 7-11 2004. IOS Press.

[113] R. Rada, H. Mili, E. Bicknell, and M. Blettner. Development and application ofa metric on semantic nets. IEEE Transactions on Systems, Man and Cybernetics,19(1):17–30, Feb 1989.

[114] A. Rector. Clinical terminology: Why is it so hard? Methods of information inmedicine, 38:239–252, 1999.

[115] A. Rector, W. Solomon, W. Nowlan, and T. Rush. A terminology server for medicallanguage and medical information systems. In Proceedings of IMIA WG6, volume 34,pages 147–157, 1994.

[116] A. L. Rector, J. E. Rogers, and P. Pole. The GALEN high level ontology. In Proceedingsof Medical Informatics Europe ’96 (MIE’96), pages 174–178, Amsterdam, 1996. IOSPress.

[117] Alan Rector and Jeremy Rogers. Ontological issues in using a description logic torepresent medical concepts: Experience from GALEN. In IMIA WG6 Workshop:Terminology and Natural Language in Medicine, Phoenix Arixona, November 1999.

[118] Philip Resnik. Using information content to evaluate semantic similarity in a taxon-omy. In IJCAI, pages 448–453, 1995.

[119] R. Richardson, A. Smeaton, and J. Murphy. Using wordnet as a knowledge basefor measuring semantic similarity between words. Technical Report Working PaperCA-1294, School of Computer Applications, Dublin City University, 1994.

[120] T. Rindflesch, J. Rajah, and L. Hunter. Extracting molecular binding relationshipsfrom biomedical text. In Proceedings of the 6th Applied Natural Language ProcessingConference (ANLP-NAACL 2000), pages 188–195, 2000.

[121] E. K. Ringger and J. F. Allen. A fertility model for post correction of continuousspeech recognition. In ICSLP96, pages 897–900, 1996.

[122] J. F. Roddick, K. Hornsby, and D. deVries. A unifying semantic distance model fordetermining the similarity of attribute values. In M. J. Oudshoorn, editor, Proc.Twenty-Sixth Australasian Computer Science Conference (ACSC2003), volume 16,pages 111–118, 2003.

BIBLIOGRAPHY 188

[123] J. M. Rodrigues, B. Trombert-Paviot, R. Baud, J. Wagner, and F. Meusnier-Carriot.GALEN-In-Use: Using artificial intelligence terminology tools to improve the linguisticcoherence of a national coding system for surgical procedures. Medinfo, 9(1):623–627,1998.

[124] J. Rogers, A. Roberts, D. Solomon, E. van der Haring, C. Wroe, P. Zanstra, andA. Rector. GALEN ten years on: Tasks and supporting tools. In Medinfo, volume 10,pages 256–260, 2001.

[125] Jeremy Rogers. Electronic Mail Correspondence. University of Manchester, Manch-ester, UK, March 4, 2005.

[126] Jeremy Rogers. Electronic Mail Correspondence. University of Manchester, Manch-ester, UK, June 21, 2005.

[127] Walter Rolandi. Alpha bail. Speech Technology Magazine, 11(1), January 2006.

[128] Patrick Ruch. Applying Natural Language Processing to Information Retrieval inClinical Records and Biomedical Texts. PhD thesis, University of Geneva, March2003.

[129] C. D. Lane S. Shiffman, W. M. S. Detmer and L. M. Fagan. A continuous-speechinterface to a decision support system: I. Techniques to accommodate misrecognizedinput. AMIA, 2:36–45, 1995.

[130] N. Sager, M. Lyman, C. Bucknail, N. Nhan, and L. J. Tick. Natural language pro-cessing and the representation of clinical data. Journal of the American MedicalInformatics Association, 1(2):142–160, 1994.

[131] Arup Sarma and David Palmer. Context-based speech recognition error detection andcorrection. In Proceedings of the HLT-NAACL 2004, pages 85–88, 2004.

[132] SCAR. SCAR Expert Hotline: Speech recognition. In Eliot Siegal, editor, SCARSpring Newsletter. Society for Computer Applications in Radiology, April 2002.

[133] U. Sinha, B. Dai, D. B Johnson, R. Taira, J. Dionisio, G. Tashima, M. Golamco,and H. Kangarloo. Interactive software for generation and visualization of structuredfindings in radiology reports. Am.J.Roentgenology, 175(3):609–612, September 2000.

[134] U. Sinha, A. Yaghmai, B. Dai, L. Thompson, R. Taira, J. Dionisio, and H. Kangarloo.Evaluation of SNOMED 3.5 in representing concepts in chest radiology reports: Inte-gration of a SNOMED mapper with a radiology reporting workstation. In Proceedingsof AMIA Annual Symposium, pages 799–803, 2000.

[135] G. Skantze and J. Edlund. Early error detection on word level. In ISCA Tutorialand Research Workshop (ITRW) on Robustness Issues in Conversational Interaction,Norwich, UK, 2004.

BIBLIOGRAPHY 189

[136] Gabriel Skantze. Error detection in spoken dialogue systems, 2002. Term pa-per, Graduate School for Language Technology, Faculty of Arts, Goteborg Uni-versity. Course project in dialogue systems. Available: http://www.speech.kth.se/∼gabriel/publications.html Accessed: February 2006 Last known update: September15, 2005.

[137] Barry Smith, Anand Kumar, and Thomas Bittner. Basic formal ontology for bioin-formatics. Journal of Information Systems, 2005.

[138] Neil Smith. Chomsky: Ideas and Ideals. Cambridge University Press, second edition,1999.

[139] SNOMED. SNOMED International: Historical perspectives. In SNOMED Inter-national web site. Accessed: February 2006 Last known update: June 3, 2005.http://www.snomed.org/about/perspectives.html.

[140] K. Spackman, K. Campbell, and R. Cote. SNOMED RT: A reference terminology forhealth care. In Proceedings of AMIA, pages 640–644, 1997.

[141] G. Spanoudakis and P. Constantopoulos. Similarity for analogical software reuse: Acomputational model. In 11th European Conference on Artificial Intelligence ECAI94,pages 18–22, Amsterdam, The Netherlands, 1994.

[142] G. Spanoudakis and P. Constantopoulos. Elaborating analogies from conceptual mod-els. International Journal of Intelligent Systems, 11(11):917–974, 1996.

[143] P. Spyns. Natural language processing in medicine: An overview. Methods of infor-mation in medicine, 3:285–301, 1996.

[144] R. Taira and S. Soderland. A statistical natural language processor for medical reports.In Proc. AMIA Fall Symposium, pages 970–974, 1999.

[145] R. Taira, S. G Soderland, and R. M Jakobovits. Automatic structuring of radiologyfree-text reports. RadioGraphics, 21(1):237–245, Jan 2001.

[146] Paul Thagard. Mind: Introduction to Cognitive Science. Cambridge. The MIT Press,2005.

[147] B. Trombert-Paviot, J. M. Rodrigues, J. E. Rogers, R. Baud, E. van der Haring, A. M.Rassinoux, V. Abrial, L. Clavel, and H. Idir. GALEN: a third generation terminol-ogy tool to support a multipurpose national coding system for surgical procedures.International Journal of Medical Informatics, 58-59(1):71–85, 2000.

[148] D. Tudhope and C. Taylor. Navigation via similarity: automatic linking based onsemantic closeness. Information Processing and Management, 33(2):233–242, 1997.

BIBLIOGRAPHY 190

[149] P. D. Turney. Mining the web for synonyms: PMI-IR versus LSA on TOEFL. InProceedings of the Twelfth European Conference on Machine Learning, pages 491–502, Freiburg, Germany, 2001.

[150] M. Turunen and J. Hakulinen. Agent-based error handling in spoken dialogue systems.In Proceedings of Eurospeech, pages 2189–2192, 2001.

[151] Geoffrey Underwood, editor. Oxford guide to the mind. Oxford University Press,Oxford, New York, 2001.

[152] Kimberly Voll. Medical language processing. ACM Journal of Computing Surveys,2005. submitted.

[153] Kimberly Voll. A Methodology of Error Detection: Improving Speech Recognition inRadiology. PhD thesis, Simon Fraser University, School of Computing Science, 8888University Drive, Burnaby, BC, Canada, June 2006.

[154] Kimberly Voll, Stella Atkins, and Bruce Forster. Improving the utility of speechrecognition through error detection. In SCAR Annual Meeting, 2006. in press.

[155] Kimberly Voll, Tom Yeh, and Veronica Dahl. An assumptive logic programmingmethodology for parsing. International Journal on Artificial Intelligence Tools,10(4):573–588, 2001.

[156] W. Wahlster, editor. Verbmobil: Foundations of Speech-to-Speech Translation.Springer, 2000.

[157] Alex Waibel, Ajay N. Jain, Arthur E. McNair, Joe Tebelskis, Louise Osterholtz,Hiroaki Saito, Otto Schmidbauer, Tilo Sloboda, and Monika Woszczyna. JANUS:Speech-to-speech translation using connectionist and non-connectionist techniques.In Proceedings of Advanced Neural Information Processing Systems, pages 183–190,1991.

[158] C. Wang and C. E Kahn. Potential use of extensible markup language for radiologyreporting: A tutorial. RadioGraphics, 20:287–293, 2000.

[159] Julie Weeds and David Weir. Co-occurrence retrieval: A flexible framework for lexicaldistributional similarity. Computational Linguistics, 31(4), 2006.

[160] D. L. Weiss. Speech recognition need not slow reporting time. SCAR ConferenceReporter, August 2003.

[161] C. Welty and N. Guarino. Supporting ontological analysis of taxonomic relationships.Data and Knowledge Engineering, 39(1), 2001.

[162] F. Wessel, R. Schluter, K. Macherey, and H. Ney. Confidence measures for largevocabulary continuous speech recognition. IEEE Transactions on Speech and AudioProcessing, 9(3):288–298, March 2001.

BIBLIOGRAPHY 191

[163] W. Woods and J.Schmolze. The KL-One family. Computers and Mathematics withApplications, 23(2-5):133–177, 1992.

[164] Manuel Zahariev. A (Acronyms). PhD thesis, Simon Fraser University, Vancouver,BC, June 2004.

[165] O. R. Zaıane, A. Fall, S. Rochefort, and V. Dahl. Concept-based retrieval usingcontrolled natural language. In Proceedings of Computer-Assisted Searching on theInternet, pages 335–355, 1997.

[166] Songmao Zhang and Olivier Bodenreider. Investigating implicit knowledge in ontolo-gies with application to the anatomical domain. In Proceedings of the 2004 PacificSymposium on Biocomputing, pages 164–175. World Scientific Publishing Co., 2003.

a methodology of error detection: improving speech ... · two heuristics involve statistical...

Documents